Por Jose R. Zapata
Ultima actualización: 15/Nov/2024
Description:
Download the dataset and select the columns that are going to be used in the project.
📚 Import libraries
# base libraries for data science
from pathlib import Path
import pandas as pd
💾 Load data
url_data = "https://www.openml.org/data/get_csv/16826755/phpMYEkMl"
titanic_df = pd.read_csv(url_data, low_memory=False) # no parsing of mixed types
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null int64
1 survived 1309 non-null int64
2 name 1309 non-null object
3 sex 1309 non-null object
4 age 1309 non-null object
5 sibsp 1309 non-null int64
6 parch 1309 non-null int64
7 ticket 1309 non-null object
8 fare 1309 non-null object
9 cabin 1309 non-null object
10 embarked 1309 non-null object
11 boat 1309 non-null object
12 body 1309 non-null object
13 home.dest 1309 non-null object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB
titanic_df.sample(10)
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1270 | 3 | 0 | Vande Walle, Mr. Nestor Cyriel | male | 28 | 0 | 0 | 345770 | 9.5 | ? | S | ? | ? | ? |
495 | 2 | 0 | Mangiavacchi, Mr. Serafino Emilio | male | ? | 0 | 0 | SC/A.3 2861 | 15.5792 | ? | C | ? | ? | New York, NY |
864 | 3 | 0 | Henriksson, Miss. Jenny Lovisa | female | 28 | 0 | 0 | 347086 | 7.775 | ? | S | ? | ? | ? |
42 | 1 | 1 | Brown, Mrs. John Murray (Caroline Lane Lamson) | female | 59 | 2 | 0 | 11769 | 51.4792 | C101 | S | D | ? | Belmont, MA |
514 | 2 | 1 | Navratil, Master. Edmond Roger | male | 2 | 1 | 1 | 230080 | 26 | F2 | S | D | ? | Nice, France |
1046 | 3 | 0 | Naidenoff, Mr. Penko | male | 22 | 0 | 0 | 349206 | 7.8958 | ? | S | ? | ? | ? |
1116 | 3 | 0 | Peduzzi, Mr. Joseph | male | ? | 0 | 0 | A/5 2817 | 8.05 | ? | S | ? | ? | ? |
458 | 2 | 1 | Ilett, Miss. Bertha | female | 17 | 0 | 0 | SO/C 14885 | 10.5 | ? | S | ? | ? | Guernsey |
7 | 1 | 0 | Andrews, Mr. Thomas Jr | male | 39 | 0 | 0 | 112050 | 0 | A36 | S | ? | ? | Belfast, NI |
1204 | 3 | 0 | Sivola, Mr. Antti Wilhelm | male | 21 | 0 | 0 | STON/O 2. 3101280 | 7.925 | ? | S | ? | ? | ? |
📊 Data info
This demo project uses the titanic dataset where the goal is to predict if a passenger survived or not.
first iteration which columns are going to be used in the project?
titanic_df.columns
Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
dtype='object')
This columns are related with the survival variable
- boat: The lifeboat number (if they survived)
- body: The body number (if they did not survive and the body was recovered)
so these columns are not be using because will give us the answer of the problem, is like data leakage.
columns_to_use = [
"pclass",
"survived",
"name",
"sex",
"age",
"sibsp",
"parch",
"ticket",
"fare",
"cabin",
"embarked",
"home.dest",
]
final data download
titanic_final_df = pd.read_csv(url_data, usecols=columns_to_use, low_memory=False)
titanic_final_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null int64
1 survived 1309 non-null int64
2 name 1309 non-null object
3 sex 1309 non-null object
4 age 1309 non-null object
5 sibsp 1309 non-null int64
6 parch 1309 non-null int64
7 ticket 1309 non-null object
8 fare 1309 non-null object
9 cabin 1309 non-null object
10 embarked 1309 non-null object
11 home.dest 1309 non-null object
dtypes: int64(4), object(8)
memory usage: 122.8+ KB
Save dataset in local format
Path.cwd().resolve().parents[0] # Define el directorio y el archivo
DATA_DIR = Path.cwd().resolve().parents[0] / "data/01_raw"
file_path = DATA_DIR / "titanic_raw.csv"
# Crea el directorio si no existe
DATA_DIR.mkdir(parents=True, exist_ok=True)
titanic_final_df.to_csv(file_path, index=False)
Results
The dataset is saved in parquet format and the columns that are going to be used in the project are selected.
Avoiding data leakage removing the columns boat and body.
path: data/01_raw/titanic_raw.csv