Descarga de datos

Ultima actualización: 15/Nov/2024

Description:

Download the dataset and select the columns that are going to be used in the project.

📚 Import libraries

# base libraries for data science
from pathlib import Path

import pandas as pd

💾 Load data

url_data = "https://www.openml.org/data/get_csv/16826755/phpMYEkMl"
titanic_df = pd.read_csv(url_data, low_memory=False)  # no parsing of mixed types
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   pclass     1309 non-null   int64
 1   survived   1309 non-null   int64
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64
 6   parch      1309 non-null   int64
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  boat       1309 non-null   object
 12  body       1309 non-null   object
 13  home.dest  1309 non-null   object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB

titanic_df.sample(10)

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
1270	3	0	Vande Walle, Mr. Nestor Cyriel	male	28	0	0	345770	9.5	?	S	?	?	?
495	2	0	Mangiavacchi, Mr. Serafino Emilio	male	?	0	0	SC/A.3 2861	15.5792	?	C	?	?	New York, NY
864	3	0	Henriksson, Miss. Jenny Lovisa	female	28	0	0	347086	7.775	?	S	?	?	?
42	1	1	Brown, Mrs. John Murray (Caroline Lane Lamson)	female	59	2	0	11769	51.4792	C101	S	D	?	Belmont, MA
514	2	1	Navratil, Master. Edmond Roger	male	2	1	1	230080	26	F2	S	D	?	Nice, France
1046	3	0	Naidenoff, Mr. Penko	male	22	0	0	349206	7.8958	?	S	?	?	?
1116	3	0	Peduzzi, Mr. Joseph	male	?	0	0	A/5 2817	8.05	?	S	?	?	?
458	2	1	Ilett, Miss. Bertha	female	17	0	0	SO/C 14885	10.5	?	S	?	?	Guernsey
7	1	0	Andrews, Mr. Thomas Jr	male	39	0	0	112050	0	A36	S	?	?	Belfast, NI
1204	3	0	Sivola, Mr. Antti Wilhelm	male	21	0	0	STON/O 2. 3101280	7.925	?	S	?	?	?

📊 Data info

This demo project uses the titanic dataset where the goal is to predict if a passenger survived or not.

first iteration which columns are going to be used in the project?

titanic_df.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

This columns are related with the survival variable

boat: The lifeboat number (if they survived)
body: The body number (if they did not survive and the body was recovered)

so these columns are not be using because will give us the answer of the problem, is like data leakage.

columns_to_use = [
    "pclass",
    "survived",
    "name",
    "sex",
    "age",
    "sibsp",
    "parch",
    "ticket",
    "fare",
    "cabin",
    "embarked",
    "home.dest",
]

final data download

titanic_final_df = pd.read_csv(url_data, usecols=columns_to_use, low_memory=False)
titanic_final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   pclass     1309 non-null   int64
 1   survived   1309 non-null   int64
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64
 6   parch      1309 non-null   int64
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  home.dest  1309 non-null   object
dtypes: int64(4), object(8)
memory usage: 122.8+ KB

Save dataset in local format

Path.cwd().resolve().parents[0]  # Define el directorio y el archivo
DATA_DIR = Path.cwd().resolve().parents[0] / "data/01_raw"

file_path = DATA_DIR / "titanic_raw.csv"

# Crea el directorio si no existe
DATA_DIR.mkdir(parents=True, exist_ok=True)

titanic_final_df.to_csv(file_path, index=False)

Results

The dataset is saved in parquet format and the columns that are going to be used in the project are selected.

Avoiding data leakage removing the columns boat and body.

path: data/01_raw/titanic_raw.csv