Descarga de datos

Por Jose R. Zapata

Ultima actualización: 15/Nov/2024

Description:

Download the dataset and select the columns that are going to be used in the project.

📚 Import libraries

# base libraries for data science
from pathlib import Path

import pandas as pd

💾 Load data

url_data = "https://www.openml.org/data/get_csv/16826755/phpMYEkMl"
titanic_df = pd.read_csv(url_data, low_memory=False)  # no parsing of mixed types
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   pclass     1309 non-null   int64
 1   survived   1309 non-null   int64
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64
 6   parch      1309 non-null   int64
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  boat       1309 non-null   object
 12  body       1309 non-null   object
 13  home.dest  1309 non-null   object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB
titanic_df.sample(10)

pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.dest
127030Vande Walle, Mr. Nestor Cyrielmale28003457709.5?S???
49520Mangiavacchi, Mr. Serafino Emiliomale?00SC/A.3 286115.5792?C??New York, NY
86430Henriksson, Miss. Jenny Lovisafemale28003470867.775?S???
4211Brown, Mrs. John Murray (Caroline Lane Lamson)female59201176951.4792C101SD?Belmont, MA
51421Navratil, Master. Edmond Rogermale21123008026F2SD?Nice, France
104630Naidenoff, Mr. Penkomale22003492067.8958?S???
111630Peduzzi, Mr. Josephmale?00A/5 28178.05?S???
45821Ilett, Miss. Berthafemale1700SO/C 1488510.5?S??Guernsey
710Andrews, Mr. Thomas Jrmale39001120500A36S??Belfast, NI
120430Sivola, Mr. Antti Wilhelmmale2100STON/O 2. 31012807.925?S???

📊 Data info

This demo project uses the titanic dataset where the goal is to predict if a passenger survived or not.

first iteration which columns are going to be used in the project?

titanic_df.columns
Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

This columns are related with the survival variable

  • boat: The lifeboat number (if they survived)
  • body: The body number (if they did not survive and the body was recovered)

so these columns are not be using because will give us the answer of the problem, is like data leakage.

columns_to_use = [
    "pclass",
    "survived",
    "name",
    "sex",
    "age",
    "sibsp",
    "parch",
    "ticket",
    "fare",
    "cabin",
    "embarked",
    "home.dest",
]

final data download

titanic_final_df = pd.read_csv(url_data, usecols=columns_to_use, low_memory=False)
titanic_final_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   pclass     1309 non-null   int64
 1   survived   1309 non-null   int64
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64
 6   parch      1309 non-null   int64
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  home.dest  1309 non-null   object
dtypes: int64(4), object(8)
memory usage: 122.8+ KB

Save dataset in local format

Path.cwd().resolve().parents[0]  # Define el directorio y el archivo
DATA_DIR = Path.cwd().resolve().parents[0] / "data/01_raw"

file_path = DATA_DIR / "titanic_raw.csv"

# Crea el directorio si no existe
DATA_DIR.mkdir(parents=True, exist_ok=True)
titanic_final_df.to_csv(file_path, index=False)

Results

The dataset is saved in parquet format and the columns that are going to be used in the project are selected.

Avoiding data leakage removing the columns boat and body.

path: data/01_raw/titanic_raw.csv