Exploración de datos y descripción

Por Jose R. Zapata

Ultima actualización: 21/Nov/2024

Description:

Data overview and exploration to check data types and fix any issue with the data types.

this is in other to do a correct data analysis and visualization of the data.

📚 Import libraries

# base libraries for data science
from pathlib import Path

import numpy as np
import pandas as pd
import pyarrow as pa

💾 Load data

# data directory path
DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df = pd.read_csv(DATA_DIR / "01_raw/titanic_raw.csv")

📊 Data description

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   int64
 2   survived    1309 non-null   int64
 3   name        1309 non-null   object
 4   sex         1309 non-null   object
 5   age         1309 non-null   object
 6   sibsp       1309 non-null   int64
 7   parch       1309 non-null   int64
 8   ticket      1309 non-null   object
 9   fare        1309 non-null   object
 10  cabin       1309 non-null   object
 11  embarked    1309 non-null   object
 12  home.dest   1309 non-null   object
dtypes: int64(5), object(8)
memory usage: 133.1+ KB
titanic_df.sample(10)

Unnamed: 0pclasssurvivednamesexagesibspparchticketfarecabinembarkedhome.dest
141411Barkworth, Mr. Algernon Henry Wilsonmale80002704230A23SHessle, Yorks
62662630Andersson, Miss. Ida Augusta Margaretafemale38423470917.775?SVadsbro, Sweden Ministee, MI
1083108330Olsen, Mr. Henry Margidomale2800C 400122.525?S?
36036021Caldwell, Mr. Albert Francismale261124873829?SBangkok, Thailand / Roseville, IL
1252125230Torber, Mr. Ernst Williammale44003645118.05?S?
67967930Boulos, Miss. Nourelainfemale911267815.2458?CSyria Kent, ON
46946921Keane, Miss. Nora Afemale?0022659312.35E101QHarrisburg, PA
1007100731McGowan, Miss. Anna 'Annie'female15003309238.0292?Q?
18718711Lines, Miss. Mary Conoverfemale1601PC 1759239.4D28SParis, France
1113111330Peacock, Mrs. Benjamin (Edith Nile)female2602SOTON/O.Q. 310131513.775?S?

Null values

In this dataset the null values are represented by the string ‘?’ so we need to replace them with pd.NA

titanic_df = titanic_df.replace("?", np.nan)

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   int64
 2   survived    1309 non-null   int64
 3   name        1309 non-null   object
 4   sex         1309 non-null   object
 5   age         1046 non-null   object
 6   sibsp       1309 non-null   int64
 7   parch       1309 non-null   int64
 8   ticket      1309 non-null   object
 9   fare        1308 non-null   object
 10  cabin       295 non-null    object
 11  embarked    1307 non-null   object
 12  home.dest   745 non-null    object
dtypes: int64(5), object(8)
memory usage: 133.1+ KB

Remove columns

  • We will remove the columns that have too many null values and need to much effort to find the correct value.
  • The column ticket is a string that is unique for each passenger, but is just a identifier, so we will remove it.

so we will remove the columns cabin, ticket and home.dest

titanic_df = titanic_df.drop(columns=["cabin", "home.dest", "ticket"])

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   int64
 2   survived    1309 non-null   int64
 3   name        1309 non-null   object
 4   sex         1309 non-null   object
 5   age         1046 non-null   object
 6   sibsp       1309 non-null   int64
 7   parch       1309 non-null   int64
 8   fare        1308 non-null   object
 9   embarked    1307 non-null   object
dtypes: int64(5), object(5)
memory usage: 102.4+ KB

Categorical variables

Ordinal

  • Pclass: A proxy for socio-economic status (SES)
    • 1 = Upper
    • 2 = Middle
    • 3 = Lower

Nominal

  • sex: Gender of the passenger
    • female
    • male
  • embarked: Port of embarkation
    • C = Cherbourg
    • Q = Queenstown
    • S = Southampton

Numerical variables

Discrete

  • sibsp: The dataset defines family relations in this way…
    • Sibling = brother, sister, stepbrother, stepsister
    • Spouse = husband, wife (mistresses and fiancés were ignored)
    • sibsp = 0, 1, 2, 3, 4, 5, 8
  • parch: The dataset defines family relations in this way…
    • Parent = mother, father
    • Child = daughter, son, stepdaughter, stepson
    • Some children travelled only with a nanny, therefore parch = 0 for them.
    • parch = 0, 1, 2, 3, 4, 5, 6

Continuous

  • fare: Passenger fare
  • age: Age of the passenger, some values are float has to be converted to int.

Boolean variables

  • Survived: 0 = No, 1 = Yes

String variables

  • name: Name of the passenger with the format Last name, Title. First name

Convert data types

Categorical variables

cols_categoric = ["pclass", "sex", "embarked"]

titanic_df[cols_categoric] = titanic_df[cols_categoric].astype("category")
titanic_df["pclass"] = pd.Categorical(
    titanic_df["pclass"], categories=[3, 2, 1], ordered=True
)

Numerical variables

cols_numeric_float = ["age", "fare"]

titanic_df[cols_numeric_float] = titanic_df[cols_numeric_float].astype("float")
cols_numeric_int = ["sibsp", "parch"]

titanic_df[cols_numeric_int] = titanic_df[cols_numeric_int].astype("int8")

Boolean variables

cols_boolean = ["survived"]

titanic_df[cols_boolean] = titanic_df[cols_boolean].astype("bool")
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   category
 2   survived    1309 non-null   bool
 3   name        1309 non-null   object
 4   sex         1309 non-null   category
 5   age         1046 non-null   float64
 6   sibsp       1309 non-null   int8
 7   parch       1309 non-null   int8
 8   fare        1308 non-null   float64
 9   embarked    1307 non-null   category
dtypes: bool(1), category(3), float64(2), int64(1), int8(2), object(1)
memory usage: 49.1+ KB
schema = pa.Table.from_pandas(titanic_df).schema

💾 Save dataframe with data types

titanic_df.to_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", index=False, schema=schema
)

📊 Analysis of Results

Some columns have been removed and the data types have been fixed to correct pyarrow data types. and null values have been replaced with np.nan

in order to do a correct analysis and visualization of the data.

💡 Proposals and Ideas

  • use other tools to compare which one can be used to describe and explore data and do data analysis.

  • Use pyarrow as dtype backend

  • Use pd.NA as null value, but yprofiling is not working well with pyarrow backend

📖 References