Exploración inicial de datos

Ultima actualización: 21/Nov/2024

Description

Data overview and exploration to check data types and fix any issue with the data types.

this is in other to do a correct data analysis and visualization of the data.

📚 Import libraries

# base libraries for data science
from pathlib import Path

import numpy as np
import pandas as pd
import pyarrow as pa

💾 Load data

# data directory path
DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df = pd.read_csv(DATA_DIR / "01_raw/titanic_raw.csv")

📊 Data description

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   int64
 2   survived    1309 non-null   int64
 3   name        1309 non-null   object
 4   sex         1309 non-null   object
 5   age         1309 non-null   object
 6   sibsp       1309 non-null   int64
 7   parch       1309 non-null   int64
 8   ticket      1309 non-null   object
 9   fare        1309 non-null   object
 10  cabin       1309 non-null   object
 11  embarked    1309 non-null   object
 12  home.dest   1309 non-null   object
dtypes: int64(5), object(8)
memory usage: 133.1+ KB

titanic_df.sample(10)

	Unnamed: 0	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	home.dest
14	14	1	1	Barkworth, Mr. Algernon Henry Wilson	male	80	0	0	27042	30	A23	S	Hessle, Yorks
626	626	3	0	Andersson, Miss. Ida Augusta Margareta	female	38	4	2	347091	7.775	?	S	Vadsbro, Sweden Ministee, MI
1083	1083	3	0	Olsen, Mr. Henry Margido	male	28	0	0	C 4001	22.525	?	S	?
360	360	2	1	Caldwell, Mr. Albert Francis	male	26	1	1	248738	29	?	S	Bangkok, Thailand / Roseville, IL
1252	1252	3	0	Torber, Mr. Ernst William	male	44	0	0	364511	8.05	?	S	?
679	679	3	0	Boulos, Miss. Nourelain	female	9	1	1	2678	15.2458	?	C	Syria Kent, ON
469	469	2	1	Keane, Miss. Nora A	female	?	0	0	226593	12.35	E101	Q	Harrisburg, PA
1007	1007	3	1	McGowan, Miss. Anna 'Annie'	female	15	0	0	330923	8.0292	?	Q	?
187	187	1	1	Lines, Miss. Mary Conover	female	16	0	1	PC 17592	39.4	D28	S	Paris, France
1113	1113	3	0	Peacock, Mrs. Benjamin (Edith Nile)	female	26	0	2	SOTON/O.Q. 3101315	13.775	?	S	?

Null values

In this dataset the null values are represented by the string ‘?’ so we need to replace them with pd.NA

titanic_df = titanic_df.replace("?", np.nan)

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   int64
 2   survived    1309 non-null   int64
 3   name        1309 non-null   object
 4   sex         1309 non-null   object
 5   age         1046 non-null   object
 6   sibsp       1309 non-null   int64
 7   parch       1309 non-null   int64
 8   ticket      1309 non-null   object
 9   fare        1308 non-null   object
 10  cabin       295 non-null    object
 11  embarked    1307 non-null   object
 12  home.dest   745 non-null    object
dtypes: int64(5), object(8)
memory usage: 133.1+ KB

Remove columns

We will remove the columns that have too many null values and need to much effort to find the correct value.
The column ticket is a string that is unique for each passenger, but is just a identifier, so we will remove it.

so we will remove the columns cabin, ticket and home.dest

titanic_df = titanic_df.drop(columns=["cabin", "home.dest", "ticket"])

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   int64
 2   survived    1309 non-null   int64
 3   name        1309 non-null   object
 4   sex         1309 non-null   object
 5   age         1046 non-null   object
 6   sibsp       1309 non-null   int64
 7   parch       1309 non-null   int64
 8   fare        1308 non-null   object
 9   embarked    1307 non-null   object
dtypes: int64(5), object(5)
memory usage: 102.4+ KB

Categorical variables

Ordinal

Pclass: A proxy for socio-economic status (SES)
- 1 = Upper
- 2 = Middle
- 3 = Lower

Nominal

sex: Gender of the passenger
- female
- male
embarked: Port of embarkation
- C = Cherbourg
- Q = Queenstown
- S = Southampton

Numerical variables

Discrete

sibsp: The dataset defines family relations in this way…
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
- sibsp = 0, 1, 2, 3, 4, 5, 8
parch: The dataset defines family relations in this way…
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch = 0 for them.
- parch = 0, 1, 2, 3, 4, 5, 6

Continuous

fare: Passenger fare
age: Age of the passenger, some values are float has to be converted to int.

Boolean variables

Survived: 0 = No, 1 = Yes

String variables

name: Name of the passenger with the format Last name, Title. First name

Convert data types

Categorical variables

cols_categoric = ["pclass", "sex", "embarked"]

titanic_df[cols_categoric] = titanic_df[cols_categoric].astype("category")

titanic_df["pclass"] = pd.Categorical(
    titanic_df["pclass"], categories=[3, 2, 1], ordered=True
)

Numerical variables

cols_numeric_float = ["age", "fare"]

titanic_df[cols_numeric_float] = titanic_df[cols_numeric_float].astype("float")

cols_numeric_int = ["sibsp", "parch"]

titanic_df[cols_numeric_int] = titanic_df[cols_numeric_int].astype("int8")

Boolean variables

cols_boolean = ["survived"]

titanic_df[cols_boolean] = titanic_df[cols_boolean].astype("bool")

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   category
 2   survived    1309 non-null   bool
 3   name        1309 non-null   object
 4   sex         1309 non-null   category
 5   age         1046 non-null   float64
 6   sibsp       1309 non-null   int8
 7   parch       1309 non-null   int8
 8   fare        1308 non-null   float64
 9   embarked    1307 non-null   category
dtypes: bool(1), category(3), float64(2), int64(1), int8(2), object(1)
memory usage: 49.1+ KB

schema = pa.Table.from_pandas(titanic_df).schema

💾 Save dataframe with data types

titanic_df.to_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", index=False, schema=schema
)

📊 Analysis of Results

Some columns have been removed and the data types have been fixed to correct pyarrow data types. and null values have been replaced with np.nan

in order to do a correct analysis and visualization of the data.

💡 Proposals and Ideas

use other tools to compare which one can be used to describe and explore data and do data analysis.
Use pyarrow as dtype backend
Use pd.NA as null value, but yprofiling is not working well with pyarrow backend

📖 References

https://pandas.pydata.org/docs/user_guide/pyarrow.html

Description

📚 Import libraries

💾 Load data

📊 Data description

Null values

Remove columns

Categorical variables

Ordinal

Nominal

Numerical variables

Discrete

Continuous

Boolean variables

String variables

Convert data types

Categorical variables

Numerical variables

Boolean variables

💾 Save dataframe with data types

📊 Analysis of Results

💡 Proposals and Ideas

📖 References

Feedback