Exploración de datos y descripción

Por Jose R. Zapata

Ultima actualización: 21/Nov/2024

Description

Exploratory data analysis (EDA) and description of the data set.

Data manipulation and visualization

📚 Import libraries

# base libraries for data science
import sys
from pathlib import Path

import pandas as pd
import seaborn as sns
# configuration to show only 2 decimal places
pd.set_option("display.float_format", "{:.2f}".format)
# print library version for reproducibility

print("Python version: ", sys.version)
print("Pandas version: ", pd.__version__)
Python version:  3.11.10 (main, Sep 27 2024, 20:27:21) [GCC 11.4.0]
Pandas version:  2.1.4

💾 Load data

The dataset has correct data types, fixed in:

notebooks/2-exploration/01-jrz-data_explore_description-2024_03_01.ipynb

DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df = pd.read_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)

📊 Data description

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   int64
 2   survived    1309 non-null   bool
 3   name        1309 non-null   object
 4   sex         1309 non-null   category
 5   age         1046 non-null   float64
 6   sibsp       1309 non-null   int64
 7   parch       1309 non-null   int64
 8   fare        1308 non-null   float64
 9   embarked    1307 non-null   category
dtypes: bool(1), category(2), float64(2), int64(4), object(1)
memory usage: 75.8+ KB

Ordinal data has to be converted again

information about the pclass column can be chech in the notebook

notebooks/2-exploration/01-jrz-data_explore_description-2024_03_01.ipynb

titanic_df["pclass"] = pd.Categorical(
    titanic_df["pclass"], categories=[3, 2, 1], ordered=True
)

# column Unnamed: 0 is not needed
titanic_df = titanic_df.drop(columns=["Unnamed: 0"])

General information about the data set:

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1309 non-null   category
 1   survived  1309 non-null   bool
 2   name      1309 non-null   object
 3   sex       1309 non-null   category
 4   age       1046 non-null   float64
 5   sibsp     1309 non-null   int64
 6   parch     1309 non-null   int64
 7   fare      1308 non-null   float64
 8   embarked  1307 non-null   category
dtypes: bool(1), category(3), float64(2), int64(2), object(1)
memory usage: 56.8+ KB
# size of the dataframe
titanic_df.shape
(1309, 9)
# sample of the dataframe
titanic_df.sample(5)

pclasssurvivednamesexagesibspparchfareembarked
581FalseCase, Mr. Howard Brownmale49.000026.00S
7363FalseCoxon, Mr. Danielmale59.00007.25S
5232TrueOxenham, Mr. Percy Thomasmale22.000010.50S
10733FalseO'Connor, Mr. MauricemaleNaN007.75Q
5182FalseNicholls, Mr. Joseph Charlesmale19.001136.75S

Number of missing values

titanic_df.isnull().sum()
pclass        0
survived      0
name          0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
dtype: int64

Target Variable = Survived

Numerical variables
titanic_df.describe()

agesibspparchfare
count1046.001309.001309.001308.00
mean29.880.500.3933.30
std14.411.040.8751.76
min0.170.000.000.00
25%21.000.000.007.90
50%28.000.000.0014.45
75%39.001.000.0031.27
max80.008.009.00512.33
Categorical variables
# categorical columns description
titanic_df.describe(include="category")

pclasssexembarked
count130913091307
unique323
top3maleS
freq709843914

📈 Univariate Analysis

Target Variable

titanic_df["survived"].value_counts().plot(
    kind="bar", color=["skyblue", "orange"], title="Survived passengers"
);

png

Numerical Variables

# list of the numerical columns
numerical_columns = list(titanic_df.select_dtypes(include=["number"]).columns)
numerical_columns
['age', 'sibsp', 'parch', 'fare']

age

column = "age"
titanic_df[column].describe()
count   1046.00
mean      29.88
std       14.41
min        0.17
25%       21.00
50%       28.00
75%       39.00
max       80.00
Name: age, dtype: float64
# number of unique values
titanic_df[column].nunique()
98
titanic_df[column].plot(
    kind="hist", bins=30, edgecolor="black", title=f"{column} histogram"
);

png

titanic_df[column].plot(kind="density", title=f"{column} distribution");

png

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

png

sibsp

column = "sibsp"
titanic_df[column].describe()
count   1309.00
mean       0.50
std        1.04
min        0.00
25%        0.00
50%        0.00
75%        1.00
max        8.00
Name: sibsp, dtype: float64
# number of unique values
titanic_df[column].nunique()
7
# unique values (because are only a few)
titanic_df[column].unique()
array([0, 1, 2, 3, 4, 5, 8])
# histogram of a column with 7 unique values between 0 and 8
titanic_df[column].plot(
    kind="hist", bins=8, edgecolor="black", title=f"{column} histogram"
);

png

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

png

parch

column = "parch"
titanic_df[column].describe()
count   1309.00
mean       0.39
std        0.87
min        0.00
25%        0.00
50%        0.00
75%        0.00
max        9.00
Name: parch, dtype: float64
# number of unique values

titanic_df[column].nunique()
8
# unique values (because are only a few)
titanic_df[column].unique()
array([0, 2, 1, 4, 3, 5, 6, 9])
titanic_df[column].plot(
    kind="hist", bins=9, edgecolor="black", title=f"{column} histogram"
);

png

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

png

fare

column = "fare"
titanic_df[column].describe()
count   1308.00
mean      33.30
std       51.76
min        0.00
25%        7.90
50%       14.45
75%       31.27
max      512.33
Name: fare, dtype: float64
titanic_df[column].plot(kind="kde", title=f"{column} distribution");

png

titanic_df[column].plot(
    kind="hist", bins=50, edgecolor="black", title=f"{column} histogram"
);

png

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

png

Categorical Variables

# list of the categorical columns
categorical_columns = list(titanic_df.select_dtypes(include=["category"]).columns)
categorical_columns
['pclass', 'sex', 'embarked']

pclass

column = "pclass"
titanic_df[column].describe()
count     1309
unique       3
top          3
freq       709
Name: pclass, dtype: int64
titanic_df[column].unique()
[1, 2, 3]
Categories (3, int64): [3 < 2 < 1]
titanic_df[column].value_counts()
pclass
3    709
1    323
2    277
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
    kind="bar", color=["skyblue", "orange", "green"], title=f"{column} value counts"
);

png

Sex

column = "sex"
titanic_df[column].describe()
count     1309
unique       2
top       male
freq       843
Name: sex, dtype: object
titanic_df[column].unique()
['female', 'male']
Categories (2, object): ['female', 'male']
titanic_df[column].value_counts()
sex
male      843
female    466
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
    kind="bar", color=["skyblue", "orange"], title=f"{column} value counts"
);

png

embarked

column = "embarked"
titanic_df[column].describe()
count     1307
unique       3
top          S
freq       914
Name: embarked, dtype: object
titanic_df[column].unique()
['S', 'C', NaN, 'Q']
Categories (3, object): ['C', 'Q', 'S']
titanic_df[column].value_counts()
embarked
S    914
C    270
Q    123
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
    kind="bar", color=["skyblue", "orange", "green"], title=f"{column} value counts"
);

png

String columns

titanic_df["name"].sample(5)
394                         Dibden, Mr. William
755                          Davies, Mr. Joseph
237                         Robbins, Mr. Victor
459                Jacobsohn, Mr. Sidney Samuel
1136    Rasmussen, Mrs. (Lena Jacobsen Solvang)
Name: name, dtype: object
titanic_df["name"].nunique()
1307

📈 Bivariate Analysis

Is important to check the relationship between the target variable and the other variables.

The target variable is Survived

Target vs Numerical Variables

survived vs age

variable = "age"

titanic_df.plot(
    kind="box",
    column=variable,
    by="survived",
    grid=False,
    title=f"Survived vs {variable} boxplot",
);

png

titanic_df[titanic_df["survived"] == 1][variable].plot(
    kind="kde",
    label="Survived",
    legend=True,
    title=f"Survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
    kind="kde",
    label="Not Survived",
    legend=True,
    xlabel=variable,
    ylabel="Density",
);

png

survived vs sibsp

variable = "sibsp"

titanic_df.plot(
    kind="box",
    column=variable,
    by="survived",
    grid=False,
    title=f"Survived vs {variable} boxplot",
);

png

titanic_df[titanic_df["survived"] == 1][variable].plot(
    kind="hist",
    label="Survived",
    legend=True,
    bins=8,
    alpha=0.5,
    title=f"survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
    kind="hist",
    label="Not Survived",
    legend=True,
    bins=8,
    alpha=0.3,
    xlabel=variable,
    ylabel="Density",
);

png

survived vs parch

variable = "parch"

titanic_df.plot(
    kind="box",
    column=variable,
    by="survived",
    grid=False,
    title=f"Survived vs {variable} boxplot",
);

png

titanic_df[titanic_df["survived"] == 1][variable].plot(
    kind="hist",
    label="Survived",
    legend=True,
    bins=9,
    alpha=0.5,
    title=f"Survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
    kind="hist",
    label="Not Survived",
    legend=True,
    bins=9,
    alpha=0.3,
    xlabel=variable,
    ylabel="Density",
);

png

survived vs fare

variable = "fare"

titanic_df.plot(
    kind="box",
    column=variable,
    by="survived",
    grid=False,
    title=f"Survived vs {variable} boxplot",
);

png

titanic_df[titanic_df["survived"] == 1][variable].plot(
    kind="kde",
    label="Survived",
    legend=True,
    title=f"Survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
    kind="kde",
    label="Not Survived",
    legend=True,
    xlabel=variable,
    ylabel="Density",
);

png

Target vs Categorical Variables

survived vs pclass

column = "pclass"

(
    pd.crosstab(titanic_df[column], titanic_df["survived"], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption("Survived vs pclass Heatmap")
)
Survived vs pclass Heatmap
survivedFalseTrueAll
pclass   
3528181709
2158119277
1123200323
All8095001309
titanic_df.groupby(column, observed=True).agg({"survived": "mean"}) * 100

survived
pclass
325.53
242.96
161.92

2nd class passengers had twice the survival rate of 3rd class and 1st class passengers had even better rates

(
    pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
        kind="bar", stacked=True, title=f"Survived vs {column} stacked barplot"
    )
);

png

(
    pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
        kind="bar",
        title=f"Survived vs {column} barplot",
    )
);

png

survived vs sex

column = "sex"

(
    pd.crosstab(titanic_df[column], titanic_df["survived"], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption("Survived vs pclass Heatmap")
)
Survived vs pclass Heatmap
survivedFalseTrueAll
sex   
female127339466
male682161843
All8095001309
titanic_df.groupby(column, observed=True).agg({"survived": "mean"}) * 100

survived
sex
female72.75
male19.10
(
    pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
        kind="bar", stacked=True, title=f"Survived vs {column} stacked barplot"
    )
);

png

(
    pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
        kind="bar",
        title=f"Survived vs {column} barplot",
    )
);

png

survived vs embarked

column = "embarked"

(
    pd.crosstab(titanic_df[column], titanic_df["survived"], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption("Survived vs pclass Heatmap")
)
Survived vs pclass Heatmap
survivedFalseTrueAll
embarked   
C120150270
Q7944123
S610304914
All8094981307
titanic_df.groupby(column, observed=True).agg({"survived": "mean"}) * 100

survived
embarked
C55.56
Q35.77
S33.26
(
    pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
        kind="bar", stacked=True, title=f"Survived vs {column} stacked barplot"
    )
);

png

(
    pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
        kind="bar",
        title=f"Survived vs {column} barplot",
    )
);

png

Numerical vs Numerical Variables

age vs sibsp

# scatter plot of age vs sibsp

titanic_df.plot(
    kind="scatter",
    x="age",
    y="sibsp",
    title="age vs Sibsp scatter plot",
);

png

age vs parch

# scatter plot of age vs parch

titanic_df.plot(
    kind="scatter",
    x="age",
    y="parch",
    title="age vs Parch scatter plot",
);

png

age vs fare

# scatter plot of age vs fare

titanic_df.plot(
    kind="scatter",
    x="age",
    y="fare",
    title="age vs fare scatter plot",
);

png

sibsp vs parch

# scatter plot of sibsp vs parch

titanic_df.plot(
    kind="scatter",
    x="sibsp",
    y="parch",
    title="Sibsp vs parch scatter plot",
);

png

sibsp vs fare

# scatter plot of sibsp vs fare

titanic_df.plot(
    kind="scatter",
    x="sibsp",
    y="fare",
    title="Sibsp vs fare scatter plot",
);

png

parch vs fare

# scatter plot of parch vs fare

titanic_df.plot(
    kind="scatter",
    x="parch",
    y="fare",
    title="parch vs fare scatter plot",
);

png

Categorical vs Categorical Variables

pclass vs sex

column_1 = "pclass"
column_2 = "sex"

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption(f"{column_1} vs {column_2} Heatmap")
)
pclass vs sex Heatmap
sexfemalemaleAll
pclass   
3216493709
2106171277
1144179323
All4668431309
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
    )
);

png

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar",
        title=f"{column_1} vs {column_2} barplot",
    )
);

png

pclass vs embarked

column_1 = "pclass"
column_2 = "embarked"

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption(f"{column_1} vs {column_2} Heatmap")
)
pclass vs embarked Heatmap
embarkedCQSAll
pclass    
3101113495709
2287242277
11413177321
All2701239141307
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
    )
);

png

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar",
        title=f"{column_1} vs {column_2} barplot",
    )
);

png

sex vs embarked

column_1 = "sex"
column_2 = "embarked"

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption(f"{column_1} vs {column_2} Heatmap")
)
sex vs embarked Heatmap
embarkedCQSAll
sex    
female11360291464
male15763623843
All2701239141307
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
    )
);

png

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar",
        title=f"{column_1} vs {column_2} barplot",
    )
);

png

sex vs pclass

column_1 = "sex"
column_2 = "pclass"

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption(f"{column_1} vs {column_2} Heatmap")
)
sex vs pclass Heatmap
pclass321All
sex    
female216106144466
male493171179843
All7092773231309
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
    )
);

png

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar",
        title=f"{column_1} vs {column_2} barplot",
    )
);

png

embarked vs pclass

column_1 = "embarked"
column_2 = "pclass"

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption(f"{column_1} vs {column_2} Heatmap")
)
embarked vs pclass Heatmap
pclass321All
embarked    
C10128141270
Q11373123
S495242177914
All7092773211307
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
    )
);

png

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar",
        title=f"{column_1} vs {column_2} barplot",
    )
);

png

embarked vs sex

column_1 = "embarked"
column_2 = "sex"

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption(f"{column_1} vs {column_2} Heatmap")
)
embarked vs sex Heatmap
sexfemalemaleAll
embarked   
C113157270
Q6063123
S291623914
All4648431307
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
    )
);

png

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar",
        title=f"{column_1} vs {column_2} barplot",
    )
);

png

Categorical vs Numerical Variables

pclass vs age

column_cat = "pclass"
column_num = "age"

titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

png

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

png

pclass vs sibsp

column_num = "sibsp"

titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);
/tmp/ipykernel_20921/2525679634.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  titanic_df.plot(

png

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

png

pclass vs parch

column_num = "parch"

titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

png

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

png

pclass vs fare

column_num = "fare"

titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

png

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

png

sex vs age

column_cat = "sex"
column_num = "age"

titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

png

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

png

sex vs sibsp

column_num = "sibsp"

titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

png

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

png

sex vs parch

column_num = "parch"

titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

png

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

png

sex vs fare

column_num = "fare"

titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

png

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

png

embarked vs age

column_cat = "embarked"
column_num = "age"

titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

png

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

png

embarked vs sibsp

column_num = "sibsp"

titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

png

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

png

embarked vs parch

column_num = "parch"

titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

png

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

png

embarked vs fare

column_num = "fare"

titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

png

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

png

📈 Multivariate Analysis

Is important to check the relationship between the target variable and multiple variables.

The target variable is Survived

Target vs age vs Sex

sns.boxplot(data=titanic_df, x="sex", y="age", hue="survived");

png

Target vs age vs pclass

sns.boxplot(data=titanic_df, x="pclass", y="age", hue="survived");

png

Target vs age vs embarked

sns.boxplot(data=titanic_df, x="embarked", y="age", hue="survived");

png

Target vs sex vs pclass

(
    titanic_df[["sex", "pclass", "survived"]]
    .groupby(["pclass", "sex"], observed=True)
    .mean()
    * 100,
    1,
)
(               survived
 pclass sex
 3      female     49.07
        male       15.21
 2      female     88.68
        male       14.62
 1      female     96.53
        male       34.08,
 1)

almost all Pclass 1 females survive and so do most of Pclass 2. Pclass 3 female survival is 50% and almost all (86%) Pclass 3 males unfortunately do not survive.

sns.catplot(
    data=titanic_df,
    x="sex",
    hue="survived",
    col="pclass",
    kind="count",
    height=4,
    aspect=0.7,
);

png

sns.catplot(data=titanic_df, x="sex", y="survived", hue="pclass", kind="point");

png

sns.catplot(
    data=titanic_df,
    x="pclass",
    y="survived",
    hue="sex",
    palette={"male": "b", "female": "m"},
    markers=["^", "o"],
    linestyles=["-", "--"],
    kind="point",
);

png

sns.catplot(data=titanic_df, x="sex", y="survived", hue="pclass", kind="bar");

png

Target vs pclass vs fare

sns.catplot(
    x="fare",
    y="survived",
    row="pclass",
    kind="box",
    hue="survived",
    orient="h",
    height=1.5,
    aspect=4,
    data=titanic_df.query("fare > 0"),
).set(xscale="log");

png

Target vs sex vs pclass vs embarked

(
    titanic_df[["embarked", "sex", "pclass", "survived"]]
    .groupby(["embarked", "pclass", "sex"], observed=True)
    .agg(["count", "mean"])
)

survived
countmean
embarkedpclasssex
C3female310.71
male700.21
2female111.00
male170.29
1female710.97
male700.40
Q3female560.59
male570.12
2female21.00
male50.00
1female21.00
male10.00
S3female1290.40
male3660.14
2female930.87
male1490.13
1female690.96
male1080.31

Numerical vs All Numerical Variables

numerical_columns
['age', 'sibsp', 'parch', 'fare']
sns.pairplot(titanic_df[numerical_columns], diag_kind="kde");

png

df = titanic_df[[*numerical_columns, "survived"]]

sns.pairplot(df, hue="survived", diag_kind="kde");

png

📏 Heuristics

def calculate_survival_percentage(df, pclass, age_min, age_max=None) -> float:
    """Calculate the percentage of survivors based on the class and age range.

    Args:
        df: DataFrame with the Titanic dataset
        pclass: int with the class of the passenger
        age_min: int with the minimum age of the passenger
        age_max: int with the maximum age of the passenger

    Returns:
        float with the percentage of survivors

    """

    if age_max:
        survivors = len(
            df[
                (df["pclass"] == pclass)
                & (df["age"] > age_min)
                & (df["age"] < age_max)
                & (df["survived"])
            ]
        )
        total = len(
            df[(df["pclass"] == pclass) & (df["age"] > age_min) & (df["age"] < age_max)]
        )
    else:
        survivors = len(
            df[(df["pclass"] == pclass) & (df["age"] > age_min) & (df["survived"])]
        )
        total = len(df[(df["pclass"] == pclass) & (df["age"] > age_min)])

    return round((survivors / total) * 100, 1)


# Calcular y mostrar los porcentajes de supervivencia
print(
    "Pclass 1 survivors above age 60:",
    calculate_survival_percentage(titanic_df, 1, 59),
    "%",
)
print(
    "Pclass 2 survivors above age 60:",
    calculate_survival_percentage(titanic_df, 2, 59),
    "%",
)
print(
    "Pclass 3 survivors above age 60:",
    calculate_survival_percentage(titanic_df, 3, 59),
    "%",
)

print(
    "Pclass 1 survivors between 20-30 age:",
    calculate_survival_percentage(titanic_df, 1, 19, 31),
    "%",
)
print(
    "Pclass 2 survivors between 20-30 age:",
    calculate_survival_percentage(titanic_df, 2, 19, 31),
    "%",
)
print(
    "Pclass 3 survivors between 20-30 age:",
    calculate_survival_percentage(titanic_df, 3, 19, 31),
    "%",
)
Pclass 1 survivors above age 60: 38.5 %
Pclass 2 survivors above age 60: 12.5 %
Pclass 3 survivors above age 60: 16.7 %
Pclass 1 survivors between 20-30 age: 69.8 %
Pclass 2 survivors between 20-30 age: 41.0 %
Pclass 3 survivors between 20-30 age: 25.2 %

📊 Analysis of Results and Conclusions

  • Almost all Pclass 1 females survive and so do most of Pclass 2. Pclass 3 female survival is 50% and almost all (86%) Pclass 3 males unfortunately do not survive
  • 2nd class passengers had twice the survival rate of 3rd class and 1st class passengers had even better rates
  • The fare is highly correlated with the class
  • The age is not correlated with the pclass
  • The age is not correlated with the fare
  • The Survival rate among passengers who embarked in Cherbourg is higher than the others
  • The Survival rate among females is higher
  • The Survival rate among 1st class passengers is higher
  • The Survival rate among age 0-10 is higher
  • The two variables that have the highest correlation with the target variable are sex and pclass

💡 Proposals and Ideas

For first baseline model, we will use the following features:

  • sex
  • pclass
  • age

and can be done with heuristics or with a simple model like a decision tree.

📖 References

Visualización

EDA

Statistical tests