Por Jose R. Zapata
Ultima actualización: 21/Nov/2024
Description
Exploratory data analysis (EDA) and description of the data set.
Data manipulation and visualization
📚 Import libraries
# base libraries for data science
import sys
from pathlib import Path
import pandas as pd
import seaborn as sns
# configuration to show only 2 decimal places
pd.set_option("display.float_format", "{:.2f}".format)
# print library version for reproducibility
print("Python version: ", sys.version)
print("Pandas version: ", pd.__version__)
Python version:  3.11.10 (main, Sep 27 2024, 20:27:21) [GCC 11.4.0]
Pandas version:  2.1.4
💾 Load data
The dataset has correct data types, fixed in:
notebooks/2-exploration/01-jrz-data_explore_description-2024_03_01.ipynb
DATA_DIR = Path.cwd().resolve().parents[1] / "data"
titanic_df = pd.read_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)
📊 Data description
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   int64
 2   survived    1309 non-null   bool
 3   name        1309 non-null   object
 4   sex         1309 non-null   category
 5   age         1046 non-null   float64
 6   sibsp       1309 non-null   int64
 7   parch       1309 non-null   int64
 8   fare        1308 non-null   float64
 9   embarked    1307 non-null   category
dtypes: bool(1), category(2), float64(2), int64(4), object(1)
memory usage: 75.8+ KB
Ordinal data has to be converted again
information about the pclass column can be chech in the notebook
notebooks/2-exploration/01-jrz-data_explore_description-2024_03_01.ipynb
titanic_df["pclass"] = pd.Categorical(
    titanic_df["pclass"], categories=[3, 2, 1], ordered=True
)
# column Unnamed: 0 is not needed
titanic_df = titanic_df.drop(columns=["Unnamed: 0"])
General information about the data set:
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1309 non-null   category
 1   survived  1309 non-null   bool
 2   name      1309 non-null   object
 3   sex       1309 non-null   category
 4   age       1046 non-null   float64
 5   sibsp     1309 non-null   int64
 6   parch     1309 non-null   int64
 7   fare      1308 non-null   float64
 8   embarked  1307 non-null   category
dtypes: bool(1), category(3), float64(2), int64(2), object(1)
memory usage: 56.8+ KB
# size of the dataframe
titanic_df.shape
(1309, 9)
# sample of the dataframe
titanic_df.sample(5)
| pclass | survived | name | sex | age | sibsp | parch | fare | embarked | |
|---|---|---|---|---|---|---|---|---|---|
| 58 | 1 | False | Case, Mr. Howard Brown | male | 49.00 | 0 | 0 | 26.00 | S | 
| 736 | 3 | False | Coxon, Mr. Daniel | male | 59.00 | 0 | 0 | 7.25 | S | 
| 523 | 2 | True | Oxenham, Mr. Percy Thomas | male | 22.00 | 0 | 0 | 10.50 | S | 
| 1073 | 3 | False | O'Connor, Mr. Maurice | male | NaN | 0 | 0 | 7.75 | Q | 
| 518 | 2 | False | Nicholls, Mr. Joseph Charles | male | 19.00 | 1 | 1 | 36.75 | S | 
Number of missing values
titanic_df.isnull().sum()
pclass        0
survived      0
name          0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
dtype: int64
Target Variable = Survived
Numerical variables
titanic_df.describe()
| age | sibsp | parch | fare | |
|---|---|---|---|---|
| count | 1046.00 | 1309.00 | 1309.00 | 1308.00 | 
| mean | 29.88 | 0.50 | 0.39 | 33.30 | 
| std | 14.41 | 1.04 | 0.87 | 51.76 | 
| min | 0.17 | 0.00 | 0.00 | 0.00 | 
| 25% | 21.00 | 0.00 | 0.00 | 7.90 | 
| 50% | 28.00 | 0.00 | 0.00 | 14.45 | 
| 75% | 39.00 | 1.00 | 0.00 | 31.27 | 
| max | 80.00 | 8.00 | 9.00 | 512.33 | 
Categorical variables
# categorical columns description
titanic_df.describe(include="category")
| pclass | sex | embarked | |
|---|---|---|---|
| count | 1309 | 1309 | 1307 | 
| unique | 3 | 2 | 3 | 
| top | 3 | male | S | 
| freq | 709 | 843 | 914 | 
📈 Univariate Analysis
Target Variable
titanic_df["survived"].value_counts().plot(
    kind="bar", color=["skyblue", "orange"], title="Survived passengers"
);

Numerical Variables
# list of the numerical columns
numerical_columns = list(titanic_df.select_dtypes(include=["number"]).columns)
numerical_columns
['age', 'sibsp', 'parch', 'fare']
age
column = "age"
titanic_df[column].describe()
count   1046.00
mean      29.88
std       14.41
min        0.17
25%       21.00
50%       28.00
75%       39.00
max       80.00
Name: age, dtype: float64
# number of unique values
titanic_df[column].nunique()
98
titanic_df[column].plot(
    kind="hist", bins=30, edgecolor="black", title=f"{column} histogram"
);

titanic_df[column].plot(kind="density", title=f"{column} distribution");

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

sibsp
column = "sibsp"
titanic_df[column].describe()
count   1309.00
mean       0.50
std        1.04
min        0.00
25%        0.00
50%        0.00
75%        1.00
max        8.00
Name: sibsp, dtype: float64
# number of unique values
titanic_df[column].nunique()
7
# unique values (because are only a few)
titanic_df[column].unique()
array([0, 1, 2, 3, 4, 5, 8])
# histogram of a column with 7 unique values between 0 and 8
titanic_df[column].plot(
    kind="hist", bins=8, edgecolor="black", title=f"{column} histogram"
);

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

parch
column = "parch"
titanic_df[column].describe()
count   1309.00
mean       0.39
std        0.87
min        0.00
25%        0.00
50%        0.00
75%        0.00
max        9.00
Name: parch, dtype: float64
# number of unique values
titanic_df[column].nunique()
8
# unique values (because are only a few)
titanic_df[column].unique()
array([0, 2, 1, 4, 3, 5, 6, 9])
titanic_df[column].plot(
    kind="hist", bins=9, edgecolor="black", title=f"{column} histogram"
);

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

fare
column = "fare"
titanic_df[column].describe()
count   1308.00
mean      33.30
std       51.76
min        0.00
25%        7.90
50%       14.45
75%       31.27
max      512.33
Name: fare, dtype: float64
titanic_df[column].plot(kind="kde", title=f"{column} distribution");

titanic_df[column].plot(
    kind="hist", bins=50, edgecolor="black", title=f"{column} histogram"
);

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

Categorical Variables
# list of the categorical columns
categorical_columns = list(titanic_df.select_dtypes(include=["category"]).columns)
categorical_columns
['pclass', 'sex', 'embarked']
pclass
column = "pclass"
titanic_df[column].describe()
count     1309
unique       3
top          3
freq       709
Name: pclass, dtype: int64
titanic_df[column].unique()
[1, 2, 3]
Categories (3, int64): [3 < 2 < 1]
titanic_df[column].value_counts()
pclass
3    709
1    323
2    277
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
    kind="bar", color=["skyblue", "orange", "green"], title=f"{column} value counts"
);

Sex
column = "sex"
titanic_df[column].describe()
count     1309
unique       2
top       male
freq       843
Name: sex, dtype: object
titanic_df[column].unique()
['female', 'male']
Categories (2, object): ['female', 'male']
titanic_df[column].value_counts()
sex
male      843
female    466
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
    kind="bar", color=["skyblue", "orange"], title=f"{column} value counts"
);

embarked
column = "embarked"
titanic_df[column].describe()
count     1307
unique       3
top          S
freq       914
Name: embarked, dtype: object
titanic_df[column].unique()
['S', 'C', NaN, 'Q']
Categories (3, object): ['C', 'Q', 'S']
titanic_df[column].value_counts()
embarked
S    914
C    270
Q    123
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
    kind="bar", color=["skyblue", "orange", "green"], title=f"{column} value counts"
);

String columns
titanic_df["name"].sample(5)
394                         Dibden, Mr. William
755                          Davies, Mr. Joseph
237                         Robbins, Mr. Victor
459                Jacobsohn, Mr. Sidney Samuel
1136    Rasmussen, Mrs. (Lena Jacobsen Solvang)
Name: name, dtype: object
titanic_df["name"].nunique()
1307
📈 Bivariate Analysis
Is important to check the relationship between the target variable and the other variables.
The target variable is Survived
Target vs Numerical Variables
survived vs age
variable = "age"
titanic_df.plot(
    kind="box",
    column=variable,
    by="survived",
    grid=False,
    title=f"Survived vs {variable} boxplot",
);

titanic_df[titanic_df["survived"] == 1][variable].plot(
    kind="kde",
    label="Survived",
    legend=True,
    title=f"Survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
    kind="kde",
    label="Not Survived",
    legend=True,
    xlabel=variable,
    ylabel="Density",
);

survived vs sibsp
variable = "sibsp"
titanic_df.plot(
    kind="box",
    column=variable,
    by="survived",
    grid=False,
    title=f"Survived vs {variable} boxplot",
);

titanic_df[titanic_df["survived"] == 1][variable].plot(
    kind="hist",
    label="Survived",
    legend=True,
    bins=8,
    alpha=0.5,
    title=f"survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
    kind="hist",
    label="Not Survived",
    legend=True,
    bins=8,
    alpha=0.3,
    xlabel=variable,
    ylabel="Density",
);

survived vs parch
variable = "parch"
titanic_df.plot(
    kind="box",
    column=variable,
    by="survived",
    grid=False,
    title=f"Survived vs {variable} boxplot",
);

titanic_df[titanic_df["survived"] == 1][variable].plot(
    kind="hist",
    label="Survived",
    legend=True,
    bins=9,
    alpha=0.5,
    title=f"Survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
    kind="hist",
    label="Not Survived",
    legend=True,
    bins=9,
    alpha=0.3,
    xlabel=variable,
    ylabel="Density",
);

survived vs fare
variable = "fare"
titanic_df.plot(
    kind="box",
    column=variable,
    by="survived",
    grid=False,
    title=f"Survived vs {variable} boxplot",
);

titanic_df[titanic_df["survived"] == 1][variable].plot(
    kind="kde",
    label="Survived",
    legend=True,
    title=f"Survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
    kind="kde",
    label="Not Survived",
    legend=True,
    xlabel=variable,
    ylabel="Density",
);

Target vs Categorical Variables
survived vs pclass
column = "pclass"
(
    pd.crosstab(titanic_df[column], titanic_df["survived"], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption("Survived vs pclass Heatmap")
)
| survived | False | True | All | 
|---|---|---|---|
| pclass | |||
| 3 | 528 | 181 | 709 | 
| 2 | 158 | 119 | 277 | 
| 1 | 123 | 200 | 323 | 
| All | 809 | 500 | 1309 | 
titanic_df.groupby(column, observed=True).agg({"survived": "mean"}) * 100
| survived | |
|---|---|
| pclass | |
| 3 | 25.53 | 
| 2 | 42.96 | 
| 1 | 61.92 | 
2nd class passengers had twice the survival rate of 3rd class and 1st class passengers had even better rates
(
    pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
        kind="bar", stacked=True, title=f"Survived vs {column} stacked barplot"
    )
);

(
    pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
        kind="bar",
        title=f"Survived vs {column} barplot",
    )
);

survived vs sex
column = "sex"
(
    pd.crosstab(titanic_df[column], titanic_df["survived"], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption("Survived vs pclass Heatmap")
)
| survived | False | True | All | 
|---|---|---|---|
| sex | |||
| female | 127 | 339 | 466 | 
| male | 682 | 161 | 843 | 
| All | 809 | 500 | 1309 | 
titanic_df.groupby(column, observed=True).agg({"survived": "mean"}) * 100
| survived | |
|---|---|
| sex | |
| female | 72.75 | 
| male | 19.10 | 
(
    pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
        kind="bar", stacked=True, title=f"Survived vs {column} stacked barplot"
    )
);

(
    pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
        kind="bar",
        title=f"Survived vs {column} barplot",
    )
);

survived vs embarked
column = "embarked"
(
    pd.crosstab(titanic_df[column], titanic_df["survived"], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption("Survived vs pclass Heatmap")
)
| survived | False | True | All | 
|---|---|---|---|
| embarked | |||
| C | 120 | 150 | 270 | 
| Q | 79 | 44 | 123 | 
| S | 610 | 304 | 914 | 
| All | 809 | 498 | 1307 | 
titanic_df.groupby(column, observed=True).agg({"survived": "mean"}) * 100
| survived | |
|---|---|
| embarked | |
| C | 55.56 | 
| Q | 35.77 | 
| S | 33.26 | 
(
    pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
        kind="bar", stacked=True, title=f"Survived vs {column} stacked barplot"
    )
);

(
    pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
        kind="bar",
        title=f"Survived vs {column} barplot",
    )
);

Numerical vs Numerical Variables
age vs sibsp
# scatter plot of age vs sibsp
titanic_df.plot(
    kind="scatter",
    x="age",
    y="sibsp",
    title="age vs Sibsp scatter plot",
);

age vs parch
# scatter plot of age vs parch
titanic_df.plot(
    kind="scatter",
    x="age",
    y="parch",
    title="age vs Parch scatter plot",
);

age vs fare
# scatter plot of age vs fare
titanic_df.plot(
    kind="scatter",
    x="age",
    y="fare",
    title="age vs fare scatter plot",
);

sibsp vs parch
# scatter plot of sibsp vs parch
titanic_df.plot(
    kind="scatter",
    x="sibsp",
    y="parch",
    title="Sibsp vs parch scatter plot",
);

sibsp vs fare
# scatter plot of sibsp vs fare
titanic_df.plot(
    kind="scatter",
    x="sibsp",
    y="fare",
    title="Sibsp vs fare scatter plot",
);

parch vs fare
# scatter plot of parch vs fare
titanic_df.plot(
    kind="scatter",
    x="parch",
    y="fare",
    title="parch vs fare scatter plot",
);

Categorical vs Categorical Variables
pclass vs sex
column_1 = "pclass"
column_2 = "sex"
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption(f"{column_1} vs {column_2} Heatmap")
)
| sex | female | male | All | 
|---|---|---|---|
| pclass | |||
| 3 | 216 | 493 | 709 | 
| 2 | 106 | 171 | 277 | 
| 1 | 144 | 179 | 323 | 
| All | 466 | 843 | 1309 | 
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
    )
);

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar",
        title=f"{column_1} vs {column_2} barplot",
    )
);

pclass vs embarked
column_1 = "pclass"
column_2 = "embarked"
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption(f"{column_1} vs {column_2} Heatmap")
)
| embarked | C | Q | S | All | 
|---|---|---|---|---|
| pclass | ||||
| 3 | 101 | 113 | 495 | 709 | 
| 2 | 28 | 7 | 242 | 277 | 
| 1 | 141 | 3 | 177 | 321 | 
| All | 270 | 123 | 914 | 1307 | 
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
    )
);

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar",
        title=f"{column_1} vs {column_2} barplot",
    )
);

sex vs embarked
column_1 = "sex"
column_2 = "embarked"
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption(f"{column_1} vs {column_2} Heatmap")
)
| embarked | C | Q | S | All | 
|---|---|---|---|---|
| sex | ||||
| female | 113 | 60 | 291 | 464 | 
| male | 157 | 63 | 623 | 843 | 
| All | 270 | 123 | 914 | 1307 | 
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
    )
);

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar",
        title=f"{column_1} vs {column_2} barplot",
    )
);

sex vs pclass
column_1 = "sex"
column_2 = "pclass"
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption(f"{column_1} vs {column_2} Heatmap")
)
| pclass | 3 | 2 | 1 | All | 
|---|---|---|---|---|
| sex | ||||
| female | 216 | 106 | 144 | 466 | 
| male | 493 | 171 | 179 | 843 | 
| All | 709 | 277 | 323 | 1309 | 
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
    )
);

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar",
        title=f"{column_1} vs {column_2} barplot",
    )
);

embarked vs pclass
column_1 = "embarked"
column_2 = "pclass"
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption(f"{column_1} vs {column_2} Heatmap")
)
| pclass | 3 | 2 | 1 | All | 
|---|---|---|---|---|
| embarked | ||||
| C | 101 | 28 | 141 | 270 | 
| Q | 113 | 7 | 3 | 123 | 
| S | 495 | 242 | 177 | 914 | 
| All | 709 | 277 | 321 | 1307 | 
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
    )
);

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar",
        title=f"{column_1} vs {column_2} barplot",
    )
);

embarked vs sex
column_1 = "embarked"
column_2 = "sex"
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
    .style.background_gradient(cmap="coolwarm")
    .set_caption(f"{column_1} vs {column_2} Heatmap")
)
| sex | female | male | All | 
|---|---|---|---|
| embarked | |||
| C | 113 | 157 | 270 | 
| Q | 60 | 63 | 123 | 
| S | 291 | 623 | 914 | 
| All | 464 | 843 | 1307 | 
(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
    )
);

(
    pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
        kind="bar",
        title=f"{column_1} vs {column_2} barplot",
    )
);

Categorical vs Numerical Variables
pclass vs age
column_cat = "pclass"
column_num = "age"
titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

pclass vs sibsp
column_num = "sibsp"
titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);
/tmp/ipykernel_20921/2525679634.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  titanic_df.plot(

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

pclass vs parch
column_num = "parch"
titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

pclass vs fare
column_num = "fare"
titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

sex vs age
column_cat = "sex"
column_num = "age"
titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

sex vs sibsp
column_num = "sibsp"
titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

sex vs parch
column_num = "parch"
titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

sex vs fare
column_num = "fare"
titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

embarked vs age
column_cat = "embarked"
column_num = "age"
titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

embarked vs sibsp
column_num = "sibsp"
titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

embarked vs parch
column_num = "parch"
titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

embarked vs fare
column_num = "fare"
titanic_df.plot(
    kind="box",
    column=column_num,
    by=column_cat,
    xlabel=column_cat,
    ylabel=column_num,
    title=f"{column_num} by {column_cat} boxplot",
    grid=False,
);

sns.barplot(
    data=titanic_df,
    x=column_cat,
    y=column_num,
    errorbar="ci",
    capsize=0.1,
    hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

📈 Multivariate Analysis
Is important to check the relationship between the target variable and multiple variables.
The target variable is Survived
Target vs age vs Sex
sns.boxplot(data=titanic_df, x="sex", y="age", hue="survived");

Target vs age vs pclass
sns.boxplot(data=titanic_df, x="pclass", y="age", hue="survived");

Target vs age vs embarked
sns.boxplot(data=titanic_df, x="embarked", y="age", hue="survived");

Target vs sex vs pclass
(
    titanic_df[["sex", "pclass", "survived"]]
    .groupby(["pclass", "sex"], observed=True)
    .mean()
    * 100,
    1,
)
(               survived
 pclass sex
 3      female     49.07
        male       15.21
 2      female     88.68
        male       14.62
 1      female     96.53
        male       34.08,
 1)
almost all Pclass 1 females survive and so do most of Pclass 2. Pclass 3 female survival is 50% and almost all (86%) Pclass 3 males unfortunately do not survive.
sns.catplot(
    data=titanic_df,
    x="sex",
    hue="survived",
    col="pclass",
    kind="count",
    height=4,
    aspect=0.7,
);

sns.catplot(data=titanic_df, x="sex", y="survived", hue="pclass", kind="point");

sns.catplot(
    data=titanic_df,
    x="pclass",
    y="survived",
    hue="sex",
    palette={"male": "b", "female": "m"},
    markers=["^", "o"],
    linestyles=["-", "--"],
    kind="point",
);

sns.catplot(data=titanic_df, x="sex", y="survived", hue="pclass", kind="bar");

Target vs pclass vs fare
sns.catplot(
    x="fare",
    y="survived",
    row="pclass",
    kind="box",
    hue="survived",
    orient="h",
    height=1.5,
    aspect=4,
    data=titanic_df.query("fare > 0"),
).set(xscale="log");

Target vs sex vs pclass vs embarked
(
    titanic_df[["embarked", "sex", "pclass", "survived"]]
    .groupby(["embarked", "pclass", "sex"], observed=True)
    .agg(["count", "mean"])
)
| survived | ||||
|---|---|---|---|---|
| count | mean | |||
| embarked | pclass | sex | ||
| C | 3 | female | 31 | 0.71 | 
| male | 70 | 0.21 | ||
| 2 | female | 11 | 1.00 | |
| male | 17 | 0.29 | ||
| 1 | female | 71 | 0.97 | |
| male | 70 | 0.40 | ||
| Q | 3 | female | 56 | 0.59 | 
| male | 57 | 0.12 | ||
| 2 | female | 2 | 1.00 | |
| male | 5 | 0.00 | ||
| 1 | female | 2 | 1.00 | |
| male | 1 | 0.00 | ||
| S | 3 | female | 129 | 0.40 | 
| male | 366 | 0.14 | ||
| 2 | female | 93 | 0.87 | |
| male | 149 | 0.13 | ||
| 1 | female | 69 | 0.96 | |
| male | 108 | 0.31 | ||
Numerical vs All Numerical Variables
numerical_columns
['age', 'sibsp', 'parch', 'fare']
sns.pairplot(titanic_df[numerical_columns], diag_kind="kde");

df = titanic_df[[*numerical_columns, "survived"]]
sns.pairplot(df, hue="survived", diag_kind="kde");

📏 Heuristics
def calculate_survival_percentage(df, pclass, age_min, age_max=None) -> float:
    """Calculate the percentage of survivors based on the class and age range.
    Args:
        df: DataFrame with the Titanic dataset
        pclass: int with the class of the passenger
        age_min: int with the minimum age of the passenger
        age_max: int with the maximum age of the passenger
    Returns:
        float with the percentage of survivors
    """
    if age_max:
        survivors = len(
            df[
                (df["pclass"] == pclass)
                & (df["age"] > age_min)
                & (df["age"] < age_max)
                & (df["survived"])
            ]
        )
        total = len(
            df[(df["pclass"] == pclass) & (df["age"] > age_min) & (df["age"] < age_max)]
        )
    else:
        survivors = len(
            df[(df["pclass"] == pclass) & (df["age"] > age_min) & (df["survived"])]
        )
        total = len(df[(df["pclass"] == pclass) & (df["age"] > age_min)])
    return round((survivors / total) * 100, 1)
# Calcular y mostrar los porcentajes de supervivencia
print(
    "Pclass 1 survivors above age 60:",
    calculate_survival_percentage(titanic_df, 1, 59),
    "%",
)
print(
    "Pclass 2 survivors above age 60:",
    calculate_survival_percentage(titanic_df, 2, 59),
    "%",
)
print(
    "Pclass 3 survivors above age 60:",
    calculate_survival_percentage(titanic_df, 3, 59),
    "%",
)
print(
    "Pclass 1 survivors between 20-30 age:",
    calculate_survival_percentage(titanic_df, 1, 19, 31),
    "%",
)
print(
    "Pclass 2 survivors between 20-30 age:",
    calculate_survival_percentage(titanic_df, 2, 19, 31),
    "%",
)
print(
    "Pclass 3 survivors between 20-30 age:",
    calculate_survival_percentage(titanic_df, 3, 19, 31),
    "%",
)
Pclass 1 survivors above age 60: 38.5 %
Pclass 2 survivors above age 60: 12.5 %
Pclass 3 survivors above age 60: 16.7 %
Pclass 1 survivors between 20-30 age: 69.8 %
Pclass 2 survivors between 20-30 age: 41.0 %
Pclass 3 survivors between 20-30 age: 25.2 %
📊 Analysis of Results and Conclusions
- Almost all Pclass 1 females survive and so do most of Pclass 2. Pclass 3 female survival is 50% and almost all (86%) Pclass 3 males unfortunately do not survive
- 2nd class passengers had twice the survival rate of 3rd class and 1st class passengers had even better rates
- The fare is highly correlated with the class
- The age is not correlated with the pclass
- The age is not correlated with the fare
- The Survival rate among passengers who embarked in Cherbourg is higher than the others
- The Survival rate among females is higher
- The Survival rate among 1st class passengers is higher
- The Survival rate among age 0-10 is higher
- The two variables that have the highest correlation with the target variable are sexandpclass
💡 Proposals and Ideas
For first baseline model, we will use the following features:
- sex
- pclass
- age
and can be done with heuristics or with a simple model like a decision tree.
📖 References
Visualización
- https://joserzapata.github.io/courses/python-ciencia-datos/visualizacion/#visualizacion-con-pandas
- https://joserzapata.github.io/courses/python-ciencia-datos/visualizacion/plotly/
- https://joserzapata.github.io/courses/python-ciencia-datos/visualizacion/seaborn/
EDA
- https://www.analyticsvidhya.com/blog/2022/02/a-quick-guide-to-bivariate-analysis-in-python/
- https://www.kaggle.com/code/allohvk/titanic-advanced-eda
- https://www.kaggle.com/code/imkushwaha/bivariate-multivariate-analysis
Statistical tests