Por Jose R. Zapata
Ultima actualización: 21/Nov/2024
Description
Exploratory data analysis (EDA) and description of the data set.
Data manipulation and visualization
📚 Import libraries
# base libraries for data science
import sys
from pathlib import Path
import pandas as pd
import seaborn as sns
# configuration to show only 2 decimal places
pd.set_option("display.float_format", "{:.2f}".format)
# print library version for reproducibility
print("Python version: ", sys.version)
print("Pandas version: ", pd.__version__)
Python version: 3.11.10 (main, Sep 27 2024, 20:27:21) [GCC 11.4.0]
Pandas version: 2.1.4
💾 Load data
The dataset has correct data types, fixed in:
notebooks/2-exploration/01-jrz-data_explore_description-2024_03_01.ipynb
DATA_DIR = Path.cwd().resolve().parents[1] / "data"
titanic_df = pd.read_parquet(
DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)
📊 Data description
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1309 non-null int64
1 pclass 1309 non-null int64
2 survived 1309 non-null bool
3 name 1309 non-null object
4 sex 1309 non-null category
5 age 1046 non-null float64
6 sibsp 1309 non-null int64
7 parch 1309 non-null int64
8 fare 1308 non-null float64
9 embarked 1307 non-null category
dtypes: bool(1), category(2), float64(2), int64(4), object(1)
memory usage: 75.8+ KB
Ordinal data has to be converted again
information about the pclass column can be chech in the notebook
notebooks/2-exploration/01-jrz-data_explore_description-2024_03_01.ipynb
titanic_df["pclass"] = pd.Categorical(
titanic_df["pclass"], categories=[3, 2, 1], ordered=True
)
# column Unnamed: 0 is not needed
titanic_df = titanic_df.drop(columns=["Unnamed: 0"])
General information about the data set:
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null category
1 survived 1309 non-null bool
2 name 1309 non-null object
3 sex 1309 non-null category
4 age 1046 non-null float64
5 sibsp 1309 non-null int64
6 parch 1309 non-null int64
7 fare 1308 non-null float64
8 embarked 1307 non-null category
dtypes: bool(1), category(3), float64(2), int64(2), object(1)
memory usage: 56.8+ KB
# size of the dataframe
titanic_df.shape
(1309, 9)
# sample of the dataframe
titanic_df.sample(5)
pclass | survived | name | sex | age | sibsp | parch | fare | embarked | |
---|---|---|---|---|---|---|---|---|---|
58 | 1 | False | Case, Mr. Howard Brown | male | 49.00 | 0 | 0 | 26.00 | S |
736 | 3 | False | Coxon, Mr. Daniel | male | 59.00 | 0 | 0 | 7.25 | S |
523 | 2 | True | Oxenham, Mr. Percy Thomas | male | 22.00 | 0 | 0 | 10.50 | S |
1073 | 3 | False | O'Connor, Mr. Maurice | male | NaN | 0 | 0 | 7.75 | Q |
518 | 2 | False | Nicholls, Mr. Joseph Charles | male | 19.00 | 1 | 1 | 36.75 | S |
Number of missing values
titanic_df.isnull().sum()
pclass 0
survived 0
name 0
sex 0
age 263
sibsp 0
parch 0
fare 1
embarked 2
dtype: int64
Target Variable = Survived
Numerical variables
titanic_df.describe()
age | sibsp | parch | fare | |
---|---|---|---|---|
count | 1046.00 | 1309.00 | 1309.00 | 1308.00 |
mean | 29.88 | 0.50 | 0.39 | 33.30 |
std | 14.41 | 1.04 | 0.87 | 51.76 |
min | 0.17 | 0.00 | 0.00 | 0.00 |
25% | 21.00 | 0.00 | 0.00 | 7.90 |
50% | 28.00 | 0.00 | 0.00 | 14.45 |
75% | 39.00 | 1.00 | 0.00 | 31.27 |
max | 80.00 | 8.00 | 9.00 | 512.33 |
Categorical variables
# categorical columns description
titanic_df.describe(include="category")
pclass | sex | embarked | |
---|---|---|---|
count | 1309 | 1309 | 1307 |
unique | 3 | 2 | 3 |
top | 3 | male | S |
freq | 709 | 843 | 914 |
📈 Univariate Analysis
Target Variable
titanic_df["survived"].value_counts().plot(
kind="bar", color=["skyblue", "orange"], title="Survived passengers"
);
Numerical Variables
# list of the numerical columns
numerical_columns = list(titanic_df.select_dtypes(include=["number"]).columns)
numerical_columns
['age', 'sibsp', 'parch', 'fare']
age
column = "age"
titanic_df[column].describe()
count 1046.00
mean 29.88
std 14.41
min 0.17
25% 21.00
50% 28.00
75% 39.00
max 80.00
Name: age, dtype: float64
# number of unique values
titanic_df[column].nunique()
98
titanic_df[column].plot(
kind="hist", bins=30, edgecolor="black", title=f"{column} histogram"
);
titanic_df[column].plot(kind="density", title=f"{column} distribution");
titanic_df[column].plot(kind="box", title=f"{column} boxplot");
sibsp
column = "sibsp"
titanic_df[column].describe()
count 1309.00
mean 0.50
std 1.04
min 0.00
25% 0.00
50% 0.00
75% 1.00
max 8.00
Name: sibsp, dtype: float64
# number of unique values
titanic_df[column].nunique()
7
# unique values (because are only a few)
titanic_df[column].unique()
array([0, 1, 2, 3, 4, 5, 8])
# histogram of a column with 7 unique values between 0 and 8
titanic_df[column].plot(
kind="hist", bins=8, edgecolor="black", title=f"{column} histogram"
);
titanic_df[column].plot(kind="box", title=f"{column} boxplot");
parch
column = "parch"
titanic_df[column].describe()
count 1309.00
mean 0.39
std 0.87
min 0.00
25% 0.00
50% 0.00
75% 0.00
max 9.00
Name: parch, dtype: float64
# number of unique values
titanic_df[column].nunique()
8
# unique values (because are only a few)
titanic_df[column].unique()
array([0, 2, 1, 4, 3, 5, 6, 9])
titanic_df[column].plot(
kind="hist", bins=9, edgecolor="black", title=f"{column} histogram"
);
titanic_df[column].plot(kind="box", title=f"{column} boxplot");
fare
column = "fare"
titanic_df[column].describe()
count 1308.00
mean 33.30
std 51.76
min 0.00
25% 7.90
50% 14.45
75% 31.27
max 512.33
Name: fare, dtype: float64
titanic_df[column].plot(kind="kde", title=f"{column} distribution");
titanic_df[column].plot(
kind="hist", bins=50, edgecolor="black", title=f"{column} histogram"
);
titanic_df[column].plot(kind="box", title=f"{column} boxplot");
Categorical Variables
# list of the categorical columns
categorical_columns = list(titanic_df.select_dtypes(include=["category"]).columns)
categorical_columns
['pclass', 'sex', 'embarked']
pclass
column = "pclass"
titanic_df[column].describe()
count 1309
unique 3
top 3
freq 709
Name: pclass, dtype: int64
titanic_df[column].unique()
[1, 2, 3]
Categories (3, int64): [3 < 2 < 1]
titanic_df[column].value_counts()
pclass
3 709
1 323
2 277
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
kind="bar", color=["skyblue", "orange", "green"], title=f"{column} value counts"
);
Sex
column = "sex"
titanic_df[column].describe()
count 1309
unique 2
top male
freq 843
Name: sex, dtype: object
titanic_df[column].unique()
['female', 'male']
Categories (2, object): ['female', 'male']
titanic_df[column].value_counts()
sex
male 843
female 466
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
kind="bar", color=["skyblue", "orange"], title=f"{column} value counts"
);
embarked
column = "embarked"
titanic_df[column].describe()
count 1307
unique 3
top S
freq 914
Name: embarked, dtype: object
titanic_df[column].unique()
['S', 'C', NaN, 'Q']
Categories (3, object): ['C', 'Q', 'S']
titanic_df[column].value_counts()
embarked
S 914
C 270
Q 123
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
kind="bar", color=["skyblue", "orange", "green"], title=f"{column} value counts"
);
String columns
titanic_df["name"].sample(5)
394 Dibden, Mr. William
755 Davies, Mr. Joseph
237 Robbins, Mr. Victor
459 Jacobsohn, Mr. Sidney Samuel
1136 Rasmussen, Mrs. (Lena Jacobsen Solvang)
Name: name, dtype: object
titanic_df["name"].nunique()
1307
📈 Bivariate Analysis
Is important to check the relationship between the target variable and the other variables.
The target variable is Survived
Target vs Numerical Variables
survived vs age
variable = "age"
titanic_df.plot(
kind="box",
column=variable,
by="survived",
grid=False,
title=f"Survived vs {variable} boxplot",
);
titanic_df[titanic_df["survived"] == 1][variable].plot(
kind="kde",
label="Survived",
legend=True,
title=f"Survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
kind="kde",
label="Not Survived",
legend=True,
xlabel=variable,
ylabel="Density",
);
survived vs sibsp
variable = "sibsp"
titanic_df.plot(
kind="box",
column=variable,
by="survived",
grid=False,
title=f"Survived vs {variable} boxplot",
);
titanic_df[titanic_df["survived"] == 1][variable].plot(
kind="hist",
label="Survived",
legend=True,
bins=8,
alpha=0.5,
title=f"survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
kind="hist",
label="Not Survived",
legend=True,
bins=8,
alpha=0.3,
xlabel=variable,
ylabel="Density",
);
survived vs parch
variable = "parch"
titanic_df.plot(
kind="box",
column=variable,
by="survived",
grid=False,
title=f"Survived vs {variable} boxplot",
);
titanic_df[titanic_df["survived"] == 1][variable].plot(
kind="hist",
label="Survived",
legend=True,
bins=9,
alpha=0.5,
title=f"Survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
kind="hist",
label="Not Survived",
legend=True,
bins=9,
alpha=0.3,
xlabel=variable,
ylabel="Density",
);
survived vs fare
variable = "fare"
titanic_df.plot(
kind="box",
column=variable,
by="survived",
grid=False,
title=f"Survived vs {variable} boxplot",
);
titanic_df[titanic_df["survived"] == 1][variable].plot(
kind="kde",
label="Survived",
legend=True,
title=f"Survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
kind="kde",
label="Not Survived",
legend=True,
xlabel=variable,
ylabel="Density",
);
Target vs Categorical Variables
survived vs pclass
column = "pclass"
(
pd.crosstab(titanic_df[column], titanic_df["survived"], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption("Survived vs pclass Heatmap")
)
survived | False | True | All |
---|---|---|---|
pclass | |||
3 | 528 | 181 | 709 |
2 | 158 | 119 | 277 |
1 | 123 | 200 | 323 |
All | 809 | 500 | 1309 |
titanic_df.groupby(column, observed=True).agg({"survived": "mean"}) * 100
survived | |
---|---|
pclass | |
3 | 25.53 |
2 | 42.96 |
1 | 61.92 |
2nd class passengers had twice the survival rate of 3rd class and 1st class passengers had even better rates
(
pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
kind="bar", stacked=True, title=f"Survived vs {column} stacked barplot"
)
);
(
pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
kind="bar",
title=f"Survived vs {column} barplot",
)
);
survived vs sex
column = "sex"
(
pd.crosstab(titanic_df[column], titanic_df["survived"], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption("Survived vs pclass Heatmap")
)
survived | False | True | All |
---|---|---|---|
sex | |||
female | 127 | 339 | 466 |
male | 682 | 161 | 843 |
All | 809 | 500 | 1309 |
titanic_df.groupby(column, observed=True).agg({"survived": "mean"}) * 100
survived | |
---|---|
sex | |
female | 72.75 |
male | 19.10 |
(
pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
kind="bar", stacked=True, title=f"Survived vs {column} stacked barplot"
)
);
(
pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
kind="bar",
title=f"Survived vs {column} barplot",
)
);
survived vs embarked
column = "embarked"
(
pd.crosstab(titanic_df[column], titanic_df["survived"], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption("Survived vs pclass Heatmap")
)
survived | False | True | All |
---|---|---|---|
embarked | |||
C | 120 | 150 | 270 |
Q | 79 | 44 | 123 |
S | 610 | 304 | 914 |
All | 809 | 498 | 1307 |
titanic_df.groupby(column, observed=True).agg({"survived": "mean"}) * 100
survived | |
---|---|
embarked | |
C | 55.56 |
Q | 35.77 |
S | 33.26 |
(
pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
kind="bar", stacked=True, title=f"Survived vs {column} stacked barplot"
)
);
(
pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
kind="bar",
title=f"Survived vs {column} barplot",
)
);
Numerical vs Numerical Variables
age vs sibsp
# scatter plot of age vs sibsp
titanic_df.plot(
kind="scatter",
x="age",
y="sibsp",
title="age vs Sibsp scatter plot",
);
age vs parch
# scatter plot of age vs parch
titanic_df.plot(
kind="scatter",
x="age",
y="parch",
title="age vs Parch scatter plot",
);
age vs fare
# scatter plot of age vs fare
titanic_df.plot(
kind="scatter",
x="age",
y="fare",
title="age vs fare scatter plot",
);
sibsp vs parch
# scatter plot of sibsp vs parch
titanic_df.plot(
kind="scatter",
x="sibsp",
y="parch",
title="Sibsp vs parch scatter plot",
);
sibsp vs fare
# scatter plot of sibsp vs fare
titanic_df.plot(
kind="scatter",
x="sibsp",
y="fare",
title="Sibsp vs fare scatter plot",
);
parch vs fare
# scatter plot of parch vs fare
titanic_df.plot(
kind="scatter",
x="parch",
y="fare",
title="parch vs fare scatter plot",
);
Categorical vs Categorical Variables
pclass vs sex
column_1 = "pclass"
column_2 = "sex"
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption(f"{column_1} vs {column_2} Heatmap")
)
sex | female | male | All |
---|---|---|---|
pclass | |||
3 | 216 | 493 | 709 |
2 | 106 | 171 | 277 |
1 | 144 | 179 | 323 |
All | 466 | 843 | 1309 |
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
)
);
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar",
title=f"{column_1} vs {column_2} barplot",
)
);
pclass vs embarked
column_1 = "pclass"
column_2 = "embarked"
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption(f"{column_1} vs {column_2} Heatmap")
)
embarked | C | Q | S | All |
---|---|---|---|---|
pclass | ||||
3 | 101 | 113 | 495 | 709 |
2 | 28 | 7 | 242 | 277 |
1 | 141 | 3 | 177 | 321 |
All | 270 | 123 | 914 | 1307 |
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
)
);
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar",
title=f"{column_1} vs {column_2} barplot",
)
);
sex vs embarked
column_1 = "sex"
column_2 = "embarked"
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption(f"{column_1} vs {column_2} Heatmap")
)
embarked | C | Q | S | All |
---|---|---|---|---|
sex | ||||
female | 113 | 60 | 291 | 464 |
male | 157 | 63 | 623 | 843 |
All | 270 | 123 | 914 | 1307 |
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
)
);
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar",
title=f"{column_1} vs {column_2} barplot",
)
);
sex vs pclass
column_1 = "sex"
column_2 = "pclass"
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption(f"{column_1} vs {column_2} Heatmap")
)
pclass | 3 | 2 | 1 | All |
---|---|---|---|---|
sex | ||||
female | 216 | 106 | 144 | 466 |
male | 493 | 171 | 179 | 843 |
All | 709 | 277 | 323 | 1309 |
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
)
);
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar",
title=f"{column_1} vs {column_2} barplot",
)
);
embarked vs pclass
column_1 = "embarked"
column_2 = "pclass"
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption(f"{column_1} vs {column_2} Heatmap")
)
pclass | 3 | 2 | 1 | All |
---|---|---|---|---|
embarked | ||||
C | 101 | 28 | 141 | 270 |
Q | 113 | 7 | 3 | 123 |
S | 495 | 242 | 177 | 914 |
All | 709 | 277 | 321 | 1307 |
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
)
);
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar",
title=f"{column_1} vs {column_2} barplot",
)
);
embarked vs sex
column_1 = "embarked"
column_2 = "sex"
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption(f"{column_1} vs {column_2} Heatmap")
)
sex | female | male | All |
---|---|---|---|
embarked | |||
C | 113 | 157 | 270 |
Q | 60 | 63 | 123 |
S | 291 | 623 | 914 |
All | 464 | 843 | 1307 |
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
)
);
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar",
title=f"{column_1} vs {column_2} barplot",
)
);
Categorical vs Numerical Variables
pclass vs age
column_cat = "pclass"
column_num = "age"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);
sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");
pclass vs sibsp
column_num = "sibsp"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);
/tmp/ipykernel_20921/2525679634.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
titanic_df.plot(
sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");
pclass vs parch
column_num = "parch"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);
sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");
pclass vs fare
column_num = "fare"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);
sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");
sex vs age
column_cat = "sex"
column_num = "age"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);
sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");
sex vs sibsp
column_num = "sibsp"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);
sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");
sex vs parch
column_num = "parch"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);
sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");
sex vs fare
column_num = "fare"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);
sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");
embarked vs age
column_cat = "embarked"
column_num = "age"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);
sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");
embarked vs sibsp
column_num = "sibsp"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);
sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");
embarked vs parch
column_num = "parch"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);
sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");
embarked vs fare
column_num = "fare"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);
sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");
📈 Multivariate Analysis
Is important to check the relationship between the target variable and multiple variables.
The target variable is Survived
Target vs age vs Sex
sns.boxplot(data=titanic_df, x="sex", y="age", hue="survived");
Target vs age vs pclass
sns.boxplot(data=titanic_df, x="pclass", y="age", hue="survived");
Target vs age vs embarked
sns.boxplot(data=titanic_df, x="embarked", y="age", hue="survived");
Target vs sex vs pclass
(
titanic_df[["sex", "pclass", "survived"]]
.groupby(["pclass", "sex"], observed=True)
.mean()
* 100,
1,
)
( survived
pclass sex
3 female 49.07
male 15.21
2 female 88.68
male 14.62
1 female 96.53
male 34.08,
1)
almost all Pclass 1 females survive and so do most of Pclass 2. Pclass 3 female survival is 50% and almost all (86%) Pclass 3 males unfortunately do not survive.
sns.catplot(
data=titanic_df,
x="sex",
hue="survived",
col="pclass",
kind="count",
height=4,
aspect=0.7,
);
sns.catplot(data=titanic_df, x="sex", y="survived", hue="pclass", kind="point");
sns.catplot(
data=titanic_df,
x="pclass",
y="survived",
hue="sex",
palette={"male": "b", "female": "m"},
markers=["^", "o"],
linestyles=["-", "--"],
kind="point",
);
sns.catplot(data=titanic_df, x="sex", y="survived", hue="pclass", kind="bar");
Target vs pclass vs fare
sns.catplot(
x="fare",
y="survived",
row="pclass",
kind="box",
hue="survived",
orient="h",
height=1.5,
aspect=4,
data=titanic_df.query("fare > 0"),
).set(xscale="log");
Target vs sex vs pclass vs embarked
(
titanic_df[["embarked", "sex", "pclass", "survived"]]
.groupby(["embarked", "pclass", "sex"], observed=True)
.agg(["count", "mean"])
)
survived | ||||
---|---|---|---|---|
count | mean | |||
embarked | pclass | sex | ||
C | 3 | female | 31 | 0.71 |
male | 70 | 0.21 | ||
2 | female | 11 | 1.00 | |
male | 17 | 0.29 | ||
1 | female | 71 | 0.97 | |
male | 70 | 0.40 | ||
Q | 3 | female | 56 | 0.59 |
male | 57 | 0.12 | ||
2 | female | 2 | 1.00 | |
male | 5 | 0.00 | ||
1 | female | 2 | 1.00 | |
male | 1 | 0.00 | ||
S | 3 | female | 129 | 0.40 |
male | 366 | 0.14 | ||
2 | female | 93 | 0.87 | |
male | 149 | 0.13 | ||
1 | female | 69 | 0.96 | |
male | 108 | 0.31 |
Numerical vs All Numerical Variables
numerical_columns
['age', 'sibsp', 'parch', 'fare']
sns.pairplot(titanic_df[numerical_columns], diag_kind="kde");
df = titanic_df[[*numerical_columns, "survived"]]
sns.pairplot(df, hue="survived", diag_kind="kde");
📏 Heuristics
def calculate_survival_percentage(df, pclass, age_min, age_max=None) -> float:
"""Calculate the percentage of survivors based on the class and age range.
Args:
df: DataFrame with the Titanic dataset
pclass: int with the class of the passenger
age_min: int with the minimum age of the passenger
age_max: int with the maximum age of the passenger
Returns:
float with the percentage of survivors
"""
if age_max:
survivors = len(
df[
(df["pclass"] == pclass)
& (df["age"] > age_min)
& (df["age"] < age_max)
& (df["survived"])
]
)
total = len(
df[(df["pclass"] == pclass) & (df["age"] > age_min) & (df["age"] < age_max)]
)
else:
survivors = len(
df[(df["pclass"] == pclass) & (df["age"] > age_min) & (df["survived"])]
)
total = len(df[(df["pclass"] == pclass) & (df["age"] > age_min)])
return round((survivors / total) * 100, 1)
# Calcular y mostrar los porcentajes de supervivencia
print(
"Pclass 1 survivors above age 60:",
calculate_survival_percentage(titanic_df, 1, 59),
"%",
)
print(
"Pclass 2 survivors above age 60:",
calculate_survival_percentage(titanic_df, 2, 59),
"%",
)
print(
"Pclass 3 survivors above age 60:",
calculate_survival_percentage(titanic_df, 3, 59),
"%",
)
print(
"Pclass 1 survivors between 20-30 age:",
calculate_survival_percentage(titanic_df, 1, 19, 31),
"%",
)
print(
"Pclass 2 survivors between 20-30 age:",
calculate_survival_percentage(titanic_df, 2, 19, 31),
"%",
)
print(
"Pclass 3 survivors between 20-30 age:",
calculate_survival_percentage(titanic_df, 3, 19, 31),
"%",
)
Pclass 1 survivors above age 60: 38.5 %
Pclass 2 survivors above age 60: 12.5 %
Pclass 3 survivors above age 60: 16.7 %
Pclass 1 survivors between 20-30 age: 69.8 %
Pclass 2 survivors between 20-30 age: 41.0 %
Pclass 3 survivors between 20-30 age: 25.2 %
📊 Analysis of Results and Conclusions
- Almost all Pclass 1 females survive and so do most of Pclass 2. Pclass 3 female survival is 50% and almost all (86%) Pclass 3 males unfortunately do not survive
- 2nd class passengers had twice the survival rate of 3rd class and 1st class passengers had even better rates
- The fare is highly correlated with the class
- The age is not correlated with the pclass
- The age is not correlated with the fare
- The Survival rate among passengers who embarked in Cherbourg is higher than the others
- The Survival rate among females is higher
- The Survival rate among 1st class passengers is higher
- The Survival rate among age 0-10 is higher
- The two variables that have the highest correlation with the target variable are
sex
andpclass
💡 Proposals and Ideas
For first baseline model, we will use the following features:
sex
pclass
age
and can be done with heuristics or with a simple model like a decision tree.
📖 References
Visualización
- https://joserzapata.github.io/courses/python-ciencia-datos/visualizacion/#visualizacion-con-pandas
- https://joserzapata.github.io/courses/python-ciencia-datos/visualizacion/plotly/
- https://joserzapata.github.io/courses/python-ciencia-datos/visualizacion/seaborn/
EDA
- https://www.analyticsvidhya.com/blog/2022/02/a-quick-guide-to-bivariate-analysis-in-python/
- https://www.kaggle.com/code/allohvk/titanic-advanced-eda
- https://www.kaggle.com/code/imkushwaha/bivariate-multivariate-analysis
Statistical tests