Por Jose R. Zapata
Ultima actualización: 19/Feb/2026
📚 Import libraries
# base libraries for data science
import sys
import warnings
from pathlib import Path
import pandas as pd
import seaborn as sns
warnings.simplefilter(action="ignore", category=FutureWarning)
# configuration to show only 2 decimal places
pd.set_option("display.float_format", "{:.2f}".format)
# print library version for reproducibility
print("Python version: ", sys.version)
print("Pandas version: ", pd.__version__)
Python version: 3.11.11 (main, Dec 6 2024, 20:02:44) [Clang 18.1.8 ]
Pandas version: 2.2.3
💾 Load data
The dataset has correct data types, fixed in:
DATA_DIR = Path.cwd().resolve().parents[1] / "data"
titanic_df = pd.read_parquet(
DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)
📊 Data description
General data information
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1309 non-null int64
1 pclass 1309 non-null int64
2 survived 1309 non-null bool
3 name 1309 non-null object
4 sex 1309 non-null category
5 age 1046 non-null float64
6 sibsp 1309 non-null int64
7 parch 1309 non-null int64
8 fare 1308 non-null float64
9 embarked 1307 non-null category
dtypes: bool(1), category(2), float64(2), int64(4), object(1)
memory usage: 75.8+ KB
Ordinal data has to be converted again
information about the pclass column can be chech in the notebook
titanic_df["pclass"] = pd.Categorical(
titanic_df["pclass"], categories=[3, 2, 1], ordered=True
)
# column Unnamed: 0 is not needed
titanic_df = titanic_df.drop(columns=["Unnamed: 0"])
General information about the data set:
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null category
1 survived 1309 non-null bool
2 name 1309 non-null object
3 sex 1309 non-null category
4 age 1046 non-null float64
5 sibsp 1309 non-null int64
6 parch 1309 non-null int64
7 fare 1308 non-null float64
8 embarked 1307 non-null category
dtypes: bool(1), category(3), float64(2), int64(2), object(1)
memory usage: 56.8+ KB
# size of the dataframe
titanic_df.shape
(1309, 9)
# sample of the dataframe
titanic_df.sample(5)
| pclass | survived | name | sex | age | sibsp | parch | fare | embarked | |
|---|---|---|---|---|---|---|---|---|---|
| 1033 | 3 | True | Moss, Mr. Albert Johan | male | NaN | 0 | 0 | 7.78 | S |
| 222 | 1 | False | Ovies y Rodriguez, Mr. Servando | male | 28.50 | 0 | 0 | 27.72 | C |
| 1070 | 3 | False | O'Brien, Mr. Timothy | male | NaN | 0 | 0 | 7.83 | Q |
| 952 | 3 | False | Larsson-Rondberg, Mr. Edvard A | male | 22.00 | 0 | 0 | 7.78 | S |
| 706 | 3 | False | Caram, Mrs. Joseph (Maria Elias) | female | NaN | 1 | 0 | 14.46 | C |
Number of missing values
titanic_df.isnull().sum()
pclass 0
survived 0
name 0
sex 0
age 263
sibsp 0
parch 0
fare 1
embarked 2
dtype: int64
Target Variable = Survived
General statistics of the data set
Numerical variables
titanic_df.describe()
| age | sibsp | parch | fare | |
|---|---|---|---|---|
| count | 1046.00 | 1309.00 | 1309.00 | 1308.00 |
| mean | 29.88 | 0.50 | 0.39 | 33.30 |
| std | 14.41 | 1.04 | 0.87 | 51.76 |
| min | 0.17 | 0.00 | 0.00 | 0.00 |
| 25% | 21.00 | 0.00 | 0.00 | 7.90 |
| 50% | 28.00 | 0.00 | 0.00 | 14.45 |
| 75% | 39.00 | 1.00 | 0.00 | 31.27 |
| max | 80.00 | 8.00 | 9.00 | 512.33 |
Categorical variables
# categorical columns description
titanic_df.describe(include="category")
| pclass | sex | embarked | |
|---|---|---|---|
| count | 1309 | 1309 | 1307 |
| unique | 3 | 2 | 3 |
| top | 3 | male | S |
| freq | 709 | 843 | 914 |
📈 Bivariate Analysis
Is important to check the relationship between the target variable and the other variables.
The target variable is Survived
Target vs Numerical Variables
survived vs age
variable = "age"
titanic_df.plot(
kind="box",
column=variable,
by="survived",
grid=False,
title=f"Survived vs {variable} boxplot",
);

titanic_df[titanic_df["survived"] == 1][variable].plot(
kind="kde",
label="Survived",
legend=True,
title=f"Survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
kind="kde",
label="Not Survived",
legend=True,
xlabel=variable,
ylabel="Density",
);

survived vs sibsp
variable = "sibsp"
titanic_df.plot(
kind="box",
column=variable,
by="survived",
grid=False,
title=f"Survived vs {variable} boxplot",
);

titanic_df[titanic_df["survived"] == 1][variable].plot(
kind="hist",
label="Survived",
legend=True,
bins=8,
alpha=0.5,
title=f"survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
kind="hist",
label="Not Survived",
legend=True,
bins=8,
alpha=0.3,
xlabel=variable,
ylabel="Density",
);

survived vs parch
variable = "parch"
titanic_df.plot(
kind="box",
column=variable,
by="survived",
grid=False,
title=f"Survived vs {variable} boxplot",
);

titanic_df[titanic_df["survived"] == 1][variable].plot(
kind="hist",
label="Survived",
legend=True,
bins=9,
alpha=0.5,
title=f"Survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
kind="hist",
label="Not Survived",
legend=True,
bins=9,
alpha=0.3,
xlabel=variable,
ylabel="Density",
);

survived vs fare
variable = "fare"
titanic_df.plot(
kind="box",
column=variable,
by="survived",
grid=False,
title=f"Survived vs {variable} boxplot",
);

titanic_df[titanic_df["survived"] == 1][variable].plot(
kind="kde",
label="Survived",
legend=True,
title=f"Survived vs {variable} density plot",
)
titanic_df[titanic_df["survived"] == 0][variable].plot(
kind="kde",
label="Not Survived",
legend=True,
xlabel=variable,
ylabel="Density",
);

Target vs Categorical Variables
survived vs pclass
column = "pclass"
(
pd
.crosstab(titanic_df[column], titanic_df["survived"], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption("Survived vs pclass Heatmap")
)
| survived | False | True | All |
|---|---|---|---|
| pclass | |||
| 3 | 528 | 181 | 709 |
| 2 | 158 | 119 | 277 |
| 1 | 123 | 200 | 323 |
| All | 809 | 500 | 1309 |
titanic_df.groupby(column, observed=True).agg({"survived": "mean"}) * 100
| survived | |
|---|---|
| pclass | |
| 3 | 25.53 |
| 2 | 42.96 |
| 1 | 61.92 |
2nd class passengers had twice the survival rate of 3rd class and 1st class passengers had even better rates
(
pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
kind="bar", stacked=True, title=f"Survived vs {column} stacked barplot"
)
);

(
pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
kind="bar",
title=f"Survived vs {column} barplot",
)
);

survived vs sex
column = "sex"
(
pd
.crosstab(titanic_df[column], titanic_df["survived"], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption("Survived vs pclass Heatmap")
)
| survived | False | True | All |
|---|---|---|---|
| sex | |||
| female | 127 | 339 | 466 |
| male | 682 | 161 | 843 |
| All | 809 | 500 | 1309 |
titanic_df.groupby(column, observed=True).agg({"survived": "mean"}) * 100
| survived | |
|---|---|
| sex | |
| female | 72.75 |
| male | 19.10 |
(
pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
kind="bar", stacked=True, title=f"Survived vs {column} stacked barplot"
)
);

(
pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
kind="bar",
title=f"Survived vs {column} barplot",
)
);

survived vs embarked
column = "embarked"
(
pd
.crosstab(titanic_df[column], titanic_df["survived"], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption("Survived vs pclass Heatmap")
)
| survived | False | True | All |
|---|---|---|---|
| embarked | |||
| C | 120 | 150 | 270 |
| Q | 79 | 44 | 123 |
| S | 610 | 304 | 914 |
| All | 809 | 498 | 1307 |
titanic_df.groupby(column, observed=True).agg({"survived": "mean"}) * 100
| survived | |
|---|---|
| embarked | |
| C | 55.56 |
| Q | 35.77 |
| S | 33.26 |
(
pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
kind="bar", stacked=True, title=f"Survived vs {column} stacked barplot"
)
);

(
pd.crosstab(titanic_df[column], titanic_df["survived"]).plot(
kind="bar",
title=f"Survived vs {column} barplot",
)
);

Numerical vs Numerical Variables
age vs sibsp
# scatter plot of age vs sibsp
titanic_df.plot(
kind="scatter",
x="age",
y="sibsp",
title="age vs Sibsp scatter plot",
);

age vs parch
# scatter plot of age vs parch
titanic_df.plot(
kind="scatter",
x="age",
y="parch",
title="age vs Parch scatter plot",
);

age vs fare
# scatter plot of age vs fare
titanic_df.plot(
kind="scatter",
x="age",
y="fare",
title="age vs fare scatter plot",
);

sibsp vs parch
# scatter plot of sibsp vs parch
titanic_df.plot(
kind="scatter",
x="sibsp",
y="parch",
title="Sibsp vs parch scatter plot",
);

sibsp vs fare
# scatter plot of sibsp vs fare
titanic_df.plot(
kind="scatter",
x="sibsp",
y="fare",
title="Sibsp vs fare scatter plot",
);

parch vs fare
# scatter plot of parch vs fare
titanic_df.plot(
kind="scatter",
x="parch",
y="fare",
title="parch vs fare scatter plot",
);

Categorical vs Categorical Variables
pclass vs sex
column_1 = "pclass"
column_2 = "sex"
(
pd
.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption(f"{column_1} vs {column_2} Heatmap")
)
| sex | female | male | All |
|---|---|---|---|
| pclass | |||
| 3 | 216 | 493 | 709 |
| 2 | 106 | 171 | 277 |
| 1 | 144 | 179 | 323 |
| All | 466 | 843 | 1309 |
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
)
);

(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar",
title=f"{column_1} vs {column_2} barplot",
)
);

pclass vs embarked
column_1 = "pclass"
column_2 = "embarked"
(
pd
.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption(f"{column_1} vs {column_2} Heatmap")
)
| embarked | C | Q | S | All |
|---|---|---|---|---|
| pclass | ||||
| 3 | 101 | 113 | 495 | 709 |
| 2 | 28 | 7 | 242 | 277 |
| 1 | 141 | 3 | 177 | 321 |
| All | 270 | 123 | 914 | 1307 |
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
)
);

(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar",
title=f"{column_1} vs {column_2} barplot",
)
);

sex vs embarked
column_1 = "sex"
column_2 = "embarked"
(
pd
.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption(f"{column_1} vs {column_2} Heatmap")
)
| embarked | C | Q | S | All |
|---|---|---|---|---|
| sex | ||||
| female | 113 | 60 | 291 | 464 |
| male | 157 | 63 | 623 | 843 |
| All | 270 | 123 | 914 | 1307 |
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
)
);

(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar",
title=f"{column_1} vs {column_2} barplot",
)
);

sex vs pclass
column_1 = "sex"
column_2 = "pclass"
(
pd
.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption(f"{column_1} vs {column_2} Heatmap")
)
| pclass | 3 | 2 | 1 | All |
|---|---|---|---|---|
| sex | ||||
| female | 216 | 106 | 144 | 466 |
| male | 493 | 171 | 179 | 843 |
| All | 709 | 277 | 323 | 1309 |
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
)
);

(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar",
title=f"{column_1} vs {column_2} barplot",
)
);

embarked vs pclass
column_1 = "embarked"
column_2 = "pclass"
(
pd
.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption(f"{column_1} vs {column_2} Heatmap")
)
| pclass | 3 | 2 | 1 | All |
|---|---|---|---|---|
| embarked | ||||
| C | 101 | 28 | 141 | 270 |
| Q | 113 | 7 | 3 | 123 |
| S | 495 | 242 | 177 | 914 |
| All | 709 | 277 | 321 | 1307 |
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
)
);

(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar",
title=f"{column_1} vs {column_2} barplot",
)
);

embarked vs sex
column_1 = "embarked"
column_2 = "sex"
(
pd
.crosstab(titanic_df[column_1], titanic_df[column_2], margins=True)
.style.background_gradient(cmap="coolwarm")
.set_caption(f"{column_1} vs {column_2} Heatmap")
)
| sex | female | male | All |
|---|---|---|---|
| embarked | |||
| C | 113 | 157 | 270 |
| Q | 60 | 63 | 123 |
| S | 291 | 623 | 914 |
| All | 464 | 843 | 1307 |
(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar", stacked=True, title=f"{column_1} vs {column_2} stacked barplot"
)
);

(
pd.crosstab(titanic_df[column_1], titanic_df[column_2]).plot(
kind="bar",
title=f"{column_1} vs {column_2} barplot",
)
);

Categorical vs Numerical Variables
pclass vs age
column_cat = "pclass"
column_num = "age"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);

sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

pclass vs sibsp
column_num = "sibsp"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);

sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

pclass vs parch
column_num = "parch"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);

sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

pclass vs fare
column_num = "fare"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);

sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

sex vs age
column_cat = "sex"
column_num = "age"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);

sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

sex vs sibsp
column_num = "sibsp"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);

sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

sex vs parch
column_num = "parch"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);

sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

sex vs fare
column_num = "fare"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);

sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

embarked vs age
column_cat = "embarked"
column_num = "age"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);

sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

embarked vs sibsp
column_num = "sibsp"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);

sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

embarked vs parch
column_num = "parch"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);

sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

embarked vs fare
column_num = "fare"
titanic_df.plot(
kind="box",
column=column_num,
by=column_cat,
xlabel=column_cat,
ylabel=column_num,
title=f"{column_num} by {column_cat} boxplot",
grid=False,
);

sns.barplot(
data=titanic_df,
x=column_cat,
y=column_num,
errorbar="ci",
capsize=0.1,
hue=column_cat,
).set_title(f"{column_num} by {column_cat} barplot");

📊 Analysis of Results
Target vs Numerical Variables
- survived vs age: The age distributions for survived and not-survived passengers are similar overall, but children (age 0–10) show a noticeably higher survival rate. The median age is slightly lower for survivors. There is no strong linear relationship between age and survival.
- survived vs sibsp: Passengers with 1 sibling/spouse aboard had a higher survival rate than those traveling alone (sibsp = 0) or those with many siblings/spouses. Having too many siblings/spouses (sibsp ≥ 3) is associated with very low survival.
- survived vs parch: Similar to sibsp, passengers with 1–2 parents/children aboard survived at higher rates than those alone or with large families. Passengers with parch ≥ 4 had very low survival rates.
- survived vs fare: Survivors paid significantly higher fares on average. The fare distribution for survivors is right-skewed with a higher median, while non-survivors cluster at lower fares. Fare is strongly associated with survival.
Target vs Categorical Variables
- survived vs pclass: 1st class passengers had the highest survival rate (~62%), 2nd class had a moderate rate (~43%), and 3rd class had the lowest (~26%). Almost all Pclass 1 females survived and so did most of Pclass 2. Pclass 3 female survival was around 50%, and almost all (86%) Pclass 3 males did not survive.
- survived vs sex: Female passengers had a dramatically higher survival rate (~73%) compared to males (~19%). Sex is the single strongest predictor of survival.
- survived vs embarked: Passengers who embarked in Cherbourg (C) had the highest survival rate, followed by Queenstown (Q) and Southampton (S). This is likely confounded by the higher proportion of 1st class passengers embarking at Cherbourg.
Numerical vs Numerical Variables
- age vs sibsp / parch: No strong correlation observed. Family size variables are largely independent of age.
- age vs fare: Weak correlation. Older passengers do not necessarily pay higher fares.
- sibsp vs parch: Weak positive correlation. Passengers with siblings/spouses tend to also have parents/children aboard, but the relationship is not strong.
- sibsp vs fare / parch vs fare: Moderate positive correlation. Larger families tend to pay higher total fares.
Categorical vs Categorical Variables
- pclass vs sex: The gender distribution is relatively balanced across classes, though 3rd class has a higher proportion of males.
- pclass vs embarked: Most Southampton passengers were in 3rd class. Cherbourg had a higher proportion of 1st class passengers, which helps explain the higher survival rate for Cherbourg embarkation.
- sex vs embarked: The gender distribution across embarkation ports is relatively similar, with a slight male majority across all ports.
Categorical vs Numerical Variables
- pclass vs fare: Fare is very strongly correlated with passenger class — 1st class passengers paid the highest fares, and 3rd class the lowest. This confirms that fare is essentially a proxy for class.
- pclass vs age: Age distribution is similar across classes, with no strong relationship between class and age.
- sex vs age / fare: Males and females have similar age distributions. However, females tend to have slightly higher fares, likely because a higher proportion of female passengers traveled in 1st and 2nd class.
- embarked vs fare: Cherbourg passengers paid the highest average fares (consistent with the higher proportion of 1st class passengers), while Southampton and Queenstown passengers paid lower fares on average.
- embarked vs age: Age distributions are similar across embarkation ports.
Key Takeaways
- The two variables with the strongest association with survival are
sexandpclass. Females and higher-class passengers had significantly better survival outcomes. - Fare is highly correlated with class and acts as a strong proxy for socioeconomic status, which in turn is associated with survival.
- Family size matters: Traveling with 1–2 family members (sibsp or parch) improved survival odds compared to traveling alone or in very large groups.
- Children (age 0–10) had higher survival rates, consistent with the “women and children first” evacuation policy.
- Embarkation port (embarked) is confounded with class: Cherbourg passengers survived more because they were disproportionately 1st class, not because of the port itself.
- Age alone is not a strong predictor of survival, except at the extremes (very young children).
💡 Proposals and Ideas
For first baseline model, could use the following features:
sexpclassage
and can be done with heuristics or with a simple model like a decision tree.
📖 References
Visualización
- https://joserzapata.github.io/courses/python-ciencia-datos/visualizacion/#visualizacion-con-pandas
- https://joserzapata.github.io/courses/python-ciencia-datos/visualizacion/plotly/
- https://joserzapata.github.io/courses/python-ciencia-datos/visualizacion/seaborn/
EDA
- https://www.analyticsvidhya.com/blog/2022/02/a-quick-guide-to-bivariate-analysis-in-python/
- https://www.kaggle.com/code/allohvk/titanic-advanced-eda
- https://www.kaggle.com/code/imkushwaha/bivariate-multivariate-analysis
Statistical tests