Análisis Multivariable de los datos

Ultima actualización: 19/Feb/2026

📚 Import libraries

# base libraries for data science
import sys
from pathlib import Path

import pandas as pd
import seaborn as sns

# configuration to show only 2 decimal places
pd.set_option("display.float_format", "{:.2f}".format)

# print library version for reproducibility

print("Python version: ", sys.version)
print("Pandas version: ", pd.__version__)

Python version:  3.11.11 (main, Dec  6 2024, 20:02:44) [Clang 18.1.8 ]
Pandas version:  2.2.3

💾 Load data

The dataset has correct data types, fixed in:

Exploración inicial de datos

DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df = pd.read_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)

📊 Data description

General data information

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   int64
 2   survived    1309 non-null   bool
 3   name        1309 non-null   object
 4   sex         1309 non-null   category
 5   age         1046 non-null   float64
 6   sibsp       1309 non-null   int64
 7   parch       1309 non-null   int64
 8   fare        1308 non-null   float64
 9   embarked    1307 non-null   category
dtypes: bool(1), category(2), float64(2), int64(4), object(1)
memory usage: 75.8+ KB

Ordinal data has to be converted again

information about the pclass column can be chech in the notebook

Exploración inicial de datos

titanic_df["pclass"] = pd.Categorical(
    titanic_df["pclass"], categories=[3, 2, 1], ordered=True
)

# column Unnamed: 0 is not needed
titanic_df = titanic_df.drop(columns=["Unnamed: 0"])

General information about the data set:

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1309 non-null   category
 1   survived  1309 non-null   bool
 2   name      1309 non-null   object
 3   sex       1309 non-null   category
 4   age       1046 non-null   float64
 5   sibsp     1309 non-null   int64
 6   parch     1309 non-null   int64
 7   fare      1308 non-null   float64
 8   embarked  1307 non-null   category
dtypes: bool(1), category(3), float64(2), int64(2), object(1)
memory usage: 56.8+ KB

# size of the dataframe
titanic_df.shape

(1309, 9)

# sample of the dataframe
titanic_df.sample(5)

	pclass	survived	name	sex	age	sibsp	parch	fare	embarked
1176	3	False	Sage, Mr. Douglas Bullen	male	NaN	8	2	69.55	S
581	2	False	Watson, Mr. Ennis Hastings	male	NaN	0	0	0.00	S
18	1	True	Bazzani, Miss. Albina	female	32.00	0	0	76.29	C
921	3	False	Keefe, Mr. Arthur	male	NaN	0	0	7.25	S
494	2	True	Mallet, Mrs. Albert (Antoinette Magnin)	female	24.00	1	1	37.00	C

Number of missing values

titanic_df.isnull().sum()

pclass        0
survived      0
name          0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
dtype: int64

Target Variable = Survived

General statistics of the data set

Numerical variables

titanic_df.describe()

	age	sibsp	parch	fare
count	1046.00	1309.00	1309.00	1308.00
mean	29.88	0.50	0.39	33.30
std	14.41	1.04	0.87	51.76
min	0.17	0.00	0.00	0.00
25%	21.00	0.00	0.00	7.90
50%	28.00	0.00	0.00	14.45
75%	39.00	1.00	0.00	31.27
max	80.00	8.00	9.00	512.33

Categorical variables

# categorical columns description
titanic_df.describe(include="category")

	pclass	sex	embarked
count	1309	1309	1307
unique	3	2	3
top	3	male	S
freq	709	843	914

📈 Multivariate Analysis

Is important to check the relationship between the target variable and multiple variables.

The target variable is Survived

Target vs age vs Sex

sns.boxplot(data=titanic_df, x="sex", y="age", hue="survived");

Target vs age vs pclass

sns.boxplot(data=titanic_df, x="pclass", y="age", hue="survived");

Target vs age vs embarked

sns.boxplot(data=titanic_df, x="embarked", y="age", hue="survived");

Target vs sex vs pclass

(
    titanic_df[["sex", "pclass", "survived"]]
    .groupby(["pclass", "sex"], observed=True)
    .mean()
    * 100,
    1,
)

(               survived
 pclass sex
 3      female     49.07
        male       15.21
 2      female     88.68
        male       14.62
 1      female     96.53
        male       34.08,
 1)

almost all Pclass 1 females survive and so do most of Pclass 2. Pclass 3 female survival is 50% and almost all (86%) Pclass 3 males unfortunately do not survive.

sns.catplot(
    data=titanic_df,
    x="sex",
    hue="survived",
    col="pclass",
    kind="count",
    height=4,
    aspect=0.7,
);

sns.catplot(data=titanic_df, x="sex", y="survived", hue="pclass", kind="point");

sns.catplot(
    data=titanic_df,
    x="pclass",
    y="survived",
    hue="sex",
    palette={"male": "b", "female": "m"},
    markers=["^", "o"],
    linestyles=["-", "--"],
    kind="point",
);

sns.catplot(data=titanic_df, x="sex", y="survived", hue="pclass", kind="bar");

Target vs pclass vs fare

sns.catplot(
    x="fare",
    y="survived",
    row="pclass",
    kind="box",
    hue="survived",
    orient="h",
    height=1.5,
    aspect=4,
    data=titanic_df.query("fare > 0"),
).set(xscale="log");

Target vs sex vs pclass vs embarked

(
    titanic_df[["embarked", "sex", "pclass", "survived"]]
    .groupby(["embarked", "pclass", "sex"], observed=True)
    .agg(["count", "mean"])
)

			survived
			count	mean
embarked	pclass	sex
C	3	female	31	0.71
	3	male	70	0.21
	2	female	11	1.00
	2	male	17	0.29
	1	female	71	0.97
	1	male	70	0.40
Q	3	female	56	0.59
	3	male	57	0.12
	2	female	2	1.00
	2	male	5	0.00
	1	female	2	1.00
	1	male	1	0.00
S	3	female	129	0.40
	3	male	366	0.14
	2	female	93	0.87
	2	male	149	0.13
	1	female	69	0.96
	1	male	108	0.31

Numerical vs All Numerical Variables

# list of the numerical columns
numerical_columns = list(titanic_df.select_dtypes(include=["number"]).columns)
numerical_columns

['age', 'sibsp', 'parch', 'fare']

sns.pairplot(titanic_df[numerical_columns], diag_kind="kde");

df = titanic_df[[*numerical_columns, "survived"]]

sns.pairplot(df, hue="survived", diag_kind="kde");

📏 Heuristics baseline model

This is a simple baseline model that will be used to compare the performance of the machine learning models.

The model will be a simple decision tree with the following features:

sex
pclass
age

def calculate_survival_percentage(df, pclass, age_min, age_max=None) -> float:
    """Calculate the percentage of survivors based on the class and age range.

    Args:
        df: DataFrame with the Titanic dataset
        pclass: int with the class of the passenger
        age_min: int with the minimum age of the passenger
        age_max: int with the maximum age of the passenger

    Returns:
        float with the percentage of survivors

    """

    if age_max:
        survivors = len(
            df[
                (df["pclass"] == pclass)
                & (df["age"] > age_min)
                & (df["age"] < age_max)
                & (df["survived"])
            ]
        )
        total = len(
            df[(df["pclass"] == pclass) & (df["age"] > age_min) & (df["age"] < age_max)]
        )
    else:
        survivors = len(
            df[(df["pclass"] == pclass) & (df["age"] > age_min) & (df["survived"])]
        )
        total = len(df[(df["pclass"] == pclass) & (df["age"] > age_min)])

    return round((survivors / total) * 100, 1)


# Calcular y mostrar los porcentajes de supervivencia
print(
    "Pclass 1 survivors above age 60:",
    calculate_survival_percentage(titanic_df, 1, 59),
    "%",
)
print(
    "Pclass 2 survivors above age 60:",
    calculate_survival_percentage(titanic_df, 2, 59),
    "%",
)
print(
    "Pclass 3 survivors above age 60:",
    calculate_survival_percentage(titanic_df, 3, 59),
    "%",
)

print(
    "Pclass 1 survivors between 20-30 age:",
    calculate_survival_percentage(titanic_df, 1, 19, 31),
    "%",
)
print(
    "Pclass 2 survivors between 20-30 age:",
    calculate_survival_percentage(titanic_df, 2, 19, 31),
    "%",
)
print(
    "Pclass 3 survivors between 20-30 age:",
    calculate_survival_percentage(titanic_df, 3, 19, 31),
    "%",
)

Pclass 1 survivors above age 60: 38.5 %
Pclass 2 survivors above age 60: 12.5 %
Pclass 3 survivors above age 60: 16.7 %
Pclass 1 survivors between 20-30 age: 69.8 %
Pclass 2 survivors between 20-30 age: 41.0 %
Pclass 3 survivors between 20-30 age: 25.2 %

📊 Analysis of Results

Sex and Passenger Class (Key Predictors)

sex and pclass are the two variables with the highest correlation to the target variable survived, making them the most important features for predicting survival.
Almost all Pclass 1 females survived (~97% from Cherbourg, ~96% from Southampton) and so did most Pclass 2 females (100% from Cherbourg and Queenstown, ~87% from Southampton).
Pclass 3 female survival drops significantly (~59% from Queenstown, ~71% from Cherbourg, ~40% from Southampton), showing that class is a strong differentiator even among women.
Almost all Pclass 3 males did not survive (~86% mortality), regardless of port of embarkation.
Male survival rates were consistently low across all classes, but Pclass 1 males had the best chance (~40% from Cherbourg, ~31% from Southampton).

Age Interactions

Age is not correlated with pclass: the age distributions across the three classes are similar, with overlapping median values and interquartile ranges.
Age is not correlated with fare: knowing a passenger’s age does not help predict the fare they paid.
The survival rate among children (age 0–10) is notably higher than other age groups, suggesting a “women and children first” policy.
Among passengers above age 60, Pclass 1 had a 38.5% survival rate, while Pclass 2 and Pclass 3 had only 12.5% and 16.7% respectively.
Among passengers aged 20–30, Pclass 1 had a 69.8% survival rate, Pclass 2 had 41.0%, and Pclass 3 had only 25.2%.

Fare and Class

The fare is highly correlated with pclass: higher-class passengers paid substantially more. Within each class, survivors tended to have paid higher fares than non-survivors, especially in Pclass 1.

Port of Embarkation

The survival rate among passengers who embarked in Cherbourg (C) is higher than those from Southampton (S) or Queenstown (Q). This is likely because Cherbourg had a higher proportion of Pclass 1 passengers.
Queenstown (Q) passengers had the lowest overall survival rate, which aligns with the fact that most of them were Pclass 3.

Summary

The multivariate analysis confirms that survival on the Titanic was primarily determined by sex (females had much higher survival rates) and passenger class (higher class meant better survival odds). Age played a secondary role, with children having an advantage. The port of embarkation is a proxy for class composition rather than an independent predictor. These findings suggest that a baseline model using sex, pclass, and age should capture the main patterns in the data.

💡 Proposals and Ideas

For first baseline model, we will use the following features:

sex
pclass
age

and can be done with heuristics or with a simple model like a decision tree.

📖 References

Visualización

EDA

Statistical tests

https://nathanrosidi.medium.com/commonly-used-statistical-tests-in-data-science-93787568eb36