Análisis Multivariable de los datos

Por Jose R. Zapata

Ultima actualización: 19/Feb/2026

📚 Import libraries

# base libraries for data science
import sys
from pathlib import Path

import pandas as pd
import seaborn as sns
# configuration to show only 2 decimal places
pd.set_option("display.float_format", "{:.2f}".format)
# print library version for reproducibility

print("Python version: ", sys.version)
print("Pandas version: ", pd.__version__)
Python version:  3.11.11 (main, Dec  6 2024, 20:02:44) [Clang 18.1.8 ]
Pandas version:  2.2.3

💾 Load data

The dataset has correct data types, fixed in:

Exploración inicial de datos

DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df = pd.read_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)

📊 Data description

General data information

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   int64
 2   survived    1309 non-null   bool
 3   name        1309 non-null   object
 4   sex         1309 non-null   category
 5   age         1046 non-null   float64
 6   sibsp       1309 non-null   int64
 7   parch       1309 non-null   int64
 8   fare        1308 non-null   float64
 9   embarked    1307 non-null   category
dtypes: bool(1), category(2), float64(2), int64(4), object(1)
memory usage: 75.8+ KB

Ordinal data has to be converted again

information about the pclass column can be chech in the notebook

Exploración inicial de datos

titanic_df["pclass"] = pd.Categorical(
    titanic_df["pclass"], categories=[3, 2, 1], ordered=True
)

# column Unnamed: 0 is not needed
titanic_df = titanic_df.drop(columns=["Unnamed: 0"])

General information about the data set:

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1309 non-null   category
 1   survived  1309 non-null   bool
 2   name      1309 non-null   object
 3   sex       1309 non-null   category
 4   age       1046 non-null   float64
 5   sibsp     1309 non-null   int64
 6   parch     1309 non-null   int64
 7   fare      1308 non-null   float64
 8   embarked  1307 non-null   category
dtypes: bool(1), category(3), float64(2), int64(2), object(1)
memory usage: 56.8+ KB
# size of the dataframe
titanic_df.shape
(1309, 9)
# sample of the dataframe
titanic_df.sample(5)

pclasssurvivednamesexagesibspparchfareembarked
11763FalseSage, Mr. Douglas BullenmaleNaN8269.55S
5812FalseWatson, Mr. Ennis HastingsmaleNaN000.00S
181TrueBazzani, Miss. Albinafemale32.000076.29C
9213FalseKeefe, Mr. ArthurmaleNaN007.25S
4942TrueMallet, Mrs. Albert (Antoinette Magnin)female24.001137.00C

Number of missing values

titanic_df.isnull().sum()
pclass        0
survived      0
name          0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
dtype: int64

Target Variable = Survived

General statistics of the data set

Numerical variables
titanic_df.describe()

agesibspparchfare
count1046.001309.001309.001308.00
mean29.880.500.3933.30
std14.411.040.8751.76
min0.170.000.000.00
25%21.000.000.007.90
50%28.000.000.0014.45
75%39.001.000.0031.27
max80.008.009.00512.33
Categorical variables
# categorical columns description
titanic_df.describe(include="category")

pclasssexembarked
count130913091307
unique323
top3maleS
freq709843914

📈 Multivariate Analysis

Is important to check the relationship between the target variable and multiple variables.

The target variable is Survived

Target vs age vs Sex

sns.boxplot(data=titanic_df, x="sex", y="age", hue="survived");

png

Target vs age vs pclass

sns.boxplot(data=titanic_df, x="pclass", y="age", hue="survived");

png

Target vs age vs embarked

sns.boxplot(data=titanic_df, x="embarked", y="age", hue="survived");

png

Target vs sex vs pclass

(
    titanic_df[["sex", "pclass", "survived"]]
    .groupby(["pclass", "sex"], observed=True)
    .mean()
    * 100,
    1,
)
(               survived
 pclass sex
 3      female     49.07
        male       15.21
 2      female     88.68
        male       14.62
 1      female     96.53
        male       34.08,
 1)

almost all Pclass 1 females survive and so do most of Pclass 2. Pclass 3 female survival is 50% and almost all (86%) Pclass 3 males unfortunately do not survive.

sns.catplot(
    data=titanic_df,
    x="sex",
    hue="survived",
    col="pclass",
    kind="count",
    height=4,
    aspect=0.7,
);

png

sns.catplot(data=titanic_df, x="sex", y="survived", hue="pclass", kind="point");

png

sns.catplot(
    data=titanic_df,
    x="pclass",
    y="survived",
    hue="sex",
    palette={"male": "b", "female": "m"},
    markers=["^", "o"],
    linestyles=["-", "--"],
    kind="point",
);

png

sns.catplot(data=titanic_df, x="sex", y="survived", hue="pclass", kind="bar");

png

Target vs pclass vs fare

sns.catplot(
    x="fare",
    y="survived",
    row="pclass",
    kind="box",
    hue="survived",
    orient="h",
    height=1.5,
    aspect=4,
    data=titanic_df.query("fare > 0"),
).set(xscale="log");

png

Target vs sex vs pclass vs embarked

(
    titanic_df[["embarked", "sex", "pclass", "survived"]]
    .groupby(["embarked", "pclass", "sex"], observed=True)
    .agg(["count", "mean"])
)

survived
countmean
embarkedpclasssex
C3female310.71
male700.21
2female111.00
male170.29
1female710.97
male700.40
Q3female560.59
male570.12
2female21.00
male50.00
1female21.00
male10.00
S3female1290.40
male3660.14
2female930.87
male1490.13
1female690.96
male1080.31

Numerical vs All Numerical Variables

# list of the numerical columns
numerical_columns = list(titanic_df.select_dtypes(include=["number"]).columns)
numerical_columns
['age', 'sibsp', 'parch', 'fare']
sns.pairplot(titanic_df[numerical_columns], diag_kind="kde");

png

df = titanic_df[[*numerical_columns, "survived"]]

sns.pairplot(df, hue="survived", diag_kind="kde");

png

📏 Heuristics baseline model

This is a simple baseline model that will be used to compare the performance of the machine learning models.

The model will be a simple decision tree with the following features:

  • sex
  • pclass
  • age
def calculate_survival_percentage(df, pclass, age_min, age_max=None) -> float:
    """Calculate the percentage of survivors based on the class and age range.

    Args:
        df: DataFrame with the Titanic dataset
        pclass: int with the class of the passenger
        age_min: int with the minimum age of the passenger
        age_max: int with the maximum age of the passenger

    Returns:
        float with the percentage of survivors

    """

    if age_max:
        survivors = len(
            df[
                (df["pclass"] == pclass)
                & (df["age"] > age_min)
                & (df["age"] < age_max)
                & (df["survived"])
            ]
        )
        total = len(
            df[(df["pclass"] == pclass) & (df["age"] > age_min) & (df["age"] < age_max)]
        )
    else:
        survivors = len(
            df[(df["pclass"] == pclass) & (df["age"] > age_min) & (df["survived"])]
        )
        total = len(df[(df["pclass"] == pclass) & (df["age"] > age_min)])

    return round((survivors / total) * 100, 1)


# Calcular y mostrar los porcentajes de supervivencia
print(
    "Pclass 1 survivors above age 60:",
    calculate_survival_percentage(titanic_df, 1, 59),
    "%",
)
print(
    "Pclass 2 survivors above age 60:",
    calculate_survival_percentage(titanic_df, 2, 59),
    "%",
)
print(
    "Pclass 3 survivors above age 60:",
    calculate_survival_percentage(titanic_df, 3, 59),
    "%",
)

print(
    "Pclass 1 survivors between 20-30 age:",
    calculate_survival_percentage(titanic_df, 1, 19, 31),
    "%",
)
print(
    "Pclass 2 survivors between 20-30 age:",
    calculate_survival_percentage(titanic_df, 2, 19, 31),
    "%",
)
print(
    "Pclass 3 survivors between 20-30 age:",
    calculate_survival_percentage(titanic_df, 3, 19, 31),
    "%",
)
Pclass 1 survivors above age 60: 38.5 %
Pclass 2 survivors above age 60: 12.5 %
Pclass 3 survivors above age 60: 16.7 %
Pclass 1 survivors between 20-30 age: 69.8 %
Pclass 2 survivors between 20-30 age: 41.0 %
Pclass 3 survivors between 20-30 age: 25.2 %

📊 Analysis of Results

Sex and Passenger Class (Key Predictors)

  • sex and pclass are the two variables with the highest correlation to the target variable survived, making them the most important features for predicting survival.
  • Almost all Pclass 1 females survived (~97% from Cherbourg, ~96% from Southampton) and so did most Pclass 2 females (100% from Cherbourg and Queenstown, ~87% from Southampton).
  • Pclass 3 female survival drops significantly (~59% from Queenstown, ~71% from Cherbourg, ~40% from Southampton), showing that class is a strong differentiator even among women.
  • Almost all Pclass 3 males did not survive (~86% mortality), regardless of port of embarkation.
  • Male survival rates were consistently low across all classes, but Pclass 1 males had the best chance (~40% from Cherbourg, ~31% from Southampton).

Age Interactions

  • Age is not correlated with pclass: the age distributions across the three classes are similar, with overlapping median values and interquartile ranges.
  • Age is not correlated with fare: knowing a passenger’s age does not help predict the fare they paid.
  • The survival rate among children (age 0–10) is notably higher than other age groups, suggesting a “women and children first” policy.
  • Among passengers above age 60, Pclass 1 had a 38.5% survival rate, while Pclass 2 and Pclass 3 had only 12.5% and 16.7% respectively.
  • Among passengers aged 20–30, Pclass 1 had a 69.8% survival rate, Pclass 2 had 41.0%, and Pclass 3 had only 25.2%.

Fare and Class

  • The fare is highly correlated with pclass: higher-class passengers paid substantially more. Within each class, survivors tended to have paid higher fares than non-survivors, especially in Pclass 1.

Port of Embarkation

  • The survival rate among passengers who embarked in Cherbourg (C) is higher than those from Southampton (S) or Queenstown (Q). This is likely because Cherbourg had a higher proportion of Pclass 1 passengers.
  • Queenstown (Q) passengers had the lowest overall survival rate, which aligns with the fact that most of them were Pclass 3.

Summary

The multivariate analysis confirms that survival on the Titanic was primarily determined by sex (females had much higher survival rates) and passenger class (higher class meant better survival odds). Age played a secondary role, with children having an advantage. The port of embarkation is a proxy for class composition rather than an independent predictor. These findings suggest that a baseline model using sex, pclass, and age should capture the main patterns in the data.

💡 Proposals and Ideas

For first baseline model, we will use the following features:

  • sex
  • pclass
  • age

and can be done with heuristics or with a simple model like a decision tree.

📖 References

Visualización

EDA

Statistical tests

Anterior