Por Jose R. Zapata
Ultima actualización: 19/Feb/2026
📚 Import libraries
# base libraries for data science
import sys
from pathlib import Path
import pandas as pd
# configuration to show only 2 decimal places
pd.set_option("display.float_format", "{:.2f}".format)
# print library version for reproducibility
print("Python version: ", sys.version)
print("Pandas version: ", pd.__version__)
Python version: 3.11.10 (main, Sep 27 2024, 20:27:21) [GCC 11.4.0]
Pandas version: 2.1.4
💾 Load data
The dataset has correct data types, fixed in:
DATA_DIR = Path.cwd().resolve().parents[1] / "data"
titanic_df = pd.read_parquet(
DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)
📊 Data description
General data information
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1309 non-null int64
1 pclass 1309 non-null int64
2 survived 1309 non-null bool
3 name 1309 non-null object
4 sex 1309 non-null category
5 age 1046 non-null float64
6 sibsp 1309 non-null int64
7 parch 1309 non-null int64
8 fare 1308 non-null float64
9 embarked 1307 non-null category
dtypes: bool(1), category(2), float64(2), int64(4), object(1)
memory usage: 75.8+ KB
Ordinal data has to be converted again
information about the pclass column can be chech in the notebook
titanic_df["pclass"] = pd.Categorical(
titanic_df["pclass"], categories=[3, 2, 1], ordered=True
)
# column Unnamed: 0 is not needed
titanic_df = titanic_df.drop(columns=["Unnamed: 0"])
General information about the data set:
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null category
1 survived 1309 non-null bool
2 name 1309 non-null object
3 sex 1309 non-null category
4 age 1046 non-null float64
5 sibsp 1309 non-null int64
6 parch 1309 non-null int64
7 fare 1308 non-null float64
8 embarked 1307 non-null category
dtypes: bool(1), category(3), float64(2), int64(2), object(1)
memory usage: 56.8+ KB
# size of the dataframe
titanic_df.shape
(1309, 9)
# sample of the dataframe
titanic_df.sample(5)
| pclass | survived | name | sex | age | sibsp | parch | fare | embarked | |
|---|---|---|---|---|---|---|---|---|---|
| 58 | 1 | False | Case, Mr. Howard Brown | male | 49.00 | 0 | 0 | 26.00 | S |
| 736 | 3 | False | Coxon, Mr. Daniel | male | 59.00 | 0 | 0 | 7.25 | S |
| 523 | 2 | True | Oxenham, Mr. Percy Thomas | male | 22.00 | 0 | 0 | 10.50 | S |
| 1073 | 3 | False | O'Connor, Mr. Maurice | male | NaN | 0 | 0 | 7.75 | Q |
| 518 | 2 | False | Nicholls, Mr. Joseph Charles | male | 19.00 | 1 | 1 | 36.75 | S |
Number of missing values
titanic_df.isnull().sum()
pclass 0
survived 0
name 0
sex 0
age 263
sibsp 0
parch 0
fare 1
embarked 2
dtype: int64
Target Variable = Survived
General statistics of the data set
Numerical variables
titanic_df.describe()
| age | sibsp | parch | fare | |
|---|---|---|---|---|
| count | 1046.00 | 1309.00 | 1309.00 | 1308.00 |
| mean | 29.88 | 0.50 | 0.39 | 33.30 |
| std | 14.41 | 1.04 | 0.87 | 51.76 |
| min | 0.17 | 0.00 | 0.00 | 0.00 |
| 25% | 21.00 | 0.00 | 0.00 | 7.90 |
| 50% | 28.00 | 0.00 | 0.00 | 14.45 |
| 75% | 39.00 | 1.00 | 0.00 | 31.27 |
| max | 80.00 | 8.00 | 9.00 | 512.33 |
Categorical variables
# categorical columns description
titanic_df.describe(include="category")
| pclass | sex | embarked | |
|---|---|---|---|
| count | 1309 | 1309 | 1307 |
| unique | 3 | 2 | 3 |
| top | 3 | male | S |
| freq | 709 | 843 | 914 |
📈 Univariate Analysis
Target Variable
titanic_df["survived"].value_counts().plot(
kind="bar", color=["skyblue", "orange"], title="Survived passengers"
);

Numerical Variables
# list of the numerical columns
numerical_columns = list(titanic_df.select_dtypes(include=["number"]).columns)
numerical_columns
['age', 'sibsp', 'parch', 'fare']
age
column = "age"
titanic_df[column].describe()
count 1046.00
mean 29.88
std 14.41
min 0.17
25% 21.00
50% 28.00
75% 39.00
max 80.00
Name: age, dtype: float64
# number of unique values
titanic_df[column].nunique()
98
titanic_df[column].plot(
kind="hist", bins=30, edgecolor="black", title=f"{column} histogram"
);

titanic_df[column].plot(kind="density", title=f"{column} distribution");

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

sibsp
column = "sibsp"
titanic_df[column].describe()
count 1309.00
mean 0.50
std 1.04
min 0.00
25% 0.00
50% 0.00
75% 1.00
max 8.00
Name: sibsp, dtype: float64
# number of unique values
titanic_df[column].nunique()
7
# unique values (because are only a few)
titanic_df[column].unique()
array([0, 1, 2, 3, 4, 5, 8])
# histogram of a column with 7 unique values between 0 and 8
titanic_df[column].plot(
kind="hist", bins=8, edgecolor="black", title=f"{column} histogram"
);

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

parch
column = "parch"
titanic_df[column].describe()
count 1309.00
mean 0.39
std 0.87
min 0.00
25% 0.00
50% 0.00
75% 0.00
max 9.00
Name: parch, dtype: float64
# number of unique values
titanic_df[column].nunique()
8
# unique values (because are only a few)
titanic_df[column].unique()
array([0, 2, 1, 4, 3, 5, 6, 9])
titanic_df[column].plot(
kind="hist", bins=9, edgecolor="black", title=f"{column} histogram"
);

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

fare
column = "fare"
titanic_df[column].describe()
count 1308.00
mean 33.30
std 51.76
min 0.00
25% 7.90
50% 14.45
75% 31.27
max 512.33
Name: fare, dtype: float64
titanic_df[column].plot(kind="kde", title=f"{column} distribution");

titanic_df[column].plot(
kind="hist", bins=50, edgecolor="black", title=f"{column} histogram"
);

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

Categorical Variables
# list of the categorical columns
categorical_columns = list(titanic_df.select_dtypes(include=["category"]).columns)
categorical_columns
['pclass', 'sex', 'embarked']
pclass
column = "pclass"
titanic_df[column].describe()
count 1309
unique 3
top 3
freq 709
Name: pclass, dtype: int64
titanic_df[column].unique()
[1, 2, 3]
Categories (3, int64): [3 < 2 < 1]
titanic_df[column].value_counts()
pclass
3 709
1 323
2 277
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
kind="bar", color=["skyblue", "orange", "green"], title=f"{column} value counts"
);

Sex
column = "sex"
titanic_df[column].describe()
count 1309
unique 2
top male
freq 843
Name: sex, dtype: object
titanic_df[column].unique()
['female', 'male']
Categories (2, object): ['female', 'male']
titanic_df[column].value_counts()
sex
male 843
female 466
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
kind="bar", color=["skyblue", "orange"], title=f"{column} value counts"
);

embarked
column = "embarked"
titanic_df[column].describe()
count 1307
unique 3
top S
freq 914
Name: embarked, dtype: object
titanic_df[column].unique()
['S', 'C', NaN, 'Q']
Categories (3, object): ['C', 'Q', 'S']
titanic_df[column].value_counts()
embarked
S 914
C 270
Q 123
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
kind="bar", color=["skyblue", "orange", "green"], title=f"{column} value counts"
);

String columns
titanic_df["name"].sample(5)
394 Dibden, Mr. William
755 Davies, Mr. Joseph
237 Robbins, Mr. Victor
459 Jacobsohn, Mr. Sidney Samuel
1136 Rasmussen, Mrs. (Lena Jacobsen Solvang)
Name: name, dtype: object
titanic_df["name"].nunique()
1307
📊 Analysis of Results and Conclusions
Dataset Overview
- The dataset contains 1,309 passengers with 9 variables (after dropping the unnamed index column).
- Missing values:
agehas 263 missing values (~20%),farehas 1, andembarkedhas 2. These will need to be addressed in the data preparation phase.
Target Variable: survived
- The dataset is imbalanced: more passengers did not survive than survived, reflecting the historical reality of the Titanic disaster.
Numerical Variables
Age
- Mean age is 29.88 years (median 28), with a standard deviation of 14.41.
- Ages range from 0.17 (infants) to 80 years.
- The distribution is right-skewed, with a concentration of passengers between 20 and 40 years old.
- There are 98 unique age values and 263 missing values (~20% of the data), which is significant and must be handled carefully during imputation.
- The boxplot reveals outliers at the upper end (elderly passengers).
SibSp (Siblings/Spouses aboard)
- Most passengers traveled without siblings or spouses (median = 0, 75th percentile = 1).
- The distribution is heavily right-skewed, with values ranging from 0 to 8.
- Only 7 unique values exist, making this variable behave almost like a categorical feature.
Parch (Parents/Children aboard)
- Similar to
sibsp, most passengers had no parents or children aboard (median = 0). - Values range from 0 to 9, with 8 unique values.
- The distribution is also heavily right-skewed, concentrated at 0.
Fare
- Mean fare is $33.30 but with a very high standard deviation of $51.76, indicating large variability.
- Fares range from $0 to $512.33.
- The distribution is extremely right-skewed: the median ($14.45) is much lower than the mean, indicating that a few very expensive tickets pull the average up.
- The boxplot shows many outliers on the upper end, corresponding to first-class luxury cabins.
- There is 1 missing value.
Categorical Variables
Pclass (Passenger Class)
- 3rd class is the most frequent (709 passengers, ~54%), followed by 1st class (323) and 2nd class (277).
- The majority of passengers were in the lowest class, consistent with historical records.
Sex
- Males significantly outnumber females: 843 males (~64%) vs. 466 females (~36%).
- This gender imbalance is important for survival analysis, as the “women and children first” policy was applied during evacuation.
Embarked (Port of Embarkation)
- Southampton (S) is the dominant port with 914 passengers (~70%), followed by Cherbourg (C) with 270 and Queenstown (Q) with 123.
- There are 2 missing values in this variable.
Key Takeaways
- Class disparity: The majority of passengers were in 3rd class, which historically had the lowest survival rates.
- Gender imbalance: Nearly two-thirds of passengers were male, yet survival rates were higher for females.
- Age distribution: The passenger population was predominantly young adults (20-40 years), with significant missing data that requires careful imputation.
- Fare distribution: Highly skewed with extreme outliers, reflecting the vast economic differences between passenger classes.
- Family variables (
sibsp,parch): Most passengers traveled alone or with very small family groups. - Missing data strategy: The
agecolumn requires special attention due to its 20% missing rate;fareandembarkedhave minimal missing values that can be easily handled.
📖 References
Visualización
- https://joserzapata.github.io/courses/python-ciencia-datos/visualizacion/#visualizacion-con-pandas
- https://joserzapata.github.io/courses/python-ciencia-datos/visualizacion/plotly/
- https://joserzapata.github.io/courses/python-ciencia-datos/visualizacion/seaborn/
EDA
- https://www.analyticsvidhya.com/blog/2022/02/a-quick-guide-to-bivariate-analysis-in-python/
- https://www.kaggle.com/code/allohvk/titanic-advanced-eda
- https://www.kaggle.com/code/imkushwaha/bivariate-multivariate-analysis
Statistical tests