Análisis Univariable de los datos

Ultima actualización: 19/Feb/2026

📚 Import libraries

# base libraries for data science
import sys
from pathlib import Path

import pandas as pd

# configuration to show only 2 decimal places
pd.set_option("display.float_format", "{:.2f}".format)

# print library version for reproducibility

print("Python version: ", sys.version)
print("Pandas version: ", pd.__version__)

Python version:  3.11.10 (main, Sep 27 2024, 20:27:21) [GCC 11.4.0]
Pandas version:  2.1.4

💾 Load data

The dataset has correct data types, fixed in:

Exploración inicial de datos

DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df = pd.read_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)

📊 Data description

General data information

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   int64
 2   survived    1309 non-null   bool
 3   name        1309 non-null   object
 4   sex         1309 non-null   category
 5   age         1046 non-null   float64
 6   sibsp       1309 non-null   int64
 7   parch       1309 non-null   int64
 8   fare        1308 non-null   float64
 9   embarked    1307 non-null   category
dtypes: bool(1), category(2), float64(2), int64(4), object(1)
memory usage: 75.8+ KB

Ordinal data has to be converted again

information about the pclass column can be chech in the notebook

Exploración inicial de datos

titanic_df["pclass"] = pd.Categorical(
    titanic_df["pclass"], categories=[3, 2, 1], ordered=True
)

# column Unnamed: 0 is not needed
titanic_df = titanic_df.drop(columns=["Unnamed: 0"])

General information about the data set:

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1309 non-null   category
 1   survived  1309 non-null   bool
 2   name      1309 non-null   object
 3   sex       1309 non-null   category
 4   age       1046 non-null   float64
 5   sibsp     1309 non-null   int64
 6   parch     1309 non-null   int64
 7   fare      1308 non-null   float64
 8   embarked  1307 non-null   category
dtypes: bool(1), category(3), float64(2), int64(2), object(1)
memory usage: 56.8+ KB

# size of the dataframe
titanic_df.shape

(1309, 9)

# sample of the dataframe
titanic_df.sample(5)

	pclass	survived	name	sex	age	sibsp	parch	fare	embarked
58	1	False	Case, Mr. Howard Brown	male	49.00	0	0	26.00	S
736	3	False	Coxon, Mr. Daniel	male	59.00	0	0	7.25	S
523	2	True	Oxenham, Mr. Percy Thomas	male	22.00	0	0	10.50	S
1073	3	False	O'Connor, Mr. Maurice	male	NaN	0	0	7.75	Q
518	2	False	Nicholls, Mr. Joseph Charles	male	19.00	1	1	36.75	S

Number of missing values

titanic_df.isnull().sum()

pclass        0
survived      0
name          0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
dtype: int64

Target Variable = Survived

General statistics of the data set

Numerical variables

titanic_df.describe()

	age	sibsp	parch	fare
count	1046.00	1309.00	1309.00	1308.00
mean	29.88	0.50	0.39	33.30
std	14.41	1.04	0.87	51.76
min	0.17	0.00	0.00	0.00
25%	21.00	0.00	0.00	7.90
50%	28.00	0.00	0.00	14.45
75%	39.00	1.00	0.00	31.27
max	80.00	8.00	9.00	512.33

Categorical variables

# categorical columns description
titanic_df.describe(include="category")

	pclass	sex	embarked
count	1309	1309	1307
unique	3	2	3
top	3	male	S
freq	709	843	914

📈 Univariate Analysis

Target Variable

titanic_df["survived"].value_counts().plot(
    kind="bar", color=["skyblue", "orange"], title="Survived passengers"
);

Numerical Variables

# list of the numerical columns
numerical_columns = list(titanic_df.select_dtypes(include=["number"]).columns)
numerical_columns

['age', 'sibsp', 'parch', 'fare']

age

column = "age"
titanic_df[column].describe()

count   1046.00
mean      29.88
std       14.41
min        0.17
25%       21.00
50%       28.00
75%       39.00
max       80.00
Name: age, dtype: float64

# number of unique values
titanic_df[column].nunique()

titanic_df[column].plot(
    kind="hist", bins=30, edgecolor="black", title=f"{column} histogram"
);

titanic_df[column].plot(kind="density", title=f"{column} distribution");

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

sibsp

column = "sibsp"
titanic_df[column].describe()

count   1309.00
mean       0.50
std        1.04
min        0.00
25%        0.00
50%        0.00
75%        1.00
max        8.00
Name: sibsp, dtype: float64

# number of unique values
titanic_df[column].nunique()

# unique values (because are only a few)
titanic_df[column].unique()

array([0, 1, 2, 3, 4, 5, 8])

# histogram of a column with 7 unique values between 0 and 8
titanic_df[column].plot(
    kind="hist", bins=8, edgecolor="black", title=f"{column} histogram"
);

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

parch

column = "parch"
titanic_df[column].describe()

count   1309.00
mean       0.39
std        0.87
min        0.00
25%        0.00
50%        0.00
75%        0.00
max        9.00
Name: parch, dtype: float64

# number of unique values

titanic_df[column].nunique()

# unique values (because are only a few)
titanic_df[column].unique()

array([0, 2, 1, 4, 3, 5, 6, 9])

titanic_df[column].plot(
    kind="hist", bins=9, edgecolor="black", title=f"{column} histogram"
);

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

fare

column = "fare"
titanic_df[column].describe()

count   1308.00
mean      33.30
std       51.76
min        0.00
25%        7.90
50%       14.45
75%       31.27
max      512.33
Name: fare, dtype: float64

titanic_df[column].plot(kind="kde", title=f"{column} distribution");

titanic_df[column].plot(
    kind="hist", bins=50, edgecolor="black", title=f"{column} histogram"
);

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

Categorical Variables

# list of the categorical columns
categorical_columns = list(titanic_df.select_dtypes(include=["category"]).columns)
categorical_columns

['pclass', 'sex', 'embarked']

pclass

column = "pclass"
titanic_df[column].describe()

count     1309
unique       3
top          3
freq       709
Name: pclass, dtype: int64

titanic_df[column].unique()

[1, 2, 3]
Categories (3, int64): [3 < 2 < 1]

titanic_df[column].value_counts()

pclass
3    709
1    323
2    277
Name: count, dtype: int64

titanic_df[column].value_counts().plot(
    kind="bar", color=["skyblue", "orange", "green"], title=f"{column} value counts"
);

Sex

column = "sex"
titanic_df[column].describe()

count     1309
unique       2
top       male
freq       843
Name: sex, dtype: object

titanic_df[column].unique()

['female', 'male']
Categories (2, object): ['female', 'male']

titanic_df[column].value_counts()

sex
male      843
female    466
Name: count, dtype: int64

titanic_df[column].value_counts().plot(
    kind="bar", color=["skyblue", "orange"], title=f"{column} value counts"
);

embarked

column = "embarked"
titanic_df[column].describe()

count     1307
unique       3
top          S
freq       914
Name: embarked, dtype: object

titanic_df[column].unique()

['S', 'C', NaN, 'Q']
Categories (3, object): ['C', 'Q', 'S']

titanic_df[column].value_counts()

embarked
S    914
C    270
Q    123
Name: count, dtype: int64

titanic_df[column].value_counts().plot(
    kind="bar", color=["skyblue", "orange", "green"], title=f"{column} value counts"
);

String columns

titanic_df["name"].sample(5)

394                         Dibden, Mr. William
755                          Davies, Mr. Joseph
237                         Robbins, Mr. Victor
459                Jacobsohn, Mr. Sidney Samuel
1136    Rasmussen, Mrs. (Lena Jacobsen Solvang)
Name: name, dtype: object

titanic_df["name"].nunique()

📊 Analysis of Results and Conclusions

Dataset Overview

The dataset contains 1,309 passengers with 9 variables (after dropping the unnamed index column).
Missing values: age has 263 missing values (~20%), fare has 1, and embarked has 2. These will need to be addressed in the data preparation phase.

Target Variable: `survived`

The dataset is imbalanced: more passengers did not survive than survived, reflecting the historical reality of the Titanic disaster.

Numerical Variables

Age

Mean age is 29.88 years (median 28), with a standard deviation of 14.41.
Ages range from 0.17 (infants) to 80 years.
The distribution is right-skewed, with a concentration of passengers between 20 and 40 years old.
There are 98 unique age values and 263 missing values (~20% of the data), which is significant and must be handled carefully during imputation.
The boxplot reveals outliers at the upper end (elderly passengers).

SibSp (Siblings/Spouses aboard)

Most passengers traveled without siblings or spouses (median = 0, 75th percentile = 1).
The distribution is heavily right-skewed, with values ranging from 0 to 8.
Only 7 unique values exist, making this variable behave almost like a categorical feature.

Parch (Parents/Children aboard)

Similar to sibsp, most passengers had no parents or children aboard (median = 0).
Values range from 0 to 9, with 8 unique values.
The distribution is also heavily right-skewed, concentrated at 0.

Fare

Mean fare is $33.30 but with a very high standard deviation of $51.76, indicating large variability.
Fares range from $0 to $512.33.
The distribution is extremely right-skewed: the median ($14.45) is much lower than the mean, indicating that a few very expensive tickets pull the average up.
The boxplot shows many outliers on the upper end, corresponding to first-class luxury cabins.
There is 1 missing value.

Categorical Variables

Pclass (Passenger Class)

3rd class is the most frequent (709 passengers, ~54%), followed by 1st class (323) and 2nd class (277).
The majority of passengers were in the lowest class, consistent with historical records.

Sex

Males significantly outnumber females: 843 males (~64%) vs. 466 females (~36%).
This gender imbalance is important for survival analysis, as the “women and children first” policy was applied during evacuation.

Embarked (Port of Embarkation)

Southampton (S) is the dominant port with 914 passengers (~70%), followed by Cherbourg (C) with 270 and Queenstown (Q) with 123.
There are 2 missing values in this variable.

Key Takeaways

Class disparity: The majority of passengers were in 3rd class, which historically had the lowest survival rates.
Gender imbalance: Nearly two-thirds of passengers were male, yet survival rates were higher for females.
Age distribution: The passenger population was predominantly young adults (20-40 years), with significant missing data that requires careful imputation.
Fare distribution: Highly skewed with extreme outliers, reflecting the vast economic differences between passenger classes.
Family variables (sibsp, parch): Most passengers traveled alone or with very small family groups.
Missing data strategy: The age column requires special attention due to its 20% missing rate; fare and embarked have minimal missing values that can be easily handled.

📖 References

Visualización

EDA

Statistical tests

https://nathanrosidi.medium.com/commonly-used-statistical-tests-in-data-science-93787568eb36

Análisis Univariable de los datos

📚 Import libraries

💾 Load data

📊 Data description

General data information

General statistics of the data set

Numerical variables

Categorical variables

📈 Univariate Analysis

Target Variable

Numerical Variables

age

sibsp

parch

fare

Categorical Variables

pclass

Sex

embarked

String columns

📊 Analysis of Results and Conclusions

Dataset Overview

Target Variable: survived

Numerical Variables

Age

SibSp (Siblings/Spouses aboard)

Parch (Parents/Children aboard)

Fare

Categorical Variables

Pclass (Passenger Class)

Sex

Embarked (Port of Embarkation)

Key Takeaways

📖 References

Feedback

Target Variable: `survived`