Análisis Univariable de los datos

Por Jose R. Zapata

Ultima actualización: 19/Feb/2026

📚 Import libraries

# base libraries for data science
import sys
from pathlib import Path

import pandas as pd
# configuration to show only 2 decimal places
pd.set_option("display.float_format", "{:.2f}".format)
# print library version for reproducibility

print("Python version: ", sys.version)
print("Pandas version: ", pd.__version__)
Python version:  3.11.10 (main, Sep 27 2024, 20:27:21) [GCC 11.4.0]
Pandas version:  2.1.4

💾 Load data

The dataset has correct data types, fixed in:

Exploración inicial de datos

DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df = pd.read_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)

📊 Data description

General data information

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  1309 non-null   int64
 1   pclass      1309 non-null   int64
 2   survived    1309 non-null   bool
 3   name        1309 non-null   object
 4   sex         1309 non-null   category
 5   age         1046 non-null   float64
 6   sibsp       1309 non-null   int64
 7   parch       1309 non-null   int64
 8   fare        1308 non-null   float64
 9   embarked    1307 non-null   category
dtypes: bool(1), category(2), float64(2), int64(4), object(1)
memory usage: 75.8+ KB

Ordinal data has to be converted again

information about the pclass column can be chech in the notebook

Exploración inicial de datos

titanic_df["pclass"] = pd.Categorical(
    titanic_df["pclass"], categories=[3, 2, 1], ordered=True
)

# column Unnamed: 0 is not needed
titanic_df = titanic_df.drop(columns=["Unnamed: 0"])

General information about the data set:

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1309 non-null   category
 1   survived  1309 non-null   bool
 2   name      1309 non-null   object
 3   sex       1309 non-null   category
 4   age       1046 non-null   float64
 5   sibsp     1309 non-null   int64
 6   parch     1309 non-null   int64
 7   fare      1308 non-null   float64
 8   embarked  1307 non-null   category
dtypes: bool(1), category(3), float64(2), int64(2), object(1)
memory usage: 56.8+ KB
# size of the dataframe
titanic_df.shape
(1309, 9)
# sample of the dataframe
titanic_df.sample(5)

pclasssurvivednamesexagesibspparchfareembarked
581FalseCase, Mr. Howard Brownmale49.000026.00S
7363FalseCoxon, Mr. Danielmale59.00007.25S
5232TrueOxenham, Mr. Percy Thomasmale22.000010.50S
10733FalseO'Connor, Mr. MauricemaleNaN007.75Q
5182FalseNicholls, Mr. Joseph Charlesmale19.001136.75S

Number of missing values

titanic_df.isnull().sum()
pclass        0
survived      0
name          0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
dtype: int64

Target Variable = Survived

General statistics of the data set

Numerical variables
titanic_df.describe()

agesibspparchfare
count1046.001309.001309.001308.00
mean29.880.500.3933.30
std14.411.040.8751.76
min0.170.000.000.00
25%21.000.000.007.90
50%28.000.000.0014.45
75%39.001.000.0031.27
max80.008.009.00512.33
Categorical variables
# categorical columns description
titanic_df.describe(include="category")

pclasssexembarked
count130913091307
unique323
top3maleS
freq709843914

📈 Univariate Analysis

Target Variable

titanic_df["survived"].value_counts().plot(
    kind="bar", color=["skyblue", "orange"], title="Survived passengers"
);

png

Numerical Variables

# list of the numerical columns
numerical_columns = list(titanic_df.select_dtypes(include=["number"]).columns)
numerical_columns
['age', 'sibsp', 'parch', 'fare']

age

column = "age"
titanic_df[column].describe()
count   1046.00
mean      29.88
std       14.41
min        0.17
25%       21.00
50%       28.00
75%       39.00
max       80.00
Name: age, dtype: float64
# number of unique values
titanic_df[column].nunique()
98
titanic_df[column].plot(
    kind="hist", bins=30, edgecolor="black", title=f"{column} histogram"
);

png

titanic_df[column].plot(kind="density", title=f"{column} distribution");

png

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

png

sibsp

column = "sibsp"
titanic_df[column].describe()
count   1309.00
mean       0.50
std        1.04
min        0.00
25%        0.00
50%        0.00
75%        1.00
max        8.00
Name: sibsp, dtype: float64
# number of unique values
titanic_df[column].nunique()
7
# unique values (because are only a few)
titanic_df[column].unique()
array([0, 1, 2, 3, 4, 5, 8])
# histogram of a column with 7 unique values between 0 and 8
titanic_df[column].plot(
    kind="hist", bins=8, edgecolor="black", title=f"{column} histogram"
);

png

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

png

parch

column = "parch"
titanic_df[column].describe()
count   1309.00
mean       0.39
std        0.87
min        0.00
25%        0.00
50%        0.00
75%        0.00
max        9.00
Name: parch, dtype: float64
# number of unique values

titanic_df[column].nunique()
8
# unique values (because are only a few)
titanic_df[column].unique()
array([0, 2, 1, 4, 3, 5, 6, 9])
titanic_df[column].plot(
    kind="hist", bins=9, edgecolor="black", title=f"{column} histogram"
);

png

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

png

fare

column = "fare"
titanic_df[column].describe()
count   1308.00
mean      33.30
std       51.76
min        0.00
25%        7.90
50%       14.45
75%       31.27
max      512.33
Name: fare, dtype: float64
titanic_df[column].plot(kind="kde", title=f"{column} distribution");

png

titanic_df[column].plot(
    kind="hist", bins=50, edgecolor="black", title=f"{column} histogram"
);

png

titanic_df[column].plot(kind="box", title=f"{column} boxplot");

png

Categorical Variables

# list of the categorical columns
categorical_columns = list(titanic_df.select_dtypes(include=["category"]).columns)
categorical_columns
['pclass', 'sex', 'embarked']

pclass

column = "pclass"
titanic_df[column].describe()
count     1309
unique       3
top          3
freq       709
Name: pclass, dtype: int64
titanic_df[column].unique()
[1, 2, 3]
Categories (3, int64): [3 < 2 < 1]
titanic_df[column].value_counts()
pclass
3    709
1    323
2    277
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
    kind="bar", color=["skyblue", "orange", "green"], title=f"{column} value counts"
);

png

Sex

column = "sex"
titanic_df[column].describe()
count     1309
unique       2
top       male
freq       843
Name: sex, dtype: object
titanic_df[column].unique()
['female', 'male']
Categories (2, object): ['female', 'male']
titanic_df[column].value_counts()
sex
male      843
female    466
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
    kind="bar", color=["skyblue", "orange"], title=f"{column} value counts"
);

png

embarked

column = "embarked"
titanic_df[column].describe()
count     1307
unique       3
top          S
freq       914
Name: embarked, dtype: object
titanic_df[column].unique()
['S', 'C', NaN, 'Q']
Categories (3, object): ['C', 'Q', 'S']
titanic_df[column].value_counts()
embarked
S    914
C    270
Q    123
Name: count, dtype: int64
titanic_df[column].value_counts().plot(
    kind="bar", color=["skyblue", "orange", "green"], title=f"{column} value counts"
);

png

String columns

titanic_df["name"].sample(5)
394                         Dibden, Mr. William
755                          Davies, Mr. Joseph
237                         Robbins, Mr. Victor
459                Jacobsohn, Mr. Sidney Samuel
1136    Rasmussen, Mrs. (Lena Jacobsen Solvang)
Name: name, dtype: object
titanic_df["name"].nunique()
1307

📊 Analysis of Results and Conclusions

Dataset Overview

  • The dataset contains 1,309 passengers with 9 variables (after dropping the unnamed index column).
  • Missing values: age has 263 missing values (~20%), fare has 1, and embarked has 2. These will need to be addressed in the data preparation phase.

Target Variable: survived

  • The dataset is imbalanced: more passengers did not survive than survived, reflecting the historical reality of the Titanic disaster.

Numerical Variables

Age

  • Mean age is 29.88 years (median 28), with a standard deviation of 14.41.
  • Ages range from 0.17 (infants) to 80 years.
  • The distribution is right-skewed, with a concentration of passengers between 20 and 40 years old.
  • There are 98 unique age values and 263 missing values (~20% of the data), which is significant and must be handled carefully during imputation.
  • The boxplot reveals outliers at the upper end (elderly passengers).

SibSp (Siblings/Spouses aboard)

  • Most passengers traveled without siblings or spouses (median = 0, 75th percentile = 1).
  • The distribution is heavily right-skewed, with values ranging from 0 to 8.
  • Only 7 unique values exist, making this variable behave almost like a categorical feature.

Parch (Parents/Children aboard)

  • Similar to sibsp, most passengers had no parents or children aboard (median = 0).
  • Values range from 0 to 9, with 8 unique values.
  • The distribution is also heavily right-skewed, concentrated at 0.

Fare

  • Mean fare is $33.30 but with a very high standard deviation of $51.76, indicating large variability.
  • Fares range from $0 to $512.33.
  • The distribution is extremely right-skewed: the median ($14.45) is much lower than the mean, indicating that a few very expensive tickets pull the average up.
  • The boxplot shows many outliers on the upper end, corresponding to first-class luxury cabins.
  • There is 1 missing value.

Categorical Variables

Pclass (Passenger Class)

  • 3rd class is the most frequent (709 passengers, ~54%), followed by 1st class (323) and 2nd class (277).
  • The majority of passengers were in the lowest class, consistent with historical records.

Sex

  • Males significantly outnumber females: 843 males (~64%) vs. 466 females (~36%).
  • This gender imbalance is important for survival analysis, as the “women and children first” policy was applied during evacuation.

Embarked (Port of Embarkation)

  • Southampton (S) is the dominant port with 914 passengers (~70%), followed by Cherbourg (C) with 270 and Queenstown (Q) with 123.
  • There are 2 missing values in this variable.

Key Takeaways

  1. Class disparity: The majority of passengers were in 3rd class, which historically had the lowest survival rates.
  2. Gender imbalance: Nearly two-thirds of passengers were male, yet survival rates were higher for females.
  3. Age distribution: The passenger population was predominantly young adults (20-40 years), with significant missing data that requires careful imputation.
  4. Fare distribution: Highly skewed with extreme outliers, reflecting the vast economic differences between passenger classes.
  5. Family variables (sibsp, parch): Most passengers traveled alone or with very small family groups.
  6. Missing data strategy: The age column requires special attention due to its 20% missing rate; fare and embarked have minimal missing values that can be easily handled.

📖 References

Visualización

EDA

Statistical tests

Siguiente