Selección de modelo de Machine Learning

Por Jose R. Zapata

Ultima actualización: 21/Nov/2024

Description

Select the best machine learning model for the Titanic dataset.

The models at least has to be better than the baseline model (heuristic approach). Modelo Base

📚 Import libraries

# base libraries for data science
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
)
from sklearn.model_selection import (
    KFold,
    ShuffleSplit,
    cross_val_score,
    learning_curve,
    train_test_split,
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

💾 Load data

DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df = pd.read_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)
# print library version for reproducibility

print("Pandas version: ", pd.__version__)
Pandas version:  2.1.4

👷 Data preparation

The columns that will be used are:

[‘pclass’, ‘sex’, ‘age’, ‘sibsp’, ‘parch’, ‘fare’, ’embarked’]

selected_features = [
    "pclass",
    "sex",
    "age",
    "sibsp",
    "parch",
    "fare",
    "embarked",
    "survived",
]

titanic_features = titanic_df[selected_features]

titanic_features.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1309 non-null   int64
 1   sex       1309 non-null   category
 2   age       1046 non-null   float64
 3   sibsp     1309 non-null   int64
 4   parch     1309 non-null   int64
 5   fare      1308 non-null   float64
 6   embarked  1307 non-null   category
 7   survived  1309 non-null   bool
dtypes: bool(1), category(2), float64(2), int64(3)
memory usage: 55.3 KB
titanic_features.isna().sum()
pclass        0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
survived      0
dtype: int64

The target variable is survived and the values are True or False. I’m going to transform the target variable to 1 or 0.

titanic_features.loc[:, "survived"] = titanic_features["survived"].astype(int)

drop duplicates if any exist in the dataset, is important to avoid any bias in the dataset or data leakage when a machine learning model is trained on the data.

Duplicated values creates bias in the dataset, and the model will be trained on the same data, which will lead to overfitting, and data leakage problems.

titanic_features = titanic_features.drop_duplicates()
titanic_features.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1114 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1114 non-null   int64
 1   sex       1114 non-null   category
 2   age       974 non-null    float64
 3   sibsp     1114 non-null   int64
 4   parch     1114 non-null   int64
 5   fare      1113 non-null   float64
 6   embarked  1112 non-null   category
 7   survived  1114 non-null   int64
dtypes: category(2), float64(2), int64(4)
memory usage: 63.3 KB

After select only 3 columns theres a lot of duplicates, then more columns are selected the probability of duplicates is lower.

I’m Not going to drop duplicates in this case. to keep the 1309 rows to compare with the machine learning models that uses more columns.

titanic_features.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1114 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1114 non-null   int64
 1   sex       1114 non-null   category
 2   age       974 non-null    float64
 3   sibsp     1114 non-null   int64
 4   parch     1114 non-null   int64
 5   fare      1113 non-null   float64
 6   embarked  1112 non-null   category
 7   survived  1114 non-null   int64
dtypes: category(2), float64(2), int64(4)
memory usage: 63.3 KB
titanic_features.isna().sum()
pclass        0
sex           0
age         140
sibsp         0
parch         0
fare          1
embarked      2
survived      0
dtype: int64
titanic_features.sample(10, random_state=42)

pclasssexagesibspparchfareembarkedsurvived
9693female30.01015.5500S0
1011male39.00029.7000C0
13063male26.5007.2250C0
7843male23.01013.9000S0
2991male40.00027.7208C0
9583femaleNaN0425.4667S0
10753male23.0009.2250S0
2671male56.00030.6958C0
1301female22.00159.4000C1
11903male25.0009.5000S1

👨‍🏭 Feature Engineering

cols_numeric = ["age", "fare", "sibsp", "parch"]
cols_categoric = ["sex", "embarked"]
cols_categoric_ord = ["pclass"]

age column needs imputation, for this model, we will use the median value. but categorical columns don’t need nothing for this heuristic model.

numeric_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
    ]
)

categorical_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder()),
    ]
)

categorical_ord_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OrdinalEncoder()),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipe, cols_numeric),
        ("categoric", categorical_pipe, cols_categoric),
        ("categoric ordinal", categorical_ord_pipe, cols_categoric_ord),
    ]
)
preprocessor
ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 ['age', 'fare', 'sibsp', 'parch']),
                                ('categoric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot', OneHotEncoder())]),
                                 ['sex', 'embarked']),
                                ('categoric ordinal',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot', OrdinalEncoder())]),
                                 ['pclass'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Train / Test split

X_features = titanic_features.drop("survived", axis="columns")
Y_target = titanic_features["survived"]

# 80% train, 20% test
x_train, x_test, y_train, y_test = train_test_split(
    X_features, Y_target, test_size=0.2, stratify=Y_target
)
x_train.shape, y_train.shape
((891, 7), (891,))
x_test.shape, y_test.shape
((223, 7), (223,))

Example of the data preprocessing pipeline:

# train pipeline
preprocessor.fit(x_train)

# obtener el nombre de las columnas de salida del preprocesamiento
# usando .get_feature_names_out()
feature_names = preprocessor.get_feature_names_out()

# transform x_Test with preprocessor and pandas output set
x_test_transformed = preprocessor.transform(x_test)
x_test_transformed = pd.DataFrame(x_test_transformed, columns=feature_names)
x_test_transformed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 223 entries, 0 to 222
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   numeric__age               223 non-null    float64
 1   numeric__fare              223 non-null    float64
 2   numeric__sibsp             223 non-null    float64
 3   numeric__parch             223 non-null    float64
 4   categoric__sex_female      223 non-null    float64
 5   categoric__sex_male        223 non-null    float64
 6   categoric__embarked_C      223 non-null    float64
 7   categoric__embarked_Q      223 non-null    float64
 8   categoric__embarked_S      223 non-null    float64
 9   categoric ordinal__pclass  223 non-null    float64
dtypes: float64(10)
memory usage: 17.6 KB
x_test_transformed.sample(5)

numeric__agenumeric__farenumeric__sibspnumeric__parchcategoric__sex_femalecategoric__sex_malecategoric__embarked_Ccategoric__embarked_Qcategoric__embarked_Scategoric ordinal__pclass
3236.00.00000.00.00.01.00.00.01.02.0
1730.08.66250.00.01.00.00.00.01.02.0
4128.514.50001.01.00.01.00.00.01.02.0
10321.061.37920.01.00.01.01.00.00.00.0
17431.037.00421.01.00.01.01.00.00.01.0

Models

Best Practices is to import the libraries at the beginning of the notebook, but for this notebook, the libraries are imported in the cell where they are used.

In this experiments Basic Machine Learning models are used, and the models are:

  • Logistic Regression
  • Linear Discriminant Analysis
  • Stochastic Gradient Descent classifier
  • Linear Support Vector Machine
  • Radius Neighbors Classifier
  • Gaussian Naive Bayes
  • Decision Tree
  • Random Forest
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

Basic Model Selection

The models are trained with the default hyperparameters, and the models with the best performance will be selected to be tuned in the next step

Helping functions

def summarize_classification(y_test, y_pred):
    acc = accuracy_score(y_test, y_pred, normalize=True)
    prec = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc = roc_auc_score(y_test, y_pred)

    return {"accuracy": acc, "precision": prec, "recall": recall, "f1": f1, "roc": roc}
def build_model(
    classifier_fn,
    preprocessor: ColumnTransformer,
    data_params: dict,
    test_frac: float = 0.2,
) -> dict:
    """
    Function to train a classification model

    Args:
        classifier_fn: classification function
        preprocessor (ColumnTransformer): preprocessor pipeline object
        data_params (dict): dictionary containing 'name_of_y_col',
                            'names_of_x_cols', and 'dataset'
        test_frac (float): fraction of data for the test, default 0.2

    Returns:
        dict: dictionary with the model performance metrics on train and test

    """

    # Extract data parameters
    name_of_y_col = data_params["name_of_y_col"]
    names_of_x_cols = data_params["names_of_x_cols"]
    dataset = data_params["dataset"]

    # Separate the feature columns and the target column
    X = dataset[names_of_x_cols]
    Y = dataset[name_of_y_col]

    # Split the data into train and test
    x_train, x_test, y_train, y_test = train_test_split(
        X, Y, test_size=test_frac, random_state=1234
    )

    # Create the pipeline with preprocessing and the classification model
    classifier_pipe = Pipeline(
        steps=[("preprocessor", preprocessor), ("model", classifier_fn)]
    )

    # Train the classifier pipeline
    model = classifier_pipe.fit(x_train, y_train)

    # Predict the test data
    y_pred = model.predict(x_test)

    # Predict the train data
    y_pred_train = model.predict(x_train)

    # Calculate the performance metrics
    train_summary = summarize_classification(y_train, y_pred_train)
    test_summary = summarize_classification(y_test, y_pred)

    return {"train": train_summary, "test": test_summary}

First Selection of Models

The idea is to do a simple train an evaluate the models to see which ones are the worts to discard them.

FEATURES = list(x_train.columns)
FEATURES
['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']

Simple Train and Evaluate

result_dict = {}
models = {
    "logistic": LogisticRegression(solver="liblinear"),
    "lda": LinearDiscriminantAnalysis(),
    "sgd": SGDClassifier(),
    "svc": LinearSVC(C=1.0, max_iter=1000, tol=1e-3, dual=False),
    "radius_neighbors": RadiusNeighborsClassifier(radius=40.0),
    "naive_bayes": GaussianNB(),
    "decision_tree": DecisionTreeClassifier(),
    "random_forest": RandomForestClassifier(),
}

data_params = {
    "name_of_y_col": "survived",
    "names_of_x_cols": FEATURES,
    "dataset": titanic_df,
}
for model_name, model in models.items():
    result_dict[model_name] = build_model(model, preprocessor, data_params)

Model Comparison

# Convert the result_dict to a DataFrame for easier plotting
metrics = ["accuracy", "precision", "recall", "f1", "roc"]
models = list(result_dict.keys())
data_train = {
    metric: {model: result_dict[model]["train"][metric] for model in models}
    for metric in metrics
}
data_test = {
    metric: {model: result_dict[model]["test"][metric] for model in models}
    for metric in metrics
}

df_train = pd.DataFrame(data_train)
df_test = pd.DataFrame(data_test)

# Plot the bar chart for each metric
for metric in metrics:
    fig, ax = plt.subplots(figsize=(10, 4))
    width = 0.35  # the width of the bars

    df_train[metric].plot(
        kind="bar", ax=ax, width=width, position=1, label="Train", color="blue"
    )
    df_test[metric].plot(
        kind="bar", ax=ax, width=width, position=0, label="Test", color="orange"
    )

    # Add horizontal lines for average performance
    avg_train = df_train[metric].mean()
    avg_test = df_test[metric].mean()
    ax.axhline(avg_train, color="blue", linestyle="--", linewidth=1, label="Avg Train")
    ax.axhline(avg_test, color="orange", linestyle="--", linewidth=1, label="Avg Test")

    # Adjust the layout
    ax.set_ylabel(metric.capitalize())
    ax.set_title(f"Model {metric.capitalize()} Comparison")
    ax.legend()

    # Set the x-tick labels inside the bars and rotate by 90 degrees
    ax.set_xticks(range(len(df_train.index)))
    ax.set_xticklabels([])

    # Draw the x-tick labels inside the bars rotated by 90 degrees
    for i, label in enumerate(df_train.index):
        bar_center = (df_train.loc[label, metric] + df_test.loc[label, metric]) / 2
        ax.text(i, bar_center, label, ha="center", va="center_baseline", rotation=45)

    plt.tight_layout()
    plt.show()

png

png

png

png

png

# Create a DataFrame combining df_train and df_test
df_combined = pd.concat(
    [df_train.add_suffix("_train"), df_test.add_suffix("_test")], axis=1
)

# Calculate the difference between train and test values
df_combined["accuracy_diff"] = (
    df_combined["accuracy_train"] - df_combined["accuracy_test"]
)
df_combined["precision_diff"] = (
    df_combined["precision_train"] - df_combined["precision_test"]
)
df_combined["recall_diff"] = df_combined["recall_train"] - df_combined["recall_test"]
df_combined["f1_diff"] = df_combined["f1_train"] - df_combined["f1_test"]
df_combined["roc_diff"] = df_combined["roc_train"] - df_combined["roc_test"]

# Detect models with overfitting (significant difference between train and test)
overfitting_threshold = 0.1  # Threshold to consider overfitting
overfitting_models = df_combined[
    (df_combined["accuracy_diff"] > overfitting_threshold)
    | (df_combined["precision_diff"] > overfitting_threshold)
    | (df_combined["recall_diff"] > overfitting_threshold)
    | (df_combined["f1_diff"] > overfitting_threshold)
    | (df_combined["roc_diff"] > overfitting_threshold)
]

# Calculate the average performance in train and test for each metric
mean_performance_train = df_combined[
    ["accuracy_train", "precision_train", "recall_train", "f1_train", "roc_train"]
].mean()
mean_performance_test = df_combined[
    ["accuracy_test", "precision_test", "recall_test", "f1_test", "roc_test"]
].mean()

# Detect models with low performance
# (performance in both train and test below the average of other models)
low_performance_models = df_combined[
    (df_combined["accuracy_train"] < mean_performance_train["accuracy_train"])
    & (df_combined["accuracy_test"] < mean_performance_test["accuracy_test"])
    & (df_combined["precision_train"] < mean_performance_train["precision_train"])
    & (df_combined["precision_test"] < mean_performance_test["precision_test"])
    & (df_combined["recall_train"] < mean_performance_train["recall_train"])
    & (df_combined["recall_test"] < mean_performance_test["recall_test"])
    & (df_combined["f1_train"] < mean_performance_train["f1_train"])
    & (df_combined["f1_test"] < mean_performance_test["f1_test"])
    & (df_combined["roc_train"] < mean_performance_train["roc_train"])
    & (df_combined["roc_test"] < mean_performance_test["roc_test"])
]

print(f"Models with overfitting: {list(overfitting_models.index)} ")
print(f"Models with low performance: {list(low_performance_models.index)} ")
Models with overfitting: ['radius_neighbors', 'decision_tree', 'random_forest']
Models with low performance: ['radius_neighbors']
# Detect models with similar performance in train and test
similar_performance_threshold = 0.05  # Threshold to consider similar performance
similar_performance_models = df_combined[
    (df_combined["accuracy_diff"].abs() < similar_performance_threshold)
    & (df_combined["precision_diff"].abs() < similar_performance_threshold)
    & (df_combined["recall_diff"].abs() < similar_performance_threshold)
    & (df_combined["f1_diff"].abs() < similar_performance_threshold)
    & (df_combined["roc_diff"].abs() < similar_performance_threshold)
]

print(
    "Models with similar performance in train and test: "
    f"{list(similar_performance_models.index)}"
)
Models with similar performance in train and test: ['logistic', 'lda', 'sgd', 'svc', 'naive_bayes']
overfitting_models

accuracy_trainprecision_trainrecall_trainf1_trainroc_trainaccuracy_testprecision_testrecall_testf1_testroc_testaccuracy_diffprecision_diffrecall_difff1_diffroc_diff
radius_neighbors0.6647560.6666670.2493770.3629760.5859890.6297710.5277780.1919190.2814810.5438120.0349850.1388890.0574570.0814950.042176
decision_tree0.9665710.9841270.9276810.9550710.9591960.8015270.7373740.7373740.7373740.7889320.1650440.2467530.1903070.2176970.170264
random_forest0.9665710.9621210.9501250.9560850.9634520.8091600.7474750.7474750.7474750.7970500.1574110.2146460.2026500.2086110.166402
low_performance_models

accuracy_trainprecision_trainrecall_trainf1_trainroc_trainaccuracy_testprecision_testrecall_testf1_testroc_testaccuracy_diffprecision_diffrecall_difff1_diffroc_diff
radius_neighbors0.6647560.6666670.2493770.3629760.5859890.6297710.5277780.1919190.2814810.5438120.0349850.1388890.0574570.0814950.042176
similar_performance_models

accuracy_trainprecision_trainrecall_trainf1_trainroc_trainaccuracy_testprecision_testrecall_testf1_testroc_testaccuracy_diffprecision_diffrecall_difff1_diffroc_diff
logistic0.7889210.7356020.7007480.7177520.7722010.7938930.7368420.7070710.7216490.776848-0.004972-0.001240-0.006323-0.003897-0.004648
lda0.7898760.7362920.7032420.7193880.7734480.7786260.7113400.6969700.7040820.7625950.0112500.0249520.0062720.0153060.010852
sgd0.6093600.4943180.8678300.6298640.6583730.5763360.4670330.8585860.6049820.6317470.0330240.0272850.0092450.0248820.026626
svc0.7908310.7394740.7007480.7195900.7737490.7786260.7113400.6969700.7040820.7625950.0122050.0281330.0037780.0155090.011153
naive_bayes0.7803250.7040570.7356610.7195120.7718550.7862600.7128710.7272730.7200000.774679-0.005935-0.0088140.008388-0.000488-0.002824

Based on the results:

Models to discard:

  • stochastic gradient descent (sgd)
  • radius neighbors classifier

the models that will be selected to the next step for cross-validation are:

  • logistic regression (simple, good performance and high interpretability)
  • linear discriminant analysis (simple, good performance and high interpretability)
  • Decision Tree (simple, good performance and high interpretability, but can be overfitting)
  • Random Forest (simple, good performance and high interpretability, but can be overfitting)

svc and gaussian naive bayes are not selected because they have similar performance than the other models, and svc needs more time to train with more data, naive bayes is a simple model, but theres not range to improve the performance because of the hiperparameters options.

Cross validation model Selection

# Define the models to evaluate
models = {
    "logistic": LogisticRegression(solver="liblinear"),
    "lda": LinearDiscriminantAnalysis(),
    "decision_tree": DecisionTreeClassifier(),
    "random_forest": RandomForestClassifier(),
}

# Evaluation metrics
scoring_metrics = ["accuracy", "f1", "precision", "recall"]

# KFold for the cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=1234)

# Variable to store the results of the cross-validation
cv_results = {metric: {} for metric in scoring_metrics}

# Cross-validation evaluation for each model and metric
for model_name, model in models.items():
    model_pipe = Pipeline(steps=[("preprocessor", preprocessor), ("model", model)])
    for metric in scoring_metrics:
        cv_results[metric][model_name] = cross_val_score(
            model_pipe, x_train, y_train, cv=kfold, scoring=metric
        )

# Convert results into a pandas DataFrame for each metric
cv_results_df = {metric: pd.DataFrame(cv_results[metric]) for metric in scoring_metrics}
# Create a DataFrame to store mean and std for each metric and model
mean_std_data = []

for metric_name in scoring_metrics:
    for model_name in models:
        mean_score = cv_results_df[metric_name][model_name].mean()
        std_score = cv_results_df[metric_name][model_name].std()
        mean_std_data.append(
            {
                "Model": model_name,
                "Metric": metric_name,
                "Mean": mean_score,
                "Std": std_score,
            }
        )

mean_std_df = pd.DataFrame(mean_std_data)
mean_std_df

ModelMetricMeanStd
0logisticaccuracy0.7542320.050505
1ldaaccuracy0.7598380.045816
2decision_treeaccuracy0.7048310.031282
3random_forestaccuracy0.7475410.054167
4logisticf10.6960350.063003
5ldaf10.7060330.055827
6decision_treef10.6387020.050070
7random_forestf10.6806030.053718
8logisticprecision0.7259630.086759
9ldaprecision0.7272540.077692
10decision_treeprecision0.6491420.047249
11random_forestprecision0.7067090.060888
12logisticrecall0.6798640.095904
13ldarecall0.6953120.085353
14decision_treerecall0.6403880.078705
15random_forestrecall0.6780290.086451
# Create a boxplot for the cross-validation results of each metric
for metric_name in scoring_metrics:
    plt.figure(figsize=(10, 6))
    cv_results_df[metric_name].boxplot()
    plt.title(f"Cross Validation Boxplot for {metric_name.capitalize()}")
    plt.ylabel(f"{metric_name.capitalize()}")
    plt.show()

png

png

png

png

Only one Metric (Recall)

The Recall (Sensitivity or True Positive Rate) metric is chosen for this model because it is more important to identify all positive cases (survivors) in order to take appropriate actions.

The objetive of this model is to identify all the survivors with out missing any of them, hypothetically the model is going to be used to take actions to save the passengers, so the recall metric is the most important metric for this model.

Recall is important because it Measures the model’s ability to correctly identify passengers who survived. It is crucial in situations where it is more important to identify all positive cases (survivors) in order to take appropriate actions, and it is not as important to identify negative cases (non-survivors). The False Negative (FN) is the most important error in this case.

Statistical Model Comparison

result_df = cv_results_df["recall"]
result_df

logisticldadecision_treerandom_forest
00.7105260.7105260.6052630.657895
10.5454550.5681820.6363640.590909
20.7586210.7586210.6206900.758621
30.7272730.7575760.6666670.696970
40.7894740.7894740.5000000.552632
50.6176470.6470590.6764710.676471
60.6976740.7209300.6976740.767442
70.8000000.8000000.8000000.828571
80.6153850.6153850.6153850.641026
90.5365850.5853660.5853660.609756
from scipy.stats import f_oneway

model1 = result_df["logistic"]
model2 = result_df["lda"]
model3 = result_df["decision_tree"]
model4 = result_df["random_forest"]

statistic, p_value = f_oneway(model1, model2, model3, model4)

print(f"Statistic: {statistic}")
print(f"p_value: {p_value}")

alpha = 0.05  # significance level

if p_value < alpha:
    print(
        "There is a statistically significant difference "
        "in the cross-validation results of the models."
    )
else:
    print(
        "There is no statistically significant difference "
        "in the cross-validation results of the models."
    )
Statistic: 0.7222098102210205
p_value: 0.5453297538007467
There is no statistically significant difference in the cross-validation results of the models.

statistically all the models are similar, so two models are selected to be tuned in the next step:

  • logistic regression (because is simple and has a good performance)
  • Random Forest (because its implementation are different than logistic regression, and can be tuned with multiple hyperparameters)

Hyperparameter tunning

Select the best hyperparameters for the models selected in the previous step.

Logistic Regression

from sklearn.model_selection import GridSearchCV

score = "recall"

parameters = {"model__penalty": ["l1", "l2"], "model__C": [0.1, 0.4, 0.8, 1, 2, 5]}

modelo = LogisticRegression(solver="liblinear")
logistic_pipe = Pipeline(steps=[("preprocessor", preprocessor), ("model", modelo)])

grid_search = GridSearchCV(
    logistic_pipe, parameters, cv=5, scoring=score, return_train_score=True
)
grid_search.fit(x_train, y_train)
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='median'))]),
                                                                         ['age',
                                                                          'fare',
                                                                          'sibsp',
                                                                          'parch']),
                                                                        ('categoric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('onehot',
                                                                                          OneHotEncoder())]),
                                                                         ['sex',
                                                                          'embarked']),
                                                                        ('categoric '
                                                                         'ordinal',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('onehot',
                                                                                          OrdinalEncoder())]),
                                                                         ['pclass'])])),
                                       ('model',
                                        LogisticRegression(solver='liblinear'))]),
             param_grid={'model__C': [0.1, 0.4, 0.8, 1, 2, 5],
                         'model__penalty': ['l1', 'l2']},
             return_train_score=True, scoring='recall')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
grid_search.best_params_
{'model__C': 0.8, 'model__penalty': 'l1'}

Evaluation Logistic Regression

model = LogisticRegression(solver="liblinear", C=0.8, penalty="l1")
logistic_pipe = Pipeline(steps=[("preprocessor", preprocessor), ("model", model)])

logistic_model = logistic_pipe.fit(x_train, y_train)
y_pred = logistic_model.predict(x_test)

summarize_classification(y_test, y_pred)
{'accuracy': 0.8026905829596412,
 'precision': 0.7692307692307693,
 'recall': 0.7526881720430108,
 'f1': 0.7608695652173912,
 'roc': 0.7955748552522746}

Random Forest

from sklearn.model_selection import GridSearchCV

score = "recall"

parameters = {
    "model__max_depth": [4, 5, 7, 9, 10],
    "model__max_features": [2, 3, 4, 5, 6, 7, 8, 9],
    "model__criterion": ["gini", "entropy"],
}

randomforest_pipe = Pipeline(
    steps=[("preprocessor", preprocessor), ("model", RandomForestClassifier())]
)

grid_search = GridSearchCV(
    randomforest_pipe, parameters, cv=5, scoring=score, return_train_score=True
)
grid_search.fit(x_train, y_train)
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='median'))]),
                                                                         ['age',
                                                                          'fare',
                                                                          'sibsp',
                                                                          'parch']),
                                                                        ('categoric',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('onehot',
                                                                                          OneHotEncoder())]),
                                                                         ['sex',
                                                                          'embarked']),
                                                                        ('categoric '
                                                                         'ordinal',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('onehot',
                                                                                          OrdinalEncoder())]),
                                                                         ['pclass'])])),
                                       ('model', RandomForestClassifier())]),
             param_grid={'model__criterion': ['gini', 'entropy'],
                         'model__max_depth': [4, 5, 7, 9, 10],
                         'model__max_features': [2, 3, 4, 5, 6, 7, 8, 9]},
             return_train_score=True, scoring='recall')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
grid_search.best_params_
{'model__criterion': 'gini', 'model__max_depth': 9, 'model__max_features': 2}

Evaluation Random Forest

modelo = RandomForestClassifier(criterion="entropy", max_depth=4, max_features=4)

Tree_pipe = Pipeline(steps=[("preprocessor", preprocessor), ("model", modelo)])
tree_model = Tree_pipe.fit(x_train, y_train)
y_pred = tree_model.predict(x_test)
summarize_classification(y_test, y_pred)
{'accuracy': 0.8251121076233184,
 'precision': 0.8214285714285714,
 'recall': 0.7419354838709677,
 'f1': 0.7796610169491526,
 'roc': 0.8132754342431762}

Final Evaluation Test

from sklearn.metrics import (
    ConfusionMatrixDisplay,
    PrecisionRecallDisplay,
    RocCurveDisplay,
    classification_report,
)

Logistic Regression

y_pred = logistic_model.predict(x_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.83      0.84      0.83       130
           1       0.77      0.75      0.76        93

    accuracy                           0.80       223
   macro avg       0.80      0.80      0.80       223
weighted avg       0.80      0.80      0.80       223
ConfusionMatrixDisplay.from_predictions(y_test, y_pred);

png

PrecisionRecallDisplay.from_predictions(y_test, y_pred);

png

log_plot = RocCurveDisplay.from_estimator(logistic_model, x_test, y_test)
plt.show()

png

Random Forest

y_pred = tree_model.predict(x_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.83      0.88      0.86       130
           1       0.82      0.74      0.78        93

    accuracy                           0.83       223
   macro avg       0.82      0.81      0.82       223
weighted avg       0.82      0.83      0.82       223
ConfusionMatrixDisplay.from_predictions(y_test, y_pred);

png

PrecisionRecallDisplay.from_predictions(y_test, y_pred);

png

dt_plot = RocCurveDisplay.from_estimator(tree_model, x_test, y_test)
plt.show()

png

Model comparison

Recall values are similar, but Random Forest has a little better performance than Logistic Regression, so Random Forest is selected as the best model for this dataset.

ROC curve is used to compare the models.

ax = plt.gca()
dt_plot.plot(ax=ax, alpha=0.8, name="random_forest")
log_plot.plot(ax=ax, alpha=0.8, name="logistic regression")
plt.title("ROC Curve - Random Forest vs Logistic Regression")
plt.show()

png

Random Forest Learning Curve

This curve is used to see if the model is overfitting or underfitting.

model = tree_model

# Parameters for the learning curve
common_params = {
    "X": x_train,
    "y": y_train,
    "train_sizes": np.linspace(0.1, 1.0, 5),
    "cv": ShuffleSplit(n_splits=50, test_size=0.2, random_state=123),
    "n_jobs": -1,
    "return_times": True,
}

scoring_metric = "recall"

# Obtain the learning curve values including fit and score times
train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(
    model, **common_params, scoring=scoring_metric
)

# Calculate the mean and standard deviation of the scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Calculate the mean and standard deviation of the fit and score times
fit_times_mean = np.mean(fit_times, axis=1)
fit_times_std = np.std(fit_times, axis=1)
score_times_mean = np.mean(score_times, axis=1)
score_times_std = np.std(score_times, axis=1)

# Plot the learning curve
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 6), sharey=True)
ax.plot(train_sizes, train_mean, "o-", label="Training score")
ax.plot(train_sizes, test_mean, "o-", color="orange", label="Cross-validation score")
ax.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.3)
ax.fill_between(
    train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.3, color="orange"
)

# Configure the title and labels
ax.set_title(f"Learning Curve for {model.__class__.__name__}")
ax.set_xlabel("Training examples")
ax.set_ylabel(scoring_metric)
ax.legend(loc="best")

# Show the plot
plt.show()

# Print the values for analysis
print("Training Sizes:", train_sizes)
print("Training Scores Mean:", train_mean)
print("Training Scores Std:", train_std)
print("Test Scores Mean:", test_mean)
print("Test Scores Std:", test_std)

png

Training Sizes: [ 71 231 391 551 712]
Training Scores Mean: [0.81382816 0.6909304  0.6773471  0.67010412 0.66925488]
Training Scores Std: [0.07939262 0.05756737 0.03772593 0.03222352 0.02729735]
Test Scores Mean: [0.61297436 0.61904138 0.63195306 0.63720044 0.64468211]
Test Scores Std: [0.07503364 0.06395101 0.05970187 0.05418839 0.04888169]
# Plot the scalability regarding fit time and score time
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(10, 12), sharex=True)

# Scalability regarding the fit time
ax[0].plot(train_sizes, fit_times_mean, "o-")
ax[0].fill_between(
    train_sizes,
    fit_times_mean - fit_times_std,
    fit_times_mean + fit_times_std,
    alpha=0.3,
)
ax[0].set_ylabel("Fit time (s)")
ax[0].set_title("Scalability of the Random Forest classifier")

# Scalability regarding the score time
ax[1].plot(train_sizes, score_times_mean, "o-")
ax[1].fill_between(
    train_sizes,
    score_times_mean - score_times_std,
    score_times_mean + score_times_std,
    alpha=0.3,
)
ax[1].set_ylabel("Score time (s)")
ax[1].set_xlabel("Number of training samples")

# Show the plot
plt.show()

# Print the fit and score times for analysis
print("Fit Times Mean:", fit_times_mean)
print("Fit Times Std:", fit_times_std)
print("Score Times Mean:", score_times_mean)
print("Score Times Std:", score_times_std)

png

Fit Times Mean: [0.05894506 0.06279021 0.06558596 0.07086827 0.07508377]
Fit Times Std: [0.01044859 0.01142745 0.01078453 0.01128258 0.01317722]
Score Times Mean: [0.00669562 0.00671644 0.00684018 0.00612065 0.00651652]
Score Times Std: [0.00259971 0.00259719 0.00271066 0.00231618 0.002923  ]

Model Interpretation

dfFeatures = pd.DataFrame(
    {
        "Features": tree_model["preprocessor"].get_feature_names_out(),
        "Importances": tree_model["model"].feature_importances_,
    }
)
dfFeatures.sort_values(by="Importances", ascending=False)

FeaturesImportances
4categoric__sex_female0.274651
5categoric__sex_male0.263831
9categoric ordinal__pclass0.146375
1numeric__fare0.133972
0numeric__age0.086363
2numeric__sibsp0.034428
6categoric__embarked_C0.022848
3numeric__parch0.022657
8categoric__embarked_S0.012933
7categoric__embarked_Q0.001942

Save the model

from joblib import dump

# grabar el modelo en un archivo
dump(Tree_pipe, "Tree_pipe-titanic.joblib")
['Tree_pipe-titanic.joblib']

Test the saved model

from joblib import load

my_model = load("Tree_pipe-titanic.joblib")
# test data

x_test.head()

pclasssexagesibspparchfareembarked
12373male24.0007.7958S
5092male39.00026.0000S
7393male44.00116.1000S
1951female16.00086.5000S
8393male19.0007.7750S
# check predictions
my_model.predict(x_test.head())
array([0, 0, 0, 1, 0])

📊 Analysis of Results

The model performs well overall, with an accuracy of 77%. It has a higher recall for class 0 (0.87) compared to class 1 (0.62), indicating that it is better at identifying instances of class 0. The precision is similar for both classes, around 0.76-0.77. The F1-score is higher for class 0 (0.81) than for class 1 (0.69), reflecting the better balance between precision and recall for class 0. The macro and weighted averages provide a comprehensive view of the model’s performance across all classes.

The model has better results than the heuristic model in all the metrics but most important in the recall metric.

  • Random Forest = 0.81 Test Macro average recall
  • Heuristic Model = 0.78 Test Macro average recall

The analysis of feature importances reveals that gender (both male and female) is the most significant predictor in the model, followed by passenger class and fare. Age, port of embarkation, and the number of family members aboard also contribute to the model’s predictions, but to a lesser extent. Understanding these feature importances helps in interpreting the model’s decision-making process and provides insights into the factors that most influence the target variable.

🧑‍🔬 Recommendations

  1. Continue with Current Model: Given the good generalization and stable performance, is factible to continue using the current model configuration.

  2. add features: include the name feature to better imputation of the age feature. because is an important feature in the model.

  3. Complex models: Try more complex models like XGBoost, LightGBM, to see if they can improve the performance of the model.

📖 References