Interpretación de modelo de Machine Learning

Por Jose R. Zapata

Ultima actualización: 2/Feb/2025

📚 Import libraries

# base libraries for data science

from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from joblib import load
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split

💾 Load model

DATA_MODEL = Path.cwd().resolve().parents[1] / "models"

titanic_model = load(DATA_MODEL / "titanic_classification-random_forest-v1.joblib")
# print library version for reproducibility

print("Pandas version: ", pd.__version__)
Pandas version:  2.2.3

Model Interpretation

Feature Importance

Random forest as a method to get feature importance. This method is computed as the mean and standard deviation of accumulation of the impurity decrease within each tree. The more a feature decreases the impurity, the more important it is considered.

features = titanic_model["preprocessor"].get_feature_names_out()
importances = titanic_model["model"].feature_importances_

dfFeatures = pd.DataFrame({"Features": features, "Importances": importances})
dfFeatures.sort_values(by="Importances", ascending=False)

FeaturesImportances
5categoric__sex_male0.281052
4categoric__sex_female0.263076
9categoric ordinal__pclass0.168484
1numeric__fare0.111280
0numeric__age0.093817
2numeric__sibsp0.028429
3numeric__parch0.026425
6categoric__embarked_C0.015069
8categoric__embarked_S0.007535
7categoric__embarked_Q0.004832
titanic_model.named_steps["model"].feature_importances_
array([0.09381687, 0.11127966, 0.02842946, 0.02642524, 0.26307643,
       0.28105184, 0.01506934, 0.0048317 , 0.00753499, 0.16848448])
std = np.std([tree.feature_importances_ for tree in titanic_model["model"].estimators_], axis=0)

dfFeatures["std"] = std
dfFeatures

FeaturesImportancesstd
0numeric__age0.0938170.047078
1numeric__fare0.1112800.068099
2numeric__sibsp0.0284290.026219
3numeric__parch0.0264250.027533
4categoric__sex_female0.2630760.263597
5categoric__sex_male0.2810520.262374
6categoric__embarked_C0.0150690.021990
7categoric__embarked_Q0.0048320.011320
8categoric__embarked_S0.0075350.013045
9categoric ordinal__pclass0.1684840.083930
# Crear el bar plot
plt.figure(figsize=(12, 4))
plt.bar(
    dfFeatures["Features"],
    dfFeatures["Importances"],
    yerr=dfFeatures["std"],
    capsize=5,
    edgecolor="black",
)

# Añadir etiquetas y título
plt.xlabel("Features")
plt.ylabel("Importances")
plt.title("Feature Importances using MDI")
plt.xticks(rotation=45);

png

Feature permutation

💾 Load data

DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df = pd.read_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)

selected_features = [
    "pclass",
    "sex",
    "age",
    "sibsp",
    "parch",
    "fare",
    "embarked",
    "survived",
]

titanic_features = titanic_df[selected_features]
titanic_features.loc[:, "survived"] = titanic_features["survived"].astype(bool)
titanic_features = titanic_features.drop_duplicates()
X_features = titanic_features.drop("survived", axis="columns")
Y_target = titanic_features["survived"]

# 80% train, 20% test
x_train, x_test, y_train, y_test = train_test_split(
    X_features, Y_target, test_size=0.2, stratify=Y_target
)
imps = permutation_importance(
    titanic_model, x_test, y_test, scoring="recall", n_repeats=10, random_state=42, n_jobs=8
)
fig = plt.figure(figsize=(10, 8))
perm_sorted_idx = imps.importances_mean.argsort()
plt.boxplot(
    imps.importances[perm_sorted_idx].T, vert=False, tick_labels=x_test.columns[perm_sorted_idx]
)
plt.title("Permutation Importances (test set)");

png

Variables with more predictive power will have a higher impact on the model’s performance when permuted.Based on the results, the most important features are:

  • Sex
  • Pclass

With moderate importance:

  • Age
  • Fare

The results are consistent with the feature analysis.

📊 Analysis of Results

The analysis of feature importances reveals that gender (both male and female) is the most significant predictor in the model, followed by passenger class and fare. Age, port of embarkation, and the number of family members aboard also contribute to the model’s predictions, but to a lesser extent. Understanding these feature importances helps in interpreting the model’s decision-making process and provides insights into the factors that most influence the target variable.

📖 References