Por Jose R. Zapata
Ultima actualización: 2/Feb/2025
📚 Import libraries
# base libraries for data science
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from joblib import load
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
💾 Load model
DATA_MODEL = Path.cwd().resolve().parents[1] / "models"
titanic_model = load(DATA_MODEL / "titanic_classification-random_forest-v1.joblib")
# print library version for reproducibility
print("Pandas version: ", pd.__version__)
Pandas version: 2.2.3
Model Interpretation
Feature Importance
Random forest as a method to get feature importance. This method is computed as the mean and standard deviation of accumulation of the impurity decrease within each tree. The more a feature decreases the impurity, the more important it is considered.
features = titanic_model["preprocessor"].get_feature_names_out()
importances = titanic_model["model"].feature_importances_
dfFeatures = pd.DataFrame({"Features": features, "Importances": importances})
dfFeatures.sort_values(by="Importances", ascending=False)
Features | Importances | |
---|---|---|
5 | categoric__sex_male | 0.281052 |
4 | categoric__sex_female | 0.263076 |
9 | categoric ordinal__pclass | 0.168484 |
1 | numeric__fare | 0.111280 |
0 | numeric__age | 0.093817 |
2 | numeric__sibsp | 0.028429 |
3 | numeric__parch | 0.026425 |
6 | categoric__embarked_C | 0.015069 |
8 | categoric__embarked_S | 0.007535 |
7 | categoric__embarked_Q | 0.004832 |
titanic_model.named_steps["model"].feature_importances_
array([0.09381687, 0.11127966, 0.02842946, 0.02642524, 0.26307643,
0.28105184, 0.01506934, 0.0048317 , 0.00753499, 0.16848448])
std = np.std([tree.feature_importances_ for tree in titanic_model["model"].estimators_], axis=0)
dfFeatures["std"] = std
dfFeatures
Features | Importances | std | |
---|---|---|---|
0 | numeric__age | 0.093817 | 0.047078 |
1 | numeric__fare | 0.111280 | 0.068099 |
2 | numeric__sibsp | 0.028429 | 0.026219 |
3 | numeric__parch | 0.026425 | 0.027533 |
4 | categoric__sex_female | 0.263076 | 0.263597 |
5 | categoric__sex_male | 0.281052 | 0.262374 |
6 | categoric__embarked_C | 0.015069 | 0.021990 |
7 | categoric__embarked_Q | 0.004832 | 0.011320 |
8 | categoric__embarked_S | 0.007535 | 0.013045 |
9 | categoric ordinal__pclass | 0.168484 | 0.083930 |
# Crear el bar plot
plt.figure(figsize=(12, 4))
plt.bar(
dfFeatures["Features"],
dfFeatures["Importances"],
yerr=dfFeatures["std"],
capsize=5,
edgecolor="black",
)
# Añadir etiquetas y título
plt.xlabel("Features")
plt.ylabel("Importances")
plt.title("Feature Importances using MDI")
plt.xticks(rotation=45);
Feature permutation
💾 Load data
DATA_DIR = Path.cwd().resolve().parents[1] / "data"
titanic_df = pd.read_parquet(
DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)
selected_features = [
"pclass",
"sex",
"age",
"sibsp",
"parch",
"fare",
"embarked",
"survived",
]
titanic_features = titanic_df[selected_features]
titanic_features.loc[:, "survived"] = titanic_features["survived"].astype(bool)
titanic_features = titanic_features.drop_duplicates()
X_features = titanic_features.drop("survived", axis="columns")
Y_target = titanic_features["survived"]
# 80% train, 20% test
x_train, x_test, y_train, y_test = train_test_split(
X_features, Y_target, test_size=0.2, stratify=Y_target
)
imps = permutation_importance(
titanic_model, x_test, y_test, scoring="recall", n_repeats=10, random_state=42, n_jobs=8
)
fig = plt.figure(figsize=(10, 8))
perm_sorted_idx = imps.importances_mean.argsort()
plt.boxplot(
imps.importances[perm_sorted_idx].T, vert=False, tick_labels=x_test.columns[perm_sorted_idx]
)
plt.title("Permutation Importances (test set)");
Variables with more predictive power will have a higher impact on the model’s performance when permuted.Based on the results, the most important features are:
- Sex
- Pclass
With moderate importance:
- Age
- Fare
The results are consistent with the feature analysis.
📊 Analysis of Results
The analysis of feature importances reveals that gender (both male and female) is the most significant predictor in the model, followed by passenger class and fare. Age, port of embarkation, and the number of family members aboard also contribute to the model’s predictions, but to a lesser extent. Understanding these feature importances helps in interpreting the model’s decision-making process and provides insights into the factors that most influence the target variable.
📖 References
- https://joserzapata.github.io/courses/python-ciencia-datos/ml/
- https://joserzapata.github.io/courses/python-ciencia-datos/clasificacion/
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 2nd Edition - Aurélien Géron
- https://joserzapata.github.io/post/lista-proyecto-machine-learning/