Clasificacion con Scikit Learn

Por Jose R. Zapata

Ultima actualizacion: 17/Nov/2023

Invitame a un Cafe

Importar librerias

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Informacion de los Datos

Titanic dataset

Fuente: https://www.kaggle.com/francksylla/titanic-machine-learning-from-disaster

titanic_df = pd.read_csv("https://www.openml.org/data/get_csv/16826755/phpMYEkMl")
titanic_df.sample(10)

pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.dest
85530Hassan, Mr. Houssein G Nmale1100269918.7875?C???
3010Blackwell, Mr. Stephen Weartmale450011378435.5TS??Trenton, NJ
32921Angle, Mrs. William A (Florence 'Mary' Agnes H...female361022687526?S11?Warwick, England
85330Harmer, Mr. Abraham (David Lishin)male25003748877.25?SB??
107630O'Donoghue, Ms. Bridgetfemale?003648567.75?Q???
44520Hiltunen, Miss. Martafemale181125065013?S??Kontiolahti, Finland / Detroit, MI
105130Nancarrow, Mr. William Henrymale3300A./5. 33388.05?S???
11611Fortune, Mrs. Mark (Mary McDougald)female601419950263C23 C25 C27S10?Winnipeg, MB
93830Klasen, Mr. Klas Albinmale18113504047.8542?S???
73431Coutts, Master. William Loch 'William'male311C.A. 3767115.9?S2?England Brooklyn, NY
titanic_df.shape
(1309, 14)
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  boat       1309 non-null   object
 12  body       1309 non-null   object
 13  home.dest  1309 non-null   object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB

Preparacion de datos

Eliminar columnas no necesarias

Se eliminan las columnas boat y body ya que con esta informacion solamente ya se puede conocer si una persona sobrevivio o no.

Para el siguiente ejercicio no se tendra en cuenta las columnas name, ticket, cabin y home.dest para hacer el ejercicio mas sencillo.

titanic_df = titanic_df.drop(['boat', 'body', 'home.dest','name', 'ticket', 'cabin'],
                             axis='columns')

titanic_df.head()

pclasssurvivedsexagesibspparchfareembarked
011female2900211.3375S
111male0.916712151.55S
210female212151.55S
310male3012151.55S
410female2512151.55S
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   pclass    1309 non-null   int64 
 1   survived  1309 non-null   int64 
 2   sex       1309 non-null   object
 3   age       1309 non-null   object
 4   sibsp     1309 non-null   int64 
 5   parch     1309 non-null   int64 
 6   fare      1309 non-null   object
 7   embarked  1309 non-null   object
dtypes: int64(4), object(4)
memory usage: 81.9+ KB

El primer paso que se sugiere es corregir el tipo de datos de las columnas, pero para esta demostracion que solo estara enfocada en los algoritmos de clasificacion se van a eliminar las filas con los datos faltantes primero, en un caso real es necesario tener en cuenta este paso.

Tratamiento de datos nulos

# Contar el numero de datos nulos
titanic_df[titanic_df.isna().any(axis='columns')].count()
pclass      0
survived    0
sex         0
age         0
sibsp       0
parch       0
fare        0
embarked    0
dtype: int64
titanic_df['age'].unique()
array(['29', '0.9167', '2', '30', '25', '48', '63', '39', '53', '71',
       '47', '18', '24', '26', '80', '?', '50', '32', '36', '37', '42',
       '19', '35', '28', '45', '40', '58', '22', '41', '44', '59', '60',
       '33', '17', '11', '14', '49', '76', '46', '27', '64', '55', '70',
       '38', '51', '31', '4', '54', '23', '43', '52', '16', '32.5', '21',
       '15', '65', '28.5', '45.5', '56', '13', '61', '34', '6', '57',
       '62', '67', '1', '12', '20', '0.8333', '8', '0.6667', '7', '3',
       '36.5', '18.5', '5', '66', '9', '0.75', '70.5', '22.5', '0.3333',
       '0.1667', '40.5', '10', '23.5', '34.5', '20.5', '30.5', '55.5',
       '38.5', '14.5', '24.5', '60.5', '74', '0.4167', '11.5', '26.5'],
      dtype=object)
# los datos faltantes estan representados por '?'
titanic_df = titanic_df.replace('?',np.nan)

Numero de datos nulos por columna

titanic_df[titanic_df.isna().any(axis='columns')].count()
pclass      266
survived    266
sex         266
age           3
sibsp       266
parch       266
fare        265
embarked    264
dtype: int64

Limpiar las filas con datos nulos en la columna survived que es la variable objetivo

titanic_df = titanic_df.dropna(subset=['survived'])
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   pclass    1309 non-null   int64 
 1   survived  1309 non-null   int64 
 2   sex       1309 non-null   object
 3   age       1046 non-null   object
 4   sibsp     1309 non-null   int64 
 5   parch     1309 non-null   int64 
 6   fare      1308 non-null   object
 7   embarked  1307 non-null   object
dtypes: int64(4), object(4)
memory usage: 81.9+ KB

Convertir las variables a su formato correcto

Corregir las variables categoricas

cols_categoricas = ["pclass", "sex", "embarked"]

titanic_df[cols_categoricas] = titanic_df[cols_categoricas].astype("category")

Corregir variables categorica ordinal

titanic_df["pclass"] = pd.Categorical(titanic_df["pclass"],
                                      categories=[3, 2, 1],
                                      ordered=True)

Corregir las variables numericas

cols_numericas = ["age", "fare"]

titanic_df[cols_numericas] = titanic_df[cols_numericas].astype("float")

Corregir las variables booleanas

cols_booleanas = ["survived"]

titanic_df[cols_booleanas] = titanic_df[cols_booleanas].astype("bool")

Informacion del dataset

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   pclass    1309 non-null   category
 1   survived  1309 non-null   bool    
 2   sex       1309 non-null   category
 3   age       1046 non-null   float64 
 4   sibsp     1309 non-null   int64   
 5   parch     1309 non-null   int64   
 6   fare      1308 non-null   float64 
 7   embarked  1307 non-null   category
dtypes: bool(1), category(3), float64(2), int64(2)
memory usage: 46.5 KB

Descripcion estadistica

titanic_df.describe()

agesibspparchfare
count1046.0000001309.0000001309.0000001308.000000
mean29.8811350.4988540.38502733.295479
std14.4135001.0416580.86556051.758668
min0.1667000.0000000.0000000.000000
25%21.0000000.0000000.0000007.895800
50%28.0000000.0000000.00000014.454200
75%39.0000001.0000000.00000031.275000
max80.0000008.0000009.000000512.329200
titanic_df.head()

pclasssurvivedsexagesibspparchfareembarked
01Truefemale29.000000211.3375S
11Truemale0.916712151.5500S
21Falsefemale2.000012151.5500S
31Falsemale30.000012151.5500S
41Falsefemale25.000012151.5500S
titanic_df.to_parquet('titanic_processed.parquet',
                      index=False)

Analisis Univariable

Se debe hacer un analisis de cada una de las variables y describir sus caracteristicas, solo se visualizara la variable survived que es la variable objetivo para ver si esta balanceada o no.

# visulizar la distribucion de los datos booleanos de la variable survived

titanic_df["survived"].value_counts().plot(kind="bar",
                                           color=['skyblue', 'orange']);

png

No hay un balance de clases, pero no es tan desbalanceado como para tener que hacer un balanceo de clases.

Analisis Bivariable

Se realizara un analisis de la variable de salida con las variables de entrada teniendo en cuenta que la variable objetivo es categorica nominal

Numericas vs Categoricas

cols_numericas
['age', 'fare']
# son 2 variables numericas 
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
axes = axes.flatten()
for i, col in enumerate(cols_numericas):
    sns.boxplot(data=titanic_df,
                x="survived", y=col,
                ax=axes[i])
    axes[i].set_title(col)
plt.tight_layout()
plt.show()

png

#grafica con seaborn del distribucion de fare por survived
sns.kdeplot(data=titanic_df,  x='fare',hue='survived',
            alpha=0.5, fill=True);
plt.xscale('log')
plt.ylabel("Costo del tiquete escala log");

png

Categoricas vs Categorica

Variables categoricas vs variable de salida

# crear graficas de heatmap para ver la correlacion entre las variables categoricas y la variable survived
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
axes = axes.flatten()
for i, col in enumerate(cols_categoricas):
    sns.heatmap(pd.crosstab(titanic_df[col],
                            titanic_df["survived"]),
                            annot=True, fmt="d",
                            ax=axes[i])
    axes[i].set_title(col)
    
plt.tight_layout()

png

from scipy import stats

resultados_chi2 = []

for col in cols_categoricas:
  # Calcular la prueba chi-cuadrado
  chi2, pval, dof, expected = stats.chi2_contingency(pd.crosstab(titanic_df[col],
                                                   titanic_df["survived"]))
  #save calues in pandas dataframe to concatenate with the results of other variables
  df = pd.DataFrame({'variable': [col],
                     'chi2': [chi2],
                     'pval': [pval]})
  resultados_chi2.append(df)
df_chi2 = pd.concat(resultados_chi2, ignore_index=True)
df_chi2  

variablechi2pval
0pclass127.8591561.720826e-28
1sex363.6179084.589925e-81
2embarked44.2417432.471881e-10

Feature Engineering (Ingenieria de caracteristicas)

Se va realizar una imputacion simple sin tener en cuenta un analisis mas profundo de los datos y la distribucion de los mismos.

las variables numericas se imputaran con la media y las variables categoricas con la moda.

# librerias para el preprocesamiento de datos
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   pclass    1309 non-null   category
 1   survived  1309 non-null   bool    
 2   sex       1309 non-null   category
 3   age       1046 non-null   float64 
 4   sibsp     1309 non-null   int64   
 5   parch     1309 non-null   int64   
 6   fare      1308 non-null   float64 
 7   embarked  1307 non-null   category
dtypes: bool(1), category(3), float64(2), int64(2)
memory usage: 46.5 KB
cols_numericas = ["age", "fare", "sibsp", "parch"]
cols_categoricas = ["sex", "embarked"]
cols_categoricas_ord = ["pclass"]

Creacion de pipelines de transformacion, la codificacion es:

  • OneHotEncoder para las variables categoricas nominales
  • OrdinalEncoder para las variables categoricas ordinales
numeric_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ])

categorical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())])

categorical_ord_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OrdinalEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('numericas', numeric_pipe, cols_numericas),
        ('categoricas', categorical_pipe, cols_categoricas),
        ('categoricas ordinales', categorical_ord_pipe, cols_categoricas_ord)
])
preprocessor
ColumnTransformer(transformers=[('numericas',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 ['age', 'fare', 'sibsp', 'parch']),
                                ('categoricas',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot', OneHotEncoder())]),
                                 ['sex', 'embarked']),
                                ('categoricas ordinales',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot', OrdinalEncoder())]),
                                 ['pclass'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Clasificacion Binaria

titanic_df = pd.read_parquet('titanic_processed.parquet')

titanic_df.head()

pclasssurvivedsexagesibspparchfareembarked
01Truefemale29.000000211.3375S
11Truemale0.916712151.5500S
21Falsefemale2.000012151.5500S
31Falsemale30.000012151.5500S
41Falsefemale25.000012151.5500S
titanic_df.shape
(1309, 8)

Dividir el dataset en entrenamiento y prueba

from sklearn.model_selection import train_test_split

X_features = titanic_df.drop('survived', axis='columns')
Y_target = titanic_df['survived']

x_train, x_test, y_train, y_test = train_test_split(X_features,
                                                    Y_target,
                                                    test_size=0.2,
                                                    stratify=Y_target)
x_train.shape, y_train.shape
((1047, 7), (1047,))
x_test.shape, y_test.shape
((262, 7), (262,))
x_train.head()

pclasssexagesibspparchfareembarked
551female14.012120.0000S
12753male16.02018.0000S
4632male22.02031.5000S
1851male42.00026.5500S
8933male29.0007.8542S

Clasificacion modelo simple

EL siguiente modelo esta basado en reglas, basadas en el analisis univariable y bivariable debido a esto es necesario transformar previamente el dataset de prueba usando el pipeline de transformacion entrenado con el dataset de entrenamiento.

# entrenar el pipeline con los datos de entrenamiento
preprocessor.fit(x_train)

#obtener el nombre de las columnas de salida del preprocesamiento
# usando .get_feature_names_out()
feature_names = preprocessor.get_feature_names_out()

# transform x_Test with preprocessor and pandas output set
x_test_transformed = preprocessor.transform(x_test)
x_test_transformed = pd.DataFrame(x_test_transformed, columns=feature_names)
x_test_transformed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262 entries, 0 to 261
Data columns (total 10 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   numericas__age                 262 non-null    float64
 1   numericas__fare                262 non-null    float64
 2   numericas__sibsp               262 non-null    float64
 3   numericas__parch               262 non-null    float64
 4   categoricas__sex_female        262 non-null    float64
 5   categoricas__sex_male          262 non-null    float64
 6   categoricas__embarked_C        262 non-null    float64
 7   categoricas__embarked_Q        262 non-null    float64
 8   categoricas__embarked_S        262 non-null    float64
 9   categoricas ordinales__pclass  262 non-null    float64
dtypes: float64(10)
memory usage: 20.6 KB
x_test_transformed.sample(5)

numericas__agenumericas__farenumericas__sibspnumericas__parchcategoricas__sex_femalecategoricas__sex_malecategoricas__embarked_Ccategoricas__embarked_Qcategoricas__embarked_Scategoricas ordinales__pclass
1022.049.50000.02.01.00.01.00.00.00.0
7174.07.77500.00.00.01.00.00.01.02.0
20827.010.50000.00.01.00.00.00.01.01.0
2627.052.00001.02.01.00.00.00.01.00.0
765.031.38754.02.00.01.00.00.01.02.0
def modelo_basico(input:pd.DataFrame)->bool:
    """
    modelo basico que predice si los pasajeros sobrevivieron o no
    en base a reglas simples obtenidas del analisis exploratorio
    Args:
      input (pd.Dataframe) : dataframe con los datos de entrada
    Returns:
      (bool): prediccion de sobrevivencia
    """ 
  
    if (input['categoricas__sex_female'] == 1) & (input['categoricas ordinales__pclass'] < 3):
        return True
    elif (input['numericas__age'] <= 5) | ((input['numericas__age'] >= 10) & (input['numericas__age'] <= 15)):
        return True
    else:
        return False
   
# predicciones del modelo basico
y_pred = x_test_transformed.apply(modelo_basico, axis='columns')
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_predictions(y_test,y_pred);

png

Evaluacion Modelo Simple

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc = roc_auc_score(y_test, y_pred)

print("accuracy_score : ", acc)
print("precision_score : ", prec)
print("recall_score : ", recall)
print("f1_score : ", f1)
print("roc_auc_score : ", roc)
accuracy_score :  0.7786259541984732
precision_score :  0.7058823529411765
recall_score :  0.72
f1_score :  0.712871287128713
roc_auc_score :  0.7674074074074074

Cualquier modelo que se entrene se va evaluar conmas mismas metricas y para ser seleccionado debe ser mejor que el modelo simple.

Regresion Logistica para Clasificacion

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression(penalty='l2',
                                    C=1.0,
                                    solver='liblinear')

logistic_complete = Pipeline(steps=[("preprocessor", preprocessor),
                                    ("model", logistic_model)])

logistic_complete.fit(x_train, y_train)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numericas',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['age', 'fare', 'sibsp',
                                                   'parch']),
                                                 ('categoricas',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder())]),
                                                  ['sex', 'embarked']),
                                                 ('categoricas ordinales',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OrdinalEncoder())]),
                                                  ['pclass'])])),
                ('model', LogisticRegression(solver='liblinear'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_pred = logistic_complete.predict(x_test)

Matriz de confusion

ConfusionMatrixDisplay.from_predictions(y_test,y_pred);

png

Evaluacion Regresion Logistica

acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc = roc_auc_score(y_test, y_pred)

print("accuracy_score : ", acc)
print("precision_score : ", prec)
print("recall_score : ", recall)
print("f1_score : ", f1)
print("roc_auc_score : ", roc)
accuracy_score :  0.8015267175572519
precision_score :  0.7553191489361702
recall_score :  0.71
f1_score :  0.731958762886598
roc_auc_score :  0.7840123456790123

Clasificacion con Multiples Modelos

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
titanic_df = pd.read_parquet('titanic_processed.parquet')
x_train.columns
Index(['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked'], dtype='object')
FEATURES = list(x_train.columns)

FEATURES
['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']
result_dict = {}

Funciones de ayuda

def summarize_classification(y_test, y_pred):
    
    acc = accuracy_score(y_test, y_pred, normalize=True)
    prec = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc = roc_auc_score(y_test, y_pred)
    
    return {'accuracy': acc, 
            'precision': prec,
            'recall':recall,
            'f1':f1,
            'roc':roc
            }

La siguiente funcion permite definir todos los pasos de entrenamiento y evaluacion que son los mismos para todos los modelos que se van a probar.

def build_model(classifier_fn,                
                name_of_y_col: str, 
                names_of_x_cols: list, 
                dataset: pd.DataFrame, 
                test_frac:float =0.2)-> dict:
    """
    Funcion para entrenar un modelo de clasificacion
    
    Args:
        classifier_fn: funcion de clasificacion
        name_of_y_col (list): nombre de la columna objetivo
        names_of_x_cols (list): lista de nombres de las columnas de caracteristicas
        dataset (pd.Dataframe): dataframe con los datos
        test_frac (float): fraccion de datos para el test, por defecto 0.2
        
    Returns:
        dict: diccionario con las metricas de desempeño del modelo en train y test  
    
    """
    
    # separar las columnas de caracteristicas y la columna objetivo
    X = dataset[names_of_x_cols]
    Y = dataset[name_of_y_col]
    
    # dividir los datos en train y test
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=test_frac)
    
    # crear el pipeline con el preprocesamiento y el modelo de clasificacion
    classifier_pipe = Pipeline(steps=[("preprocessor", preprocessor),
                                      ("model", classifier_fn)])
    
    # entrenar el pipeline del clasificador   
    model = classifier_pipe.fit(x_train, y_train)
    
    # predecir los datos de test
    y_pred = model.predict(x_test)

    # predecir los datos de train
    y_pred_train = model.predict(x_train)
    
    # calcular las metricas de desempeño
    train_summary = summarize_classification(y_train, y_pred_train)
    test_summary = summarize_classification(y_test, y_pred)
    
    # guardar las metricas de desempeño en un dataframe
    pred_results = pd.DataFrame({'y_test': y_test,
                                 'y_pred': y_pred})
    
    # calcular la matriz de confusion
    model_crosstab = pd.crosstab(pred_results.y_pred, pred_results.y_test)
    
    return {'train': train_summary, 
            'test': test_summary,
            'confusion_matrix': model_crosstab}

Regresion logistica

result_dict['logistic'] = build_model(LogisticRegression(solver='liblinear'),
                                      'survived',
                                      FEATURES,
                                      titanic_df)

result_dict['logistic'] 
{'train': {'accuracy': 0.792741165234002,
  'precision': 0.7364130434782609,
  'recall': 0.6930946291560103,
  'f1': 0.7140974967061924,
  'roc': 0.7726143877487368},
 'test': {'accuracy': 0.7900763358778626,
  'precision': 0.78125,
  'recall': 0.6880733944954128,
  'f1': 0.7317073170731707,
  'roc': 0.7754092462673143},
 'confusion_matrix': y_test  False  True 
 y_pred              
 False     132     34
 True       21     75}

Lineal Discriminant Analysis

result_dict['linear_discriminant_analysis'] = build_model(LinearDiscriminantAnalysis(solver='svd'),
                                                          'survived',
                                                          FEATURES,
                                                          titanic_df)
result_dict['linear_discriminant_analysis']
{'train': {'accuracy': 0.7831900668576887,
  'precision': 0.734375,
  'recall': 0.6928746928746928,
  'f1': 0.7130214917825537,
  'roc': 0.7667498464373464},
 'test': {'accuracy': 0.8206106870229007,
  'precision': 0.7555555555555555,
  'recall': 0.7311827956989247,
  'f1': 0.7431693989071038,
  'roc': 0.8005026404530127},
 'confusion_matrix': y_test  False  True 
 y_pred              
 False     147     25
 True       22     68}
result_dict['linear_discriminant_analysis'] = build_model(LinearDiscriminantAnalysis(solver='svd'),
                                                          'survived',
                                                          FEATURES,
                                                          titanic_df)
result_dict['linear_discriminant_analysis']
{'train': {'accuracy': 0.7975167144221585,
  'precision': 0.7472826086956522,
  'recall': 0.6979695431472082,
  'f1': 0.7217847769028872,
  'roc': 0.777774970654768},
 'test': {'accuracy': 0.7480916030534351,
  'precision': 0.7127659574468085,
  'recall': 0.6320754716981132,
  'f1': 0.67,
  'roc': 0.729499274310595},
 'confusion_matrix': y_test  False  True 
 y_pred              
 False     129     39
 True       27     67}

SGD

result_dict['sgd'] = build_model(SGDClassifier(max_iter=1000,
                                               tol=1e-3),
                                 'survived',
                                 FEATURES,
                                 titanic_df)

result_dict['sgd']
{'train': {'accuracy': 0.7182425978987583,
  'precision': 0.636604774535809,
  'recall': 0.6030150753768844,
  'f1': 0.6193548387096773,
  'roc': 0.6959605423109384},
 'test': {'accuracy': 0.732824427480916,
  'precision': 0.7051282051282052,
  'recall': 0.5392156862745098,
  'f1': 0.611111111111111,
  'roc': 0.6977328431372547},
 'confusion_matrix': y_test  False  True 
 y_pred              
 False     137     47
 True       23     55}

SVC Lineal

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

  • SVC con kernel lineal
  • dual=False cuando el numero de muestras > numero de caracteristicas
result_dict['linear_svc'] = build_model(LinearSVC(C=1.0,
                                                   max_iter=1000,
                                                   tol=1e-3,
                                                   dual=False),
                                        'survived',
                                        FEATURES,
                                        titanic_df)

result_dict['linear_svc']
{'train': {'accuracy': 0.7908309455587392,
  'precision': 0.7317073170731707,
  'recall': 0.6923076923076923,
  'f1': 0.7114624505928854,
  'roc': 0.7708113804004215},
 'test': {'accuracy': 0.7786259541984732,
  'precision': 0.7653061224489796,
  'recall': 0.6818181818181818,
  'f1': 0.721153846153846,
  'roc': 0.7652511961722488},
 'confusion_matrix': y_test  False  True 
 y_pred              
 False     129     35
 True       23     75}

Radius Neighbors Classifier

result_dict['radius_neighbors'] = build_model(RadiusNeighborsClassifier(radius=40.0),
                                              'survived',
                                              FEATURES,
                                              titanic_df)
result_dict['radius_neighbors']
{'train': {'accuracy': 0.6685768863419294,
  'precision': 0.6736111111111112,
  'recall': 0.24433249370277077,
  'f1': 0.3585951940850277,
  'roc': 0.5860124006975392},
 'test': {'accuracy': 0.6297709923664122,
  'precision': 0.575,
  'recall': 0.22330097087378642,
  'f1': 0.32167832167832167,
  'roc': 0.5581913659400378},
 'confusion_matrix': y_test  False  True 
 y_pred              
 False     142     80
 True       17     23}

Decision Tree classifier

result_dict['decision_tree'] = build_model(DecisionTreeClassifier(),
                                           'survived',
                                           FEATURES,
                                           titanic_df)

result_dict['decision_tree']
{'train': {'accuracy': 0.9637058261700095,
  'precision': 0.9811320754716981,
  'recall': 0.9215189873417722,
  'f1': 0.9503916449086162,
  'roc': 0.9553913955113769},
 'test': {'accuracy': 0.8396946564885496,
  'precision': 0.8315789473684211,
  'recall': 0.7523809523809524,
  'f1': 0.79,
  'roc': 0.8252350621777373},
 'confusion_matrix': y_test  False  True 
 y_pred              
 False     141     26
 True       16     79}

Naive Bayes

result_dict['naive_bayes'] = build_model(GaussianNB(),
                                         'survived',
                                         FEATURES,
                                         titanic_df)

result_dict['naive_bayes']
{'train': {'accuracy': 0.7793696275071633,
  'precision': 0.7126168224299065,
  'recall': 0.738498789346247,
  'f1': 0.7253269916765755,
  'roc': 0.772246240098991},
 'test': {'accuracy': 0.7938931297709924,
  'precision': 0.6896551724137931,
  'recall': 0.6896551724137931,
  'f1': 0.6896551724137931,
  'roc': 0.7676847290640394},
 'confusion_matrix': y_test  False  True 
 y_pred              
 False     148     27
 True       27     60}

Comparacion de modelos

# Crear un diccionario solo con los resultados de prueba de cada modelo
nombre_modelos = result_dict.keys()
resultados_train = {} # crear diccionario vacio
resultados_test = {} # crear diccionario vacio
for nombre in nombre_modelos:
    resultados_train[nombre] = result_dict[nombre]['train']['f1']
    resultados_test[nombre] = result_dict[nombre]['test']['f1']

df_comparacion = pd.DataFrame([resultados_train, resultados_test], index=['train', 'test'])
# Plot the bar chart
fig, ax = plt.subplots(figsize=(10, 4))
df_comparacion.T.plot(kind='bar', ax=ax)

# Adjust the layout
ax.set_ylabel('F1 score')
ax.set_title('Model F1 Comparison')

# Set the x-tick labels inside the bars and rotate by 90 degrees
ax.set_xticks(range(len(df_comparacion.columns)))
ax.set_xticklabels([])

# Draw the x-tick labels inside the bars rotated by 90 degrees
for i, label in enumerate(df_comparacion.columns):
    bar_center = (df_comparacion.loc['train', label] +
                  df_comparacion.loc['test', label]) / 2
    ax.text(i, bar_center, label, ha='center',
            va='center_baseline', rotation=45)

plt.tight_layout()

png

Cross Validation - Seleccion de Modelos

Analizar la varianza de los resultados para obtener los que tengan mejor resultado.

Con el cross validation se puede detectar si el modelo esta sobreajustado o no.

# Grabar los resultados de cada modelo
from sklearn import model_selection

models = []

#logistic Regression
models.append(('Logistic', LogisticRegression(solver='liblinear')))

# Decision Tree classifier
models.append(('Decision Tree', DecisionTreeClassifier()))

#
models.append(('LDA', LinearDiscriminantAnalysis(solver= 'svd')))

# evaluate each model in turn
results = []
names = []
scoring = 'f1'
for name, model in models:
    # Kfol cross validation for model selection
    kfold = model_selection.KFold(n_splits=10)
    model_pipe = Pipeline(steps=[("preprocessor", preprocessor),
                                 ("model", model)])
    #X train , y train
    cv_results = model_selection.cross_val_score(model_pipe,
                                                 x_train, y_train,
                                                 cv=kfold,
                                                 scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = f"({name}, {cv_results.mean()}, {cv_results.std()}"
    print(msg)
(Logistic, 0.7021797259716466, 0.04878880609448707
(Decision Tree, 0.6448902869325651, 0.06141106864049656
(LDA, 0.7024526448038142, 0.04725740067640807
plt.figure(figsize = (15,8)) 
result_df = pd.DataFrame(results, index=names).T
result_df.boxplot()
plt.title("Resultados de Cross Validation");

png

Comparacion Estadistica de Modelos

from scipy.stats import f_oneway

model1 = result_df['Logistic']
model2 = result_df['Decision Tree']
model3 = result_df['LDA']

statistic, p_value = f_oneway(model1, model2, model3)

print(f'Statistic: {statistic}')
print(f'p_value: {p_value}')

alpha = 0.05  # nivel de significancia

if p_value < alpha:
    print("Existe una diferencia estadísticamente significativa en los resultados de cross-validation de los modelos.")
else:
    print("No Existe una diferencia estadísticamente significativa en los resultados de cross-validation de los modelos.")
Statistic: 3.539696943277524
p_value: 0.043129128040821176
Existe una diferencia estadísticamente significativa en los resultados de cross-validation de los modelos.

Hyperparameter tunning (Optimizacion de hiperparametros)

titanic_df = pd.read_parquet('titanic_processed.parquet')
X = titanic_df.drop('survived', axis='columns')

Y = titanic_df['survived']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
def summarize_classification(y_test, y_pred):
    
    acc = accuracy_score(y_test, y_pred, normalize=True)
    f1 = f1_score(y_test, y_pred)

    prec = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    roc = roc_auc_score(y_test, y_pred)
    
    
    print("Test data count: ",len(y_test))
    print("f1_score : " , f1)
    print("accuracy_score : " , acc)
    print("precision_score : " , prec)
    print("recall_score : ", recall)
    print("roc_auc_score : ", roc)

Decision Tree

from sklearn.model_selection import GridSearchCV

# cuando se usan pipelines se debe usar 
# el nombre del paso y luego __ y el parametro
parameters = {'model__max_depth': [4, 5, 7, 9, 10],
              'model__max_features': [2, 3, 4, 5, 6, 7, 8, 9],
              'model__criterion': ['gini', 'entropy'],
              }

DecisionTree_pipe = Pipeline(steps=[("preprocessor", preprocessor),
                                    ("model", DecisionTreeClassifier())])

grid_search = GridSearchCV(DecisionTree_pipe,
                           parameters, cv=3,
                           scoring='f1',
                           return_train_score=True)
grid_search.fit(x_train, y_train)
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numericas',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='median'))]),
                                                                         ['age',
                                                                          'fare',
                                                                          'sibsp',
                                                                          'parch']),
                                                                        ('categoricas',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('onehot',
                                                                                          OneHotEncoder())]),
                                                                         ['sex',
                                                                          'embarked']),
                                                                        ('categoricas '
                                                                         'ordinales',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('onehot',
                                                                                          OrdinalEncoder())]),
                                                                         ['pclass'])])),
                                       ('model', DecisionTreeClassifier())]),
             param_grid={'model__criterion': ['gini', 'entropy'],
                         'model__max_depth': [4, 5, 7, 9, 10],
                         'model__max_features': [2, 3, 4, 5, 6, 7, 8, 9]},
             return_train_score=True, scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
tree_cv_results = pd.DataFrame(grid_search.cv_results_)

(tree_cv_results[['params', 'mean_test_score', 'mean_train_score']]
 .sort_values(by='mean_test_score', ascending=False)
 .head(10))

paramsmean_test_scoremean_train_score
47{'model__criterion': 'entropy', 'model__max_de...0.7374640.746546
4{'model__criterion': 'gini', 'model__max_depth...0.7267780.752055
2{'model__criterion': 'gini', 'model__max_depth...0.7219870.740337
46{'model__criterion': 'entropy', 'model__max_de...0.7207580.741882
26{'model__criterion': 'gini', 'model__max_depth...0.7182460.830027
42{'model__criterion': 'entropy', 'model__max_de...0.7167650.731583
12{'model__criterion': 'gini', 'model__max_depth...0.7111950.776762
3{'model__criterion': 'gini', 'model__max_depth...0.7102540.736806
52{'model__criterion': 'entropy', 'model__max_de...0.7101470.750812
76{'model__criterion': 'entropy', 'model__max_de...0.7094380.854808
grid_search.best_params_
{'model__criterion': 'entropy',
 'model__max_depth': 4,
 'model__max_features': 9}
modelo = DecisionTreeClassifier(criterion='entropy',
                                max_depth=4,
                                max_features=8)

DecisionTree_pipe = Pipeline(steps=[("preprocessor", preprocessor),
                                    ("model", modelo)])
decision_tree_model = DecisionTree_pipe.fit(x_train, y_train)
y_pred = decision_tree_model.predict(x_test)
summarize_classification(y_test, y_pred)
Test data count:  262
f1_score :  0.7653061224489796
accuracy_score :  0.8244274809160306
precision_score :  0.7731958762886598
recall_score :  0.7575757575757576
roc_auc_score :  0.8113032162111917

Regresion logistica

# cuando se usan pipelines se debe usar
# el nombre del paso y luego __ y el parametro
parameters = {'model__penalty': ['l1', 'l2'], 
              'model__C': [0.1, 0.4, 0.8, 1, 2, 5]}

modelo = LogisticRegression(solver='liblinear')
logistic_pipe = Pipeline(steps=[("preprocessor", preprocessor),
                                ("model", modelo)])

grid_search = GridSearchCV(logistic_pipe,
                           parameters, cv=3,
                           scoring='f1',
                           return_train_score=True)
grid_search.fit(x_train, y_train)
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('numericas',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='median'))]),
                                                                         ['age',
                                                                          'fare',
                                                                          'sibsp',
                                                                          'parch']),
                                                                        ('categoricas',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('onehot',
                                                                                          OneHotEncoder())]),
                                                                         ['sex',
                                                                          'embarked']),
                                                                        ('categoricas '
                                                                         'ordinales',
                                                                         Pipeline(steps=[('imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('onehot',
                                                                                          OrdinalEncoder())]),
                                                                         ['pclass'])])),
                                       ('model',
                                        LogisticRegression(solver='liblinear'))]),
             param_grid={'model__C': [0.1, 0.4, 0.8, 1, 2, 5],
                         'model__penalty': ['l1', 'l2']},
             return_train_score=True, scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
reg_log_cv_results = pd.DataFrame(grid_search.cv_results_)

(reg_log_cv_results[['params', 'mean_test_score', 'mean_train_score']]
 .sort_values(by='mean_test_score', ascending=False)
 .head(10))

paramsmean_test_scoremean_train_score
6{'model__C': 1, 'model__penalty': 'l1'}0.7122300.711734
5{'model__C': 0.8, 'model__penalty': 'l2'}0.7113460.713974
7{'model__C': 1, 'model__penalty': 'l2'}0.7113460.714403
4{'model__C': 0.8, 'model__penalty': 'l1'}0.7112470.710449
8{'model__C': 2, 'model__penalty': 'l1'}0.7095200.711829
9{'model__C': 2, 'model__penalty': 'l2'}0.7095200.713576
3{'model__C': 0.4, 'model__penalty': 'l2'}0.7087920.713781
10{'model__C': 5, 'model__penalty': 'l1'}0.7077590.714807
11{'model__C': 5, 'model__penalty': 'l2'}0.7077590.714807
2{'model__C': 0.4, 'model__penalty': 'l1'}0.7069300.708156
grid_search.best_params_
{'model__C': 1, 'model__penalty': 'l1'}
modelo = LogisticRegression(solver='liblinear',
                            C=0.4,
                            penalty='l2')
logistic_pipe = Pipeline(steps=[("preprocessor", preprocessor),
                                ("model", modelo)])

logistic_model = logistic_pipe.fit(x_train, y_train)
y_pred = logistic_model.predict(x_test)
summarize_classification(y_test, y_pred)
Test data count:  262
f1_score :  0.7346938775510204
accuracy_score :  0.8015267175572519
precision_score :  0.7422680412371134
recall_score :  0.7272727272727273
roc_auc_score :  0.7869492470719465

Final Evaluation Test

from sklearn.metrics import classification_report
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import ConfusionMatrixDisplay

from sklearn.metrics import roc_curve
from sklearn.metrics import RocCurveDisplay

Decision Tree

y_pred = decision_tree_model.predict(x_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

       False       0.85      0.87      0.86       163
        True       0.77      0.76      0.77        99

    accuracy                           0.82       262
   macro avg       0.81      0.81      0.81       262
weighted avg       0.82      0.82      0.82       262
ConfusionMatrixDisplay.from_predictions(y_test,y_pred);

png

PrecisionRecallDisplay.from_predictions(y_test,y_pred);

png

dt_plot = RocCurveDisplay.from_estimator(decision_tree_model, x_test, y_test)
plt.show()

png

Regresion Logistica

y_pred = logistic_model.predict(x_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

       False       0.84      0.85      0.84       163
        True       0.74      0.73      0.73        99

    accuracy                           0.80       262
   macro avg       0.79      0.79      0.79       262
weighted avg       0.80      0.80      0.80       262
ConfusionMatrixDisplay.from_predictions(y_test,y_pred);

png

PrecisionRecallDisplay.from_predictions(y_test,y_pred);

png

log_plot = RocCurveDisplay.from_estimator(logistic_model, x_test, y_test)
plt.show()

png

Comparacion modelos

ax = plt.gca()
dt_plot.plot(ax=ax, alpha=0.8)
log_plot.plot(ax=ax, alpha=0.8)
plt.title('ROC Curve - Decision Tree vs Logistic Regression')
plt.show()

png

Interpretacion del Modelo

Como mejor modelo se selecciona el Decision Tree, con los siguientes parametros:

{'criterion': 'entropy', 'max_depth': 4, 'max_features': 8}

dfFeatures = pd.DataFrame({'Features':decision_tree_model['preprocessor']
                                      .get_feature_names_out(),
                           'Importances':decision_tree_model['model']
                                         .feature_importances_})
dfFeatures.sort_values(by='Importances',ascending=False)

FeaturesImportances
5categoricas__sex_male0.524125
9categoricas ordinales__pclass0.227958
1numericas__fare0.109974
0numericas__age0.080572
2numericas__sibsp0.042068
8categoricas__embarked_S0.015302
3numericas__parch0.000000
4categoricas__sex_female0.000000
6categoricas__embarked_C0.000000
7categoricas__embarked_Q0.000000

En base a estos resultados se puede concluir que las variables mas importantes para determinar si una persona sobrevivio o no son:'sex', 'pclass', 'age', 'fare'

x_train = x_train.drop(columns=['sibsp', 'parch', 'embarked'])

cols_numericas = ["age", "fare"]
cols_categoricas = ["sex"]
cols_categoricas_ord = ["pclass"]

numeric_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ])

categorical_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())])

categorical_ord_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OrdinalEncoder())])

preprocessor = ColumnTransformer(
    transformers=[
        ('numericas', numeric_pipe, cols_numericas),
        ('categoricas', categorical_pipe, cols_categoricas),
        ('categoricas ordinales', categorical_ord_pipe, cols_categoricas_ord)
])



modelo = DecisionTreeClassifier(criterion='entropy',
                                max_depth=4,
                                max_features=8)

DecisionTree_pipe = Pipeline(steps=[("preprocessor", preprocessor),
                                    ("model", modelo)])

DecisionTree_pipe.fit(x_train, y_train)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numericas',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['age', 'fare']),
                                                 ('categoricas',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder())]),
                                                  ['sex']),
                                                 ('categoricas ordinales',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OrdinalEncoder())]),
                                                  ['pclass'])])),
                ('model',
                 DecisionTreeClassifier(criterion='entropy', max_depth=4,
                                        max_features=8))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
x_test = x_test.drop(columns=['sibsp', 'parch', 'embarked'])
y_pred = DecisionTree_pipe.predict(x_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

       False       0.85      0.88      0.86       163
        True       0.78      0.74      0.76        99

    accuracy                           0.82       262
   macro avg       0.82      0.81      0.81       262
weighted avg       0.82      0.82      0.82       262

Se obtienen resultados equivalentes con menos variables

Procesamiento y Modelo Final con scikit-learn

Luego realizar la seleccion del modelo final y sus hiperparametros, se procede a crear un pipeline completo de entrenamiento con todos los datos y prediccion.

En un proceso de Ciencia de datos se debe entrenar con todos los datos disponibles si es posible, para que el modelo final tenga la mayor cantidad de informacion posible. lo que evita que el modelo final tenga sobreajuste son los hiperparametros que se optimizaron en el paso anterior. Esto se hace en la aplicaciones reales, ya que normalmente los datos se van actualizando, por ejemplo en un sistema de clasificacion de anomalias, el modelo se va actualizando con los nuevos datos que van surgiendo y se va reentrenando con los datos historicos(pasado), para predecir datos presentes y futuros.

En este caso se va a usar el modelo de Decision Tree con los hiperparametros

{'criterion': 'entropy', 'max_depth': 4, 'max_features': 7}

Pipeline Final

Para entrenar el pipeline final se leen los datos limpios, solamente las columnas definidas anteriormente

columnas_base = ['sex', 'pclass', 'age', 'fare', 'survived']
titanic_df = pd.read_parquet('titanic_processed.parquet',
                              columns= columnas_base)

se dividen los datos en las variables de entrada y la variable de salida

X = titanic_df.drop('survived', axis='columns')
y = titanic_df['survived']

Se entrena el ultimo pipeline con todos los datos

DecisionTree_pipe.fit(X, y)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numericas',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['age', 'fare']),
                                                 ('categoricas',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder())]),
                                                  ['sex']),
                                                 ('categoricas ordinales',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OrdinalEncoder())]),
                                                  ['pclass'])])),
                ('model',
                 DecisionTreeClassifier(criterion='entropy', max_depth=4,
                                        max_features=8))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Grabar el Modelo

from joblib import dump, load # libreria de serializacion

# grabar el modelo en un archivo
dump(DecisionTree_pipe, 'DecisionTree_pipe-titanic.joblib')
['DecisionTree_pipe-titanic.joblib']
from joblib import load

mi_modelo = load('DecisionTree_pipe-titanic.joblib')
x_test.head()

pclasssexagefare
7823male65.07.7500
3512male60.039.0000
11353maleNaN7.8958
731female22.0151.5500
12433maleNaN7.2250

Puede verse que el valor de la primera fila de age es NaN, pero el modelo lo imputa

# hacer la prediccion solo de las primeras 5 filas de test
mi_modelo.predict(x_test.head())
array([False, False, False,  True, False])

Referencias

Cheatsheet scikit-learn https://datacamp-community-prod.s3.amazonaws.com/5433fa18-9f43-44cc-b228-74672efcd116

Phd. Jose R. Zapata