Regresion con Scikit Learn

Por Jose R. Zapata

Invitame a un Cafe

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import datetime

Informacion de los datos

automobile_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv')

# Ver 5 registros aleatorios
automobile_df.sample(5)

mpg cylinders displacement horsepower weight acceleration model_year origin name
125 20.0 6 198.0 95.0 3102 16.5 74 usa plymouth duster
206 26.5 4 140.0 72.0 2565 13.6 76 usa ford pinto
100 18.0 6 250.0 88.0 3021 16.5 73 usa ford maverick
160 17.0 6 231.0 110.0 3907 21.0 75 usa buick century
295 35.7 4 98.0 80.0 1915 14.4 79 usa dodge colt hatchback custom
#Tamaño del dataset
automobile_df.shape 
(398, 9)
automobile_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB

Preparacion de Datos

automobile_df['horsepower'].unique()
array([130., 165., 150., 140., 198., 220., 215., 225., 190., 170., 160.,
        95.,  97.,  85.,  88.,  46.,  87.,  90., 113., 200., 210., 193.,
        nan, 100., 105., 175., 153., 180., 110.,  72.,  86.,  70.,  76.,
        65.,  69.,  60.,  80.,  54., 208., 155., 112.,  92., 145., 137.,
       158., 167.,  94., 107., 230.,  49.,  75.,  91., 122.,  67.,  83.,
        78.,  52.,  61.,  93., 148., 129.,  96.,  71.,  98., 115.,  53.,
        81.,  79., 120., 152., 102., 108.,  68.,  58., 149.,  89.,  63.,
        48.,  66., 139., 103., 125., 133., 138., 135., 142.,  77.,  62.,
       132.,  84.,  64.,  74., 116.,  82.])
automobile_df['horsepower'].isna().sum()
6
automobile_df = automobile_df.dropna()
automobile_df.shape
(392, 9)

Eliminar Columnas no necesarias

automobile_df.drop(['origin', 'name'], axis='columns', inplace=True)
automobile_df.sample(5)

mpg cylinders displacement horsepower weight acceleration model_year
301 34.2 4 105.0 70.0 2200 13.2 79
219 25.5 4 122.0 96.0 2300 15.5 77
321 32.2 4 108.0 75.0 2265 15.2 80
344 39.0 4 86.0 64.0 1875 16.4 81
146 28.0 4 90.0 75.0 2125 14.5 74
automobile_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   model_year    392 non-null    int64  
dtypes: float64(4), int64(3)
memory usage: 24.5 KB

Convertir el formato de ‘model year’ a año completo

automobile_df['model_year'] = '19' + automobile_df['model_year'].astype(str)
automobile_df.sample(5)

mpg cylinders displacement horsepower weight acceleration model_year
248 36.1 4 91.0 60.0 1800 16.4 1978
193 24.0 6 200.0 81.0 3012 17.6 1976
231 15.5 8 400.0 190.0 4325 12.2 1977
388 26.0 4 156.0 92.0 2585 14.5 1982
83 28.0 4 98.0 80.0 2164 15.0 1972

Agregar columna de los años del automobil

automobile_df['age'] = datetime.datetime.now().year - pd.to_numeric(automobile_df['model_year'])
automobile_df.drop(['model_year'], axis='columns', inplace=True)
automobile_df.sample(5)

mpg cylinders displacement horsepower weight acceleration age
380 36.0 4 120.0 88.0 2160 14.5 41
17 21.0 6 200.0 85.0 2587 16.0 53
33 19.0 6 232.0 100.0 2634 13.0 52
146 28.0 4 90.0 75.0 2125 14.5 49
96 13.0 8 360.0 175.0 3821 11.0 50
automobile_df.dtypes
mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight            int64
acceleration    float64
age               int64
dtype: object
automobile_df['horsepower'] = pd.to_numeric(automobile_df['horsepower'], errors='coerce')
automobile_df.describe()

mpg cylinders displacement horsepower weight acceleration age
count 392.000000 392.000000 392.000000 392.000000 392.000000 392.000000 392.000000
mean 23.445918 5.471939 194.411990 104.469388 2977.584184 15.541327 47.020408
std 7.805007 1.705783 104.644004 38.491160 849.402560 2.758864 3.683737
min 9.000000 3.000000 68.000000 46.000000 1613.000000 8.000000 41.000000
25% 17.000000 4.000000 105.000000 75.000000 2225.250000 13.775000 44.000000
50% 22.750000 4.000000 151.000000 93.500000 2803.500000 15.500000 47.000000
75% 29.000000 8.000000 275.750000 126.000000 3614.750000 17.025000 50.000000
max 46.600000 8.000000 455.000000 230.000000 5140.000000 24.800000 53.000000

Analisis Univariable

Se debe hacer un analisis de cada una de las variables y describir sus caracteristicas.

Analisis Bivariable

Scatter Plots

fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(automobile_df['age'], automobile_df['mpg'])

plt.xlabel('Años')
plt.ylabel('Millas por galon');

png

fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(automobile_df['acceleration'], automobile_df['mpg'])

plt.xlabel('Aceleracion')
plt.ylabel('Millas por galon');

png

fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(automobile_df['weight'], automobile_df['mpg'])

plt.xlabel('Peso')
plt.ylabel('Millas por galon');

png

fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(automobile_df['displacement'], automobile_df['mpg'])

plt.xlabel('Desplazamiento')
plt.ylabel('Millas por galon');

png

fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(automobile_df['horsepower'], automobile_df['mpg'])

plt.xlabel('Caballos de fuerza')
plt.ylabel('Millas por galon');

png

fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(automobile_df['cylinders'], automobile_df['mpg'])

plt.xlabel('Cilindros')
plt.ylabel('Millas por galon');

png

Correlacion

automobile_corr = automobile_df.corr()

automobile_corr

mpg cylinders displacement horsepower weight acceleration age
mpg 1.000000 -0.777618 -0.805127 -0.778427 -0.832244 0.423329 -0.580541
cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 -0.504683 0.345647
displacement -0.805127 0.950823 1.000000 0.897257 0.932994 -0.543800 0.369855
horsepower -0.778427 0.842983 0.897257 1.000000 0.864538 -0.689196 0.416361
weight -0.832244 0.897527 0.932994 0.864538 1.000000 -0.416839 0.309120
acceleration 0.423329 -0.504683 -0.543800 -0.689196 -0.416839 1.000000 -0.290316
age -0.580541 0.345647 0.369855 0.416361 0.309120 -0.290316 1.000000
fig, ax = plt.subplots(figsize=(12, 10))

sns.heatmap(automobile_corr, annot=True);

png

automobile_df = automobile_df.sample(frac=1).reset_index(drop=True)

automobile_df.head()

mpg cylinders displacement horsepower weight acceleration age
0 30.0 4 97.0 67.0 1985 16.4 46
1 13.0 8 360.0 170.0 4654 13.0 50
2 32.7 6 168.0 132.0 2910 11.4 43
3 20.5 6 225.0 100.0 3430 17.2 45
4 11.0 8 429.0 208.0 4633 11.0 51
automobile_df.to_csv('auto-mpg-processed.csv', index=False)

Experimentos con Regresion Lineal

Regresion lineal con una caracteristica (horsepower)

from sklearn.model_selection import train_test_split

X = automobile_df[['horsepower']]
Y = automobile_df['mpg']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) #20%
x_train.sample(5)

horsepower
210 75.0
1 170.0
344 75.0
0 67.0
4 208.0
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression().fit(x_train, y_train)
print('Puntaje Entrenamiento: ', linear_model.score(x_train, y_train))
Puntaje Entrenamiento:  0.5841005932173976
y_pred = linear_model.predict(x_test)
from sklearn.metrics import r2_score

print('Puntaje Testing: ', r2_score(y_test, y_pred))
Puntaje Testing:  0.7130733130588783
from sklearn.metrics import PredictionErrorDisplay
PredictionErrorDisplay.from_predictions(y_true=y_test,
                                        y_pred=y_pred,
                                        kind="actual_vs_predicted");

png

PredictionErrorDisplay.from_predictions(y_true=y_test,
                                        y_pred=y_pred,
                                        kind="residual_vs_predicted");

png

fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(x_test, y_test)
plt.plot(x_test, y_pred, color='r')

plt.xlabel('Caballos de Fuerza')
plt.ylabel('Mpg')
plt.show()

png

Regresion lineal con una caracteristica - age

X = automobile_df[['age']]
Y = automobile_df['mpg']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

linear_model = LinearRegression().fit(x_train, y_train)

print('Puntaje de entrenamiento: ', linear_model.score(x_train, y_train))

y_pred = linear_model.predict(x_test)

print('Puntaje de Testing: ', r2_score(y_test, y_pred))
Puntaje de entrenamiento:  0.3068776992523711
Puntaje de Testing:  0.44000610623520187
PredictionErrorDisplay.from_predictions(y_true=y_test,
                                        y_pred=y_pred,
                                        kind="actual_vs_predicted");

png

PredictionErrorDisplay.from_predictions(y_true=y_test,
                                        y_pred=y_pred,
                                        kind="residual_vs_predicted");

png

fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(x_test, y_test)
plt.plot(x_test, y_pred, color='r')

plt.xlabel('Age')
plt.ylabel('Mpg')
plt.show()

png

Regresion lineal con varias caracteristicas

# X = automobile_df[['displacement', 'horsepower', 'weight', 'acceleration', 'cylinders']]

X = automobile_df[['displacement', 'horsepower', 'weight']]
Y = automobile_df['mpg']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
linear_model = LinearRegression().fit(x_train, y_train)
print('Training score: ', linear_model.score(x_train, y_train))
Training score:  0.7052831602383469
predictors = x_train.columns
coef = pd.Series(linear_model.coef_, predictors).sort_values()

print(coef)
horsepower     -0.046900
weight         -0.005304
displacement   -0.003034
dtype: float64
y_pred = linear_model.predict(x_test)
print('Puntaje Testing', r2_score(y_test, y_pred))
Puntaje Testing 0.710191227033769
PredictionErrorDisplay.from_predictions(y_true=y_test,
                                        y_pred=y_pred,
                                        kind="actual_vs_predicted");

png

PredictionErrorDisplay.from_predictions(y_true=y_test,
                                        y_pred=y_pred,
                                        kind="residual_vs_predicted");

png

Regresión con Múltiples Modelos

Si se tienen muchos modelos realizar cross validation con todos es costoso computacionalmente y en tiempo, entonces se debe ir descartando los modelos con menos desempeño hasta llegar al modelo final

  • Inicialmente se dividen los datos en una parte para realizar la seleccion del modelo (Model selection dataset) y otra para realizar la prueba de desempeño (Performance dataset, esta parte de los datos se debe usar solo en el final de todo el proceso)
  • Primer paso es dividir el Model selection dataset en una parte de entrenamiento (train) y otra de prueba (test), normalmente en una proporcion de 80/20 o 70/30.
  • Hacer una evaluacion de todos los modelos la division enterior y seleccionar los mejores (preferiblemente que tengan un principio de funcionamiento diferente entre ellos).
  • Con lo mejores modelos (la cantidad depende de que tan parecido es su resultado) realizar un cross-validation (detectar si hay over-fitting) para obtener los que tengan mejor resultado.
  • Se saca el mejor o los mejores modelos (Los que tengan mejor desempeño y poca varianza en sus resultados) y se realiza optimizacion de hiper parametros (Hyper parameter Tunning). Este proceso es costoso computacionalmente por eso se debe realizar con muy pocos modelos.
  • Luego se selecciona el mejor modelo (Mejor desempeño y poca varianza) y se obtienen los hiper parametros que dieron el mejor resultado.
  • Finalmente se entrena el modelo seleccionado con los hiper-parametros hallados sobre el Model selection dataset y se hace la prueba final con el Performance dataset
#import statsmodels.api as sm

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Lars
from sklearn.linear_model import SGDRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

import warnings
warnings.filterwarnings("ignore")
automobile_df = pd.read_csv('auto-mpg-processed.csv')

automobile_df.head()

mpg cylinders displacement horsepower weight acceleration age
0 30.0 4 97.0 67.0 1985 16.4 46
1 13.0 8 360.0 170.0 4654 13.0 50
2 32.7 6 168.0 132.0 2910 11.4 43
3 20.5 6 225.0 100.0 3430 17.2 45
4 11.0 8 429.0 208.0 4633 11.0 51
result_dict = {}

Funciones de ayuda

def entrenar_modelo(modelo,                
                    name_of_y_col: str, 
                    names_of_x_cols: list,
                    dataset: pd.DataFrame,
                    test_frac:float=0.2,
                  ):
    
    """entrenar_modelo
    
    Funcion para entrenar y evaluar un modelo
        
    """
    
    X = dataset[names_of_x_cols]
    Y = dataset[name_of_y_col]

    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=test_frac)
    
    model = modelo.fit(x_train, y_train)
    
    y_pred = model.predict(x_test)
    
    print("Entrenamiento_score : " , model.score(x_train, y_train))
    print("Prueba_score : ", r2_score(y_test, y_pred))

    return {
            'Entrenamiento_score': model.score(x_train, y_train),
            'Prueba_score': r2_score(y_test, y_pred)
           }     
# Funcion para comparar los resultados de los modelos
#se van a almacenar en un diccionario
def compare_results():
    for key in result_dict:
        print('Regresion: ', key)
        print('Entrenamiento score', result_dict[key]['Entrenamiento_score'])
        print('Prueba score', result_dict[key]['Prueba_score'])
        print()

Regresion lineal

result_dict['mpg ~ single_linear'] = entrenar_modelo(LinearRegression(),
                                                     'mpg',
                                                     ['weight'],
                                                     automobile_df)
Entrenamiento_score :  0.6929340901372136
Prueba_score :  0.6862492949888326
result_dict['mpg ~ kitchen_sink_linear'] = entrenar_modelo(LinearRegression(),
                                                      'mpg',
                                                     ['cylinders',
                                                      'displacement',
                                                      'horsepower',
                                                      'weight',
                                                      'acceleration'],
                                                      automobile_df)
Entrenamiento_score :  0.7101831758445959
Prueba_score :  0.6729626744895553
result_dict['mpg ~ parsimonius_linear'] = entrenar_modelo(LinearRegression(),
                                                          'mpg',
                                                          ['horsepower',
                                                           'weight'],
                                                           automobile_df)
Entrenamiento_score :  0.7044792740221829
Prueba_score :  0.7112066461045605
compare_results()
Regresion:  mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326

Regresion:  mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553

Regresion:  mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605

Lasso

result_dict['mpg ~ kitchen_sink_lasso'] = entrenar_modelo(Lasso(alpha=0.5),
                                                          'mpg',
                                                         ['cylinders',
                                                          'displacement',
                                                          'horsepower',
                                                          'weight',
                                                          'acceleration'],
                                                           automobile_df)
Entrenamiento_score :  0.705185556695638
Prueba_score :  0.7125443285785023
compare_results()
Regresion:  mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326

Regresion:  mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553

Regresion:  mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605

Regresion:  mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023

Ridge

result_dict['mpg ~ kitchen_sink_ridge'] = entrenar_modelo(Ridge(alpha=0.5),
                                                                 'mpg',
                                                                ['cylinders',
                                                                 'displacement',
                                                                 'horsepower',
                                                                 'weight',
                                                                 'acceleration'],
                                                                  automobile_df)
Entrenamiento_score :  0.70796844071492
Prueba_score :  0.7001902852658111
compare_results()
Regresion:  mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326

Regresion:  mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553

Regresion:  mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605

Regresion:  mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023

Regresion:  mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111

Elasticnet

result_dict['mpg ~ kitchen_sink_elastic_net_ols'] = entrenar_modelo(ElasticNet(alpha=1, l1_ratio=0.5,
                                                                              max_iter= 100000, 
                                                                              warm_start= True),
                                                               'mpg',
                                                              ['cylinders',
                                                               'displacement',
                                                               'horsepower',
                                                               'weight',
                                                               'acceleration'],
                                                                automobile_df)
Entrenamiento_score :  0.7067959317119072
Prueba_score :  0.7027536474756819
compare_results()
Regresion:  mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326

Regresion:  mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553

Regresion:  mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605

Regresion:  mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023

Regresion:  mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111

Regresion:  mpg ~ kitchen_sink_elastic_net_ols
Entrenamiento score 0.7067959317119072
Prueba score 0.7027536474756819

SVR

For SVR regression with larger datasets this alternate implementations is preferred

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR

  • Uses a different library for implementation
  • More flexibility with choice of penalties
  • Scales to larger datasets
result_dict['mpg ~ kitchen_sink_svr'] = entrenar_modelo(SVR(kernel='linear',
                                                            epsilon=0.05,
                                                            C=0.3),
                                                    'mpg',
                                                   ['cylinders',
                                                    'displacement',
                                                    'horsepower',
                                                    'weight',
                                                    'acceleration'],
                                                     automobile_df)
Entrenamiento_score :  0.6843427697354635
Prueba_score :  0.7440428540582755
compare_results()
Regresion:  mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326

Regresion:  mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553

Regresion:  mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605

Regresion:  mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023

Regresion:  mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111

Regresion:  mpg ~ kitchen_sink_elastic_net_ols
Entrenamiento score 0.7067959317119072
Prueba score 0.7027536474756819

Regresion:  mpg ~ kitchen_sink_svr
Entrenamiento score 0.6843427697354635
Prueba score 0.7440428540582755

KNR

result_dict['mpg ~ kitchen_sink_kneighbors'] = entrenar_modelo(KNeighborsRegressor(n_neighbors=10),
                                                                'mpg',
                                                               ['cylinders',
                                                                'displacement',
                                                                'horsepower',
                                                                'weight',
                                                                'acceleration'],
                                                                automobile_df)
Entrenamiento_score :  0.7853568628076514
Prueba_score :  0.5826866402088287
compare_results()
Regresion:  mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326

Regresion:  mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553

Regresion:  mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605

Regresion:  mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023

Regresion:  mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111

Regresion:  mpg ~ kitchen_sink_elastic_net_ols
Entrenamiento score 0.7067959317119072
Prueba score 0.7027536474756819

Regresion:  mpg ~ kitchen_sink_svr
Entrenamiento score 0.6843427697354635
Prueba score 0.7440428540582755

Regresion:  mpg ~ kitchen_sink_kneighbors
Entrenamiento score 0.7853568628076514
Prueba score 0.5826866402088287

SGD

# Este momdelo necesita qeu los datos esten standarizados para un buen desempeño

from sklearn.pipeline import Pipeline
SGDRegressor_pipe = Pipeline([('StandardScaler', StandardScaler()),
                              ('SGDRegressor', SGDRegressor(max_iter=10000,
                                                            tol=1e-3))
                             ])
result_dict['mpg ~ kitchen_sink_sgd'] = entrenar_modelo(SGDRegressor_pipe,
                                                   'mpg',
                                                  ['cylinders',
                                                   'displacement',
                                                   'horsepower',
                                                   'weight',
                                                   'acceleration'],
                                                    automobile_df)
Entrenamiento_score :  0.6967509810416522
Prueba_score :  0.7493939349317008
compare_results()
Regresion:  mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326

Regresion:  mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553

Regresion:  mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605

Regresion:  mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023

Regresion:  mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111

Regresion:  mpg ~ kitchen_sink_elastic_net_ols
Entrenamiento score 0.7067959317119072
Prueba score 0.7027536474756819

Regresion:  mpg ~ kitchen_sink_svr
Entrenamiento score 0.6843427697354635
Prueba score 0.7440428540582755

Regresion:  mpg ~ kitchen_sink_kneighbors
Entrenamiento score 0.7853568628076514
Prueba score 0.5826866402088287

Regresion:  mpg ~ kitchen_sink_sgd
Entrenamiento score 0.6967509810416522
Prueba score 0.7493939349317008

Decision Tree

result_dict['mpg ~ kitchen_sink_decision_tree'] = entrenar_modelo(DecisionTreeRegressor(max_depth=2),
                                                             'mpg',
                                                            ['cylinders',
                                                             'displacement',
                                                             'horsepower',
                                                             'weight',
                                                             'acceleration'],
                                                              automobile_df)
Entrenamiento_score :  0.731782801402372
Prueba_score :  0.7273334316094502
compare_results()
Regresion:  mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326

Regresion:  mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553

Regresion:  mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605

Regresion:  mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023

Regresion:  mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111

Regresion:  mpg ~ kitchen_sink_elastic_net_ols
Entrenamiento score 0.7067959317119072
Prueba score 0.7027536474756819

Regresion:  mpg ~ kitchen_sink_svr
Entrenamiento score 0.6843427697354635
Prueba score 0.7440428540582755

Regresion:  mpg ~ kitchen_sink_kneighbors
Entrenamiento score 0.7853568628076514
Prueba score 0.5826866402088287

Regresion:  mpg ~ kitchen_sink_sgd
Entrenamiento score 0.6967509810416522
Prueba score 0.7493939349317008

Regresion:  mpg ~ kitchen_sink_decision_tree
Entrenamiento score 0.731782801402372
Prueba score 0.7273334316094502

Lars

result_dict['mpg ~ kitchen_sink_lars'] = entrenar_modelo(Lars(n_nonzero_coefs=4),
                                                    'mpg',
                                                   ['cylinders',
                                                    'displacement',
                                                    'horsepower',
                                                    'weight',
                                                    'acceleration'],
                                                     automobile_df)
Entrenamiento_score :  0.7074987896418201
Prueba_score :  0.6910722188360159
compare_results()
Regresion:  mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326

Regresion:  mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553

Regresion:  mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605

Regresion:  mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023

Regresion:  mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111

Regresion:  mpg ~ kitchen_sink_elastic_net_ols
Entrenamiento score 0.7067959317119072
Prueba score 0.7027536474756819

Regresion:  mpg ~ kitchen_sink_svr
Entrenamiento score 0.6843427697354635
Prueba score 0.7440428540582755

Regresion:  mpg ~ kitchen_sink_kneighbors
Entrenamiento score 0.7853568628076514
Prueba score 0.5826866402088287

Regresion:  mpg ~ kitchen_sink_sgd
Entrenamiento score 0.6967509810416522
Prueba score 0.7493939349317008

Regresion:  mpg ~ kitchen_sink_decision_tree
Entrenamiento score 0.731782801402372
Prueba score 0.7273334316094502

Regresion:  mpg ~ kitchen_sink_lars
Entrenamiento score 0.7074987896418201
Prueba score 0.6910722188360159
# Crear un diccionario solo con los resultados de prueba de cada modelo
nombre_modelos = result_dict.keys()
resultados_prueba = {} # crear diccionario vacio
for nombre in nombre_modelos:
    resultados_prueba[nombre] = result_dict[nombre]['Prueba_score']
plt.figure(figsize = (12,10)) # tamaño de la figura
plt.barh(range(len(resultados_prueba)), list(resultados_prueba.values()), 
        align='center', );
plt.title("Resultados en el dataset de pruebas de cada modelo")
plt.yticks(range(len(resultados_prueba)), list(resultados_prueba.keys()));

png

Cross Validation - Seleccion de Modelos

Analizar la varianza de los resultados para obtener los que tengan mejor resultado.

# lista para almacenar cada uno los modelos seleccionados para el cross validation
models = []

# Alamcenando los modelos como una tupla (nombre, modelo)
models.append(('kitchen_sink_linear',LinearRegression()))
models.append(('kitchen_sink_lasso',Lasso(alpha=0.5)))
models.append(('kitchen_sink_elastic_net',ElasticNet(alpha=1,
                                                     l1_ratio=0.5, 
                                                    max_iter= 100000, 
                                                    warm_start= True)))
models.append(('kitchen_sink_kneighbors',KNeighborsRegressor(n_neighbors=10)))
models.append(('kitchen_sink_decision_tree',DecisionTreeRegressor(max_depth=2)))
models.append(('kitchen_sink_svr',SVR(kernel='linear', epsilon=0.05, C=0.3)))
# Grabar los resultados de cada modelo
from sklearn import model_selection

#Semilla para obtener los mismos resultados de pruebas
seed = 2
results = []
names = []
scoring = 'r2'
for name, model in models:
    # Kfold cross validation for model selection
    kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=seed)
    #X train , y train
    cv_results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = f"({name}, {cv_results.mean()}, {cv_results.std()}"
    print(msg)
(kitchen_sink_linear, 0.6896886399698443, 0.045921282325928056
(kitchen_sink_lasso, 0.6899674077297042, 0.04551627012666227
(kitchen_sink_elastic_net, 0.6899856530481917, 0.04549868267546741
(kitchen_sink_kneighbors, 0.6870342787253639, 0.06558189405850869
(kitchen_sink_decision_tree, 0.6822895738299007, 0.05731626846239184
(kitchen_sink_svr, 0.6835596490250043, 0.04594683494610037
plt.figure(figsize = (15,8)) 
result_df = pd.DataFrame(results, index=names).T
result_df.boxplot()
plt.title("Resultados de Cross Validation");

png

Comparacion Estadistica de Modelos

from scipy.stats import f_oneway

model1 = result_df['kitchen_sink_linear']
model2 = result_df['kitchen_sink_lasso']
model3 = result_df['kitchen_sink_elastic_net']
model4 = result_df['kitchen_sink_kneighbors']
model5 = result_df['kitchen_sink_decision_tree']
model6 = result_df['kitchen_sink_svr']

statistic, p_value = f_oneway(model1, model2, model3, model4, model5, model6)

print(f'Statistic: {statistic}')
print(f'p_value: {p_value}')

alpha = 0.05  # nivel de significancia

if p_value < alpha:
    print("Existe una diferencia estadísticamente significativa en los resultados de cross-validation de los modelos.")
else:
    print("No Existe una diferencia estadísticamente significativa en los resultados de cross-validation de los modelos.")
Statistic: 0.017736104314674258
p_value: 0.9998609375900367
No Existe una diferencia estadísticamente significativa en los resultados de cross-validation de los modelos.

Hyper Parameter Tunning

Optimizacion de hiperparametros, Se seleccionan los mejores modelos que tengan diferente formas de funcionamiento.

from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings("ignore")
automobile_df = pd.read_csv('auto-mpg-processed.csv')

automobile_df.head()

mpg cylinders displacement horsepower weight acceleration age
0 31.5 4 89.0 71.0 1990 14.9 45
1 15.0 8 318.0 150.0 3399 11.0 50
2 34.5 4 105.0 70.0 2150 14.9 44
3 19.0 6 250.0 100.0 3282 15.0 52
4 23.9 8 260.0 90.0 3420 22.2 44
X = automobile_df.drop(['mpg', 'age'], axis=1)

Y = automobile_df['mpg']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

Lasso regression

parameters = {'alpha': [0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 1.0]}

grid_search = GridSearchCV(Lasso(), parameters, cv=3, return_train_score=True)
grid_search.fit(x_train, y_train)
GridSearchCV(cv=3, estimator=Lasso(),
         param_grid={&#x27;alpha&#x27;: [0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 1.0]},
         return_train_score=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" ><label for="sk-estimator-id-1" class="sk-toggleable__label sk-toggleable__label-arrow">GridSearchCV</label><div class="sk-toggleable__content"><pre>GridSearchCV(cv=3, estimator=Lasso(),
         param_grid={&#x27;alpha&#x27;: [0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 1.0]},
         return_train_score=True)</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-2" type="checkbox" ><label for="sk-estimator-id-2" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: Lasso</label><div class="sk-toggleable__content"><pre>Lasso()</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-3" type="checkbox" ><label for="sk-estimator-id-3" class="sk-toggleable__label sk-toggleable__label-arrow">Lasso</label><div class="sk-toggleable__content"><pre>Lasso()</pre></div></div></div></div></div></div></div></div></div></div>

Resultados de la hiperparametrizacion

print(f"Mejor resultado = {grid_search.best_score_}")
print(f"Mejor parametros = {grid_search.best_params_}")
Mejor resultado = 0.6780335620781672
Mejor parametros = {'alpha': 0.2}
# Para ver todos los resultados del cross validation
# No es necesario, solo es informativo para ver como varia el modelo
for i in range(len(parameters['alpha'])):
    print('Parametros: ', grid_search.cv_results_['params'][i])

    print('Promedio Score Prueba: ', grid_search.cv_results_['mean_test_score'][i])
    
    print('Rank: ', grid_search.cv_results_['rank_test_score'][i])
Parametros:  {'alpha': 0.2}
Promedio Score Prueba:  0.6780335620781672
Rank:  1
Parametros:  {'alpha': 0.4}
Promedio Score Prueba:  0.6778715866065846
Rank:  3
Parametros:  {'alpha': 0.6}
Promedio Score Prueba:  0.677876267061371
Rank:  2
Parametros:  {'alpha': 0.7}
Promedio Score Prueba:  0.6778496435917755
Rank:  4
Parametros:  {'alpha': 0.8}
Promedio Score Prueba:  0.6778219518141327
Rank:  5
Parametros:  {'alpha': 0.9}
Promedio Score Prueba:  0.6777937919345783
Rank:  6
Parametros:  {'alpha': 1.0}
Promedio Score Prueba:  0.6777669685319033
Rank:  7
lasso_model = Lasso(alpha=grid_search.best_params_['alpha']).fit(x_train, y_train)
y_pred = lasso_model.predict(x_test)

print('Entrenamiento score: ', lasso_model.score(x_train, y_train))
print('Prueba score: ', r2_score(y_test, y_pred))
Entrenamiento score:  0.6964506055697341
Prueba score:  0.7247233580821335

KNeighbors regression

parameters = {'n_neighbors': [10, 12, 14, 18, 20, 25, 30, 35, 50]}

grid_search = GridSearchCV(KNeighborsRegressor(), parameters, cv=3, return_train_score=True)
grid_search.fit(x_train, y_train)
GridSearchCV(cv=3, estimator=KNeighborsRegressor(),
         param_grid={&#x27;n_neighbors&#x27;: [10, 12, 14, 18, 20, 25, 30, 35, 50]},
         return_train_score=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-4" type="checkbox" ><label for="sk-estimator-id-4" class="sk-toggleable__label sk-toggleable__label-arrow">GridSearchCV</label><div class="sk-toggleable__content"><pre>GridSearchCV(cv=3, estimator=KNeighborsRegressor(),
         param_grid={&#x27;n_neighbors&#x27;: [10, 12, 14, 18, 20, 25, 30, 35, 50]},
         return_train_score=True)</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-5" type="checkbox" ><label for="sk-estimator-id-5" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: KNeighborsRegressor</label><div class="sk-toggleable__content"><pre>KNeighborsRegressor()</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-6" type="checkbox" ><label for="sk-estimator-id-6" class="sk-toggleable__label sk-toggleable__label-arrow">KNeighborsRegressor</label><div class="sk-toggleable__content"><pre>KNeighborsRegressor()</pre></div></div></div></div></div></div></div></div></div></div>
print(f"Mejor resultado = {grid_search.best_score_}")
print(f"Mejor parametros = {grid_search.best_params_}")
Mejor resultado = 0.6976710386307476
Mejor parametros = {'n_neighbors': 20}
kneighbors_model = KNeighborsRegressor(n_neighbors=grid_search.best_params_['n_neighbors']).fit(x_train, y_train)
y_pred = kneighbors_model.predict(x_test)

print('Entrenamiento score: ', kneighbors_model.score(x_train, y_train))
print('Prueba score: ', r2_score(y_test, y_pred))
Entrenamiento score:  0.7278939175813881
Prueba score:  0.7174936862217003

Decision Tree

parameters = {'max_depth':[1, 2, 3, 4, 5, 7, 8]}

grid_search = GridSearchCV(DecisionTreeRegressor(), parameters, cv=3, return_train_score=True)
grid_search.fit(x_train, y_train)
GridSearchCV(cv=3, estimator=DecisionTreeRegressor(),
         param_grid={&#x27;max_depth&#x27;: [1, 2, 3, 4, 5, 7, 8]},
         return_train_score=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-7" type="checkbox" ><label for="sk-estimator-id-7" class="sk-toggleable__label sk-toggleable__label-arrow">GridSearchCV</label><div class="sk-toggleable__content"><pre>GridSearchCV(cv=3, estimator=DecisionTreeRegressor(),
         param_grid={&#x27;max_depth&#x27;: [1, 2, 3, 4, 5, 7, 8]},
         return_train_score=True)</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-8" type="checkbox" ><label for="sk-estimator-id-8" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: DecisionTreeRegressor</label><div class="sk-toggleable__content"><pre>DecisionTreeRegressor()</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-9" type="checkbox" ><label for="sk-estimator-id-9" class="sk-toggleable__label sk-toggleable__label-arrow">DecisionTreeRegressor</label><div class="sk-toggleable__content"><pre>DecisionTreeRegressor()</pre></div></div></div></div></div></div></div></div></div></div>
print(f"Mejor resultado = {grid_search.best_score_}")
print(f"Mejor parametros = {grid_search.best_params_}")
Mejor resultado = 0.7189376031570488
Mejor parametros = {'max_depth': 2}
decision_tree_model = DecisionTreeRegressor(max_depth=grid_search.best_params_['max_depth']).fit(x_train, y_train)
y_pred = decision_tree_model.predict(x_test)

print('Entrenamiento score: ', decision_tree_model.score(x_train, y_train))
print('Prueba score: ', r2_score(y_test, y_pred))
Entrenamiento score:  0.7350762963796875
Prueba score:  0.7066861561377391

SVR

parameters = {'epsilon': [0.05, 0.1, 0.2, 0.3],
              'C': [0.2, 0.3]}

grid_search = GridSearchCV(SVR(kernel='linear'), parameters, cv=3, return_train_score=True)
grid_search.fit(x_train, y_train)
GridSearchCV(cv=3, estimator=SVR(kernel='linear'),
         param_grid={&#x27;C&#x27;: [0.2, 0.3], &#x27;epsilon&#x27;: [0.05, 0.1, 0.2, 0.3]},
         return_train_score=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-10" type="checkbox" ><label for="sk-estimator-id-10" class="sk-toggleable__label sk-toggleable__label-arrow">GridSearchCV</label><div class="sk-toggleable__content"><pre>GridSearchCV(cv=3, estimator=SVR(kernel=&#x27;linear&#x27;),
         param_grid={&#x27;C&#x27;: [0.2, 0.3], &#x27;epsilon&#x27;: [0.05, 0.1, 0.2, 0.3]},
         return_train_score=True)</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-11" type="checkbox" ><label for="sk-estimator-id-11" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: SVR</label><div class="sk-toggleable__content"><pre>SVR(kernel=&#x27;linear&#x27;)</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-12" type="checkbox" ><label for="sk-estimator-id-12" class="sk-toggleable__label sk-toggleable__label-arrow">SVR</label><div class="sk-toggleable__content"><pre>SVR(kernel=&#x27;linear&#x27;)</pre></div></div></div></div></div></div></div></div></div></div>
print(f"Mejor resultado = {grid_search.best_score_}")
print(f"Mejor parametros = {grid_search.best_params_}")
Mejor resultado = 0.67394146137429
Mejor parametros = {'C': 0.2, 'epsilon': 0.05}

Como ejercicio academico se estima el resultado con los datos de test

svr_model = SVR(kernel='linear',
                epsilon=grid_search.best_params_['epsilon'], 
                C=grid_search.best_params_['C']).fit(x_train, y_train)
y_pred = svr_model.predict(x_test)

print('Entrenamiento score: ', svr_model.score(x_train, y_train))
print('Prueba score: ', r2_score(y_test, y_pred))
Entrenamiento score:  0.6883608736085003
Prueba score:  0.6831001462299411

Grabar el Modelo

Se Selecciona el modelo que obtuvo mejores resultados en la hiperparametrizacion y sus resultados eran buenos con el dataset de prueba (test)

# Entrenar el modelo con todos los datos disponibles
kneighbors_model = KNeighborsRegressor(n_neighbors= 25).fit(X, Y)
from joblib import dump# libreria de serializacion

# grabar el modelo en un archivo
dump(kneighbors_model, 'kneighbors_model-auto_mpg.joblib')
['kneighbors_model-auto_mpg.joblib']

Usar el Modelo

import pandas as pd
from joblib import load 
modelo = load('kneighbors_model-auto_mpg.joblib')
modelo
KNeighborsRegressor(n_neighbors=25)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
datos = pd.read_csv('auto-mpg-processed.csv')
datos.head()

mpg cylinders displacement horsepower weight acceleration age
0 31.5 4 89.0 71.0 1990 14.9 45
1 15.0 8 318.0 150.0 3399 11.0 50
2 34.5 4 105.0 70.0 2150 14.9 44
3 19.0 6 250.0 100.0 3282 15.0 52
4 23.9 8 260.0 90.0 3420 22.2 44
# tomar dos datos de entrada para realizar la prediccion
datos_prueba = datos.iloc[2:4,1:6]
datos_prueba

cylinders displacement horsepower weight acceleration
2 4 105.0 70.0 2150 14.9
3 6 250.0 100.0 3282 15.0
# resultados de predicion con el modelo
modelo.predict(datos_prueba)
array([31.38 , 19.484])

Phd. Jose R. Zapata