Por Jose R. Zapata
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
Informacion de los datos
automobile_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv')
# Ver 5 registros aleatorios
automobile_df.sample(5)
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
125 | 20.0 | 6 | 198.0 | 95.0 | 3102 | 16.5 | 74 | usa | plymouth duster |
206 | 26.5 | 4 | 140.0 | 72.0 | 2565 | 13.6 | 76 | usa | ford pinto |
100 | 18.0 | 6 | 250.0 | 88.0 | 3021 | 16.5 | 73 | usa | ford maverick |
160 | 17.0 | 6 | 231.0 | 110.0 | 3907 | 21.0 | 75 | usa | buick century |
295 | 35.7 | 4 | 98.0 | 80.0 | 1915 | 14.4 | 79 | usa | dodge colt hatchback custom |
#Tamaño del dataset
automobile_df.shape
(398, 9)
automobile_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 398 non-null float64
1 cylinders 398 non-null int64
2 displacement 398 non-null float64
3 horsepower 392 non-null float64
4 weight 398 non-null int64
5 acceleration 398 non-null float64
6 model_year 398 non-null int64
7 origin 398 non-null object
8 name 398 non-null object
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB
Preparacion de Datos
automobile_df['horsepower'].unique()
array([130., 165., 150., 140., 198., 220., 215., 225., 190., 170., 160.,
95., 97., 85., 88., 46., 87., 90., 113., 200., 210., 193.,
nan, 100., 105., 175., 153., 180., 110., 72., 86., 70., 76.,
65., 69., 60., 80., 54., 208., 155., 112., 92., 145., 137.,
158., 167., 94., 107., 230., 49., 75., 91., 122., 67., 83.,
78., 52., 61., 93., 148., 129., 96., 71., 98., 115., 53.,
81., 79., 120., 152., 102., 108., 68., 58., 149., 89., 63.,
48., 66., 139., 103., 125., 133., 138., 135., 142., 77., 62.,
132., 84., 64., 74., 116., 82.])
automobile_df['horsepower'].isna().sum()
6
automobile_df = automobile_df.dropna()
automobile_df.shape
(392, 9)
Eliminar Columnas no necesarias
automobile_df.drop(['origin', 'name'], axis='columns', inplace=True)
automobile_df.sample(5)
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | |
---|---|---|---|---|---|---|---|
301 | 34.2 | 4 | 105.0 | 70.0 | 2200 | 13.2 | 79 |
219 | 25.5 | 4 | 122.0 | 96.0 | 2300 | 15.5 | 77 |
321 | 32.2 | 4 | 108.0 | 75.0 | 2265 | 15.2 | 80 |
344 | 39.0 | 4 | 86.0 | 64.0 | 1875 | 16.4 | 81 |
146 | 28.0 | 4 | 90.0 | 75.0 | 2125 | 14.5 | 74 |
automobile_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 397
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 392 non-null float64
1 cylinders 392 non-null int64
2 displacement 392 non-null float64
3 horsepower 392 non-null float64
4 weight 392 non-null int64
5 acceleration 392 non-null float64
6 model_year 392 non-null int64
dtypes: float64(4), int64(3)
memory usage: 24.5 KB
Convertir el formato de ‘model year’ a año completo
automobile_df['model_year'] = '19' + automobile_df['model_year'].astype(str)
automobile_df.sample(5)
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | |
---|---|---|---|---|---|---|---|
248 | 36.1 | 4 | 91.0 | 60.0 | 1800 | 16.4 | 1978 |
193 | 24.0 | 6 | 200.0 | 81.0 | 3012 | 17.6 | 1976 |
231 | 15.5 | 8 | 400.0 | 190.0 | 4325 | 12.2 | 1977 |
388 | 26.0 | 4 | 156.0 | 92.0 | 2585 | 14.5 | 1982 |
83 | 28.0 | 4 | 98.0 | 80.0 | 2164 | 15.0 | 1972 |
Agregar columna de los años del automobil
automobile_df['age'] = datetime.datetime.now().year - pd.to_numeric(automobile_df['model_year'])
automobile_df.drop(['model_year'], axis='columns', inplace=True)
automobile_df.sample(5)
mpg | cylinders | displacement | horsepower | weight | acceleration | age | |
---|---|---|---|---|---|---|---|
380 | 36.0 | 4 | 120.0 | 88.0 | 2160 | 14.5 | 41 |
17 | 21.0 | 6 | 200.0 | 85.0 | 2587 | 16.0 | 53 |
33 | 19.0 | 6 | 232.0 | 100.0 | 2634 | 13.0 | 52 |
146 | 28.0 | 4 | 90.0 | 75.0 | 2125 | 14.5 | 49 |
96 | 13.0 | 8 | 360.0 | 175.0 | 3821 | 11.0 | 50 |
automobile_df.dtypes
mpg float64
cylinders int64
displacement float64
horsepower float64
weight int64
acceleration float64
age int64
dtype: object
automobile_df['horsepower'] = pd.to_numeric(automobile_df['horsepower'], errors='coerce')
automobile_df.describe()
mpg | cylinders | displacement | horsepower | weight | acceleration | age | |
---|---|---|---|---|---|---|---|
count | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 | 392.000000 |
mean | 23.445918 | 5.471939 | 194.411990 | 104.469388 | 2977.584184 | 15.541327 | 47.020408 |
std | 7.805007 | 1.705783 | 104.644004 | 38.491160 | 849.402560 | 2.758864 | 3.683737 |
min | 9.000000 | 3.000000 | 68.000000 | 46.000000 | 1613.000000 | 8.000000 | 41.000000 |
25% | 17.000000 | 4.000000 | 105.000000 | 75.000000 | 2225.250000 | 13.775000 | 44.000000 |
50% | 22.750000 | 4.000000 | 151.000000 | 93.500000 | 2803.500000 | 15.500000 | 47.000000 |
75% | 29.000000 | 8.000000 | 275.750000 | 126.000000 | 3614.750000 | 17.025000 | 50.000000 |
max | 46.600000 | 8.000000 | 455.000000 | 230.000000 | 5140.000000 | 24.800000 | 53.000000 |
Analisis Univariable
Se debe hacer un analisis de cada una de las variables y describir sus caracteristicas.
Analisis Bivariable
Scatter Plots
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(automobile_df['age'], automobile_df['mpg'])
plt.xlabel('Años')
plt.ylabel('Millas por galon');
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(automobile_df['acceleration'], automobile_df['mpg'])
plt.xlabel('Aceleracion')
plt.ylabel('Millas por galon');
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(automobile_df['weight'], automobile_df['mpg'])
plt.xlabel('Peso')
plt.ylabel('Millas por galon');
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(automobile_df['displacement'], automobile_df['mpg'])
plt.xlabel('Desplazamiento')
plt.ylabel('Millas por galon');
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(automobile_df['horsepower'], automobile_df['mpg'])
plt.xlabel('Caballos de fuerza')
plt.ylabel('Millas por galon');
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(automobile_df['cylinders'], automobile_df['mpg'])
plt.xlabel('Cilindros')
plt.ylabel('Millas por galon');
Correlacion
automobile_corr = automobile_df.corr()
automobile_corr
mpg | cylinders | displacement | horsepower | weight | acceleration | age | |
---|---|---|---|---|---|---|---|
mpg | 1.000000 | -0.777618 | -0.805127 | -0.778427 | -0.832244 | 0.423329 | -0.580541 |
cylinders | -0.777618 | 1.000000 | 0.950823 | 0.842983 | 0.897527 | -0.504683 | 0.345647 |
displacement | -0.805127 | 0.950823 | 1.000000 | 0.897257 | 0.932994 | -0.543800 | 0.369855 |
horsepower | -0.778427 | 0.842983 | 0.897257 | 1.000000 | 0.864538 | -0.689196 | 0.416361 |
weight | -0.832244 | 0.897527 | 0.932994 | 0.864538 | 1.000000 | -0.416839 | 0.309120 |
acceleration | 0.423329 | -0.504683 | -0.543800 | -0.689196 | -0.416839 | 1.000000 | -0.290316 |
age | -0.580541 | 0.345647 | 0.369855 | 0.416361 | 0.309120 | -0.290316 | 1.000000 |
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(automobile_corr, annot=True);
automobile_df = automobile_df.sample(frac=1).reset_index(drop=True)
automobile_df.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | age | |
---|---|---|---|---|---|---|---|
0 | 30.0 | 4 | 97.0 | 67.0 | 1985 | 16.4 | 46 |
1 | 13.0 | 8 | 360.0 | 170.0 | 4654 | 13.0 | 50 |
2 | 32.7 | 6 | 168.0 | 132.0 | 2910 | 11.4 | 43 |
3 | 20.5 | 6 | 225.0 | 100.0 | 3430 | 17.2 | 45 |
4 | 11.0 | 8 | 429.0 | 208.0 | 4633 | 11.0 | 51 |
automobile_df.to_csv('auto-mpg-processed.csv', index=False)
Experimentos con Regresion Lineal
Regresion lineal con una caracteristica (horsepower)
from sklearn.model_selection import train_test_split
X = automobile_df[['horsepower']]
Y = automobile_df['mpg']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) #20%
x_train.sample(5)
horsepower | |
---|---|
210 | 75.0 |
1 | 170.0 |
344 | 75.0 |
0 | 67.0 |
4 | 208.0 |
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression().fit(x_train, y_train)
print('Puntaje Entrenamiento: ', linear_model.score(x_train, y_train))
Puntaje Entrenamiento: 0.5841005932173976
y_pred = linear_model.predict(x_test)
from sklearn.metrics import r2_score
print('Puntaje Testing: ', r2_score(y_test, y_pred))
Puntaje Testing: 0.7130733130588783
from sklearn.metrics import PredictionErrorDisplay
PredictionErrorDisplay.from_predictions(y_true=y_test,
y_pred=y_pred,
kind="actual_vs_predicted");
PredictionErrorDisplay.from_predictions(y_true=y_test,
y_pred=y_pred,
kind="residual_vs_predicted");
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(x_test, y_test)
plt.plot(x_test, y_pred, color='r')
plt.xlabel('Caballos de Fuerza')
plt.ylabel('Mpg')
plt.show()
Regresion lineal con una caracteristica - age
X = automobile_df[['age']]
Y = automobile_df['mpg']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
linear_model = LinearRegression().fit(x_train, y_train)
print('Puntaje de entrenamiento: ', linear_model.score(x_train, y_train))
y_pred = linear_model.predict(x_test)
print('Puntaje de Testing: ', r2_score(y_test, y_pred))
Puntaje de entrenamiento: 0.3068776992523711
Puntaje de Testing: 0.44000610623520187
PredictionErrorDisplay.from_predictions(y_true=y_test,
y_pred=y_pred,
kind="actual_vs_predicted");
PredictionErrorDisplay.from_predictions(y_true=y_test,
y_pred=y_pred,
kind="residual_vs_predicted");
fig, ax = plt.subplots(figsize=(12, 8))
plt.scatter(x_test, y_test)
plt.plot(x_test, y_pred, color='r')
plt.xlabel('Age')
plt.ylabel('Mpg')
plt.show()
Regresion lineal con varias caracteristicas
# X = automobile_df[['displacement', 'horsepower', 'weight', 'acceleration', 'cylinders']]
X = automobile_df[['displacement', 'horsepower', 'weight']]
Y = automobile_df['mpg']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
linear_model = LinearRegression().fit(x_train, y_train)
print('Training score: ', linear_model.score(x_train, y_train))
Training score: 0.7052831602383469
predictors = x_train.columns
coef = pd.Series(linear_model.coef_, predictors).sort_values()
print(coef)
horsepower -0.046900
weight -0.005304
displacement -0.003034
dtype: float64
y_pred = linear_model.predict(x_test)
print('Puntaje Testing', r2_score(y_test, y_pred))
Puntaje Testing 0.710191227033769
PredictionErrorDisplay.from_predictions(y_true=y_test,
y_pred=y_pred,
kind="actual_vs_predicted");
PredictionErrorDisplay.from_predictions(y_true=y_test,
y_pred=y_pred,
kind="residual_vs_predicted");
Regresión con Múltiples Modelos
Si se tienen muchos modelos realizar cross validation con todos es costoso computacionalmente y en tiempo, entonces se debe ir descartando los modelos con menos desempeño hasta llegar al modelo final
- Inicialmente se dividen los datos en una parte para realizar la seleccion del modelo (Model selection dataset) y otra para realizar la prueba de desempeño (Performance dataset, esta parte de los datos se debe usar solo en el final de todo el proceso)
- Primer paso es dividir el Model selection dataset en una parte de entrenamiento (train) y otra de prueba (test), normalmente en una proporcion de 80/20 o 70/30.
- Hacer una evaluacion de todos los modelos la division enterior y seleccionar los mejores (preferiblemente que tengan un principio de funcionamiento diferente entre ellos).
- Con lo mejores modelos (la cantidad depende de que tan parecido es su resultado) realizar un cross-validation (detectar si hay over-fitting) para obtener los que tengan mejor resultado.
- Se saca el mejor o los mejores modelos (Los que tengan mejor desempeño y poca varianza en sus resultados) y se realiza optimizacion de hiper parametros (Hyper parameter Tunning). Este proceso es costoso computacionalmente por eso se debe realizar con muy pocos modelos.
- Luego se selecciona el mejor modelo (Mejor desempeño y poca varianza) y se obtienen los hiper parametros que dieron el mejor resultado.
- Finalmente se entrena el modelo seleccionado con los hiper-parametros hallados sobre el Model selection dataset y se hace la prueba final con el Performance dataset
#import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Lars
from sklearn.linear_model import SGDRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
import warnings
warnings.filterwarnings("ignore")
automobile_df = pd.read_csv('auto-mpg-processed.csv')
automobile_df.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | age | |
---|---|---|---|---|---|---|---|
0 | 30.0 | 4 | 97.0 | 67.0 | 1985 | 16.4 | 46 |
1 | 13.0 | 8 | 360.0 | 170.0 | 4654 | 13.0 | 50 |
2 | 32.7 | 6 | 168.0 | 132.0 | 2910 | 11.4 | 43 |
3 | 20.5 | 6 | 225.0 | 100.0 | 3430 | 17.2 | 45 |
4 | 11.0 | 8 | 429.0 | 208.0 | 4633 | 11.0 | 51 |
result_dict = {}
Funciones de ayuda
def entrenar_modelo(modelo,
name_of_y_col: str,
names_of_x_cols: list,
dataset: pd.DataFrame,
test_frac:float=0.2,
):
"""entrenar_modelo
Funcion para entrenar y evaluar un modelo
"""
X = dataset[names_of_x_cols]
Y = dataset[name_of_y_col]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=test_frac)
model = modelo.fit(x_train, y_train)
y_pred = model.predict(x_test)
print("Entrenamiento_score : " , model.score(x_train, y_train))
print("Prueba_score : ", r2_score(y_test, y_pred))
return {
'Entrenamiento_score': model.score(x_train, y_train),
'Prueba_score': r2_score(y_test, y_pred)
}
# Funcion para comparar los resultados de los modelos
#se van a almacenar en un diccionario
def compare_results():
for key in result_dict:
print('Regresion: ', key)
print('Entrenamiento score', result_dict[key]['Entrenamiento_score'])
print('Prueba score', result_dict[key]['Prueba_score'])
print()
Regresion lineal
result_dict['mpg ~ single_linear'] = entrenar_modelo(LinearRegression(),
'mpg',
['weight'],
automobile_df)
Entrenamiento_score : 0.6929340901372136
Prueba_score : 0.6862492949888326
result_dict['mpg ~ kitchen_sink_linear'] = entrenar_modelo(LinearRegression(),
'mpg',
['cylinders',
'displacement',
'horsepower',
'weight',
'acceleration'],
automobile_df)
Entrenamiento_score : 0.7101831758445959
Prueba_score : 0.6729626744895553
result_dict['mpg ~ parsimonius_linear'] = entrenar_modelo(LinearRegression(),
'mpg',
['horsepower',
'weight'],
automobile_df)
Entrenamiento_score : 0.7044792740221829
Prueba_score : 0.7112066461045605
compare_results()
Regresion: mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326
Regresion: mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553
Regresion: mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605
Lasso
result_dict['mpg ~ kitchen_sink_lasso'] = entrenar_modelo(Lasso(alpha=0.5),
'mpg',
['cylinders',
'displacement',
'horsepower',
'weight',
'acceleration'],
automobile_df)
Entrenamiento_score : 0.705185556695638
Prueba_score : 0.7125443285785023
compare_results()
Regresion: mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326
Regresion: mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553
Regresion: mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605
Regresion: mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023
Ridge
result_dict['mpg ~ kitchen_sink_ridge'] = entrenar_modelo(Ridge(alpha=0.5),
'mpg',
['cylinders',
'displacement',
'horsepower',
'weight',
'acceleration'],
automobile_df)
Entrenamiento_score : 0.70796844071492
Prueba_score : 0.7001902852658111
compare_results()
Regresion: mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326
Regresion: mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553
Regresion: mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605
Regresion: mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023
Regresion: mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111
Elasticnet
result_dict['mpg ~ kitchen_sink_elastic_net_ols'] = entrenar_modelo(ElasticNet(alpha=1, l1_ratio=0.5,
max_iter= 100000,
warm_start= True),
'mpg',
['cylinders',
'displacement',
'horsepower',
'weight',
'acceleration'],
automobile_df)
Entrenamiento_score : 0.7067959317119072
Prueba_score : 0.7027536474756819
compare_results()
Regresion: mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326
Regresion: mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553
Regresion: mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605
Regresion: mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023
Regresion: mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111
Regresion: mpg ~ kitchen_sink_elastic_net_ols
Entrenamiento score 0.7067959317119072
Prueba score 0.7027536474756819
SVR
For SVR regression with larger datasets this alternate implementations is preferred
https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR
- Uses a different library for implementation
- More flexibility with choice of penalties
- Scales to larger datasets
result_dict['mpg ~ kitchen_sink_svr'] = entrenar_modelo(SVR(kernel='linear',
epsilon=0.05,
C=0.3),
'mpg',
['cylinders',
'displacement',
'horsepower',
'weight',
'acceleration'],
automobile_df)
Entrenamiento_score : 0.6843427697354635
Prueba_score : 0.7440428540582755
compare_results()
Regresion: mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326
Regresion: mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553
Regresion: mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605
Regresion: mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023
Regresion: mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111
Regresion: mpg ~ kitchen_sink_elastic_net_ols
Entrenamiento score 0.7067959317119072
Prueba score 0.7027536474756819
Regresion: mpg ~ kitchen_sink_svr
Entrenamiento score 0.6843427697354635
Prueba score 0.7440428540582755
KNR
result_dict['mpg ~ kitchen_sink_kneighbors'] = entrenar_modelo(KNeighborsRegressor(n_neighbors=10),
'mpg',
['cylinders',
'displacement',
'horsepower',
'weight',
'acceleration'],
automobile_df)
Entrenamiento_score : 0.7853568628076514
Prueba_score : 0.5826866402088287
compare_results()
Regresion: mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326
Regresion: mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553
Regresion: mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605
Regresion: mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023
Regresion: mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111
Regresion: mpg ~ kitchen_sink_elastic_net_ols
Entrenamiento score 0.7067959317119072
Prueba score 0.7027536474756819
Regresion: mpg ~ kitchen_sink_svr
Entrenamiento score 0.6843427697354635
Prueba score 0.7440428540582755
Regresion: mpg ~ kitchen_sink_kneighbors
Entrenamiento score 0.7853568628076514
Prueba score 0.5826866402088287
SGD
# Este momdelo necesita qeu los datos esten standarizados para un buen desempeño
from sklearn.pipeline import Pipeline
SGDRegressor_pipe = Pipeline([('StandardScaler', StandardScaler()),
('SGDRegressor', SGDRegressor(max_iter=10000,
tol=1e-3))
])
result_dict['mpg ~ kitchen_sink_sgd'] = entrenar_modelo(SGDRegressor_pipe,
'mpg',
['cylinders',
'displacement',
'horsepower',
'weight',
'acceleration'],
automobile_df)
Entrenamiento_score : 0.6967509810416522
Prueba_score : 0.7493939349317008
compare_results()
Regresion: mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326
Regresion: mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553
Regresion: mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605
Regresion: mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023
Regresion: mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111
Regresion: mpg ~ kitchen_sink_elastic_net_ols
Entrenamiento score 0.7067959317119072
Prueba score 0.7027536474756819
Regresion: mpg ~ kitchen_sink_svr
Entrenamiento score 0.6843427697354635
Prueba score 0.7440428540582755
Regresion: mpg ~ kitchen_sink_kneighbors
Entrenamiento score 0.7853568628076514
Prueba score 0.5826866402088287
Regresion: mpg ~ kitchen_sink_sgd
Entrenamiento score 0.6967509810416522
Prueba score 0.7493939349317008
Decision Tree
result_dict['mpg ~ kitchen_sink_decision_tree'] = entrenar_modelo(DecisionTreeRegressor(max_depth=2),
'mpg',
['cylinders',
'displacement',
'horsepower',
'weight',
'acceleration'],
automobile_df)
Entrenamiento_score : 0.731782801402372
Prueba_score : 0.7273334316094502
compare_results()
Regresion: mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326
Regresion: mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553
Regresion: mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605
Regresion: mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023
Regresion: mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111
Regresion: mpg ~ kitchen_sink_elastic_net_ols
Entrenamiento score 0.7067959317119072
Prueba score 0.7027536474756819
Regresion: mpg ~ kitchen_sink_svr
Entrenamiento score 0.6843427697354635
Prueba score 0.7440428540582755
Regresion: mpg ~ kitchen_sink_kneighbors
Entrenamiento score 0.7853568628076514
Prueba score 0.5826866402088287
Regresion: mpg ~ kitchen_sink_sgd
Entrenamiento score 0.6967509810416522
Prueba score 0.7493939349317008
Regresion: mpg ~ kitchen_sink_decision_tree
Entrenamiento score 0.731782801402372
Prueba score 0.7273334316094502
Lars
result_dict['mpg ~ kitchen_sink_lars'] = entrenar_modelo(Lars(n_nonzero_coefs=4),
'mpg',
['cylinders',
'displacement',
'horsepower',
'weight',
'acceleration'],
automobile_df)
Entrenamiento_score : 0.7074987896418201
Prueba_score : 0.6910722188360159
compare_results()
Regresion: mpg ~ single_linear
Entrenamiento score 0.6929340901372136
Prueba score 0.6862492949888326
Regresion: mpg ~ kitchen_sink_linear
Entrenamiento score 0.7101831758445959
Prueba score 0.6729626744895553
Regresion: mpg ~ parsimonius_linear
Entrenamiento score 0.7044792740221829
Prueba score 0.7112066461045605
Regresion: mpg ~ kitchen_sink_lasso
Entrenamiento score 0.705185556695638
Prueba score 0.7125443285785023
Regresion: mpg ~ kitchen_sink_ridge
Entrenamiento score 0.70796844071492
Prueba score 0.7001902852658111
Regresion: mpg ~ kitchen_sink_elastic_net_ols
Entrenamiento score 0.7067959317119072
Prueba score 0.7027536474756819
Regresion: mpg ~ kitchen_sink_svr
Entrenamiento score 0.6843427697354635
Prueba score 0.7440428540582755
Regresion: mpg ~ kitchen_sink_kneighbors
Entrenamiento score 0.7853568628076514
Prueba score 0.5826866402088287
Regresion: mpg ~ kitchen_sink_sgd
Entrenamiento score 0.6967509810416522
Prueba score 0.7493939349317008
Regresion: mpg ~ kitchen_sink_decision_tree
Entrenamiento score 0.731782801402372
Prueba score 0.7273334316094502
Regresion: mpg ~ kitchen_sink_lars
Entrenamiento score 0.7074987896418201
Prueba score 0.6910722188360159
# Crear un diccionario solo con los resultados de prueba de cada modelo
nombre_modelos = result_dict.keys()
resultados_prueba = {} # crear diccionario vacio
for nombre in nombre_modelos:
resultados_prueba[nombre] = result_dict[nombre]['Prueba_score']
plt.figure(figsize = (12,10)) # tamaño de la figura
plt.barh(range(len(resultados_prueba)), list(resultados_prueba.values()),
align='center', );
plt.title("Resultados en el dataset de pruebas de cada modelo")
plt.yticks(range(len(resultados_prueba)), list(resultados_prueba.keys()));
Cross Validation - Seleccion de Modelos
Analizar la varianza de los resultados para obtener los que tengan mejor resultado.
# lista para almacenar cada uno los modelos seleccionados para el cross validation
models = []
# Alamcenando los modelos como una tupla (nombre, modelo)
models.append(('kitchen_sink_linear',LinearRegression()))
models.append(('kitchen_sink_lasso',Lasso(alpha=0.5)))
models.append(('kitchen_sink_elastic_net',ElasticNet(alpha=1,
l1_ratio=0.5,
max_iter= 100000,
warm_start= True)))
models.append(('kitchen_sink_kneighbors',KNeighborsRegressor(n_neighbors=10)))
models.append(('kitchen_sink_decision_tree',DecisionTreeRegressor(max_depth=2)))
models.append(('kitchen_sink_svr',SVR(kernel='linear', epsilon=0.05, C=0.3)))
# Grabar los resultados de cada modelo
from sklearn import model_selection
#Semilla para obtener los mismos resultados de pruebas
seed = 2
results = []
names = []
scoring = 'r2'
for name, model in models:
# Kfold cross validation for model selection
kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=seed)
#X train , y train
cv_results = model_selection.cross_val_score(model, x_train, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = f"({name}, {cv_results.mean()}, {cv_results.std()}"
print(msg)
(kitchen_sink_linear, 0.6896886399698443, 0.045921282325928056
(kitchen_sink_lasso, 0.6899674077297042, 0.04551627012666227
(kitchen_sink_elastic_net, 0.6899856530481917, 0.04549868267546741
(kitchen_sink_kneighbors, 0.6870342787253639, 0.06558189405850869
(kitchen_sink_decision_tree, 0.6822895738299007, 0.05731626846239184
(kitchen_sink_svr, 0.6835596490250043, 0.04594683494610037
plt.figure(figsize = (15,8))
result_df = pd.DataFrame(results, index=names).T
result_df.boxplot()
plt.title("Resultados de Cross Validation");
Comparacion Estadistica de Modelos
from scipy.stats import f_oneway
model1 = result_df['kitchen_sink_linear']
model2 = result_df['kitchen_sink_lasso']
model3 = result_df['kitchen_sink_elastic_net']
model4 = result_df['kitchen_sink_kneighbors']
model5 = result_df['kitchen_sink_decision_tree']
model6 = result_df['kitchen_sink_svr']
statistic, p_value = f_oneway(model1, model2, model3, model4, model5, model6)
print(f'Statistic: {statistic}')
print(f'p_value: {p_value}')
alpha = 0.05 # nivel de significancia
if p_value < alpha:
print("Existe una diferencia estadísticamente significativa en los resultados de cross-validation de los modelos.")
else:
print("No Existe una diferencia estadísticamente significativa en los resultados de cross-validation de los modelos.")
Statistic: 0.017736104314674258
p_value: 0.9998609375900367
No Existe una diferencia estadísticamente significativa en los resultados de cross-validation de los modelos.
Hyper Parameter Tunning
Optimizacion de hiperparametros, Se seleccionan los mejores modelos que tengan diferente formas de funcionamiento.
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings("ignore")
automobile_df = pd.read_csv('auto-mpg-processed.csv')
automobile_df.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | age | |
---|---|---|---|---|---|---|---|
0 | 31.5 | 4 | 89.0 | 71.0 | 1990 | 14.9 | 45 |
1 | 15.0 | 8 | 318.0 | 150.0 | 3399 | 11.0 | 50 |
2 | 34.5 | 4 | 105.0 | 70.0 | 2150 | 14.9 | 44 |
3 | 19.0 | 6 | 250.0 | 100.0 | 3282 | 15.0 | 52 |
4 | 23.9 | 8 | 260.0 | 90.0 | 3420 | 22.2 | 44 |
X = automobile_df.drop(['mpg', 'age'], axis=1)
Y = automobile_df['mpg']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
Lasso regression
parameters = {'alpha': [0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 1.0]}
grid_search = GridSearchCV(Lasso(), parameters, cv=3, return_train_score=True)
grid_search.fit(x_train, y_train)
GridSearchCV(cv=3, estimator=Lasso(),param_grid={'alpha': [0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 1.0]}, return_train_score=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" ><label for="sk-estimator-id-1" class="sk-toggleable__label sk-toggleable__label-arrow">GridSearchCV</label><div class="sk-toggleable__content"><pre>GridSearchCV(cv=3, estimator=Lasso(), param_grid={'alpha': [0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 1.0]}, return_train_score=True)</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-2" type="checkbox" ><label for="sk-estimator-id-2" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: Lasso</label><div class="sk-toggleable__content"><pre>Lasso()</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-3" type="checkbox" ><label for="sk-estimator-id-3" class="sk-toggleable__label sk-toggleable__label-arrow">Lasso</label><div class="sk-toggleable__content"><pre>Lasso()</pre></div></div></div></div></div></div></div></div></div></div>
Resultados de la hiperparametrizacion
print(f"Mejor resultado = {grid_search.best_score_}") print(f"Mejor parametros = {grid_search.best_params_}")
Mejor resultado = 0.6780335620781672 Mejor parametros = {'alpha': 0.2}
# Para ver todos los resultados del cross validation # No es necesario, solo es informativo para ver como varia el modelo for i in range(len(parameters['alpha'])): print('Parametros: ', grid_search.cv_results_['params'][i]) print('Promedio Score Prueba: ', grid_search.cv_results_['mean_test_score'][i]) print('Rank: ', grid_search.cv_results_['rank_test_score'][i])
Parametros: {'alpha': 0.2} Promedio Score Prueba: 0.6780335620781672 Rank: 1 Parametros: {'alpha': 0.4} Promedio Score Prueba: 0.6778715866065846 Rank: 3 Parametros: {'alpha': 0.6} Promedio Score Prueba: 0.677876267061371 Rank: 2 Parametros: {'alpha': 0.7} Promedio Score Prueba: 0.6778496435917755 Rank: 4 Parametros: {'alpha': 0.8} Promedio Score Prueba: 0.6778219518141327 Rank: 5 Parametros: {'alpha': 0.9} Promedio Score Prueba: 0.6777937919345783 Rank: 6 Parametros: {'alpha': 1.0} Promedio Score Prueba: 0.6777669685319033 Rank: 7
lasso_model = Lasso(alpha=grid_search.best_params_['alpha']).fit(x_train, y_train)
y_pred = lasso_model.predict(x_test) print('Entrenamiento score: ', lasso_model.score(x_train, y_train)) print('Prueba score: ', r2_score(y_test, y_pred))
Entrenamiento score: 0.6964506055697341 Prueba score: 0.7247233580821335
KNeighbors regression
parameters = {'n_neighbors': [10, 12, 14, 18, 20, 25, 30, 35, 50]} grid_search = GridSearchCV(KNeighborsRegressor(), parameters, cv=3, return_train_score=True) grid_search.fit(x_train, y_train)
GridSearchCV(cv=3, estimator=KNeighborsRegressor(),param_grid={'n_neighbors': [10, 12, 14, 18, 20, 25, 30, 35, 50]}, return_train_score=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-4" type="checkbox" ><label for="sk-estimator-id-4" class="sk-toggleable__label sk-toggleable__label-arrow">GridSearchCV</label><div class="sk-toggleable__content"><pre>GridSearchCV(cv=3, estimator=KNeighborsRegressor(), param_grid={'n_neighbors': [10, 12, 14, 18, 20, 25, 30, 35, 50]}, return_train_score=True)</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-5" type="checkbox" ><label for="sk-estimator-id-5" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: KNeighborsRegressor</label><div class="sk-toggleable__content"><pre>KNeighborsRegressor()</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-6" type="checkbox" ><label for="sk-estimator-id-6" class="sk-toggleable__label sk-toggleable__label-arrow">KNeighborsRegressor</label><div class="sk-toggleable__content"><pre>KNeighborsRegressor()</pre></div></div></div></div></div></div></div></div></div></div>
print(f"Mejor resultado = {grid_search.best_score_}") print(f"Mejor parametros = {grid_search.best_params_}")
Mejor resultado = 0.6976710386307476 Mejor parametros = {'n_neighbors': 20}
kneighbors_model = KNeighborsRegressor(n_neighbors=grid_search.best_params_['n_neighbors']).fit(x_train, y_train)
y_pred = kneighbors_model.predict(x_test) print('Entrenamiento score: ', kneighbors_model.score(x_train, y_train)) print('Prueba score: ', r2_score(y_test, y_pred))
Entrenamiento score: 0.7278939175813881 Prueba score: 0.7174936862217003
Decision Tree
parameters = {'max_depth':[1, 2, 3, 4, 5, 7, 8]} grid_search = GridSearchCV(DecisionTreeRegressor(), parameters, cv=3, return_train_score=True) grid_search.fit(x_train, y_train)
GridSearchCV(cv=3, estimator=DecisionTreeRegressor(),param_grid={'max_depth': [1, 2, 3, 4, 5, 7, 8]}, return_train_score=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-7" type="checkbox" ><label for="sk-estimator-id-7" class="sk-toggleable__label sk-toggleable__label-arrow">GridSearchCV</label><div class="sk-toggleable__content"><pre>GridSearchCV(cv=3, estimator=DecisionTreeRegressor(), param_grid={'max_depth': [1, 2, 3, 4, 5, 7, 8]}, return_train_score=True)</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-8" type="checkbox" ><label for="sk-estimator-id-8" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: DecisionTreeRegressor</label><div class="sk-toggleable__content"><pre>DecisionTreeRegressor()</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-9" type="checkbox" ><label for="sk-estimator-id-9" class="sk-toggleable__label sk-toggleable__label-arrow">DecisionTreeRegressor</label><div class="sk-toggleable__content"><pre>DecisionTreeRegressor()</pre></div></div></div></div></div></div></div></div></div></div>
print(f"Mejor resultado = {grid_search.best_score_}") print(f"Mejor parametros = {grid_search.best_params_}")
Mejor resultado = 0.7189376031570488 Mejor parametros = {'max_depth': 2}
decision_tree_model = DecisionTreeRegressor(max_depth=grid_search.best_params_['max_depth']).fit(x_train, y_train)
y_pred = decision_tree_model.predict(x_test) print('Entrenamiento score: ', decision_tree_model.score(x_train, y_train)) print('Prueba score: ', r2_score(y_test, y_pred))
Entrenamiento score: 0.7350762963796875 Prueba score: 0.7066861561377391
SVR
parameters = {'epsilon': [0.05, 0.1, 0.2, 0.3], 'C': [0.2, 0.3]} grid_search = GridSearchCV(SVR(kernel='linear'), parameters, cv=3, return_train_score=True) grid_search.fit(x_train, y_train)
GridSearchCV(cv=3, estimator=SVR(kernel='linear'),param_grid={'C': [0.2, 0.3], 'epsilon': [0.05, 0.1, 0.2, 0.3]}, return_train_score=True)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-10" type="checkbox" ><label for="sk-estimator-id-10" class="sk-toggleable__label sk-toggleable__label-arrow">GridSearchCV</label><div class="sk-toggleable__content"><pre>GridSearchCV(cv=3, estimator=SVR(kernel='linear'), param_grid={'C': [0.2, 0.3], 'epsilon': [0.05, 0.1, 0.2, 0.3]}, return_train_score=True)</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-11" type="checkbox" ><label for="sk-estimator-id-11" class="sk-toggleable__label sk-toggleable__label-arrow">estimator: SVR</label><div class="sk-toggleable__content"><pre>SVR(kernel='linear')</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-12" type="checkbox" ><label for="sk-estimator-id-12" class="sk-toggleable__label sk-toggleable__label-arrow">SVR</label><div class="sk-toggleable__content"><pre>SVR(kernel='linear')</pre></div></div></div></div></div></div></div></div></div></div>
print(f"Mejor resultado = {grid_search.best_score_}") print(f"Mejor parametros = {grid_search.best_params_}")
Mejor resultado = 0.67394146137429 Mejor parametros = {'C': 0.2, 'epsilon': 0.05}
Como ejercicio academico se estima el resultado con los datos de test
svr_model = SVR(kernel='linear', epsilon=grid_search.best_params_['epsilon'], C=grid_search.best_params_['C']).fit(x_train, y_train)
y_pred = svr_model.predict(x_test) print('Entrenamiento score: ', svr_model.score(x_train, y_train)) print('Prueba score: ', r2_score(y_test, y_pred))
Entrenamiento score: 0.6883608736085003 Prueba score: 0.6831001462299411
Grabar el Modelo
Se Selecciona el modelo que obtuvo mejores resultados en la hiperparametrizacion y sus resultados eran buenos con el dataset de prueba (test)
# Entrenar el modelo con todos los datos disponibles kneighbors_model = KNeighborsRegressor(n_neighbors= 25).fit(X, Y)
from joblib import dump# libreria de serializacion # grabar el modelo en un archivo dump(kneighbors_model, 'kneighbors_model-auto_mpg.joblib')
['kneighbors_model-auto_mpg.joblib']
Usar el Modelo
import pandas as pd from joblib import load
modelo = load('kneighbors_model-auto_mpg.joblib')
modelo
KNeighborsRegressor(n_neighbors=25)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.KNeighborsRegressor(n_neighbors=25)datos = pd.read_csv('auto-mpg-processed.csv') datos.head()
mpg cylinders displacement horsepower weight acceleration age 0 31.5 4 89.0 71.0 1990 14.9 45 1 15.0 8 318.0 150.0 3399 11.0 50 2 34.5 4 105.0 70.0 2150 14.9 44 3 19.0 6 250.0 100.0 3282 15.0 52 4 23.9 8 260.0 90.0 3420 22.2 44 # tomar dos datos de entrada para realizar la prediccion datos_prueba = datos.iloc[2:4,1:6] datos_prueba
cylinders displacement horsepower weight acceleration 2 4 105.0 70.0 2150 14.9 3 6 250.0 100.0 3282 15.0 # resultados de predicion con el modelo modelo.predict(datos_prueba)
array([31.38 , 19.484])
Phd. Jose R. Zapata