Por Jose R. Zapata
Ultima actualización: 21/Nov/2024
Description:
Create a feature engineering based on the data analysis in this notebook https://github.com/JoseRZapata/demo-data-science-template/blob/main/notebooks/3-analysis/02-jrz-data_description_Manual-pandas-2024_10_24.ipynb
The feature engineering will be performed using scikit-learn pipelines and transformers.
📚 Import libraries
# base libraries for data science
from pathlib import Path
import pandas as pd
import sklearn as sk
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
💾 Load data
DATA_DIR = Path.cwd().resolve().parents[1] / "data"
titanic_df = pd.read_parquet(
DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)
# print library version for reproducibility
print("Pandas version: ", pd.__version__)
print("sklearn version: ", sk.__version__)
Pandas version: 2.1.4
sklearn version: 1.3.2
👷 Data preparation
The name
column will be droped because it is not relevant for the model.
selected_features = [
"pclass",
"sex",
"age",
"sibsp",
"parch",
"fare",
"embarked",
"survived",
]
titanic_features = titanic_df[selected_features].copy()
titanic_features.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null int64
1 sex 1309 non-null category
2 age 1046 non-null float64
3 sibsp 1309 non-null int64
4 parch 1309 non-null int64
5 fare 1308 non-null float64
6 embarked 1307 non-null category
7 survived 1309 non-null bool
dtypes: bool(1), category(2), float64(2), int64(3)
memory usage: 55.3 KB
Missing values
titanic_features.isna().sum()
pclass 0
sex 0
age 263
sibsp 0
parch 0
fare 1
embarked 2
survived 0
dtype: int64
duplicated data
duplicate_rows = titanic_features.duplicated().sum()
print("Number of duplicate rows: ", duplicate_rows)
Number of duplicate rows: 195
titanic_features.sample(10, random_state=42)
titanic_features = titanic_features.drop_duplicates()
titanic_features.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1114 entries, 0 to 1308
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1114 non-null int64
1 sex 1114 non-null category
2 age 974 non-null float64
3 sibsp 1114 non-null int64
4 parch 1114 non-null int64
5 fare 1113 non-null float64
6 embarked 1112 non-null category
7 survived 1114 non-null bool
dtypes: bool(1), category(2), float64(2), int64(3)
memory usage: 55.7 KB
👨🏭 Feature Engineering
# Encode target variable
titanic_features["survived"] = titanic_features["survived"].astype("int")
# True = 1, False = 0
titanic_features.sample(5)
pclass | sex | age | sibsp | parch | fare | embarked | survived | |
---|---|---|---|---|---|---|---|---|
533 | 2 | female | 21.0 | 0 | 1 | 21.0000 | S | 1 |
407 | 2 | female | 29.0 | 1 | 0 | 26.0000 | S | 1 |
412 | 2 | male | 35.0 | 0 | 0 | 26.0000 | S | 0 |
254 | 1 | male | NaN | 0 | 0 | 30.5000 | S | 1 |
1300 | 3 | female | 15.0 | 1 | 0 | 14.4542 | C | 1 |
cols_numeric = ["age", "fare", "sibsp", "parch"]
cols_categoric = ["sex", "embarked"]
cols_categoric_ord = ["pclass"]
numeric_pipe = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
]
)
categorical_pipe = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder()),
]
)
categorical_ord_pipe = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OrdinalEncoder()),
]
)
preprocessor = ColumnTransformer(
transformers=[
("numeric", numeric_pipe, cols_numeric),
("categoric", categorical_pipe, cols_categoric),
("categoric ordinales", categorical_ord_pipe, cols_categoric_ord),
]
)
preprocessor
ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('imputer', SimpleImputer(strategy='median'))]), ['age', 'fare', 'sibsp', 'parch']), ('categoric', Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder())]), ['sex', 'embarked']), ('categoric ordinales', Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OrdinalEncoder())]), ['pclass'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('imputer', SimpleImputer(strategy='median'))]), ['age', 'fare', 'sibsp', 'parch']), ('categoric', Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder())]), ['sex', 'embarked']), ('categoric ordinales', Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OrdinalEncoder())]), ['pclass'])])
['age', 'fare', 'sibsp', 'parch']
SimpleImputer(strategy='median')
['sex', 'embarked']
SimpleImputer(strategy='most_frequent')
OneHotEncoder()
['pclass']
SimpleImputer(strategy='most_frequent')
OrdinalEncoder()
Numeric Pipeline:
- Columns: [“age”, “fare”, “sibsp”, “parch”]
- Steps:
SimpleImputer(strategy="median")
: Imputes missing values using the median of each column.
Categorical Pipeline:
- Columns: [“sex”, “embarked”]
- Steps:
SimpleImputer(strategy="most_frequent")
: Imputes missing values using the most frequent value in each column.OneHotEncoder()
: Encodes categorical features as a one-hot numeric array.
Categorical Ordinal Pipeline:
- Columns: [“pclass”]
- Steps:
SimpleImputer(strategy="most_frequent")
: Imputes missing values using the most frequent value in each column.OrdinalEncoder()
: Encodes categorical features as ordinal integers.
Column Transformer:
- Combines the numeric, categorical, and categorical ordinal pipelines into a single preprocessing step.
Example of the data preprocessing pipeline
Train / Test split
X_features = titanic_features.drop("survived", axis="columns")
Y_target = titanic_features["survived"]
# 80% train, 20% test
x_train, x_test, y_train, y_test = train_test_split(
X_features, Y_target, test_size=0.2, stratify=Y_target
)
x_train.shape, y_train.shape
((891, 7), (891,))
x_test.shape, y_test.shape
((223, 7), (223,))
Preprocessing pipeline
transformed_data = preprocessor.fit(x_train)
feature_names = preprocessor.get_feature_names_out()
# transform x_Test with preprocessor and pandas output set
x_train_transformed = preprocessor.transform(x_train)
x_train_transformed = pd.DataFrame(x_train_transformed, columns=feature_names)
x_train_transformed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 numeric__age 891 non-null float64
1 numeric__fare 891 non-null float64
2 numeric__sibsp 891 non-null float64
3 numeric__parch 891 non-null float64
4 categoric__sex_female 891 non-null float64
5 categoric__sex_male 891 non-null float64
6 categoric__embarked_C 891 non-null float64
7 categoric__embarked_Q 891 non-null float64
8 categoric__embarked_S 891 non-null float64
9 categoric ordinales__pclass 891 non-null float64
dtypes: float64(10)
memory usage: 69.7 KB
x_train_transformed.head()
numeric__age | numeric__fare | numeric__sibsp | numeric__parch | categoric__sex_female | categoric__sex_male | categoric__embarked_C | categoric__embarked_Q | categoric__embarked_S | categoric ordinales__pclass | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2.0 | 31.2750 | 4.0 | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
1 | 28.0 | 22.3583 | 0.0 | 2.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 |
2 | 10.0 | 46.9000 | 5.0 | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
3 | 28.0 | 227.5250 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
4 | 18.0 | 8.3000 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 2.0 |
x_train.head()
pclass | sex | age | sibsp | parch | fare | embarked | |
---|---|---|---|---|---|---|---|
624 | 3 | female | 2.0 | 4 | 2 | 31.2750 | S |
1123 | 3 | female | NaN | 0 | 2 | 22.3583 | C |
828 | 3 | female | 10.0 | 5 | 2 | 46.9000 | S |
237 | 1 | male | NaN | 0 | 0 | 227.5250 | C |
619 | 3 | male | 18.0 | 0 | 0 | 8.3000 | S |
💡 Recommendations and Ideas
Handling Missing Data:
- Recommendation: Evaluate the impact of different imputation strategies on model performance. Consider using advanced imputation techniques such as
KNNImputer
orIterativeImputer
. - Rationale: Different imputation strategies can have varying impacts on model performance. Advanced techniques may provide better estimates for missing values.
- Recommendation: Evaluate the impact of different imputation strategies on model performance. Consider using advanced imputation techniques such as
Feature Scaling:
- Recommendation: Add a scaling step to the numeric pipeline using
StandardScaler
orMinMaxScaler
. - Rationale: Scaling numeric features can improve the performance of many machine learning algorithms by ensuring that all features contribute equally to the model.
- Recommendation: Add a scaling step to the numeric pipeline using
Feature Engineering:
- Recommendation: Explore feature engineering techniques to create new features from the existing ones. For example, combining
sibsp
andparch
to create afamily_size
feature. - Rationale: Feature engineering can help capture important patterns in the data that are not immediately apparent from the raw features.
- Recommendation: Explore feature engineering techniques to create new features from the existing ones. For example, combining
📖 References
- https://joserzapata.github.io/courses/python-ciencia-datos/ml/
- https://joserzapata.github.io/courses/python-ciencia-datos/clasificacion/
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 2nd Edition - Aurélien Géron
- https://joserzapata.github.io/post/lista-proyecto-machine-learning/