Exploración de datos y descripción

Ultima actualización: 21/Nov/2024

Description:

Create a feature engineering based on the data analysis in this notebook https://github.com/JoseRZapata/demo-data-science-template/blob/main/notebooks/3-analysis/02-jrz-data_description_Manual-pandas-2024_10_24.ipynb

The feature engineering will be performed using scikit-learn pipelines and transformers.

📚 Import libraries

# base libraries for data science
from pathlib import Path

import pandas as pd
import sklearn as sk
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

💾 Load data

DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df = pd.read_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)

# print library version for reproducibility

print("Pandas version: ", pd.__version__)
print("sklearn version: ", sk.__version__)

Pandas version:  2.1.4
sklearn version:  1.3.2

👷 Data preparation

The name column will be droped because it is not relevant for the model.

selected_features = [
    "pclass",
    "sex",
    "age",
    "sibsp",
    "parch",
    "fare",
    "embarked",
    "survived",
]

titanic_features = titanic_df[selected_features].copy()
titanic_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1309 non-null   int64
 1   sex       1309 non-null   category
 2   age       1046 non-null   float64
 3   sibsp     1309 non-null   int64
 4   parch     1309 non-null   int64
 5   fare      1308 non-null   float64
 6   embarked  1307 non-null   category
 7   survived  1309 non-null   bool
dtypes: bool(1), category(2), float64(2), int64(3)
memory usage: 55.3 KB

Missing values

titanic_features.isna().sum()

pclass        0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
survived      0
dtype: int64

duplicated data

duplicate_rows = titanic_features.duplicated().sum()
print("Number of duplicate rows: ", duplicate_rows)

Number of duplicate rows:  195

titanic_features.sample(10, random_state=42)

titanic_features = titanic_features.drop_duplicates()

titanic_features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1114 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1114 non-null   int64
 1   sex       1114 non-null   category
 2   age       974 non-null    float64
 3   sibsp     1114 non-null   int64
 4   parch     1114 non-null   int64
 5   fare      1113 non-null   float64
 6   embarked  1112 non-null   category
 7   survived  1114 non-null   bool
dtypes: bool(1), category(2), float64(2), int64(3)
memory usage: 55.7 KB

👨‍🏭 Feature Engineering

# Encode target variable
titanic_features["survived"] = titanic_features["survived"].astype("int")

# True = 1, False = 0

titanic_features.sample(5)

	pclass	sex	age	sibsp	parch	fare	embarked	survived
533	2	female	21.0	0	1	21.0000	S	1
407	2	female	29.0	1	0	26.0000	S	1
412	2	male	35.0	0	0	26.0000	S	0
254	1	male	NaN	0	0	30.5000	S	1
1300	3	female	15.0	1	0	14.4542	C	1

cols_numeric = ["age", "fare", "sibsp", "parch"]
cols_categoric = ["sex", "embarked"]
cols_categoric_ord = ["pclass"]

numeric_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
    ]
)

categorical_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder()),
    ]
)

categorical_ord_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OrdinalEncoder()),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipe, cols_numeric),
        ("categoric", categorical_pipe, cols_categoric),
        ("categoric ordinales", categorical_ord_pipe, cols_categoric_ord),
    ]
)

preprocessor

ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 ['age', 'fare', 'sibsp', 'parch']),
                                ('categoric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot', OneHotEncoder())]),
                                 ['sex', 'embarked']),
                                ('categoric ordinales',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot', OrdinalEncoder())]),
                                 ['pclass'])])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Numeric Pipeline:
- Columns: [“age”, “fare”, “sibsp”, “parch”]
- Steps:
  - SimpleImputer(strategy="median"): Imputes missing values using the median of each column.
Categorical Pipeline:
- Columns: [“sex”, “embarked”]
- Steps:
  - SimpleImputer(strategy="most_frequent"): Imputes missing values using the most frequent value in each column.
  - OneHotEncoder(): Encodes categorical features as a one-hot numeric array.
Categorical Ordinal Pipeline:
- Columns: [“pclass”]
- Steps:
  - SimpleImputer(strategy="most_frequent"): Imputes missing values using the most frequent value in each column.
  - OrdinalEncoder(): Encodes categorical features as ordinal integers.
Column Transformer:
- Combines the numeric, categorical, and categorical ordinal pipelines into a single preprocessing step.

Example of the data preprocessing pipeline

Train / Test split

X_features = titanic_features.drop("survived", axis="columns")
Y_target = titanic_features["survived"]

# 80% train, 20% test
x_train, x_test, y_train, y_test = train_test_split(
    X_features, Y_target, test_size=0.2, stratify=Y_target
)

x_train.shape, y_train.shape

((891, 7), (891,))

x_test.shape, y_test.shape

((223, 7), (223,))

Preprocessing pipeline

transformed_data = preprocessor.fit(x_train)

feature_names = preprocessor.get_feature_names_out()

# transform x_Test with preprocessor and pandas output set
x_train_transformed = preprocessor.transform(x_train)
x_train_transformed = pd.DataFrame(x_train_transformed, columns=feature_names)
x_train_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   numeric__age                 891 non-null    float64
 1   numeric__fare                891 non-null    float64
 2   numeric__sibsp               891 non-null    float64
 3   numeric__parch               891 non-null    float64
 4   categoric__sex_female        891 non-null    float64
 5   categoric__sex_male          891 non-null    float64
 6   categoric__embarked_C        891 non-null    float64
 7   categoric__embarked_Q        891 non-null    float64
 8   categoric__embarked_S        891 non-null    float64
 9   categoric ordinales__pclass  891 non-null    float64
dtypes: float64(10)
memory usage: 69.7 KB

x_train_transformed.head()

	numeric__age	numeric__fare	numeric__sibsp	numeric__parch	categoric__sex_female	categoric__sex_male	categoric__embarked_C	categoric__embarked_S	categoric ordinales__pclass
0	2.0	31.2750	4.0	2.0	1.0	0.0	0.0	1.0	2.0
1	28.0	22.3583	0.0	2.0	1.0	0.0	1.0	0.0	2.0
2	10.0	46.9000	5.0	2.0	1.0	0.0	0.0	1.0	2.0
3	28.0	227.5250	0.0	0.0	0.0	1.0	1.0	0.0	0.0
4	18.0	8.3000	0.0	0.0	0.0	1.0	0.0	1.0	2.0

x_train.head()

	pclass	sex	age	sibsp	parch	fare	embarked
624	3	female	2.0	4	2	31.2750	S
1123	3	female	NaN	0	2	22.3583	C
828	3	female	10.0	5	2	46.9000	S
237	1	male	NaN	0	0	227.5250	C
619	3	male	18.0	0	0	8.3000	S

💡 Recommendations and Ideas

Handling Missing Data:
- Recommendation: Evaluate the impact of different imputation strategies on model performance. Consider using advanced imputation techniques such as KNNImputer or IterativeImputer.
- Rationale: Different imputation strategies can have varying impacts on model performance. Advanced techniques may provide better estimates for missing values.
Feature Scaling:
- Recommendation: Add a scaling step to the numeric pipeline using StandardScaler or MinMaxScaler.
- Rationale: Scaling numeric features can improve the performance of many machine learning algorithms by ensuring that all features contribute equally to the model.
Feature Engineering:
- Recommendation: Explore feature engineering techniques to create new features from the existing ones. For example, combining sibsp and parch to create a family_size feature.
- Rationale: Feature engineering can help capture important patterns in the data that are not immediately apparent from the raw features.

📖 References

https://joserzapata.github.io/courses/python-ciencia-datos/ml/
https://joserzapata.github.io/courses/python-ciencia-datos/clasificacion/
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 2nd Edition - Aurélien Géron
https://joserzapata.github.io/post/lista-proyecto-machine-learning/