Exploración de datos y descripción

Por Jose R. Zapata

Ultima actualización: 21/Nov/2024

Description:

Create a feature engineering based on the data analysis in this notebook https://github.com/JoseRZapata/demo-data-science-template/blob/main/notebooks/3-analysis/02-jrz-data_description_Manual-pandas-2024_10_24.ipynb

The feature engineering will be performed using scikit-learn pipelines and transformers.

📚 Import libraries

# base libraries for data science
from pathlib import Path

import pandas as pd
import sklearn as sk
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

💾 Load data

DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df = pd.read_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", engine="pyarrow"
)
# print library version for reproducibility

print("Pandas version: ", pd.__version__)
print("sklearn version: ", sk.__version__)
Pandas version:  2.1.4
sklearn version:  1.3.2

👷 Data preparation

The name column will be droped because it is not relevant for the model.

selected_features = [
    "pclass",
    "sex",
    "age",
    "sibsp",
    "parch",
    "fare",
    "embarked",
    "survived",
]

titanic_features = titanic_df[selected_features].copy()
titanic_features.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1309 non-null   int64
 1   sex       1309 non-null   category
 2   age       1046 non-null   float64
 3   sibsp     1309 non-null   int64
 4   parch     1309 non-null   int64
 5   fare      1308 non-null   float64
 6   embarked  1307 non-null   category
 7   survived  1309 non-null   bool
dtypes: bool(1), category(2), float64(2), int64(3)
memory usage: 55.3 KB

Missing values

titanic_features.isna().sum()
pclass        0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
survived      0
dtype: int64

duplicated data

duplicate_rows = titanic_features.duplicated().sum()
print("Number of duplicate rows: ", duplicate_rows)
Number of duplicate rows:  195

titanic_features.sample(10, random_state=42)

titanic_features = titanic_features.drop_duplicates()
titanic_features.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1114 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   pclass    1114 non-null   int64
 1   sex       1114 non-null   category
 2   age       974 non-null    float64
 3   sibsp     1114 non-null   int64
 4   parch     1114 non-null   int64
 5   fare      1113 non-null   float64
 6   embarked  1112 non-null   category
 7   survived  1114 non-null   bool
dtypes: bool(1), category(2), float64(2), int64(3)
memory usage: 55.7 KB

👨‍🏭 Feature Engineering

# Encode target variable
titanic_features["survived"] = titanic_features["survived"].astype("int")

# True = 1, False = 0
titanic_features.sample(5)

pclasssexagesibspparchfareembarkedsurvived
5332female21.00121.0000S1
4072female29.01026.0000S1
4122male35.00026.0000S0
2541maleNaN0030.5000S1
13003female15.01014.4542C1
cols_numeric = ["age", "fare", "sibsp", "parch"]
cols_categoric = ["sex", "embarked"]
cols_categoric_ord = ["pclass"]
numeric_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
    ]
)

categorical_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder()),
    ]
)

categorical_ord_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OrdinalEncoder()),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipe, cols_numeric),
        ("categoric", categorical_pipe, cols_categoric),
        ("categoric ordinales", categorical_ord_pipe, cols_categoric_ord),
    ]
)
preprocessor
ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 ['age', 'fare', 'sibsp', 'parch']),
                                ('categoric',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot', OneHotEncoder())]),
                                 ['sex', 'embarked']),
                                ('categoric ordinales',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot', OrdinalEncoder())]),
                                 ['pclass'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
  1. Numeric Pipeline:

    • Columns: [“age”, “fare”, “sibsp”, “parch”]
    • Steps:
      • SimpleImputer(strategy="median"): Imputes missing values using the median of each column.
  2. Categorical Pipeline:

    • Columns: [“sex”, “embarked”]
    • Steps:
      • SimpleImputer(strategy="most_frequent"): Imputes missing values using the most frequent value in each column.
      • OneHotEncoder(): Encodes categorical features as a one-hot numeric array.
  3. Categorical Ordinal Pipeline:

    • Columns: [“pclass”]
    • Steps:
      • SimpleImputer(strategy="most_frequent"): Imputes missing values using the most frequent value in each column.
      • OrdinalEncoder(): Encodes categorical features as ordinal integers.
  4. Column Transformer:

    • Combines the numeric, categorical, and categorical ordinal pipelines into a single preprocessing step.

Example of the data preprocessing pipeline

Train / Test split

X_features = titanic_features.drop("survived", axis="columns")
Y_target = titanic_features["survived"]

# 80% train, 20% test
x_train, x_test, y_train, y_test = train_test_split(
    X_features, Y_target, test_size=0.2, stratify=Y_target
)
x_train.shape, y_train.shape
((891, 7), (891,))
x_test.shape, y_test.shape
((223, 7), (223,))

Preprocessing pipeline

transformed_data = preprocessor.fit(x_train)
feature_names = preprocessor.get_feature_names_out()

# transform x_Test with preprocessor and pandas output set
x_train_transformed = preprocessor.transform(x_train)
x_train_transformed = pd.DataFrame(x_train_transformed, columns=feature_names)
x_train_transformed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   numeric__age                 891 non-null    float64
 1   numeric__fare                891 non-null    float64
 2   numeric__sibsp               891 non-null    float64
 3   numeric__parch               891 non-null    float64
 4   categoric__sex_female        891 non-null    float64
 5   categoric__sex_male          891 non-null    float64
 6   categoric__embarked_C        891 non-null    float64
 7   categoric__embarked_Q        891 non-null    float64
 8   categoric__embarked_S        891 non-null    float64
 9   categoric ordinales__pclass  891 non-null    float64
dtypes: float64(10)
memory usage: 69.7 KB
x_train_transformed.head()

numeric__agenumeric__farenumeric__sibspnumeric__parchcategoric__sex_femalecategoric__sex_malecategoric__embarked_Ccategoric__embarked_Qcategoric__embarked_Scategoric ordinales__pclass
02.031.27504.02.01.00.00.00.01.02.0
128.022.35830.02.01.00.01.00.00.02.0
210.046.90005.02.01.00.00.00.01.02.0
328.0227.52500.00.00.01.01.00.00.00.0
418.08.30000.00.00.01.00.00.01.02.0
x_train.head()

pclasssexagesibspparchfareembarked
6243female2.04231.2750S
11233femaleNaN0222.3583C
8283female10.05246.9000S
2371maleNaN00227.5250C
6193male18.0008.3000S

💡 Recommendations and Ideas

  1. Handling Missing Data:

    • Recommendation: Evaluate the impact of different imputation strategies on model performance. Consider using advanced imputation techniques such as KNNImputer or IterativeImputer.
    • Rationale: Different imputation strategies can have varying impacts on model performance. Advanced techniques may provide better estimates for missing values.
  2. Feature Scaling:

    • Recommendation: Add a scaling step to the numeric pipeline using StandardScaler or MinMaxScaler.
    • Rationale: Scaling numeric features can improve the performance of many machine learning algorithms by ensuring that all features contribute equally to the model.
  3. Feature Engineering:

    • Recommendation: Explore feature engineering techniques to create new features from the existing ones. For example, combining sibsp and parch to create a family_size feature.
    • Rationale: Feature engineering can help capture important patterns in the data that are not immediately apparent from the raw features.

📖 References