Por Jose R. Zapata
Ultima actualización: 21/Nov/2024
Description:
Data overview and exploration to check data types and fix any issue with the data types.
this is in other to do a correct data analysis and visualization of the data.
📚 Import libraries
# base libraries for data science
from pathlib import Path
import numpy as np
import pandas as pd
import pyarrow as pa
💾 Load data
# data directory path
DATA_DIR = Path.cwd().resolve().parents[1] / "data"
titanic_df = pd.read_csv(DATA_DIR / "01_raw/titanic_raw.csv")
📊 Data description
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1309 non-null int64
1 pclass 1309 non-null int64
2 survived 1309 non-null int64
3 name 1309 non-null object
4 sex 1309 non-null object
5 age 1309 non-null object
6 sibsp 1309 non-null int64
7 parch 1309 non-null int64
8 ticket 1309 non-null object
9 fare 1309 non-null object
10 cabin 1309 non-null object
11 embarked 1309 non-null object
12 home.dest 1309 non-null object
dtypes: int64(5), object(8)
memory usage: 133.1+ KB
titanic_df.sample(10)
Unnamed: 0 | pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14 | 14 | 1 | 1 | Barkworth, Mr. Algernon Henry Wilson | male | 80 | 0 | 0 | 27042 | 30 | A23 | S | Hessle, Yorks |
626 | 626 | 3 | 0 | Andersson, Miss. Ida Augusta Margareta | female | 38 | 4 | 2 | 347091 | 7.775 | ? | S | Vadsbro, Sweden Ministee, MI |
1083 | 1083 | 3 | 0 | Olsen, Mr. Henry Margido | male | 28 | 0 | 0 | C 4001 | 22.525 | ? | S | ? |
360 | 360 | 2 | 1 | Caldwell, Mr. Albert Francis | male | 26 | 1 | 1 | 248738 | 29 | ? | S | Bangkok, Thailand / Roseville, IL |
1252 | 1252 | 3 | 0 | Torber, Mr. Ernst William | male | 44 | 0 | 0 | 364511 | 8.05 | ? | S | ? |
679 | 679 | 3 | 0 | Boulos, Miss. Nourelain | female | 9 | 1 | 1 | 2678 | 15.2458 | ? | C | Syria Kent, ON |
469 | 469 | 2 | 1 | Keane, Miss. Nora A | female | ? | 0 | 0 | 226593 | 12.35 | E101 | Q | Harrisburg, PA |
1007 | 1007 | 3 | 1 | McGowan, Miss. Anna 'Annie' | female | 15 | 0 | 0 | 330923 | 8.0292 | ? | Q | ? |
187 | 187 | 1 | 1 | Lines, Miss. Mary Conover | female | 16 | 0 | 1 | PC 17592 | 39.4 | D28 | S | Paris, France |
1113 | 1113 | 3 | 0 | Peacock, Mrs. Benjamin (Edith Nile) | female | 26 | 0 | 2 | SOTON/O.Q. 3101315 | 13.775 | ? | S | ? |
Null values
In this dataset the null values are represented by the string ‘?’ so we need to replace them with pd.NA
titanic_df = titanic_df.replace("?", np.nan)
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1309 non-null int64
1 pclass 1309 non-null int64
2 survived 1309 non-null int64
3 name 1309 non-null object
4 sex 1309 non-null object
5 age 1046 non-null object
6 sibsp 1309 non-null int64
7 parch 1309 non-null int64
8 ticket 1309 non-null object
9 fare 1308 non-null object
10 cabin 295 non-null object
11 embarked 1307 non-null object
12 home.dest 745 non-null object
dtypes: int64(5), object(8)
memory usage: 133.1+ KB
Remove columns
- We will remove the columns that have too many null values and need to much effort to find the correct value.
- The column
ticket
is a string that is unique for each passenger, but is just a identifier, so we will remove it.
so we will remove the columns cabin
, ticket
and home.dest
titanic_df = titanic_df.drop(columns=["cabin", "home.dest", "ticket"])
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1309 non-null int64
1 pclass 1309 non-null int64
2 survived 1309 non-null int64
3 name 1309 non-null object
4 sex 1309 non-null object
5 age 1046 non-null object
6 sibsp 1309 non-null int64
7 parch 1309 non-null int64
8 fare 1308 non-null object
9 embarked 1307 non-null object
dtypes: int64(5), object(5)
memory usage: 102.4+ KB
Categorical variables
Ordinal
Pclass
: A proxy for socio-economic status (SES)- 1 = Upper
- 2 = Middle
- 3 = Lower
Nominal
sex
: Gender of the passenger- female
- male
embarked
: Port of embarkation- C = Cherbourg
- Q = Queenstown
- S = Southampton
Numerical variables
Discrete
sibsp
: The dataset defines family relations in this way…- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
sibsp
= 0, 1, 2, 3, 4, 5, 8
parch
: The dataset defines family relations in this way…- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore
parch
= 0 for them. parch
= 0, 1, 2, 3, 4, 5, 6
Continuous
fare
: Passenger fareage
: Age of the passenger, some values are float has to be converted to int.
Boolean variables
Survived
: 0 = No, 1 = Yes
String variables
name
: Name of the passenger with the formatLast name, Title. First name
Convert data types
Categorical variables
cols_categoric = ["pclass", "sex", "embarked"]
titanic_df[cols_categoric] = titanic_df[cols_categoric].astype("category")
titanic_df["pclass"] = pd.Categorical(
titanic_df["pclass"], categories=[3, 2, 1], ordered=True
)
Numerical variables
cols_numeric_float = ["age", "fare"]
titanic_df[cols_numeric_float] = titanic_df[cols_numeric_float].astype("float")
cols_numeric_int = ["sibsp", "parch"]
titanic_df[cols_numeric_int] = titanic_df[cols_numeric_int].astype("int8")
Boolean variables
cols_boolean = ["survived"]
titanic_df[cols_boolean] = titanic_df[cols_boolean].astype("bool")
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1309 non-null int64
1 pclass 1309 non-null category
2 survived 1309 non-null bool
3 name 1309 non-null object
4 sex 1309 non-null category
5 age 1046 non-null float64
6 sibsp 1309 non-null int8
7 parch 1309 non-null int8
8 fare 1308 non-null float64
9 embarked 1307 non-null category
dtypes: bool(1), category(3), float64(2), int64(1), int8(2), object(1)
memory usage: 49.1+ KB
schema = pa.Table.from_pandas(titanic_df).schema
💾 Save dataframe with data types
titanic_df.to_parquet(
DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", index=False, schema=schema
)
📊 Analysis of Results
Some columns have been removed and the data types have been fixed to correct pyarrow data types.
and null values have been replaced with np.nan
in order to do a correct analysis and visualization of the data.
💡 Proposals and Ideas
use other tools to compare which one can be used to describe and explore data and do data analysis.
Use pyarrow as dtype backend
Use
pd.NA
as null value, but yprofiling is not working well with pyarrow backend