Data overview and exploration to check data types and fix any issue with the data types.
this is in other to do a correct data analysis and visualization of the data.
📚 Import libraries
# base libraries for data science
from pathlib import Path
import numpy as np
import pandas as pd
import pyarrow as pa
💾 Load data
# data directory path
DATA_DIR = Path.cwd().resolve().parents[1] / "data"
titanic_df = pd.read_csv(DATA_DIR / "01_raw/titanic_raw.csv")
📊 Data description
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1309 non-null int64
1 pclass 1309 non-null int64
2 survived 1309 non-null int64
3 name 1309 non-null object
4 sex 1309 non-null object
5 age 1309 non-null object
6 sibsp 1309 non-null int64
7 parch 1309 non-null int64
8 ticket 1309 non-null object
9 fare 1309 non-null object
10 cabin 1309 non-null object
11 embarked 1309 non-null object
12 home.dest 1309 non-null object
dtypes: int64(5), object(8)
memory usage: 133.1+ KB
Unnamed: 0 | pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | home.dest | |
14 | 14 | 1 | 1 | Barkworth, Mr. Algernon Henry Wilson | male | 80 | 0 | 0 | 27042 | 30 | A23 | S | Hessle, Yorks |
626 | 626 | 3 | 0 | Andersson, Miss. Ida Augusta Margareta | female | 38 | 4 | 2 | 347091 | 7.775 | ? | S | Vadsbro, Sweden Ministee, MI |
1083 | 1083 | 3 | 0 | Olsen, Mr. Henry Margido | male | 28 | 0 | 0 | C 4001 | 22.525 | ? | S | ? |
360 | 360 | 2 | 1 | Caldwell, Mr. Albert Francis | male | 26 | 1 | 1 | 248738 | 29 | ? | S | Bangkok, Thailand / Roseville, IL |
1252 | 1252 | 3 | 0 | Torber, Mr. Ernst William | male | 44 | 0 | 0 | 364511 | 8.05 | ? | S | ? |
679 | 679 | 3 | 0 | Boulos, Miss. Nourelain | female | 9 | 1 | 1 | 2678 | 15.2458 | ? | C | Syria Kent, ON |
469 | 469 | 2 | 1 | Keane, Miss. Nora A | female | ? | 0 | 0 | 226593 | 12.35 | E101 | Q | Harrisburg, PA |
1007 | 1007 | 3 | 1 | McGowan, Miss. Anna 'Annie' | female | 15 | 0 | 0 | 330923 | 8.0292 | ? | Q | ? |
187 | 187 | 1 | 1 | Lines, Miss. Mary Conover | female | 16 | 0 | 1 | PC 17592 | 39.4 | D28 | S | Paris, France |
1113 | 1113 | 3 | 0 | Peacock, Mrs. Benjamin (Edith Nile) | female | 26 | 0 | 2 | SOTON/O.Q. 3101315 | 13.775 | ? | S | ? |
Null values
In this dataset the null values are represented by the string ‘?’ so we need to replace them with pd.NA
titanic_df = titanic_df.replace("?", np.nan)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1309 non-null int64
1 pclass 1309 non-null int64
2 survived 1309 non-null int64
3 name 1309 non-null object
4 sex 1309 non-null object
5 age 1046 non-null object
6 sibsp 1309 non-null int64
7 parch 1309 non-null int64
8 ticket 1309 non-null object
9 fare 1308 non-null object
10 cabin 295 non-null object
11 embarked 1307 non-null object
12 home.dest 745 non-null object
dtypes: int64(5), object(8)
memory usage: 133.1+ KB
Remove columns
- We will remove the columns that have too many null values and need to much effort to find the correct value.
- The column
is a string that is unique for each passenger, but is just a identifier, so we will remove it.
so we will remove the columns cabin
, ticket
and home.dest
titanic_df = titanic_df.drop(columns=["cabin", "home.dest", "ticket"])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1309 non-null int64
1 pclass 1309 non-null int64
2 survived 1309 non-null int64
3 name 1309 non-null object
4 sex 1309 non-null object
5 age 1046 non-null object
6 sibsp 1309 non-null int64
7 parch 1309 non-null int64
8 fare 1308 non-null object
9 embarked 1307 non-null object
dtypes: int64(5), object(5)
memory usage: 102.4+ KB
Categorical variables
: A proxy for socio-economic status (SES)- 1 = Upper
- 2 = Middle
- 3 = Lower
: Gender of the passenger- female
- male
: Port of embarkation- C = Cherbourg
- Q = Queenstown
- S = Southampton
Numerical variables
: The dataset defines family relations in this way…- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
= 0, 1, 2, 3, 4, 5, 8
: The dataset defines family relations in this way…- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore
= 0 for them. parch
= 0, 1, 2, 3, 4, 5, 6
: Passenger fareage
: Age of the passenger, some values are float has to be converted to int.
Boolean variables
: 0 = No, 1 = Yes
String variables
: Name of the passenger with the formatLast name, Title. First name
Convert data types
Categorical variables
cols_categoric = ["pclass", "sex", "embarked"]
titanic_df[cols_categoric] = titanic_df[cols_categoric].astype("category")
titanic_df["pclass"] = pd.Categorical(
titanic_df["pclass"], categories=[3, 2, 1], ordered=True
Numerical variables
cols_numeric_float = ["age", "fare"]
titanic_df[cols_numeric_float] = titanic_df[cols_numeric_float].astype("float")
cols_numeric_int = ["sibsp", "parch"]
titanic_df[cols_numeric_int] = titanic_df[cols_numeric_int].astype("int8")
Boolean variables
cols_boolean = ["survived"]
titanic_df[cols_boolean] = titanic_df[cols_boolean].astype("bool")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1309 non-null int64
1 pclass 1309 non-null category
2 survived 1309 non-null bool
3 name 1309 non-null object
4 sex 1309 non-null category
5 age 1046 non-null float64
6 sibsp 1309 non-null int8
7 parch 1309 non-null int8
8 fare 1308 non-null float64
9 embarked 1307 non-null category
dtypes: bool(1), category(3), float64(2), int64(1), int8(2), object(1)
memory usage: 49.1+ KB
schema = pa.Table.from_pandas(titanic_df).schema
💾 Save dataframe with data types
DATA_DIR / "02_intermediate/titanic_type_fixed.parquet", index=False, schema=schema
📊 Analysis of Results
Some columns have been removed and the data types have been fixed to correct pyarrow data types.
and null values have been replaced with np.nan
in order to do a correct analysis and visualization of the data.
💡 Proposals and Ideas
use other tools to compare which one can be used to describe and explore data and do data analysis.
Use pyarrow as dtype backend
as null value, but yprofiling is not working well with pyarrow backend