Problem Statement
In this notebook we are dealing with one of the most popular classification problems - “Predicting survival rate of Titanic passengers based on available characteristics”.
This problem is hosted as a Kaggle competition and can be accessed by https://www.kaggle.com/c/titanic.
In short, the goal is to create a binary classification model which predict if a person survives or not in titanic shipwreck based on features like their age, sex, financial situation etc.
Importing Dataset
The titanic dataset is freely available on path - https://www.kaggle.com/c/titanic/data.
## Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/dipin/.kaggle/kaggle.json'
## Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/dipin/.kaggle/kaggle.json'
api.competition_download_files('titanic')
# Unzip titanic.zip
import zipfile
with zipfile.ZipFile("titanic.zip","r") as zip_ref:
zip_ref.extractall("titanic_data")
# Loading Dataset
#! pip3 install pandas
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
df_train = pd.read_csv('titanic_data/train.csv')
df_test = pd.read_csv('titanic_data/test.csv')
print(df_train.head())
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
## 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
## 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
## 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
## 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
## PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
## 1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
## 2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
## 3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
## 4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
Exploratory Data Analysis
Here, we look into the train data more deeply. We starts with Summary statistics, NA resolve, Correlation check and so on.
Data Summary
## PassengerId Survived Pclass Age SibSp Parch Fare
## count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
## mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
## std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
## min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
## 25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
## 50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
## 75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
## max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
The range of Fare is huge and we may assume 0.0 Fare was for the ship’s staffs.
Visualization - Relationship between predictors and predicted variable.
import matplotlib.pyplot as plt
import seaborn as sns
# Passenger Class vs Survival
sns.countplot(x='Pclass',hue='Survived', data=df_train).set_title("Passenger Class vs Survival")
plt.show()
# Passenger Sex vs Survival
sns.countplot(x='Sex',hue='Survived', data=df_train).set_title("Passenger Sex vs Survival")
plt.show()
Some observations from the above visualizations,
- 1st class passengers had greater chances of survival while class 3 passengers had the least.
- Comparing passenger sex, female passengers had a greater chance of survival than their counterparts.
Data Pre-processing
Visualization of Missing Values.
# Visualization - Missing Values
# Passenger Ids from 892 are part of test data.
df_train = pd.concat([df_train, df_test])
df_train.index = df_train.PassengerId
# Columns with NAs
print(df_train.columns[df_train.isna().any()].tolist())
#! pip3 install missingno
## ['Survived', 'Age', 'Fare', 'Cabin', 'Embarked']
From the plot, we could see that columns Age and Cabin posses a lot of NAs. Age seems important and let’s look into suitable NA imputation steps.
Let’s check missingness in Embarked and remove corresponding rows according to the result.
# Filling Fare with median
df_train.Fare.fillna(df_train.Fare.median(), inplace=True)
# Rows with missing values for Embarked
print(df_train.loc[~df_train['Embarked'].isin(['S','Q','C'])])
# Fill those two rows with most applicable value
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
## PassengerId
## 62 62 1.0 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
## 830 830 1.0 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN
Let’s check the influence of Pclass and Sex on Age by box plot visualizations.
From the figure a general inference is that female passengers are younger than males and parameter age is closely related to ticket class. Median based imputing seems to be fit in this case. We will split train data into 6 categories and finds median on each category. Later these values will replace NAs in Age.
# Fill NAs with median of each group
df_train['Age'] = df_train.groupby(['Pclass','Sex'])['Age'].apply(lambda x: x.fillna(x.median()))
print((df_train[df_train.Age.isna()].index))
## Int64Index([], dtype='int64', name='PassengerId')
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
## PassengerId
## 1 1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
## 2 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
## 3 3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
## 4 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
## 5 5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
Feature Engineering
Columns SibSp and Parch gives info about family members onboard. We can unify these to create ne column family_mem.
# New column family_mem = sum of siblings+parents/children+self
df_train['family_mem'] = df_train.SibSp+df_train.Parch+1
df_train.head()
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked family_mem
## PassengerId
## 1 1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2
## 2 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2
## 3 3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 1
## 4 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 2
## 5 5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 1
If you notice Name column carefully, honorifics like Mr., Col. were given along with Passenger names. Honorifics information could be added as a separate column and let’s look how designation matters in survival.
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked family_mem Hon
## PassengerId
## 1 1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 2 Mr
## 2 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 Mrs
## 3 3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 1 Miss
## 4 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 2 Mrs
## 5 5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 1 Mr
sns.countplot(x='Hon',hue='Survived', data=df_train).set_title("Passenger Honorifics vs Survival")
plt.show()
## ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'family_mem', 'Hon']
Numeric to Categorical/Factors
Age in our datase is actually ordinal but treated as numeric at the moment. We will look how age is distributed and convert ages to different categories.
It seems majority of passengers belongs to age group 20-30. Let’s divide age range(0-80) to four categories: 0-20, 21-40, 41-60, 61-80.
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]
names = ['0-10', '11-20', '21-30', '31-40', '41-50', '51-60','61-70', '71-80']
df_train['AgeRange'] = pd.cut(df_train['Age'], bins, labels=names)
# Categorize Pclass and Hon aswell
df_train.Pclass = pd.Categorical(df_train.Pclass)
df_train.Hon = pd.Categorical(df_train.Hon)
print (df_train.dtypes)
## PassengerId int64
## Survived float64
## Pclass category
## Name object
## Sex object
## Age float64
## SibSp int64
## Parch int64
## Ticket object
## Fare float64
## Cabin object
## Embarked object
## family_mem int64
## Hon category
## AgeRange category
## dtype: object
Columns Name, Age, SibSp, Parch, Ticket, Cabin could be dropped since they are not adding any significant information.
# Dropping unnecessary columns
#df_train.drop([ 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)
# Dropping Sex col as Hon has similar but precise information.
df_train.drop([ 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Sex'], axis=1, inplace=True)
df_train.head(3)
## PassengerId Survived Pclass Fare Embarked family_mem Hon AgeRange
## PassengerId
## 1 1 0.0 3 7.2500 S 2 Mr 21-30
## 2 2 1.0 1 71.2833 C 2 Mrs 31-40
## 3 3 1.0 3 7.9250 S 1 Miss 21-30
Train/Test Split
Here, we will detach test data (PassengerIds from 892) to different dataframe. The step is required to avoid data leakage issue in later stages.
Scaling Numeric Variables
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_train[['Fare','family_mem']] = scaler.fit_transform(df_train[['Fare','family_mem']])
#df_train = df_train.round({'Fare': 4, 'family_mem': 4})
df_test[['Fare','family_mem']] = scaler.fit_transform(df_test[['Fare','family_mem']])
#df_test = df_test.round({'Fare': 4, 'family_mem': 4})
Encoding categorical variables
## Survived Fare family_mem Pclass_2 Pclass_3 Embarked_Q Embarked_S Hon_ Col Hon_ Don Hon_ Dona Hon_ Dr Hon_ Jonkheer Hon_ Lady Hon_ Major Hon_ Master Hon_ Miss Hon_ Mlle Hon_ Mme Hon_ Mr Hon_ Mrs Hon_ Ms Hon_ Rev Hon_ Sir AgeRange_11-20 AgeRange_21-30 AgeRange_31-40 AgeRange_41-50 AgeRange_51-60 AgeRange_61-70 AgeRange_71-80
## PassengerId
## 1 0.0 -0.502445 0.059160 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
## 2 1.0 0.786845 0.059160 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
## 3 1.0 -0.488854 -0.560975 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## 4 1.0 0.420730 0.059160 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
## 5 0.0 -0.486337 -0.560975 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0
Modelling
Simple Logistic Regression
## PassengerId
## 887 0.0
## 888 1.0
## 889 0.0
## 890 1.0
## 891 0.0
## Name: Survived, dtype: float64
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X, Y)
pred_log = clf.predict(df_test).astype(int)
#pd.DataFrame(pred_log).head()
pred_log_df = pd.DataFrame()
pred_log_df['PassengerId'] = df_test.index
pred_log_df['Survived'] = pred_log
pred_log_df.head()
## PassengerId Survived
## 0 892 0
## 1 893 1
## 2 894 0
## 3 895 0
## 4 896 1
Cross Validation and generation of different classification models
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
import xgboost
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.82 (+/- 0.05)
clf = LogisticRegression()
scores = cross_val_score(clf, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.82 (+/- 0.04)
clf = RandomForestClassifier(max_depth=2, random_state=0)
scores = cross_val_score(clf, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.80 (+/- 0.06)
clf = KNeighborsClassifier(n_neighbors=3)
scores = cross_val_score(clf, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.79 (+/- 0.06)
clf = GaussianNB()
scores = cross_val_score(clf, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.61 (+/- 0.25)
clf = xgboost.XGBClassifier()
scores = cross_val_score(clf, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.82 (+/- 0.06)
XGBoost Classification
clf = xgboost.XGBClassifier().fit(X, Y)
pred_xg = clf.predict(df_test).astype(int)
#pd.DataFrame(pred_xg).head()
pred_xg_df = pd.DataFrame()
pred_xg_df['PassengerId'] = df_test.index
pred_xg_df['Survived'] = pred_xg
pred_xg_df.head()
## PassengerId Survived
## 0 892 0
## 1 893 0
## 2 894 0
## 3 895 0
## 4 896 1
SVM Classifier
clf = svm.SVC(kernel='linear', C=1).fit(X, Y)
pred_svm = clf.predict(df_test).astype(int)
#pd.DataFrame(pred_svm).head()
pred_svm_df = pd.DataFrame()
pred_svm_df['PassengerId'] = df_test.index
pred_svm_df['Survived'] = pred_svm
pred_svm_df.head()
## PassengerId Survived
## 0 892 0
## 1 893 1
## 2 894 0
## 3 895 0
## 4 896 1