Problem Statement

In this notebook we are dealing with one of the most popular classification problems - “Predicting survival rate of Titanic passengers based on available characteristics”.

This problem is hosted as a Kaggle competition and can be accessed by https://www.kaggle.com/c/titanic.

In short, the goal is to create a binary classification model which predict if a person survives or not in titanic shipwreck based on features like their age, sex, financial situation etc.

Importing Dataset

The titanic dataset is freely available on path - https://www.kaggle.com/c/titanic/data.

#! pip3 install kaggle

# Kaggle API stuffs
from kaggle.api.kaggle_api_extended import KaggleApi
## Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/dipin/.kaggle/kaggle.json'
api = KaggleApi()
api.authenticate()

# Downloading titanic dataset
## Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/dipin/.kaggle/kaggle.json'
api.competition_download_files('titanic')

# Unzip titanic.zip
import zipfile
with zipfile.ZipFile("titanic.zip","r") as zip_ref:
    zip_ref.extractall("titanic_data")
    
# Loading Dataset
#! pip3 install pandas
import pandas as pd
pd.set_option('display.expand_frame_repr', False)

df_train = pd.read_csv('titanic_data/train.csv')
df_test = pd.read_csv('titanic_data/test.csv')
print(df_train.head())
##    PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
## 0            1         0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
## 1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
## 2            3         1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
## 3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
## 4            5         0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S
print(df_test.head())
##    PassengerId  Pclass                                          Name     Sex   Age  SibSp  Parch   Ticket     Fare Cabin Embarked
## 0          892       3                              Kelly, Mr. James    male  34.5      0      0   330911   7.8292   NaN        Q
## 1          893       3              Wilkes, Mrs. James (Ellen Needs)  female  47.0      1      0   363272   7.0000   NaN        S
## 2          894       2                     Myles, Mr. Thomas Francis    male  62.0      0      0   240276   9.6875   NaN        Q
## 3          895       3                              Wirz, Mr. Albert    male  27.0      0      0   315154   8.6625   NaN        S
## 4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0      1      1  3101298  12.2875   NaN        S

Exploratory Data Analysis

Here, we look into the train data more deeply. We starts with Summary statistics, NA resolve, Correlation check and so on.

Data Summary

# Summary of Data
df_train.describe()
##        PassengerId    Survived      Pclass         Age       SibSp       Parch        Fare
## count   891.000000  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
## mean    446.000000    0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
## std     257.353842    0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
## min       1.000000    0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
## 25%     223.500000    0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
## 50%     446.000000    0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
## 75%     668.500000    1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
## max     891.000000    1.000000    3.000000   80.000000    8.000000    6.000000  512.329200

The range of Fare is huge and we may assume 0.0 Fare was for the ship’s staffs.

Visualization - Relationship between predictors and predicted variable.

import matplotlib.pyplot as plt
import seaborn as sns

# Passenger Class vs Survival
sns.countplot(x='Pclass',hue='Survived', data=df_train).set_title("Passenger Class vs Survival")
plt.show()
# Passenger Sex vs Survival

sns.countplot(x='Sex',hue='Survived', data=df_train).set_title("Passenger Sex vs Survival")
plt.show()

Some observations from the above visualizations,

  • 1st class passengers had greater chances of survival while class 3 passengers had the least.
  • Comparing passenger sex, female passengers had a greater chance of survival than their counterparts.

Data Pre-processing

Visualization of Missing Values.

# Visualization - Missing Values

# Passenger Ids from 892 are part of test data.
df_train = pd.concat([df_train, df_test])
df_train.index = df_train.PassengerId

# Columns with NAs
print(df_train.columns[df_train.isna().any()].tolist())

#! pip3 install missingno
## ['Survived', 'Age', 'Fare', 'Cabin', 'Embarked']
import missingno as msno
# %matplotlib inline

msno.matrix(df_train)

From the plot, we could see that columns Age and Cabin posses a lot of NAs. Age seems important and let’s look into suitable NA imputation steps.

Let’s check missingness in Embarked and remove corresponding rows according to the result.

# Filling Fare with median
df_train.Fare.fillna(df_train.Fare.median(), inplace=True)

# Rows with missing values for Embarked
print(df_train.loc[~df_train['Embarked'].isin(['S','Q','C'])])

# Fill those two rows with most applicable value
##              PassengerId  Survived  Pclass                                       Name     Sex   Age  SibSp  Parch  Ticket  Fare Cabin Embarked
## PassengerId                                                                                                                                   
## 62                    62       1.0       1                        Icard, Miss. Amelie  female  38.0      0      0  113572  80.0   B28      NaN
## 830                  830       1.0       1  Stone, Mrs. George Nelson (Martha Evelyn)  female  62.0      0      0  113572  80.0   B28      NaN
df_train.Embarked.fillna('C', inplace=True)

Let’s check the influence of Pclass and Sex on Age by box plot visualizations.

# Boxplot of Age on Ticket and Sex categories
df_train.boxplot(column='Age',by=['Pclass','Sex'])

From the figure a general inference is that female passengers are younger than males and parameter age is closely related to ticket class. Median based imputing seems to be fit in this case. We will split train data into 6 categories and finds median on each category. Later these values will replace NAs in Age.

# Fill NAs with median of each group
df_train['Age'] = df_train.groupby(['Pclass','Sex'])['Age'].apply(lambda x: x.fillna(x.median()))
print((df_train[df_train.Age.isna()].index))
## Int64Index([], dtype='int64', name='PassengerId')
df_train.head()
##              PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
## PassengerId                                                                                                                                                        
## 1                      1       0.0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
## 2                      2       1.0       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
## 3                      3       1.0       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
## 4                      4       1.0       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
## 5                      5       0.0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S

Feature Engineering

Columns SibSp and Parch gives info about family members onboard. We can unify these to create ne column family_mem.

# New column family_mem = sum of siblings+parents/children+self
df_train['family_mem'] = df_train.SibSp+df_train.Parch+1
df_train.head()
##              PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked  family_mem
## PassengerId                                                                                                                                                                    
## 1                      1       0.0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S           2
## 2                      2       1.0       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C           2
## 3                      3       1.0       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S           1
## 4                      4       1.0       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S           2
## 5                      5       0.0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S           1

If you notice Name column carefully, honorifics like Mr., Col. were given along with Passenger names. Honorifics information could be added as a separate column and let’s look how designation matters in survival.

df_train['Hon'] = (df_train['Name'].str.extract(r"\,(.[a-zA-Z]+)\."))
df_train.head()
##              PassengerId  Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked  family_mem    Hon
## PassengerId                                                                                                                                                                           
## 1                      1       0.0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S           2     Mr
## 2                      2       1.0       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C           2    Mrs
## 3                      3       1.0       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S           1   Miss
## 4                      4       1.0       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S           2    Mrs
## 5                      5       0.0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S           1     Mr
sns.countplot(x='Hon',hue='Survived', data=df_train).set_title("Passenger Honorifics vs Survival")
plt.show()

print(list(df_train))
## ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'family_mem', 'Hon']

Numeric to Categorical/Factors

Age in our datase is actually ordinal but treated as numeric at the moment. We will look how age is distributed and convert ages to different categories.

# Histogram of Age column
df_train['Age'].plot(kind='hist')
plt.show()

It seems majority of passengers belongs to age group 20-30. Let’s divide age range(0-80) to four categories: 0-20, 21-40, 41-60, 61-80.

bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]
names = ['0-10', '11-20', '21-30', '31-40', '41-50', '51-60','61-70', '71-80']

df_train['AgeRange'] = pd.cut(df_train['Age'], bins, labels=names)

# Categorize Pclass and Hon aswell
df_train.Pclass = pd.Categorical(df_train.Pclass)
df_train.Hon = pd.Categorical(df_train.Hon)

print (df_train.dtypes)
## PassengerId       int64
## Survived        float64
## Pclass         category
## Name             object
## Sex              object
## Age             float64
## SibSp             int64
## Parch             int64
## Ticket           object
## Fare            float64
## Cabin            object
## Embarked         object
## family_mem        int64
## Hon            category
## AgeRange       category
## dtype: object

Columns Name, Age, SibSp, Parch, Ticket, Cabin could be dropped since they are not adding any significant information.

# Dropping unnecessary columns
#df_train.drop([ 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Cabin'], axis=1, inplace=True)
# Dropping Sex col as Hon has similar but precise information.
df_train.drop([ 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Sex'], axis=1, inplace=True)
df_train.head(3)
##              PassengerId  Survived Pclass     Fare Embarked  family_mem    Hon AgeRange
## PassengerId                                                                            
## 1                      1       0.0      3   7.2500        S           2     Mr    21-30
## 2                      2       1.0      1  71.2833        C           2    Mrs    31-40
## 3                      3       1.0      3   7.9250        S           1   Miss    21-30

Train/Test Split

Here, we will detach test data (PassengerIds from 892) to different dataframe. The step is required to avoid data leakage issue in later stages.

# Rows from 893 onwards belongs to test data

df_test = df_train.iloc[891:]
df_train = df_train.iloc[:891]

df_train.drop('PassengerId', axis=1, inplace=True)
df_test.drop(['PassengerId','Survived'], axis=1, inplace=True)

Scaling Numeric Variables

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

df_train[['Fare','family_mem']] = scaler.fit_transform(df_train[['Fare','family_mem']])
#df_train = df_train.round({'Fare': 4, 'family_mem': 4})
df_test[['Fare','family_mem']] = scaler.fit_transform(df_test[['Fare','family_mem']])
#df_test = df_test.round({'Fare': 4, 'family_mem': 4})

Encoding categorical variables

df_train = pd.get_dummies(df_train, drop_first=True)
print(df_train.head())
##              Survived      Fare  family_mem  Pclass_2  Pclass_3  Embarked_Q  Embarked_S  Hon_ Col  Hon_ Don  Hon_ Dona  Hon_ Dr  Hon_ Jonkheer  Hon_ Lady  Hon_ Major  Hon_ Master  Hon_ Miss  Hon_ Mlle  Hon_ Mme  Hon_ Mr  Hon_ Mrs  Hon_ Ms  Hon_ Rev  Hon_ Sir  AgeRange_11-20  AgeRange_21-30  AgeRange_31-40  AgeRange_41-50  AgeRange_51-60  AgeRange_61-70  AgeRange_71-80
## PassengerId                                                                                                                                                                                                                                                                                                                                                                       
## 1                 0.0 -0.502445    0.059160         0         1           0           1         0         0          0        0              0          0           0            0          0          0         0        1         0        0         0         0               0               1               0               0               0               0               0
## 2                 1.0  0.786845    0.059160         0         0           0           0         0         0          0        0              0          0           0            0          0          0         0        0         1        0         0         0               0               0               1               0               0               0               0
## 3                 1.0 -0.488854   -0.560975         0         1           0           1         0         0          0        0              0          0           0            0          1          0         0        0         0        0         0         0               0               1               0               0               0               0               0
## 4                 1.0  0.420730    0.059160         0         0           0           1         0         0          0        0              0          0           0            0          0          0         0        0         1        0         0         0               0               0               1               0               0               0               0
## 5                 0.0 -0.486337   -0.560975         0         1           0           1         0         0          0        0              0          0           0            0          0          0         0        1         0        0         0         0               0               0               1               0               0               0               0
df_test = pd.get_dummies(df_test, drop_first=True)

Modelling

Simple Logistic Regression

Y = df_train.Survived
X = df_train.drop(['Survived'], axis=1)
Y.tail()
## PassengerId
## 887    0.0
## 888    1.0
## 889    0.0
## 890    1.0
## 891    0.0
## Name: Survived, dtype: float64
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X, Y)
pred_log = clf.predict(df_test).astype(int)
#pd.DataFrame(pred_log).head()
pred_log_df =  pd.DataFrame()
pred_log_df['PassengerId'] = df_test.index
pred_log_df['Survived'] = pred_log
pred_log_df.head()
##    PassengerId  Survived
## 0          892         0
## 1          893         1
## 2          894         0
## 3          895         0
## 4          896         1

Cross Validation and generation of different classification models

from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
import xgboost
from sklearn.model_selection import cross_val_score

clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.82 (+/- 0.05)
clf = LogisticRegression()
scores = cross_val_score(clf, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.82 (+/- 0.04)
clf = RandomForestClassifier(max_depth=2, random_state=0)
scores = cross_val_score(clf, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.80 (+/- 0.06)
clf = KNeighborsClassifier(n_neighbors=3)
scores = cross_val_score(clf, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.79 (+/- 0.06)
clf = GaussianNB()
scores = cross_val_score(clf, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.61 (+/- 0.25)
clf = xgboost.XGBClassifier()
scores = cross_val_score(clf, X, Y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.82 (+/- 0.06)

XGBoost Classification

clf = xgboost.XGBClassifier().fit(X, Y)
pred_xg = clf.predict(df_test).astype(int)

#pd.DataFrame(pred_xg).head()
pred_xg_df =  pd.DataFrame()
pred_xg_df['PassengerId'] = df_test.index
pred_xg_df['Survived'] = pred_xg
pred_xg_df.head()
##    PassengerId  Survived
## 0          892         0
## 1          893         0
## 2          894         0
## 3          895         0
## 4          896         1

SVM Classifier

clf = svm.SVC(kernel='linear', C=1).fit(X, Y)
pred_svm = clf.predict(df_test).astype(int)

#pd.DataFrame(pred_svm).head()
pred_svm_df =  pd.DataFrame()
pred_svm_df['PassengerId'] = df_test.index
pred_svm_df['Survived'] = pred_svm
pred_svm_df.head()
##    PassengerId  Survived
## 0          892         0
## 1          893         1
## 2          894         0
## 3          895         0
## 4          896         1

Kaggle Submission

pred_log_df.to_csv('lg_submission.csv', index=False)
pred_xg_df.to_csv('xg_submission.csv', index=False)
pred_svm_df.to_csv('svm_submission.csv', index=False)