1 Titanic Survivor Analysis with Machine Learning

1.1 Introduction

The sinking of the Titanic will be one of the most interesting case study to be used in making predictions whether the passengers survived or not.

In this LBB Project, we visited back our previous same dataset of Titanic taken from Kaggle that we have explored in the beginning stage of LBB Project Programming for Data Science with R.

1.2 Business Question

We will make prediction whether the passengers aboard the Titanic will Survived or not

1.3 Dataset Overview

There are three datasets provided by the original owner of the data, which we can explore and used to create the predictions:

Dataset train.csv contains the details of a subset of passengers aboard Titanic with total of 891 passengers and importantly, have information whether they survived or not, also known as the “ground truth”.
Dataset test.csv contains similar information as train.csv but does not disclosed the “ground truth” for each passenger.
Dataset gender_submission.csv is a set of predictions that assume all and only female passengers survive.

2 Data Preparation

Before we dive to do further analysis, let us prepare our datasets

2.1 Import Libraries

# Library setup and Installation necessary packages
library(readr)
library(dplyr)
library(gtools)
library(car)
library(caret)
library(CMplot)
library(class)
#library(lubridate) # working with datetime
#library(GGally) # correlation relationship
#library(MLmetrics) # for MAE calculations
#library(performance) # model performance comparison
#library(lmtest) # Testing Linear Regression Model

2.2 Data Exploratory

Before we can create the dataset, we will check the informations contain in each original kaggle datasets train.csv and test.csv.

2.2.1 Train Dataset

# Read the first dataset
titanic_train <- read.csv("data_input/train.csv")

# Check the structure of first dataset
glimpse(titanic_train)

#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
#> $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
#> $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
#> $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
#> $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
#> $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
#> $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
#> $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
#> $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
#> $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
#> $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
#> $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

This dataset contains information for total of 891 passengers.

2.2.2 Test Dataset

# Read the second dataset
titanic_test <- read.csv("data_input/test.csv")

# Check the structure of second dataset
glimpse(titanic_test)

#> Rows: 418
#> Columns: 11
#> $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903…
#> $ Pclass      <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3…
#> $ Name        <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)", "M…
#> $ Sex         <chr> "male", "female", "male", "male", "female", "male", "femal…
#> $ Age         <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 21.0…
#> $ SibSp       <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0…
#> $ Parch       <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ Ticket      <chr> "330911", "363272", "240276", "315154", "3101298", "7538",…
#> $ Fare        <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, 2…
#> $ Cabin       <chr> "", "", "", "", "", "", "", "", "", "", "", "", "B45", "",…
#> $ Embarked    <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S", "S"…

This dataset contains information for total of 418 passengers.

2.3 Dataset Description

The difference between dataframe train and test is the extra column variable Survived on dataframe train but all the other column variables have the same meaning.

Therefore, our data description applied to both dataframes and can be explained as follows:

PassengerId: Row number

Survived: Survival Status of the passenger. 1 for Yes, 0 for No

Pclass: Ticket class and a proxy for socio-economic status (1 = 1st class (Upper), 2 = 2nd class (Middle), 3 = 3rd class (Lower))

Name: Name of the passenger

Sex: Gender of the passenger (male / female)

Age: Age of the passenger in years and it is fractional if less than 1. If the age is estimated, it is in the form of xx.5

SibSp: Number of Siblings / Spouses aboard the Titanic with family relations as follows

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: Number of Parents / Children aboard the Titanic with family relations as follows

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch = 0 for them

Ticket: Ticket Number

Fare: Passenger Fare

Cabin: Cabin number

Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

2.4 Dataset Exploration

2.4.1 Titanic_Train Dataframe

Change Data Type Appropriately

We will change the following column variables into categorical factor datatype: Survived, Pclass, Sex, SibSp, Parch, and Embarked

# Change Data Type
titanic_train <- titanic_train %>% 
  mutate_at(vars(Survived, Pclass, Sex, SibSp,
                 Parch, Embarked),
            as.factor)

# Confirm Data Type Change
str(titanic_train)

#> 'data.frame':    891 obs. of  12 variables:
#>  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
#>  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
#>  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#>  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
#>  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
#>  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
#>  $ Parch      : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
#>  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#>  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
#>  $ Cabin      : chr  "" "C85" "" "C123" ...
#>  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Check Missing Value

anyNA(titanic_train)

#> [1] TRUE

colSums(is.na(titanic_train))

#> PassengerId    Survived      Pclass        Name         Sex         Age 
#>           0           0           0           0           0         177 
#>       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           0           0           0           0

There are 177 missing values in column Age

177/nrow(titanic_test)

#> [1] 0.423445

Estimated about 42.34% of our dataframe titanic_train have missing values in column Age as it is quite significant, therefore we will keep the data as it is without doing any removal process or handling of those missing values.

2.4.2 Titanic_Test Dataframe

Change Data Type Appropriately

We will change the following column variables into categorical factor datatype: Pclass, Sex, SibSp, Parch, and Embarked

# Change Data Type
titanic_test <- titanic_test %>% 
  mutate_at(vars(Pclass, Sex, SibSp,
                 Parch, Embarked),
            as.factor)

# Confirm Data Type Change
str(titanic_test)

#> 'data.frame':    418 obs. of  11 variables:
#>  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
#>  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 3 2 3 3 3 3 2 3 3 ...
#>  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
#>  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
#>  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
#>  $ SibSp      : Factor w/ 7 levels "0","1","2","3",..: 1 2 1 1 2 1 1 2 1 3 ...
#>  $ Parch      : Factor w/ 8 levels "0","1","2","3",..: 1 1 1 1 2 1 1 2 1 1 ...
#>  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
#>  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
#>  $ Cabin      : chr  "" "" "" "" ...
#>  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...

Check Missing Value

anyNA(titanic_test)

#> [1] TRUE

colSums(is.na(titanic_test))

#> PassengerId      Pclass        Name         Sex         Age       SibSp 
#>           0           0           0           0          86           0 
#>       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           1           0           0

There are 86 missing values in column Age and 1 missing values in column Fare

86/nrow(titanic_test)

#> [1] 0.2057416

Estimated about 20.57% of our dataframe titanic_test have missing values in column Age as it is still significant amount, therefore we will keep the data as it is without doing any removal process or handling of those missing values.

2.5 Build Model

For the purpose of building the model, we will focus our model using dataset titanic_train which contains our target variable to be analyzed Survived

2.5.1 Check Class Proportion of Our Data

prop.table(table(titanic_train$Survived))

#> 
#>         0         1 
#> 0.6161616 0.3838384

We observed that the proportion is balance with 61.6% (Not Survived) to 38.4% (Survived)

2.5.2 Model without any Predictor

titanic_train_null <- glm(Survived ~ 1, data = titanic_train, family = "binomial")
summary(titanic_train_null)

#> 
#> Call:
#> glm(formula = Survived ~ 1, family = "binomial", data = titanic_train)
#> 
#> Coefficients:
#>             Estimate Std. Error z value        Pr(>|z|)    
#> (Intercept) -0.47329    0.06889   -6.87 0.0000000000064 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1186.7  on 890  degrees of freedom
#> Residual deviance: 1186.7  on 890  degrees of freedom
#> AIC: 1188.7
#> 
#> Number of Fisher Scoring iterations: 4

# log of odds
exp(-0.47329)

#> [1] 0.6229494

# Probability
inv.logit(0.6229494)

#> [1] 0.650889

💡 Insight Interpretation :

Number of Survived passengers (alive) is 0.62 times more than passengers not surviving.
Probability that passengers aboard Titanic survived (alive) is 65%, and the rest (35%) not survived

2.5.3 Categorial Predictor Model based on Gender

Let us take a look again on what our dataframe titanic_train looks like

head(titanic_train)

titanic_train_gender <- glm(Survived ~ Sex, data = titanic_train, family = "binomial")
summary(titanic_train_gender)

#> 
#> Call:
#> glm(formula = Survived ~ Sex, family = "binomial", data = titanic_train)
#> 
#> Coefficients:
#>             Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)   1.0566     0.1290   8.191 0.000000000000000258 ***
#> Sexmale      -2.5137     0.1672 -15.036 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1186.7  on 890  degrees of freedom
#> Residual deviance:  917.8  on 889  degrees of freedom
#> AIC: 921.8
#> 
#> Number of Fisher Scoring iterations: 4

💡 Coefficients Information :

Intercept : 1.0566 is log of odds ratio passengers survived (alive) when its gender is “female”
Sexmale : -2.5137 is log of odds ratio passengers survived (alive) when its gender is “male”

Calculate Probability for Survival Passengers with “female” gender

# log of odds
exp(1.0566)

#> [1] 2.876574

# Probability
inv.logit(2.876574)

#> [1] 0.9466762