library("data.table")
library("janitor")
library("GGally")
library("missForest")
library("randomForest")
library("MASS")
library("class")
library("caret")
library("ModelMetrics")
library("tidyverse")
library("magrittr")
library("glmnet")
library("mice")

Acknowledgement

Formatting and data processing ideas are inspired by Bisaria’s post on Kaggle.

Skills Involved

R-base, tidyverse, Statistical Learning, Data Wrangling, String manipulation, function programming, data imputation.

What is Kaggle

Kaggle is a website that post challenges in the field of machine learning. For the titanic challenge, partakers are given two datasets - train and test. The train dataset would contain the outcome and the test dataset would not. After generating the predicted outcome for the test dataset based on the model trained in the train dataset, partakers would submit the results to Kaggle website for scoring. The score for this challenge is the proportion of correct guesses.

Synopsis

This is an application of some of the statistical learning methods in ISLR. In particular, we are using Logistic Regression, Ridge Regression, Lasso Regression, and Radial SVM to predict the survival status (0 = perished, 1 = survived) of Kaggle’s Titanic Dataset. The main challenge of this dataset lies in imputing the variable “age” and “deck.” Using missForest::missForest, we imputed the age variable. From other clues such as the same ticket numbers would be the same deck, and certain ticket class would correspond to certain decks, we imputed all the decks values.

Through the process and submission on Kaggle, Radial SVM was the most accurate with the proportion of correct prediction = 0.79904 (Top 12% as of 6/30/2020).

The code can be found at Github.

Load the Data

The “train” and “test” dataset were loaded. The “full” dataset was used to impute NAs values.

Exploratory Analysis

Figures and Plots

Training Dataset
passenger_id survived pclass name sex age sib_sp parch ticket fare cabin embarked type
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NA S Train
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C Train
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NA S Train
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S Train
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NA S Train
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 NA Q Train

The Data Dictionary is as below:

Variable Definition Key
survival Survival 0 = No;1 = Yes
pclass Ticket class 1 = 1st;2 = 2nd;3 = 3rd
sex Sex NA
Age Age in years NA
sibsp # of siblings / spouses aboard the Titanic NA
parch # of parents / children aboard the Titanic NA
ticket Ticket number NA
fare Passenger fare NA
cabin Cabin number NA
embarked Port of Embarkation C = Cherbourg; Q = Queenstown; S = Southampton

The data also comes with the following notes:

Variable Notes
pclass: A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.

We plot the pairwise scatter plot to have a cursory look at the whole dataset.

We can see that there are significant correlation between the predictor variables. Ridge and Lasso regression should be robust against multi-collinearity.

Comments

The biggest issue with our dataset is the NAs. From the following table, we can see that “age” and “cabin” have the most numbers of NAs.

Number of NAs in each dataset by variables
type passenger_id pclass name sex age sib_sp parch ticket fare cabin embarked
Train 0 0 0 0 177 0 0 0 0 687 2
Test 0 0 0 0 86 0 0 0 1 327 0

Strategies for “age” and “cabin” imputations is discussed individually below. The imputation will be performed on a merged dataset of both “test” and “train”.

Feature Engineering

“pclass” or ticket class

R incorrectly assumed pclass == 3 is the highest level, so we have to recode the variable appropriately as in pclass == 1 > pclass == 2 > pclass == 3. This change is more meaningful for inference than for prediction.

Before

##  Factor w/ 3 levels "3","1","2": 1 2 1 2 1 1 2 1 1 3 ...

After

##  Factor w/ 3 levels "3","2","1": 1 3 1 3 1 1 3 1 1 2 ...

pclass == 1 is now correctly the highest level.

“name”

The “name” variable is stored in the form: “Last Name, Title. Middle First Name.” (e.g, Braund, Mr. Owen Harris). We are mostly concerned with the last name and title of the person. Last names can determine if people are in the same family and are traveling together (hence staying in the same cabin). Title can indicate the age of the person if the person’s age is missing. For example, “Dr.” would indicate someone older and “Master” would be some one younger. We also convert all characters to lower case.

Before
name
Braund, Mr. Owen Harris
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Heikkinen, Miss. Laina
Futrelle, Mrs. Jacques Heath (Lily May Peel)
Allen, Mr. William Henry
Moran, Mr. James
After
title first_name last_name
mr owen harris braund
mrs john bradley (florence briggs thayer) cumings
miss laina heikkinen
mrs jacques heath (lily may peel) futrelle
mr william henry allen
mr james moran

Title of a passenger

We examine the number of different titles in the dataset.

title count missing_age min age max age
mr 757 176 11.00 80.0
miss 260 50 0.17 63.0
mrs 197 27 14.00 76.0
master 61 8 0.33 14.5
dr 8 1 23.00 54.0
rev 8 0 27.00 57.0
col 4 0 47.00 60.0
major 2 0 45.00 52.0
mlle 2 0 24.00 24.0
ms 2 1 28.00 28.0
capt 1 0 70.00 70.0
don 1 0 40.00 40.0
dona 1 0 39.00 39.0
jonkheer 1 0 38.00 38.0
lady 1 0 48.00 48.0
mme 1 0 24.00 24.0
sir 1 0 49.00 49.0
the countess 1 0 33.00 33.0

We merge the titles to minimize the number of categories as follows.

Merging titles
new_title old_title description
sir mr, capt, col, don, major, rev, jonkheer, sir, dr and sex = male, master and age > 14.5 Male, age > 14.5
madam mrs, dona, mlle, mme, dr and sex = female, ms, miss and age > 14.5,the countess, lady Female, age > 14.5
young_master master and age <= 14.5, mr and age <=14.5 Male, age <= 14.5
young_miss miss and age <= 14.5 Female, age <= 14.5

First we convert every “master” at age > 14.5 to “sir”, “miss” at age >14.5 to “madam”, “mr” at age < 14.5 to “young_master”, and “mrs” at age < 14.5 to “young_miss” and male/female “dr” to sir/madam. Then we merge the title as above using forcats::fct_collapse.

Dataset with merged title
title count missing_age min age max age
sir 777 177 15.00 80.0
madam 365 28 15.00 76.0
young_miss 101 50 0.17 14.5
young_master 66 8 0.33 14.5

From this table, we can see that we have successfully merged the title as described. We can now move on to imputation of the “age”, “embark”, and “fare”. Imputation of the “cabin” variable will be dealt with separately.

Imputation of “age”, “embark”, “fare” with missForest::missForest

Before imputing with missForest::missForest:

Number of NAs in each dataset by variables
type passenger_id pclass name sex age sib_sp parch ticket fare cabin embarked
Train 0 0 0 0 177 0 0 0 0 687 2
Test 0 0 0 0 86 0 0 0 1 327 0
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!

After missForest::missForest imputation:

Number of NAs in each dataset by variables
type passenger_id pclass last_name title first_name sex sib_sp parch ticket cabin na age fare embarked
Train 0 0 0 0 0 0 0 0 0 687 0 0 0 0
Test 0 0 0 0 0 0 0 0 0 327 0 0 0 0

missForest::missForest has imputed all the NAs in “age”, “embark” and “fare”. Now the only remaining variable with NAs is “cabin”.

“ticket”

Ticket numbers have some information that helps us with imputing the “cabin” variable.

Some ticket numbers
ticket
A/5 21171
PC 17599
STON/O2. 3101282
113803
373450
330877

We see that ticket numbers are stored as a series of numbers at the end of the string. Using regex, we can easily extract the numbers from the tickets. For tickets that have no numbers, we assign it number the number 99999.

Extracted ticket numbers
tix_num
21171
17599
3101282
113803
373450
330877

“family members”

Merge “sib_sp” and “parch” into “family_mem” which is short for family member.

“cabin” and “deck”

Cabin is the variable with the most NAs. Let’s first extract the first letter of the available cabin numbers into the variable “deck”. We are also changing the deck “T” into deck “A”. Deck “T” is just another first-class deck and there is only one value of deck “T” in the whole dataset.

Creating the ‘deck’ variable from ‘cabin’
cabin deck
C85 C
C123 C
E46 E
G6 G
C103 C
D56 D

We know that people who have the same ticket number must be on the same deck. Therefore, if there is a group of people with the same ticket number and one of them has a deck number then we assign that deck number to the whole group.

In the example below, we see that two people have ticket number 3, one of them is on deck “E”, so we assigned deck “E” to the other person who has missing deck value. Whereas two people have ticket number 2, but since none of them have a deck, we will impute their information later.

Before
passenger_id pclass tix_num deck
1078 2 2 NA
1194 2 2 NA
773 2 3 E
842 2 3 NA
1062 3 251 NA
474 2 541 D
After
passenger_id pclass tix_num deck
1078 2 2 NA
1194 2 2 NA
773 2 3 E
842 2 3 E
1062 3 251 NA
474 2 541 D

The following table shows the number of missing decks values in each “pclass”.

Number of decks and NAs by pclass
3 2 1
A 0 0 21
B 0 0 66
C 0 0 105
D 0 6 40
E 3 5 38
F 10 13 0
G 5 0 0
NA 691 253 53

From this table, we see that there are 997 NAs after ticket number imputation. For the rest of the missing cabins, we shall randomly assign the deck number based on the unique ticket number by “pclass”.

Number of unique decks and NAs by pclass
3 2 1
A 0 0 19
B 0 0 31
C 0 0 49
D 0 6 24
E 2 4 20
F 8 8 0
G 2 0 0
NA 534 174 43

The following method is used for the random sampling to keep the ratio. For example: in class 3, there are 2 unique ticket numbers in deck E, 8 unique tickets number in deck F, and 2 unique tickets in deck G. There are 534 NAs in class 3. Hence, we would assign a ratio of 2:8:2 respectively to deck E:F:G in all of the 534 NAs. We follow the steps below:

  1. Determine the ratio of “decks” in each “pclass” to be randomly sampled.
  2. Impute the deck based on the calculated ratio.
## [[1]]
## [1]  6  9 15  7  6
## 
## [[2]]
## [1] 58 39 77
## 
## [[3]]
## [1]  89 356  89

R-base provides an elegant way to impute the decks based on the above ratio.

Number of unique decks imputated
3 2 1
A 0 0 6
B 0 0 9
C 0 0 15
D 0 58 7
E 89 39 6
F 356 77 0
G 89 0 0

We see that the decks has been imputed according to the desired ratio. We now assign the unique tickets with imputed deck back to the full list, and again assign all the tickets with the same number with the same deck using the function we have wrote earlier.

Total number of decks by pclass
3 2 1
A 0 0 28
B 0 0 80
C 0 0 124
D 1 85 47
E 125 65 44
F 466 127 0
G 117 0 0

We have now imputed all the data and can move on to regenerating the train/test datasets.

Regenerating the “train” and “test” Dataset

From the full dataset, we split using dplyr::group_split. The train dataset has 891 obs while the test dataset has 418 obs

## [1] 891  14
## [1] 418  13

We have regenerated our “train” and “test” datasetS. Now we can perform statistical learning.

Fitting Models

pred<-function(x){
        data<-predict(x, newdata=test_imp)
        final<-data.frame(PassengerId = test_imp$passenger_id, Survived = data)
}

We generate this “pred” function to generate the prediction dataframe for submission using caret.

compare<-function(x,y){
        dat<- full_join(x,y, by = "PassengerId") %>%
                mutate(Survived.x = parse_number(as.character(.$Survived.x))) %>%
                mutate(Survived.y = parse_number(as.character(.$Survived.y))) %>%
                mutate(diff = Survived.x - Survived.y) %>%
                filter(diff != 0) %>%
                as.data.frame
        
        dat_1<- test_imp %>%
                filter(passenger_id %in% dat$PassengerId) %>%
                mutate(surv = case_when(dat$diff <0 ~ "model_2", 
                                        TRUE > 0 ~ "model_1")) %>%
                as.data.frame
        list(dat,dat_1)
}

We generate this function to compare the differences between datasets generated by different methods.

Logistics Regression

## 
## Call:  NULL
## 
## Coefficients:
##       (Intercept)            pclass2            pclass1         titlemadam  
##         -2.368023           1.107841           2.669288           3.306329  
## titleyoung_master    titleyoung_miss          sexfemale         family_mem  
##          3.111178           2.952375                 NA          -0.417483  
##               age               fare          embarkedC          embarkedQ  
##         -0.025187           0.004429           0.467970           0.207616  
##             deckB              deckC              deckD              deckE  
##         -0.326905          -0.362735           0.725383           0.566957  
##             deckF              deckG  
##          0.478093           0.578372  
## 
## Degrees of Freedom: 890 Total (i.e. Null);  874 Residual
## Null Deviance:       1187 
## Residual Deviance: 729.5     AIC: 763.5

Using logistic regression, our predictive accuracy is 0.76555.

Lasso and Ridge

Coefficients of Ridge model

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       -1.127418228
## (Intercept)        .          
## pclass2            0.305182786
## pclass1            0.546395191
## titlemadam         0.839601038
## titleyoung_master  0.815199012
## titleyoung_miss    0.495704094
## sexfemale          0.889401450
## family_mem        -0.070759390
## age               -0.009264014
## fare               0.002652996
## embarkedC          0.297908824
## embarkedQ          0.114297128
## deckB              0.250634363
## deckC              0.161502862
## deckD              0.316098153
## deckE              0.009686306
## deckF             -0.211160249
## deckG             -0.184288770

Coefficients of Lasso model

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                               1
## (Intercept)       -1.8618977091
## (Intercept)        .           
## pclass2            0.4933583388
## pclass1            1.3739437774
## titlemadam         0.1041118049
## titleyoung_master  1.9448457562
## titleyoung_miss    .           
## sexfemale          2.4898806098
## family_mem        -0.1608056082
## age               -0.0039205005
## fare               0.0009603881
## embarkedC          0.1976752457
## embarkedQ          .           
## deckB              .           
## deckC              .           
## deckD              0.1648004146
## deckE              .           
## deckF              .           
## deckG              .

The proportion of correct responses for both ridge and lasso methods are 0.78947.

Radial Kernel SVM

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 891 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (17), scaled (17) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 802, 803, 802, 802, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.8182857  0.6093306
##   0.50  0.8216440  0.6139442
##   1.00  0.8193971  0.6039277
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05554577
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05554577 and C = 0.5.

The proportion of correct responses for the Radial Kernel SVM method is 0.79904 (Top 12%).

Conclusion

The different classification methods yields very close results. In order to improve accuracy, we would need a better strategy to impute both the “age” variable and “deck” variable.