library("data.table")
library("janitor")
library("GGally")
library("missForest")
library("randomForest")
library("MASS")
library("class")
library("caret")
library("ModelMetrics")
library("tidyverse")
library("magrittr")
library("glmnet")
library("mice")

Acknowledgement

Formatting and data processing ideas are inspired by Bisaria’s post on Kaggle.

Skills Involved

R-base, tidyverse, Statistical Learning, Data Wrangling, String manipulation, function programming, data imputation.

What is Kaggle

Kaggle is a website that post challenges in the field of machine learning. For the titanic challenge, partakers are given two datasets - train and test. The train dataset would contain the outcome and the test dataset would not. After generating the predicted outcome for the test dataset based on the model trained in the train dataset, partakers would submit the results to Kaggle website for scoring. The score for this challenge is the proportion of correct guesses.

Synopsis

This is an application of some of the statistical learning methods in ISLR. In particular, we are using Logistic Regression, Ridge Regression, Lasso Regression, and Radial SVM to predict the survival status (0 = perished, 1 = survived) of Kaggle’s Titanic Dataset. The main challenge of this dataset lies in imputing the variable “age” and “deck.” Using missForest::missForest, we imputed the age variable. From other clues such as the same ticket numbers would be the same deck, and certain ticket class would correspond to certain decks, we imputed all the decks values.

Through the process and submission on Kaggle, Radial SVM was the most accurate with the proportion of correct prediction = 0.79904 (Top 12% as of 6/30/2020).

The code can be found at Github.

Load the Data

The “train” and “test” dataset were loaded. The “full” dataset was used to impute NAs values.

Exploratory Analysis

Figures and Plots

Training Dataset
passenger_id	survived	pclass	name	sex	age	sib_sp	ticket	fare	cabin	embarked	type
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NA	S	Train
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.2833	C85	C	Train
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NA	S	Train
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S	Train
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NA	S	Train
6	0	3	Moran, Mr. James	male	NA	0	330877	8.4583	NA	Q	Train

The Data Dictionary is as below:

Variable	Definition	Key
survival	Survival	0 = No;1 = Yes
pclass	Ticket class	1 = 1st;2 = 2nd;3 = 3rd
sex	Sex	NA
Age	Age in years	NA
sibsp	# of siblings / spouses aboard the Titanic	NA
parch	# of parents / children aboard the Titanic	NA
ticket	Ticket number	NA
fare	Passenger fare	NA
cabin	Cabin number	NA
embarked	Port of Embarkation	C = Cherbourg; Q = Queenstown; S = Southampton

The data also comes with the following notes:

Variable Notes
pclass: A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.

We plot the pairwise scatter plot to have a cursory look at the whole dataset.

We can see that there are significant correlation between the predictor variables. Ridge and Lasso regression should be robust against multi-collinearity.

Comments

The biggest issue with our dataset is the NAs. From the following table, we can see that “age” and “cabin” have the most numbers of NAs.

Number of NAs in each dataset by variables
type	passenger_id	pclass	name	sex	age	sib_sp	parch	ticket	fare	cabin	embarked
Train	0	0	0	0	177	0	0	0	0	687	2
Test	0	0	0	0	86	0	0	0	1	327	0

Strategies for “age” and “cabin” imputations is discussed individually below. The imputation will be performed on a merged dataset of both “test” and “train”.

Feature Engineering

“pclass” or ticket class

R incorrectly assumed pclass == 3 is the highest level, so we have to recode the variable appropriately as in pclass == 1 > pclass == 2 > pclass == 3. This change is more meaningful for inference than for prediction.

Before

##  Factor w/ 3 levels "3","1","2": 1 2 1 2 1 1 2 1 1 3 ...

After

##  Factor w/ 3 levels "3","2","1": 1 3 1 3 1 1 3 1 1 2 ...

pclass == 1 is now correctly the highest level.

“name”

The “name” variable is stored in the form: “Last Name, Title. Middle First Name.” (e.g, Braund, Mr. Owen Harris). We are mostly concerned with the last name and title of the person. Last names can determine if people are in the same family and are traveling together (hence staying in the same cabin). Title can indicate the age of the person if the person’s age is missing. For example, “Dr.” would indicate someone older and “Master” would be some one younger. We also convert all characters to lower case.

Before
name
Braund, Mr. Owen Harris
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Heikkinen, Miss. Laina
Futrelle, Mrs. Jacques Heath (Lily May Peel)
Allen, Mr. William Henry
Moran, Mr. James

After
title	first_name	last_name
mr	owen harris	braund
mrs	john bradley (florence briggs thayer)	cumings
miss	laina	heikkinen
mrs	jacques heath (lily may peel)	futrelle
mr	william henry	allen
mr	james	moran

Title of a passenger

We examine the number of different titles in the dataset.

title	count	missing_age	min age	max age
mr	757	176	11.00	80.0
miss	260	50	0.17	63.0
mrs	197	27	14.00	76.0
master	61	8	0.33	14.5
dr	8	1	23.00	54.0
rev	8	0	27.00	57.0
col	4	0	47.00	60.0
major	2	0	45.00	52.0
mlle	2	0	24.00	24.0
ms	2	1	28.00	28.0
capt	1	0	70.00	70.0
don	1	0	40.00	40.0
dona	1	0	39.00	39.0
jonkheer	1	0	38.00	38.0
lady	1	0	48.00	48.0
mme	1	0	24.00	24.0
sir	1	0	49.00	49.0
the countess	1	0	33.00	33.0

We merge the titles to minimize the number of categories as follows.

Merging titles
new_title	old_title	description
sir	mr, capt, col, don, major, rev, jonkheer, sir, dr and sex = male, master and age > 14.5	Male, age > 14.5
madam	mrs, dona, mlle, mme, dr and sex = female, ms, miss and age > 14.5,the countess, lady	Female, age > 14.5
young_master	master and age <= 14.5, mr and age <=14.5	Male, age <= 14.5
young_miss	miss and age <= 14.5	Female, age <= 14.5

First we convert every “master” at age > 14.5 to “sir”, “miss” at age >14.5 to “madam”, “mr” at age < 14.5 to “young_master”, and “mrs” at age < 14.5 to “young_miss” and male/female “dr” to sir/madam. Then we merge the title as above using forcats::fct_collapse.

Dataset with merged title
title	count	missing_age	min age	max age
sir	777	177	15.00	80.0
madam	365	28	15.00	76.0
young_miss	101	50	0.17	14.5
young_master	66	8	0.33	14.5

From this table, we can see that we have successfully merged the title as described. We can now move on to imputation of the “age”, “embark”, and “fare”. Imputation of the “cabin” variable will be dealt with separately.

Imputation of “age”, “embark”, “fare” with `missForest::missForest`

Before imputing with missForest::missForest:

Number of NAs in each dataset by variables
type	passenger_id	pclass	name	sex	age	sib_sp	parch	ticket	fare	cabin	embarked
Train	0	0	0	0	177	0	0	0	0	687	2
Test	0	0	0	0	86	0	0	0	1	327	0

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!

After missForest::missForest imputation:

Number of NAs in each dataset by variables
type	passenger_id	pclass	last_name	title	first_name	sex	sib_sp	parch	ticket	cabin	na	age	fare	embarked
Train	0	0	0	0	0	0	0	0	0	687	0	0	0	0
Test	0	0	0	0	0	0	0	0	0	327	0	0	0	0

missForest::missForest has imputed all the NAs in “age”, “embark” and “fare”. Now the only remaining variable with NAs is “cabin”.

“ticket”

Ticket numbers have some information that helps us with imputing the “cabin” variable.

Some ticket numbers
ticket
A/5 21171
PC 17599
STON/O2. 3101282
113803
373450
330877

We see that ticket numbers are stored as a series of numbers at the end of the string. Using regex, we can easily extract the numbers from the tickets. For tickets that have no numbers, we assign it number the number 99999.

Extracted ticket numbers
tix_num
21171
17599
3101282
113803
373450
330877

“family members”

Merge “sib_sp” and “parch” into “family_mem” which is short for family member.

“cabin” and “deck”

Cabin is the variable with the most NAs. Let’s first extract the first letter of the available cabin numbers into the variable “deck”. We are also changing the deck “T” into deck “A”. Deck “T” is just another first-class deck and there is only one value of deck “T” in the whole dataset.

Creating the ‘deck’ variable from ‘cabin’
cabin	deck
C85	C
C123	C
E46	E
G6	G
C103	C
D56	D

We know that people who have the same ticket number must be on the same deck. Therefore, if there is a group of people with the same ticket number and one of them has a deck number then we assign that deck number to the whole group.

In the example below, we see that two people have ticket number 3, one of them is on deck “E”, so we assigned deck “E” to the other person who has missing deck value. Whereas two people have ticket number 2, but since none of them have a deck, we will impute their information later.

Before
passenger_id	pclass	tix_num	deck
1078	2	2	NA
1194	2	2	NA
773	2	3	E
842	2	3	NA
1062	3	251	NA
474	2	541	D

After
passenger_id	pclass	tix_num	deck
1078	2	2	NA
1194	2	2	NA
773	2	3	E
842	2	3	E
1062	3	251	NA
474	2	541	D

The following table shows the number of missing decks values in each “pclass”.

Number of decks and NAs by pclass
	3	2	1
A	0	0	21
B	0	0	66
C	0	0	105
D	0	6	40
E	3	5	38
F	10	13	0
G	5	0	0
NA	691	253	53

From this table, we see that there are 997 NAs after ticket number imputation. For the rest of the missing cabins, we shall randomly assign the deck number based on the unique ticket number by “pclass”.

Number of unique decks and NAs by pclass
	3	2	1
A	0	0	19
B	0	0	31
C	0	0	49
D	0	6	24
E	2	4	20
F	8	8	0
G	2	0	0
NA	534	174	43

The following method is used for the random sampling to keep the ratio. For example: in class 3, there are 2 unique ticket numbers in deck E, 8 unique tickets number in deck F, and 2 unique tickets in deck G. There are 534 NAs in class 3. Hence, we would assign a ratio of 2:8:2 respectively to deck E:F:G in all of the 534 NAs. We follow the steps below:

Determine the ratio of “decks” in each “pclass” to be randomly sampled.
Impute the deck based on the calculated ratio.

## [[1]]
## [1]  6  9 15  7  6
## 
## [[2]]
## [1] 58 39 77
## 
## [[3]]
## [1]  89 356  89

R-base provides an elegant way to impute the decks based on the above ratio.

Number of unique decks imputated
	3	2	1
A	0	0	6
B	0	0	9
C	0	0	15
D	0	58	7
E	89	39	6
F	356	77	0
G	89	0	0

We see that the decks has been imputed according to the desired ratio. We now assign the unique tickets with imputed deck back to the full list, and again assign all the tickets with the same number with the same deck using the function we have wrote earlier.

Total number of decks by pclass
	3	2	1
A	0	0	28
B	0	0	80
C	0	0	124
D	1	85	47
E	125	65	44
F	466	127	0
G	117	0	0

We have now imputed all the data and can move on to regenerating the train/test datasets.

Regenerating the “train” and “test” Dataset

From the full dataset, we split using dplyr::group_split. The train dataset has 891 obs while the test dataset has 418 obs

## [1] 891  14

## [1] 418  13

We have regenerated our “train” and “test” datasetS. Now we can perform statistical learning.

Fitting Models

pred<-function(x){
        data<-predict(x, newdata=test_imp)
        final<-data.frame(PassengerId = test_imp$passenger_id, Survived = data)
}

We generate this “pred” function to generate the prediction dataframe for submission using caret.

compare<-function(x,y){
        dat<- full_join(x,y, by = "PassengerId") %>%
                mutate(Survived.x = parse_number(as.character(.$Survived.x))) %>%
                mutate(Survived.y = parse_number(as.character(.$Survived.y))) %>%
                mutate(diff = Survived.x - Survived.y) %>%
                filter(diff != 0) %>%
                as.data.frame
        
        dat_1<- test_imp %>%
                filter(passenger_id %in% dat$PassengerId) %>%
                mutate(surv = case_when(dat$diff <0 ~ "model_2", 
                                        TRUE > 0 ~ "model_1")) %>%
                as.data.frame
        list(dat,dat_1)
}

We generate this function to compare the differences between datasets generated by different methods.

Logistics Regression

## 
## Call:  NULL
## 
## Coefficients:
##       (Intercept)            pclass2            pclass1         titlemadam  
##         -2.368023           1.107841           2.669288           3.306329  
## titleyoung_master    titleyoung_miss          sexfemale         family_mem  
##          3.111178           2.952375                 NA          -0.417483  
##               age               fare          embarkedC          embarkedQ  
##         -0.025187           0.004429           0.467970           0.207616  
##             deckB              deckC              deckD              deckE  
##         -0.326905          -0.362735           0.725383           0.566957  
##             deckF              deckG  
##          0.478093           0.578372  
## 
## Degrees of Freedom: 890 Total (i.e. Null);  874 Residual
## Null Deviance:       1187 
## Residual Deviance: 729.5     AIC: 763.5

Using logistic regression, our predictive accuracy is 0.76555.

Lasso and Ridge

Coefficients of Ridge model

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                              1
## (Intercept)       -1.127418228
## (Intercept)        .          
## pclass2            0.305182786
## pclass1            0.546395191
## titlemadam         0.839601038
## titleyoung_master  0.815199012
## titleyoung_miss    0.495704094
## sexfemale          0.889401450
## family_mem        -0.070759390
## age               -0.009264014
## fare               0.002652996
## embarkedC          0.297908824
## embarkedQ          0.114297128
## deckB              0.250634363
## deckC              0.161502862
## deckD              0.316098153
## deckE              0.009686306
## deckF             -0.211160249
## deckG             -0.184288770

Coefficients of Lasso model

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                               1
## (Intercept)       -1.8618977091
## (Intercept)        .           
## pclass2            0.4933583388
## pclass1            1.3739437774
## titlemadam         0.1041118049
## titleyoung_master  1.9448457562
## titleyoung_miss    .           
## sexfemale          2.4898806098
## family_mem        -0.1608056082
## age               -0.0039205005
## fare               0.0009603881
## embarkedC          0.1976752457
## embarkedQ          .           
## deckB              .           
## deckC              .           
## deckD              0.1648004146
## deckE              .           
## deckF              .           
## deckG              .

The proportion of correct responses for both ridge and lasso methods are 0.78947.

Radial Kernel SVM

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 891 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (17), scaled (17) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 802, 803, 802, 802, 802, 802, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.8182857  0.6093306
##   0.50  0.8216440  0.6139442
##   1.00  0.8193971  0.6039277
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05554577
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05554577 and C = 0.5.

The proportion of correct responses for the Radial Kernel SVM method is 0.79904 (Top 12%).

Conclusion

The different classification methods yields very close results. In order to improve accuracy, we would need a better strategy to impute both the “age” variable and “deck” variable.

Statistical Learning Challenge with Kaggle Titanic Dataset

hhp2125

7/1/2020

Acknowledgement

Skills Involved

What is Kaggle

Synopsis

Load the Data

Exploratory Analysis

Figures and Plots

Comments

Feature Engineering

“pclass” or ticket class

“name”

Title of a passenger

Imputation of “age”, “embark”, “fare” with `missForest::missForest`

“ticket”

“family members”

“cabin” and “deck”

Regenerating the “train” and “test” Dataset

Fitting Models

Logistics Regression

Lasso and Ridge

Radial Kernel SVM

Conclusion

Statistical Learning Challenge with Kaggle Titanic Dataset

hhp2125

7/1/2020

Acknowledgement

Skills Involved

What is Kaggle

Synopsis

Load the Data

Exploratory Analysis

Figures and Plots

Comments

Feature Engineering

“pclass” or ticket class

“name”

Title of a passenger

Imputation of “age”, “embark”, “fare” with missForest::missForest

“ticket”

“family members”

“cabin” and “deck”

Regenerating the “train” and “test” Dataset

Fitting Models

Logistics Regression

Lasso and Ridge

Radial Kernel SVM

Conclusion

Imputation of “age”, “embark”, “fare” with `missForest::missForest`