library("data.table")
library("janitor")
library("GGally")
library("missForest")
library("randomForest")
library("MASS")
library("class")
library("caret")
library("ModelMetrics")
library("tidyverse")
library("magrittr")
library("glmnet")
library("mice")
Formatting and data processing ideas are inspired by Bisaria’s post on Kaggle.
R-base, tidyverse, Statistical Learning, Data Wrangling, String manipulation, function programming, data imputation.
Kaggle is a website that post challenges in the field of machine learning. For the titanic challenge, partakers are given two datasets - train and test. The train dataset would contain the outcome and the test dataset would not. After generating the predicted outcome for the test dataset based on the model trained in the train dataset, partakers would submit the results to Kaggle website for scoring. The score for this challenge is the proportion of correct guesses.
This is an application of some of the statistical learning methods in ISLR. In particular, we are using Logistic Regression, Ridge Regression, Lasso Regression, and Radial SVM to predict the survival status (0 = perished, 1 = survived) of Kaggle’s Titanic Dataset. The main challenge of this dataset lies in imputing the variable “age” and “deck.” Using missForest::missForest, we imputed the age variable. From other clues such as the same ticket numbers would be the same deck, and certain ticket class would correspond to certain decks, we imputed all the decks values.
Through the process and submission on Kaggle, Radial SVM was the most accurate with the proportion of correct prediction = 0.79904 (Top 12% as of 6/30/2020).
The code can be found at Github.
The “train” and “test” dataset were loaded. The “full” dataset was used to impute NAs values.
| passenger_id | survived | pclass | name | sex | age | sib_sp | parch | ticket | fare | cabin | embarked | type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NA | S | Train |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Train |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NA | S | Train |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S | Train |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NA | S | Train |
| 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | NA | Q | Train |
The Data Dictionary is as below:
| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No;1 = Yes |
| pclass | Ticket class | 1 = 1st;2 = 2nd;3 = 3rd |
| sex | Sex | NA |
| Age | Age in years | NA |
| sibsp | # of siblings / spouses aboard the Titanic | NA |
| parch | # of parents / children aboard the Titanic | NA |
| ticket | Ticket number | NA |
| fare | Passenger fare | NA |
| cabin | Cabin number | NA |
| embarked | Port of Embarkation | C = Cherbourg; Q = Queenstown; S = Southampton |
The data also comes with the following notes:
Variable Notes
pclass: A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lowerage: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way…
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)parch: The dataset defines family relations in this way…
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.
We plot the pairwise scatter plot to have a cursory look at the whole dataset.
We can see that there are significant correlation between the predictor variables. Ridge and Lasso regression should be robust against multi-collinearity.
R incorrectly assumed pclass == 3 is the highest level, so we have to recode the variable appropriately as in pclass == 1 > pclass == 2 > pclass == 3. This change is more meaningful for inference than for prediction.
Before
## Factor w/ 3 levels "3","1","2": 1 2 1 2 1 1 2 1 1 3 ...
After
## Factor w/ 3 levels "3","2","1": 1 3 1 3 1 1 3 1 1 2 ...
pclass == 1 is now correctly the highest level.
The “name” variable is stored in the form: “Last Name, Title. Middle First Name.” (e.g, Braund, Mr. Owen Harris). We are mostly concerned with the last name and title of the person. Last names can determine if people are in the same family and are traveling together (hence staying in the same cabin). Title can indicate the age of the person if the person’s age is missing. For example, “Dr.” would indicate someone older and “Master” would be some one younger. We also convert all characters to lower case.
| name |
|---|
| Braund, Mr. Owen Harris |
| Cumings, Mrs. John Bradley (Florence Briggs Thayer) |
| Heikkinen, Miss. Laina |
| Futrelle, Mrs. Jacques Heath (Lily May Peel) |
| Allen, Mr. William Henry |
| Moran, Mr. James |
| title | first_name | last_name |
|---|---|---|
| mr | owen harris | braund |
| mrs | john bradley (florence briggs thayer) | cumings |
| miss | laina | heikkinen |
| mrs | jacques heath (lily may peel) | futrelle |
| mr | william henry | allen |
| mr | james | moran |
We examine the number of different titles in the dataset.
| title | count | missing_age | min age | max age |
|---|---|---|---|---|
| mr | 757 | 176 | 11.00 | 80.0 |
| miss | 260 | 50 | 0.17 | 63.0 |
| mrs | 197 | 27 | 14.00 | 76.0 |
| master | 61 | 8 | 0.33 | 14.5 |
| dr | 8 | 1 | 23.00 | 54.0 |
| rev | 8 | 0 | 27.00 | 57.0 |
| col | 4 | 0 | 47.00 | 60.0 |
| major | 2 | 0 | 45.00 | 52.0 |
| mlle | 2 | 0 | 24.00 | 24.0 |
| ms | 2 | 1 | 28.00 | 28.0 |
| capt | 1 | 0 | 70.00 | 70.0 |
| don | 1 | 0 | 40.00 | 40.0 |
| dona | 1 | 0 | 39.00 | 39.0 |
| jonkheer | 1 | 0 | 38.00 | 38.0 |
| lady | 1 | 0 | 48.00 | 48.0 |
| mme | 1 | 0 | 24.00 | 24.0 |
| sir | 1 | 0 | 49.00 | 49.0 |
| the countess | 1 | 0 | 33.00 | 33.0 |
We merge the titles to minimize the number of categories as follows.
| new_title | old_title | description |
|---|---|---|
| sir | mr, capt, col, don, major, rev, jonkheer, sir, dr and sex = male, master and age > 14.5 | Male, age > 14.5 |
| madam | mrs, dona, mlle, mme, dr and sex = female, ms, miss and age > 14.5,the countess, lady | Female, age > 14.5 |
| young_master | master and age <= 14.5, mr and age <=14.5 | Male, age <= 14.5 |
| young_miss | miss and age <= 14.5 | Female, age <= 14.5 |
First we convert every “master” at age > 14.5 to “sir”, “miss” at age >14.5 to “madam”, “mr” at age < 14.5 to “young_master”, and “mrs” at age < 14.5 to “young_miss” and male/female “dr” to sir/madam. Then we merge the title as above using forcats::fct_collapse.
| title | count | missing_age | min age | max age |
|---|---|---|---|---|
| sir | 777 | 177 | 15.00 | 80.0 |
| madam | 365 | 28 | 15.00 | 76.0 |
| young_miss | 101 | 50 | 0.17 | 14.5 |
| young_master | 66 | 8 | 0.33 | 14.5 |
From this table, we can see that we have successfully merged the title as described. We can now move on to imputation of the “age”, “embark”, and “fare”. Imputation of the “cabin” variable will be dealt with separately.
missForest::missForestBefore imputing with missForest::missForest:
| type | passenger_id | pclass | name | sex | age | sib_sp | parch | ticket | fare | cabin | embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Train | 0 | 0 | 0 | 0 | 177 | 0 | 0 | 0 | 0 | 687 | 2 |
| Test | 0 | 0 | 0 | 0 | 86 | 0 | 0 | 0 | 1 | 327 | 0 |
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
After missForest::missForest imputation:
| type | passenger_id | pclass | last_name | title | first_name | sex | sib_sp | parch | ticket | cabin | na | age | fare | embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Train | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 687 | 0 | 0 | 0 | 0 |
| Test | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 327 | 0 | 0 | 0 | 0 |
missForest::missForest has imputed all the NAs in “age”, “embark” and “fare”. Now the only remaining variable with NAs is “cabin”.
Ticket numbers have some information that helps us with imputing the “cabin” variable.
| ticket |
|---|
| A/5 21171 |
| PC 17599 |
| STON/O2. 3101282 |
| 113803 |
| 373450 |
| 330877 |
We see that ticket numbers are stored as a series of numbers at the end of the string. Using regex, we can easily extract the numbers from the tickets. For tickets that have no numbers, we assign it number the number 99999.
| tix_num |
|---|
| 21171 |
| 17599 |
| 3101282 |
| 113803 |
| 373450 |
| 330877 |
Merge “sib_sp” and “parch” into “family_mem” which is short for family member.
Cabin is the variable with the most NAs. Let’s first extract the first letter of the available cabin numbers into the variable “deck”. We are also changing the deck “T” into deck “A”. Deck “T” is just another first-class deck and there is only one value of deck “T” in the whole dataset.
| cabin | deck |
|---|---|
| C85 | C |
| C123 | C |
| E46 | E |
| G6 | G |
| C103 | C |
| D56 | D |
We know that people who have the same ticket number must be on the same deck. Therefore, if there is a group of people with the same ticket number and one of them has a deck number then we assign that deck number to the whole group.
In the example below, we see that two people have ticket number 3, one of them is on deck “E”, so we assigned deck “E” to the other person who has missing deck value. Whereas two people have ticket number 2, but since none of them have a deck, we will impute their information later.
| passenger_id | pclass | tix_num | deck |
|---|---|---|---|
| 1078 | 2 | 2 | NA |
| 1194 | 2 | 2 | NA |
| 773 | 2 | 3 | E |
| 842 | 2 | 3 | NA |
| 1062 | 3 | 251 | NA |
| 474 | 2 | 541 | D |
| passenger_id | pclass | tix_num | deck |
|---|---|---|---|
| 1078 | 2 | 2 | NA |
| 1194 | 2 | 2 | NA |
| 773 | 2 | 3 | E |
| 842 | 2 | 3 | E |
| 1062 | 3 | 251 | NA |
| 474 | 2 | 541 | D |
The following table shows the number of missing decks values in each “pclass”.
| 3 | 2 | 1 | |
|---|---|---|---|
| A | 0 | 0 | 21 |
| B | 0 | 0 | 66 |
| C | 0 | 0 | 105 |
| D | 0 | 6 | 40 |
| E | 3 | 5 | 38 |
| F | 10 | 13 | 0 |
| G | 5 | 0 | 0 |
| NA | 691 | 253 | 53 |
From this table, we see that there are 997 NAs after ticket number imputation. For the rest of the missing cabins, we shall randomly assign the deck number based on the unique ticket number by “pclass”.
| 3 | 2 | 1 | |
|---|---|---|---|
| A | 0 | 0 | 19 |
| B | 0 | 0 | 31 |
| C | 0 | 0 | 49 |
| D | 0 | 6 | 24 |
| E | 2 | 4 | 20 |
| F | 8 | 8 | 0 |
| G | 2 | 0 | 0 |
| NA | 534 | 174 | 43 |
The following method is used for the random sampling to keep the ratio. For example: in class 3, there are 2 unique ticket numbers in deck E, 8 unique tickets number in deck F, and 2 unique tickets in deck G. There are 534 NAs in class 3. Hence, we would assign a ratio of 2:8:2 respectively to deck E:F:G in all of the 534 NAs. We follow the steps below:
## [[1]]
## [1] 6 9 15 7 6
##
## [[2]]
## [1] 58 39 77
##
## [[3]]
## [1] 89 356 89
R-base provides an elegant way to impute the decks based on the above ratio.
| 3 | 2 | 1 | |
|---|---|---|---|
| A | 0 | 0 | 6 |
| B | 0 | 0 | 9 |
| C | 0 | 0 | 15 |
| D | 0 | 58 | 7 |
| E | 89 | 39 | 6 |
| F | 356 | 77 | 0 |
| G | 89 | 0 | 0 |
We see that the decks has been imputed according to the desired ratio. We now assign the unique tickets with imputed deck back to the full list, and again assign all the tickets with the same number with the same deck using the function we have wrote earlier.
| 3 | 2 | 1 | |
|---|---|---|---|
| A | 0 | 0 | 28 |
| B | 0 | 0 | 80 |
| C | 0 | 0 | 124 |
| D | 1 | 85 | 47 |
| E | 125 | 65 | 44 |
| F | 466 | 127 | 0 |
| G | 117 | 0 | 0 |
We have now imputed all the data and can move on to regenerating the train/test datasets.
From the full dataset, we split using dplyr::group_split. The train dataset has 891 obs while the test dataset has 418 obs
## [1] 891 14
## [1] 418 13
We have regenerated our “train” and “test” datasetS. Now we can perform statistical learning.
pred<-function(x){
data<-predict(x, newdata=test_imp)
final<-data.frame(PassengerId = test_imp$passenger_id, Survived = data)
}
We generate this “pred” function to generate the prediction dataframe for submission using caret.
compare<-function(x,y){
dat<- full_join(x,y, by = "PassengerId") %>%
mutate(Survived.x = parse_number(as.character(.$Survived.x))) %>%
mutate(Survived.y = parse_number(as.character(.$Survived.y))) %>%
mutate(diff = Survived.x - Survived.y) %>%
filter(diff != 0) %>%
as.data.frame
dat_1<- test_imp %>%
filter(passenger_id %in% dat$PassengerId) %>%
mutate(surv = case_when(dat$diff <0 ~ "model_2",
TRUE > 0 ~ "model_1")) %>%
as.data.frame
list(dat,dat_1)
}
We generate this function to compare the differences between datasets generated by different methods.
##
## Call: NULL
##
## Coefficients:
## (Intercept) pclass2 pclass1 titlemadam
## -2.368023 1.107841 2.669288 3.306329
## titleyoung_master titleyoung_miss sexfemale family_mem
## 3.111178 2.952375 NA -0.417483
## age fare embarkedC embarkedQ
## -0.025187 0.004429 0.467970 0.207616
## deckB deckC deckD deckE
## -0.326905 -0.362735 0.725383 0.566957
## deckF deckG
## 0.478093 0.578372
##
## Degrees of Freedom: 890 Total (i.e. Null); 874 Residual
## Null Deviance: 1187
## Residual Deviance: 729.5 AIC: 763.5
Using logistic regression, our predictive accuracy is 0.76555.
Coefficients of Ridge model
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.127418228
## (Intercept) .
## pclass2 0.305182786
## pclass1 0.546395191
## titlemadam 0.839601038
## titleyoung_master 0.815199012
## titleyoung_miss 0.495704094
## sexfemale 0.889401450
## family_mem -0.070759390
## age -0.009264014
## fare 0.002652996
## embarkedC 0.297908824
## embarkedQ 0.114297128
## deckB 0.250634363
## deckC 0.161502862
## deckD 0.316098153
## deckE 0.009686306
## deckF -0.211160249
## deckG -0.184288770
Coefficients of Lasso model
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.8618977091
## (Intercept) .
## pclass2 0.4933583388
## pclass1 1.3739437774
## titlemadam 0.1041118049
## titleyoung_master 1.9448457562
## titleyoung_miss .
## sexfemale 2.4898806098
## family_mem -0.1608056082
## age -0.0039205005
## fare 0.0009603881
## embarkedC 0.1976752457
## embarkedQ .
## deckB .
## deckC .
## deckD 0.1648004146
## deckE .
## deckF .
## deckG .
The proportion of correct responses for both ridge and lasso methods are 0.78947.
## Support Vector Machines with Radial Basis Function Kernel
##
## 891 samples
## 8 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (17), scaled (17)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 802, 803, 802, 802, 802, 802, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.8182857 0.6093306
## 0.50 0.8216440 0.6139442
## 1.00 0.8193971 0.6039277
##
## Tuning parameter 'sigma' was held constant at a value of 0.05554577
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05554577 and C = 0.5.
The proportion of correct responses for the Radial Kernel SVM method is 0.79904 (Top 12%).
The different classification methods yields very close results. In order to improve accuracy, we would need a better strategy to impute both the “age” variable and “deck” variable.
Comments
The biggest issue with our dataset is the NAs. From the following table, we can see that “age” and “cabin” have the most numbers of NAs.
Strategies for “age” and “cabin” imputations is discussed individually below. The imputation will be performed on a merged dataset of both “test” and “train”.