Titanic Dataset Analysis

PART 1: Problem Description

For this project I have analyzed the Titanic data set obtained from Kaggle. Using the data set, one can observe whether 891 passengers of the Titanic survived or perished, and the relationship with several variables, including Age, Sex, passenger class, if they had family on-board the ship, their ticket number, how much they paid for their ticket, where they boarded the ship, and their cabin’s location. To tackle this problem I will apply the conditional inference tree algorithm to train classification models of survival using several of the passengers’ traits. The resulting models will be evaluated using balanced accuracy for my own validation data set, and ultimately I will be submitting the model predictions to the Kaggle competition’s public leaderboard and receiving a score.

PART 2: Analysis Overview

To approach the problem of determining who survived or perished in the Titanic accident, I broke the analysis into several parts, including an initial exploratory analysis, and modeling using the conditional inference trees from the party package in R. The exploratory analysis consisted of plotting variables against the ‘Survived’ field to see if there were any clear relationships. I used my insights from that work to determine which variables to include in an initial model. In order to keep the model simple, I settled on three variables: Age, Sex, and Passenger Class (PClass). I ignored any variables that had a large number of missing values.

Initial Data Loading

library(knitr)
library(caret)
library(party)
library(randomForest)

train_dat <- read.csv('~/Desktop/UW_Coursera/titanic_project/data/train.csv', header=T)
train_dat$Survived <- factor(train_dat$Survived, levels=c(0,1), labels=c('DIED', 'SURVIVED'))

I split the training data into a model-training data set and validation data set for the models, and withhold the test data set for testing final models. To make sure we get a good balance of people who survived/perished in the accident, we will sample using the ‘Survived’ variable, which will maintain the ratio observed in the full data set.

set.seed(42)
train_idx <- createDataPartition(train_dat$Survived, p=0.75, list=F, times=1)

model_dat <- train_dat[train_idx,]
val_dat <- train_dat[-train_idx,]

Exploratory Analysis

Exploratory plots of each variable vs survival are included below.

It’s pretty apparent here that ‘Passenger Class’, ‘Sex’, and ‘Fare’ have some relationship with survival. Values higher than 2 in the ‘SibSp’ and ‘Parch’ variables, indicating people traveling with larger families, also appear to correspond to lower survival, although observations with values >= 2 in those variables are rare. The ‘Embarked’ variable also has some relationship to survival, although mainly only for people who boarded in Cherbourg (‘C’). Age does not appear to have any significant relationship with survival, which is surprising given that life-boats were supposed to be boarded “Women and Children First”.

PART 3: Initial Solution Explanation

My initial solution to predict which passengers survived is to train a conditional inference tree using the ‘Age’, ‘Sex’, and passenger class (‘Pclass’) variables. Conditional inference trees are similar to the RPART algorithm, except they use a statistical test to create each split, and can use the p-value of that test to determine when to stop splitting. (Including corrections for multiple tests - ie. Bonferroni). I will estimate performance of the model on the 25% validation set I created from the training set, using model performance metrics obtained from caret’s confusionMatrix() function.

par(mfrow=c(1,1))
simple_survival_ctree <- ctree(Survived ~ Age + Sex + Pclass, data=model_dat, 
                               controls=ctree_control(testtype="Bonferroni"))

plot(simple_survival_ctree)

val_dat$ctree_preds <- predict(simple_survival_ctree, newdata=val_dat, type='response')

PART 4: Initial Solution Analysis

This model returned a balanced accuracy of 0.7280, . The tree logic can be interpreted with relative ease - for example the left most node says “If you were a female in first or second class, you had a very high probability of surviving the disaster”. On the other side of the tree, the far right node suggests: “If you were a male over the age of 9 who wasn’t in first class, the odds that you survived are very low”. This model appears to be a slightly more complex variation of “Women and Children First”, where children were very young (9 years old or younger). This model does appear to suggest if you were a child that wasn’t in first or second class, however, your odds of surviving were low.

Submitting the test set to Kaggle yielded a score of 0.76077 on the public leaderboard, which is actually lower than the naive “all women survive and all men die” benchmark model they provide. That’s a bit disappointing to see, but the model was only able to identify 40/85 survivors, which is the cause in the lower predictivity.

confusionMatrix(val_dat$ctree_preds, val_dat$Survived, positive="SURVIVED")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction DIED SURVIVED
##   DIED      135       45
##   SURVIVED    2       40
##                                           
##                Accuracy : 0.7883          
##                  95% CI : (0.7286, 0.8401)
##     No Information Rate : 0.6171          
##     P-Value [Acc > NIR] : 3.488e-08       
##                                           
##                   Kappa : 0.5044          
##  Mcnemar's Test P-Value : 8.993e-10       
##                                           
##             Sensitivity : 0.4706          
##             Specificity : 0.9854          
##          Pos Pred Value : 0.9524          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.3829          
##          Detection Rate : 0.1802          
##    Detection Prevalence : 0.1892          
##       Balanced Accuracy : 0.7280          
##                                           
##        'Positive' Class : SURVIVED        
##

We can quickly look at who the people are that the model is misclassifying.

PART 5: Revised Solution and Analysis

To improve the model, I will include information on the family size, as well as the ticket price they paid. The rational behind including the Parch and SibSp variables is that people with larger families may be less likely to survive, given that they were trying to keep track of people in the commotion. The exploratory data analysis supports this assumption, although these variables were not included in the original model due to the low populations of higher values in these variables. Fare was included solely because in the exploratory data analysis, it appears that on average, people who survived had higher fares.

better_survival_ctree <- ctree(Survived ~ Age + Sex + Pclass + Fare + Parch + SibSp, data=model_dat, 
                               controls=ctree_control(testtype="Bonferroni"))

plot(better_survival_ctree)

val_dat$v2_ctree_preds <- predict(better_survival_ctree, newdata=val_dat, type='response')

This model did remarkably better, with a leaderboard score of 0.79426, and a balanced accuracy of 0.8444. In addition, the model was able to correctly identify roughly 78% of the survivors (66 out of 85 total in the validation set). Most of the gains in the model’s prediction of survivability come from it’s ability to classify surviving women who were traveling in 3rd class. In fact, the 19 ‘missed’ survivors are all male. It is clear then that the model does quite well classifying female passengers correctly, but additional work would be required to handle the males.

confusionMatrix(val_dat$v2_ctree_preds, val_dat$Survived, positive="SURVIVED")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction DIED SURVIVED
##   DIED      125       19
##   SURVIVED   12       66
##                                           
##                Accuracy : 0.8604          
##                  95% CI : (0.8077, 0.9031)
##     No Information Rate : 0.6171          
##     P-Value [Acc > NIR] : 1.076e-15       
##                                           
##                   Kappa : 0.6998          
##  Mcnemar's Test P-Value : 0.2812          
##                                           
##             Sensitivity : 0.7765          
##             Specificity : 0.9124          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.8681          
##              Prevalence : 0.3829          
##          Detection Rate : 0.2973          
##    Detection Prevalence : 0.3514          
##       Balanced Accuracy : 0.8444          
##                                           
##        'Positive' Class : SURVIVED        
##

PART 6: Conclusions

Two conditional inference trees were trained to recognize survivors of the Titanic disaster using basic passenger characteristics, and performed reasonably well, ultimately returning a public leaderboard score of 0.79426 on Kaggle, corresponding to a rank of 1201/4431. Unfortunately, both models misclassified a number of male survivors, and even splitting by gender and training individual models on males and females did not result in increased accuracy for men. It is not clear if a more complex algorithm needs to be applied, or whether more feature engineering of Cabins and names needs to be applied to recognize those passengers. However, due to the simplicity and interpretability of the ctree models for this task, I was relatively pleased with the model performance and score on Kaggle.