Titanic Kaggle Notes

30/11/2014

Steve Burr

Introduction

-My top score was 0.79426 on Kaggle

-Apparently anything in the 0.79-0.81 is pretty good, anything above is exceptional / cheating

-Did anyone else do better?

-Did anyone else do anything?

The rest of this presentation explains what I did with R code for those interested

It's heavily inspired by the guide at: http://trevorstephens.com/post/72916401642/titanic-getting-started-with-r

I also made use of the book "Machine Learning with R" by Brett Lantz

Load all packages that I used

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.1.2

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Warning: package 'caret' was built under R version 3.1.2

## Loading required package: lattice
## Loading required package: ggplot2

Data read in

First I read in the test / training datasets provided and then also created a combined dataset
This was so that variables held as a factor had the same number of levels (important so R doesn't get upset)

#Read in the data
train<-data.frame(read.csv("train.csv"))
test<-data.frame(read.csv("test.csv"))
test$Survived<-NA
#Create a total dataset for sorting out features (so they have consistent factors across data)
total<-rbind(train,test)

Variable creation 1

Someone's title may provide additional information about their status over and above ticket class which may be predictive of survival
Once title was stripped out of the name attribute I simplified it into what I thought were sensible groupings (I did this differently to the online guide)

#Get Title
total$Name<-as.character(total$Name)
total$Title <- sapply(total$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
total$Title <- sub(' ', '', total$Title)

#Simplify Titles
total$Title[total$Title %in% c('Mme', 'Mlle','Ms')] <- 'Mrs'
total$Title[total$Title %in% c('Capt', 'Don', 'Major', 'Sir','Col','Dr','Rev')] <- 'Sir'
total$Title[total$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'

#Code as a factor
total$Title <-as.factor(total$Title)

Variable creation 2

I also produced "Total Family Size" based on number of Siblings/Spouses + Parents/Children + 1

#Total Family Size
total$FamilySize <- total$SibSp + total$Parch + 1

Missing value imputation 1

There are missing values in the data, I attempted to impute sensible values before applying models (though depending on what technique this could either be vital or not required at all)

total[is.na(total$Fare),]

##      PassengerId Survived Pclass               Name  Sex  Age SibSp Parch
## 1044        1044       NA      3 Storey, Mr. Thomas male 60.5     0     0
##      Ticket Fare Cabin Embarked Title FamilySize
## 1044   3701   NA              S    Mr          1

#There is one missing fare from Pclass 3
#Replace missing vaue with median
total$Fare[is.na(total$Fare)]<-median(total$Fare[total$Pclass==3],na.rm=TRUE)

#Make missing sex = male as majority class 
total$Sex[is.na(total$Sex)]<-"male"

Missing value imputation 2

Most variables are categorical, so will be treated as a series of binary variables by most models, therefore I scaled fare to be on the same scale
It's not normally distributed so I opted to scale based on max / min

total$FareScale <- 1 + (total$Fare - max(total$Fare))/(max(total$Fare)-min(total$Fare))

Missing value imputation 3

Age is a really important variable with a lot of missing values
I decided to use a (fairly bad) regression model to do this
However I've previously just used median values by title with pretty sucessful results (similar final scores)

total$Pclass<-as.factor(total$Pclass)
NonMising <- total[!is.na(total$Age),]
Age.reg<-lm(Age ~ Pclass + SibSp + Title ,data = NonMising)

summary(Age.reg)[4]

## $coefficients
##             Estimate Std. Error  t value  Pr(>|t|)
## (Intercept)   39.852     5.5243   7.2140 1.047e-12
## Pclass2       -9.261     0.9533  -9.7147 2.066e-21
## Pclass3      -12.147     0.8509 -14.2762 2.394e-42
## SibSp         -1.409     0.4159  -3.3876 7.316e-04
## TitleMaster  -21.070     5.8207  -3.6198 3.091e-04
## TitleMiss     -9.191     5.6060  -1.6396 1.014e-01
## TitleMr        1.514     5.5758   0.2716 7.860e-01
## TitleMrs       4.162     5.6097   0.7419 4.583e-01
## TitleSir      10.719     5.9796   1.7925 7.334e-02

summary(Age.reg)[8]

## $r.squared
## [1] 0.4171

Missing value imputation 5

Predicting missing values:

#Make Prediction
total$Age_Pred<- predict(Age.reg,total)
#Have a few negative cases - set to minimum age
total$Age_Pred <- sapply (total$Age_Pred, FUN = function (x) {
    if (x < 0) {min(total$Age, na.rm = TRUE)}
    else {x}
}) 
#Set missing values to be those predicted by the model
total$Age <- mapply (total$Age, total$Age_Pred, FUN= function(x,y)
  {
    if(is.na(x)) {y} 
    else {x}
})

Final variable tweaks

#Make a factor of family size 
total$FactorSize <- as.factor(total$FamilySize)

#Create Z standardised Age Var
AverageAge<-mean(total$Age)
AgeSD <- sd(total$Age)
total$AgeZ = ((total$Age - AverageAge) / AgeSD)

#Produce final datasets
train<-total[1:891,]
test<-total[892:1309,]
row.names(train)<-NULL
train$Survived<-as.factor(train$Survived)

Modelling approach

I used the "Random Forest" algorithm to get my best score
To optimise the different options available I made use of the "caret" package
The "caret" package allows you to quickly run many models with different settings, evaluate their performance based on a measure of your choice and then select the best one
I also thought I got this score using the C5.0 algorithm, but now can't reproduce the result so must be mistaken (make sure to set random number seeds / have clear code etc.!!!)

Model Optimisation

Instead of using accuracy to pick the best model is used the "Kappa" statistic which accounts for the probability of getting things right by chance ( More information )
When measuring performance I used 10 fold cross validation (this involves producing each model 10 times with 90%/10% train/test ratios and reporting average performance statistics)
The random forest algorithm has a single parameter to tune, mtry which specifies the number of variables which go into each tree

A Quick Explanation of Random Forest

A popular "black box" method for classification tasks
Produce a large number of trees (by default 500 in R) using a subset of the available variables (by default sqrt(nvars)) and a random sample of cases (with replacement) equal to the size of the original data set
i.e. the 500 trees produced will all have the same number of records, but the exact cases will vary across each one
All 500 trees then vote on the classification of unseen cases, the most common class is the one allocated to each case
For more information see:
Berkley

Wikipedia

Optimisation Set Up

As mentioned previously, I wanted the model with the highest Kappa
Instead of selecting the best model I instead used the simplest model within 1 SE of the best performing model
Simplest in the case of trees is based on the depth of the trees (and the number of boosting iterations if applicable, this doesn't apply to random forest)
This should reduce the potential for over fitting, a model which performs almost as well as another model but is "simpler" is likely to perform better on unseen data

Final Model Code

The best model used mtry = 2, e.g. two features per tree

ctrl<- trainControl(method="cv", number = 10, selectionFunction = "oneSE")
grid_rf <- expand.grid(.mtry=c(1:6))
set.seed(1988)
m_rf<- train (Survived ~ Pclass + Sex + AgeZ + FareScale + Title + FactorSize
              ,data = train, method = "rf", metric = "Kappa",
              trControl = ctrl, tuneGrid = grid_rf)

Print Final Model Performance

Final Model

(m_rf[[4]])[2,]

##   mtry Accuracy  Kappa AccuracySD KappaSD
## 2    2   0.8271 0.6286    0.03611 0.08049

Exporting Predictions to Upload to Kaggle

Prediction <- predict(m_rf, test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = "m_rf_maxKappa_Simple1.csv", row.names = FALSE)

END

Any questions / comments ?