Titanic Kaggle Notes

30/11/2014

Steve Burr

Introduction

-My top score was 0.79426 on Kaggle

-Apparently anything in the 0.79-0.81 is pretty good, anything above is exceptional / cheating

-Did anyone else do better?

-Did anyone else do anything?

The rest of this presentation explains what I did with R code for those interested

It's heavily inspired by the guide at: http://trevorstephens.com/post/72916401642/titanic-getting-started-with-r

I also made use of the book "Machine Learning with R" by Brett Lantz

Load all packages that I used

library(randomForest)
## Warning: package 'randomForest' was built under R version 3.1.2
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Warning: package 'caret' was built under R version 3.1.2
## Loading required package: lattice
## Loading required package: ggplot2

Data read in

  • First I read in the test / training datasets provided and then also created a combined dataset
  • This was so that variables held as a factor had the same number of levels (important so R doesn't get upset)
#Read in the data
train<-data.frame(read.csv("train.csv"))
test<-data.frame(read.csv("test.csv"))
test$Survived<-NA
#Create a total dataset for sorting out features (so they have consistent factors across data)
total<-rbind(train,test)

Variable creation 1

  • Someone's title may provide additional information about their status over and above ticket class which may be predictive of survival

  • Once title was stripped out of the name attribute I simplified it into what I thought were sensible groupings (I did this differently to the online guide)

#Get Title
total$Name<-as.character(total$Name)
total$Title <- sapply(total$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
total$Title <- sub(' ', '', total$Title)

#Simplify Titles
total$Title[total$Title %in% c('Mme', 'Mlle','Ms')] <- 'Mrs'
total$Title[total$Title %in% c('Capt', 'Don', 'Major', 'Sir','Col','Dr','Rev')] <- 'Sir'
total$Title[total$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'

#Code as a factor
total$Title <-as.factor(total$Title)

Variable creation 2

  • I also produced "Total Family Size" based on number of Siblings/Spouses + Parents/Children + 1
#Total Family Size
total$FamilySize <- total$SibSp + total$Parch + 1 

Missing value imputation 1

  • There are missing values in the data, I attempted to impute sensible values before applying models (though depending on what technique this could either be vital or not required at all)
total[is.na(total$Fare),]
##      PassengerId Survived Pclass               Name  Sex  Age SibSp Parch
## 1044        1044       NA      3 Storey, Mr. Thomas male 60.5     0     0
##      Ticket Fare Cabin Embarked Title FamilySize
## 1044   3701   NA              S    Mr          1
#There is one missing fare from Pclass 3
#Replace missing vaue with median
total$Fare[is.na(total$Fare)]<-median(total$Fare[total$Pclass==3],na.rm=TRUE)

#Make missing sex = male as majority class 
total$Sex[is.na(total$Sex)]<-"male"

Missing value imputation 2

  • Most variables are categorical, so will be treated as a series of binary variables by most models, therefore I scaled fare to be on the same scale

  • It's not normally distributed so I opted to scale based on max / min

total$FareScale <- 1 + (total$Fare - max(total$Fare))/(max(total$Fare)-min(total$Fare))

Missing value imputation 3

  • Age is a really important variable with a lot of missing values
  • I decided to use a (fairly bad) regression model to do this
  • However I've previously just used median values by title with pretty sucessful results (similar final scores)
total$Pclass<-as.factor(total$Pclass)
NonMising <- total[!is.na(total$Age),]
Age.reg<-lm(Age ~ Pclass + SibSp + Title ,data = NonMising)
summary(Age.reg)[4]
## $coefficients
##             Estimate Std. Error  t value  Pr(>|t|)
## (Intercept)   39.852     5.5243   7.2140 1.047e-12
## Pclass2       -9.261     0.9533  -9.7147 2.066e-21
## Pclass3      -12.147     0.8509 -14.2762 2.394e-42
## SibSp         -1.409     0.4159  -3.3876 7.316e-04
## TitleMaster  -21.070     5.8207  -3.6198 3.091e-04
## TitleMiss     -9.191     5.6060  -1.6396 1.014e-01
## TitleMr        1.514     5.5758   0.2716 7.860e-01
## TitleMrs       4.162     5.6097   0.7419 4.583e-01
## TitleSir      10.719     5.9796   1.7925 7.334e-02
summary(Age.reg)[8]
## $r.squared
## [1] 0.4171

Missing value imputation 5

  • Predicting missing values:
#Make Prediction
total$Age_Pred<- predict(Age.reg,total)
#Have a few negative cases - set to minimum age
total$Age_Pred <- sapply (total$Age_Pred, FUN = function (x) {
    if (x < 0) {min(total$Age, na.rm = TRUE)}
    else {x}
}) 
#Set missing values to be those predicted by the model
total$Age <- mapply (total$Age, total$Age_Pred, FUN= function(x,y)
  {
    if(is.na(x)) {y} 
    else {x}
})

Final variable tweaks

#Make a factor of family size 
total$FactorSize <- as.factor(total$FamilySize)

#Create Z standardised Age Var
AverageAge<-mean(total$Age)
AgeSD <- sd(total$Age)
total$AgeZ = ((total$Age - AverageAge) / AgeSD)

#Produce final datasets
train<-total[1:891,]
test<-total[892:1309,]
row.names(train)<-NULL
train$Survived<-as.factor(train$Survived)

Modelling approach

  • I used the "Random Forest" algorithm to get my best score

  • To optimise the different options available I made use of the "caret" package

  • The "caret" package allows you to quickly run many models with different settings, evaluate their performance based on a measure of your choice and then select the best one

  • I also thought I got this score using the C5.0 algorithm, but now can't reproduce the result so must be mistaken (make sure to set random number seeds / have clear code etc.!!!)

Model Optimisation

  • Instead of using accuracy to pick the best model is used the "Kappa" statistic which accounts for the probability of getting things right by chance ( More information )

  • When measuring performance I used 10 fold cross validation (this involves producing each model 10 times with 90%/10% train/test ratios and reporting average performance statistics)

  • The random forest algorithm has a single parameter to tune, mtry which specifies the number of variables which go into each tree

A Quick Explanation of Random Forest

  • A popular "black box" method for classification tasks
  • Produce a large number of trees (by default 500 in R) using a subset of the available variables (by default sqrt(nvars)) and a random sample of cases (with replacement) equal to the size of the original data set
  • i.e. the 500 trees produced will all have the same number of records, but the exact cases will vary across each one
  • All 500 trees then vote on the classification of unseen cases, the most common class is the one allocated to each case
  • For more information see:
    Berkley

Wikipedia

Optimisation Set Up

  • As mentioned previously, I wanted the model with the highest Kappa
  • Instead of selecting the best model I instead used the simplest model within 1 SE of the best performing model
  • Simplest in the case of trees is based on the depth of the trees (and the number of boosting iterations if applicable, this doesn't apply to random forest)
  • This should reduce the potential for over fitting, a model which performs almost as well as another model but is "simpler" is likely to perform better on unseen data

Final Model Code

The best model used mtry = 2, e.g. two features per tree

ctrl<- trainControl(method="cv", number = 10, selectionFunction = "oneSE")
grid_rf <- expand.grid(.mtry=c(1:6))
set.seed(1988)
m_rf<- train (Survived ~ Pclass + Sex + AgeZ + FareScale + Title + FactorSize
              ,data = train, method = "rf", metric = "Kappa",
              trControl = ctrl, tuneGrid = grid_rf)

Print Final Model Performance

  • Final Model
(m_rf[[4]])[2,]
##   mtry Accuracy  Kappa AccuracySD KappaSD
## 2    2   0.8271 0.6286    0.03611 0.08049

Exporting Predictions to Upload to Kaggle

Prediction <- predict(m_rf, test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = "m_rf_maxKappa_Simple1.csv", row.names = FALSE)

END

  • Any questions / comments ?