Titanic Kaggle Write Up

Introduction

I have not previously competed in a Kaggle competition, therefore I have decided to start with the Titanic competiton for which there is good introductory guidance.

As I am more familiar with R than Python, I have produced my solution using R.

Competition Aims

The goal of the competition is to predict whether a passenger survived the sinking of the Titanic or not.

The data provided to do this contains the following variables:

Name
Age
Social Class of passenger (PClass)
Sex
Number of siblings / spouses aboard
Number of parents / children aboard
Fare paid by the passenger
Cabin occupied by passenger
The port the passenger embarked from

The data is split into two parts, a training set of 891 observations, and a test set of 418 observations for which the fate of the passengers is unknown. Entries to the competition are scored on the percentage of correct predictions made within the test dataset.

My Approach

Outline

To undertake the analysis, I initially used the “CART” decision tree algorithm found within the rpart package in R. Alongside this I used the “rattle”, “rpart.plot” and “RColorBrewer” packages to help with looking at my initial results.

The variables I considered initially were “Age”, “Sex”, “Fare” and “PClass” as I felt that these are most likely to have a large impact on surivival. Age and Sex are going to be the biggest predictors due to the “Women and children first” rules used for allocating passengers to the limitted number of lifeboats.

Implementation

I read the data into R and utilised the “data.table” package:

library(data.table)

## Warning: package 'data.table' was built under R version 3.1.1

train<-data.table(read.csv("train.csv"))
test<-data.table(read.csv("test.csv"))

I then used the “CART” decision tree algorithm with the packages outlined above to produce my first solution.

The code that I used can be seen below:

library(rpart)
library(rattle)

## Warning: package 'rattle' was built under R version 3.1.1

## Rattle: A free graphical interface for data mining with R.
## Version 3.1.0 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.1.1

library(RColorBrewer)

## Warning: package 'RColorBrewer' was built under R version 3.1.1

fit <- rpart(Survived ~ Pclass + Sex + Age + Fare, data=train, method="class")
fancyRpartPlot(fit)

plot of chunk unnamed-chunk-2

Prediction <- predict(fit, test, type = "class")
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = "myfirstdtree.csv", row.names = FALSE)

Scoring

My initial approach produced a score of 0.78469 which I felt was good for a first attempt.

I didn’t make any attempt to deal with missing values in “Age” and “Fare” - by inputing these values in a sensible way it should be possible to improve my score.

Additionally, there are other techniques such as random forests which may be more effective than a simple decision tree.

Improvements

I decided that the titles (e.g. Mr, Mrs, Miss, Master) used by the passengers is likely to be a good way of dividing them up in order to understand their Ages and Fares. This is because the titles will be linked to both age and social status.

I created these variables from the “Name” variable as follows:

test2<-test
test2$Survived<-NA
all_data<-rbind(test2,train)

## Warning: Argument 2 has names in a different order. Columns will be bound
## by name for consistency with base. You can supply unnamed lists and the
## columns will then be joined by position, or set use.names=FALSE.
## Alternatively, explicitly setting use.names to TRUE will remove this
## warning.

all_data$Name<-as.character(all_data$Name)
all_data$Title <- sapply(all_data$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
all_data$Title <- sub(' ', '', all_data$Title)

#Reduce Complexity
all_data$Title[all_data$Title %in% c('Mme', 'Mlle','Ms')] <- 'Mrs'
all_data$Title[all_data$Title %in% c('Capt', 'Don', 'Major', 'Sir','Col','Dr','Rev')] <- 'Sir'
all_data$Title[all_data$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'
all_data$Title <-as.factor(all_data$Title)

I decided that a lot of the rarer titles could be grouped together without adversly impacting on the analysis.

Having produced these titles, I then found the median age and fares within them and used these values to replace any missing values in the initial datasets.

This was done as follows:

#Calculate age/fare within titles and replace missing values
AverageAge<-data.table(aggregate(Age ~ Title, data = all_data, median))
names(AverageAge)<-c("Title","AverageAge")

## Warning: The names(x)<-value syntax copies the whole table. This is due to
## <- in R itself. Please change to setnames(x,old,new) which does not copy
## and is faster. See help('setnames'). You can safely ignore this warning if
## it is inconvenient to change right now. Setting options(warn=2) turns this
## warning into an error, so you can then use traceback() to find and change
## your names<- calls.

AverageFare<-data.table(aggregate(Fare ~ Title, data = all_data, median))
names(AverageFare)<-c("Title","AverageFare")

## Warning: The names(x)<-value syntax copies the whole table. This is due to
## <- in R itself. Please change to setnames(x,old,new) which does not copy
## and is faster. See help('setnames'). You can safely ignore this warning if
## it is inconvenient to change right now. Setting options(warn=2) turns this
## warning into an error, so you can then use traceback() to find and change
## your names<- calls.

all_data<-merge(all_data,AverageAge,by="Title")
all_data<-merge(all_data,AverageFare,by="Title")
all_data$Age<-mapply(all_data$Age,all_data$AverageAge, FUN=function(x,y) {if(is.na(x)){y} else{x}})
all_data$Fare<-mapply(all_data$Fare,all_data$AverageFare, FUN=function(x,y) {if(is.na(x)){y} else{x}})

#Split data back into Test/Train
all_data<-all_data[order(PassengerId)]
train<-all_data[1:891,]
test<-all_data[892:1309,]

I then used the same decision tree methodology as before, with my new data and the variables “PClass”,“Sex”,“Age”,“Fare” and “Title” to produce new predictions:

fit <- rpart(Survived ~ Pclass + Sex + Age + Fare + Title, data=train, method="class")
fancyRpartPlot(fit)

plot of chunk unnamed-chunk-5

Prediction <- predict(fit, test, type = "class")
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = "myseconddtree.csv", row.names = FALSE)

My score was 0.77990 which was slightly worse than before.

At this point, I decided to try using the “randomForest” package with the same data to see what this much hyped technique came up with.

The I used for this is as follows:

set.seed(69)
library(randomForest)

## Warning: package 'randomForest' was built under R version 3.1.1

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + Fare + Title, data=train, importance=TRUE, ntree=2000)
varImpPlot(fit)

plot of chunk unnamed-chunk-6

Prediction <- predict(fit, test)
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = "firstforest.csv", row.names = FALSE)

Unforunately, this performed worse than both my previous attempts! (0.76555)

Finally I then made use of “Conditional Inference Trees” from the “party” package.

library(party)

## Warning: package 'party' was built under R version 3.1.1

## Loading required package: grid
## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.1.1

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 3.1.1

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 3.1.1

## Loading required package: modeltools

## Warning: package 'modeltools' was built under R version 3.1.1

## Loading required package: stats4

fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + Fare + Title,
               data = train, controls=cforest_unbiased(ntree=2000, mtry=3))
Prediction <- predict(fit, test, OOB=TRUE, type = "response")
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = "ConditionalInference.csv", row.names = FALSE)

This produced my best score of 0.78947.

There is still room for improvement, I have not used all of the variables, and it may not be necessary to make use of all those I have. There are also other techniques such as logistic regression which may perform better.

It may also be possible to improve by coding ages into sensible bands instead of leaving them as numeric variables.

Conclusions

I’ve made use of several packages and techniques in R which I haven’t used before, and made my first entry to a kaggle competition.

This has been a good learning experience, though there is plenty of room for further study and further improvements.