Decision Trees for the Titanic Kaggle Competition

This is a continuing attempt at the Kaggle Titanic Competition. I'll use the same methodology of cleaning the training and testing data sets as before and won't repeat the code here.

Random Forest

Try a Random Forest first since who gets to a lifeboat is a human decision. So supposedly there trees should be a good way to model this.

library(randomForest)

## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.

set.seed(200)

# Random forest model using training set
rf.titanic = randomForest(Survived ~ Pclass + Sex + Age + Child + Sex * Pclass + 
    SibSp + Parch + Family + Mother, data = trainData, mtry = 3, importance = T)

## Warning: The response has five or fewer unique values.  Are you sure you
## want to do regression?


# Prediction using test set
yhat.rf = predict(rf.titanic, newdata = testData)

# Importance of variables
importance(rf.titanic)

##        %IncMSE IncNodePurity
## Pclass  66.263        21.474
## Sex     93.239        53.881
## Age     28.470        23.801
## Child   22.574         3.822
## SibSp   16.587         6.123
## Parch    9.119         3.470
## Family  24.912        10.474
## Mother   9.589         1.951

varImpPlot(rf.titanic)

plot of chunk unnamed-chunk-2


survival.rf <- vector()
survival.rf = ifelse(yhat.rf > 0.5, 1, 0)

# Creating CSV for Kaggle Submission
kaggle.sub <- cbind(PassengerId, survival.rf)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_rf1.csv", 
    row.names = FALSE)

Random forest did better than linear regression!! Ranking improved by 77.

Now trying Random Forest with different mtry. I can't use Cross Validation as I don't have the y-value for the test sets and therefore, I can't calculate the mean error for each value of mtry.

set.seed(200)

# Random forest model using training set
rf.titanic2 = randomForest(Survived ~ Pclass + Sex + Age + Child + Sex * Pclass + 
    SibSp + Parch + Family + Mother, data = trainData, mtry = 4, importance = T)

## Warning: The response has five or fewer unique values.  Are you sure you
## want to do regression?


# Prediction using test set
yhat.rf2 = predict(rf.titanic2, newdata = testData)

# Importance of variables
importance(rf.titanic2)

##        %IncMSE IncNodePurity
## Pclass  78.622        22.865
## Sex    115.675        57.231
## Age     30.562        31.920
## Child   21.501         4.002
## SibSp   16.968         6.528
## Parch    7.735         3.215
## Family  25.870        11.263
## Mother   9.619         1.927

varImpPlot(rf.titanic2)

plot of chunk unnamed-chunk-3


# Creating CSV for Kaggle Submission
survival.rf2 <- vector()
survival.rf2 = ifelse(yhat.rf2 > 0.5, 1, 0)

kaggle.sub <- cbind(PassengerId, survival.rf2)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_rf2.csv", 
    row.names = FALSE)

This was NOT an improvement at all. So let's try another tree method, Boosting:

Boosting

library(gbm)

## Loading required package: survival
## Loading required package: splines
## Loading required package: lattice
## Loading required package: parallel
## Loaded gbm 2.1

set.seed(200)

boost.titanic = gbm(Survived ~ Pclass + Sex + Age + Child + Sex * Pclass + SibSp + 
    Parch + Family + Mother, data = trainData, distribution = "gaussian", n.trees = 5000, 
    interaction.depth = 4)

summary(boost.titanic)

plot of chunk unnamed-chunk-4

##                   var rel.inf
## Sex               Sex 49.0366
## Age               Age 19.6405
## Pclass         Pclass 19.1863
## Family         Family  7.2521
## SibSp           SibSp  3.1205
## Child           Child  0.8146
## Parch           Parch  0.5934
## Mother         Mother  0.3560
## Pclass:Sex Pclass:Sex  0.0000


# Partial Dependence Plot
par(mfrow = c(1, 2))
plot(boost.titanic, i = "Sex")
plot(boost.titanic, i = "Age")

plot of chunk unnamed-chunk-4


# Use Boosted Model to predict
yhat.boost = predict(boost.titanic, newdata = testData, n.trees = 5000)

# Creating CSV for Kaggle Submission
survival.boost <- vector()
survival.boost = ifelse(yhat.boost > 0.5, 1, 0)

kaggle.sub <- cbind(PassengerId, survival.boost)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_boost.csv", 
    row.names = FALSE)

Boosting improved results substantially!! Ranking increased by 322 - now within top 2/3.