This is a continuing attempt at the Kaggle Titanic Competition. I'll use the same methodology of cleaning the training and testing data sets as before and won't repeat the code here.
Try a Random Forest first since who gets to a lifeboat is a human decision. So supposedly there trees should be a good way to model this.
library(randomForest)
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
set.seed(200)
# Random forest model using training set
rf.titanic = randomForest(Survived ~ Pclass + Sex + Age + Child + Sex * Pclass +
SibSp + Parch + Family + Mother, data = trainData, mtry = 3, importance = T)
## Warning: The response has five or fewer unique values. Are you sure you
## want to do regression?
# Prediction using test set
yhat.rf = predict(rf.titanic, newdata = testData)
# Importance of variables
importance(rf.titanic)
## %IncMSE IncNodePurity
## Pclass 66.263 21.474
## Sex 93.239 53.881
## Age 28.470 23.801
## Child 22.574 3.822
## SibSp 16.587 6.123
## Parch 9.119 3.470
## Family 24.912 10.474
## Mother 9.589 1.951
varImpPlot(rf.titanic)
survival.rf <- vector()
survival.rf = ifelse(yhat.rf > 0.5, 1, 0)
# Creating CSV for Kaggle Submission
kaggle.sub <- cbind(PassengerId, survival.rf)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_rf1.csv",
row.names = FALSE)
Random forest did better than linear regression!! Ranking improved by 77.
Now trying Random Forest with different mtry. I can't use Cross Validation as I don't have the y-value for the test sets and therefore, I can't calculate the mean error for each value of mtry.
set.seed(200)
# Random forest model using training set
rf.titanic2 = randomForest(Survived ~ Pclass + Sex + Age + Child + Sex * Pclass +
SibSp + Parch + Family + Mother, data = trainData, mtry = 4, importance = T)
## Warning: The response has five or fewer unique values. Are you sure you
## want to do regression?
# Prediction using test set
yhat.rf2 = predict(rf.titanic2, newdata = testData)
# Importance of variables
importance(rf.titanic2)
## %IncMSE IncNodePurity
## Pclass 78.622 22.865
## Sex 115.675 57.231
## Age 30.562 31.920
## Child 21.501 4.002
## SibSp 16.968 6.528
## Parch 7.735 3.215
## Family 25.870 11.263
## Mother 9.619 1.927
varImpPlot(rf.titanic2)
# Creating CSV for Kaggle Submission
survival.rf2 <- vector()
survival.rf2 = ifelse(yhat.rf2 > 0.5, 1, 0)
kaggle.sub <- cbind(PassengerId, survival.rf2)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_rf2.csv",
row.names = FALSE)
This was NOT an improvement at all. So let's try another tree method, Boosting:
library(gbm)
## Loading required package: survival
## Loading required package: splines
## Loading required package: lattice
## Loading required package: parallel
## Loaded gbm 2.1
set.seed(200)
boost.titanic = gbm(Survived ~ Pclass + Sex + Age + Child + Sex * Pclass + SibSp +
Parch + Family + Mother, data = trainData, distribution = "gaussian", n.trees = 5000,
interaction.depth = 4)
summary(boost.titanic)
## var rel.inf
## Sex Sex 49.0366
## Age Age 19.6405
## Pclass Pclass 19.1863
## Family Family 7.2521
## SibSp SibSp 3.1205
## Child Child 0.8146
## Parch Parch 0.5934
## Mother Mother 0.3560
## Pclass:Sex Pclass:Sex 0.0000
# Partial Dependence Plot
par(mfrow = c(1, 2))
plot(boost.titanic, i = "Sex")
plot(boost.titanic, i = "Age")
# Use Boosted Model to predict
yhat.boost = predict(boost.titanic, newdata = testData, n.trees = 5000)
# Creating CSV for Kaggle Submission
survival.boost <- vector()
survival.boost = ifelse(yhat.boost > 0.5, 1, 0)
kaggle.sub <- cbind(PassengerId, survival.boost)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_boost.csv",
row.names = FALSE)
Boosting improved results substantially!! Ranking increased by 322 - now within top 2/3.