Predicting Titanic Survivors with Random Forests

First we load the data:

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

data <- read.csv("train.csv")
datest <- read.csv("test.csv")

We will fit a linear model that predicts age with sibling-spouse(SibSp) numbers. We will use this for filling missing Age values:

fit <- lm(Age ~ SibSp, data = data)
nohay <- which(is.na(data$Age))
data$Age[nohay] <-  fit$coeff[[1]] + data$SibSp[nohay] * fit$coeff[[2]]
nohay <- which(is.na(datest$Age))
datest$Age[nohay] <-  fit$coeff[[1]] + datest$SibSp[nohay] * fit$coeff[[2]]

Now we fit a random forest model on the training data. The confusion matrix with the training data is reported:

modRF <- train(Survived ~ Pclass + Sex + Age + Fare + Embarked + SibSp + Parch
, data = data, method = "rf")
save(modRF, file = ".modRF")

We now compute the confusion model on the same training data (just to have an idea):

pred <- predict(modRF, data)

## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.

pred <- as.integer(pred > 0.5)
conf <- confusionMatrix(data$Survived, pred)
conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 531  18
##          1  81 261
##                                         
##                Accuracy : 0.889         
##                  95% CI : (0.866, 0.909)
##     No Information Rate : 0.687         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.757         
##  Mcnemar's Test P-Value : 4.63e-10      
##                                         
##             Sensitivity : 0.868         
##             Specificity : 0.935         
##          Pos Pred Value : 0.967         
##          Neg Pred Value : 0.763         
##              Prevalence : 0.687         
##          Detection Rate : 0.596         
##    Detection Prevalence : 0.616         
##       Balanced Accuracy : 0.902         
##                                         
##        'Positive' Class : 0             
##

We have noted that one of the passengers in the testing set doesn’t have Fare data, we will fill it:

datest[153,9]<-mean(datest[,9],na.rm=TRUE)

We now apply the model to the testing set and write the result to csv file for submission:

pred <- predict(modRF, datest)
pred <- as.integer(pred > 0.5)

result <- data.frame(datest$PassengerId, pred)
names(result)<-c("PassengerId","Survived")
write.csv(result, "prediction.csv",row.names=FALSE)

Predicting Titanic Survivors with Random Forests

Enrique Balp Straffon

Monday, August 04, 2014