Titanic - Kaggle

Adam Tolnay

Competition description:

https://www.kaggle.com/c/titanic

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

library(randomForest)
library(e1071)
library(caret)

Importing data:

titanic.train <- read.csv(file = "train.csv", stringsAsFactors = FALSE, header = TRUE)
titanic.test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)

#More complete age data was available on wikipedia
titanic.wikitrain <- read.csv(file = "wikitrain.csv", stringsAsFactors = FALSE, header = TRUE)
titanic.wikitest <- read.csv(file = "wikitest.csv", stringsAsFactors = FALSE, header = TRUE)
titanic.train$Age <- titanic.wikitrain$Age_wiki
titanic.test$Age <- titanic.wikitest$Age_wiki

titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE
titanic.test$Survived <- 0

titanic.full <- rbind(titanic.train, titanic.test)

The data set has 1309 observations of 12 variables. The objective is to predict whether passengers in the test set survive. I used a logit model, a random forest, and a support vector machine to predict survival. Here is a list of variables:

Survived: 0 or 1 depending on passenger survival. Null for the test set.
Pclass: Whether the passenger is in 1st, 2nd or 3rd class.
Name: Name of the passenger.
Sex: Passenger gender
Age: Passenger age
SibSp: The number of siblings and spouses
Parch: The number of parents and children
Ticket: Ticket number
Fare: Fare
Cabin: Cabin number
Embarked: Whether the passenger embarked from Cherbourg, Queenstown or Southampton.

Age and fare have some missing values, so I will fill them in with the median.

titanic.full$Age[is.na(titanic.full$Age)] <- median(titanic.full$Age, na.rm = TRUE)
titanic.full$Fare[is.na(titanic.full$Fare)] <- median(titanic.full$Fare, na.rm = TRUE)

Creating a new variable isMinor.

titanic.full$isMinor <- ifelse(titanic.full$Age <= 15, TRUE, FALSE)

Casting categorical variables as factors:

titanic.full$Pclass <- as.factor(titanic.full$Pclass)
titanic.full$Sex <- as.factor(titanic.full$Sex)
titanic.full$Embarked <- as.factor(titanic.full$Embarked)
titanic.full$SurvivedFactor <- as.factor(titanic.full$Survived)

Splitting back into train and test:

titanic.train <- titanic.full[titanic.full$IsTrainSet == TRUE,]
titanic.test <- titanic.full[titanic.full$IsTrainSet == FALSE,]

Regression: Passenger name and cabin are not included. Using backwards stepwise selection, the AIC indicated that all variables except fare and embarked should be included.

set.seed(12)

survived.equation <- "Survived ~ Pclass + Sex + isMinor + SibSp + Parch + Age"
titanic.glm <- glm(survived.equation, family = binomial(link = "logit"), data = titanic.train)

GLMpredictions <- round(predict(titanic.glm, newdata = titanic.test, type = "response"))

GLMpredictionsRMSE <- round(predict(titanic.glm, newdata = titanic.train, type = "response"))

titanic.glm

## 
## Call:  glm(formula = survived.equation, family = binomial(link = "logit"), 
##     data = titanic.train)
## 
## Coefficients:
## (Intercept)      Pclass2      Pclass3      Sexmale  isMinorTRUE  
##     3.87900     -1.25597     -2.45257     -2.80705      0.99529  
##       SibSp        Parch          Age  
##    -0.47249     -0.14159     -0.03142  
## 
## Degrees of Freedom: 890 Total (i.e. Null);  883 Residual
## Null Deviance:       1187 
## Residual Deviance: 779.1     AIC: 795.1

cat('An estimate of the RMSE is ', RMSE(GLMpredictionsRMSE, titanic.train$Survived))

## An estimate of the RMSE is  0.4329317

The coefficients show some expected relationships: Probability of survival decreases for lower classes, men, and age, and increases for minors. I expected that solo passengers would be less likely to survive, but probability of survival decreases when there are relatives on board.

Random forest:

set.seed(23)

survived.equation <- "SurvivedFactor ~ Pclass + Sex + isMinor + SibSp + Parch + Age"

survived.formula <- as.formula(survived.equation)
titanic.rf <- randomForest(formula = survived.formula, data = titanic.train, ntree = 1000, mtry = 3, nodesize = .01 * nrow(titanic.test))

RFpredictions <- predict(titanic.rf, newdata = titanic.test, type="response")
RFpredictionsRMSE <- predict(titanic.rf, newdata = titanic.train, type="response")

titanic.rf

## 
## Call:
##  randomForest(formula = survived.formula, data = titanic.train,      ntree = 1000, mtry = 3, nodesize = 0.01 * nrow(titanic.test)) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.27%
## Confusion matrix:
##     0   1 class.error
## 0 505  44  0.08014572
## 1 101 241  0.29532164

cat('An estimate of the RMSE is ', RMSE(as.numeric(RFpredictionsRMSE)-1, titanic.train$Survived))

## An estimate of the RMSE is  0.359261

The estimate of error rate is 16.3%, but the error rate on the test set was 21%, so it was slightly overfit. The model predicts that 606/891, or 68% of the passengers in the test set died, which is very close to the overall ratio of deaths, 1502/2224, of 67.5%. It didn’t make a greater proportion of mistakes in either category (16.7% false negatives, and 15.4% false positives)

A rough ranking of variable importance based on the random forest:

titanic.rf$importance

##         MeanDecreaseGini
## Pclass          47.12901
## Sex            119.23328
## isMinor          6.32162
## SibSp           22.26294
## Parch           11.44942
## Age             65.01094

Support vector machine: I tested a linear, polynomial, and radial kernel, and used 10-fold cross validation to find the best parameters. Tuning takes quite a while so it’s not included, the parameters are passed straight to the models.

#linear.tune=tune.svm(survived.formula,data=titanic.train, kernel="linear",cost=seq(1,20,1))

#poly.tune=tune.svm(survived.formula,data=titanic.train, kernel="polynomial", degree = seq(3,5,1), cost = c(0.01,0.1,0.2,0.5,0.7,1,2,3,5,10,15,20))

#radial.tune=tune.svm(survived.formula,data=titanic.train, kernel="radial", gamma = seq(0,1,0.1), cost=c(0.01,0.1,0.2,0.5,0.7,1,2,3,5,10,15,20,50,100))

set.seed(43)

titanic.svm.linear <- svm(formula = survived.formula, data = titanic.train, kernel = "linear", cost = 10)
titanic.svm.poly <- svm(formula = survived.formula, data = titanic.train, kernel = "polynomial", degree = 3, cost = 15)
titanic.svm.radial <- svm(formula = survived.formula, data = titanic.train, kernel = "radial", gamma = .1, cost = 20)

SVMpredictionsLinear <- predict(titanic.svm.linear, titanic.test)
SVMpredictionsPoly <- predict(titanic.svm.poly, titanic.test)
SVMpredictionsRadial <- predict(titanic.svm.radial, titanic.test)

LinearRMSE <- predict(titanic.svm.linear, titanic.train)
PolyRMSE <- predict(titanic.svm.poly, titanic.train)
RadialRMSE <- predict(titanic.svm.radial, titanic.train)

cat('Linear kernel RMSE: ', RMSE(as.numeric(LinearRMSE)-1, titanic.train$Survived))

## Linear kernel RMSE:  0.461783

cat('Polynomial kernel RMSE: ', RMSE(as.numeric(PolyRMSE)-1, titanic.train$Survived))

## Polynomial kernel RMSE:  0.4006168

cat('Radial kernel RMSE: ', RMSE(as.numeric(RadialRMSE)-1, titanic.train$Survived))

## Radial kernel RMSE:  0.3978054

The radial kernel performed best for support vector machines.

Output:

Since the random forest had the best RMSE, I will use it’s predictions.

PassengerId <- titanic.test$PassengerId
Survived <- RFpredictions

output.df <- as.data.frame(PassengerId)
output.df$Survived <- Survived

write.csv(output.df, file = "Kaggle.csv", row.names = FALSE)

It’s interesting that one submission achieved a good score using only the name variable, which I initially thought was useless. Names begin with a title, such as Mr, Mrs, Miss etc… and minors can be identified by their titles: Miss for girls and Master for boys. Last names are grouped into families, and the following rules are used to predict survival:

All males die except boys in families where all females and boys live.
All females live except in families where all females and boys die.