Introduction

The objective of this analysis is to predict the survival of passengers aboard the RMS Titanic. Though some groups were more likely to survive, survival cannot be perfectly predicted using those groups alone. The dataset provides several variables that may be considered more obvious in making these predictions (such as gender) in addition to others that may not seem as important based on common sense. This dataset also features missingness, which must be addressed before constructing a random forest to make survivorship predictions.

Data

The training data contains 891 observations of 12 variables. The variables provided are:

PassengerID
Survived (Yes/No)
Passenger Class (1,2,3)
Passenger Name
Passenger Sex
Passenger Age
Number of Passenger Siblings and Spouses
Number of Passenger Parents and children
Ticket Number
Fare
Cabin
Port of Embarkation (Cherbourg, Queenstown, Southhampton)

The training data has an overall survivor rating of 342 passengers out of 891, about 38%. Figure 1 is a barplot showing the disribution of number of passengers in each class. Class 1 accounted for 24% of the passengers, class 2 for 21%, and class 3 for 55%.

Figure 1

Sex is suspected to be a strong predictor for surviving the wreck. Shown in figure 2 are the percentages of men and women who survived. Only about 18% of men survived while 74% of women did. In the training data there are 577 men and 314 women.

Figure 2

Age is likely a good predictor for survivorship as well, but in many cases this data is missing, shown below (Figure 3). The average age of those whose age are in the data is 29.6991176 with a standard deviation of 14.5264973. Figure 4 plots survivorship against age with added jitter. This plot does not seem to show a distinct difference between the ages of those who died versus survived (bearing in mind that this is not all the data).

Figure 3

Figure 4

Two variables are used to describe the familial relationships of passengers: Number of siblings and spouses and number of children and parents. These two variables have means 0.5230079 and 0.3815937, respectively. Figure 5 shows a boxplot of these two variables.

Figure 5

Ticket is a factor variable with 681 levels. Some tickets are strings of numbers while others contain some letters and words.

Another variable that seems to have missing entries is the Cabin variable. Here we will be using multiple imputation to handle the missingness with age. Since there is a distinct set of cabin numbers on board the ship, and some passengers appear to have more than one cabin number listed, using multiple imputation will not be appropriate for this variable.

Finally, the port of embarkation is recorded. This is missing for two observations. The Southhampton port had by far the most boardings (644 out of the 891). At Queenstown, 77 passengers boarded and 72 of them were of class 3.

Methods

The missingness had to be addressed before going on to build a random forest to predict survivorship, since we don’t want to just throw away all the data with missing observations. Age was missing in both the training and test data sets, so the datasets were combined before using MICE to impute the missing values. Pssenger ID, Cabin, Passenger Name, and Ticket were excluding from the predictor matrix for the multiple imputation. The default methods in R were used for the imputation. These were predictive mean matching for the continuous variables and logistic regression for binary variables. 5 imputed datasets were made using MICE, and all five were split back into train and test data sets. The first 891 observations of each of the five imputed datasets were made into the new train data sets, and the remaining were saved as five new test data sets. Figure 6 shows the imputed values in red and the existing values in blue.

Figure 6

With the five new training datasets, five random forests were built. To generate predictions, each forest was used to predict survivorship in each of the five test data sets. This resulted in 25 sets of predicted values.

Results

The model that performed the best was five random forests using the variables passenger class, sex, age, numer of siblings and spouses, number of parents and children, fare and port of embarkation. The predicted values from each of the five forests on each of the five test datasets were totaled up to count the number of times each passenger had a prediction of “survived.” Passenger IDs that had a predicted value of “survived”, 13 times or more got a final prediction of “survived.” Passengers predicted to have survived 12 times or fewer were given a final prediction of “did not survive.”

The model predicted that about 33% of passengers in the test data survived. This model had an accuracy of 77.99%, according to the kaggle submission.

Conclusions and Future Work

The key methods used here in predicting who survived the shipwreck were multiple imputation using chain equations and random forests. There are several ways to deal with missing data. One benefit of using MICE is that it does not take one value and carry on the analysis assuming that value is true. Instead there are several options for one imputed value that can be averaged. In this case, we are not particularly interested in the relationship between the predictors and survivorship; we only care about prediction. This makes the random forest a good choice as a model.

A variable that has a lot of information that was not used in this analysis is the passenger’s name. There is likely some relationship between certain titles in a passenger’s name and their age that could help in the process of imputing the missing values for age. In addition, trying more and other combinations of prediction methods in the multiple imputation could lead to better results. For example, a regression tree might actually be the better choice for some variables instead of predictive mean matching. Finally, it would be interesting to consider some sort of interaction between age and class. We would guess that those who were in first class were more likely to survive and those who were children. Perhaps the effect of each of these characteristics is even more amplified when they occur at the same time.

Appendix

test <- read.csv("titanic_test.csv", header = T)

#combine training and test data for the imputation
test$Survived <- as.factor(rep(NA, nrow(test)))

all_data <- rbind(train, test)
all_data$Survived <- as.factor(all_data$Survived)


library(mice)
library(randomForest)


impute1 <- mice(train, maxit = 0)

pred_matrix <- impute1$predictorMatrix
pred_matrix[,c("PassengerId", "Name", "Ticket", "Cabin")] <- 0

impute <- mice(all_data, predictorMatrix = pred_matrix, m=5, maxit=10)
stripplot(impute)

newdat <- list()
for (i in 1:5){
newdat[[i]] <- complete(impute, i)
}

#new traiing data with imputed values

newtrain <- list()
for(j in 1:5){
newtrain[[j]] <- newdat[[j]][1:891,]
}


newtest <- list()

for(k in 1:5){
newtest[[k]] <- newdat[[k]][892:1309,]
}

set.seed(237)
forests <- list()

for (l in 1:5){
forests[[l]] <- randomForest(Survived ~ Pclass + Sex + Age + SibSp + Parch +Fare + Embarked, data=newtrain[[l]])
}

#get predictions

#store all 25 predictions in list
pred <- list()
y <- 1
 
for(m in 1:5){
  for(n in 1:5){
  pred[[y]] <- predict(forests[[m]], newtest[[n]], type = "response")
  y <- y+1
  }
}


all_predictions <- as.data.frame(pred)

# do.call(rbind.data.frame, pred)

colnames(all_predictions) <- as.character(c(1:25))

#change to numeric for addition
index <- sapply(all_predictions, is.factor)
all_predictions[index] <- lapply(all_predictions[index], function(x) as.numeric(as.character(x)))

all_predictions$sum <- 0
for(r in 1:418){
all_predictions$sum[r] <- sum(all_predictions[r,1:25])
}

all_predictions$final <- 0

#majority rules
for(w in 1:418){
  if(all_predictions$sum[w] >= 13){
    all_predictions$final[w] <- 1
  } else if (all_predictions$sum[w] <= 12){
    all_predictions$final[w] <- 0
  }
}


test$Survived <- all_predictions$final

mean(all_predictions$final)

submission <- test[,c(1,12)]
write.csv(submission, file="predictions.csv")