Titanic Kaggle

Introduction

The objective is to build a binary classification model on the Titanic training dataset and use this model to predict which individuals surivived the tragedy.

Let’s start by downloading the data sets and loading necessary packages.

train <- read.csv("train.csv", na.strings = "", colClasses = c("Survived" = "factor", "Name" = "character",
                                                               "Pclass" = "factor"))
test <- read.csv("test.csv", na.strings = "", colClasses = c("Survived" = "factor", "Name" = "character",
                                                             "Pclass" = "factor"))

library(ggplot2)
library(dplyr)
library(cowplot)
library(stringr)
library(knitr)
library(VIM)
library(class)
library(caret)
library(MASS)
library(randomForest)
library(e1071)

Exploratory Analyses

The dataset includes the following columns:

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Let’s observe the correlations between any numeric predictor variables:

pairs(train[, c("Pclass", "Age", "SibSp", "Parch", "Fare")], col = train$Survived)

Socioeconomic Status

Socioeconomic status is reflected in the pclass variable. Let’s look at this more carefully:

Most passengers were in the third class and of this ticket class over half the population did not survive. In contrast, the number of passengers in first class that survived exceeded the number of first class passengers who perished. Hence you were less likely to survive if you were in the lower ticket class.

Fare

What about the fare that the passenger paid?

Clearly the fare correlates with ticket class and to some extent with port of embarkation.

It appears that the distribution of those who survived is shifted to higher average fare compared to the distribution of those who perished. However this may be due to the previous shown correlations.

Age

Let’s look at age distributions:

This shows that your chances of survival were greater if you were under 10.

Family Aboard

The variables SibSp and Parch represent how many siblings/spouses or how many parents/children were aboard respectively. Let’s combine these into a single feature for number of family members. Did having family aboard increase one’s odd of survival?

It appears that traveling with 1 - 3 family members improved one’s odds of survival over traveling alone or traveling with a large family.

Gender

Did one gender survive more than another?

Women were more likely to survive than men.

Name Analyses

head(train$Name)

## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "Allen, Mr. William Henry"                           
## [6] "Moran, Mr. James"

The Name feature is interesting as it provides an individuals title. This can be useful information.

newtrain <- mutate(newtrain, title = str_replace(str_extract(Name, "[a-zA-Z]+[.]"), "[.]", ""))
kable(t(summary(as.factor(newtrain$title))))

Capt	Col	Countess	Don	Dr	Jonkheer	Lady	Major	Master	Miss	Mlle	Mme	Mr	Mrs	Ms	Rev	Sir
1	2	1	1	7	1	1	2	40	182	2	1	517	125	1	6	1

In addition to the more common tiles (such as Ms, Miss, Mr, Mrs, Master), there are entries such as Don, Jonkheer, Lady, Countess etc. Let’s re-categorize these less popular entries as follows:

Capt, Col, Don, Jonkheer, Major, Sir –> Mr
Countess, Lady, Mme, Ms, Dona –> Mrs
Mlle –> Miss

update_title <- function(old_title){
  if (is.element(old_title, c("Capt", "Col", "Don", "Jonkheer", "Major", "Sir"))) {
    new_title <- "Mr"
  } else if (old_title %in% c("Countess", "Lady", "Mme", "Ms", "Dona")) {
    new_title <- "Mrs"
  } else if (old_title == "Mlle") {
    new_title <- "Miss"
  } else if (old_title %in% c("Dr", "Master", "Miss", "Mr", "Mrs", "Ms", "Rev")) {
    new_title <- old_title
  }
  return(new_title)
}

newtitles <- sapply(newtrain$title, update_title)
newtrain$title <- as.factor(newtitles)

kable(t(summary(newtrain$title)))

Dr	Master	Miss	Mr	Mrs	Rev
7	40	184	525	129	6

Missing Data

Several features have missing data. The entire training data set has 891 observations.

aggr(train, prop = FALSE, numbers = TRUE, combined = TRUE, sortVars = TRUE)

## 
##  Variables sorted by number of missings: 
##     Variable Count
##        Cabin   687
##          Age   177
##     Embarked     2
##  PassengerId     0
##     Survived     0
##       Pclass     0
##         Name     0
##          Sex     0
##        SibSp     0
##        Parch     0
##       Ticket     0
##         Fare     0

For the purposes of this analysis, let’s omit the Cabin feature because too much information is missing. Let’s try to impute the missing values for Age and Embarked.

Port of Embarkation

Let’s try to deduce the missing Port of embarkation

subset(train, is.na(Embarked))

##     PassengerId Survived Pclass                                      Name
## 62           62        1      1                       Icard, Miss. Amelie
## 830         830        1      1 Stone, Mrs. George Nelson (Martha Evelyn)
##        Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 62  female  38     0     0 113572   80   B28     <NA>
## 830 female  62     0     0 113572   80   B28     <NA>

It appears that these two women were not accompanied by any family. However we do have information on their fare and ticket class.

Let’s try using k-nearest neighbors on the fare where ticket class is 1 to predict port of embarkation. Note Pclass is not included for knn because it is categorical and we shouldn’t consider information from other ticket classes when these two individuals were in Pclass = 1.

ggplot(train, aes(Pclass, Fare, color = Embarked)) + geom_jitter() + geom_hline(yintercept = 80, color = "black")

pclass1 <- filter(train, Pclass == 1)
predictors <- dplyr::select(pclass1, Fare)
missing_emb <- is.na(pclass1$Embarked)
knn(data.frame(predictors[!missing_emb,]), data.frame(predictors[missing_emb,]), pclass1$Embarked[!missing_emb], k = as.integer(sqrt(dim(pclass1)[1])))

## [1] S S
## Levels: C Q S

We predict that the individuals likely embarked from Southampton. Based on some [research] (https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html) we can see that these two passengers did in fact boarded from Southampton!

newtrain$Embarked[which(is.na(newtrain$Embarked))] <- "S"

Impute Age

It’s hard to estimate the missing ages using a linear model because none of the continous predictors would logically correlate. However, the title likely correlates on average with the age. So let’s impute ages as grouped averages by title.

missing_age <- is.na(newtrain$Age)
train_age <- newtrain[!missing_age, ]
average_age <- aggregate(train_age$Age, by=list(train_age$title), mean, na.rm = TRUE)

impute.mean <- function(x) replace(x, is.na(x), round(mean(x, na.rm = TRUE)))
newtrain <- newtrain %>% group_by(title) %>% mutate(Age = impute.mean(Age))

Prediction Model

Cleaning Test Data

Let’s start by creating the title column.

newtest <- mutate(test, title = str_replace(str_extract(Name, "[a-zA-Z]+[.]"), "[.]", ""))
kable(t(summary(as.factor(newtest$title))))

Col	Dona	Dr	Master	Miss	Mr	Mrs	Ms	Rev
2	1	1	21	78	240	72	1	2

newtitles <- sapply(newtest$title, update_title)
newtest$title <- as.factor(newtitles)

Now let’s check for missing information in the test data set.

aggr(newtest, prop = FALSE, numbers = TRUE, combined = TRUE, sortVars = TRUE)

## 
##  Variables sorted by number of missings: 
##     Variable Count
##        Cabin   327
##          Age    86
##         Fare     1
##  PassengerId     0
##       Pclass     0
##         Name     0
##          Sex     0
##        SibSp     0
##        Parch     0
##       Ticket     0
##     Embarked     0
##        title     0

We can impute missing age values using the same method.

#Renaming columns to match the model

newage <- c()
for (i in 1:dim(newtest)[1]) {
  if (is.na(newtest$Age[i])) {
    if (newtest$title[i]== "Dr" ) {newage[i] <- 42.00}
    if (newtest$title[i]== "Master" ) {newage[i] <- 5.00}
    if (newtest$title[i]== "Miss" ) {newage[i] <- 22.00}
    if (newtest$title[i]== "Mr" ) {newage[i] <- 33.00}
    if (newtest$title[i]== "Mrs" ) {newage[i] <- 36.00}
    if (newtest$title[i]== "Rev" ) {newage[i] <- 43.00}
  } else {
    newage[i] <- newtest$Age[i]
  }
}

newtest <- mutate(newtest, Age = newage)

Now let’s impute the missing Fare value in the test data set. As shown before, fare is dependent on Pclass and port of embarkation.

missingfare <- subset(newtest, is.na(Fare))

fit_fare <- lm(Fare ~ Pclass + Embarked, newtrain)
model_fare <- predict(fit_fare, missingfare)

newtest$Fare[as.numeric(rownames(missingfare))] = model_fare

Approach

The approach is to create a few models and determine which would be the best model to predict the test set. Here are the models:

Logistic Regression
Linear Discriminant Analysis
Bagging with Decision Trees
Random Forest
Support Vector Machine

In order to determine the best model, let’s perform cross validation using 25% of the training data set.

set.seed(101)
train_subset <- createDataPartition(y = newtrain$Survived, p = 0.75, list = FALSE)
drop.cols <- c('PassengerId', 'Name', 'Ticket', 'Cabin', 'Famsize')
newtrain_rev <- dplyr::select(newtrain, -one_of(drop.cols))
training <- newtrain_rev[train_subset, ]
validation <- newtrain_rev[-train_subset, ]

Prediction Models

Let’s build the models and evaluate the accuracy of each model’s prediction against the validation set

# Logistic Regression
glm_fit <- glm(Survived ~ ., family = binomial(link='logit'), data = training)
glm_fit_simple <- glm(Survived ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Age, 
                      family = binomial(link='logit'), data = training)

# Linear Discriminant Analysis
lda_fit <- lda(Survived ~ ., data = training)

# Bagging with Decision Trees
bag_fit <- train(Survived ~ ., data = training, method = "treebag")

# Random Forest
rf_fit <- train(Survived ~ ., data = training, method = "rf")

# Support Vector Machines
svm_fit <- svm(Survived ~ ., data = training, kernel = "linear")
svm_fit_simple <- svm(Survived ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Age,
                      data = training, kernel = "linear")

CM_glm <- confusionMatrix(validation$Survived, as.factor(round(predict(glm_fit, validation, type = "response"))))
CM_glm_simple <- confusionMatrix(validation$Survived, as.factor(round(predict(glm_fit_simple, validation, type = "response"))))
CM_lda <- confusionMatrix(validation$Survived, predict(lda_fit, validation)$class)
CM_bag <- confusionMatrix(validation$Survived, predict(bag_fit, validation))
CM_rf <- confusionMatrix(validation$Survived, predict(rf_fit, validation))
CM_svm <- confusionMatrix(validation$Survived, predict(svm_fit, validation))
CM_svm_simple <- confusionMatrix(validation$Survived, predict(svm_fit_simple, validation))

# Calculate model accuracies
sapply(list(glm = CM_glm, glm_simple = CM_glm_simple, lda = CM_lda, bagging = CM_bag, RF = CM_rf, SVM = CM_svm, SVM_simple = CM_svm_simple), 
       function(x) x$overall["Accuracy"])

##        glm.Accuracy glm_simple.Accuracy        lda.Accuracy 
##           0.8378378           0.8198198           0.8288288 
##    bagging.Accuracy         RF.Accuracy        SVM.Accuracy 
##           0.8333333           0.8423423           0.8243243 
## SVM_simple.Accuracy 
##           0.7927928

Test Set Prediction

drop.cols <- c('PassengerId', 'Name', 'Ticket', 'Cabin')
newtest_rev <- dplyr::select(newtest, -one_of(drop.cols)) 

glm_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = as.factor(round(predict(glm_fit, newtest_rev, type = "response"))))
glm_sim_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = as.factor(round(predict(glm_fit_simple, newtest_rev, type = "response"))))
lda_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = predict(lda_fit, newtest_rev)$class)
bag_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = predict(bag_fit, newtest_rev))
rf_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = predict(rf_fit, newtest_rev))
svm_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = predict(svm_fit, newtest_rev))
svm_sim_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = predict(svm_fit_simple, newtest_rev))