The objective is to build a binary classification model on the Titanic training dataset and use this model to predict which individuals surivived the tragedy.
Let’s start by downloading the data sets and loading necessary packages.
train <- read.csv("train.csv", na.strings = "", colClasses = c("Survived" = "factor", "Name" = "character",
"Pclass" = "factor"))
test <- read.csv("test.csv", na.strings = "", colClasses = c("Survived" = "factor", "Name" = "character",
"Pclass" = "factor"))
library(ggplot2)
library(dplyr)
library(cowplot)
library(stringr)
library(knitr)
library(VIM)
library(class)
library(caret)
library(MASS)
library(randomForest)
library(e1071)
The dataset includes the following columns:
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
Let’s observe the correlations between any numeric predictor variables:
pairs(train[, c("Pclass", "Age", "SibSp", "Parch", "Fare")], col = train$Survived)
Socioeconomic status is reflected in the pclass variable. Let’s look at this more carefully:
Most passengers were in the third class and of this ticket class over half the population did not survive. In contrast, the number of passengers in first class that survived exceeded the number of first class passengers who perished. Hence you were less likely to survive if you were in the lower ticket class.
What about the fare that the passenger paid?
Clearly the fare correlates with ticket class and to some extent with port of embarkation.
It appears that the distribution of those who survived is shifted to higher average fare compared to the distribution of those who perished. However this may be due to the previous shown correlations.
Let’s look at age distributions:
This shows that your chances of survival were greater if you were under 10.
The variables SibSp and Parch represent how many siblings/spouses or how many parents/children were aboard respectively. Let’s combine these into a single feature for number of family members. Did having family aboard increase one’s odd of survival?
It appears that traveling with 1 - 3 family members improved one’s odds of survival over traveling alone or traveling with a large family.
Did one gender survive more than another?
Women were more likely to survive than men.
head(train$Name)
## [1] "Braund, Mr. Owen Harris"
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
## [5] "Allen, Mr. William Henry"
## [6] "Moran, Mr. James"
The Name feature is interesting as it provides an individuals title. This can be useful information.
newtrain <- mutate(newtrain, title = str_replace(str_extract(Name, "[a-zA-Z]+[.]"), "[.]", ""))
kable(t(summary(as.factor(newtrain$title))))
| Capt | Col | Countess | Don | Dr | Jonkheer | Lady | Major | Master | Miss | Mlle | Mme | Mr | Mrs | Ms | Rev | Sir |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 1 | 7 | 1 | 1 | 2 | 40 | 182 | 2 | 1 | 517 | 125 | 1 | 6 | 1 |
In addition to the more common tiles (such as Ms, Miss, Mr, Mrs, Master), there are entries such as Don, Jonkheer, Lady, Countess etc. Let’s re-categorize these less popular entries as follows:
update_title <- function(old_title){
if (is.element(old_title, c("Capt", "Col", "Don", "Jonkheer", "Major", "Sir"))) {
new_title <- "Mr"
} else if (old_title %in% c("Countess", "Lady", "Mme", "Ms", "Dona")) {
new_title <- "Mrs"
} else if (old_title == "Mlle") {
new_title <- "Miss"
} else if (old_title %in% c("Dr", "Master", "Miss", "Mr", "Mrs", "Ms", "Rev")) {
new_title <- old_title
}
return(new_title)
}
newtitles <- sapply(newtrain$title, update_title)
newtrain$title <- as.factor(newtitles)
kable(t(summary(newtrain$title)))
| Dr | Master | Miss | Mr | Mrs | Rev |
|---|---|---|---|---|---|
| 7 | 40 | 184 | 525 | 129 | 6 |
Several features have missing data. The entire training data set has 891 observations.
aggr(train, prop = FALSE, numbers = TRUE, combined = TRUE, sortVars = TRUE)
##
## Variables sorted by number of missings:
## Variable Count
## Cabin 687
## Age 177
## Embarked 2
## PassengerId 0
## Survived 0
## Pclass 0
## Name 0
## Sex 0
## SibSp 0
## Parch 0
## Ticket 0
## Fare 0
For the purposes of this analysis, let’s omit the Cabin feature because too much information is missing. Let’s try to impute the missing values for Age and Embarked.
Let’s try to deduce the missing Port of embarkation
subset(train, is.na(Embarked))
## PassengerId Survived Pclass Name
## 62 62 1 1 Icard, Miss. Amelie
## 830 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn)
## Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 62 female 38 0 0 113572 80 B28 <NA>
## 830 female 62 0 0 113572 80 B28 <NA>
It appears that these two women were not accompanied by any family. However we do have information on their fare and ticket class.
Let’s try using k-nearest neighbors on the fare where ticket class is 1 to predict port of embarkation. Note Pclass is not included for knn because it is categorical and we shouldn’t consider information from other ticket classes when these two individuals were in Pclass = 1.
ggplot(train, aes(Pclass, Fare, color = Embarked)) + geom_jitter() + geom_hline(yintercept = 80, color = "black")
pclass1 <- filter(train, Pclass == 1)
predictors <- dplyr::select(pclass1, Fare)
missing_emb <- is.na(pclass1$Embarked)
knn(data.frame(predictors[!missing_emb,]), data.frame(predictors[missing_emb,]), pclass1$Embarked[!missing_emb], k = as.integer(sqrt(dim(pclass1)[1])))
## [1] S S
## Levels: C Q S
We predict that the individuals likely embarked from Southampton. Based on some [research] (https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html) we can see that these two passengers did in fact boarded from Southampton!
newtrain$Embarked[which(is.na(newtrain$Embarked))] <- "S"
It’s hard to estimate the missing ages using a linear model because none of the continous predictors would logically correlate. However, the title likely correlates on average with the age. So let’s impute ages as grouped averages by title.
missing_age <- is.na(newtrain$Age)
train_age <- newtrain[!missing_age, ]
average_age <- aggregate(train_age$Age, by=list(train_age$title), mean, na.rm = TRUE)
impute.mean <- function(x) replace(x, is.na(x), round(mean(x, na.rm = TRUE)))
newtrain <- newtrain %>% group_by(title) %>% mutate(Age = impute.mean(Age))
Let’s start by creating the title column.
newtest <- mutate(test, title = str_replace(str_extract(Name, "[a-zA-Z]+[.]"), "[.]", ""))
kable(t(summary(as.factor(newtest$title))))
| Col | Dona | Dr | Master | Miss | Mr | Mrs | Ms | Rev |
|---|---|---|---|---|---|---|---|---|
| 2 | 1 | 1 | 21 | 78 | 240 | 72 | 1 | 2 |
newtitles <- sapply(newtest$title, update_title)
newtest$title <- as.factor(newtitles)
Now let’s check for missing information in the test data set.
aggr(newtest, prop = FALSE, numbers = TRUE, combined = TRUE, sortVars = TRUE)
##
## Variables sorted by number of missings:
## Variable Count
## Cabin 327
## Age 86
## Fare 1
## PassengerId 0
## Pclass 0
## Name 0
## Sex 0
## SibSp 0
## Parch 0
## Ticket 0
## Embarked 0
## title 0
We can impute missing age values using the same method.
#Renaming columns to match the model
newage <- c()
for (i in 1:dim(newtest)[1]) {
if (is.na(newtest$Age[i])) {
if (newtest$title[i]== "Dr" ) {newage[i] <- 42.00}
if (newtest$title[i]== "Master" ) {newage[i] <- 5.00}
if (newtest$title[i]== "Miss" ) {newage[i] <- 22.00}
if (newtest$title[i]== "Mr" ) {newage[i] <- 33.00}
if (newtest$title[i]== "Mrs" ) {newage[i] <- 36.00}
if (newtest$title[i]== "Rev" ) {newage[i] <- 43.00}
} else {
newage[i] <- newtest$Age[i]
}
}
newtest <- mutate(newtest, Age = newage)
Now let’s impute the missing Fare value in the test data set. As shown before, fare is dependent on Pclass and port of embarkation.
missingfare <- subset(newtest, is.na(Fare))
fit_fare <- lm(Fare ~ Pclass + Embarked, newtrain)
model_fare <- predict(fit_fare, missingfare)
newtest$Fare[as.numeric(rownames(missingfare))] = model_fare
The approach is to create a few models and determine which would be the best model to predict the test set. Here are the models:
In order to determine the best model, let’s perform cross validation using 25% of the training data set.
set.seed(101)
train_subset <- createDataPartition(y = newtrain$Survived, p = 0.75, list = FALSE)
drop.cols <- c('PassengerId', 'Name', 'Ticket', 'Cabin', 'Famsize')
newtrain_rev <- dplyr::select(newtrain, -one_of(drop.cols))
training <- newtrain_rev[train_subset, ]
validation <- newtrain_rev[-train_subset, ]
Let’s build the models and evaluate the accuracy of each model’s prediction against the validation set
# Logistic Regression
glm_fit <- glm(Survived ~ ., family = binomial(link='logit'), data = training)
glm_fit_simple <- glm(Survived ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Age,
family = binomial(link='logit'), data = training)
# Linear Discriminant Analysis
lda_fit <- lda(Survived ~ ., data = training)
# Bagging with Decision Trees
bag_fit <- train(Survived ~ ., data = training, method = "treebag")
# Random Forest
rf_fit <- train(Survived ~ ., data = training, method = "rf")
# Support Vector Machines
svm_fit <- svm(Survived ~ ., data = training, kernel = "linear")
svm_fit_simple <- svm(Survived ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Age,
data = training, kernel = "linear")
CM_glm <- confusionMatrix(validation$Survived, as.factor(round(predict(glm_fit, validation, type = "response"))))
CM_glm_simple <- confusionMatrix(validation$Survived, as.factor(round(predict(glm_fit_simple, validation, type = "response"))))
CM_lda <- confusionMatrix(validation$Survived, predict(lda_fit, validation)$class)
CM_bag <- confusionMatrix(validation$Survived, predict(bag_fit, validation))
CM_rf <- confusionMatrix(validation$Survived, predict(rf_fit, validation))
CM_svm <- confusionMatrix(validation$Survived, predict(svm_fit, validation))
CM_svm_simple <- confusionMatrix(validation$Survived, predict(svm_fit_simple, validation))
# Calculate model accuracies
sapply(list(glm = CM_glm, glm_simple = CM_glm_simple, lda = CM_lda, bagging = CM_bag, RF = CM_rf, SVM = CM_svm, SVM_simple = CM_svm_simple),
function(x) x$overall["Accuracy"])
## glm.Accuracy glm_simple.Accuracy lda.Accuracy
## 0.8378378 0.8198198 0.8288288
## bagging.Accuracy RF.Accuracy SVM.Accuracy
## 0.8333333 0.8423423 0.8243243
## SVM_simple.Accuracy
## 0.7927928
drop.cols <- c('PassengerId', 'Name', 'Ticket', 'Cabin')
newtest_rev <- dplyr::select(newtest, -one_of(drop.cols))
glm_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = as.factor(round(predict(glm_fit, newtest_rev, type = "response"))))
glm_sim_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = as.factor(round(predict(glm_fit_simple, newtest_rev, type = "response"))))
lda_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = predict(lda_fit, newtest_rev)$class)
bag_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = predict(bag_fit, newtest_rev))
rf_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = predict(rf_fit, newtest_rev))
svm_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = predict(svm_fit, newtest_rev))
svm_sim_pred <- data.frame(PassengerId = newtest$PassengerId, Survived = predict(svm_fit_simple, newtest_rev))