The Challenge

This document is a response to the Kaggle challenge “Titanic - Machine Learning from Disaster”.

The goal of the challenge is to use the available data to build a machine learning model that predicts which passengers survived the shipwreck of the Titanic in 1912. Three models were used: a logistic regression model, a decision tree model and a random forest model model. The decision tree and the random forest model achieved the exact same performance of 0.7799, while the logistic regression performed slightly worse (0.76794).

Loading required packages

library(rlang)
library(httr)
library(jsonlite)
library(data.table)
library(tidyr)
library(ggplot2)
library(dplyr)
library(showtext)
library(extrafont)
library(viridis)
library(RColorBrewer)
library(ggsci)
library(forcats)
library(broom)
library(WVPlots)
library(caret)
library(pROC)
library(recipes)
library(readr)
library(ROCR)
library(vip)
library(stringr)
library(scales)
library(ggtext)
library(rpart) 
library(rpart.plot)
library(ranger)
library(h2o)
library(randomForest)
library(rattle)

The Data

Uploading training and testing data sets

Titanic_test <- fread("test.csv")
Titanic_train <- fread("train.csv")
gender_sub <- fread("gender_submission.csv")

Preparing Data

First we’ll take a look at the training data set. What infos are available?

str(Titanic_train)

## Classes 'data.table' and 'data.frame':   891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
##  - attr(*, ".internal.selfref")=<externalptr>

We see that passengers are identified by an Id Number (PassengerId) and by their names (Name). The data set also includes the survival variable (Survived): 1 if the passenger survived, 0 if the passenger died. Sex and Age are also displayed for each passenger. SibSp and Parch relates to the number of relatives each passenger had on board. SibSp is the number of siblings and spouses. Parch is the number of parents and children. Fare is the price paid on the ticket. Pclass refers to the socio-economic status of the passenger: 1 is the Upper Class, 2 is the Middle Class, 3 is the Lower Class. Embarked refer to the Port each passenger embarked on: “C” is for Cherbourg, “Q” is for Queenstown and “S” is for Southampton.

Let’s prep the data. We will not be using information that is idiossincratic to each passenger: Name, Ticket Number and Cabin Number. We will also filter to remove NAs from the analysis, since there are a number of passengers that don’t have the Age information available.

Titanic_train1 <- Titanic_train %>%
  mutate(Pclass = factor(Pclass), Sex = factor(Sex), Survived = factor(Survived)) %>%
  filter(Age != is.na(Age)) %>%
  select(-c("Ticket", "Cabin", "Name"))

Defining theme to be used

In this section, we create a theme to be used in all plots.

loadfonts(device = "win")
font_add("Palatino", "pala.ttf")
showtext_auto()
theme1 <- theme_minimal(base_size = 30) + theme(plot.subtitle = element_text(face = "italic"),
                                  text = element_text(family = "Palatino"))
color1 <- pal_jama(palette = "default")(3)[1]
color2 <- pal_jama(palette = "default")(3)[2]
color3 <- pal_jama(palette = "default")(3)[3]

my_theme <- function(type = c("discrete", "continuous")) {
  if(type == "discrete") {
    list(theme1, scale_colour_manual(values = c(color1, color3, color2)),
         scale_fill_manual(values = c(color1, color3, color2)))
  } else {
    list(theme1, scale_fill_continuous(high = color1, low = color3))
  }
}

Exploratory Data Analysis

We’ll perform some EDA to get acquainted with the datasets. First let’s look at Age. Is there some pattern to observe?

Titanic_train2 <- Titanic_train1 %>%
  mutate(Survived = fct_recode(Survived, "Survivor" = "1", "Deceased" = "0"),
         Sex = fct_recode(Sex, "Male" = "male", "Female" = "female"),
         Family_size = factor(SibSp + Parch + 1))

vline_df <- Titanic_train2 %>%
  group_by(Survived, Sex, .drop = FALSE) %>%
  summarise(Mean = mean(Age), Median = median(Age)) %>%
  as.data.frame() %>%
  mutate(Comment_median = paste("Median Age:", Median),
         Comment_mean = paste("Mean Age:", round(Mean, 2)))

Titanic_train2 %>%
  ggplot(aes(x = Age)) +
    geom_density(data = Titanic_train2[Survived == "Survivor"], 
                 aes(x = Age, y = ..density.., color = Survived, fill = Survived),
                 alpha = 0.3, size = 0.8) +
    geom_density(data = Titanic_train2[Survived == "Deceased"], 
                 aes(x = Age, y = -..density.., color = Survived, fill = Survived),
                 alpha = 0.3, size = 0.8) +
    geom_segment(data = vline_df %>% filter(Survived == "Survivor"), 
                 aes(x = Mean, y = 0 , xend = Mean, yend = Inf, color = Survived),
                 size = 1) +
    geom_segment(data = vline_df %>% filter(Survived == "Deceased"), 
                 aes(x = Mean, y = 0 , xend = Mean, yend = -Inf, color = Survived),
                 size = 1) +
    geom_segment(data = vline_df %>% filter(Survived == "Survivor"), 
                 aes(x = Median, y = 0 , xend = Median, yend = Inf, color = Survived),
                 linetype = 'dashed', size = 1) +
    geom_segment(data = vline_df %>% filter(Survived == "Deceased"), 
                 aes(x = Median, y = 0 , xend = Median, yend = -Inf, color = Survived),
                 linetype = 'dashed', size = 1) +
    geom_text(data = vline_df %>% filter(Survived == "Survivor"),
              y = 0.02, aes(x = 70, 
                            label = paste(Comment_mean,"\n",Comment_median), 
                            color = Survived),
              family = "Palatino", size = 9) +
    geom_text(data = vline_df %>% filter(Survived == "Deceased"),
              y = -0.02, aes(x = 70, 
                            label = paste(Comment_mean,"\n",Comment_median), 
                            color = Survived),
              family = "Palatino", size = 9) +
    coord_flip() +
    labs(
      title = "Density of Age Per Sex"
    ) +
    my_theme(type = "discrete") +
    theme(axis.text.x = element_blank(),
          axis.title.x = element_blank(),
          panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank()) +
    facet_wrap(~Sex)

Looking at Age density per Sex, we can see that the Females that died tend to be younger than those that survived. Deceased females hava a median Age of 24.5, versus 28 for the female survivors. This is different for Males: deceased Males have a slightly higher median Age (29) than the survivors (28). All distributions are somewhat leptokurtyc (they have positive excess kurtosis) and right-skewed.

When we look at Class, we see significant Age disparities. Upper Class passengers tend to be older overall (Median Age of 45.25 for Deceased and 35 por Survivors). Lower Class passengers are generally very young (Median Age of 25 for Deceased and 22 for Survivors). It is also noticeable how platycurtic the distribution for Upper Class is compared to the others.

Now let’s look at Fare prices. For this analysis we will cluster the fare prices in groups. We will also cluster Age.

Titanic_train2 %>%
  filter(Survived == "Deceased") %>%
  filter(Fare != is.na(Fare)) %>%
  group_by(grp_age = cut(Age, breaks = seq(0,80, by = 10)),
           grp_fare = cut(Fare, breaks = c(seq(0,100, by = 20), 520)),
           Survived) %>%
  summarise(n = n()) %>%
  ggplot(aes(x = grp_age, y = grp_fare)) +
    geom_tile(aes(fill = n), width = 0.9, height = 0.9) +
    geom_text(aes(x = grp_age, y = grp_fare, label = n),
              family = "Palatino", color = "white", size = 20) +
    labs(
      title = "Number Of People per Age and Fare Group - Deceased Passengers",
      y = "Fare Group",
      x = "Age Group",
      fill = "Count"
    ) +
    my_theme(type = "continuous") +
    theme(legend.position = 'none',
          panel.grid.major.x = element_blank(),
          panel.grid.major.y = element_blank()) +
    scale_fill_gradient2(low = "lightblue", high = color1, trans = "log10")

We see that a large number of deceased passengers where between 10 and 40 years old and payed a very small amount (0-20).

For survivors, we see that it is more evenly distributed. Yes, a lot of survivors also payed a cheap fare, but those that payed the highest prices also seem to have better chances at survival.

What about Family Size? For this one, we will add SibSp and Parch. The goal is not to know what specific relatives were on board, but how many relatives each passenger had on board.

Titanic_train2 %>%
  mutate(Family_size = factor(Family_size)) %>%
  ggplot(aes(x = Family_size)) +
    geom_bar(data = Titanic_train2 %>% filter(Survived == "Survivor"),
             aes(y = ..count.., fill = Survived)) +
    geom_bar(data = Titanic_train2 %>% filter(Survived == "Deceased"),
             aes(y = -..count.., fill = Survived)) +
    geom_text(data = Titanic_train2 %>% filter(Survived == "Survivor"),
              stat = 'count',aes(label = ..count.., y = ..count..),
              family = "Palatino", hjust = -0.5, size = 15) +
    geom_text(data = Titanic_train2 %>% filter(Survived == "Deceased"),
              stat = 'count',aes(label = ..count.., y = -..count..),
              family = "Palatino", hjust = 1.25, size = 15) +
    scale_x_discrete(limits = rev) +
    scale_y_continuous(limits = c(-285, 285)) +
    coord_flip() +
    labs(
      title = "Number of Individuals Per Family Size",
      x = "Size of Family"
    ) +
    my_theme(type = "discrete") +
    theme(axis.text.x = element_blank(),
          panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank(),
          axis.title.x = element_blank())

We see that the majority of passengers had a Family Size of 1. This means that they were not accompanied by other relatives. People with small families (from 2 to 4 relatives) have survived more often, while most of the few people who brought big families aboard (from 5 to 8 relatives) have died.

Let’s combine the variables that seem most important (Sex, Age, Class) to look at survival rate within each group. For this analysis, we’ll cluster Age into groups.

Titanic_train3 <- Titanic_train2 %>%
  group_by(grp_age = cut(Age, breaks = seq(0,80, by = 20)),Survived, Sex, Pclass) %>%
  count() %>%
  group_by(grp_age, Sex, Pclass) %>%
  mutate(prop = n/sum(n))

Titanic_train3

Titanic_train3 %>%
  ggplot(aes(x = grp_age, y = prop, fill = Survived)) +
    geom_col(position = 'fill') +
    geom_text(data = Titanic_train3 %>% filter(Survived == "Survivor"),
              aes(x = grp_age, y = 0.2, 
                  label = paste(percent(prop))),
              color = "white", family = "Palatino", size = 13) +
    geom_text(data = Titanic_train3,
              aes(x = grp_age, y = 0.1,
                  label = paste("(",n/prop,")")),
              color = "white", family = "Palatino", size = 13) +
    facet_grid(Pclass~Sex) +
    my_theme(type = "discrete") +
    theme(plot.subtitle = element_text(color = color3, family="Palatino",
                                       face = "italic"),
          panel.grid.major.y = element_blank(),
          panel.grid.major.x = element_blank(),
          panel.grid.minor.y = element_blank()) +
    labs(
      title = "Survival Rate Analysis",
      subtitle = "Percentage inside each bar indicates survival rate",
      y = "Survival Rate",
      x = "Age Group"
    )

We see that Females in Upper and Middle Classes have a very good chance at survival regardless of their age. Men have a much lower Survival Rate even in Upper Classes. We also see that very few Lower Class Men have survived.

Building the Model

Now that we have taken a look at how different variables affect the chance of survival, it’s time to choose a model to use. Since the response is a categorical, binary variable (Survivor or Deceased), we will try two models, the first one being a logistic regression model and the second one being a decision tree.

The Logistic Regression Model

The Logistic Regression Model solves the problem of estimating a binary response variable by fitting a sigmoidal-shaped, non-linear regression line. The logistic function models the probability of response by giving an output between 0 and 1. Rearranging this function results in the logit transformation. This transformation’s output provides simple interpretation of results. The odds of response = 1 increase by exp(\(\beta\)) for each unit increase of each feature in the model.

For this model, we’ll use the “caret” package to train the training set. This functions automatically perform one-hot encoding of dummy variables. We will also perform pre-processing of the training set by removing zero-variance features and centering and scaling numeric data.

The training function will perform 10-fold cross-validation. The estimation method will be Partial Least Squares, which assesses if reduction of features helps improving accuracy. We will include all possible features and let the model itself trim for what is actually necessary.

set.seed(123)

ctrl <- trainControl(method = "cv", number = 10)

(cv_model_pls <- train(
  Survived ~ .,
  data = Titanic_train1,
  method = "pls",
  family = "binomial",
  preProcess = c("zv","center", "scale"),
  trControl = ctrl,
  tuneLenght = 8
))

## Partial Least Squares 
## 
## 714 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (11), scaled (11) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 642, 642, 643, 643, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  Accuracy   Kappa    
##   1      0.7660603  0.5013915
##   2      0.7772692  0.5305714
##   3      0.7870501  0.5495633
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 3.

ggplot(cv_model_pls) +
  geom_line(color = color1) +
  geom_point(color = color3, size = 3) +
  geom_text(aes(x = ncomp, y = Accuracy, label = percent(Accuracy)),
            family = "Palatino" , vjust = -1, size = 13) +
  my_theme(type = "discrete") +
  labs(
    title = "Principal component analysis of Accuracy",
    x = "Number of Components"
  )

When plotting principal component analysis of accuracy, we can see that a model with 3 components achieved the highest accuracy. It correctly predicted about 78% of all fates.

But which components were used? Let’s plot component importance to find out. For logistic impression, component importance is estimated using the absolute value of the z-statistic for each parameter.

vip1 <- vip(cv_model_pls, num_features = 9, method = "model")

vip1$data %>%
  ggplot(aes(x = Importance, y = reorder(Variable, Importance), fill = Importance)) +
    geom_col() +
    my_theme(type = "continuous") +
    labs(
      title = "Component Importance",
      y = "Components"
    )

We see that Sexmale, Pclass3 and Fare were the most important variables. Sexmale is the one-hot encoded variable of Sex(1 for Male and 0 for Female). Similarly, the Pclass3 is the one-hot encoded variable of Pclass (1 for Lower Class and 0 if not).

How did the model actually fare? For this analysis, we will print the Confusion Matrix.

Titanic_train1$survival_pred <- predict(cv_model_pls, Titanic_train1)
model_prob <- predict(cv_model_pls, newdata = Titanic_train1, type = "prob")

Titanic_train1 <- Titanic_train1 %>%
  mutate(survival_prob = model_prob[,2])

cfMatrix <- confusionMatrix(
  data = relevel(Titanic_train1$survival_pred, ref = "1"),
  reference = relevel(Titanic_train1$Survived, ref = "1")
)

cfMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   0
##          1 199  59
##          0  91 365
##                                           
##                Accuracy : 0.7899          
##                  95% CI : (0.7582, 0.8192)
##     No Information Rate : 0.5938          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.5568          
##                                           
##  Mcnemar's Test P-Value : 0.01137         
##                                           
##             Sensitivity : 0.6862          
##             Specificity : 0.8608          
##          Pos Pred Value : 0.7713          
##          Neg Pred Value : 0.8004          
##              Prevalence : 0.4062          
##          Detection Rate : 0.2787          
##    Detection Prevalence : 0.3613          
##       Balanced Accuracy : 0.7735          
##                                           
##        'Positive' Class : 1               
##

Model accuracy being higher than the No-Information Rate means that the model predicted better than if we simply predicted the same fate(in this case, death) for all passengers. High specificity means that the model was accurate in the prediction of True Negatives. In other words, it had good performance in predicting who actually died.

Plotting the ROC curve will help us understand how the model fared against simply guessing the results at chance.

ROC <- roc(Titanic_train1$Survived, as.numeric(Titanic_train1$survival_pred))

Titanic_train2 <- Titanic_train1 %>%
  mutate(Survived = as.numeric(as.character(Survived)),
         survival_prob = as.numeric(as.character(survival_prob)))

ROCPlot(Titanic_train2, "survival_prob", "Survived", 
        truthTarget = TRUE, title = "ROC Curve", curve_color = color3) +
  my_theme(type = "discrete") +
  labs(
    title = "ROC Curve"
  )

An area under the curve of 0.85 means that the model performed well. It means that the True Positive Rate grows at a faster pace than the False Positive Rate.

Comparing Results

Let’s plot the Confusion Matrix in a more visual way.

This plot gives us a visual idea of the accuracy of logistic regression. We see that the model was more accurate when predicting deaths (high number of true negatives).

cfMatrix_plot <- as.data.frame(cfMatrix$table)
cfMatrix_plot <- cfMatrix_plot %>%
  mutate(Prediction = fct_relevel(Prediction, "1", after = 1),
         Reference = fct_relevel(Reference, "1", after = 1)) %>%
  group_by(Prediction) %>%
  mutate(Accuracy_prediction = Freq/sum(Freq)) %>%
  ungroup() %>%
  group_by(Reference) %>%
  mutate(Accuracy_reference = Freq/sum(Freq))

cfMatrix_plot

cfMatrix_plot %>%
  ggplot(aes(x = Prediction, y = Reference, fill = Freq)) +
    geom_tile() +
    geom_text(aes(x = Prediction, y = Reference, label = Freq),
            family = "Palatino", color = "white", size = 20) +
    my_theme(type = "continuous") +
    theme(legend.position = "none") +
    labs(title = "Confusion Matrix")

80% of all deaths predicted were correct 77% of all survivals predicted were correct. 86% of all deaths were predicted correctly. 69% of all survivals were predicted correctly.

What about probabilities? We’ll plot probabilities by Age, grouping by Sex and Class, to better visualize the tendencies.

Titanic_train1 %>%
  mutate(Sex = fct_recode(Sex , "Male" = "male", "Female" = "female")) %>%
  ggplot(aes(x = Age, y = survival_prob, color = Pclass)) +
    geom_point() +
    geom_smooth(method = "glm", method.args = list(family = "binomial"), 
      se = FALSE) +
    facet_wrap(~Sex) +
    my_theme(type = "discrete") +
    labs(
      title = "Survival Probability with Logistic Regression",
      y = "Survival Probability",
      color = "Passenger Class"
    )

All of the smoothed regression lines are descending. That means that the older the passenger is, the less likely he or she is to survive. We also see the big difference in survival probabilities for men and women. Class, as we’ve seen before, is also a big factor: the wealthier the passenger is, the more likely he or she is to survive.

Plotting the predicted probabilities vs the actual results provides us with a better visualization of the binomial logistic regression.

Titanic_train1 %>%
  ggplot(aes(x = survival_prob, y = as.numeric(as.character(Survived)))) +
    geom_point(alpha = 0.5, color = color3) +
    stat_smooth(method = "glm", method.args = list(family = "binomial"), 
                se = FALSE, color = color1) +
    my_theme(type = "discrete") +
    labs(
      title = "Prediction Probabilities vs Actual Results",
      y = "Actual Results",
      x = "Prediction Probabilities"
    )

Applying the model

To apply the model in the testing set, we will first impute missing numeric information via a K-nearest neighbours imputation using 6 neighbouring data points. For this step we will use the “recipes” package.

Titanic_train_recipe <- Titanic_train %>%
  select(-c("Name", "Ticket", "Cabin"))

Test_recipe <- recipe(Survived ~ ., data = Titanic_train_recipe) %>%
  step_knnimpute(Age, impute_with = imp_vars(Pclass, Fare, SibSp, Parch), neighbors = 6) %>%
  step_knnimpute(Fare, impute_with = imp_vars(Pclass, Age, SibSp, Parch), neighbors = 6)

Test_recipe

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          8
## 
## Operations:
## 
## K-nearest neighbor imputation for Age
## K-nearest neighbor imputation for Fare

prepare <- prep(Test_recipe, training = Titanic_train_recipe, strings_as_factors = TRUE)

Titanic_test1 <- Titanic_test %>%
  mutate(Survived = as.integer(0)) %>%
  relocate(Survived, .after = PassengerId) %>%
  select(-c("Ticket", "Cabin", "Name")) 

baked_test1 <- bake(prepare, new_data = Titanic_test1)

baked_test1 <- baked_test1 %>%
  mutate(Pclass = factor(Pclass),
         Sex = factor(Sex))

baked_test1$prediction <- predict(cv_model_pls, baked_test1)

Upon submitting the prediction results in the testing data on the Kaggle website, the calculated accuracy of the submission was 0.76794.

The Decision Tree Model

A tree-based model estimates the response variable by a set of rules that split the data into smaller groups. The terminal nodes of the tree will contain the probability of survival based on the characteristics of the observation.

We will train the data in a Decision Tree Model using 10-fold cross-validation. The “rpart2” method will authomatically search for the optimal maximum depth of tree.

Titanic_train_tree <- Titanic_train1 %>%
  select(-c(survival_pred, survival_prob, PassengerId)) %>%
  mutate(Survived = factor(Survived))

train.control <- trainControl(
                           method = "repeatedcv",
                           number = 10,
                           repeats = 3,
                           summaryFunction = twoClassSummary, 
                           classProbs = TRUE
                           )

tree_model <- train(
  make.names(Survived) ~ .,
  data = Titanic_train_tree,
  method = "rpart2",
  trControl = train.control,
  tuneLength = 10,
  metric = "ROC"
)

## note: only 8 possible values of the max tree depth from the initial fit.
##  Truncating the grid to 8 .

ggplot(tree_model) +
  geom_line(color = color1) +
  geom_point(color = color3, size = 3) +
  geom_text(aes(x = maxdepth, y = ROC, label = round(ROC,2)),
            family = "Palatino" , vjust = -1, size = 13) +
  my_theme(type = "discrete") +
  labs(title = "ROC by Maximum Tree Depth")

We see that a model with maximum tree depth of 5 had the higher ROC value.

Now let’s plot the decision tree to see which features are important in determining survival.

fancyRpartPlot(tree_model$finalModel,type = 5,  palettes = "BuGn",pal.thresh = 0.5, cex = 1.5, split.family = "Palatino", nn.family = "Palatino", fam.main = "Palatino", family = "Palatino", round = 0, pal.node.fun = TRUE)

Comparing Results

Titanic_train_tree$survival_pred <- predict(tree_model, Titanic_train_tree)
model_prob <- predict(tree_model, newdata = Titanic_train_tree, type = "prob")

Titanic_train_tree <- Titanic_train_tree %>%
  mutate(survival_prob = model_prob[,2]) %>%
  mutate(survival_pred = ifelse(survival_pred == "X0", 0, 1)) %>%
  mutate(survival_pred = factor(survival_pred))

cfMatrix_tree <- confusionMatrix(
  data = relevel(Titanic_train_tree$survival_pred, ref = "1"),
  reference = relevel(Titanic_train_tree$Survived, ref = "1")
)

cfMatrix_tree

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   0
##          1 181  13
##          0 109 411
##                                          
##                Accuracy : 0.8291         
##                  95% CI : (0.7995, 0.856)
##     No Information Rate : 0.5938         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6262         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.6241         
##             Specificity : 0.9693         
##          Pos Pred Value : 0.9330         
##          Neg Pred Value : 0.7904         
##              Prevalence : 0.4062         
##          Detection Rate : 0.2535         
##    Detection Prevalence : 0.2717         
##       Balanced Accuracy : 0.7967         
##                                          
##        'Positive' Class : 1              
##

Consulting the Confusion Matrix gives us a better understanding of model performance. Notice that the accuracy was higher than the one achieved by the Logistic Regression Model. But not unlike the other model, Specificity was way higher than Sensitivity. High specificity means that the model was accurate in the prediction of True Negatives. In other words, it had good performance in predicting who actually died.

Let’s plot it.

cfMatrix_tree_plot <- as.data.frame(cfMatrix_tree$table)
cfMatrix_tree_plot <- cfMatrix_tree_plot %>%
  mutate(Prediction = fct_relevel(Prediction, "1", after = 1),
         Reference = fct_relevel(Reference, "1", after = 1)) %>%
  group_by(Prediction) %>%
  mutate(Accuracy_prediction = Freq/sum(Freq)) %>%
  ungroup() %>%
  group_by(Reference) %>%
  mutate(Accuracy_reference = Freq/sum(Freq))
  

cfMatrix_tree_plot %>%
  ggplot(aes(x = Prediction, y = Reference, fill = Freq)) +
    geom_tile() +
    geom_text(aes(x = Prediction, y = Reference, label = Freq),
            family = "Palatino", color = "white", size = 20) +
    my_theme(type = "continuous") +
    theme(legend.position = "none") +
    labs(title = "Confusion Matrix")

79% of all deaths predicted were correct 93% of all survivals predicted were correct. 97% of all deaths were predicted correctly. 62% of all survivals were predicted correctly.

ROC <- roc(Titanic_train_tree$Survived, as.numeric(Titanic_train_tree$survival_pred))

Titanic_train_tree2 <- Titanic_train_tree %>%
  mutate(Survived = as.numeric(as.character(Survived)),
         survival_prob = as.numeric(as.character(survival_prob)))

ROCPlot(Titanic_train_tree2, "survival_prob", "Survived", 
        truthTarget = TRUE, title = "ROC Curve", curve_color = color3) +
  my_theme(type = "discrete") +
  labs(
    title = "ROC Curve"
  )

Plotting the ROC gives us an Area Under the Curve of 0.84.

Applying the model

We will apply the model in the same “baked” testing data set as before.

baked_test1$tree_prediction <- predict(tree_model, baked_test1)

baked_test1 <- baked_test1 %>%
  mutate(tree_prediction = ifelse(tree_prediction == "X0", 0, 1)) %>%
  mutate(tree_prediction = factor(tree_prediction))

Upon submitting the prediction results in the testing data on the Kaggle website, the calculated accuracy of the submission was 0.77990.

The Random Forest Model

A random forest model is a method for gathering a large collection of de-correlated bagged decision trees. The bagging method reduces variance and improves predictive performance of the model by building many trees on bootstrapped copied of the data set. Combining this with the split-variable randomization method which limits the search for the next split rule in each tree to a subset of the original features, we achieve decorrelation between trees.

We will performe the training of this model allowing “caret” to perform a grid search for the optimal tuning parameters. It will choose: mtry, the number of features to search in each split, splitrule, the splitting rule to use during tree construction, and min.node.size, which is one of the ways to measure tree complexity.

Titanic_train_rf <- Titanic_train1 %>%
  select(-c(survival_pred, survival_prob, PassengerId)) %>%
  mutate(Survived = factor(Survived))

response <- "Survived"
predictors <- setdiff(names(Titanic_train_rf), response)
n_features <- length(predictors)

set.seed(123)

rf_grid <- expand.grid(
  mtry = floor(n_features * c(.25,.33,.50,.66,.75)),
  min.node.size = c(1,3,5,7,10),
  splitrule = c("gini", "extratrees")
)

rf_model <- train(
  Survived~., 
  data=Titanic_train_rf, 
  method='ranger', 
  metric='Accuracy',
  tuneGrid = rf_grid,
  preProcess = c("zv","center", "scale"),
  trControl=trainControl(method = "cv", number = 10))

rf_model

## Random Forest 
## 
## 714 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 642, 642, 643, 643, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  splitrule   Accuracy   Kappa    
##   1      1             gini        0.7772105  0.5103780
##   1      1             extratrees  0.7856221  0.5322607
##   1      3             gini        0.7772105  0.5103804
##   1      3             extratrees  0.7898279  0.5405408
##   1      5             gini        0.7800078  0.5175955
##   1      5             extratrees  0.7898279  0.5414206
##   1      7             gini        0.7702269  0.4947003
##   1      7             extratrees  0.7841941  0.5269332
##   1     10             gini        0.7785798  0.5151439
##   1     10             extratrees  0.7856808  0.5316079
##   2      1             gini        0.8123044  0.5952571
##   2      1             extratrees  0.7982199  0.5599290
##   2      3             gini        0.8178991  0.6059845
##   2      3             extratrees  0.7996479  0.5631719
##   2      5             gini        0.8164906  0.6035668
##   2      5             extratrees  0.7982394  0.5587275
##   2      7             gini        0.8136933  0.5980372
##   2      7             extratrees  0.7982786  0.5599613
##   2     10             gini        0.8094484  0.5873664
##   2     10             extratrees  0.7912559  0.5446359
##   3      1             gini        0.8137715  0.6043784
##   3      1             extratrees  0.7954617  0.5583201
##   3      3             gini        0.8165689  0.6094015
##   3      3             extratrees  0.7968505  0.5606292
##   3      5             gini        0.8137715  0.6035279
##   3      5             extratrees  0.7954617  0.5581735
##   3      7             gini        0.8165493  0.6091893
##   3      7             extratrees  0.7982394  0.5641082
##   3     10             gini        0.8166080  0.6094427
##   3     10             extratrees  0.7968310  0.5604479
##   4      1             gini        0.8151017  0.6085541
##   4      1             extratrees  0.8094875  0.5924584
##   4      3             gini        0.8178991  0.6129206
##   4      3             extratrees  0.8053013  0.5841202
##   4      5             gini        0.8151408  0.6087635
##   4      5             extratrees  0.8095070  0.5926381
##   4      7             gini        0.8165689  0.6114668
##   4      7             extratrees  0.8053013  0.5830802
##   4     10             gini        0.8221440  0.6221654
##   4     10             extratrees  0.8039124  0.5796170
##   5      1             gini        0.8067293  0.5927149
##   5      1             extratrees  0.8165102  0.6114648
##   5      3             gini        0.8109351  0.6024075
##   5      3             extratrees  0.8137715  0.6052131
##   5      5             gini        0.8109351  0.6009592
##   5      5             extratrees  0.8123239  0.6009310
##   5      7             gini        0.8165493  0.6112550
##   5      7             extratrees  0.8123239  0.6003404
##   5     10             gini        0.8207746  0.6207243
##   5     10             extratrees  0.8151408  0.6063137
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4, splitrule = gini
##  and min.node.size = 10.

By optimizing accuracy, we have selected a model using 4 features in each split, a minimal terminal node size of 10 and a splitting rule which minimizes the Gini impurity. This model achieved an accuracy of about 82%.

rf_impurity <- ranger(
  formula = Survived ~ .,
  data = Titanic_train_rf,
  num.trees = 2000,
  mtry = rf_model$bestTune$mtry,
  min.node.size = rf_model$bestTune$min.node.size,
  sample.fraction = .80,
  replace = FALSE,
  importance = "impurity",
  respect.unordered.factors = "order",
  verbose = FALSE,
  seed = 123
)

vip_rf <- vip(rf_impurity, bar = FALSE)

vip_rf$data %>%
  ggplot(aes(x = Importance, y = reorder(Variable, Importance), fill = Importance)) +
    geom_col() +
    my_theme(type = "continuous") +
    labs(
      title = "Component Importance",
      y = "Components"
    )

Looking at Component Importance, we see this time that Sex, Age and Fare play a huge part. Component Importance is calculated throught the average decrease in the loss function across all trees.

ggplot(rf_model) +
  geom_point(size = 3, aes(color = min.node.size)) +
  geom_line(aes(group = min.node.size, color = min.node.size),
            size = 1) +
  guides(shape = FALSE) +
  theme1 +
  theme(legend.position = "bottom") +
  scale_color_jama() +
  labs(
    title = "Model accuracy using different hyperparameters",
    color = "Minimal Node Size",
    x = "Subset of Features (mtry)"
    )

Here we plot the cross-valitaded accuracy results for the hyperparameter grid-search. We see that overall, models using extratrees had poorer performance, while Gini impurity models using 7 or 10 data points in each node were considerably better.

Titanic_train_rf$prediction <- predict(rf_model, Titanic_train_rf)


cfMatrix_rf <- confusionMatrix(
  data = relevel(Titanic_train_rf$prediction, ref = "1"),
  reference = relevel(Titanic_train_rf$Survived, ref = "1")
)

cfMatrix_rf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   0
##          1 240  17
##          0  50 407
##                                           
##                Accuracy : 0.9062          
##                  95% CI : (0.8824, 0.9265)
##     No Information Rate : 0.5938          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8019          
##                                           
##  Mcnemar's Test P-Value : 9.252e-05       
##                                           
##             Sensitivity : 0.8276          
##             Specificity : 0.9599          
##          Pos Pred Value : 0.9339          
##          Neg Pred Value : 0.8906          
##              Prevalence : 0.4062          
##          Detection Rate : 0.3361          
##    Detection Prevalence : 0.3599          
##       Balanced Accuracy : 0.8937          
##                                           
##        'Positive' Class : 1               
##

This Confusion Matrix had the highest accuracy so far. Once again, Specificity was higher than Sensitivity. High specificity means that the model was accurate in the prediction of True Negatives. In other words, it had good performance in predicting who actually died.

cfMatrix_rf_plot <- as.data.frame(cfMatrix_rf$table)
cfMatrix_rf_plot <- cfMatrix_rf_plot %>%
  mutate(Prediction = fct_relevel(Prediction, "1", after = 1),
         Reference = fct_relevel(Reference, "1", after = 1)) %>%
  group_by(Prediction) %>%
  mutate(Accuracy_prediction = Freq/sum(Freq)) %>%
  ungroup() %>%
  group_by(Reference) %>%
  mutate(Accuracy_reference = Freq/sum(Freq))
  

cfMatrix_rf_plot %>%
  ggplot(aes(x = Prediction, y = Reference, fill = Freq)) +
    geom_tile() +
    geom_text(aes(x = Prediction, y = Reference, label = Freq),
            family = "Palatino", color = "white", size = 20) +
    my_theme(type = "continuous") +
    theme(legend.position = "none") +
    labs(title = "Confusion Matrix")

89% of all deaths predicted were correct 93% of all survivals predicted were correct. 96% of all deaths were predicted correctly. 83% of all survivals were predicted correctly.

Applying the Model

baked_test1$rf_prediction <- predict(rf_model, baked_test1)

We apply this model using the same “baked” data set that we used before. Upon submission to the Kaggle website, this model achieved a performance of 0.77990, the same performance achieved by the Decision Tree Model.

In conclusion, these models have fared better than the logistic regression model. The Decision Tree Model, despite being simpler and more prone to high variance than the Random Forest model, provides results which are easier to interpret.

Titanic - Machine Learning from Disaster

Gustavo Reis

18/03/2021

The Challenge

Loading required packages

The Data

Uploading training and testing data sets

Preparing Data

Defining theme to be used

Exploratory Data Analysis

Building the Model

The Logistic Regression Model

Comparing Results

Applying the model

The Decision Tree Model

Comparing Results

Applying the model

The Random Forest Model

Applying the Model