The Titanic Disaster

RMS Titanic, or better known simply as Titanic, is a British ocean liner that sank after crashing into an iceberg on the 15th of April 1912, during its first voyage. Out of the estimated 2224 passengers and crew on board, 1496 perished. It was considered one of the deadliest sinking accidents at the time, and has become somewhat infamous, inspiring movies and other multimedia works.

One of the biggest causes of the high death toll is the lack of lifeboats. There wasn’t nearly enough for the total amount of passengers on board, causing the crew to prioritize some groups of people more than others.

Last time, we took a look at the passenger data and tried to do some Exploratory Data Analysis in order to figure out the correlation between variables. This time, we’ll try creating a model that will predict whether or not a passenger will survive based off their characteristics.

Data Preparation

You can find the data set that we will be using for this project here. It consists of three different files, train, test, and gender_submission. In this case, we’ll only be using train, as the test data does not have the correct answers (Survived values) so we’d be unable to check the model’s accuracy using it. As usual, let’s read the train data set using read.csv and use dplyr’s glimpse to check data types.

library(dplyr)
train <- read.csv("data/train.csv")
glimpse(train)

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

This data consists of 12 columns, those being:

PassengerId: The ID of the passenger, can be treated as an index in this case
Survived: A 0 or 1 value indicating whether or not the passenger survived the disaster
Pclass: The ticket class of the passenger, which would be first (1), second (2), or third (3)
Name: Name of the passenger
Sex: Gender of the passenger
Age: Age of the passenger in years
SibSp: The amount of siblings the passenger has on board
Parch: The amount of parents/children the passenger has on board
Ticket: The passenger’s ticket ID
Fare: The fare price that the passenger had to pay
Cabin: Passenger’s cabin number
Embarked: The passenger’s port of embarkation, in this case Cherbourg (C), Queenstown (Q), or Southampton (S)

First off, let’s drop columns that have too many different categorical values. This would be PassengerId, Name, Cabin, and Ticket.

train <- train %>%
  select(-c(PassengerId, Name, Cabin, Ticket))
glimpse(train)

## Rows: 891
## Columns: 8
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass   <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Sex      <chr> "male", "female", "female", "female", "male", "male", "male",…
## $ Age      <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
## $ SibSp    <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch    <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S", "…

NA Values

Let’s check for NA values. Since we have a lot of columns that use strings/characters, let’s also make sure that blank values are also treated as NA. For that, we’ll use mutate across all character columns and change blank values (““) into NA.

train %>% mutate(across(where(is.character), ~ na_if(.,""))) %>% is.na() %>% colSums()

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0      177        0        0        0        2

Seems like we have some NA values. Let’s get rid of those using omit.

train <- train %>% na.omit()
train %>% is.na() %>% colSums()

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0

Data Types

It would seem like some columns such as Survived, Pclass, Sex, and Embarked have commonly repeating values. Let’s change them into a categorical format using mutate.

train <- train %>%
  mutate(Survived=as.factor(Survived),
         Pclass=as.factor(Pclass),
         Sex=as.factor(Sex),
         Embarked=as.factor(Embarked))

glimpse(train)

## Rows: 714
## Columns: 8
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
## $ Pclass   <fct> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3, 1…
## $ Sex      <fct> male, female, female, female, male, male, male, female, femal…
## $ Age      <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, …
## $ SibSp    <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0, 0…
## $ Parch    <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0, 0…
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 1…
## $ Embarked <fct> S, C, S, S, S, S, S, S, C, S, S, S, S, S, S, Q, S, S, S, Q, S…

Cross-Validation

Although we already have a data labeled as test, we still need to split our data so that we can evaluate the model. Let’s split the data with a proportion of 0.8 using splitter.

library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(777)
splitter <- initial_split(data= train, prop=0.8)

# splitting
train_data <- training(splitter)
test_data <- testing(splitter)

Now, let’s check if the data distribution in our training data is balanced.

table(train_data$Survived) %>% 
  prop.table()

## 
##         0         1 
## 0.5831874 0.4168126

Seems quite balanced to me. Looks like we’re ready to start making models.

Logistic Regression

The first model we’ll try to make is Logistic Regression. We’ll simply use glm and all of our predictors as a starting point.

logres_all <- glm(formula = Survived ~., family = "binomial",
             data = train_data)
summary(logres_all)

## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  14.963905 535.411352   0.028  0.97770    
## Pclass2      -0.851380   0.364589  -2.335  0.01953 *  
## Pclass3      -1.932228   0.376431  -5.133 2.85e-07 ***
## Sexmale      -2.504881   0.239834 -10.444  < 2e-16 ***
## Age          -0.043218   0.009238  -4.678 2.89e-06 ***
## SibSp        -0.432944   0.141890  -3.051  0.00228 ** 
## Parch         0.043208   0.136527   0.316  0.75164    
## Fare          0.003056   0.003025   1.010  0.31251    
## EmbarkedC   -10.973225 535.411284  -0.020  0.98365    
## EmbarkedQ   -12.083389 535.411561  -0.023  0.98199    
## EmbarkedS   -11.388800 535.411265  -0.021  0.98303    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 775.69  on 570  degrees of freedom
## Residual deviance: 520.98  on 560  degrees of freedom
## AIC: 542.98
## 
## Number of Fisher Scoring iterations: 12

While this model seems fine, we can see that there are still factors that doesn’t seem to contribute much to our target variable. In this case, let’s try using the step method instead. Using the previous model with all the predictors, we can input it into step and use the backward direction. This will drop the predictors one by one until it reaches the lowest AIC value it can get.

logres_step <- step(object = logres_all,
                   direction = "backward",
                   trace=F)
summary(logres_step)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Fare, family = "binomial", 
##     data = train_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.717439   0.550948   6.747 1.51e-11 ***
## Pclass2     -0.942016   0.354094  -2.660  0.00781 ** 
## Pclass3     -2.033789   0.367815  -5.529 3.21e-08 ***
## Sexmale     -2.502982   0.232155 -10.782  < 2e-16 ***
## Age         -0.044320   0.009160  -4.838 1.31e-06 ***
## SibSp       -0.425369   0.133364  -3.190  0.00143 ** 
## Fare         0.003894   0.002972   1.310  0.19023    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 775.69  on 570  degrees of freedom
## Residual deviance: 524.72  on 564  degrees of freedom
## AIC: 538.72
## 
## Number of Fisher Scoring iterations: 5

Prediction

Let’s use this model to predict data from our test group.

pred_logres <- predict(object = logres_step,
        newdata = test_data,
        type = "response")
head(pred_logres)

##          1          2          3          4          5          6 
## 0.27351103 0.86373450 0.05460894 0.38926516 0.62944359 0.15191265

Now, let’s store that data inside a column in a different table.

test_result <- test_data
test_result$logres_prediction <- ifelse(pred_logres > 0.5, yes = 1, no = 0)
rmarkdown::paged_table(test_result)

Model Evaluation

While we can see manually which of the predictions are accurate, the more efficient method of evaluating this model would be by using Confusion Matrix, which can be accessed through the caret package. Make sure that the prediction column is the same format as the Survived column, which would be factor.

library(caret)
logresMatrix <- confusionMatrix(data = as.factor(test_result$logres_prediction),
                reference = test_result$Survived, 
                positive = "1")
logresMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 84 19
##          1  7 33
##                                           
##                Accuracy : 0.8182          
##                  95% CI : (0.7451, 0.8777)
##     No Information Rate : 0.6364          
##     P-Value [Acc > NIR] : 1.573e-06       
##                                           
##                   Kappa : 0.5867          
##                                           
##  Mcnemar's Test P-Value : 0.03098         
##                                           
##             Sensitivity : 0.6346          
##             Specificity : 0.9231          
##          Pos Pred Value : 0.8250          
##          Neg Pred Value : 0.8155          
##              Prevalence : 0.3636          
##          Detection Rate : 0.2308          
##    Detection Prevalence : 0.2797          
##       Balanced Accuracy : 0.7788          
##                                           
##        'Positive' Class : 1               
##

We can see that our logistic regression model has these values:

Accuracy: 81.8%
Sensitivity (Recall): 63.5%
Pos Pred Value (Precision): 82.5%
Specificity: 92.3%

Let’s input these values into a data frame as we go, in order to better evaluate the different models.

evaluation <- data_frame(Model= "Logistic Regression",
          Accuracy = logresMatrix$overall[1],
           Sensitivity = logresMatrix$byClass[1],
           Specificity = logresMatrix$byClass[2],
           Precision = logresMatrix$byClass[3])

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

rmarkdown::paged_table(evaluation)

KNN

K-Nearest Neighbour tries to classify a data depending on its neighbouring datas’ properties. Before trying to make a K-Nearest Neighbour model, we first have to take out categorical data within our data frame. Let’s use select to achieve this.

train_knn <- train_data %>%
  select(-c(Pclass, Sex, Embarked))
test_knn <- test_data %>%
  select(-c(Pclass, Sex, Embarked))

We then have to separate the target and predictors into separate variables like below.

#predictors
train_x <- train_knn %>% select(-Survived)

test_x <- test_knn %>% select(-Survived)

# target
train_y <- train_knn$Survived

test_y <- test_knn$Survived

Then, we have to scale the values in our data frame. Do this for both the training and test data.

# train
train_xs <- scale(train_x)

# test
test_xs <- scale(test_x,
                        center = attr(train_xs, "scaled:center"),
                        scale = attr(train_xs, "scaled:scale"))

We’ll find the optimum K by counting the amount of rows in our training data and taking the square root of it.

sqrt(nrow(train_data))

## [1] 23.89561

Let’s round that up into 24 and start making the KNN model. Unlike logistic regression, KNN immediately gives you the resulting values of the prediction like below.

library(class)
knn <- knn(train=train_xs,
                  test=test_xs,
                  cl=train_y,
                  k=24)
head(knn)

## [1] 0 1 1 0 0 0
## Levels: 0 1

Prediction

Now we can input it into our result data frame by inserting a new column.

test_result$knn_prediction <- knn
rmarkdown::paged_table(test_result)

Model Evaluation

Finally, in order to evaluate the model, we’ll use Confusion Matrix.

KNNMatrix <- confusionMatrix(data=test_result$knn_prediction,
                reference= test_result$Survived,
                positive= "1")
KNNMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 76 27
##          1 15 25
##                                           
##                Accuracy : 0.7063          
##                  95% CI : (0.6244, 0.7794)
##     No Information Rate : 0.6364          
##     P-Value [Acc > NIR] : 0.04770         
##                                           
##                   Kappa : 0.3324          
##                                           
##  Mcnemar's Test P-Value : 0.08963         
##                                           
##             Sensitivity : 0.4808          
##             Specificity : 0.8352          
##          Pos Pred Value : 0.6250          
##          Neg Pred Value : 0.7379          
##              Prevalence : 0.3636          
##          Detection Rate : 0.1748          
##    Detection Prevalence : 0.2797          
##       Balanced Accuracy : 0.6580          
##                                           
##        'Positive' Class : 1               
##

As for our K-Nearest Neighbour model, it has these values:

Accuracy: 70.6%
Sensitivity (Recall): 48.5%
Pos Pred Value (Precision): 62.5%
Specificity: 83.5%

Let’s add these to our previous data frame using add_row.

evaluation <- evaluation %>% add_row(Model= "K-Nearest Neighbour",
          Accuracy = KNNMatrix$overall[1],
           Sensitivity = KNNMatrix$byClass[1],
           Specificity = KNNMatrix$byClass[2],
           Precision = KNNMatrix$byClass[3])
rmarkdown::paged_table(evaluation)

Naive Bayes

The next model we’ll try out is Naive Bayes. The Naive Bayes model is based off the Bayes’ Theory of Probability, where the probability of a certain event can be affected by the existence of new information. We’ll use the nb function from the e1071 library. We’ll once more try to use a formula where all the predictors are included.

library(e1071)
nb <- naiveBayes(
  formula = Survived ~ .,
  data = test_data
)
nb

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.6363636 0.3636364 
## 
## Conditional probabilities:
##    Pclass
## Y           1         2         3
##   0 0.1208791 0.2087912 0.6703297
##   1 0.5192308 0.2500000 0.2307692
## 
##    Sex
## Y       female       male
##   0 0.08791209 0.91208791
##   1 0.59615385 0.40384615
## 
##    Age
## Y       [,1]     [,2]
##   0 30.53297 13.30033
##   1 31.09615 17.26524
## 
##    SibSp
## Y        [,1]      [,2]
##   0 0.4725275 0.8735707
##   1 0.4807692 0.7273477
## 
##    Parch
## Y        [,1]      [,2]
##   0 0.4395604 1.0350394
##   1 0.3653846 0.6576502
## 
##    Fare
## Y       [,1]     [,2]
##   0 24.45302 39.95824
##   1 51.46803 58.49952
## 
##    Embarked
## Y                       C          Q          S
##   0 0.00000000 0.13186813 0.02197802 0.84615385
##   1 0.01923077 0.28846154 0.03846154 0.65384615

Here we can see all the probabilities for the different predictors, where 1 is the likelihood of a passenger having that characteristic surviving. We can take away some interesting things from this, such as the high likelihood that a non-surviving passenger is male (91.2%).

Prediction

Now, let’s try predicting the test data using this model by using predict. Simply input the type as class so that it will output the predicted class.

nb_prediction <- predict(
  object = nb,
  newdata = test_data,
  type = "class"
)
test_result$nb_prediction <- nb_prediction
rmarkdown::paged_table(test_result)

Model Evaluation

Let’s use Confusion Matrix in order to evaluate this model.

NBMatrix <- confusionMatrix(data = nb_prediction,
                reference = test_data$Survived,
                positive = "1",
                mode = "everything")
NBMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 80 14
##          1 11 38
##                                           
##                Accuracy : 0.8252          
##                  95% CI : (0.7528, 0.8836)
##     No Information Rate : 0.6364          
##     P-Value [Acc > NIR] : 5.904e-07       
##                                           
##                   Kappa : 0.6175          
##                                           
##  Mcnemar's Test P-Value : 0.6892          
##                                           
##             Sensitivity : 0.7308          
##             Specificity : 0.8791          
##          Pos Pred Value : 0.7755          
##          Neg Pred Value : 0.8511          
##               Precision : 0.7755          
##                  Recall : 0.7308          
##                      F1 : 0.7525          
##              Prevalence : 0.3636          
##          Detection Rate : 0.2657          
##    Detection Prevalence : 0.3427          
##       Balanced Accuracy : 0.8049          
##                                           
##        'Positive' Class : 1               
##

Our Naive Bayes model has these values:

Accuracy: 82.5%
Sensitivity (Recall): 73%
Pos Pred Value (Precision): 77.5%
Specificity: 87.9%

evaluation <- evaluation %>% add_row(Model= "Naive Bayes",
          Accuracy = NBMatrix$overall[1],
           Sensitivity = NBMatrix$byClass[1],
           Specificity = NBMatrix$byClass[2],
           Precision = NBMatrix$byClass[3])
rmarkdown::paged_table(evaluation)

Decision Tree

Decision Tree is a bit different compared to the other models. Although we can predict data using it akin to the previous models we’ve discussed, it can also output a set of rules that are immediately interpretable. In order to see this, we’ll create the model with all predictors using tree from the partykit package, and input it into a plot.

library(partykit)
tree <- ctree(formula = Survived ~ .,
                   data = train_data)
  
plot(tree, type = "simple")

We can see that the resulting tree has set Sex as the root node, making it the most important factor first and foremost. Then, depending on the sex, it branches out into some interior nodes which are Pclass, Fare, and Age.

The example as to how to interpret this plot would be by testing it out on an existing passenger profile. Let’s select a number at random… how about 17? And let’s get rid of the Survived column first so that we can check answers later.

train[17,] %>% select(-Survived)

##    Pclass    Sex Age SibSp Parch Fare Embarked
## 19      3 female  31     1     0   18        S

Following the passenger profile above, by their sex (female) we would immediately jump to check their Pclass. Their Pclass is 3, which means they would be predicted as 0 according to the resulting leaf node. Now, let’s check our answer.

train[17,]$Survived

## [1] 0
## Levels: 0 1

Seems like we were correct.

Prediction

Now, let’s try to predict using this model like we have done for the previous ones. Put in the type as response in order to output the predicted class.

tree_prediction <- predict(
  object = tree,
  newdata = test_data,
  type = "response"
)
test_result$tree_prediction <- tree_prediction
rmarkdown::paged_table(test_result)

As always, let’s check the performance of our model with Confusion Matrix.

treeMatrix <- confusionMatrix(
  data = tree_prediction,
  reference = test_data$Survived
)
treeMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 90 29
##          1  1 23
##                                           
##                Accuracy : 0.7902          
##                  95% CI : (0.7143, 0.8538)
##     No Information Rate : 0.6364          
##     P-Value [Acc > NIR] : 5.117e-05       
##                                           
##                   Kappa : 0.4876          
##                                           
##  Mcnemar's Test P-Value : 8.244e-07       
##                                           
##             Sensitivity : 0.9890          
##             Specificity : 0.4423          
##          Pos Pred Value : 0.7563          
##          Neg Pred Value : 0.9583          
##              Prevalence : 0.6364          
##          Detection Rate : 0.6294          
##    Detection Prevalence : 0.8322          
##       Balanced Accuracy : 0.7157          
##                                           
##        'Positive' Class : 0               
##

These values are not too bad, especially when it comes to Sensitivity. But we can improve this model by adjusting some of the controls, such as mincriterion, minsplit and minbucket. Typically, we would call this Pruning, but seeing as our tree is already quite simple, how about we loosen the parameters from the default a bit? It’s possible that our tree may be underfitted.

The default mincriterion value is 0.95, let’s lower it down to 0.75. Let’s also lower the minsplit and minbucket.

# tuning model decision tree
tree_tuned <- ctree(formula = Survived ~ ., 
                            data = train_data,
                            control = ctree_control(mincriterion = 0.75, 
                                                    minsplit = 10,
                                                    minbucket = 5))
plot(tree_tuned, type="simple")

Let’s predict using this new model.

tree_tuned_prediction <- predict(
  object = tree_tuned,
  newdata = test_data,
  type = "response"
)
test_result$tree_tuned_prediction <- tree_tuned_prediction
tunedTreeMatrix <- confusionMatrix(
  data = tree_tuned_prediction,
  reference = test_data$Survived
)
tunedTreeMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 88 20
##          1  3 32
##                                           
##                Accuracy : 0.8392          
##                  95% CI : (0.7685, 0.8952)
##     No Information Rate : 0.6364          
##     P-Value [Acc > NIR] : 7.221e-08       
##                                           
##                   Kappa : 0.6263          
##                                           
##  Mcnemar's Test P-Value : 0.0008492       
##                                           
##             Sensitivity : 0.9670          
##             Specificity : 0.6154          
##          Pos Pred Value : 0.8148          
##          Neg Pred Value : 0.9143          
##              Prevalence : 0.6364          
##          Detection Rate : 0.6154          
##    Detection Prevalence : 0.7552          
##       Balanced Accuracy : 0.7912          
##                                           
##        'Positive' Class : 0               
##

Model Evaluation

Seems like we managed to get a better accuracy score. Let’s store both into the evaluation data frame to make it easier for us to compare.

evaluation <- evaluation %>% add_row(Model= "Decision Tree",
          Accuracy = treeMatrix$overall[1],
           Sensitivity = treeMatrix$byClass[1],
           Specificity = treeMatrix$byClass[2],
           Precision = treeMatrix$byClass[3]) %>%
  add_row(Model= "Tuned Decision Tree",
          Accuracy = tunedTreeMatrix$overall[1],
           Sensitivity = tunedTreeMatrix$byClass[1],
           Specificity = tunedTreeMatrix$byClass[2],
           Precision = tunedTreeMatrix$byClass[3])

rmarkdown::paged_table(evaluation)

While our Tuned Decision Tree has higher Accuracy values, our Decision Tree is still between when it comes to Sensitivity.

Random Forest

The last method we’ll be trying out is Random Forest. In this method, multiple models will be created and compared against each other in order to create the best-performing model. In order to make this model, we first have to create the control parameters, and input it into the model using train.

# train_ctrl <- trainControl(method = "repeatedcv",
#                            number = 5,
#                            repeats = 3)
# 
# forest <- train(Survived ~ .,
#                    data = train_data,
#                    method = "rf",
#                    trControl = train_ctrl)
# saveRDS(forest, "forest.RDS")

Since making Random Forest models take quite a while, I’ve saved the resulting model into an RDS format. Let’s load it up using readRDS.

forest <- readRDS("forest.RDS")

Prediction

Same as before, we can predict using this model. In the case of Random Forest, it only accepts raw or prob, so let’s just use raw to see the predicted class.

forest_prediction <- predict(
  object = forest,
  newdata = test_data,
  type = "raw"
)
test_result$forest_prediction <- forest_prediction
rmarkdown::paged_table(test_result)

We can’t interpret Random Forest the same way we did with Decision Tree, but we can see the importance of each variable/predictor that was used in the model by plotting the varImp.

varImp(forest) %>% plot

Similar to the conclusion that we’ve reached when looking at our Decision Tree, Sex is the most important variable in our model, followed by Fare, Age, then Class.

Model Evaluation

We can access the model’s OOB (Out of Bag) value by accessing its finalModel. The OOB value is the error rate of the model when looking at unseen data (data that was not used when creating the model).

forest$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.32%
## Confusion matrix:
##     0   1 class.error
## 0 308  25  0.07507508
## 1  91 147  0.38235294

Our Random Forest model has an OOB error rate of 19.79%, which means it has an accuracy of 80.21%.

evaluation <- evaluation %>% add_row(Model= "Random Forest",
          Accuracy = 0.8021)

Conclusion

Model Evaluation

In order to conclude the performance of all our models, let’s call the evaluation data frame. In the imaginary case where we would use these models to predict a potential passenger’s survival rate, I think it would be better to be safe than sorry. Which means we’d prioritize predicting the negative (0) case and reducing the amount of False Positives (predicted as survived, but ended up not surviving). This would mean we’d use the Precision metric.

rmarkdown::paged_table(evaluation)

As we can see from the table, it would seem that our Logistic Regression, Tuned Decision Tree, and Random Forest models would be the most suited for the job.

Throughout the course of making these models, we’ve learned that there are many ways to improve a model, such as only using important predictors, or watching out for instances of overfitting or underfitting. Aside from that, we’ve also discovered that there are many tools that may help us in choosing important predictors, such as looking at a model’s summary, using stepwise models, and plotting out varImp.

We can also conclude that different models can be used depending on what we’re looking for. For example, KNN can only use numerical variables, which explains why its score is quite terrible in our case, since we have a lot of categorical variables. Meanwhile, if we’d like a set of interpretable rules rather than immediate predictions, we can use the Decision Tree.

Survival Factors

We can conclude that the biggest contributing factors of a passenger’s survival is their Sex, followed by their Fare, Age, and then Class. It would seem that female passengers were prioritized during the evacuation process, causing their survival probabilities to be higher. What’s probably most interesting to me personally is that fare seems to affect a passenger’s likelihood of surviving more than their class. Through this analysis and the creation of these models, we can safely say that gender as well as socioeconomic status was a big factor during the evacuation of passengers on the Titanic.

Titanic: Machine Learning

by Odelia Naomi

20 July 2024

The Titanic Disaster

Data Preparation

NA Values

Data Types

Cross-Validation

Logistic Regression

Prediction

Model Evaluation

KNN

Prediction

Model Evaluation

Naive Bayes

Prediction

Model Evaluation

Decision Tree

Prediction

Model Evaluation

Random Forest

Prediction

Model Evaluation

Conclusion

Model Evaluation

Survival Factors