Background and Objectives

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

In this report, we are going to build a model to predict what sorts of people were more likely to survive.

Import Library

library(tidyverse) # for data wrangling
library(yardstick) # to calculate a cross-tabulation of observed and predicted classes
library(ggplot2) # to visualize data
library(plotly) # to visualize data
library(caret) # to pre-process data
library(rsample) # for cross validation
library(caret) # to pre-process data
library(e1071) # to build Naive Bayes model
library(ROCR) # to build ROC graphic
library(tibble) # to visualize data

Data Import

Titanic passengers data was obtained from kaggle.com, which contained passenger data such as name, age, gender, socio-economic class, etc.

Data description:
- survival -> Survival (0 = No, 1 = Yes)
- pclass -> Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- sex -> Sex /gender
- Age -> Age in years
- sibsp -> # of siblings / spouses aboard the Titanic
- parch -> # of parents/ children aboard the Titanic
- ticket -> Ticket number
- fare -> Passenger fare
- cabin -> Cabin number
- embarked -> Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Data Wrangling

At this phase, we have to check first whether the variables are in appropiate class.

summary(titanic)
##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186

Variables to be deleted and the reasons: - PassengerId -> indicated the row names
- Name, Ticket, Cabin -> uniqueness of the variables are more than 100, so they won’t provide any useful information for model

Variables to be transformed or class changed:
- Survived, Pclass -> change into factor class
- Age -> transform into 3 class (below 20, 20-40, above 40)
- Fare -> transform into 2 class. Most of the data is distributed in below 50, so let’s make 2 class (<= 50 and > 50)
- SibSp, Parch -> transform into 2 class. As we can see from summary table, most of the observation has ‘0’ or ‘1’ value (3rd Q value is still either ‘0’ or ‘1’)

# delete variables & change class of variables
titanic_final <- titanic %>% 
  select(-PassengerId, -Name, -Ticket, -Cabin) %>% 
  mutate(Survived = as.factor(Survived),
         Pclass = as.factor(Pclass))
titanic_final
# inspecting missing value
colSums(is.na(titanic_final))
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0      177        0        0        0        0

There are 177 missing values in Age variable and Age should be one of predictors, we need to remove observation that have missing values in Age variable.

# delete missing values 
titanic_final <- titanic_final %>% 
  drop_na()
titanic_final$Age.Range <- ifelse(titanic_final$Age<=10, "Below 20",
                                  ifelse(titanic_final$Age>20 & titanic_final$Age<=40,"21-40",
                                         "above 40"))
titanic_final$SibSp.Exp <- ifelse(titanic_final$SibSp > 0, "Yes", "No")
titanic_final$Parch.Exp <- ifelse(titanic_final$Parch > 0, "Yes","No")
titanic_final$Fare.Exp <- ifelse(titanic_final$Fare <= 50, "<= 50", ">50")
titanic_final <- titanic_final[,-c(4:7)]
titanic_final

Exploratory Data Analysis

Checking the proportion of the target variable (Survived)

prop.table(table(titanic_final$Survived))
## 
##         0         1 
## 0.5938375 0.4061625

The proportion of Survived class is quite balance (Yes = 0,41, No = 0,59). We can use this data to predict for both criteria.

Cross Validation

set.seed(123)
split <- initial_split(titanic_final, prop = 0.8, strata = "Survived")
train <- training(split)
test <- testing(split)

Checking the proportion of target variable of the train data.

prop.table(table(train$Survived))
## 
##         0         1 
## 0.5933682 0.4066318

Found that the proportion of the target variable is quite balance, we can continue to next phase.

Modeling: Naive Bayes

There are certain characteristics of Naive Bayes that should be considered:

  • assumes that all features of the dataset are equally important and independent. This allows Naive Bayes to perform faster computation (the algorithms is quite simple).
  • prone to bias due to data scarcity. In some cases, our data may have a distribution where scarce observations lead to probabilities approximating close to 0 or 1, which introduces a heavy bias into our model that could lead to poor performance on unseen data.
  • more appropriate for data with categoric predictors. This is because Naive Bayes is sensitive to data scarcity. Meanwhile, a continuous variable might contain really scarce or even only one observation for certain value.
  • apply Laplace estimator/smoothing for data scarcity problem. Laplace estimator proposes the adding of a small number (usually 1) to each of the counts in the frequency table. This subsequently ensures that each class-feature combination has a non-zero probability of occurring.

Based on the characteristic, our titanic data is quite suitable to use Naive Bayes model since most of our data have been transformed into categorical class. To make a good Naive Bayes or to ensure a non-zero probability of occuring, we should apply Laplace estimator

# model building
naive <- naiveBayes(Survived~., train, laplace =1)
# model fitting
naive_pred <- predict(naive,test, type="class") # for the class prediction
naive_prob <- predict(naive,test, type="raw") # for the probability

# result
naive_table <- select(test, Survived) %>% 
  bind_cols(Survived_pred = naive_pred) %>% 
  bind_cols(Survived_eprob = round(naive_prob[,1],4)) %>% 
  bind_cols(Survived_pprob = round(naive_prob[,2],4))

# performance evaluation - confusion matrix
naive_table %>% 
  conf_mat(Survived, Survived_pred) %>% 
  autoplot(type ="heatmap")

naive_perf <- confusionMatrix(data = naive_table$Survived_pred, reference = naive_table$Survived, positive = "1")
data.frame(
  "Accuracy" = naive_perf$overall[1],
  "Recall" = naive_perf$byClass[1],
  "Specificity" = naive_perf$byClass[2],
  "Precision" = naive_perf$byClass[3]
)
naive_ROC <- prediction(naive_prob[,2], labels = test$Survived )

# ROC curve
plot(performance(naive_ROC, "tpr", "fpr"),
     main="ROC")
abline(a=0, b=1)

# AUC
auc_ROCR_n <- performance(naive_ROC, measure = "auc")
auc_ROCR_n <- auc_ROCR_n@y.values[[1]]
auc_ROCR_n
## [1] 0.8435673

Metrics that are to be evaluated from the model:
- Accuracy: the ability to correctly predict both classes from the total observation.
- Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
- Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
- Specificity: the ability to correctly predict the negative class from the total actual-negative class.

On the other hand, Receiver Operationg Characteristics (ROC) curve and Area Under Curve (AUC) are used to be described how well that our model can distinguish our target class. ROC is a probability curve and AUC represents the degree or measure of separability. The range of AUC is between 0.5 to 1. The higher degree of AUC value the better of model that can distinguish the target class.

For this case, we want to predict / classify what sorts of people were able to be survived. Thus, both categories in target variable are equally important. On one hand, this model’s accuracy metrics should be high. On the other hand, we still need to pay attention recall and specificity which cannot be very low. Several things we can do includes finding the recall-specificity break-even point (the cutoff where recall and specificity are equal) or finding a point where both scores are not very low (>= 70) while also having high accuracy.

Let’s demostrate some model tuning practic by changing the threshold for classification.

library(cmplot)

confmat_plot(prob = naive_table$Survived_pprob, ref = test$Survived, postarget = "1", negtarget = "0")

From the plot above, we know that the recall-specificity break-even point is on 0.36. But we need to set our threshold 0.401 to get all metrics above 70%. Let try this threshold for our data.

# tuning final model
naive_table <- naive_table %>% 
  mutate(tuning_pred = as.factor(ifelse(Survived_pprob>= 0.401,"1","0")))

# metrics result
final_n <- confusionMatrix(data = naive_table$tuning_pred, reference = naive_table$Survived, positive = "1")

df_final_n <- data.frame(
  "Accuracy" = final_n$overall[1],
  "Recall" = final_n$byClass[1],
  "Specificity" = final_n$byClass[2],
  "Precision" = final_n$byClass[3]
) %>% 
  cbind(AUC = auc_ROCR_n)
df_final_n

Modeling: Decision Tree

Decision tree model is one of the tree-based models which has the major benefit of being interpretable. Decision tree is an algorithm that will make a set of rules visualized in a diagram that resembles a tree.

There are certain characteristics of decision tree model:

  • perform well on both numerical and categorical variable.

  • all predictors are assumed to interact.

  • quite robust to the problem of multicollinearity. A decision tree will choose a variable that has the highest information gain in one split, whereas a method such as logistic regression would have used both.

  • robust and insensitive to outliers. Splitting will happen at a condition where it maximizes the homogeneity within resulting groups. Outliers will have little influence on the splitting process.

Let’s start to build our model to understand what it looks like.

library(rpart)
library(rattle)
library(rpart.plot)

# model building
dtree <- rpart(formula = Survived ~ ., data = train, method = "class")

fancyRpartPlot(dtree, sub = NULL)

# model fitting
dtree_pred <- predict(dtree, test, type = "class")
dtree_prob <- predict(dtree, test, type = "prob")

# result
dtree_table <- select(test, Survived) %>%
  bind_cols(Survived_pred = dtree_pred) %>% 
  bind_cols(Survived_eprob = round(dtree_prob[,1],4)) %>% 
  bind_cols(Survived_pprob = round(dtree_prob[,2],4))

# performance evaluation - confusion matrix
dtree_table %>% 
  conf_mat(Survived, Survived_pred) %>% 
  autoplot(type = "heatmap")

dt_perf <- confusionMatrix(data = dtree_table$Survived_pred, reference = dtree_table$Survived, positive = "1")

data.frame(
  "Accuracy" = dt_perf$overall[1],
  "Recall" = dt_perf$byClass[1],
  "Specificity" = dt_perf$byClass[2],
  "Precision" = dt_perf$byClass[3]
)
# ROC
dtree_ROC <- prediction(dtree_prob[,2], labels = test$Survived )


# ROC curve
plot(performance(dtree_ROC, "tpr","fpr"),
     main ="ROC")
abline(a=0, b =1)

# AUC
auc_ROCR_d <- performance(dtree_ROC, measure = "auc")
auc_ROCR_d <- auc_ROCR_d@y.values[[1]]
auc_ROCR_d
## [1] 0.869152

By using default parameter, we don’t have to tune the model because we’ve already got a normal-sized tree (no need to perform tree pruning). Let’s try to tune the model by changing the the threshold for classification.

confmat_plot(prob = dtree_table$Survived_pprob, ref = dtree_table$Survived, postarget = "1", negtarget = "0")
# tuning final model of decision tree
dtree_table <- dtree_table %>% 
  mutate(tuning_pred=as.factor(ifelse(Survived_pprob>= 0.369,"1","0")))

# metrics result
final_d <- confusionMatrix(data = dtree_table$tuning_pred, reference = dtree_table$Survived, positive = "1")

df_final_d <- data.frame(
  "Accuracy" = final_d$overall[1],
  "Recall" = final_d$byClass[1],
  "Specificity" = final_d$byClass[2],
  "Precision" = final_d$byClass[3]
) %>% 
  cbind(AUC = auc_ROCR_d)
df_final_d

Modeling: Random Forest

The wisdom of crowd Let’s build Random Forest model

#model building
set.seed(123)
ctrl <- trainControl(method = "repeatedcv", number = 4, repeats = 4) # k-fold validation
forest <- train(Survived~., data = train, method="rf", trControl =ctrl)
forest
## Random Forest 
## 
## 573 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times) 
## Summary of sample sizes: 430, 430, 430, 429, 430, 430, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8066907  0.5807211
##    6    0.7979737  0.5699642
##   11    0.7944954  0.5631310
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

From the summary, we can see the optimum variables considered for splitting at each tree node is 2.We can inspect the importance of each variable by using varimp().

varImp(forest)
## rf variable importance
## 
##                   Overall
## Sexmale           100.000
## Pclass3            29.517
## Fare.Exp>50        16.750
## Age.RangeBelow 20   7.115
## Pclass2             6.312
## Parch.ExpYes        6.257
## EmbarkedC           5.256
## SibSp.ExpYes        4.133
## EmbarkedS           3.941
## Age.Rangeabove 40   2.849
## EmbarkedQ           0.000

In random forest, we don’t have to split our dataset into train and test set because there is out-of-bag estimates (OOB) in random forest modeling which act as a reliable estimate of the accuracy on unseen examples. The OOB we achieved was generated from our train dataset. See the summary below.

plot(forest$finalModel)
legend("topright", colnames(forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

forest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 20.24%
## Confusion matrix:
##     0   1 class.error
## 0 314  26  0.07647059
## 1  90 143  0.38626609

Let’s fit the model to our test data

# model fitting
forest_pred <- predict(forest, test, type = "raw")
forest_prob <- predict(forest, test, type = "prob")

# result
forest_table <- select(test, Survived) %>% 
  bind_cols(Survived_pred = forest_pred) %>% 
  bind_cols(Survived_eprob = round(forest_prob[,1],4)) %>%
  bind_cols(Survived_pprob = round(forest_prob[,2],4))

# performance evaluation - confusion matrix
forest_table %>% 
  conf_mat(Survived, Survived_pred) %>% 
  autoplot(type ="heatmap")

perf <- confusionMatrix(data = forest_table$Survived_pred, reference = forest_table$Survived, positive = "1")
data.frame(
  "Accuracy" = perf$overall[1],
  "Recall" = perf$byClass[1],
  "Specificity" = perf$byClass[2],
  "Precision" = perf$byClass[3]
)
# ROC
forest_roc <- data.frame(prediction = round(forest_prob[,2],4),
                         trueclass = as.numeric(forest_table$Survived==1))
head(forest_roc)
forest_roc <- ROCR::prediction(forest_roc$prediction, forest_roc$trueclass)

# ROC curve
plot(performance(forest_roc, "tpr","fpr"),
     main = "ROC")
abline(a=0, b=1)

# AUC
auc_ROCR_f <- performance(forest_roc, measure = "auc")
auc_ROCR_f <- auc_ROCR_f@y.values[[1]]
auc_ROCR_f
## [1] 0.8659148

As we can the result, this random forest model gave us a good accuracy, recall and precision (>70%). Besides, we got a high AUC either. Let’s try to tune the threshold to reach a higher score on each metrics.

confmat_plot(prob = forest_table$Survived_pprob, ref = test$Survived, postarget = "1", negtarget = "0")
# tuning final model
forest_table <- forest_table %>% 
  mutate(tuning_pred = as.factor(ifelse(Survived_pprob >= 0.345  , "1","0")))

# metrics result
final_f <- confusionMatrix(data = forest_table$tuning_pred, reference = forest_table$Survived, positive = "1")

df_final_f <- data.frame(
  "Accuracy" = final_f$overall[1],
  "Recall" = final_f$byClass[1],
  "Specificity" = final_f$byClass[2],
  "Precision" = final_f$byClass[3]
) %>% 
  cbind(AUC = auc_ROCR_f)
df_final_f

Conclusion

rbind("Naive Bayes" = df_final_n, "Desicion Tree" = df_final_d, "Random Forest" = df_final_f)

Base on the metrics table above, the predictive model built using Random Forest algorithm had slightly better result than others. The model got the highest accuracy 81% while maintained sensitivity, specificity, and precision above 70%. It gave a high AUC at 86,6%. Therefore the best model to clustering characteristics of Survival of Titanic is the Random Forest model.