Introduction

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Objective

In this project, we will try to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data from kaggle. The algorithm that we will use are Naive Bayes, Decision Tree and Random Forest.

Library and Setup

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(e1071)
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
library(ROCR)
library(partykit)
#> Loading required package: grid
#> Loading required package: libcoin
#> Loading required package: mvtnorm

Data Preparation

Data Import

titanic <-  read.csv('data_input/train.csv')

titanic

Data Exploration

  1. Check Data Type
glimpse(titanic)
#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
#> $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
#> $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
#> $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
#> $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal~
#> $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
#> $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
#> $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
#> $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
#> $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
#> $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
#> $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
  1. Data Summary
summary(titanic)
#>   PassengerId       Survived          Pclass          Name          
#>  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
#>  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
#>  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
#>  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
#>  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
#>  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
#>                                                                     
#>      Sex                 Age            SibSp           Parch       
#>  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
#>  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
#>  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
#>                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
#>                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
#>                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
#>                     NA's   :177                                     
#>     Ticket               Fare           Cabin             Embarked        
#>  Length:891         Min.   :  0.00   Length:891         Length:891        
#>  Class :character   1st Qu.:  7.91   Class :character   Class :character  
#>  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
#>                     Mean   : 32.20                                        
#>                     3rd Qu.: 31.00                                        
#>                     Max.   :512.33                                        
#> 

Parch, min value to Q3 tends to 0, meaning that the data is not very informative (it may not be used later in modeling)

  1. Missing Value Check
colSums(is.na(titanic))
#> PassengerId    Survived      Pclass        Name         Sex         Age 
#>           0           0           0           0           0         177 
#>       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           0           0           0           0

The Age column has 177 missing values, we will fill these missing values with the median value of Age

  1. Data Overview

Glossary data titanic:

  1. PassengerId : Unique ID Number for each Titanic passenger (out of 831)
  2. Survived : Survival (0 = No, 1 = Yes)
  3. Pclass : Ticket Class (1= Upper, 2= Middle, 3= Low)
  4. Name : Unique Name of each Titanic passenger
  5. Sex : Female or Male
  6. Age : Age in year (out of 80)
  7. SibSp : Number of siblings / spouses aboard the Titanic (out of 8)
  8. Parch : Number of parents . children aboard the Titanic (out of 6)
  9. Ticket : Unique Ticket Number (out of 831)
  10. Fare : Passenger Fare (out of 512.33)
  11. Cabin : Cabin Number
  12. Embarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Data Resume: - The data has 891 rows and 12 columns. - We can ignore PassengerId, Ticket, Name because it cannot be classified as a category or number that can affect the model. - Our target variable is Survived. - Survived, Pclass, Sex data type can be changed into a factor.

Data Manipulation

  1. Data Manipulation
titanic <- titanic %>% 
  mutate(Embarked_S = ifelse(Embarked == "S", 1, 0), #create column Embarked_S
         Embarked_C = ifelse(Embarked == "C", 1, 0), #create column Embarked_C
         Embarked_Q = ifelse(Embarked == "Q", 1, 0), #create column Embarked_Q
         Sex = ifelse(Sex == "female", 0, 1), 
         Age = replace_na(Age, replace = round(median(Age, na.rm = T))), #fill missing value with mean
         Age = as.integer(Age)) %>% 
  dplyr::select(-c(PassengerId, Ticket, Cabin, Name, Embarked)) %>% 
  mutate_at(c('Survived', 'Pclass', 'Sex', 'Embarked_S', 'Embarked_C', 'Embarked_Q'), as.factor)
glimpse(titanic)
#> Rows: 891
#> Columns: 10
#> $ Survived   <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,~
#> $ Pclass     <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3,~
#> $ Sex        <fct> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,~
#> $ Age        <int> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14, 5~
#> $ SibSp      <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0,~
#> $ Parch      <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0,~
#> $ Fare       <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, ~
#> $ Embarked_S <fct> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,~
#> $ Embarked_C <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,~
#> $ Embarked_Q <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,~
  1. Missing Value Check
anyNA(titanic)
#> [1] FALSE

Pre-Processing Data

Variable Target Proportion

prop.table(table(titanic$Survived))
#> 
#>         0         1 
#> 0.6161616 0.3838384

The target variable has a fairly balanced proportion, doesn’t need additional pre-processing.

Cross Validation

Data Splitting.

RNGkind(sample.kind = "Rounding")
set.seed(303)
index <- sample(nrow(titanic), nrow(titanic)*0.7)

#Data Splitting
t_train <- titanic[index,]
t_test <- titanic[-index,]

label_ttrain <- t_train[, "Survived"]
label_ttest <- t_test[, "Survived"]

Target Variable Data Train Proportion

prop.table(table(label_ttrain))
#> label_ttrain
#>         0         1 
#> 0.5971108 0.4028892

Naive Bayes

Modelling

model_naive<- naiveBayes(Survived ~ ., data = t_train)  

Prediction

preds_naive <- predict(model_naive, newdata = t_test)
head(data.frame(actual = label_ttest, prediction = preds_naive))

Model Evaluation

  1. Confussion Matrix
eval_naive <- confusionMatrix(data = as.factor(preds_naive), 
                          reference = as.factor(label_ttest), 
                          positive = "1")

eval_naive
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 156  32
#>          1  21  59
#>                                           
#>                Accuracy : 0.8022          
#>                  95% CI : (0.7494, 0.8482)
#>     No Information Rate : 0.6604          
#>     P-Value [Acc > NIR] : 0.00000022      
#>                                           
#>                   Kappa : 0.5457          
#>                                           
#>  Mcnemar's Test P-Value : 0.1696          
#>                                           
#>             Sensitivity : 0.6484          
#>             Specificity : 0.8814          
#>          Pos Pred Value : 0.7375          
#>          Neg Pred Value : 0.8298          
#>              Prevalence : 0.3396          
#>          Detection Rate : 0.2201          
#>    Detection Prevalence : 0.2985          
#>       Balanced Accuracy : 0.7649          
#>                                           
#>        'Positive' Class : 1               
#> 
  1. ROC
# prediction with probability output
naive_pprob <- predict(model_naive, newdata = t_test, type = "raw")

# creating ROC data frame
data_roc <- data.frame(pred_proba = naive_pprob[, '1'],
                       actual = ifelse(t_test$Survived == "1", 1, 0))

head(data_roc)
#create prediction object
naive_roc <- prediction(predictions = data_roc$pred_proba, 
                       labels = data_roc$actual)

# ROC curve
plot(performance(naive_roc, "tpr", "fpr"),
     main="ROC")
abline(a=0, b=1, col='red') #diagonal line

  1. AUC
auc_naive <- performance(prediction.obj = naive_roc, measure = "auc")
auc_naive@y.values # AUC value from ROC plot
#> [[1]]
#> [1] 0.8123176

Decision Tree

Modelling

model_dt <- ctree(Survived ~ ., data = t_train)
plot(model_dt, type="simple", gp = gpar(fontsize = 8))

From the model, we get the information that the classification is based on variables Sex, Pclass, Embarked_S, Fare, and Age.

Prediction

preds_dt <- predict(model_dt, newdata = t_test, type = "response")
head(preds_dt)
#>  2  4  6  9 15 23 
#>  1  1  0  1  1  1 
#> Levels: 0 1

Model Evaluation

  1. Confusion Matrix
eval_dt <- confusionMatrix(data = preds_dt, reference = label_ttest, positive = "1")
eval_dt
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 155  24
#>          1  22  67
#>                                           
#>                Accuracy : 0.8284          
#>                  95% CI : (0.7778, 0.8715)
#>     No Information Rate : 0.6604          
#>     P-Value [Acc > NIR] : 0.0000000006551 
#>                                           
#>                   Kappa : 0.6153          
#>                                           
#>  Mcnemar's Test P-Value : 0.8828          
#>                                           
#>             Sensitivity : 0.7363          
#>             Specificity : 0.8757          
#>          Pos Pred Value : 0.7528          
#>          Neg Pred Value : 0.8659          
#>              Prevalence : 0.3396          
#>          Detection Rate : 0.2500          
#>    Detection Prevalence : 0.3321          
#>       Balanced Accuracy : 0.8060          
#>                                           
#>        'Positive' Class : 1               
#> 
  1. ROC
# prediction with probability output
dt_pprob <- predict(model_dt, newdata = t_test, type = "prob")

# creating ROC data frame
data2_roc <- data.frame(pred_proba = dt_pprob[, '1'],
                        actual = ifelse(t_test$Survived == "1", 1, 0))

#create prediction object
dt_roc <- prediction(predictions = data2_roc$pred_proba,
                     labels = data2_roc$actual)

# ROC curve
plot(performance(dt_roc, "tpr", "fpr"),
     main="ROC")
abline(a=0, b=1, col='red') #diagonal line

  1. AUC
auc_dt <- performance(prediction.obj = dt_roc, measure = "auc")
auc_dt@y.values # AUC value from ROC plot
#> [[1]]
#> [1] 0.8603092

Random Forest

Modelling

Check for column that have near-zero (we can delete columns that have near-zero or less informative).

nearZeroVar(t_train)
#> integer(0)

We can see that we have no near-zero variance data.

RNGkind(sample.kind = "Rounding")
set.seed(212)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_forest <- train(Survived ~ ., data = t_train, method = "rf", trControl = ctrl)

model_forest
#> Random Forest 
#> 
#> 623 samples
#>   9 predictor
#>   2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times) 
#> Summary of sample sizes: 498, 499, 498, 499, 498, 498, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.8116946  0.5929486
#>    6    0.7966968  0.5746334
#>   10    0.7779699  0.5360043
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.

From the model summary, we know that the optimum number of variables considered for splitting at each tree node is 2. We can also inspect the importance of each variable that was used in our random forest using varImp().

varImp(model_forest)
#> rf variable importance
#> 
#>             Overall
#> Sex1        100.000
#> Fare         53.841
#> Age          37.394
#> Pclass3      27.839
#> SibSp        12.102
#> Parch         9.964
#> Embarked_S1   4.911
#> Pclass2       4.315
#> Embarked_C1   3.409
#> Embarked_Q1   0.000

Even though our data has been split for train and test, actually we don’t need to split train and test data when using random forest. The result of bootstrap sampling has data that are not used in making random forests which called out-of-bag data. This out-of-bag data can be considered as data test by the model.

plot(model_forest$finalModel)
legend("topright", colnames(model_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

model_forest$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 2
#> 
#>         OOB estimate of  error rate: 19.58%
#> Confusion matrix:
#>     0   1 class.error
#> 0 342  30  0.08064516
#> 1  92 159  0.36653386

Prediction

preds_forest <- predict(model_forest, newdata = t_test, type = "raw")
head(preds_forest)
#> [1] 1 1 0 0 1 1
#> Levels: 0 1

Model Evaluation

  1. Confusion Matrix
eval_forest <- confusionMatrix(data = preds_forest, reference = label_ttest, positive = "1")
eval_forest
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 162  32
#>          1  15  59
#>                                           
#>                Accuracy : 0.8246          
#>                  95% CI : (0.7737, 0.8682)
#>     No Information Rate : 0.6604          
#>     P-Value [Acc > NIR] : 0.000000001618  
#>                                           
#>                   Kappa : 0.5904          
#>                                           
#>  Mcnemar's Test P-Value : 0.0196          
#>                                           
#>             Sensitivity : 0.6484          
#>             Specificity : 0.9153          
#>          Pos Pred Value : 0.7973          
#>          Neg Pred Value : 0.8351          
#>              Prevalence : 0.3396          
#>          Detection Rate : 0.2201          
#>    Detection Prevalence : 0.2761          
#>       Balanced Accuracy : 0.7818          
#>                                           
#>        'Positive' Class : 1               
#> 
  1. ROC
# prediction with probability output
forest_pprob <- predict(model_forest, newdata = t_test, type = "prob")

# creating ROC data frame
data3_roc <- data.frame(pred_proba = forest_pprob[, '1'],
                        actual = ifelse(t_test$Survived == "1", 1, 0))

#create prediction object
forest_roc <- prediction(predictions = data3_roc$pred_proba,
                         labels = data3_roc$actual)

# ROC curve
plot(performance(forest_roc, "tpr", "fpr"),
     main="ROC")
abline(a=0, b=1, col='red') #diagonal line

  1. AUC
auc_forest <- performance(prediction.obj = forest_roc, measure = "auc")
auc_forest@y.values # AUC value from ROC plot
#> [[1]]
#> [1] 0.8908239

Models Perfomance Comparation

#get the accuracy, sensitivity, specificity, precision and AUC of model naive bayes
evals_naive <- data_frame(Model = "Naive Bayes",
                          Accuracy = eval_naive$overall[1],
                          Recall = eval_naive$byClass[1],
                          Specificity = eval_naive$byClass[2],
                          Precision = eval_naive$byClass[3])

#get the accuracy, sensitivity, specificity, precision and AUC of model decision tree
evals_dt <- data_frame(Model = "Decision Tree",
                       Accuracy = eval_dt$overall[1],
                       Recall = eval_dt$byClass[1],
                       Specificity = eval_dt$byClass[2],
                       Precision = eval_dt$byClass[3])

#get the accuracy, sensitivity, specificity, precision and AUC of model random forest
evals_forest <- data_frame(Model = "Random Forest",
                           Accuracy = eval_forest$overall[1],
                           Recall = eval_forest$byClass[1],
                           Specificity = eval_forest$byClass[2],
                           Precision = eval_forest$byClass[3])

rbind(evals_naive, evals_dt, evals_forest)

As a rescue team, we need to know the data of passengers who should have survived the Titanic crash. Therefore, we will look at the Recall metrics, this is to ensure that no survivors are missed in the rescue process.

From the results above, Decision Tree has the best recall performance with a figure of 73.6%. Let’s try to tune the model to get better performance results.

Tuning Model

Down Sampling

In the train dataset, the proportion of our data was around 59,71% not survived and around 40,29% survived. Let’s do down-sampling to balancing the proportion of target variable.

RNGkind(sample.kind = "Rounding")
set.seed(212)
t_train2 <- downSample(x = t_train[, -1], 
                       y = as.factor(t_train[, 1]), 
                       yname = "Survived")
rmarkdown::paged_table(t_train2)

check variable target proportion.

prop.table(table(t_train2$Survived))
#> 
#>   0   1 
#> 0.5 0.5

Modelling

model_dt2 <- ctree(Survived ~ ., data = t_train2)
plot(model_dt2, type="simple", gp = gpar(fontsize = 8))

Prediction

preds_dt2 <- predict(model_dt2, newdata = t_test, type = "response")
head(preds_dt2)
#>  2  4  6  9 15 23 
#>  1  1  0  1  1  1 
#> Levels: 0 1

Model Evaluation

  1. Confusion Matrix
eval_dt2 <- confusionMatrix(data = preds_dt2, reference = label_ttest, positive = "1")
eval_dt2
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 130  18
#>          1  47  73
#>                                           
#>                Accuracy : 0.7575          
#>                  95% CI : (0.7016, 0.8075)
#>     No Information Rate : 0.6604          
#>     P-Value [Acc > NIR] : 0.0003713       
#>                                           
#>                   Kappa : 0.4981          
#>                                           
#>  Mcnemar's Test P-Value : 0.0005147       
#>                                           
#>             Sensitivity : 0.8022          
#>             Specificity : 0.7345          
#>          Pos Pred Value : 0.6083          
#>          Neg Pred Value : 0.8784          
#>              Prevalence : 0.3396          
#>          Detection Rate : 0.2724          
#>    Detection Prevalence : 0.4478          
#>       Balanced Accuracy : 0.7683          
#>                                           
#>        'Positive' Class : 1               
#> 
  1. ROC
# prediction with probability output
dt_pprob2 <- predict(model_dt2, newdata = t_test, type = "prob")

# creating ROC data frame
data4_roc <- data.frame(pred_proba = dt_pprob2[, '1'],
                        actual = ifelse(t_test$Survived == "1", 1, 0))

#create prediction object
dt_roc2 <- prediction(predictions = data4_roc$pred_proba,
                     labels = data4_roc$actual)

# ROC curve
plot(performance(dt_roc2, "tpr", "fpr"),
     main="ROC")
abline(a=0, b=1, col='red') #diagonal line

  1. AUC
auc_dt2 <- performance(prediction.obj = dt_roc2, measure = "auc")
auc_dt2@y.values # AUC value from ROC plot
#> [[1]]
#> [1] 0.8335196

Before VS After Tune Up

evals_dt2 <- data_frame(Model = "DT Tune Up",
                       Accuracy = eval_dt2$overall[1],
                       Recall = eval_dt2$byClass[1],
                       Specificity = eval_dt2$byClass[2],
                       Precision = eval_dt2$byClass[3])

rbind(evals_dt, evals_dt2)

Conlusion

As a rescue team, we need to know the data of passengers who should have survived the Titanic crash. Therefore, we will look at the Recall metrics, this is to ensure that no survivors are missed in the rescue process.

Based on the performance results, Decision Tree with the down sample method has the highest recall performance, which is around 80.22%. In my opinion, this number is enough to predict the victims of a shipwreck as big as the case of the Titanic. I’ve also done modeling using Logistic Regression and K-Nearest Neighbor, but the model only produces the highest Recall of 73.6% (you can check my RPubs here). However, If you expect better results, you can do other model tuning like better data pre-processing or something like that. Thank you.