1 Overview

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this analysis, we will build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

2 Library Setup

library(dplyr) # for data wrangling
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
library(inspectdf) # for EDA
library(caret) # to pre-process data
library(tidyverse) # Data wrangling
library(e1071) # Naive Bayes model
library(partykit) # Decision Tree
library(rsample)# Cross-Validation
library(caret) # Model Evaluation
library(ROCR) # ROC dan AUC

3 Data Preparation

3.1 Import Data

The analysis will use dataset from Kaggle. It will be splitted into 2 dataset :

  1. train.csv dataset will be used to build machine learning models
  2. test.csv dataset will be used to see how well machine learning models perform on unseen data. But test.csv data contains only predictor column, the target columns are separated in gender_submission.csv so we need to combine the target column into testing data.
train <- read.csv("train.csv")
test <- read.csv("test.csv")

survive <- read.csv("gender_submission.csv")

#combine the target variable into test data
test <- cbind(test, Survived = survive$Survived)

rmarkdown::paged_table(train)
rmarkdown::paged_table(test)

3.2 Variable Description

Here are some information about the data :

  • Pclass : A proxy for socio-economic status (SES).
    1st = Upper
    2nd = Middle
    3rd = Lower
  • Name : Name of the passenger
  • Sex : Gender of passenger
  • Age : Age in years
  • Sibsp : Siblings / Spouses aboard the Titanic.
    Sibling = brother, sister, stepbrother, stepsister;
    Spouse = husband, wife (mistresses and fiancés were ignored)
  • Parch : Parents / children aboard the Titanic.
    Parent = mother, father;
    Child = daughter, son, stepdaughter, stepson;
    Some children travelled only with a nanny, therefore parch=0 for them.
  • Ticket : Ticket number
  • Fare : Passenger fare
  • Cabin : Cabin number
  • Embarked: Port of Embarkation.
    C = Cherbourg;
    Q = Queenstown;
    S = Southampton;

4 Exploratory Data Analysis

In this step, we want to explore our data variables to understand the condition of the data before modeling.

First, we will check all columns in the dataframe

glimpse(train)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal~
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~

See if there is missing value in data train

colSums(is.na(train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

There is 177 missing value in age column. It was around 19% of our data. Instead of remove the data. let’s try to replace the age number with the mean of age.

train_clean <- train %>%
    mutate(Age = if_else(is.na(Age), mean(Age, na.rm = TRUE), Age))

colSums(is.na(train_clean))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

we also need to check missing value in data test

colSums(is.na(test))
## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           0           0           1           0           0           0

There’s some missing value in test data. To avoid NA from affecting the prediction results, we will delete data that contains NA

test_clean <- test %>% 
    na.omit()

In this case analysis, we will use the survived variable as the target. For some predictor variables, we need to change to a more appropriate data type and try to remove variables that are not contain much information like PassengerId, Name, Ticket, and Cabin.

data_train <- train_clean %>% 
          select(-c(PassengerId, Name, Ticket, Cabin)) %>% 
          mutate(Survived = as.factor(Survived),
           Pclass = as.factor(Pclass),
           Sex = as.factor(Sex),
           SibSp = as.factor(SibSp),
           Parch = as.factor(Parch),
           Embarked = as.factor(Embarked))

data_test <- test_clean %>% 
          select(-c(PassengerId, Name, Ticket, Cabin)) %>% 
          mutate(Survived = as.factor(Survived),
           Pclass = as.factor(Pclass),
           Sex = as.factor(Sex),
           SibSp = as.factor(SibSp),
           Parch = as.factor(Parch),
           Embarked = as.factor(Embarked))

re-checking data_train structure

str(data_train)
## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

And these are the proportions of our train variable target class

prop.table(table(data_train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

As we see the proportion of our data was around 61,61% not survived and around 38,38% survived that indicates the imbalance data. We can balancing the proportion of target variable with down-sampling methods.

set.seed(267)
data_traind <- downSample(x = data_train[, -1], y = as.factor(data_train[, 1]), yname = "Survived")
rmarkdown::paged_table(data_traind)
prop.table(table(data_traind$Survived))
## 
##   0   1 
## 0.5 0.5

5 Naive Bayes

Naive Bayes is a supervised learning classification algorithm based on Bayes Theorem. It is an extension of Exact Bayes classifier and assumes all variables are independent of each other. This method is suitable when dimensionality of inputs are high. It uses a simple method of conditional probability and can be computationally inexpensive and outperform more sophisticated models.

Here is the formula from Bayes Theorem:

\[ P(Y|x) = \frac{P(x|Y)\ P(Y)}{P(x|Y)\ P(Y) + P(x|\neg Y)\ P(\neg Y)} \]

5.1 Modeling

Training the Naive Bayes model cannot be simpler, just use naive-Bayes that comes in e1701 package. Once the model is trained, we use it to predict outcome in test dataset.

model_naive <- naiveBayes(x = data_traind %>% select(-Survived), #prediktor
                          y = data_traind$Survived, #target
                          laplace = 1) 

5.2 Predict Data

pred_label_naive <- predict(model_naive, data_test, type = "class")
head(data.frame(actual = data_test$Survived, prediction = pred_label_naive))
##   actual prediction
## 1      0          0
## 2      1          1
## 3      0          0
## 4      0          0
## 5      1          1
## 6      0          0

6 Decision Tree

Decision Tree is a fairly simple and very easy to interpreted models, even easier than linear regression models. The results are visualized in a Flow Chart so that it can be easily understood how the model predicts.

6.1 Modeling

model_dt <- ctree(Survived~ .,data_traind)
plot(model_dt, type="simple")

6.2 Predict Data

pred_test_dt <- predict(model_dt, newdata = data_test, 
                          type = "response")
head(pred_test_dt)
## 1 2 3 4 5 6 
## 0 1 0 0 1 0 
## Levels: 0 1

7 Random Forest

Random Forest is one type of Ensemble Method. The Random Forest consists of a set of Decision Trees. Each Decision Tree has its own characteristics and is not related to one another. Random Forest will predict the target value for voting, the value with the highest number will be the final prediction of the target data.

The disadvantage of random forest is its huge and heavy computing model. This can be reduced by selecting the predictor so that it is not too much. If there is large number of columns, we can delete columns that have near-zero (less informative) variance with nearZeroVar () from the caret package.

nearZeroVar(data_traind)
## integer(0)

7.1 Modeling

set.seed(267)

ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3) # k-fold cross validation
survive_forest <- train(Survived ~ ., data = data_traind, method = "rf", trControl = ctrl)
survive_forest
## Random Forest 
## 
## 684 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 546, 547, 548, 548, 547, 548, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7738311  0.5476843
##   11    0.8001697  0.6003162
##   20    0.7831127  0.5662113
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.
varImp(survive_forest)
## rf variable importance
## 
##             Overall
## Fare      100.00000
## Sexmale    95.16120
## Age        78.25422
## Pclass3    25.78161
## SibSp1      7.60459
## EmbarkedC   6.03533
## EmbarkedS   5.92940
## Parch1      5.01794
## Parch2      4.58701
## Pclass2     4.46792
## EmbarkedQ   3.36343
## SibSp3      2.04308
## SibSp4      1.89459
## SibSp2      1.38080
## SibSp8      1.01522
## Parch4      0.57022
## Parch5      0.46405
## SibSp5      0.45973
## Parch3      0.04909
## Parch6      0.00000
plot(survive_forest$finalModel)
legend("topright", colnames(survive_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

survive_forest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##         OOB estimate of  error rate: 19.74%
## Confusion matrix:
##     0   1 class.error
## 0 285  57   0.1666667
## 1  78 264   0.2280702

7.2 Predict Data

pred_test_rf <- predict(survive_forest, newdata = data_test, type = "raw")

8 Evaluation

Confusion Matrix

We will evaluate all model with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.

The performance will be shown in the Accuracy, Sensitivity/Recall, Specificity, and Precision. Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.

ROC (Receiver - Operating Curve)

Apart from using the Confusion Matrix, we can use other metrics to compare between model 1 and others. One way is to use the ROC Curve. Receiver-Operating Curve (ROC) is a curve that plots the relationship between True Positive Rate (Sensitivity or Recall) with False Positive Rate (1-Specificity). A good model should ideally have a high True Positive Rate and a low False Positive Rate.

AUC (Area Under ROC Curve)

The AUC shows the area under the ROC curve. AUC is commonly used to compare performance between models. The closer it to 1 means our performance of the model is better.

8.1 Naive Bayes

Confusion Matrix

mat1 <- confusionMatrix(data = pred_label_naive, reference = data_test$Survived, positive = "1")
mat1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 171  28
##          1  33  99
##                                          
##                Accuracy : 0.8157         
##                  95% CI : (0.7697, 0.856)
##     No Information Rate : 0.6163         
##     P-Value [Acc > NIR] : 3.183e-15      
##                                          
##                   Kappa : 0.6132         
##                                          
##  Mcnemar's Test P-Value : 0.6085         
##                                          
##             Sensitivity : 0.7795         
##             Specificity : 0.8382         
##          Pos Pred Value : 0.7500         
##          Neg Pred Value : 0.8593         
##              Prevalence : 0.3837         
##          Detection Rate : 0.2991         
##    Detection Prevalence : 0.3988         
##       Balanced Accuracy : 0.8089         
##                                          
##        'Positive' Class : 1              
## 

ROC (Receiver - Operating Curve)

# get the probability prediction
prob_survive <- predict(model_naive, data_test, type = "raw")

# prepare dataframe for  ROC
data_roc <- data.frame(prob = prob_survive[,2], # probability of positive class(survived)
                       labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived
head(data_roc)
##         prob labels
## 1 0.04078981      0
## 2 0.53590515      1
## 3 0.18470196      0
## 4 0.03860111      0
## 5 0.75699552      1
## 6 0.05263042      0
naive_roc <- ROCR::prediction(data_roc$prob, data_roc$labels) 

# ROC curve
plot(performance(naive_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
     main = "ROC")
abline(a = 0, b = 1)

AUC (Area Under ROC Curve)

auc_n <- performance(naive_roc, measure = "auc")
auc_n@y.values
## [[1]]
## [1] 0.8857496

8.2 Decision Tree

Confusion Matrix

mat2 <- confusionMatrix(data = pred_test_dt, reference = data_test$Survived, positive = "1")
mat2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 195   2
##          1   9 125
##                                           
##                Accuracy : 0.9668          
##                  95% CI : (0.9413, 0.9833)
##     No Information Rate : 0.6163          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9305          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.9843          
##             Specificity : 0.9559          
##          Pos Pred Value : 0.9328          
##          Neg Pred Value : 0.9898          
##              Prevalence : 0.3837          
##          Detection Rate : 0.3776          
##    Detection Prevalence : 0.4048          
##       Balanced Accuracy : 0.9701          
##                                           
##        'Positive' Class : 1               
## 

ROC (Receiver - Operating Curve)

prob_survive_dt <- predict(model_dt, data_test, type = "prob")

data_roc1 <- data.frame(prob = prob_survive_dt[,2], # probability of positive class(survived)
                       labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived

dt_roc <- ROCR::prediction(data_roc1$prob, data_roc1$labels) 

# ROC curve
plot(performance(dt_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
     main = "ROC")
abline(a = 0, b = 1)

AUC (Area Under ROC Curve)

dt_auc <- performance(dt_roc, measure = "auc")
dt_auc@y.values
## [[1]]
## [1] 0.9410607

8.3 Random Forest

Confusion Matrix

mat3 <- confusionMatrix(data = pred_test_rf, reference = data_test$Survived, positive = "1")
mat3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 168  20
##          1  36 107
##                                          
##                Accuracy : 0.8308         
##                  95% CI : (0.786, 0.8696)
##     No Information Rate : 0.6163         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.6506         
##                                          
##  Mcnemar's Test P-Value : 0.04502        
##                                          
##             Sensitivity : 0.8425         
##             Specificity : 0.8235         
##          Pos Pred Value : 0.7483         
##          Neg Pred Value : 0.8936         
##              Prevalence : 0.3837         
##          Detection Rate : 0.3233         
##    Detection Prevalence : 0.4320         
##       Balanced Accuracy : 0.8330         
##                                          
##        'Positive' Class : 1              
## 

ROC (Receiver - Operating Curve)

prob_survive_rf <- predict(survive_forest, data_test, type = "prob")

data_roc2 <- data.frame(prob = prob_survive_rf[,2], # probability of positive class(survived)
                       labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived
rf_roc <- ROCR::prediction(data_roc2$prob, data_roc2$labels) 

# ROC curve
plot(performance(rf_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
     main = "ROC")
abline(a = 0, b = 1)

AUC (Area Under ROC Curve)

rf_auc <- performance(rf_roc, measure = "auc")
rf_auc@y.values
## [[1]]
## [1] 0.9202563

9 Conclusion

After we make the model, predict, and do some model evaluation, let’s compare the model of naive bayes, decision tree, and random forest for this case.

#get the accuracy, sensitivity, specificity, precision and AUC of model naive bayes
m_naive <- data.frame(Model = "Naive Bayes",
           Accuracy = round((mat1$table[4] + mat1$table[1]) / sum(mat1$table),4),
           Sensitivity = round(mat1$table[4] / (mat1$table[4] + mat1$table[3]),4),
           Specificity = round(mat1$table[1] / (mat1$table[1] + mat1$table[2]),4),
           Precision = round(mat1$table[4] / (mat1$table[4] + mat1$table[2]),4),
           AUC = round(as.numeric(auc_n@y.values),4))

#get the accuracy, sensitivity, specificity, precision and AUC of model decision tree
m_dt <- data.frame(Model = "Decision Tree",
           Accuracy = round((mat2$table[4] + mat2$table[1]) / sum(mat2$table),4),
           Sensitivity = round(mat2$table[4] / (mat2$table[4] + mat2$table[3]),4),
           Specificity = round(mat2$table[1] / (mat2$table[1] + mat2$table[2]),4),
           Precision = round(mat2$table[4] / (mat2$table[4] + mat2$table[2]),4),
           AUC = round(as.numeric(dt_auc@y.values),4))

#get the accuracy, sensitivity, specificity, precision and AUC of model random forest
m_rf <- data.frame(Model = "Random Forest",
           Accuracy = round((mat3$table[4] + mat3$table[1]) / sum(mat3$table),4),
           Sensitivity = round(mat3$table[4] / (mat3$table[4] + mat3$table[3]),4),
           Specificity = round(mat3$table[1] / (mat3$table[1] + mat3$table[2]),4),
           Precision = round(mat3$table[4] / (mat3$table[4] + mat3$table[2]),4),
           AUC = round(as.numeric(rf_auc@y.values),4))

rmarkdown::paged_table(rbind(m_naive, m_dt, m_rf))

From the comparison table above, we can see that the predictive model built using Decision Tree algorithm gave the best result. The model gave highest accuracy 96% and sensitivity 98% while also maintain specificity and precision above 90%. It also gave the highest AUC at 94%. So, the best model to predict the survival of Titanic passengers is Decision Tree algorithm.