Prediction the Survival of Titanic Passengers Using Naive Bayes, Decision Tree, and Random Forest

Introduction

We have already predict the survival of Titanic Passengers Using Logistic Regression and K-NN model in previous model. Now, we’ll predict using Naive Bayes, Decision Tree, and Random Forest model. The target variable is the same as previous Survived. The dataset was provided by Kaggle.

Import Library and Setup

library(dplyr)
library(caret)
library(e1071)
library(ROCR)
library(partykit) 
library(caret)
library(randomForest)

Data Import

This data contains 3 dataset, such as :
1. train.csv: is using for training data (complete with predictor variable)
2. test.csv : is using for testing data (without survived variable)
3. gender_submission.csv: contains id and survived variable

Let’s read the data

Data train

train <- read.csv("train.csv")
rmarkdown::paged_table(train)

Train data has 891 obs and 12 columns. We’ll use this data to make model later.

Data Test

test <- read.csv("test.csv")
rmarkdown::paged_table(test)

Test data has 418 obs and 11 columns. If we see in train data has 12 columns and in test data has 11 columns. The difference is that in train data has Survivedvariable which in test data has not. The Survived variable in test data is contained in gender_submission.csv. Let’s read the data.

survive <- read.csv("gender_submission.csv")
rmarkdown::paged_table(survive)

Some information about the data :

Pclass : A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
Name : Name of the passenger
Sex : Gender of passenger
Age : Age in years
Sibsp : siblings / spouses aboard the Titanic
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
Parch : parents / children aboard the Titanic
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
Ticket : Ticket number
Fare : Passenger fare
Cabin : Cabin number
Embarked : Port of Embarkation
C = Cherbourg
Q = Queenstown
S = Southampton

Data Preparation and EDA

To make it easier to read, let’s join the test data and survive, so that we just only work with only two data.

test1 <- cbind(test, Survived = survive$Survived)

Now, we only have 2 data : train and test dataset. Let’s check the if there is missing value in data and the data type of data.

colSums(is.na(train))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

There is 177 missing value in age column. It was around 19% of our data. Instead of remove the data. let’s try to replace the age number with the mean of age.

train_clean <- train %>%
    mutate(Age = if_else(is.na(Age), mean(Age, na.rm = TRUE), Age))

colSums(is.na(train_clean))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Let’s check if there’s any missing value in data test.

colSums(is.na(test1))

## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           0           0           1           0           0           0

There’s some missing value in test data. To avoid NA from affecting the prediction results, we will delete data that contains NA

test_clean <- test1 %>% 
    na.omit()

In this case, Survived is the target variable, and some variables are not contain much information about the modeling. Let’s select some predictor variable and change the data type.

data_train <- train_clean %>% 
          select(-c(PassengerId, Name, Ticket, Cabin)) %>% 
          mutate(Survived = as.factor(Survived),
           Pclass = as.factor(Pclass),
           Sex = as.factor(Sex),
           SibSp = as.factor(SibSp),
           Parch = as.factor(Parch),
           Embarked = as.factor(Embarked))

data_test <- test_clean %>% 
          select(-c(PassengerId, Name, Ticket, Cabin)) %>% 
          mutate(Survived = as.factor(Survived),
           Pclass = as.factor(Pclass),
           Sex = as.factor(Sex),
           SibSp = as.factor(SibSp),
           Parch = as.factor(Parch),
           Embarked = as.factor(Embarked))

Let’s check the structure of data train by using str()

str(data_train)

## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...

Before we continue to modeling, let’s check the proportion of our data in class target.

prop.table(table(data_train$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

In the train dataset, the proportion of our data was around 61,61% not survived and around 38,38% survived. Let’s do down-sampling to balancing the proportion of target variable.

set.seed(267)
data_traind <- downSample(x = data_train[, -1], y = as.factor(data_train[, 1]), yname = "Survived")
rmarkdown::paged_table(data_traind)

prop.table(table(data_traind$Survived))

## 
##   0   1 
## 0.5 0.5

Model Naive Bayes

Modeling

After we prepare the data, we’ll try to model with using Naive-Bayes.

model_naive <- naiveBayes(x = data_traind %>% select(-Survived), #prediktor
                          y = data_traind$Survived, #target
                          laplace = 1)

Predict Data

pred_label_naive <- predict(model_naive, data_test, type = "class")
head(data.frame(actual = data_test$Survived, prediction = pred_label_naive))

##   actual prediction
## 1      0          0
## 2      1          1
## 3      0          0
## 4      0          0
## 5      1          1
## 6      0          0

Model Evaluation

We can do model evaluation in naive-bayes by using confusion matrix. We can also check the ROC (Receiver-Operating Curve) and AUC(Area Under ROC Curve).

Confusion Matrix

mat1 <- confusionMatrix(data = pred_label_naive, reference = data_test$Survived, positive = "1")
mat1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 171  28
##          1  33  99
##                                          
##                Accuracy : 0.8157         
##                  95% CI : (0.7697, 0.856)
##     No Information Rate : 0.6163         
##     P-Value [Acc > NIR] : 3.183e-15      
##                                          
##                   Kappa : 0.6132         
##                                          
##  Mcnemar's Test P-Value : 0.6085         
##                                          
##             Sensitivity : 0.7795         
##             Specificity : 0.8382         
##          Pos Pred Value : 0.7500         
##          Neg Pred Value : 0.8593         
##              Prevalence : 0.3837         
##          Detection Rate : 0.2991         
##    Detection Prevalence : 0.3988         
##       Balanced Accuracy : 0.8089         
##                                          
##        'Positive' Class : 1              
##

ROC (Receiver - Operating Curve)

Receiver-Operating Curve (ROC) is a curve that plots the relationship between True Positive Rate (Sensitivity or Recall) with False Positive Rate (1-Specificity). A good model should ideally have a high True Positive Rate and a low False Positive Rate.

# get the probability prediction
prob_survive <- predict(model_naive, data_test, type = "raw")

# prepare dataframe for  ROC
data_roc <- data.frame(prob = prob_survive[,2], # probability of positive class(survived)
                       labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived
head(data_roc)

##         prob labels
## 1 0.04078981      0
## 2 0.53590515      1
## 3 0.18470196      0
## 4 0.03860111      0
## 5 0.75699552      1
## 6 0.05263042      0

naive_roc <- ROCR::prediction(data_roc$prob, data_roc$labels) 

# ROC curve
plot(performance(naive_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
     main = "ROC")
abline(a = 0, b = 1)

AUC (Area Under ROC Curve)

AUC shows the area under the ROC curve. The closer it to 1 means our performance of the model is better. To get the AUC value, write auc in themeasure parameter of performance () and retrieve the value y.values.

auc_n <- performance(naive_roc, measure = "auc")
auc_n@y.values

## [[1]]
## [1] 0.8857496

Decision Tree

Decision tree will produce a decision tree based on the patterns contained in the data. The results are visualized in a Flow Chart so that it can be easily understood how the model predicts.

Modeling

model_dt <- ctree(Survived~ .,data_traind)
plot(model_dt, type="simple")

From the decision tree model, we can get some information such as:

Root Node: The first branch, the most important variable used to determine the target value. Ex : [1]
Interior Node: The second branch and so on, in the form of other variables that are used if the first branch is not enough to determine the target. Ex : [2], [4], [7], and [9]
Leaf / Terminal Node: Predicted target value or class. Ex : [3], [5], [6], [8], [10], and [11]

model_dt

## 
## Model formula:
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
## Fitted party:
## [1] root
## |   [2] Sex in female
## |   |   [3] Pclass in 1, 2: 1 (n = 165, err = 2.4%)
## |   |   [4] Pclass in 3
## |   |   |   [5] Fare <= 23.25: 1 (n = 104, err = 33.7%)
## |   |   |   [6] Fare > 23.25: 0 (n = 19, err = 15.8%)
## |   [7] Sex in male
## |   |   [8] Pclass in 1: 0 (n = 91, err = 49.5%)
## |   |   [9] Pclass in 2, 3
## |   |   |   [10] Age <= 12
## |   |   |   |   [11] SibSp in 0, 1, 2: 1 (n = 17, err = 0.0%)
## |   |   |   |   [12] SibSp in 3, 4, 5: 0 (n = 9, err = 11.1%)
## |   |   |   [13] Age > 12: 0 (n = 279, err = 16.5%)
## 
## Number of inner nodes:    6
## Number of terminal nodes: 7

From the model, we get the information that the classification is based on variables Sex, Pclass, Fare, and Age

Predict Data

pred_test_dt <- predict(model_dt, newdata = data_test, 
                          type = "response")
head(pred_test_dt)

## 1 2 3 4 5 6 
## 0 1 0 0 1 0 
## Levels: 0 1

Model Evaluation

We can do model evaluation in naive-bayes by using confusion matrix. We can also check the ROC (Receiver-Operating Curve) and AUC(Area Under ROC Curve).

Confusion Matrix

mat2 <- confusionMatrix(data = pred_test_dt, reference = data_test$Survived, positive = "1")
mat2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 195   2
##          1   9 125
##                                           
##                Accuracy : 0.9668          
##                  95% CI : (0.9413, 0.9833)
##     No Information Rate : 0.6163          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9305          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.9843          
##             Specificity : 0.9559          
##          Pos Pred Value : 0.9328          
##          Neg Pred Value : 0.9898          
##              Prevalence : 0.3837          
##          Detection Rate : 0.3776          
##    Detection Prevalence : 0.4048          
##       Balanced Accuracy : 0.9701          
##                                           
##        'Positive' Class : 1               
##

ROC (Receiver - Operating Curve)

prob_survive_dt <- predict(model_dt, data_test, type = "prob")

data_roc1 <- data.frame(prob = prob_survive_dt[,2], # probability of positive class(survived)
                       labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived

dt_roc <- ROCR::prediction(data_roc1$prob, data_roc1$labels) 

# ROC curve
plot(performance(dt_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
     main = "ROC")
abline(a = 0, b = 1)

AUC (Area Under ROC Curve)

dt_auc <- performance(dt_roc, measure = "auc")
dt_auc@y.values

## [[1]]
## [1] 0.9410607

Random Forest

The disadvantage of random forest is its huge and heavy computing model. This can be reduced by selecting the predictor so that it is not too much. If there is large number of columns, we can delete columns that have near-zero (less informative) variance with nearZeroVar () from the caret package.

nearZeroVar(data_traind)

## integer(0)

We can see that we have no near-zero variance data. We’ll try to model using Random forest.

Modeling

set.seed(267)

ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3) # k-fold cross validation
survive_forest <- train(Survived ~ ., data = data_traind, method = "rf", trControl = ctrl)

After we modeling the data, let’s try to call the model.

survive_forest

## Random Forest 
## 
## 684 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 546, 547, 548, 548, 547, 548, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7738311  0.5476843
##   11    0.8001697  0.6003162
##   20    0.7831127  0.5662113
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.

From the model summary, we know that the optimum number of variables considered for splitting at each tree node is 11. We can also inspect the importance of each variable that was used in our random forest using varImp().

varImp(survive_forest)

## rf variable importance
## 
##             Overall
## Fare      100.00000
## Sexmale    95.16120
## Age        78.25422
## Pclass3    25.78161
## SibSp1      7.60459
## EmbarkedC   6.03533
## EmbarkedS   5.92940
## Parch1      5.01794
## Parch2      4.58701
## Pclass2     4.46792
## EmbarkedQ   3.36343
## SibSp3      2.04308
## SibSp4      1.89459
## SibSp2      1.38080
## SibSp8      1.01522
## Parch4      0.57022
## Parch5      0.46405
## SibSp5      0.45973
## Parch3      0.04909
## Parch6      0.00000

Even though our data has been split for train and test, actually we don’t need to split train and test data when using random forest. The result of bootstrap sampling has data that are not used in making random forests which called out-of-bag data. This out-of-bag data can be considered as data test by the model.

plot(survive_forest$finalModel)
legend("topright", colnames(survive_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

survive_forest$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##         OOB estimate of  error rate: 19.74%
## Confusion matrix:
##     0   1 class.error
## 0 285  57   0.1666667
## 1  78 264   0.2280702

Predict Data

pred_test_rf <- predict(survive_forest, newdata = data_test, type = "raw")

Model Evaluation

We can do model evaluation in naive-bayes by using confusion matrix. We can also check the ROC (Receiver-Operating Curve) and AUC(Area Under ROC Curve).

Confusion Matrix

mat3 <- confusionMatrix(data = pred_test_rf, reference = data_test$Survived, positive = "1")
mat3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 168  20
##          1  36 107
##                                          
##                Accuracy : 0.8308         
##                  95% CI : (0.786, 0.8696)
##     No Information Rate : 0.6163         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.6506         
##                                          
##  Mcnemar's Test P-Value : 0.04502        
##                                          
##             Sensitivity : 0.8425         
##             Specificity : 0.8235         
##          Pos Pred Value : 0.7483         
##          Neg Pred Value : 0.8936         
##              Prevalence : 0.3837         
##          Detection Rate : 0.3233         
##    Detection Prevalence : 0.4320         
##       Balanced Accuracy : 0.8330         
##                                          
##        'Positive' Class : 1              
##

ROC (Receiver - Operating Curve)

prob_survive_rf <- predict(survive_forest, data_test, type = "prob")

data_roc2 <- data.frame(prob = prob_survive_rf[,2], # probability of positive class(survived)
                       labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived
rf_roc <- ROCR::prediction(data_roc2$prob, data_roc2$labels) 

# ROC curve
plot(performance(rf_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
     main = "ROC")
abline(a = 0, b = 1)

AUC (Area Under ROC Curve)

rf_auc <- performance(rf_roc, measure = "auc")
rf_auc@y.values

## [[1]]
## [1] 0.9202563

Conclusion

After we make the model, predict, and do some model evaluation, let’s compare the model of naive bayes, decision tree, and random forest for this case.

#get the accuracy, sensitivity, specificity, precision and AUC of model naive bayes
m_naive <- data.frame(Model = "Naive Bayes",
           Accuracy = round((mat1$table[4] + mat1$table[1]) / sum(mat1$table),4),
           Sensitivity = round(mat1$table[4] / (mat1$table[4] + mat1$table[3]),4),
           Specificity = round(mat1$table[1] / (mat1$table[1] + mat1$table[2]),4),
           Precision = round(mat1$table[4] / (mat1$table[4] + mat1$table[2]),4),
           AUC = round(as.numeric(auc_n@y.values),4))

#get the accuracy, sensitivity, specificity, precision and AUC of model decision tree
m_dt <- data.frame(Model = "Decision Tree",
           Accuracy = round((mat2$table[4] + mat2$table[1]) / sum(mat2$table),4),
           Sensitivity = round(mat2$table[4] / (mat2$table[4] + mat2$table[3]),4),
           Specificity = round(mat2$table[1] / (mat2$table[1] + mat2$table[2]),4),
           Precision = round(mat2$table[4] / (mat2$table[4] + mat2$table[2]),4),
           AUC = round(as.numeric(dt_auc@y.values),4))

#get the accuracy, sensitivity, specificity, precision and AUC of model random forest
m_rf <- data.frame(Model = "Random Forest",
           Accuracy = round((mat3$table[4] + mat3$table[1]) / sum(mat3$table),4),
           Sensitivity = round(mat3$table[4] / (mat3$table[4] + mat3$table[3]),4),
           Specificity = round(mat3$table[1] / (mat3$table[1] + mat3$table[2]),4),
           Precision = round(mat3$table[4] / (mat3$table[4] + mat3$table[2]),4),
           AUC = round(as.numeric(rf_auc@y.values),4))

rbind(m_naive, m_dt, m_rf)

##           Model Accuracy Sensitivity Specificity Precision    AUC
## 1   Naive Bayes   0.8157      0.7795      0.8382    0.7500 0.8857
## 2 Decision Tree   0.9668      0.9843      0.9559    0.9328 0.9411
## 3 Random Forest   0.8308      0.8425      0.8235    0.7483 0.9203

The Survived prediction of Titanic passengers based on the metrics table above :
The predictive model built using Decision Tree algorithm gave the best result. The model gave highest accuracy 96% and sensitivity 98% while also maintain specificity and precision above 90%. It also gave the highest AUC at 94%. Therefore the best model to predict the survival of Titanic passengers is Decision Tree algorithm.