Introduction

At this time, we want to build predictive model to predict passengers of Titanic whether or not they survived the sinking of the ship. It will be a model to answer question like “what sorts of people were more likely to survive?”, using passenger data (ie name, age, gender, socio-economic class, etc). Previously, we have used logistic regression and K-Nearest Neighbor (K-NN) as the classification method and the precision value is 83.87% (using logistic regression). Now we are going to build other predictive models using Naive Bayes, Decision Tree and Random Forest to see if we can have better accuracy or precision value. The dataset is splitted into two groups, i.e. training set and test set taken from here.

Data Preparation

Load library

# libraries
library(readr)
library(tidyverse)
library(gtools)
library(class)
library(caret)
library(e1071)
library(car)
library(rsample)
library(ROCR)
library(partykit)
library(randomForest)
library(dplyr)

Read data train.csv

titanic <- read.csv("train.csv")
head(titanic)

Description:

  • PassengerId: Row ID in the data set
  • Survived: If a passenger survived or not (0 = No, 1 = Yes)
  • Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  • Name: Passenger’s name
  • Sex: Passenger’s gender (male or female)
  • Age: Passenger’s age (in years)
  • SibSp: # of siblings / spouses aboard the Titanic
  • Parch: # of parents / children aboard the Titanic
  • Ticket: Ticket number
  • Fare: Passenger fare
  • Cabin: Cabin number
  • Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Check missing value

colSums(is.na(titanic))
#> PassengerId    Survived      Pclass        Name         Sex         Age 
#>           0           0           0           0           0         177 
#>       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           0           0           0           0

Since there is missing value in Age variable, we will only select the data that has a value in Age variable. This is important because we do not want the data interrupt our model later on. We will do this in data wrangling section.

Data wrangling

glimpse(titanic) 
#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
#> $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
#> $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
#> $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
#> $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal~
#> $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
#> $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
#> $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
#> $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
#> $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
#> $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
#> $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
# data selection and transformation
titanic_clean <- titanic %>% 
  filter(!is.na(titanic$Age) & !titanic$Embarked=="") %>% # filter out data with missing value or empty data
  select(-c("PassengerId","Name","Ticket","Cabin")) %>% # unselect column that is not useful
  mutate(
    Survived = as.factor(Survived),
    Pclass = as.factor(Pclass),
    Sex = as.factor(Sex),
    Embarked = as.factor(Embarked)
  )

str(titanic_clean)
#> 'data.frame':    712 obs. of  8 variables:
#>  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 2 2 ...
#>  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
#>  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
#>  $ Age     : num  22 38 26 35 35 54 2 27 14 4 ...
#>  $ SibSp   : int  1 1 0 1 0 0 3 0 1 1 ...
#>  $ Parch   : int  0 0 0 0 0 0 1 2 0 1 ...
#>  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
#>  $ Embarked: Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 3 3 3 1 3 ...

Check missing value after data wrangling

colSums(is.na(titanic_clean))
#> Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
#>        0        0        0        0        0        0        0        0

Eksploratory Data Analysis

Data Distribution Analysis

summary(titanic_clean)
#>  Survived Pclass      Sex           Age            SibSp           Parch       
#>  0:424    1:184   female:259   Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
#>  1:288    2:173   male  :453   1st Qu.:20.00   1st Qu.:0.000   1st Qu.:0.0000  
#>           3:355                Median :28.00   Median :0.000   Median :0.0000  
#>                                Mean   :29.64   Mean   :0.514   Mean   :0.4326  
#>                                3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:1.0000  
#>                                Max.   :80.00   Max.   :5.000   Max.   :6.0000  
#>       Fare        Embarked
#>  Min.   :  0.00   C:130   
#>  1st Qu.:  8.05   Q: 28   
#>  Median : 15.65   S:554   
#>  Mean   : 34.57           
#>  3rd Qu.: 33.00           
#>  Max.   :512.33

Check Class Imbalance

prop.table(table(titanic_clean$Survived))
#> 
#>         0         1 
#> 0.5955056 0.4044944

Based on the proportion value above, the target variable class (Survived) is balance enough so that we do not need to do additional data pre-processing to balance the class.

Cross Validation

When doing prediction, we need to show that a model can predict new data. Thus, we have evaluate the model using new data and we need to split the data into test data and train data. Let’s say we use 80% of the data as train data and the rest as test data.

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- initial_split(data = titanic_clean,
                       prop = 0.8,
                       strata = "Survived")
titanic_train <- training(index)
titanic_test <- testing(index)
# re-check class imbalance
prop.table(table(titanic_train$Survived))
#> 
#>         0         1 
#> 0.5957821 0.4042179

If we look at the proportion of train data, we can still consider that the data is balance. In this case, we want to make it more balance so that the proportion is 50:50 and thus we do downsampling.

# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)

titanic_train_down <- downSample(x = titanic_train %>% select(-Survived),
                                 y = titanic_train$Survived,
                                 yname = "Survived")

dim(titanic_train_down)
#> [1] 460   8
dim(titanic_train)
#> [1] 569   8
# re-check class imbalance
prop.table(table(titanic_train_down$Survived))
#> 
#>   0   1 
#> 0.5 0.5

Now we use titanic_train_down as our train data to build models.

Build Model

Naive Bayes

# build model based on train data
model_naive <- naiveBayes(Survived~., data = titanic_train_down, laplace = 1)

# predict class of test data
pred_naive <- predict(model_naive, newdata = titanic_test, type = "class")

# model evaluation with confusion matrix
naive_conf <- confusionMatrix(data = pred_naive, titanic_test$Survived, positive = "1")
naive_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 75 22
#>          1 10 36
#>                                          
#>                Accuracy : 0.7762         
#>                  95% CI : (0.699, 0.8416)
#>     No Information Rate : 0.5944         
#>     P-Value [Acc > NIR] : 0.000003382    
#>                                          
#>                   Kappa : 0.5201         
#>                                          
#>  Mcnemar's Test P-Value : 0.05183        
#>                                          
#>             Sensitivity : 0.6207         
#>             Specificity : 0.8824         
#>          Pos Pred Value : 0.7826         
#>          Neg Pred Value : 0.7732         
#>              Prevalence : 0.4056         
#>          Detection Rate : 0.2517         
#>    Detection Prevalence : 0.3217         
#>       Balanced Accuracy : 0.7515         
#>                                          
#>        'Positive' Class : 1              
#> 

💡 Interpretation:

Based on the confusion matrix above, the accuracy of the model is 77.62%. In this Titanic Survival Passenger case, we want to avoid passenger’s characteristics that are not survived. Therefore, we use Precision as metric to avoid False Positive (passengers who were actually not survived, but they’re predicted as survived). The precision value of the model is 78.26%.

Accuracy has a weakness of reflecting how good our model in classifying classes. Therefore, let’s calculate the ROC and AUC as other tools for model evaluation.

ROC (Receiver Operating Curve) describes the relationship between True Positive Rate (Sensitivity/Recall) with False Positive Rate (1-Specificity) in every threshold. Ideally, good model has High True Positive Rate and Low False Positive Rate.

AUC (Area Under Curve) shows the area under ROC curve. The higher the AUC score, the better the model is.

# predict test data in probability type
pred_naive_prob <- predict(model_naive, newdata = titanic_test, type = "raw")

# create prediction object
roc_naive_pred <- prediction(predictions = pred_naive_prob[,1],
                             labels = titanic_test$Survived=="1")

# ROC curve
plot(performance(prediction.obj = roc_naive_pred,
                 measure = "tpr",
                 x.measure = "fpr"))

# AUC
naive_auc <- performance(prediction.obj = roc_naive_pred,
                         measure = "auc")

auc_naive <- naive_auc@y.values
auc_naive
#> [[1]]
#> [1] 0.1772819

💡 Interpretation:

ROC curve and AUC score do not show good result which means the model is not good enough in classifying positive and negative classes.

Decision Tree

Next let’s try using Decision Tree to see if the model has better interpretability.

# build model based on train data
model_dtree <- ctree(formula = Survived~., 
                     data = titanic_train_down)

# visualize decision tree
plot(model_dtree, type = "simple")

# predict class in test data
pred_dtree <- predict(object = model_dtree,
                      newdata = titanic_test,
                      type = "response")

# model evaluation with confusion matrix
dtree_conf <- confusionMatrix(data = pred_dtree, titanic_test$Survived, positive = "1")
dtree_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 67 19
#>          1 18 39
#>                                           
#>                Accuracy : 0.7413          
#>                  95% CI : (0.6614, 0.8108)
#>     No Information Rate : 0.5944          
#>     P-Value [Acc > NIR] : 0.000171        
#>                                           
#>                   Kappa : 0.4619          
#>                                           
#>  Mcnemar's Test P-Value : 1.000000        
#>                                           
#>             Sensitivity : 0.6724          
#>             Specificity : 0.7882          
#>          Pos Pred Value : 0.6842          
#>          Neg Pred Value : 0.7791          
#>              Prevalence : 0.4056          
#>          Detection Rate : 0.2727          
#>    Detection Prevalence : 0.3986          
#>       Balanced Accuracy : 0.7303          
#>                                           
#>        'Positive' Class : 1               
#> 

Besides using test data in confusion matrix, it is better if we can compare model performance in train data as well to see the model fitting.

# predict class in train data
pred_dtree_train <- predict(object = model_dtree,
                            newdata = titanic_train_down,
                            type = "response")

# model evaluation with confusion matrix
confusionMatrix(data = pred_dtree_train, titanic_train_down$Survived, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 194  57
#>          1  36 173
#>                                               
#>                Accuracy : 0.7978              
#>                  95% CI : (0.7582, 0.8336)    
#>     No Information Rate : 0.5                 
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.5957              
#>                                               
#>  Mcnemar's Test P-Value : 0.03809             
#>                                               
#>             Sensitivity : 0.7522              
#>             Specificity : 0.8435              
#>          Pos Pred Value : 0.8278              
#>          Neg Pred Value : 0.7729              
#>              Prevalence : 0.5000              
#>          Detection Rate : 0.3761              
#>    Detection Prevalence : 0.4543              
#>       Balanced Accuracy : 0.7978              
#>                                               
#>        'Positive' Class : 1                   
#> 

Precision in train data : 0.8278 Precision in test data : 0.6842

Based on the comparison above, the difference of precision value is 0.1436. This happens because the model experience overfitting, thus it can be fixed with model tuning.

# use parameters: mincriterion, minsplit and minbucket
model_dtree_new <- ctree(formula = Survived~.,
                         data = titanic_train_down,
                         control = ctree_control(mincriterion = 0.07,
                                                 minsplit = 35,
                                                 minbucket = 10))

plot(model_dtree_new, type = "simple")

Evaluate if the model is still overfitting.

# predict class in test data
pred_dtree_new <- predict(object = model_dtree_new,
                      newdata = titanic_test,
                      type = "response")

# model evaluation with confusion matrix
confusionMatrix(data = pred_dtree_new, titanic_test$Survived, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 61 15
#>          1 24 43
#>                                           
#>                Accuracy : 0.7273          
#>                  95% CI : (0.6465, 0.7983)
#>     No Information Rate : 0.5944          
#>     P-Value [Acc > NIR] : 0.0006414       
#>                                           
#>                   Kappa : 0.448           
#>                                           
#>  Mcnemar's Test P-Value : 0.2001848       
#>                                           
#>             Sensitivity : 0.7414          
#>             Specificity : 0.7176          
#>          Pos Pred Value : 0.6418          
#>          Neg Pred Value : 0.8026          
#>              Prevalence : 0.4056          
#>          Detection Rate : 0.3007          
#>    Detection Prevalence : 0.4685          
#>       Balanced Accuracy : 0.7295          
#>                                           
#>        'Positive' Class : 1               
#> 
# predict class in train data
pred_dtree_train_new <- predict(object = model_dtree_new,
                            newdata = titanic_train_down,
                            type = "response")

# model evaluation with confusion matrix
confusionMatrix(data = pred_dtree_train_new, titanic_train_down$Survived, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 200  46
#>          1  30 184
#>                                               
#>                Accuracy : 0.8348              
#>                  95% CI : (0.7976, 0.8676)    
#>     No Information Rate : 0.5                 
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.6696              
#>                                               
#>  Mcnemar's Test P-Value : 0.08532             
#>                                               
#>             Sensitivity : 0.8000              
#>             Specificity : 0.8696              
#>          Pos Pred Value : 0.8598              
#>          Neg Pred Value : 0.8130              
#>              Prevalence : 0.5000              
#>          Detection Rate : 0.4000              
#>    Detection Prevalence : 0.4652              
#>       Balanced Accuracy : 0.8348              
#>                                               
#>        'Positive' Class : 1                   
#> 

It seems that the model is still overfitting. Thus, let’s try to improve the model with Ensemble Method, which is the third method.

💡 Interpretation:

Based on the confusion matrix above, the accuracy of the model is 74.13% and the precision value of the model is 68.42%. These values are less than previous model’s values, which means that this model is not as good as previous model (ie. Naive Bayes model).

# ROC
# predict test data in probability type
pred_dtree_prob <- predict(model_dtree, newdata = titanic_test, type = "prob")

# create prediction object
roc_dtree_pred <- prediction(predictions = pred_dtree_prob[,1],
                             labels = titanic_test$Survived=="1")

# ROC curve
plot(performance(prediction.obj = roc_dtree_pred,
                 measure = "tpr",
                 x.measure = "fpr"))

# AUC
dtree_auc <- performance(prediction.obj = roc_dtree_pred,
                         measure = "auc")

auc_dtree <- dtree_auc@y.values
auc_dtree
#> [[1]]
#> [1] 0.1692698

💡 Interpretation:

ROC curve and AUC score do not show good result which means the model is not good enough in classifying positive and negative classes.

Random Forest

Random Forest is one of the Ensemble Method that will do prediction by creating many decision tree. Among those predictions, it will use majority voting to predict the final result.

# create model
set.seed(417)
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

model_rf <- train(form = Survived~.,
                  data = titanic_train_down,
                  method = "rf",
                  trainControl=control)
   
# save the model
saveRDS(model_rf, "titanic_forest.RDS")

# read the model
model_rf <- readRDS("titanic_forest.RDS")

# view the model
model_rf$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)), trainControl = ..1) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 2
#> 
#>         OOB estimate of  error rate: 20.87%
#> Confusion matrix:
#>     0   1 class.error
#> 0 194  36   0.1565217
#> 1  60 170   0.2608696

💡 Interpretation:

Although previously we do cross validation by splitting the data and evaluate using test data, actually when using Random Forest there are out-of-bag data that is considered as test data by the model. The model uses the data to predict and count error named out-of-bag error.

In Random Forest model, value of out-of-bag-error is 20.87% which means the accuracy of model is 79,13%.

We can see the out-of-bag (OOB) in each class by using plot.

plot(model_rf$finalModel)
legend("topright", colnames(model_rf$finalModel$err.rate),
       col=1:6,cex=0.8,fill=1:6)

Based on the visualization above, with less than 100 trees the error rate is low for the model but for more maximum result it uses more trees to reduce OOB.

We can also inspect the importance of each variable that was used in our random forest using varImp().

plot(varImp(model_rf))

From the above plot, Gender Male is the most important variable that impacts the model.

Let’s test our model with test data.

# model evaluation
pred_rf <- predict(model_rf, newdata = titanic_test)

rf_conf <- confusionMatrix(pred_rf, titanic_test$Survived, positive = "1")
rf_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 65 15
#>          1 20 43
#>                                           
#>                Accuracy : 0.7552          
#>                  95% CI : (0.6764, 0.8232)
#>     No Information Rate : 0.5944          
#>     P-Value [Acc > NIR] : 0.0000397       
#>                                           
#>                   Kappa : 0.4992          
#>                                           
#>  Mcnemar's Test P-Value : 0.499           
#>                                           
#>             Sensitivity : 0.7414          
#>             Specificity : 0.7647          
#>          Pos Pred Value : 0.6825          
#>          Neg Pred Value : 0.8125          
#>              Prevalence : 0.4056          
#>          Detection Rate : 0.3007          
#>    Detection Prevalence : 0.4406          
#>       Balanced Accuracy : 0.7530          
#>                                           
#>        'Positive' Class : 1               
#> 

💡 Interpretation:

Based on the confusion matrix above, the accuracy of the model is 75.52% and the precision value of the model is 68.25%. These values are similar to Decision Tree model, but less than the first model (ie. Naive Bayes model).

# ROC
# predict test data in probability type
pred_rf_prob <- predict(model_rf, newdata = titanic_test, type = "prob")

# create prediction object
roc_rf_pred <- prediction(predictions = pred_rf_prob[,1],
                             labels = titanic_test$Survived=="1")

# ROC curve
plot(performance(prediction.obj = roc_rf_pred,
                 measure = "tpr",
                 x.measure = "fpr"))

# AUC
rf_auc <- performance(prediction.obj = roc_rf_pred,
                         measure = "auc")

auc_rf <- rf_auc@y.values
auc_rf
#> [[1]]
#> [1] 0.1593306

💡 Interpretation:

ROC curve and AUC score do not show good result which means the model is not good enough in classifying positive and negative classes.

Conclusion

eval_naive <- data_frame(Accuracy = naive_conf$overall[1],
           Recall = naive_conf$byClass[1],
           Specificity = naive_conf$byClass[2],
           Precision = naive_conf$byClass[3])
eval_naive$AUC <- auc_naive[[1]]
eval_naive
eval_dtree <- data_frame(Accuracy = dtree_conf$overall[1],
           Recall = dtree_conf$byClass[1],
           Specificity = dtree_conf$byClass[2],
           Precision = dtree_conf$byClass[3])
eval_dtree$AUC <- auc_dtree[[1]]
eval_dtree
eval_rf <- data_frame(Accuracy = rf_conf$overall[1],
           Recall = rf_conf$byClass[1],
           Specificity = rf_conf$byClass[2],
           Precision = rf_conf$byClass[3])
eval_rf$AUC <- auc_rf[[1]]
eval_rf
df <- as.data.frame(rbind(eval_naive, eval_dtree, eval_rf))
rownames(df) <- c("Naive Bayes","Decision Tree","Random Forest")
df

Based on the metrics table above, the predictive model built using Naive Bayes algorithm gives the best result. The model gives highest accuracy 77.62% while also has precision 78.26%. It also gave the highest AUC at 17.72% although this does not represent a good model to classify positive and negative classes. Therefore the best model to predict whether or not Titanic’s passengers survived the sinking of the ship is the Naive Bayes model.