1 Research Background

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

We might wondering what is the best model to predict the survival of Titanic passengers ?

2 Data Preaparation and EDA

2.1 Packages Data

Following are attached some of the packages used, please install the packages first if needed.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)
library(ROCR)
library(partykit) 
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(caret)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine

2.2 Import Data

Before finding the right model to examine the data, we certainly need the data first. In this case, I used data from https://www.kaggle.com/c/titanic

train <- read.csv("train.csv")
rmarkdown::paged_table(train)
test <- read.csv("test.csv")
rmarkdown::paged_table(test)

Test data has 418 obs and 11 columns. If we see in train data has 12 columns and in test data has 11 columns. The difference is that in train data has Survivedvariable which in test data has not. The Survived variable in test data is contained in gender_submission.csv. Let’s read the data.

survive <- read.csv("gender_submission.csv")
rmarkdown::paged_table(survive)

2.3 Variabel Description

Some information about the data :

Pclass : A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower Name : Name of the passenger Sex : Gender of passenger Age : Age in years Sibsp : siblings / spouses aboard the Titanic Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored) Parch : parents / children aboard the Titanic Parent = mother, father Child = daughter, son, stepdaughter, stepson (Some children travelled only with a nanny, therefore parch=0 for them.) Ticket : Ticket number Fare : Passenger fare Cabin : Cabin number Embarked : Port of Embarkation C = Cherbourg Q = Queenstown S = Southampton

3 Data Cleaning

3.1 Missing Data

colSums(is.na(train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0
train_clean <- train %>%
    mutate(Age = if_else(is.na(Age), mean(Age, na.rm = TRUE), Age))

colSums(is.na(train_clean))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0
test1 <- cbind(test, Survived = survive$Survived)
colSums(is.na(test1))
## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           0           0           1           0           0           0
test_clean <- test1 %>% 
    na.omit()

3.2 Data Structure

data_train <- train_clean %>% 
          select(-c(PassengerId, Name, Ticket, Cabin)) %>% 
          mutate(Survived = as.factor(Survived),
           Pclass = as.factor(Pclass),
           Sex = as.factor(Sex),
           SibSp = as.factor(SibSp),
           Parch = as.factor(Parch),
           Embarked = as.factor(Embarked))

data_test <- test_clean %>% 
          select(-c(PassengerId, Name, Ticket, Cabin)) %>% 
          mutate(Survived = as.factor(Survived),
           Pclass = as.factor(Pclass),
           Sex = as.factor(Sex),
           SibSp = as.factor(SibSp),
           Parch = as.factor(Parch),
           Embarked = as.factor(Embarked))
str(data_train)
## 'data.frame':    891 obs. of  8 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
prop.table(table(data_train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

In the train dataset, the proportion of our data was around 61,61% not survived and around 38,38% survived. Let’s do down-sampling to balancing the proportion of target variable.

set.seed(267)
data_traind <- downSample(x = data_train[, -1], y = as.factor(data_train[, 1]), yname = "Survived")
rmarkdown::paged_table(data_traind)
prop.table(table(data_traind$Survived))
## 
##   0   1 
## 0.5 0.5

4. Modeling the Data

4.1 Naive Bayes

nvb_result <- naiveBayes(x = data_traind %>% select(-Survived), #prediktor
                          y = data_traind$Survived, #target
                          laplace = 1) 

Predict Data Naive Bayes

pred_label_naive <- predict(nvb_result, data_test, type = "class")
head(data.frame(actual = data_test$Survived, prediction = pred_label_naive))

Naive Bayes Model Evaluation

mat1 <- confusionMatrix(data = pred_label_naive, reference = data_test$Survived, positive = "1")
mat1
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 171  28
##          1  33  99
##                                          
##                Accuracy : 0.8157         
##                  95% CI : (0.7697, 0.856)
##     No Information Rate : 0.6163         
##     P-Value [Acc > NIR] : 3.183e-15      
##                                          
##                   Kappa : 0.6132         
##                                          
##  Mcnemar's Test P-Value : 0.6085         
##                                          
##             Sensitivity : 0.7795         
##             Specificity : 0.8382         
##          Pos Pred Value : 0.7500         
##          Neg Pred Value : 0.8593         
##              Prevalence : 0.3837         
##          Detection Rate : 0.2991         
##    Detection Prevalence : 0.3988         
##       Balanced Accuracy : 0.8089         
##                                          
##        'Positive' Class : 1              
## 

4.2 Decision Tree

Decision tree will produce a decision tree based on the patterns contained in the data. The results are visualized in a Flow Chart so that it can be easily understood how the model predicts.

model_dt <- ctree(Survived~ .,data_traind)
plot(model_dt, type="simple")

From the decision tree model, we can get some information such as:

Root Node: The first branch, the most important variable used to determine the target value. Ex : [1]

Interior Node: The second branch and so on, in the form of other variables that are used if the first branch is not enough to determine the target. Ex : [2], [4], [7], and [9]

Leaf / Terminal Node: Predicted target value or class. Ex : [3], [5], [6], [8], [10], and [11]

model_dt
## 
## Model formula:
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
## Fitted party:
## [1] root
## |   [2] Sex in female
## |   |   [3] Pclass in 1, 2: 1 (n = 165, err = 2.4%)
## |   |   [4] Pclass in 3
## |   |   |   [5] Fare <= 23.25: 1 (n = 104, err = 33.7%)
## |   |   |   [6] Fare > 23.25: 0 (n = 19, err = 15.8%)
## |   [7] Sex in male
## |   |   [8] Pclass in 1: 0 (n = 91, err = 49.5%)
## |   |   [9] Pclass in 2, 3
## |   |   |   [10] Age <= 12
## |   |   |   |   [11] SibSp in 0, 1, 2: 1 (n = 17, err = 0.0%)
## |   |   |   |   [12] SibSp in 3, 4, 5: 0 (n = 9, err = 11.1%)
## |   |   |   [13] Age > 12: 0 (n = 279, err = 16.5%)
## 
## Number of inner nodes:    6
## Number of terminal nodes: 7

From the model, we get the information that the classification is based on variables Sex, Pclass, Fare, and Age

Predict Data Decision Tree

pred_test_dt <- predict(model_dt, newdata = data_test, 
                          type = "response")
head(pred_test_dt)
## 1 2 3 4 5 6 
## 0 1 0 0 1 0 
## Levels: 0 1

Decision Tree Model Evaluation

mat2 <- confusionMatrix(data = pred_test_dt, reference = data_test$Survived, positive = "1")
mat2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 195   2
##          1   9 125
##                                           
##                Accuracy : 0.9668          
##                  95% CI : (0.9413, 0.9833)
##     No Information Rate : 0.6163          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9305          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.9843          
##             Specificity : 0.9559          
##          Pos Pred Value : 0.9328          
##          Neg Pred Value : 0.9898          
##              Prevalence : 0.3837          
##          Detection Rate : 0.3776          
##    Detection Prevalence : 0.4048          
##       Balanced Accuracy : 0.9701          
##                                           
##        'Positive' Class : 1               
## 

4.3 Random Forest

The disadvantage of random forest is its huge and heavy computing model. This can be reduced by selecting the predictor so that it is not too much. If there is large number of columns, we can delete columns that have near-zero (less informative) variance with nearZeroVar () from the caret package.

nearZeroVar(data_traind)
## integer(0)
set.seed(267)

ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3) # k-fold cross validation
survive_forest <- train(Survived ~ ., data = data_traind, method = "rf", trControl = ctrl)
survive_forest
## Random Forest 
## 
## 684 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 546, 547, 548, 548, 547, 548, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7738311  0.5476843
##   11    0.8001697  0.6003162
##   20    0.7831127  0.5662113
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.

From the model summary, we know that the optimum number of variables considered for splitting at each tree node is 11. We can inspect the importance of each variable that was used in our random forest using varImp().

varImp(survive_forest)
## rf variable importance
## 
##             Overall
## Fare      100.00000
## Sexmale    95.16120
## Age        78.25422
## Pclass3    25.78161
## SibSp1      7.60459
## EmbarkedC   6.03533
## EmbarkedS   5.92940
## Parch1      5.01794
## Parch2      4.58701
## Pclass2     4.46792
## EmbarkedQ   3.36343
## SibSp3      2.04308
## SibSp4      1.89459
## SibSp2      1.38080
## SibSp8      1.01522
## Parch4      0.57022
## Parch5      0.46405
## SibSp5      0.45973
## Parch3      0.04909
## Parch6      0.00000
plot(survive_forest$finalModel)
legend("topright", colnames(survive_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

survive_forest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##         OOB estimate of  error rate: 19.74%
## Confusion matrix:
##     0   1 class.error
## 0 285  57   0.1666667
## 1  78 264   0.2280702

Predict Data Random Forest

pred_test_rf <- predict(survive_forest, newdata = data_test, type = "raw")

Random Forest Model Evaluation

mat3 <- confusionMatrix(data = pred_test_rf, reference = data_test$Survived, positive = "1")
mat3
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 168  20
##          1  36 107
##                                          
##                Accuracy : 0.8308         
##                  95% CI : (0.786, 0.8696)
##     No Information Rate : 0.6163         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.6506         
##                                          
##  Mcnemar's Test P-Value : 0.04502        
##                                          
##             Sensitivity : 0.8425         
##             Specificity : 0.8235         
##          Pos Pred Value : 0.7483         
##          Neg Pred Value : 0.8936         
##              Prevalence : 0.3837         
##          Detection Rate : 0.3233         
##    Detection Prevalence : 0.4320         
##       Balanced Accuracy : 0.8330         
##                                          
##        'Positive' Class : 1              
## 

Conclusion

#get the accuracy, sensitivity, specificity, precision and AUC of model naive bayes
m_naive <- data.frame(Model = "Naive Bayes",
           Accuracy = round((mat1$table[4] + mat1$table[1]) / sum(mat1$table),4),
           Sensitivity = round(mat1$table[4] / (mat1$table[4] + mat1$table[3]),4),
           Specificity = round(mat1$table[1] / (mat1$table[1] + mat1$table[2]),4),
           Precision = round(mat1$table[4] / (mat1$table[4] + mat1$table[2]),4))

#get the accuracy, sensitivity, specificity, precision and AUC of model decision tree
m_dt <- data.frame(Model = "Decision Tree",
           Accuracy = round((mat2$table[4] + mat2$table[1]) / sum(mat2$table),4),
           Sensitivity = round(mat2$table[4] / (mat2$table[4] + mat2$table[3]),4),
           Specificity = round(mat2$table[1] / (mat2$table[1] + mat2$table[2]),4),
           Precision = round(mat2$table[4] / (mat2$table[4] + mat2$table[2]),4))

#get the accuracy, sensitivity, specificity, precision and AUC of model random forest
m_rf <- data.frame(Model = "Random Forest",
           Accuracy = round((mat3$table[4] + mat3$table[1]) / sum(mat3$table),4),
           Sensitivity = round(mat3$table[4] / (mat3$table[4] + mat3$table[3]),4),
           Specificity = round(mat3$table[1] / (mat3$table[1] + mat3$table[2]),4),
           Precision = round(mat3$table[4] / (mat3$table[4] + mat3$table[2]),4))

rbind(m_naive, m_dt, m_rf)

The Survived prediction of Titanic passengers based on the metrics table above : The predictive model built using Decision Tree algorithm gave the best result. The model gave highest accuracy 96% and sensitivity 98% while also maintain specificity and precision above 90%. Therefore the best model to predict the survival of Titanic passengers is Decision Tree algorithm.