Prediction the Survival of Titanic Passengers Using Naive Bayes, Decision Tree, and Random Forest
Introduction
We have already predict the survival of Titanic Passengers Using Logistic Regression and K-NN model in previous model. Now, we’ll predict using Naive Bayes, Decision Tree, and Random Forest model. The target variable is the same as previous Survived
. The dataset was provided by Kaggle.
Import Library and Setup
Data Import
This data contains 3 dataset, such as :
1. train.csv
: is using for training data (complete with predictor variable)
2. test.csv
: is using for testing data (without survived variable)
3. gender_submission.csv
: contains id and survived variable
Let’s read the data
Data train
Train data has 891 obs and 12 columns. We’ll use this data to make model later.
Data Test
Test data has 418 obs and 11 columns. If we see in train data has 12 columns and in test data has 11 columns. The difference is that in train data has Survived
variable which in test data has not. The Survived
variable in test data is contained in gender_submission.csv
. Let’s read the data.
Some information about the data :
Pclass
: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = LowerName
: Name of the passengerSex
: Gender of passengerAge
: Age in yearsSibsp
: siblings / spouses aboard the Titanic
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)Parch
: parents / children aboard the Titanic
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.Ticket
: Ticket numberFare
: Passenger fareCabin
: Cabin numberEmbarked
: Port of Embarkation
C = Cherbourg
Q = Queenstown
S = Southampton
Data Preparation and EDA
To make it easier to read, let’s join the test data and survive, so that we just only work with only two data.
Now, we only have 2 data : train
and test
dataset. Let’s check the if there is missing value in data and the data type of data.
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
There is 177 missing value in age column. It was around 19% of our data. Instead of remove the data. let’s try to replace the age number with the mean of age.
train_clean <- train %>%
mutate(Age = if_else(is.na(Age), mean(Age, na.rm = TRUE), Age))
colSums(is.na(train_clean))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Let’s check if there’s any missing value in data test.
## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 86 0
## Parch Ticket Fare Cabin Embarked Survived
## 0 0 1 0 0 0
There’s some missing value in test data. To avoid NA from affecting the prediction results, we will delete data that contains NA
In this case, Survived
is the target variable, and some variables are not contain much information about the modeling. Let’s select some predictor variable and change the data type.
data_train <- train_clean %>%
select(-c(PassengerId, Name, Ticket, Cabin)) %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
SibSp = as.factor(SibSp),
Parch = as.factor(Parch),
Embarked = as.factor(Embarked))
data_test <- test_clean %>%
select(-c(PassengerId, Name, Ticket, Cabin)) %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
SibSp = as.factor(SibSp),
Parch = as.factor(Parch),
Embarked = as.factor(Embarked))
Let’s check the structure of data train by using str()
## 'data.frame': 891 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Before we continue to modeling, let’s check the proportion of our data in class target.
##
## 0 1
## 0.6161616 0.3838384
In the train dataset, the proportion of our data was around 61,61% not survived and around 38,38% survived. Let’s do down-sampling to balancing the proportion of target variable.
set.seed(267)
data_traind <- downSample(x = data_train[, -1], y = as.factor(data_train[, 1]), yname = "Survived")
rmarkdown::paged_table(data_traind)
##
## 0 1
## 0.5 0.5
Model Naive Bayes
Modeling
After we prepare the data, we’ll try to model with using Naive-Bayes.
Predict Data
pred_label_naive <- predict(model_naive, data_test, type = "class")
head(data.frame(actual = data_test$Survived, prediction = pred_label_naive))
## actual prediction
## 1 0 0
## 2 1 1
## 3 0 0
## 4 0 0
## 5 1 1
## 6 0 0
Model Evaluation
We can do model evaluation in naive-bayes by using confusion matrix. We can also check the ROC (Receiver-Operating Curve) and AUC(Area Under ROC Curve).
Confusion Matrix
mat1 <- confusionMatrix(data = pred_label_naive, reference = data_test$Survived, positive = "1")
mat1
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 171 28
## 1 33 99
##
## Accuracy : 0.8157
## 95% CI : (0.7697, 0.856)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : 3.183e-15
##
## Kappa : 0.6132
##
## Mcnemar's Test P-Value : 0.6085
##
## Sensitivity : 0.7795
## Specificity : 0.8382
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.8593
## Prevalence : 0.3837
## Detection Rate : 0.2991
## Detection Prevalence : 0.3988
## Balanced Accuracy : 0.8089
##
## 'Positive' Class : 1
##
ROC (Receiver - Operating Curve)
Receiver-Operating Curve (ROC) is a curve that plots the relationship between True Positive Rate (Sensitivity or Recall) with False Positive Rate (1-Specificity). A good model should ideally have a high True Positive Rate and a low False Positive Rate.
# get the probability prediction
prob_survive <- predict(model_naive, data_test, type = "raw")
# prepare dataframe for ROC
data_roc <- data.frame(prob = prob_survive[,2], # probability of positive class(survived)
labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived
head(data_roc)
## prob labels
## 1 0.04078981 0
## 2 0.53590515 1
## 3 0.18470196 0
## 4 0.03860111 0
## 5 0.75699552 1
## 6 0.05263042 0
naive_roc <- ROCR::prediction(data_roc$prob, data_roc$labels)
# ROC curve
plot(performance(naive_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
main = "ROC")
abline(a = 0, b = 1)
AUC (Area Under ROC Curve)
AUC shows the area under the ROC curve. The closer it to 1 means our performance of the model is better. To get the AUC value, write auc
in themeasure
parameter of performance ()
and retrieve the value y.values
.
## [[1]]
## [1] 0.8857496
Decision Tree
Decision tree will produce a decision tree based on the patterns contained in the data. The results are visualized in a Flow Chart so that it can be easily understood how the model predicts.
Modeling
`
From the decision tree model, we can get some information such as:
Root Node: The first branch, the most important variable used to determine the target value. Ex : [1]
Interior Node: The second branch and so on, in the form of other variables that are used if the first branch is not enough to determine the target. Ex : [2], [4], [7], and [9]
Leaf / Terminal Node: Predicted target value or class. Ex : [3], [5], [6], [8], [10], and [11]
##
## Model formula:
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Fitted party:
## [1] root
## | [2] Sex in female
## | | [3] Pclass in 1, 2: 1 (n = 165, err = 2.4%)
## | | [4] Pclass in 3
## | | | [5] Fare <= 23.25: 1 (n = 104, err = 33.7%)
## | | | [6] Fare > 23.25: 0 (n = 19, err = 15.8%)
## | [7] Sex in male
## | | [8] Pclass in 1: 0 (n = 91, err = 49.5%)
## | | [9] Pclass in 2, 3
## | | | [10] Age <= 12
## | | | | [11] SibSp in 0, 1, 2: 1 (n = 17, err = 0.0%)
## | | | | [12] SibSp in 3, 4, 5: 0 (n = 9, err = 11.1%)
## | | | [13] Age > 12: 0 (n = 279, err = 16.5%)
##
## Number of inner nodes: 6
## Number of terminal nodes: 7
From the model, we get the information that the classification is based on variables Sex
, Pclass
, Fare
, and Age
Predict Data
## 1 2 3 4 5 6
## 0 1 0 0 1 0
## Levels: 0 1
Model Evaluation
We can do model evaluation in naive-bayes by using confusion matrix. We can also check the ROC (Receiver-Operating Curve) and AUC(Area Under ROC Curve).
Confusion Matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 195 2
## 1 9 125
##
## Accuracy : 0.9668
## 95% CI : (0.9413, 0.9833)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9305
##
## Mcnemar's Test P-Value : 0.07044
##
## Sensitivity : 0.9843
## Specificity : 0.9559
## Pos Pred Value : 0.9328
## Neg Pred Value : 0.9898
## Prevalence : 0.3837
## Detection Rate : 0.3776
## Detection Prevalence : 0.4048
## Balanced Accuracy : 0.9701
##
## 'Positive' Class : 1
##
ROC (Receiver - Operating Curve)
prob_survive_dt <- predict(model_dt, data_test, type = "prob")
data_roc1 <- data.frame(prob = prob_survive_dt[,2], # probability of positive class(survived)
labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived
dt_roc <- ROCR::prediction(data_roc1$prob, data_roc1$labels)
# ROC curve
plot(performance(dt_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
main = "ROC")
abline(a = 0, b = 1)
Random Forest
The disadvantage of random forest is its huge and heavy computing model. This can be reduced by selecting the predictor so that it is not too much. If there is large number of columns, we can delete columns that have near-zero (less informative) variance with nearZeroVar ()
from the caret
package.
## integer(0)
We can see that we have no near-zero variance data. We’ll try to model using Random forest.
Modeling
set.seed(267)
ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3) # k-fold cross validation
survive_forest <- train(Survived ~ ., data = data_traind, method = "rf", trControl = ctrl)
After we modeling the data, let’s try to call the model.
## Random Forest
##
## 684 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 546, 547, 548, 548, 547, 548, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7738311 0.5476843
## 11 0.8001697 0.6003162
## 20 0.7831127 0.5662113
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.
From the model summary, we know that the optimum number of variables considered for splitting at each tree node is 11. We can also inspect the importance of each variable that was used in our random forest using varImp().
## rf variable importance
##
## Overall
## Fare 100.00000
## Sexmale 95.16120
## Age 78.25422
## Pclass3 25.78161
## SibSp1 7.60459
## EmbarkedC 6.03533
## EmbarkedS 5.92940
## Parch1 5.01794
## Parch2 4.58701
## Pclass2 4.46792
## EmbarkedQ 3.36343
## SibSp3 2.04308
## SibSp4 1.89459
## SibSp2 1.38080
## SibSp8 1.01522
## Parch4 0.57022
## Parch5 0.46405
## SibSp5 0.45973
## Parch3 0.04909
## Parch6 0.00000
Even though our data has been split for train and test, actually we don’t need to split train and test data when using random forest. The result of bootstrap sampling has data that are not used in making random forests which called out-of-bag data. This out-of-bag data can be considered as data test by the model.
plot(survive_forest$finalModel)
legend("topright", colnames(survive_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 11
##
## OOB estimate of error rate: 19.74%
## Confusion matrix:
## 0 1 class.error
## 0 285 57 0.1666667
## 1 78 264 0.2280702
Model Evaluation
We can do model evaluation in naive-bayes by using confusion matrix. We can also check the ROC (Receiver-Operating Curve) and AUC(Area Under ROC Curve).
Confusion Matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 168 20
## 1 36 107
##
## Accuracy : 0.8308
## 95% CI : (0.786, 0.8696)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6506
##
## Mcnemar's Test P-Value : 0.04502
##
## Sensitivity : 0.8425
## Specificity : 0.8235
## Pos Pred Value : 0.7483
## Neg Pred Value : 0.8936
## Prevalence : 0.3837
## Detection Rate : 0.3233
## Detection Prevalence : 0.4320
## Balanced Accuracy : 0.8330
##
## 'Positive' Class : 1
##
ROC (Receiver - Operating Curve)
prob_survive_rf <- predict(survive_forest, data_test, type = "prob")
data_roc2 <- data.frame(prob = prob_survive_rf[,2], # probability of positive class(survived)
labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived
rf_roc <- ROCR::prediction(data_roc2$prob, data_roc2$labels)
# ROC curve
plot(performance(rf_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
main = "ROC")
abline(a = 0, b = 1)
Conclusion
After we make the model, predict, and do some model evaluation, let’s compare the model of naive bayes, decision tree, and random forest for this case.
#get the accuracy, sensitivity, specificity, precision and AUC of model naive bayes
m_naive <- data.frame(Model = "Naive Bayes",
Accuracy = round((mat1$table[4] + mat1$table[1]) / sum(mat1$table),4),
Sensitivity = round(mat1$table[4] / (mat1$table[4] + mat1$table[3]),4),
Specificity = round(mat1$table[1] / (mat1$table[1] + mat1$table[2]),4),
Precision = round(mat1$table[4] / (mat1$table[4] + mat1$table[2]),4),
AUC = round(as.numeric(auc_n@y.values),4))
#get the accuracy, sensitivity, specificity, precision and AUC of model decision tree
m_dt <- data.frame(Model = "Decision Tree",
Accuracy = round((mat2$table[4] + mat2$table[1]) / sum(mat2$table),4),
Sensitivity = round(mat2$table[4] / (mat2$table[4] + mat2$table[3]),4),
Specificity = round(mat2$table[1] / (mat2$table[1] + mat2$table[2]),4),
Precision = round(mat2$table[4] / (mat2$table[4] + mat2$table[2]),4),
AUC = round(as.numeric(dt_auc@y.values),4))
#get the accuracy, sensitivity, specificity, precision and AUC of model random forest
m_rf <- data.frame(Model = "Random Forest",
Accuracy = round((mat3$table[4] + mat3$table[1]) / sum(mat3$table),4),
Sensitivity = round(mat3$table[4] / (mat3$table[4] + mat3$table[3]),4),
Specificity = round(mat3$table[1] / (mat3$table[1] + mat3$table[2]),4),
Precision = round(mat3$table[4] / (mat3$table[4] + mat3$table[2]),4),
AUC = round(as.numeric(rf_auc@y.values),4))
rbind(m_naive, m_dt, m_rf)
## Model Accuracy Sensitivity Specificity Precision AUC
## 1 Naive Bayes 0.8157 0.7795 0.8382 0.7500 0.8857
## 2 Decision Tree 0.9668 0.9843 0.9559 0.9328 0.9411
## 3 Random Forest 0.8308 0.8425 0.8235 0.7483 0.9203
The Survived
prediction of Titanic passengers based on the metrics table above :
The predictive model built using Decision Tree algorithm gave the best result. The model gave highest accuracy 96% and sensitivity 98% while also maintain specificity and precision above 90%. It also gave the highest AUC at 94%. Therefore the best model to predict the survival of Titanic passengers is Decision Tree algorithm.