At this time, we want to build predictive model to predict passengers of Titanic whether or not they survived the sinking of the ship. It will be a model to answer question like “what sorts of people were more likely to survive?”, using passenger data (ie name, age, gender, socio-economic class, etc). Previously, we have used logistic regression and K-Nearest Neighbor (K-NN) as the classification method and the precision value is 83.87% (using logistic regression). Now we are going to build other predictive models using Naive Bayes, Decision Tree and Random Forest to see if we can have better accuracy or precision value. The dataset is splitted into two groups, i.e. training set and test set taken from here.
Load library
# libraries
library(readr)
library(tidyverse)
library(gtools)
library(class)
library(caret)
library(e1071)
library(car)
library(rsample)
library(ROCR)
library(partykit)
library(randomForest)
library(dplyr)Read data train.csv
titanic <- read.csv("train.csv")
head(titanic)Description:
PassengerId: Row ID in the data setSurvived: If a passenger survived or not (0 = No, 1 =
Yes)Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)Name: Passenger’s nameSex: Passenger’s gender (male or female)Age: Passenger’s age (in years)SibSp: # of siblings / spouses aboard the TitanicParch: # of parents / children aboard the TitanicTicket: Ticket numberFare: Passenger fareCabin: Cabin numberEmbarked: Port of Embarkation (C = Cherbourg, Q =
Queenstown, S = Southampton)Check missing value
colSums(is.na(titanic))#> PassengerId Survived Pclass Name Sex Age
#> 0 0 0 0 0 177
#> SibSp Parch Ticket Fare Cabin Embarked
#> 0 0 0 0 0 0
Since there is missing value in Age variable, we will only select the data that has a value in Age variable. This is important because we do not want the data interrupt our model later on. We will do this in data wrangling section.
Data wrangling
glimpse(titanic) #> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
#> $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
#> $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
#> $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal~
#> $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
#> $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
#> $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
#> $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
#> $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
#> $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
# data selection and transformation
titanic_clean <- titanic %>%
filter(!is.na(titanic$Age) & !titanic$Embarked=="") %>% # filter out data with missing value or empty data
select(-c("PassengerId","Name","Ticket","Cabin")) %>% # unselect column that is not useful
mutate(
Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
Embarked = as.factor(Embarked)
)
str(titanic_clean)#> 'data.frame': 712 obs. of 8 variables:
#> $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 2 2 ...
#> $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
#> $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
#> $ Age : num 22 38 26 35 35 54 2 27 14 4 ...
#> $ SibSp : int 1 1 0 1 0 0 3 0 1 1 ...
#> $ Parch : int 0 0 0 0 0 0 1 2 0 1 ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
#> $ Embarked: Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 3 3 3 1 3 ...
Check missing value after data wrangling
colSums(is.na(titanic_clean))#> Survived Pclass Sex Age SibSp Parch Fare Embarked
#> 0 0 0 0 0 0 0 0
summary(titanic_clean)#> Survived Pclass Sex Age SibSp Parch
#> 0:424 1:184 female:259 Min. : 0.42 Min. :0.000 Min. :0.0000
#> 1:288 2:173 male :453 1st Qu.:20.00 1st Qu.:0.000 1st Qu.:0.0000
#> 3:355 Median :28.00 Median :0.000 Median :0.0000
#> Mean :29.64 Mean :0.514 Mean :0.4326
#> 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:1.0000
#> Max. :80.00 Max. :5.000 Max. :6.0000
#> Fare Embarked
#> Min. : 0.00 C:130
#> 1st Qu.: 8.05 Q: 28
#> Median : 15.65 S:554
#> Mean : 34.57
#> 3rd Qu.: 33.00
#> Max. :512.33
prop.table(table(titanic_clean$Survived))#>
#> 0 1
#> 0.5955056 0.4044944
Based on the proportion value above, the target variable class
(Survived) is balance enough so that we do not need to do
additional data pre-processing to balance the class.
When doing prediction, we need to show that a model can predict new
data. Thus, we have evaluate the model using new data and we need to
split the data into test data and train data.
Let’s say we use 80% of the data as train data and the rest
as test data.
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- initial_split(data = titanic_clean,
prop = 0.8,
strata = "Survived")
titanic_train <- training(index)
titanic_test <- testing(index)# re-check class imbalance
prop.table(table(titanic_train$Survived))#>
#> 0 1
#> 0.5957821 0.4042179
If we look at the proportion of train data, we can still consider that the data is balance. In this case, we want to make it more balance so that the proportion is 50:50 and thus we do downsampling.
# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
titanic_train_down <- downSample(x = titanic_train %>% select(-Survived),
y = titanic_train$Survived,
yname = "Survived")
dim(titanic_train_down)#> [1] 460 8
dim(titanic_train)#> [1] 569 8
# re-check class imbalance
prop.table(table(titanic_train_down$Survived))#>
#> 0 1
#> 0.5 0.5
Now we use titanic_train_down as our train data to build
models.
# build model based on train data
model_naive <- naiveBayes(Survived~., data = titanic_train_down, laplace = 1)
# predict class of test data
pred_naive <- predict(model_naive, newdata = titanic_test, type = "class")
# model evaluation with confusion matrix
naive_conf <- confusionMatrix(data = pred_naive, titanic_test$Survived, positive = "1")
naive_conf#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 75 22
#> 1 10 36
#>
#> Accuracy : 0.7762
#> 95% CI : (0.699, 0.8416)
#> No Information Rate : 0.5944
#> P-Value [Acc > NIR] : 0.000003382
#>
#> Kappa : 0.5201
#>
#> Mcnemar's Test P-Value : 0.05183
#>
#> Sensitivity : 0.6207
#> Specificity : 0.8824
#> Pos Pred Value : 0.7826
#> Neg Pred Value : 0.7732
#> Prevalence : 0.4056
#> Detection Rate : 0.2517
#> Detection Prevalence : 0.3217
#> Balanced Accuracy : 0.7515
#>
#> 'Positive' Class : 1
#>
💡 Interpretation:
Based on the confusion matrix above, the accuracy of the model is
77.62%. In this Titanic Survival Passenger case, we want to
avoid passenger’s characteristics that are not
survived. Therefore, we use Precision as metric to
avoid False Positive (passengers who were actually not survived, but
they’re predicted as survived). The precision value of the model is
78.26%.
Accuracy has a weakness of reflecting how good our model in classifying classes. Therefore, let’s calculate the ROC and AUC as other tools for model evaluation.
ROC (Receiver Operating Curve) describes the relationship between True Positive Rate (Sensitivity/Recall) with False Positive Rate (1-Specificity) in every threshold. Ideally, good model has High True Positive Rate and Low False Positive Rate.
AUC (Area Under Curve) shows the area under ROC curve. The higher the AUC score, the better the model is.
# predict test data in probability type
pred_naive_prob <- predict(model_naive, newdata = titanic_test, type = "raw")
# create prediction object
roc_naive_pred <- prediction(predictions = pred_naive_prob[,1],
labels = titanic_test$Survived=="1")
# ROC curve
plot(performance(prediction.obj = roc_naive_pred,
measure = "tpr",
x.measure = "fpr"))# AUC
naive_auc <- performance(prediction.obj = roc_naive_pred,
measure = "auc")
auc_naive <- naive_auc@y.values
auc_naive#> [[1]]
#> [1] 0.1772819
💡 Interpretation:
ROC curve and AUC score do not show good result which means the model is not good enough in classifying positive and negative classes.
Next let’s try using Decision Tree to see if the model has better interpretability.
# build model based on train data
model_dtree <- ctree(formula = Survived~.,
data = titanic_train_down)
# visualize decision tree
plot(model_dtree, type = "simple")# predict class in test data
pred_dtree <- predict(object = model_dtree,
newdata = titanic_test,
type = "response")
# model evaluation with confusion matrix
dtree_conf <- confusionMatrix(data = pred_dtree, titanic_test$Survived, positive = "1")
dtree_conf#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 67 19
#> 1 18 39
#>
#> Accuracy : 0.7413
#> 95% CI : (0.6614, 0.8108)
#> No Information Rate : 0.5944
#> P-Value [Acc > NIR] : 0.000171
#>
#> Kappa : 0.4619
#>
#> Mcnemar's Test P-Value : 1.000000
#>
#> Sensitivity : 0.6724
#> Specificity : 0.7882
#> Pos Pred Value : 0.6842
#> Neg Pred Value : 0.7791
#> Prevalence : 0.4056
#> Detection Rate : 0.2727
#> Detection Prevalence : 0.3986
#> Balanced Accuracy : 0.7303
#>
#> 'Positive' Class : 1
#>
Besides using test data in confusion matrix, it is better if we can compare model performance in train data as well to see the model fitting.
# predict class in train data
pred_dtree_train <- predict(object = model_dtree,
newdata = titanic_train_down,
type = "response")
# model evaluation with confusion matrix
confusionMatrix(data = pred_dtree_train, titanic_train_down$Survived, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 194 57
#> 1 36 173
#>
#> Accuracy : 0.7978
#> 95% CI : (0.7582, 0.8336)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.5957
#>
#> Mcnemar's Test P-Value : 0.03809
#>
#> Sensitivity : 0.7522
#> Specificity : 0.8435
#> Pos Pred Value : 0.8278
#> Neg Pred Value : 0.7729
#> Prevalence : 0.5000
#> Detection Rate : 0.3761
#> Detection Prevalence : 0.4543
#> Balanced Accuracy : 0.7978
#>
#> 'Positive' Class : 1
#>
Precision in train data : 0.8278 Precision in
test data : 0.6842
Based on the comparison above, the difference of precision value is 0.1436. This happens because the model experience overfitting, thus it can be fixed with model tuning.
# use parameters: mincriterion, minsplit and minbucket
model_dtree_new <- ctree(formula = Survived~.,
data = titanic_train_down,
control = ctree_control(mincriterion = 0.07,
minsplit = 35,
minbucket = 10))
plot(model_dtree_new, type = "simple")Evaluate if the model is still overfitting.
# predict class in test data
pred_dtree_new <- predict(object = model_dtree_new,
newdata = titanic_test,
type = "response")
# model evaluation with confusion matrix
confusionMatrix(data = pred_dtree_new, titanic_test$Survived, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 61 15
#> 1 24 43
#>
#> Accuracy : 0.7273
#> 95% CI : (0.6465, 0.7983)
#> No Information Rate : 0.5944
#> P-Value [Acc > NIR] : 0.0006414
#>
#> Kappa : 0.448
#>
#> Mcnemar's Test P-Value : 0.2001848
#>
#> Sensitivity : 0.7414
#> Specificity : 0.7176
#> Pos Pred Value : 0.6418
#> Neg Pred Value : 0.8026
#> Prevalence : 0.4056
#> Detection Rate : 0.3007
#> Detection Prevalence : 0.4685
#> Balanced Accuracy : 0.7295
#>
#> 'Positive' Class : 1
#>
# predict class in train data
pred_dtree_train_new <- predict(object = model_dtree_new,
newdata = titanic_train_down,
type = "response")
# model evaluation with confusion matrix
confusionMatrix(data = pred_dtree_train_new, titanic_train_down$Survived, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 200 46
#> 1 30 184
#>
#> Accuracy : 0.8348
#> 95% CI : (0.7976, 0.8676)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.6696
#>
#> Mcnemar's Test P-Value : 0.08532
#>
#> Sensitivity : 0.8000
#> Specificity : 0.8696
#> Pos Pred Value : 0.8598
#> Neg Pred Value : 0.8130
#> Prevalence : 0.5000
#> Detection Rate : 0.4000
#> Detection Prevalence : 0.4652
#> Balanced Accuracy : 0.8348
#>
#> 'Positive' Class : 1
#>
It seems that the model is still overfitting. Thus, let’s try to improve the model with Ensemble Method, which is the third method.
💡 Interpretation:
Based on the confusion matrix above, the accuracy of the model is 74.13% and the precision value of the model is 68.42%. These values are less than previous model’s values, which means that this model is not as good as previous model (ie. Naive Bayes model).
# ROC
# predict test data in probability type
pred_dtree_prob <- predict(model_dtree, newdata = titanic_test, type = "prob")
# create prediction object
roc_dtree_pred <- prediction(predictions = pred_dtree_prob[,1],
labels = titanic_test$Survived=="1")
# ROC curve
plot(performance(prediction.obj = roc_dtree_pred,
measure = "tpr",
x.measure = "fpr"))# AUC
dtree_auc <- performance(prediction.obj = roc_dtree_pred,
measure = "auc")
auc_dtree <- dtree_auc@y.values
auc_dtree#> [[1]]
#> [1] 0.1692698
💡 Interpretation:
ROC curve and AUC score do not show good result which means the model is not good enough in classifying positive and negative classes.
Random Forest is one of the Ensemble Method that will do prediction by creating many decision tree. Among those predictions, it will use majority voting to predict the final result.
# create model
set.seed(417)
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_rf <- train(form = Survived~.,
data = titanic_train_down,
method = "rf",
trainControl=control)
# save the model
saveRDS(model_rf, "titanic_forest.RDS")
# read the model
model_rf <- readRDS("titanic_forest.RDS")
# view the model
model_rf$finalModel#>
#> Call:
#> randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)), trainControl = ..1)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 2
#>
#> OOB estimate of error rate: 20.87%
#> Confusion matrix:
#> 0 1 class.error
#> 0 194 36 0.1565217
#> 1 60 170 0.2608696
💡 Interpretation:
Although previously we do cross validation by splitting the data and evaluate using test data, actually when using Random Forest there are out-of-bag data that is considered as test data by the model. The model uses the data to predict and count error named out-of-bag error.
In Random Forest model, value of out-of-bag-error is 20.87% which means the accuracy of model is 79,13%.
We can see the out-of-bag (OOB) in each class by using plot.
plot(model_rf$finalModel)
legend("topright", colnames(model_rf$finalModel$err.rate),
col=1:6,cex=0.8,fill=1:6)Based on the visualization above, with less than 100 trees the error rate is low for the model but for more maximum result it uses more trees to reduce OOB.
We can also inspect the importance of each variable that was used in
our random forest using varImp().
plot(varImp(model_rf))From the above plot, Gender Male is the most important variable that impacts the model.
Let’s test our model with test data.
# model evaluation
pred_rf <- predict(model_rf, newdata = titanic_test)
rf_conf <- confusionMatrix(pred_rf, titanic_test$Survived, positive = "1")
rf_conf#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 65 15
#> 1 20 43
#>
#> Accuracy : 0.7552
#> 95% CI : (0.6764, 0.8232)
#> No Information Rate : 0.5944
#> P-Value [Acc > NIR] : 0.0000397
#>
#> Kappa : 0.4992
#>
#> Mcnemar's Test P-Value : 0.499
#>
#> Sensitivity : 0.7414
#> Specificity : 0.7647
#> Pos Pred Value : 0.6825
#> Neg Pred Value : 0.8125
#> Prevalence : 0.4056
#> Detection Rate : 0.3007
#> Detection Prevalence : 0.4406
#> Balanced Accuracy : 0.7530
#>
#> 'Positive' Class : 1
#>
💡 Interpretation:
Based on the confusion matrix above, the accuracy of the model is 75.52% and the precision value of the model is 68.25%. These values are similar to Decision Tree model, but less than the first model (ie. Naive Bayes model).
# ROC
# predict test data in probability type
pred_rf_prob <- predict(model_rf, newdata = titanic_test, type = "prob")
# create prediction object
roc_rf_pred <- prediction(predictions = pred_rf_prob[,1],
labels = titanic_test$Survived=="1")
# ROC curve
plot(performance(prediction.obj = roc_rf_pred,
measure = "tpr",
x.measure = "fpr"))# AUC
rf_auc <- performance(prediction.obj = roc_rf_pred,
measure = "auc")
auc_rf <- rf_auc@y.values
auc_rf#> [[1]]
#> [1] 0.1593306
💡 Interpretation:
ROC curve and AUC score do not show good result which means the model is not good enough in classifying positive and negative classes.
eval_naive <- data_frame(Accuracy = naive_conf$overall[1],
Recall = naive_conf$byClass[1],
Specificity = naive_conf$byClass[2],
Precision = naive_conf$byClass[3])
eval_naive$AUC <- auc_naive[[1]]
eval_naiveeval_dtree <- data_frame(Accuracy = dtree_conf$overall[1],
Recall = dtree_conf$byClass[1],
Specificity = dtree_conf$byClass[2],
Precision = dtree_conf$byClass[3])
eval_dtree$AUC <- auc_dtree[[1]]
eval_dtreeeval_rf <- data_frame(Accuracy = rf_conf$overall[1],
Recall = rf_conf$byClass[1],
Specificity = rf_conf$byClass[2],
Precision = rf_conf$byClass[3])
eval_rf$AUC <- auc_rf[[1]]
eval_rfdf <- as.data.frame(rbind(eval_naive, eval_dtree, eval_rf))
rownames(df) <- c("Naive Bayes","Decision Tree","Random Forest")
dfBased on the metrics table above, the predictive model built using Naive Bayes algorithm gives the best result. The model gives highest accuracy 77.62% while also has precision 78.26%. It also gave the highest AUC at 17.72% although this does not represent a good model to classify positive and negative classes. Therefore the best model to predict whether or not Titanic’s passengers survived the sinking of the ship is the Naive Bayes model.