The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
In this report, we are going to build a model to predict what sorts of people were more likely to survive.
library(tidyverse) # for data wrangling
library(yardstick) # to calculate a cross-tabulation of observed and predicted classes
library(ggplot2) # to visualize data
library(plotly) # to visualize data
library(caret) # to pre-process data
library(rsample) # for cross validation
library(caret) # to pre-process data
library(e1071) # to build Naive Bayes model
library(ROCR) # to build ROC graphic
library(tibble) # to visualize data
Titanic passengers data was obtained from kaggle.com, which contained passenger data such as name, age, gender, socio-economic class, etc.
Data description:
- survival -> Survival (0 = No, 1 = Yes)
- pclass -> Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- sex -> Sex /gender
- Age -> Age in years
- sibsp -> # of siblings / spouses aboard the Titanic
- parch -> # of parents/ children aboard the Titanic
- ticket -> Ticket number
- fare -> Passenger fare
- cabin -> Cabin number
- embarked -> Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
At this phase, we have to check first whether the variables are in appropiate class.
summary(titanic)
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
Variables to be deleted and the reasons: - PassengerId -> indicated the row names
- Name, Ticket, Cabin -> uniqueness of the variables are more than 100, so they won’t provide any useful information for model
Variables to be transformed or class changed:
- Survived, Pclass -> change into factor class
- Age -> transform into 3 class (below 20, 20-40, above 40)
- Fare -> transform into 2 class. Most of the data is distributed in below 50, so let’s make 2 class (<= 50 and > 50)
- SibSp, Parch -> transform into 2 class. As we can see from summary table, most of the observation has ‘0’ or ‘1’ value (3rd Q value is still either ‘0’ or ‘1’)
# delete variables & change class of variables
titanic_final <- titanic %>%
select(-PassengerId, -Name, -Ticket, -Cabin) %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass))
titanic_final
# inspecting missing value
colSums(is.na(titanic_final))
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 177 0 0 0 0
There are 177 missing values in Age variable and Age should be one of predictors, we need to remove observation that have missing values in Age variable.
# delete missing values
titanic_final <- titanic_final %>%
drop_na()
titanic_final$Age.Range <- ifelse(titanic_final$Age<=10, "Below 20",
ifelse(titanic_final$Age>20 & titanic_final$Age<=40,"21-40",
"above 40"))
titanic_final$SibSp.Exp <- ifelse(titanic_final$SibSp > 0, "Yes", "No")
titanic_final$Parch.Exp <- ifelse(titanic_final$Parch > 0, "Yes","No")
titanic_final$Fare.Exp <- ifelse(titanic_final$Fare <= 50, "<= 50", ">50")
titanic_final <- titanic_final[,-c(4:7)]
titanic_final
Checking the proportion of the target variable (Survived)
prop.table(table(titanic_final$Survived))
##
## 0 1
## 0.5938375 0.4061625
The proportion of Survived class is quite balance (Yes = 0,41, No = 0,59). We can use this data to predict for both criteria.
set.seed(123)
split <- initial_split(titanic_final, prop = 0.8, strata = "Survived")
train <- training(split)
test <- testing(split)
Checking the proportion of target variable of the train data.
prop.table(table(train$Survived))
##
## 0 1
## 0.5933682 0.4066318
Found that the proportion of the target variable is quite balance, we can continue to next phase.
There are certain characteristics of Naive Bayes that should be considered:
Based on the characteristic, our titanic data is quite suitable to use Naive Bayes model since most of our data have been transformed into categorical class. To make a good Naive Bayes or to ensure a non-zero probability of occuring, we should apply Laplace estimator
# model building
naive <- naiveBayes(Survived~., train, laplace =1)
# model fitting
naive_pred <- predict(naive,test, type="class") # for the class prediction
naive_prob <- predict(naive,test, type="raw") # for the probability
# result
naive_table <- select(test, Survived) %>%
bind_cols(Survived_pred = naive_pred) %>%
bind_cols(Survived_eprob = round(naive_prob[,1],4)) %>%
bind_cols(Survived_pprob = round(naive_prob[,2],4))
# performance evaluation - confusion matrix
naive_table %>%
conf_mat(Survived, Survived_pred) %>%
autoplot(type ="heatmap")
naive_perf <- confusionMatrix(data = naive_table$Survived_pred, reference = naive_table$Survived, positive = "1")
data.frame(
"Accuracy" = naive_perf$overall[1],
"Recall" = naive_perf$byClass[1],
"Specificity" = naive_perf$byClass[2],
"Precision" = naive_perf$byClass[3]
)
naive_ROC <- prediction(naive_prob[,2], labels = test$Survived )
# ROC curve
plot(performance(naive_ROC, "tpr", "fpr"),
main="ROC")
abline(a=0, b=1)
# AUC
auc_ROCR_n <- performance(naive_ROC, measure = "auc")
auc_ROCR_n <- auc_ROCR_n@y.values[[1]]
auc_ROCR_n
## [1] 0.8435673
Metrics that are to be evaluated from the model:
- Accuracy: the ability to correctly predict both classes from the total observation.
- Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
- Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
- Specificity: the ability to correctly predict the negative class from the total actual-negative class.
On the other hand, Receiver Operationg Characteristics (ROC) curve and Area Under Curve (AUC) are used to be described how well that our model can distinguish our target class. ROC is a probability curve and AUC represents the degree or measure of separability. The range of AUC is between 0.5 to 1. The higher degree of AUC value the better of model that can distinguish the target class.
For this case, we want to predict / classify what sorts of people were able to be survived. Thus, both categories in target variable are equally important. On one hand, this model’s accuracy metrics should be high. On the other hand, we still need to pay attention recall and specificity which cannot be very low. Several things we can do includes finding the recall-specificity break-even point (the cutoff where recall and specificity are equal) or finding a point where both scores are not very low (>= 70) while also having high accuracy.
Let’s demostrate some model tuning practic by changing the threshold for classification.
library(cmplot)
confmat_plot(prob = naive_table$Survived_pprob, ref = test$Survived, postarget = "1", negtarget = "0")
From the plot above, we know that the recall-specificity break-even point is on 0.36. But we need to set our threshold 0.401 to get all metrics above 70%. Let try this threshold for our data.
# tuning final model
naive_table <- naive_table %>%
mutate(tuning_pred = as.factor(ifelse(Survived_pprob>= 0.401,"1","0")))
# metrics result
final_n <- confusionMatrix(data = naive_table$tuning_pred, reference = naive_table$Survived, positive = "1")
df_final_n <- data.frame(
"Accuracy" = final_n$overall[1],
"Recall" = final_n$byClass[1],
"Specificity" = final_n$byClass[2],
"Precision" = final_n$byClass[3]
) %>%
cbind(AUC = auc_ROCR_n)
df_final_n
Decision tree model is one of the tree-based models which has the major benefit of being interpretable. Decision tree is an algorithm that will make a set of rules visualized in a diagram that resembles a tree.
There are certain characteristics of decision tree model:
perform well on both numerical and categorical variable.
all predictors are assumed to interact.
quite robust to the problem of multicollinearity. A decision tree will choose a variable that has the highest information gain in one split, whereas a method such as logistic regression would have used both.
robust and insensitive to outliers. Splitting will happen at a condition where it maximizes the homogeneity within resulting groups. Outliers will have little influence on the splitting process.
Let’s start to build our model to understand what it looks like.
library(rpart)
library(rattle)
library(rpart.plot)
# model building
dtree <- rpart(formula = Survived ~ ., data = train, method = "class")
fancyRpartPlot(dtree, sub = NULL)
# model fitting
dtree_pred <- predict(dtree, test, type = "class")
dtree_prob <- predict(dtree, test, type = "prob")
# result
dtree_table <- select(test, Survived) %>%
bind_cols(Survived_pred = dtree_pred) %>%
bind_cols(Survived_eprob = round(dtree_prob[,1],4)) %>%
bind_cols(Survived_pprob = round(dtree_prob[,2],4))
# performance evaluation - confusion matrix
dtree_table %>%
conf_mat(Survived, Survived_pred) %>%
autoplot(type = "heatmap")
dt_perf <- confusionMatrix(data = dtree_table$Survived_pred, reference = dtree_table$Survived, positive = "1")
data.frame(
"Accuracy" = dt_perf$overall[1],
"Recall" = dt_perf$byClass[1],
"Specificity" = dt_perf$byClass[2],
"Precision" = dt_perf$byClass[3]
)
# ROC
dtree_ROC <- prediction(dtree_prob[,2], labels = test$Survived )
# ROC curve
plot(performance(dtree_ROC, "tpr","fpr"),
main ="ROC")
abline(a=0, b =1)
# AUC
auc_ROCR_d <- performance(dtree_ROC, measure = "auc")
auc_ROCR_d <- auc_ROCR_d@y.values[[1]]
auc_ROCR_d
## [1] 0.869152
By using default parameter, we don’t have to tune the model because we’ve already got a normal-sized tree (no need to perform tree pruning). Let’s try to tune the model by changing the the threshold for classification.
confmat_plot(prob = dtree_table$Survived_pprob, ref = dtree_table$Survived, postarget = "1", negtarget = "0")
# tuning final model of decision tree
dtree_table <- dtree_table %>%
mutate(tuning_pred=as.factor(ifelse(Survived_pprob>= 0.369,"1","0")))
# metrics result
final_d <- confusionMatrix(data = dtree_table$tuning_pred, reference = dtree_table$Survived, positive = "1")
df_final_d <- data.frame(
"Accuracy" = final_d$overall[1],
"Recall" = final_d$byClass[1],
"Specificity" = final_d$byClass[2],
"Precision" = final_d$byClass[3]
) %>%
cbind(AUC = auc_ROCR_d)
df_final_d
The wisdom of crowd Let’s build Random Forest model
#model building
set.seed(123)
ctrl <- trainControl(method = "repeatedcv", number = 4, repeats = 4) # k-fold validation
forest <- train(Survived~., data = train, method="rf", trControl =ctrl)
forest
## Random Forest
##
## 573 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold, repeated 4 times)
## Summary of sample sizes: 430, 430, 430, 429, 430, 430, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8066907 0.5807211
## 6 0.7979737 0.5699642
## 11 0.7944954 0.5631310
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
From the summary, we can see the optimum variables considered for splitting at each tree node is 2.We can inspect the importance of each variable by using varimp().
varImp(forest)
## rf variable importance
##
## Overall
## Sexmale 100.000
## Pclass3 29.517
## Fare.Exp>50 16.750
## Age.RangeBelow 20 7.115
## Pclass2 6.312
## Parch.ExpYes 6.257
## EmbarkedC 5.256
## SibSp.ExpYes 4.133
## EmbarkedS 3.941
## Age.Rangeabove 40 2.849
## EmbarkedQ 0.000
In random forest, we don’t have to split our dataset into train and test set because there is out-of-bag estimates (OOB) in random forest modeling which act as a reliable estimate of the accuracy on unseen examples. The OOB we achieved was generated from our train dataset. See the summary below.
plot(forest$finalModel)
legend("topright", colnames(forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)
forest$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 20.24%
## Confusion matrix:
## 0 1 class.error
## 0 314 26 0.07647059
## 1 90 143 0.38626609
Let’s fit the model to our test data
# model fitting
forest_pred <- predict(forest, test, type = "raw")
forest_prob <- predict(forest, test, type = "prob")
# result
forest_table <- select(test, Survived) %>%
bind_cols(Survived_pred = forest_pred) %>%
bind_cols(Survived_eprob = round(forest_prob[,1],4)) %>%
bind_cols(Survived_pprob = round(forest_prob[,2],4))
# performance evaluation - confusion matrix
forest_table %>%
conf_mat(Survived, Survived_pred) %>%
autoplot(type ="heatmap")
perf <- confusionMatrix(data = forest_table$Survived_pred, reference = forest_table$Survived, positive = "1")
data.frame(
"Accuracy" = perf$overall[1],
"Recall" = perf$byClass[1],
"Specificity" = perf$byClass[2],
"Precision" = perf$byClass[3]
)
# ROC
forest_roc <- data.frame(prediction = round(forest_prob[,2],4),
trueclass = as.numeric(forest_table$Survived==1))
head(forest_roc)
forest_roc <- ROCR::prediction(forest_roc$prediction, forest_roc$trueclass)
# ROC curve
plot(performance(forest_roc, "tpr","fpr"),
main = "ROC")
abline(a=0, b=1)
# AUC
auc_ROCR_f <- performance(forest_roc, measure = "auc")
auc_ROCR_f <- auc_ROCR_f@y.values[[1]]
auc_ROCR_f
## [1] 0.8659148
As we can the result, this random forest model gave us a good accuracy, recall and precision (>70%). Besides, we got a high AUC either. Let’s try to tune the threshold to reach a higher score on each metrics.
confmat_plot(prob = forest_table$Survived_pprob, ref = test$Survived, postarget = "1", negtarget = "0")
# tuning final model
forest_table <- forest_table %>%
mutate(tuning_pred = as.factor(ifelse(Survived_pprob >= 0.345 , "1","0")))
# metrics result
final_f <- confusionMatrix(data = forest_table$tuning_pred, reference = forest_table$Survived, positive = "1")
df_final_f <- data.frame(
"Accuracy" = final_f$overall[1],
"Recall" = final_f$byClass[1],
"Specificity" = final_f$byClass[2],
"Precision" = final_f$byClass[3]
) %>%
cbind(AUC = auc_ROCR_f)
df_final_f
rbind("Naive Bayes" = df_final_n, "Desicion Tree" = df_final_d, "Random Forest" = df_final_f)
Base on the metrics table above, the predictive model built using Random Forest algorithm had slightly better result than others. The model got the highest accuracy 81% while maintained sensitivity, specificity, and precision above 70%. It gave a high AUC at 86,6%. Therefore the best model to clustering characteristics of Survival of Titanic is the Random Forest model.