The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this analysis, we will build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
library(dplyr) # for data wrangling
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
library(inspectdf) # for EDA
library(caret) # to pre-process data
library(tidyverse) # Data wrangling
library(e1071) # Naive Bayes model
library(partykit) # Decision Tree
library(rsample)# Cross-Validation
library(caret) # Model Evaluation
library(ROCR) # ROC dan AUCThe analysis will use dataset from Kaggle. It will be splitted into 2 dataset :
train.csv dataset will be used to build machine learning modelstest.csv dataset will be used to see how well machine learning models perform on unseen data. But test.csv data contains only predictor column, the target columns are separated in gender_submission.csv so we need to combine the target column into testing data.train <- read.csv("train.csv")
test <- read.csv("test.csv")
survive <- read.csv("gender_submission.csv")
#combine the target variable into test data
test <- cbind(test, Survived = survive$Survived)
rmarkdown::paged_table(train)rmarkdown::paged_table(test)Here are some information about the data :
Pclass : A proxy for socio-economic status (SES).Name : Name of the passengerSex : Gender of passengerAge : Age in yearsSibsp : Siblings / Spouses aboard the Titanic.Parch : Parents / children aboard the Titanic.Ticket : Ticket numberFare : Passenger fareCabin : Cabin numberEmbarked: Port of Embarkation.In this step, we want to explore our data variables to understand the condition of the data before modeling.
First, we will check all columns in the dataframe
glimpse(train)## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal~
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
See if there is missing value in data train
colSums(is.na(train))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
There is 177 missing value in age column. It was around 19% of our data. Instead of remove the data. let’s try to replace the age number with the mean of age.
train_clean <- train %>%
mutate(Age = if_else(is.na(Age), mean(Age, na.rm = TRUE), Age))
colSums(is.na(train_clean))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
we also need to check missing value in data test
colSums(is.na(test))## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 86 0
## Parch Ticket Fare Cabin Embarked Survived
## 0 0 1 0 0 0
There’s some missing value in test data. To avoid NA from affecting the prediction results, we will delete data that contains NA
test_clean <- test %>%
na.omit()In this case analysis, we will use the survived variable as the target. For some predictor variables, we need to change to a more appropriate data type and try to remove variables that are not contain much information like PassengerId, Name, Ticket, and Cabin.
data_train <- train_clean %>%
select(-c(PassengerId, Name, Ticket, Cabin)) %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
SibSp = as.factor(SibSp),
Parch = as.factor(Parch),
Embarked = as.factor(Embarked))
data_test <- test_clean %>%
select(-c(PassengerId, Name, Ticket, Cabin)) %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
SibSp = as.factor(SibSp),
Parch = as.factor(Parch),
Embarked = as.factor(Embarked))re-checking data_train structure
str(data_train)## 'data.frame': 891 obs. of 8 variables:
## $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
## $ Parch : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
And these are the proportions of our train variable target class
prop.table(table(data_train$Survived))##
## 0 1
## 0.6161616 0.3838384
As we see the proportion of our data was around 61,61% not survived and around 38,38% survived that indicates the imbalance data. We can balancing the proportion of target variable with down-sampling methods.
set.seed(267)
data_traind <- downSample(x = data_train[, -1], y = as.factor(data_train[, 1]), yname = "Survived")
rmarkdown::paged_table(data_traind)prop.table(table(data_traind$Survived))##
## 0 1
## 0.5 0.5
Naive Bayes is a supervised learning classification algorithm based on Bayes Theorem. It is an extension of Exact Bayes classifier and assumes all variables are independent of each other. This method is suitable when dimensionality of inputs are high. It uses a simple method of conditional probability and can be computationally inexpensive and outperform more sophisticated models.
Here is the formula from Bayes Theorem:
\[ P(Y|x) = \frac{P(x|Y)\ P(Y)}{P(x|Y)\ P(Y) + P(x|\neg Y)\ P(\neg Y)} \]
Training the Naive Bayes model cannot be simpler, just use naive-Bayes that comes in e1701 package. Once the model is trained, we use it to predict outcome in test dataset.
model_naive <- naiveBayes(x = data_traind %>% select(-Survived), #prediktor
y = data_traind$Survived, #target
laplace = 1) pred_label_naive <- predict(model_naive, data_test, type = "class")
head(data.frame(actual = data_test$Survived, prediction = pred_label_naive))## actual prediction
## 1 0 0
## 2 1 1
## 3 0 0
## 4 0 0
## 5 1 1
## 6 0 0
Decision Tree is a fairly simple and very easy to interpreted models, even easier than linear regression models. The results are visualized in a Flow Chart so that it can be easily understood how the model predicts.
model_dt <- ctree(Survived~ .,data_traind)
plot(model_dt, type="simple")pred_test_dt <- predict(model_dt, newdata = data_test,
type = "response")
head(pred_test_dt)## 1 2 3 4 5 6
## 0 1 0 0 1 0
## Levels: 0 1
Random Forest is one type of Ensemble Method. The Random Forest consists of a set of Decision Trees. Each Decision Tree has its own characteristics and is not related to one another. Random Forest will predict the target value for voting, the value with the highest number will be the final prediction of the target data.
The disadvantage of random forest is its huge and heavy computing model. This can be reduced by selecting the predictor so that it is not too much. If there is large number of columns, we can delete columns that have near-zero (less informative) variance with nearZeroVar () from the caret package.
nearZeroVar(data_traind)## integer(0)
set.seed(267)
ctrl <- trainControl(method="repeatedcv", number = 5, repeats = 3) # k-fold cross validation
survive_forest <- train(Survived ~ ., data = data_traind, method = "rf", trControl = ctrl)survive_forest## Random Forest
##
## 684 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 546, 547, 548, 548, 547, 548, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7738311 0.5476843
## 11 0.8001697 0.6003162
## 20 0.7831127 0.5662113
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.
varImp(survive_forest)## rf variable importance
##
## Overall
## Fare 100.00000
## Sexmale 95.16120
## Age 78.25422
## Pclass3 25.78161
## SibSp1 7.60459
## EmbarkedC 6.03533
## EmbarkedS 5.92940
## Parch1 5.01794
## Parch2 4.58701
## Pclass2 4.46792
## EmbarkedQ 3.36343
## SibSp3 2.04308
## SibSp4 1.89459
## SibSp2 1.38080
## SibSp8 1.01522
## Parch4 0.57022
## Parch5 0.46405
## SibSp5 0.45973
## Parch3 0.04909
## Parch6 0.00000
plot(survive_forest$finalModel)
legend("topright", colnames(survive_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)survive_forest$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 11
##
## OOB estimate of error rate: 19.74%
## Confusion matrix:
## 0 1 class.error
## 0 285 57 0.1666667
## 1 78 264 0.2280702
pred_test_rf <- predict(survive_forest, newdata = data_test, type = "raw")Confusion Matrix
We will evaluate all model with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.
The performance will be shown in the Accuracy, Sensitivity/Recall, Specificity, and Precision. Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.
ROC (Receiver - Operating Curve)
Apart from using the Confusion Matrix, we can use other metrics to compare between model 1 and others. One way is to use the ROC Curve. Receiver-Operating Curve (ROC) is a curve that plots the relationship between True Positive Rate (Sensitivity or Recall) with False Positive Rate (1-Specificity). A good model should ideally have a high True Positive Rate and a low False Positive Rate.
AUC (Area Under ROC Curve)
The AUC shows the area under the ROC curve. AUC is commonly used to compare performance between models. The closer it to 1 means our performance of the model is better.
Confusion Matrix
mat1 <- confusionMatrix(data = pred_label_naive, reference = data_test$Survived, positive = "1")
mat1## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 171 28
## 1 33 99
##
## Accuracy : 0.8157
## 95% CI : (0.7697, 0.856)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : 3.183e-15
##
## Kappa : 0.6132
##
## Mcnemar's Test P-Value : 0.6085
##
## Sensitivity : 0.7795
## Specificity : 0.8382
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.8593
## Prevalence : 0.3837
## Detection Rate : 0.2991
## Detection Prevalence : 0.3988
## Balanced Accuracy : 0.8089
##
## 'Positive' Class : 1
##
ROC (Receiver - Operating Curve)
# get the probability prediction
prob_survive <- predict(model_naive, data_test, type = "raw")
# prepare dataframe for ROC
data_roc <- data.frame(prob = prob_survive[,2], # probability of positive class(survived)
labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived
head(data_roc)## prob labels
## 1 0.04078981 0
## 2 0.53590515 1
## 3 0.18470196 0
## 4 0.03860111 0
## 5 0.75699552 1
## 6 0.05263042 0
naive_roc <- ROCR::prediction(data_roc$prob, data_roc$labels)
# ROC curve
plot(performance(naive_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
main = "ROC")
abline(a = 0, b = 1)AUC (Area Under ROC Curve)
auc_n <- performance(naive_roc, measure = "auc")
auc_n@y.values## [[1]]
## [1] 0.8857496
Confusion Matrix
mat2 <- confusionMatrix(data = pred_test_dt, reference = data_test$Survived, positive = "1")
mat2## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 195 2
## 1 9 125
##
## Accuracy : 0.9668
## 95% CI : (0.9413, 0.9833)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9305
##
## Mcnemar's Test P-Value : 0.07044
##
## Sensitivity : 0.9843
## Specificity : 0.9559
## Pos Pred Value : 0.9328
## Neg Pred Value : 0.9898
## Prevalence : 0.3837
## Detection Rate : 0.3776
## Detection Prevalence : 0.4048
## Balanced Accuracy : 0.9701
##
## 'Positive' Class : 1
##
ROC (Receiver - Operating Curve)
prob_survive_dt <- predict(model_dt, data_test, type = "prob")
data_roc1 <- data.frame(prob = prob_survive_dt[,2], # probability of positive class(survived)
labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived
dt_roc <- ROCR::prediction(data_roc1$prob, data_roc1$labels)
# ROC curve
plot(performance(dt_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
main = "ROC")
abline(a = 0, b = 1)AUC (Area Under ROC Curve)
dt_auc <- performance(dt_roc, measure = "auc")
dt_auc@y.values## [[1]]
## [1] 0.9410607
Confusion Matrix
mat3 <- confusionMatrix(data = pred_test_rf, reference = data_test$Survived, positive = "1")
mat3## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 168 20
## 1 36 107
##
## Accuracy : 0.8308
## 95% CI : (0.786, 0.8696)
## No Information Rate : 0.6163
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6506
##
## Mcnemar's Test P-Value : 0.04502
##
## Sensitivity : 0.8425
## Specificity : 0.8235
## Pos Pred Value : 0.7483
## Neg Pred Value : 0.8936
## Prevalence : 0.3837
## Detection Rate : 0.3233
## Detection Prevalence : 0.4320
## Balanced Accuracy : 0.8330
##
## 'Positive' Class : 1
##
ROC (Receiver - Operating Curve)
prob_survive_rf <- predict(survive_forest, data_test, type = "prob")
data_roc2 <- data.frame(prob = prob_survive_rf[,2], # probability of positive class(survived)
labels = as.numeric(data_test$Survived == "1")) #get the label as the test data who survived
rf_roc <- ROCR::prediction(data_roc2$prob, data_roc2$labels)
# ROC curve
plot(performance(rf_roc, "tpr", "fpr"), #tpr = true positive rate, fpr = false positive rate
main = "ROC")
abline(a = 0, b = 1)AUC (Area Under ROC Curve)
rf_auc <- performance(rf_roc, measure = "auc")
rf_auc@y.values## [[1]]
## [1] 0.9202563
After we make the model, predict, and do some model evaluation, let’s compare the model of naive bayes, decision tree, and random forest for this case.
#get the accuracy, sensitivity, specificity, precision and AUC of model naive bayes
m_naive <- data.frame(Model = "Naive Bayes",
Accuracy = round((mat1$table[4] + mat1$table[1]) / sum(mat1$table),4),
Sensitivity = round(mat1$table[4] / (mat1$table[4] + mat1$table[3]),4),
Specificity = round(mat1$table[1] / (mat1$table[1] + mat1$table[2]),4),
Precision = round(mat1$table[4] / (mat1$table[4] + mat1$table[2]),4),
AUC = round(as.numeric(auc_n@y.values),4))
#get the accuracy, sensitivity, specificity, precision and AUC of model decision tree
m_dt <- data.frame(Model = "Decision Tree",
Accuracy = round((mat2$table[4] + mat2$table[1]) / sum(mat2$table),4),
Sensitivity = round(mat2$table[4] / (mat2$table[4] + mat2$table[3]),4),
Specificity = round(mat2$table[1] / (mat2$table[1] + mat2$table[2]),4),
Precision = round(mat2$table[4] / (mat2$table[4] + mat2$table[2]),4),
AUC = round(as.numeric(dt_auc@y.values),4))
#get the accuracy, sensitivity, specificity, precision and AUC of model random forest
m_rf <- data.frame(Model = "Random Forest",
Accuracy = round((mat3$table[4] + mat3$table[1]) / sum(mat3$table),4),
Sensitivity = round(mat3$table[4] / (mat3$table[4] + mat3$table[3]),4),
Specificity = round(mat3$table[1] / (mat3$table[1] + mat3$table[2]),4),
Precision = round(mat3$table[4] / (mat3$table[4] + mat3$table[2]),4),
AUC = round(as.numeric(rf_auc@y.values),4))
rmarkdown::paged_table(rbind(m_naive, m_dt, m_rf))From the comparison table above, we can see that the predictive model built using Decision Tree algorithm gave the best result. The model gave highest accuracy 96% and sensitivity 98% while also maintain specificity and precision above 90%. It also gave the highest AUC at 94%. So, the best model to predict the survival of Titanic passengers is Decision Tree algorithm.