Brief History of Titanic Disaster
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On Sunday, April 14, 1912, during her maiden voyage, the widely considered unsinkable RMS Titanic sank after colliding with an iceberg. The Titanic’s distress signals were heard by a nearby ship. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
Federal law soon required that all large ocean-going vessels to be equipped with wireless for safety reasons. David Sarnoff noted that the Titanic disaster brought radio to the front.
Purpose of the project :
- Learn to use Naive Bayes, Decision
Tree & **Random Forest* to predict
Survivedbased on the data set.
Explanation on “Titanic” data :
- PassengerId : Row ID in the data set
- Survived : If a passenger survived or not (0 = No, 1 = Yes)
- Pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- Name : Passenger’s name
- Sex : Passenger’s gender (male or female)
- Age : Passenger’s age (in years)
- SibSp : # of siblings / spouses aboard the Titanic
- Parch : # of parents / children aboard the Titanic
- Ticket : Ticket number
- Fare : Passenger fare
- Cabin : Cabin number
- Embarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
Data Preparation
Load library.
library(tidyverse)
library(GGally)
library(car)
library(caret)
library(class)
library(rmarkdown)
library(reshape)
library(lmtest)
library(dplyr)
library(rsample)
library(gtools)
library(class)
library(e1071)
library(rsample)
library(ROCR)
library(partykit)
library(randomForest)
Load dataset.
# Load data
titanic <- read.csv("dataInputs/train.csv")
# Show data as table
paged_table(titanic)
Check structure of the new data frame
# Check structure
titanic %>% glimpse()
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
levels(titanic$Ticket)
## NULL
💡 Insight :
PassengerId,Name,TicketandCabinaren’t usable variable for prediction model. Therefore, they will be removed.Survivedis the target of our prediction.Survived,Pclass,Sex,SibSp,ParchandEmbarkedshould be converted to categorical type.
titanic <- titanic %>%
select(-c(PassengerId,
Name,
Ticket,
Cabin)) %>%
mutate_at(vars(Survived,
Pclass,
Sex,
SibSp,
Parch,
Embarked), as.factor)
N/A value on our data frame
# Check proportion of missing data
table(is.na(titanic))
##
## FALSE TRUE
## 6951 177
titanic <- titanic %>% na.omit()
titanic %>% is.na() %>% colSums()
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 0 0 0 0 0
The proportion of missing values (NA) from the data is only 1.68%. Therefore, it can be deleted.
Re-check missing value using Reshape
# Check missing value using "reshape" library
missing_data <- melt(apply(titanic[, -2], 2, function(x) sum(is.na(x) | x=="")))
cbind(row.names(missing_data)[missing_data$value>0], missing_data[missing_data$value>0,])
## [,1] [,2]
## [1,] "Embarked" "2"
Update missing embarked port using common value
titanic$Embarked[which(is.na(titanic$Embarked) | titanic$Embarked=="")] <- 'S'
Exploratory and Data Analysis
Take a look on data summary
titanic %>% summary()
## Survived Pclass Sex Age SibSp Parch Fare
## 0:424 1:186 female:261 Min. : 0.42 0:471 0:521 Min. : 0.00
## 1:290 2:173 male :453 1st Qu.:20.12 1:183 1:110 1st Qu.: 8.05
## 3:355 Median :28.00 2: 25 2: 68 Median : 15.74
## Mean :29.70 3: 12 3: 5 Mean : 34.69
## 3rd Qu.:38.00 4: 18 4: 4 3rd Qu.: 33.38
## Max. :80.00 5: 5 5: 5 Max. :512.33
## 8: 0 6: 1
## Embarked
## : 0
## C:130
## Q: 28
## S:556
##
##
##
💡 Insight :
- 424 passenger deceased during the tragedy and only 290 people survived
- There are 453 male and 261 female
AgeandFareseems to have outliers
Check possibly outlier variables
# Check outlier on Fare predictor
boxplot(titanic$Fare)
# Check outlier on Age predictor
boxplot(titanic$Age)
💡 Insight :
- As per our assumption,
FareandAgehave outlier. - We need to pay more attention when using them as predictors
Check class imbalance
# Check class imbalance
prop.table(table(titanic$Survived))
##
## 0 1
## 0.5938375 0.4061625
Based on the proportion value above, the target variable class (Survived) is balance enough so that we do not need to do additional data pre-processing to balance the class. However, in this case, we will make it the proportion is 50:50. Therefore, we do down-sampling.
Train Test Split
Before we make a model, we need to split the data into train and test dataset. This is a crucial step in the machine learning process, as it allows us to evaluate the performance of our models and make informed decisions about how to improve them.. We will split into 80% for the training and the rest of it as the testing.
# Train test split
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- initial_split(data = titanic,
prop = 0.8,
strata = "Survived")
titanic_train <- training(index)
titanic_test <- testing(index)
# Down-sampling for titanic_train
titanic_train <- downSample(x = titanic_train %>% select(-Survived),
y = titanic_train$Survived,
yname = "Survived")
# Re-check class imbalance on train dataset
prop.table(table(titanic_train$Survived))
##
## 0 1
## 0.5 0.5
Now, our data proportion is perfectly balance.
Create Model
We will create three types of models (Naive Bayes, Decision Tree and Random Forest) to predict whether a passenger survived or not. Each model will be developed in several steps:
- Create a model
- Predict the model
- Create an evaluation using Confusion Matrix, ROC and AUC
- Tuning (if necessary)
Naive Bayes
Naive Bayes is a simple and popular algorithm used for classification problems. It is based on Bayes’ theorem, which states that the probability of a hypothesis (such as a classification label) given some evidence (such as a set of features) is proportional to the probability of the evidence given the hypothesis multiplied by the prior probability of the hypothesis.
First, we will need to create a model using Naive Bayes method.
# Create Naive Bayes Model
model_nb <- naiveBayes(Survived~.,
data = titanic_train,
laplace = 0.1)
Second, we will create a prediction based on our previous model.
# Predict class of test data
pred_nb <- predict(model_nb,
newdata = titanic_test,
type = "class")
Next, we need to create a model evaluation using Confusion Matrix
# Model evaluation using confusion matrix
nb_evaluation <- confusionMatrix(data = pred_nb,
titanic_test$Survived,
positive = "1")
nb_evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 75 29
## 1 10 29
##
## Accuracy : 0.7273
## 95% CI : (0.6465, 0.7983)
## No Information Rate : 0.5944
## P-Value [Acc > NIR] : 0.0006414
##
## Kappa : 0.4033
##
## Mcnemar's Test P-Value : 0.0039478
##
## Sensitivity : 0.5000
## Specificity : 0.8824
## Pos Pred Value : 0.7436
## Neg Pred Value : 0.7212
## Prevalence : 0.4056
## Detection Rate : 0.2028
## Detection Prevalence : 0.2727
## Balanced Accuracy : 0.6912
##
## 'Positive' Class : 1
##
💡 Insight :
- The model accuracy is 72.73%
- We need to avoid “Un-survived but predicted as survived”. Therefore,
we will use
Precisionto avoid False Positive. The Precision value is 74.36% - We need to check ROC and AUC as other tools for model evaluation
# predict test data in probability type
pred_naive_prob <- predict(model_nb,
newdata = titanic_test,
type = "raw")
# create prediction object
roc_naive_pred <- prediction(predictions = pred_naive_prob[,1],
labels = titanic_test$Survived=="1")
# ROC curve
plot(performance(prediction.obj = roc_naive_pred,
measure = "tpr",
x.measure = "fpr"))
# AUC
naive_auc <- performance(prediction.obj = roc_naive_pred,
measure = "auc")
auc_naive <- naive_auc@y.values
auc_naive
## [[1]]
## [1] 0.2162272
💡 Insight :
- Based on ROC visualisation, the curve doesn’t give a good result
- Based on AUC value, the result isn’t good enough
- Our Naive Bayes model isn’t good enough in classifying
positiveandnegativeclasses.
Decision Tree
Naive Bayes is a simple and popular algorithm used for classification problems. It is based on Bayes’ theorem, which states that the probability of a hypothesis (such as a classification label) given some evidence (such as a set of features) is proportional to the probability of the evidence given the hypothesis multiplied by the prior probability of the hypothesis.
First, we will need to create a model using Decision Tree method.
# Create Decision Tree model
model_dt <- ctree(formula = Survived~.,
data = titanic_train)
# Visualize decision tree
plot(model_dt,
type = "simple")
Second, we will create a prediction based on DT model.
# Predict class of test data
pred_dt <- predict(object = model_dt,
newdata = titanic_test,
type = "response")
# Model evaluation using confusion matrix
dt_evaluation <- confusionMatrix(data = pred_dt,
titanic_test$Survived,
positive = "1")
dt_evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 60 10
## 1 25 48
##
## Accuracy : 0.7552
## 95% CI : (0.6764, 0.8232)
## No Information Rate : 0.5944
## P-Value [Acc > NIR] : 3.97e-05
##
## Kappa : 0.5124
##
## Mcnemar's Test P-Value : 0.01796
##
## Sensitivity : 0.8276
## Specificity : 0.7059
## Pos Pred Value : 0.6575
## Neg Pred Value : 0.8571
## Prevalence : 0.4056
## Detection Rate : 0.3357
## Detection Prevalence : 0.5105
## Balanced Accuracy : 0.7667
##
## 'Positive' Class : 1
##
💡 Insight :
Accuracyvalue is 75.52%Precisionvalue is only 65.75%, we can assume our model is overfitting.- We need to tune our model
Tuning our current Decision Tree Model using Pruning method
We will update the controller in our model :
- Minimum criterion to 0.05
- Minimum split to 30
- Minimum bucket to 10
# Pruning
model_dt_new <- ctree(formula = Survived~.,
data = titanic_train,
control = ctree_control(mincriterion = 0.05,
minsplit = 30,
minbucket = 10))
plot(model_dt_new, type = "simple")
Check the prediction result
pred_dt_new <- predict(object = model_dt_new,
newdata = titanic_test,
type = "response")
# Model evaluation using confusion matrix
dt_evaluation_new <- confusionMatrix(data = pred_dt_new,
titanic_test$Survived,
positive = "1")
dt_evaluation_new
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 65 13
## 1 20 45
##
## Accuracy : 0.7692
## 95% CI : (0.6915, 0.8355)
## No Information Rate : 0.5944
## P-Value [Acc > NIR] : 7.98e-06
##
## Kappa : 0.5304
##
## Mcnemar's Test P-Value : 0.2963
##
## Sensitivity : 0.7759
## Specificity : 0.7647
## Pos Pred Value : 0.6923
## Neg Pred Value : 0.8333
## Prevalence : 0.4056
## Detection Rate : 0.3147
## Detection Prevalence : 0.4545
## Balanced Accuracy : 0.7703
##
## 'Positive' Class : 1
##
We will need to compare with our Train data
pred_dt_train_new <- predict(object = model_dt_new,
newdata = titanic_train,
type = "response")
dt_evaluation_train_new <- confusionMatrix(data = pred_dt_train_new,
titanic_train$Survived,
positive = "1")
dt_evaluation_train_new
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 191 40
## 1 41 192
##
## Accuracy : 0.8254
## 95% CI : (0.7878, 0.8589)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6509
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8276
## Specificity : 0.8233
## Pos Pred Value : 0.8240
## Neg Pred Value : 0.8268
## Prevalence : 0.5000
## Detection Rate : 0.4138
## Detection Prevalence : 0.5022
## Balanced Accuracy : 0.8254
##
## 'Positive' Class : 1
##
💡 Insight :
- Our model is better after tuning, increasing the
Accuracyto 76.92% andPrecisionto 69.23% - However, our model is still over fitting.
- We still can improve the model in the future using Ensemble Method, Feature Selection and Regularization
- Compared to
Naive Bayesmodel,Decission Treeresult is worse.
Random Forest
Random Forest is a popular machine learning algorithm used for classification and regression problems. It is an ensemble method that combines multiple decision trees to improve the accuracy and reduce the variance of the model.
First, we will need to create a model using Random Forest method. Then, save the result in RDS format and load it as a new variable.
# Create Random Forest model
set.seed(123)
control <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3)
model_rf <- train(form = Survived~.,
data = titanic_train,
method = "rf",
trainControl=control)
# Save the model ib RDS format
saveRDS(model_rf,
"titanic_Randomforest.RDS")
# Read the model and save it as new variable
model_rf <- readRDS("titanic_Randomforest.RDS")
# View the model
model_rf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, trainControl = ..1)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 11
##
## OOB estimate of error rate: 20.91%
## Confusion matrix:
## 0 1 class.error
## 0 188 44 0.1896552
## 1 53 179 0.2284483
Next, we can create visualisations to check
Out of Bag Error and importance predictors using
varImp
# Check Out of Bag Error
plot(model_rf$finalModel)
legend("topright", colnames(model_rf$finalModel$err.rate),
col=1:6,cex=0.8,fill=1:6)
# Check significant predictors
plot(varImp(model_rf))
💡 Insight :
- Out of bag error is 20.69. Therefore, the accuracy of the model is 79.31%
- Based on first visualisation, the error rate is low.
Fare,AgeandSexare the most significant predictors
Predicting the data
# model evaluation
pred_rf <- predict(model_rf,
newdata = titanic_test)
rf_evaluation <- confusionMatrix(pred_rf,
titanic_test$Survived,
positive = "1")
rf_evaluation
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 61 19
## 1 24 39
##
## Accuracy : 0.6993
## 95% CI : (0.6171, 0.7731)
## No Information Rate : 0.5944
## P-Value [Acc > NIR] : 0.006104
##
## Kappa : 0.3848
##
## Mcnemar's Test P-Value : 0.541866
##
## Sensitivity : 0.6724
## Specificity : 0.7176
## Pos Pred Value : 0.6190
## Neg Pred Value : 0.7625
## Prevalence : 0.4056
## Detection Rate : 0.2727
## Detection Prevalence : 0.4406
## Balanced Accuracy : 0.6950
##
## 'Positive' Class : 1
##
💡 Insight :
Accuracyis 71.33% andPrecisionis only 63.93%- The model is still no better than
Naive Bayesmodel
Evaluation
# Create a function to show evaluation result on NB, DT and RF model
eval_nb <- data_frame(Accuracy = nb_evaluation$overall[1],
Recall = nb_evaluation$byClass[1],
Specificity = nb_evaluation$byClass[2],
Precision = nb_evaluation$byClass[3])
eval_dt <- data_frame(Accuracy = dt_evaluation_new$overall[1],
Recall = dt_evaluation_new$byClass[1],
Specificity = dt_evaluation_new$byClass[2],
Precision = dt_evaluation_new$byClass[3])
eval_rf <- data_frame(Accuracy = rf_evaluation$overall[1],
Recall = rf_evaluation$byClass[1],
Specificity = rf_evaluation$byClass[2],
Precision = rf_evaluation$byClass[3])
# Show result
eval_result <- as.data.frame(rbind(eval_nb, eval_dt, eval_rf))
rownames(eval_result) <- c("Naive Bayes","Decision Tree","Random Forest")
paged_table(eval_result)
Conclusion
- The
Accuracyvalue of the prediction model that we build usingDecission Treealgorithm is the best result. - The
Precisionvalue of the the prediction model that we build usingNaive Bayesalgorithm is the best result. - However, the AUC value is less than “0.5”. it means that the model
is performing worse than random guessing and is not able to
differentiate between the positive and negative classes. In this case,
there are several steps that can be taken to improve the model:
- Feature selection: Revisit the feature selection process and ensure that the selected features are relevant and informative for the target variable. Consider adding or removing features to improve the model’s performance.
- Hyperparameter tuning: Perform hyperparameter tuning to find the optimal combination of hyperparameters for the model. This can be done using techniques such as grid search, random search, or Bayesian optimization.
- Try different algorithms: If the AUC value is still low after trying the above steps, consider trying a different algorithm that may be better suited to the problem at hand. For example, if the data is highly non-linear, a neural network may perform better than a linear model.
- It is important to note that a low AUC value does not necessarily mean that the model is useless. Depending on the specific problem and application, a model with a low AUC value may still be useful for some tasks. However, it is always important to strive for the best possible performance and to continuously improve the model as new data and techniques become available.