Titanic Survival Classification v2

Raja Palawija

22 March 2023

Brief History of Titanic Disaster

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On Sunday, April 14, 1912, during her maiden voyage, the widely considered unsinkable RMS Titanic sank after colliding with an iceberg. The Titanic’s distress signals were heard by a nearby ship. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

Federal law soon required that all large ocean-going vessels to be equipped with wireless for safety reasons. David Sarnoff noted that the Titanic disaster brought radio to the front.

Purpose of the project :

Learn to use Naive Bayes, Decision Tree & **Random Forest* to predict Survived based on the data set.

Explanation on “Titanic” data :

PassengerId : Row ID in the data set
Survived : If a passenger survived or not (0 = No, 1 = Yes)
Pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
Name : Passenger’s name
Sex : Passenger’s gender (male or female)
Age : Passenger’s age (in years)
SibSp : # of siblings / spouses aboard the Titanic
Parch : # of parents / children aboard the Titanic
Ticket : Ticket number
Fare : Passenger fare
Cabin : Cabin number
Embarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Data Preparation

Load library.

library(tidyverse)
library(GGally)
library(car)
library(caret)
library(class)
library(rmarkdown)
library(reshape)
library(lmtest)
library(dplyr) 
library(rsample)
library(gtools)
library(class)
library(e1071)
library(rsample)
library(ROCR)
library(partykit)
library(randomForest)

Load dataset.

# Load data
titanic <- read.csv("dataInputs/train.csv")

# Show data as table
paged_table(titanic)

Check structure of the new data frame

# Check structure
titanic %>% glimpse()

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

levels(titanic$Ticket)

## NULL

💡 Insight :

PassengerId, Name, Ticket and Cabin aren’t usable variable for prediction model. Therefore, they will be removed.
Survived is the target of our prediction.
Survived, Pclass, Sex, SibSp, Parch and Embarked should be converted to categorical type.

titanic <- titanic %>% 
  select(-c(PassengerId, 
            Name, 
            Ticket, 
            Cabin)) %>% 
  mutate_at(vars(Survived, 
                 Pclass, 
                 Sex,
                 SibSp,
                 Parch,
                 Embarked), as.factor)

N/A value on our data frame

# Check proportion of missing data
table(is.na(titanic))

## 
## FALSE  TRUE 
##  6951   177

titanic <- titanic %>% na.omit()
titanic %>% is.na() %>% colSums()

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0

The proportion of missing values (NA) from the data is only 1.68%. Therefore, it can be deleted.

Re-check missing value using Reshape

# Check missing value using "reshape" library
missing_data <- melt(apply(titanic[, -2], 2, function(x) sum(is.na(x) | x=="")))
cbind(row.names(missing_data)[missing_data$value>0], missing_data[missing_data$value>0,])

##      [,1]       [,2]
## [1,] "Embarked" "2"

Update missing embarked port using common value

titanic$Embarked[which(is.na(titanic$Embarked) | titanic$Embarked=="")] <- 'S'

Exploratory and Data Analysis

Take a look on data summary

titanic %>% summary()

##  Survived Pclass      Sex           Age        SibSp   Parch        Fare       
##  0:424    1:186   female:261   Min.   : 0.42   0:471   0:521   Min.   :  0.00  
##  1:290    2:173   male  :453   1st Qu.:20.12   1:183   1:110   1st Qu.:  8.05  
##           3:355                Median :28.00   2: 25   2: 68   Median : 15.74  
##                                Mean   :29.70   3: 12   3:  5   Mean   : 34.69  
##                                3rd Qu.:38.00   4: 18   4:  4   3rd Qu.: 33.38  
##                                Max.   :80.00   5:  5   5:  5   Max.   :512.33  
##                                                8:  0   6:  1                   
##  Embarked
##   :  0   
##  C:130   
##  Q: 28   
##  S:556   
##          
##          
##

💡 Insight :

424 passenger deceased during the tragedy and only 290 people survived
There are 453 male and 261 female
Age and Fare seems to have outliers

Check possibly outlier variables

# Check outlier on Fare predictor

boxplot(titanic$Fare)

# Check outlier on Age predictor

boxplot(titanic$Age)

💡 Insight :

As per our assumption, Fare and Age have outlier.
We need to pay more attention when using them as predictors

Check class imbalance

# Check class imbalance
prop.table(table(titanic$Survived))

## 
##         0         1 
## 0.5938375 0.4061625

Based on the proportion value above, the target variable class (Survived) is balance enough so that we do not need to do additional data pre-processing to balance the class. However, in this case, we will make it the proportion is 50:50. Therefore, we do down-sampling.

Train Test Split

Before we make a model, we need to split the data into train and test dataset. This is a crucial step in the machine learning process, as it allows us to evaluate the performance of our models and make informed decisions about how to improve them.. We will split into 80% for the training and the rest of it as the testing.

# Train test split

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- initial_split(data = titanic,
                       prop = 0.8,
                       strata = "Survived")

titanic_train <- training(index)
titanic_test <- testing(index)

# Down-sampling for titanic_train

titanic_train <- downSample(x = titanic_train %>% select(-Survived),
                            y = titanic_train$Survived,
                            yname = "Survived")

# Re-check class imbalance on train dataset

prop.table(table(titanic_train$Survived))

## 
##   0   1 
## 0.5 0.5

Now, our data proportion is perfectly balance.

Create Model

We will create three types of models (Naive Bayes, Decision Tree and Random Forest) to predict whether a passenger survived or not. Each model will be developed in several steps:

Create a model
Predict the model
Create an evaluation using Confusion Matrix, ROC and AUC
Tuning (if necessary)

Naive Bayes

Naive Bayes is a simple and popular algorithm used for classification problems. It is based on Bayes’ theorem, which states that the probability of a hypothesis (such as a classification label) given some evidence (such as a set of features) is proportional to the probability of the evidence given the hypothesis multiplied by the prior probability of the hypothesis.

First, we will need to create a model using Naive Bayes method.

# Create Naive Bayes Model
model_nb <- naiveBayes(Survived~., 
                       data = titanic_train, 
                       laplace = 0.1)

Second, we will create a prediction based on our previous model.

# Predict class of test data

pred_nb <- predict(model_nb, 
                   newdata = titanic_test, 
                   type = "class")

Next, we need to create a model evaluation using Confusion Matrix

# Model evaluation using confusion matrix

nb_evaluation <- confusionMatrix(data = pred_nb, 
                                 titanic_test$Survived, 
                                 positive = "1")
nb_evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 75 29
##          1 10 29
##                                           
##                Accuracy : 0.7273          
##                  95% CI : (0.6465, 0.7983)
##     No Information Rate : 0.5944          
##     P-Value [Acc > NIR] : 0.0006414       
##                                           
##                   Kappa : 0.4033          
##                                           
##  Mcnemar's Test P-Value : 0.0039478       
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 0.8824          
##          Pos Pred Value : 0.7436          
##          Neg Pred Value : 0.7212          
##              Prevalence : 0.4056          
##          Detection Rate : 0.2028          
##    Detection Prevalence : 0.2727          
##       Balanced Accuracy : 0.6912          
##                                           
##        'Positive' Class : 1               
##

💡 Insight :

The model accuracy is 72.73%
We need to avoid “Un-survived but predicted as survived”. Therefore, we will use Precision to avoid False Positive. The Precision value is 74.36%
We need to check ROC and AUC as other tools for model evaluation

# predict test data in probability type
pred_naive_prob <- predict(model_nb, 
                           newdata = titanic_test, 
                           type = "raw")

# create prediction object
roc_naive_pred <- prediction(predictions = pred_naive_prob[,1],
                             labels = titanic_test$Survived=="1")

# ROC curve
plot(performance(prediction.obj = roc_naive_pred,
                 measure = "tpr",
                 x.measure = "fpr"))

# AUC

naive_auc <- performance(prediction.obj = roc_naive_pred,
                         measure = "auc")

auc_naive <- naive_auc@y.values

auc_naive

## [[1]]
## [1] 0.2162272

💡 Insight :

Based on ROC visualisation, the curve doesn’t give a good result
Based on AUC value, the result isn’t good enough
Our Naive Bayes model isn’t good enough in classifying positive and negative classes.

Decision Tree

First, we will need to create a model using Decision Tree method.

# Create Decision Tree model
model_dt <- ctree(formula = Survived~., 
                     data = titanic_train)

# Visualize decision tree
plot(model_dt, 
     type = "simple")

Second, we will create a prediction based on DT model.

# Predict class of test data
pred_dt <- predict(object = model_dt,
                   newdata = titanic_test,
                   type = "response")

# Model evaluation using confusion matrix
dt_evaluation <- confusionMatrix(data = pred_dt, 
                                 titanic_test$Survived, 
                                 positive = "1")
dt_evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 60 10
##          1 25 48
##                                           
##                Accuracy : 0.7552          
##                  95% CI : (0.6764, 0.8232)
##     No Information Rate : 0.5944          
##     P-Value [Acc > NIR] : 3.97e-05        
##                                           
##                   Kappa : 0.5124          
##                                           
##  Mcnemar's Test P-Value : 0.01796         
##                                           
##             Sensitivity : 0.8276          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.6575          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.4056          
##          Detection Rate : 0.3357          
##    Detection Prevalence : 0.5105          
##       Balanced Accuracy : 0.7667          
##                                           
##        'Positive' Class : 1               
##

💡 Insight :

Accuracy value is 75.52%
Precision value is only 65.75%, we can assume our model is overfitting.
We need to tune our model

Tuning our current Decision Tree Model using Pruning method

We will update the controller in our model :

Minimum criterion to 0.05
Minimum split to 30
Minimum bucket to 10

# Pruning

model_dt_new <- ctree(formula = Survived~.,
                         data = titanic_train,
                         control = ctree_control(mincriterion = 0.05,
                                                 minsplit = 30,
                                                 minbucket = 10))

plot(model_dt_new, type = "simple")

Check the prediction result

pred_dt_new <- predict(object = model_dt_new,
                       newdata = titanic_test,
                       type = "response")

# Model evaluation using confusion matrix
dt_evaluation_new <- confusionMatrix(data = pred_dt_new, 
                                     titanic_test$Survived, 
                                     positive = "1")
dt_evaluation_new

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 65 13
##          1 20 45
##                                           
##                Accuracy : 0.7692          
##                  95% CI : (0.6915, 0.8355)
##     No Information Rate : 0.5944          
##     P-Value [Acc > NIR] : 7.98e-06        
##                                           
##                   Kappa : 0.5304          
##                                           
##  Mcnemar's Test P-Value : 0.2963          
##                                           
##             Sensitivity : 0.7759          
##             Specificity : 0.7647          
##          Pos Pred Value : 0.6923          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.4056          
##          Detection Rate : 0.3147          
##    Detection Prevalence : 0.4545          
##       Balanced Accuracy : 0.7703          
##                                           
##        'Positive' Class : 1               
##

We will need to compare with our Train data

pred_dt_train_new <- predict(object = model_dt_new,
                       newdata = titanic_train,
                       type = "response")

dt_evaluation_train_new <- confusionMatrix(data = pred_dt_train_new, 
                                     titanic_train$Survived, 
                                     positive = "1")
dt_evaluation_train_new

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 191  40
##          1  41 192
##                                           
##                Accuracy : 0.8254          
##                  95% CI : (0.7878, 0.8589)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6509          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8276          
##             Specificity : 0.8233          
##          Pos Pred Value : 0.8240          
##          Neg Pred Value : 0.8268          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4138          
##    Detection Prevalence : 0.5022          
##       Balanced Accuracy : 0.8254          
##                                           
##        'Positive' Class : 1               
##

💡 Insight :

Our model is better after tuning, increasing the Accuracy to 76.92% and Precision to 69.23%
However, our model is still over fitting.
We still can improve the model in the future using Ensemble Method, Feature Selection and Regularization
Compared to Naive Bayes model, Decission Tree result is worse.

Random Forest

Random Forest is a popular machine learning algorithm used for classification and regression problems. It is an ensemble method that combines multiple decision trees to improve the accuracy and reduce the variance of the model.

First, we will need to create a model using Random Forest method. Then, save the result in RDS format and load it as a new variable.

# Create Random Forest model
set.seed(123)
control <- trainControl(method = "repeatedcv", 
                        number = 5, 
                        repeats = 3)

model_rf <- train(form = Survived~.,
                  data = titanic_train,
                  method = "rf",
                  trainControl=control)
   
# Save the model ib RDS format
saveRDS(model_rf, 
        "titanic_Randomforest.RDS")

# Read the model and save it as new variable
model_rf <- readRDS("titanic_Randomforest.RDS")

# View the model
model_rf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, trainControl = ..1) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##         OOB estimate of  error rate: 20.91%
## Confusion matrix:
##     0   1 class.error
## 0 188  44   0.1896552
## 1  53 179   0.2284483

Next, we can create visualisations to check Out of Bag Error and importance predictors using varImp

# Check Out of Bag Error

plot(model_rf$finalModel)
legend("topright", colnames(model_rf$finalModel$err.rate),
       col=1:6,cex=0.8,fill=1:6)

# Check significant predictors

plot(varImp(model_rf))

💡 Insight :

Out of bag error is 20.69. Therefore, the accuracy of the model is 79.31%
Based on first visualisation, the error rate is low.
Fare, Age and Sex are the most significant predictors

Predicting the data

# model evaluation
pred_rf <- predict(model_rf, 
                   newdata = titanic_test)

rf_evaluation <- confusionMatrix(pred_rf, 
                                 titanic_test$Survived, 
                                 positive = "1")

rf_evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 61 19
##          1 24 39
##                                           
##                Accuracy : 0.6993          
##                  95% CI : (0.6171, 0.7731)
##     No Information Rate : 0.5944          
##     P-Value [Acc > NIR] : 0.006104        
##                                           
##                   Kappa : 0.3848          
##                                           
##  Mcnemar's Test P-Value : 0.541866        
##                                           
##             Sensitivity : 0.6724          
##             Specificity : 0.7176          
##          Pos Pred Value : 0.6190          
##          Neg Pred Value : 0.7625          
##              Prevalence : 0.4056          
##          Detection Rate : 0.2727          
##    Detection Prevalence : 0.4406          
##       Balanced Accuracy : 0.6950          
##                                           
##        'Positive' Class : 1               
##

💡 Insight :

Accuracy is 71.33% and Precision is only 63.93%
The model is still no better than Naive Bayes model

Evaluation

# Create a function to show evaluation result on NB, DT and RF model

eval_nb <- data_frame(Accuracy = nb_evaluation$overall[1],
           Recall = nb_evaluation$byClass[1],
           Specificity = nb_evaluation$byClass[2],
           Precision = nb_evaluation$byClass[3])

eval_dt <- data_frame(Accuracy = dt_evaluation_new$overall[1],
           Recall = dt_evaluation_new$byClass[1],
           Specificity = dt_evaluation_new$byClass[2],
           Precision = dt_evaluation_new$byClass[3])

eval_rf <- data_frame(Accuracy = rf_evaluation$overall[1],
           Recall = rf_evaluation$byClass[1],
           Specificity = rf_evaluation$byClass[2],
           Precision = rf_evaluation$byClass[3])

# Show result

eval_result <- as.data.frame(rbind(eval_nb, eval_dt, eval_rf))
rownames(eval_result) <- c("Naive Bayes","Decision Tree","Random Forest")
paged_table(eval_result)

Conclusion

The Accuracy value of the prediction model that we build using Decission Tree algorithm is the best result.
The Precision value of the the prediction model that we build using Naive Bayes algorithm is the best result.
However, the AUC value is less than “0.5”. it means that the model is performing worse than random guessing and is not able to differentiate between the positive and negative classes. In this case, there are several steps that can be taken to improve the model:
- Feature selection: Revisit the feature selection process and ensure that the selected features are relevant and informative for the target variable. Consider adding or removing features to improve the model’s performance.
- Hyperparameter tuning: Perform hyperparameter tuning to find the optimal combination of hyperparameters for the model. This can be done using techniques such as grid search, random search, or Bayesian optimization.
- Try different algorithms: If the AUC value is still low after trying the above steps, consider trying a different algorithm that may be better suited to the problem at hand. For example, if the data is highly non-linear, a neural network may perform better than a linear model.
It is important to note that a low AUC value does not necessarily mean that the model is useless. Depending on the specific problem and application, a model with a low AUC value may still be useful for some tasks. However, it is always important to strive for the best possible performance and to continuously improve the model as new data and techniques become available.