1 Introduction

1.1 Introduction

This report will show you process to make a classification model using Naive bayes, Desicion Tree, and random forest supervised learning to predict survivalabilty of passanger after the infamous titanic shipwrecks On April 1912.

The data is obtained from https://www.kaggle.com/competitions/titanic/data

1.2 Package

To make the model we will use the following library :

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gtools)
library(ggplot2)
library(class)
library(tidyr)
library(caret)
## Loading required package: lattice
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:gtools':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:gtools':
## 
##     permutations
library(partykit)
## Warning: package 'partykit' was built under R version 4.3.1
## Loading required package: grid
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.3.1
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.3.1
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.1
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine

2 Data Preparation & Explanation

2.1 Data input

titan_train <- read.csv("Tinanic/train.csv")
head(titan_train)

2.2 Data Explanation

survival : Survivalability of passanger if 0 = No, 1 = yes

pclass : Ticket class, 1 = 1st, 2 = 2nd, 3 = 3rd

sex : passanger’s gender

Age : passanger’s Age in years

sibsp : number of siblings / spouses aboard the Titanic

parch : number of parents / children aboard the Titanic

ticket : number

fare : Passenger fare

cabin : Cabin number

embarked : Port of Embarkation location, key : C = Cherbourg, Q = Queenstown, S = Southampton

2.3 Data Structure

glimpse(titan_train)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

The data train has 891 rows and 12 columns. The column survived in data train which indicates the survival status of a passenger of titanic will be use as a target variable with the other column will be use as the predictor.

Next, we need to convert all variable to the desired data types data train and remove the unrelatable variable :

titan_train <- titan_train %>%
  mutate(Pclass = as.factor(Pclass),
         Sex = as.factor(Sex),
         Survived = as.factor(Survived),
         Embarked = as.factor(Embarked),
         Parch = as.factor(Parch),
         SibSp = as.factor(SibSp)) %>%
  select(-PassengerId , -Name, -Ticket, -Cabin)

Incase you didn’t notice, the Cabin variable has many empty character value ““, with this argument we decided to remove this variable. ## Missing Value

colSums(is.na(titan_train))
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0      177        0        0        0        0

Since there is a small portion of missing values in the data set, we will remove the rows that contain the missing value

titan_train_clean <- na.omit(titan_train)
glimpse(titan_train_clean)
## Rows: 714
## Columns: 8
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
## $ Pclass   <fct> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3, 1…
## $ Sex      <fct> male, female, female, female, male, male, male, female, femal…
## $ Age      <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, …
## $ SibSp    <fct> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0, 0…
## $ Parch    <fct> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0, 0…
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 1…
## $ Embarked <fct> S, C, S, S, S, S, S, S, C, S, S, S, S, S, S, Q, S, S, S, Q, S…

2.4 Data Splitting

In this section, we will split the dataset to data train and data test. The data train will be used to train the model and the data test will be used to evaluate the performance of the trained linear model. 80% of the dataset will be used for data train and the rest is data test.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
samplesize <- round(0.8 * nrow(titan_train_clean), 0)
index <- sample(seq_len(nrow(titan_train_clean)), size = samplesize)

data_train <- titan_train_clean[index, ]
data_test <- titan_train_clean[-index, ]

2.5 Data Pre-processing

Before we build the model, we need to examine the proportion of the target variable we have in the target column of data train.

prop.table(table(data_train$Survived))
## 
##         0         1 
## 0.5954466 0.4045534

We can see that the proportion of positive and negative value of the target is unbalanced, this can affect the performance of the model, We have thousands data rows, therefore we will use upsampling method to make the proportion is balanced, however there are no new information will be added to current data :

RNGkind(sample.kind = "Rounding")
train_down <- upSample(
  x = data_train %>% select(-Survived),
  y = data_train$Survived,
  yname = "Survived"
)
nrow(train_down)
## [1] 680
prop.table(table(train_down$Survived))
## 
##   0   1 
## 0.5 0.5

now we have balanced proportion of target variable with 848 data information from the original data.

3 Model Building : Naive Bayes

3.1 Model Build

model_naive <- naiveBayes(Survived ~ ., data = train_down)  
model_naive
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   0   1 
## 0.5 0.5 
## 
## Conditional probabilities:
##    Pclass
## Y           1         2         3
##   0 0.1588235 0.2264706 0.6147059
##   1 0.3852941 0.2764706 0.3382353
## 
##    Sex
## Y      female      male
##   0 0.1441176 0.8558824
##   1 0.6441176 0.3558824
## 
##    Age
## Y       [,1]     [,2]
##   0 30.68529 14.01846
##   1 26.61691 14.64662
## 
##    SibSp
## Y             0           1           2           3           4           5
##   0 0.700000000 0.208823529 0.026470588 0.020588235 0.035294118 0.008823529
##   1 0.597058824 0.341176471 0.032352941 0.017647059 0.011764706 0.000000000
##    SibSp
## Y             8
##   0 0.000000000
##   1 0.000000000
## 
##    Parch
## Y             0           1           2           3           4           5
##   0 0.791176471 0.123529412 0.064705882 0.002941176 0.005882353 0.008823529
##   1 0.635294118 0.197058824 0.155882353 0.005882353 0.000000000 0.005882353
##    Parch
## Y             6
##   0 0.002941176
##   1 0.000000000
## 
##    Fare
## Y       [,1]     [,2]
##   0 23.32182 32.04383
##   1 48.39665 68.26737
## 
##    Embarked
## Y                         C           Q           S
##   0 0.000000000 0.123529412 0.041176471 0.835294118
##   1 0.002941176 0.261764706 0.044117647 0.691176471

The provided information represents the output of a Naive Bayes classifier for discrete predictors. Here is an explanation of the different components:

A-priori probabilities: These are the probabilities of the classes (0 and 1) occurring in the target variable. In this case, the probabilities are 0.5 for each class, indicating an equal prior probability for both classes.

Conditional probabilities: These are the conditional probabilities of each predictor variable given each class. Each table represents the probabilities of a specific predictor variable taking certain values (e.g., Pclass 1, 2, or 3) given the class (0 or 1).

Pclass: The conditional probabilities for the variable “Pclass” indicate the probabilities of each class (0 or 1) occurring for each value of Pclass. For example, the probability of Pclass 1 given class 0 is 0.1509434, while the probability of Pclass 1 given class 1 is 0.4339623.

Sex: The conditional probabilities for the variable “Sex” indicate the probabilities of each class occurring for each value of Sex. For example, the probability of being female given class 0 is 0.1509434, while the probability of being female given class 1 is 0.6816038.

Age: The conditional probabilities for the variable “Age” provide the mean values for each class. The values are represented in a matrix with two columns: the first column corresponds to class 0, and the second column corresponds to class 1. For example, the mean age for class 0 is 30.62618, while the mean age for class 1 is 28.13422.

SibSp: The conditional probabilities for the variable “SibSp” represent the probabilities of each class occurring for each value of SibSp. For example, the probability of SibSp 0 given class 0 is 0.698113208, while the probability of SibSp 0 given class 1 is 0.601415094.

Parch: The conditional probabilities for the variable “Parch” indicate the probabilities of each class occurring for each value of Parch. For example, the probability of Parch 0 given class 0 is 0.790094340, while the probability of Parch 0 given class 1 is 0.632075472.

Fare: The conditional probabilities for the variable “Fare” provide the mean values for each class, similar to the “Age” variable.

Embarked: The conditional probabilities for the variable “Embarked” represent the probabilities of each class occurring for each value of Embarked. For example, the probability of Embarked C given class 0 is 0.00, while the probability of Embarked C given class 1 is 0.007075472.

These conditional probabilities can be used to calculate the posterior probabilities and make predictions using the Naive Bayes classifier.

3.2 Predict

preds_naive <- predict(model_naive, newdata = data_test) 

(confmatrix_naive <- table(preds_naive, data_test$Survived))  
##            
## preds_naive  0  1
##           0 70 16
##           1 14 43

The confusion matrix displays the following information:

True Negative (TN): The number of instances that were correctly predicted as negative (0) by the classifier. In this case, there are 70 instances that were correctly predicted as negative.

False Positive (FP): The number of instances that were incorrectly predicted as positive (1) by the classifier. In this case, there are 16 instances that were incorrectly predicted as positive.

False Negative (FN): The number of instances that were incorrectly predicted as negative (0) by the classifier. In this case, there are 14 instances that were incorrectly predicted as negative.

True Positive (TP): The number of instances that were correctly predicted as positive (1) by the classifier. In this case, there are 43 instances that were correctly predicted as positive.

3.3 Model Evaluation

confusionMatrix(confmatrix_naive) 
## Confusion Matrix and Statistics
## 
##            
## preds_naive  0  1
##           0 70 16
##           1 14 43
##                                           
##                Accuracy : 0.7902          
##                  95% CI : (0.7143, 0.8538)
##     No Information Rate : 0.5874          
##     P-Value [Acc > NIR] : 2.332e-07       
##                                           
##                   Kappa : 0.565           
##                                           
##  Mcnemar's Test P-Value : 0.8551          
##                                           
##             Sensitivity : 0.8333          
##             Specificity : 0.7288          
##          Pos Pred Value : 0.8140          
##          Neg Pred Value : 0.7544          
##              Prevalence : 0.5874          
##          Detection Rate : 0.4895          
##    Detection Prevalence : 0.6014          
##       Balanced Accuracy : 0.7811          
##                                           
##        'Positive' Class : 0               
## 

We will use accuracy as the metric to determine model performance. We choose this metric because the data that we used is past data accuired on 1912, we can’t deploy rescue team because of the gap year to this present year therefore the purpose of this model can be used for determine either a passenger was survived or not survived on titanic shipwreck. However if we want to rescue the passenger (if possible) we choose the sensitivity metric to minimize the false negative (declared not survive but survived), to minimize the fatalities.

The naive bayes model accuracy is 0.7902 , indicating that the classifier has an 79.02% accuracy in predicting the target variable.

4 Model Building : Desicion Tree

4.1 Model Build

dt_model <- ctree (Survived ~. , data_train)
plot(dt_model, type = "simple")

dt_model
## 
## Model formula:
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
## Fitted party:
## [1] root
## |   [2] Sex in female
## |   |   [3] Pclass in 1, 2: 1 (n = 124, err = 7.3%)
## |   |   [4] Pclass in 3: 0 (n = 77, err = 48.1%)
## |   [5] Sex in male
## |   |   [6] Pclass in 1
## |   |   |   [7] Age <= 36: 1 (n = 33, err = 42.4%)
## |   |   |   [8] Age > 36: 0 (n = 50, err = 26.0%)
## |   |   [9] Pclass in 2, 3
## |   |   |   [10] Age <= 3: 1 (n = 16, err = 25.0%)
## |   |   |   [11] Age > 3: 0 (n = 271, err = 12.9%)
## 
## Number of inner nodes:    5
## Number of terminal nodes: 6

The provided information represents a decision tree model (dt_model) that predicts the survival outcome (Survived) based on several predictor variables (Pclass, Sex, Age, SibSp, Parch, Fare, Embarked). Here is an explanation of the tree structure:

  • The root node is the starting point of the decision tree.

  • The first split is based on the variable Sex. If the individual is female, the tree proceeds to node 3. If the individual is male, the tree proceeds to node 7.

  • For females, the next split is based on the variable Pclass. If the individual’s Pclass is 1 or 2, the predicted outcome is 1 (survived) in node 3. If the individual’s Pclass is 3, the tree proceeds to node 4.

  • For female individuals with Pclass 3, the next split is based on the variable Fare. If the individual’s Fare is less than or equal to 22.025, the predicted outcome is 1 (survived) in node 5. If the individual’s Fare is greater than 22.025, the predicted outcome is 0 (not survived) in node 6.

  • For male individuals, if their Pclass is 1, the next split is based on the variable Age. If the individual’s Age is less than or equal to 37, the predicted outcome is 1 (survived) in node 9. If the individual’s Age is greater than 37, the predicted outcome is 0 (not survived) in node 10.

  • For male individuals with Pclass 2 or 3, if their Age is less than or equal to 9, the next split is based on the variable SibSp. If the individual’s SibSp is 0 or 1, the predicted outcome is 1 (survived) in node 13. If the individual’s SibSp is 3, 4, or 5, the predicted outcome is 0 (not survived) in node 14. If their Age is greater than 9, the predicted outcome is 0 (not survived) in node 15.

4.2 Predict

pred_dt <- predict(dt_model, data_test)
(conf_matrix_dtree <- table(pred_dt, data_test$Survived))
##        
## pred_dt  0  1
##       0 79 20
##       1  5 39

True Negative (TN): The model predicted 0 (not survived) correctly, and the actual value is also 0. In this case, the count is 79.

False Positive (FP): The model predicted 1 (survived), but the actual value is 0 (not survived). In this case, the count is 20.

False Negative (FN): The model predicted 0 (not survived), but the actual value is 1 (survived). In this case, the count is 5.

True Positive (TP): The model predicted 1 (survived) correctly, and the actual value is also 1. In this case, the count is 39.

predict(dt_model, head(data_test), type="prob")
##             0         1
## 2  0.07258065 0.9274194
## 11 0.51948052 0.4805195
## 13 0.87084871 0.1291513
## 16 0.07258065 0.9274194
## 21 0.87084871 0.1291513
## 22 0.87084871 0.1291513

4.3 Model Evaluation

confusionMatrix(pred_dt, data_test$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 79 20
##          1  5 39
##                                           
##                Accuracy : 0.8252          
##                  95% CI : (0.7528, 0.8836)
##     No Information Rate : 0.5874          
##     P-Value [Acc > NIR] : 9.969e-10       
##                                           
##                   Kappa : 0.6251          
##                                           
##  Mcnemar's Test P-Value : 0.00511         
##                                           
##             Sensitivity : 0.9405          
##             Specificity : 0.6610          
##          Pos Pred Value : 0.7980          
##          Neg Pred Value : 0.8864          
##              Prevalence : 0.5874          
##          Detection Rate : 0.5524          
##    Detection Prevalence : 0.6923          
##       Balanced Accuracy : 0.8007          
##                                           
##        'Positive' Class : 0               
## 

The desicion tree model accuracy is 0.8252 , indicating that the classifier has an 85.52% accuracy in predicting the target variable.

5 Model building - Random Forest

5.1 K-fold Cross Validation & Model Build

K-fold cross-validation is a resampling technique used in machine learning and model evaluation. It is commonly used to assess the performance and generalize the results of a predictive model.

The process involves splitting the available data into K equally sized subsets or “folds.” The model is then trained and evaluated K times, each time using a different fold as the validation set and the remaining folds as the training set. This allows for a more comprehensive evaluation of the model’s performance and helps to mitigate the impact of data variability and overfitting.

# set.seed(417)
# ctrl <- trainControl(method="repeatedcv", number=5, repeats=3) # k-fold cross validation
# model_rf <- train(Survived ~ ., 
#                   data= data_train, 
#                   method="rf", 
#                   trControl = ctrl)

# adding comment to the code so we won't generarate another k-fold cross valitadion

In summary, the Random Forest model achieved an accuracy of approximately 82.14% and a kappa coefficient of around 0.657 when trained with mtry = 11. These metrics indicate the model’s overall predictive performance on unseen data during cross-validation.

In this case, the selected model is the one with mtry = 11, which achieved the highest accuracy when tested on the data obtained from bootstrap sampling (which can be considered as the training data used to build the decision tree in the random forest).

#saveRDS(model_rf, file = "model_rf.rds")

# model_rf with k-fold cross valitadion above has been saved to model_rf.rds and will be loaded
model_rfload <- readRDS("model_rf.rds")
model_rfload
## Random Forest 
## 
## 678 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 543, 543, 542, 542, 542, 543, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7866558  0.5733781
##   11    0.8402505  0.6804766
##   20    0.8333551  0.6666677
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.

5.2 Variable Importance

varImp(model_rfload)
## rf variable importance
## 
##             Overall
## Sexmale   100.00000
## Age        95.30008
## Fare       90.04768
## Pclass3    26.36037
## SibSp1     10.79012
## Pclass2     8.78699
## Parch1      4.92440
## EmbarkedC   4.87462
## EmbarkedS   4.29197
## Parch2      3.36415
## SibSp3      3.04347
## SibSp4      1.89729
## EmbarkedQ   1.72423
## SibSp2      1.71779
## Parch5      0.74619
## SibSp5      0.47669
## Parch4      0.36510
## Parch3      0.31629
## Parch6      0.05464
## SibSp8      0.00000

These values represent the relative importance of each predictor variable in the model. Higher values indicate greater importance in predicting the outcome variable. In this case, Age, Sex (male), and Fare are the most important variables in the model, while Pclass3, SibSp1, Pclass2, EmbarkedC, Parch1, Parch2, and EmbarkedS have relatively lower importance.

5.3 Out of Bag Error

It is not necessary to perform cross-validation when using random forest. This is because, from the results of bootstrap sampling, there are data points that are not used in building the random forest. These data points are referred to as out-of-bag data and are considered as test data by the model. The model will make predictions using these data points and calculate the resulting error, known as the out-of-bag error.

model_rfload$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##         OOB estimate of  error rate: 15.63%
## Confusion matrix:
##     0   1 class.error
## 0 287  51   0.1508876
## 1  55 285   0.1617647

In the model_rf model, the Out of Bag Error value is 16.22%. Therefore, the accuracy of the model on the test data (out of bag data) is 100% - 15.63% = 84.77 %

5.4 Predict

pred_rf <- predict(object = model_rfload,
                   newdata = data_test,
                   type = "raw")
(conf_matrix_rf <- table(pred_rf, data_test$Survived))
##        
## pred_rf  0  1
##       0 79  6
##       1  5 53

True Negative (TN): The model predicted 0 (not survived) correctly, and the actual value is also 0. In this case, the count is 79.

False Positive (FP): The model predicted 1 (survived), but the actual value is 0 (not survived). In this case, the count is 6.

False Negative (FN): The model predicted 0 (not survived), but the actual value is 1 (survived). In this case, the count is 5.

True Positive (TP): The model predicted 1 (survived) correctly, and the actual value is also 1. In this case, the count is 53.

5.5 Model Evaluation

confusionMatrix(pred_rf, 
                data_test$Survived, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 79  6
##          1  5 53
##                                          
##                Accuracy : 0.9231         
##                  95% CI : (0.8665, 0.961)
##     No Information Rate : 0.5874         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8409         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.8983         
##             Specificity : 0.9405         
##          Pos Pred Value : 0.9138         
##          Neg Pred Value : 0.9294         
##              Prevalence : 0.4126         
##          Detection Rate : 0.3706         
##    Detection Prevalence : 0.4056         
##       Balanced Accuracy : 0.9194         
##                                          
##        'Positive' Class : 1              
## 

The random forest model accuracy is 0.9231, indicating that the classifier has an 92.31% accuracy in predicting the target variable.

6 Conclusion

As stated before, accuracy will be used as the metric to determine model performance. We choose this metric because the data that we used is past data accuired on 1912, we can’t deploy rescue team because of the gap year to this present year therefore the purpose of this model can be used for determine either a passenger was survived or not survived on titanic shipwreck. However if we want to rescue the passenger (if possible) we choose the sensitivity metric to minimize the false negative (declared not survive but survived), to minimize the fatalities.

  • The naive bayes model accuracy is 0.7902

  • The desicion tree model accuracy is 0.8252

  • The random forest model accuracy is 0.9231

Therefore we can conclude that the random forest model have the better model to predict the Survivalablilty of titanic shipwreck passenger. An acuracy value of 0.9231 indicates that the random forest model is able to accurately identify approximately 92.31% indicates that the random forest model correctly predicted the survived label (0(no) or 1(yes)) for 92.31% of the instances in the test data or unseen data. It indicates the overall correctness of the model’s predictions. In other words, it has a high ability to correctly predict both cases in the unseen data.