This report will show you process to make a classification model using Naive bayes, Desicion Tree, and random forest supervised learning to predict survivalabilty of passanger after the infamous titanic shipwrecks On April 1912.
The data is obtained from https://www.kaggle.com/competitions/titanic/data
To make the model we will use the following library :
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gtools)
library(ggplot2)
library(class)
library(tidyr)
library(caret)## Loading required package: lattice
library(car)## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:gtools':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
library(e1071)##
## Attaching package: 'e1071'
## The following object is masked from 'package:gtools':
##
## permutations
library(partykit)## Warning: package 'partykit' was built under R version 4.3.1
## Loading required package: grid
## Loading required package: libcoin
## Warning: package 'libcoin' was built under R version 4.3.1
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 4.3.1
library(randomForest)## Warning: package 'randomForest' was built under R version 4.3.1
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
titan_train <- read.csv("Tinanic/train.csv")
head(titan_train)survival : Survivalability of passanger if 0 = No, 1 = yes
pclass : Ticket class, 1 = 1st, 2 = 2nd, 3 = 3rd
sex : passanger’s gender
Age : passanger’s Age in years
sibsp : number of siblings / spouses aboard the Titanic
parch : number of parents / children aboard the Titanic
ticket : number
fare : Passenger fare
cabin : Cabin number
embarked : Port of Embarkation location, key : C = Cherbourg, Q = Queenstown, S = Southampton
glimpse(titan_train)## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
The data train has 891 rows and 12 columns. The column survived in data train which indicates the survival status of a passenger of titanic will be use as a target variable with the other column will be use as the predictor.
Next, we need to convert all variable to the desired data types data train and remove the unrelatable variable :
titan_train <- titan_train %>%
mutate(Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
Survived = as.factor(Survived),
Embarked = as.factor(Embarked),
Parch = as.factor(Parch),
SibSp = as.factor(SibSp)) %>%
select(-PassengerId , -Name, -Ticket, -Cabin)Incase you didn’t notice, the Cabin variable has many empty character value ““, with this argument we decided to remove this variable. ## Missing Value
colSums(is.na(titan_train))## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 0 0 0 177 0 0 0 0
Since there is a small portion of missing values in the data set, we will remove the rows that contain the missing value
titan_train_clean <- na.omit(titan_train)
glimpse(titan_train_clean)## Rows: 714
## Columns: 8
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1…
## $ Pclass <fct> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3, 1…
## $ Sex <fct> male, female, female, female, male, male, male, female, femal…
## $ Age <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2, …
## $ SibSp <fct> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0, 0…
## $ Parch <fct> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0, 0…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750, 1…
## $ Embarked <fct> S, C, S, S, S, S, S, S, C, S, S, S, S, S, S, Q, S, S, S, Q, S…
In this section, we will split the dataset to data train and data test. The data train will be used to train the model and the data test will be used to evaluate the performance of the trained linear model. 80% of the dataset will be used for data train and the rest is data test.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
samplesize <- round(0.8 * nrow(titan_train_clean), 0)
index <- sample(seq_len(nrow(titan_train_clean)), size = samplesize)
data_train <- titan_train_clean[index, ]
data_test <- titan_train_clean[-index, ]Before we build the model, we need to examine the proportion of the target variable we have in the target column of data train.
prop.table(table(data_train$Survived))##
## 0 1
## 0.5954466 0.4045534
We can see that the proportion of positive and negative value of the target is unbalanced, this can affect the performance of the model, We have thousands data rows, therefore we will use upsampling method to make the proportion is balanced, however there are no new information will be added to current data :
RNGkind(sample.kind = "Rounding")
train_down <- upSample(
x = data_train %>% select(-Survived),
y = data_train$Survived,
yname = "Survived"
)
nrow(train_down)## [1] 680
prop.table(table(train_down$Survived))##
## 0 1
## 0.5 0.5
now we have balanced proportion of target variable with 848 data information from the original data.
model_naive <- naiveBayes(Survived ~ ., data = train_down)
model_naive##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5 0.5
##
## Conditional probabilities:
## Pclass
## Y 1 2 3
## 0 0.1588235 0.2264706 0.6147059
## 1 0.3852941 0.2764706 0.3382353
##
## Sex
## Y female male
## 0 0.1441176 0.8558824
## 1 0.6441176 0.3558824
##
## Age
## Y [,1] [,2]
## 0 30.68529 14.01846
## 1 26.61691 14.64662
##
## SibSp
## Y 0 1 2 3 4 5
## 0 0.700000000 0.208823529 0.026470588 0.020588235 0.035294118 0.008823529
## 1 0.597058824 0.341176471 0.032352941 0.017647059 0.011764706 0.000000000
## SibSp
## Y 8
## 0 0.000000000
## 1 0.000000000
##
## Parch
## Y 0 1 2 3 4 5
## 0 0.791176471 0.123529412 0.064705882 0.002941176 0.005882353 0.008823529
## 1 0.635294118 0.197058824 0.155882353 0.005882353 0.000000000 0.005882353
## Parch
## Y 6
## 0 0.002941176
## 1 0.000000000
##
## Fare
## Y [,1] [,2]
## 0 23.32182 32.04383
## 1 48.39665 68.26737
##
## Embarked
## Y C Q S
## 0 0.000000000 0.123529412 0.041176471 0.835294118
## 1 0.002941176 0.261764706 0.044117647 0.691176471
The provided information represents the output of a Naive Bayes classifier for discrete predictors. Here is an explanation of the different components:
A-priori probabilities: These are the probabilities of the classes (0 and 1) occurring in the target variable. In this case, the probabilities are 0.5 for each class, indicating an equal prior probability for both classes.
Conditional probabilities: These are the conditional probabilities of each predictor variable given each class. Each table represents the probabilities of a specific predictor variable taking certain values (e.g., Pclass 1, 2, or 3) given the class (0 or 1).
Pclass: The conditional probabilities for the variable “Pclass” indicate the probabilities of each class (0 or 1) occurring for each value of Pclass. For example, the probability of Pclass 1 given class 0 is 0.1509434, while the probability of Pclass 1 given class 1 is 0.4339623.
Sex: The conditional probabilities for the variable “Sex” indicate the probabilities of each class occurring for each value of Sex. For example, the probability of being female given class 0 is 0.1509434, while the probability of being female given class 1 is 0.6816038.
Age: The conditional probabilities for the variable “Age” provide the mean values for each class. The values are represented in a matrix with two columns: the first column corresponds to class 0, and the second column corresponds to class 1. For example, the mean age for class 0 is 30.62618, while the mean age for class 1 is 28.13422.
SibSp: The conditional probabilities for the variable “SibSp” represent the probabilities of each class occurring for each value of SibSp. For example, the probability of SibSp 0 given class 0 is 0.698113208, while the probability of SibSp 0 given class 1 is 0.601415094.
Parch: The conditional probabilities for the variable “Parch” indicate the probabilities of each class occurring for each value of Parch. For example, the probability of Parch 0 given class 0 is 0.790094340, while the probability of Parch 0 given class 1 is 0.632075472.
Fare: The conditional probabilities for the variable “Fare” provide the mean values for each class, similar to the “Age” variable.
Embarked: The conditional probabilities for the variable “Embarked” represent the probabilities of each class occurring for each value of Embarked. For example, the probability of Embarked C given class 0 is 0.00, while the probability of Embarked C given class 1 is 0.007075472.
These conditional probabilities can be used to calculate the posterior probabilities and make predictions using the Naive Bayes classifier.
preds_naive <- predict(model_naive, newdata = data_test)
(confmatrix_naive <- table(preds_naive, data_test$Survived)) ##
## preds_naive 0 1
## 0 70 16
## 1 14 43
The confusion matrix displays the following information:
True Negative (TN): The number of instances that were correctly predicted as negative (0) by the classifier. In this case, there are 70 instances that were correctly predicted as negative.
False Positive (FP): The number of instances that were incorrectly predicted as positive (1) by the classifier. In this case, there are 16 instances that were incorrectly predicted as positive.
False Negative (FN): The number of instances that were incorrectly predicted as negative (0) by the classifier. In this case, there are 14 instances that were incorrectly predicted as negative.
True Positive (TP): The number of instances that were correctly predicted as positive (1) by the classifier. In this case, there are 43 instances that were correctly predicted as positive.
confusionMatrix(confmatrix_naive) ## Confusion Matrix and Statistics
##
##
## preds_naive 0 1
## 0 70 16
## 1 14 43
##
## Accuracy : 0.7902
## 95% CI : (0.7143, 0.8538)
## No Information Rate : 0.5874
## P-Value [Acc > NIR] : 2.332e-07
##
## Kappa : 0.565
##
## Mcnemar's Test P-Value : 0.8551
##
## Sensitivity : 0.8333
## Specificity : 0.7288
## Pos Pred Value : 0.8140
## Neg Pred Value : 0.7544
## Prevalence : 0.5874
## Detection Rate : 0.4895
## Detection Prevalence : 0.6014
## Balanced Accuracy : 0.7811
##
## 'Positive' Class : 0
##
We will use accuracy as the metric to determine model performance. We choose this metric because the data that we used is past data accuired on 1912, we can’t deploy rescue team because of the gap year to this present year therefore the purpose of this model can be used for determine either a passenger was survived or not survived on titanic shipwreck. However if we want to rescue the passenger (if possible) we choose the sensitivity metric to minimize the false negative (declared not survive but survived), to minimize the fatalities.
The naive bayes model accuracy is 0.7902 , indicating that the classifier has an 79.02% accuracy in predicting the target variable.
dt_model <- ctree (Survived ~. , data_train)
plot(dt_model, type = "simple")dt_model##
## Model formula:
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Fitted party:
## [1] root
## | [2] Sex in female
## | | [3] Pclass in 1, 2: 1 (n = 124, err = 7.3%)
## | | [4] Pclass in 3: 0 (n = 77, err = 48.1%)
## | [5] Sex in male
## | | [6] Pclass in 1
## | | | [7] Age <= 36: 1 (n = 33, err = 42.4%)
## | | | [8] Age > 36: 0 (n = 50, err = 26.0%)
## | | [9] Pclass in 2, 3
## | | | [10] Age <= 3: 1 (n = 16, err = 25.0%)
## | | | [11] Age > 3: 0 (n = 271, err = 12.9%)
##
## Number of inner nodes: 5
## Number of terminal nodes: 6
The provided information represents a decision tree model (dt_model) that predicts the survival outcome (Survived) based on several predictor variables (Pclass, Sex, Age, SibSp, Parch, Fare, Embarked). Here is an explanation of the tree structure:
The root node is the starting point of the decision tree.
The first split is based on the variable Sex. If the individual is female, the tree proceeds to node 3. If the individual is male, the tree proceeds to node 7.
For females, the next split is based on the variable Pclass. If the individual’s Pclass is 1 or 2, the predicted outcome is 1 (survived) in node 3. If the individual’s Pclass is 3, the tree proceeds to node 4.
For female individuals with Pclass 3, the next split is based on the variable Fare. If the individual’s Fare is less than or equal to 22.025, the predicted outcome is 1 (survived) in node 5. If the individual’s Fare is greater than 22.025, the predicted outcome is 0 (not survived) in node 6.
For male individuals, if their Pclass is 1, the next split is based on the variable Age. If the individual’s Age is less than or equal to 37, the predicted outcome is 1 (survived) in node 9. If the individual’s Age is greater than 37, the predicted outcome is 0 (not survived) in node 10.
For male individuals with Pclass 2 or 3, if their Age is less than or equal to 9, the next split is based on the variable SibSp. If the individual’s SibSp is 0 or 1, the predicted outcome is 1 (survived) in node 13. If the individual’s SibSp is 3, 4, or 5, the predicted outcome is 0 (not survived) in node 14. If their Age is greater than 9, the predicted outcome is 0 (not survived) in node 15.
pred_dt <- predict(dt_model, data_test)
(conf_matrix_dtree <- table(pred_dt, data_test$Survived))##
## pred_dt 0 1
## 0 79 20
## 1 5 39
True Negative (TN): The model predicted 0 (not survived) correctly, and the actual value is also 0. In this case, the count is 79.
False Positive (FP): The model predicted 1 (survived), but the actual value is 0 (not survived). In this case, the count is 20.
False Negative (FN): The model predicted 0 (not survived), but the actual value is 1 (survived). In this case, the count is 5.
True Positive (TP): The model predicted 1 (survived) correctly, and the actual value is also 1. In this case, the count is 39.
predict(dt_model, head(data_test), type="prob")## 0 1
## 2 0.07258065 0.9274194
## 11 0.51948052 0.4805195
## 13 0.87084871 0.1291513
## 16 0.07258065 0.9274194
## 21 0.87084871 0.1291513
## 22 0.87084871 0.1291513
confusionMatrix(pred_dt, data_test$Survived)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 79 20
## 1 5 39
##
## Accuracy : 0.8252
## 95% CI : (0.7528, 0.8836)
## No Information Rate : 0.5874
## P-Value [Acc > NIR] : 9.969e-10
##
## Kappa : 0.6251
##
## Mcnemar's Test P-Value : 0.00511
##
## Sensitivity : 0.9405
## Specificity : 0.6610
## Pos Pred Value : 0.7980
## Neg Pred Value : 0.8864
## Prevalence : 0.5874
## Detection Rate : 0.5524
## Detection Prevalence : 0.6923
## Balanced Accuracy : 0.8007
##
## 'Positive' Class : 0
##
The desicion tree model accuracy is 0.8252 , indicating that the classifier has an 85.52% accuracy in predicting the target variable.
K-fold cross-validation is a resampling technique used in machine learning and model evaluation. It is commonly used to assess the performance and generalize the results of a predictive model.
The process involves splitting the available data into K equally sized subsets or “folds.” The model is then trained and evaluated K times, each time using a different fold as the validation set and the remaining folds as the training set. This allows for a more comprehensive evaluation of the model’s performance and helps to mitigate the impact of data variability and overfitting.
# set.seed(417)
# ctrl <- trainControl(method="repeatedcv", number=5, repeats=3) # k-fold cross validation
# model_rf <- train(Survived ~ .,
# data= data_train,
# method="rf",
# trControl = ctrl)
# adding comment to the code so we won't generarate another k-fold cross valitadionIn summary, the Random Forest model achieved an accuracy of approximately 82.14% and a kappa coefficient of around 0.657 when trained with mtry = 11. These metrics indicate the model’s overall predictive performance on unseen data during cross-validation.
In this case, the selected model is the one with mtry = 11, which achieved the highest accuracy when tested on the data obtained from bootstrap sampling (which can be considered as the training data used to build the decision tree in the random forest).
#saveRDS(model_rf, file = "model_rf.rds")
# model_rf with k-fold cross valitadion above has been saved to model_rf.rds and will be loadedmodel_rfload <- readRDS("model_rf.rds")
model_rfload## Random Forest
##
## 678 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 543, 543, 542, 542, 542, 543, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7866558 0.5733781
## 11 0.8402505 0.6804766
## 20 0.8333551 0.6666677
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 11.
varImp(model_rfload)## rf variable importance
##
## Overall
## Sexmale 100.00000
## Age 95.30008
## Fare 90.04768
## Pclass3 26.36037
## SibSp1 10.79012
## Pclass2 8.78699
## Parch1 4.92440
## EmbarkedC 4.87462
## EmbarkedS 4.29197
## Parch2 3.36415
## SibSp3 3.04347
## SibSp4 1.89729
## EmbarkedQ 1.72423
## SibSp2 1.71779
## Parch5 0.74619
## SibSp5 0.47669
## Parch4 0.36510
## Parch3 0.31629
## Parch6 0.05464
## SibSp8 0.00000
These values represent the relative importance of each predictor variable in the model. Higher values indicate greater importance in predicting the outcome variable. In this case, Age, Sex (male), and Fare are the most important variables in the model, while Pclass3, SibSp1, Pclass2, EmbarkedC, Parch1, Parch2, and EmbarkedS have relatively lower importance.
It is not necessary to perform cross-validation when using random forest. This is because, from the results of bootstrap sampling, there are data points that are not used in building the random forest. These data points are referred to as out-of-bag data and are considered as test data by the model. The model will make predictions using these data points and calculate the resulting error, known as the out-of-bag error.
model_rfload$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 11
##
## OOB estimate of error rate: 15.63%
## Confusion matrix:
## 0 1 class.error
## 0 287 51 0.1508876
## 1 55 285 0.1617647
In the model_rf model, the Out of Bag Error value is 16.22%. Therefore, the accuracy of the model on the test data (out of bag data) is 100% - 15.63% = 84.77 %
pred_rf <- predict(object = model_rfload,
newdata = data_test,
type = "raw")
(conf_matrix_rf <- table(pred_rf, data_test$Survived))##
## pred_rf 0 1
## 0 79 6
## 1 5 53
True Negative (TN): The model predicted 0 (not survived) correctly, and the actual value is also 0. In this case, the count is 79.
False Positive (FP): The model predicted 1 (survived), but the actual value is 0 (not survived). In this case, the count is 6.
False Negative (FN): The model predicted 0 (not survived), but the actual value is 1 (survived). In this case, the count is 5.
True Positive (TP): The model predicted 1 (survived) correctly, and the actual value is also 1. In this case, the count is 53.
confusionMatrix(pred_rf,
data_test$Survived,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 79 6
## 1 5 53
##
## Accuracy : 0.9231
## 95% CI : (0.8665, 0.961)
## No Information Rate : 0.5874
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8409
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8983
## Specificity : 0.9405
## Pos Pred Value : 0.9138
## Neg Pred Value : 0.9294
## Prevalence : 0.4126
## Detection Rate : 0.3706
## Detection Prevalence : 0.4056
## Balanced Accuracy : 0.9194
##
## 'Positive' Class : 1
##
The random forest model accuracy is 0.9231, indicating that the classifier has an 92.31% accuracy in predicting the target variable.
As stated before, accuracy will be used as the metric to determine model performance. We choose this metric because the data that we used is past data accuired on 1912, we can’t deploy rescue team because of the gap year to this present year therefore the purpose of this model can be used for determine either a passenger was survived or not survived on titanic shipwreck. However if we want to rescue the passenger (if possible) we choose the sensitivity metric to minimize the false negative (declared not survive but survived), to minimize the fatalities.
The naive bayes model accuracy is 0.7902
The desicion tree model accuracy is 0.8252
The random forest model accuracy is 0.9231
Therefore we can conclude that the random forest model have the better model to predict the Survivalablilty of titanic shipwreck passenger. An acuracy value of 0.9231 indicates that the random forest model is able to accurately identify approximately 92.31% indicates that the random forest model correctly predicted the survived label (0(no) or 1(yes)) for 92.31% of the instances in the test data or unseen data. It indicates the overall correctness of the model’s predictions. In other words, it has a high ability to correctly predict both cases in the unseen data.