# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE
)

1 Introduction

This is an assignment for LBB Classification Data Science Algorithms:

  • LBB Classification I (Logistic Regression vs KNN)
  • LBB Classification II (Naive Bayes, Decision Tree & Random Forest)

The sinking of the Titanic ship was indeed a shocking event that reverberated across the world. News of the shipwreck spread rapidly, becoming a trending topic among people globally. This was due to the Titanic being one of the largest cruise ships ever built in its time. The accident resulted in the tragic loss of many lives. Data on the passengers who survived and those who did not were compiled with the hope of studying the impact of age and gender on the likelihood of survival during a shipwreck. The goal is to predict whether an individual could survive or not in the event of a ship accident. Obtaining a model would establish safety standards on cruise ships, specifically concerning certain genders or age groups, guiding the placement of safety equipment on board.

This report aims to conduct a classification analysis using the Titanic dataset to predict passenger survival. The report employs the Logistic Regression, K-Nearest Neighbors (KNN), Naive Bayes, Decision Tree, and Random Forest models. The described steps include data preprocessing, feature selection, model development, performance evaluation, and interpretation of results, along with comparisons between the models.

In this report, I utilize five popular models, namely Logistic Regression, K-Nearest Neighbors (KNN), Naive Bayes, Decision Tree, and Random Forest, to predict passenger survival in the Titanic dataset.

  • Target: Survival status (survived).
  • Predictors: All variables except survived.

The dataset can be found at https://www.kaggle.com/c/titanic. It contains biodata from Titanic passengers, and based on this biodata, we aim to predict whether a passenger survived or not.

1.1 Data Understanding

Before diving into the project, we’ll need to import the required libraries to facilitate our analysis.

library(dplyr) # Preprocessing Dataset
library(class) # knn models
library(caret) # Confusion Matrix # pre-process and scale the data
library(ggplot2) # ggplot visualization
library(ggthemes) # ggplot theme
library(tidyverse) # For data processing and visualization
library(car) # Calculate VIF
library(randomForest) # make random forest models
library(partykit) # decision tree models
library(e1071) # functions for statistic and probabilistic algorithms like a fuzzy classifier, naive Bayes classifier

And then import our data

data <- read.csv("datasets/train.csv")
glimpse(data)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

Metadata:

  • PassengerId: Unique identification number for each passenger.

  • Survived: Indicates whether the passenger survived (1) or did not survive (0).

  • Pclass: Ticket class of the passenger, with values 1, 2, or 3, reflecting the social-economic class (1: First class, 2: Second class, 3: Third class).

  • Name: Full name of the passenger.

  • Sex: Gender of the passenger (male: male, female: female).

  • Age: Age of the passenger in years.

  • SibSp: Number of siblings or spouses accompanying the passenger on board.

  • Parch: Number of parents or children accompanying the passenger on board.

  • Ticket: Ticket number of the passenger.

  • Fare: Fare paid by the passenger.

  • Cabin: Cabin number of the passenger.

  • Embarked: Port where the passenger embarked the ship (C: Cherbourg, Q: Queenstown, S: Southampton).

1.1.1 Data Dictionary

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

1.1.2 Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

2 EDA & Data Wrangling

Before proceeding further, we need to examine the condition of our data and clean it to convert the raw dataset into a usable form that is more relevant to our analysis later.

Change data types and remove unnecessary columns: There are several variables with inappropriate data types that need to be corrected.

data <- data %>%
  mutate(Survived = as.factor(Survived),
         Pclass = as.factor(Pclass),
         Sex = as.factor(Sex),
         Embarked = as.factor(Embarked))

head(data)
summary(data)
##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:549    1:216   Length:891         female:314  
##  1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
##  Median :446.0            3:491   Mode  :character               
##  Mean   :446.0                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##                                                                  
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin           Embarked
##  Min.   :  0.00   Length:891          :  2   
##  1st Qu.:  7.91   Class :character   C:168   
##  Median : 14.45   Mode  :character   Q: 77   
##  Mean   : 32.20                      S:644   
##  3rd Qu.: 31.00                              
##  Max.   :512.33                              
## 

Insights Obtained:

  • There are 549 passengers who did not survive (0) and 342 passengers who survived (1).

  • There are 216 passengers in class 1, 184 passengers in class 2, and 491 passengers in class 3.

  • There are 577 male passengers (0) and 314 female passengers (1).

  • The age range of passengers spans from 0.42 years to 80 years. The average age of passengers is 29.7 years.

  • The majority of passengers do not have any siblings or spouses accompanying them (0), followed by 209 passengers with 1 sibling or spouse.

  • The majority of passengers do not have any parents or children accompanying them (0), followed by 118 passengers with 1 parent or child.

  • The ticket fares range from 0 to 512.33. The average ticket fare is 32.2.

  • The majority of passengers (644) boarded the ship from Southampton port (S), followed by 168 passengers from Cherbourg port (C), and 77 passengers from Queenstown port (Q).

prop.table(table(data$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

2.1 Check Missing Values

The first step is to look for missing values. Handling missing data is crucial because any analytical results based on a dataset with missing values could be biased.

After that, we will handle the missing values in the Age column by filling the missing values with the mean age. This is done to make the dataset more complete and eliminate any missing values that may disrupt the analysis process.

data$Embarked[data$Embarked == ""] <- NA
data$Cabin[data$Cabin == ""] <- NA
colSums(is.na(data))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

The only columns that have missing values are Age and Embarked columns. However, the proportion of missing values is quite large: 177 rows out of 891 rows for Age and 2 rows out of 891 rows for Embarked. For Embarked, empty values are represented by strings with ” “. Additionally, in the Cabin column, there are values with”“, indicating that the Passenger’s Cabin was not specified or known during the data collection.

For this treatment, we will perform the following:

  • Input the missing values in the Age column with the mean age and in the Embarked column with the mode value.
  • Drop the columns Name, PassengerId, Ticket, and Cabin as they are not needed in the modeling process.
# Filling missing values in the `Age` column with the mean age
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm=TRUE)

# Filling missing values in the `Embarked` column with the mode value
calc_mode <- function(x){
  
  # List the distinct / unique values
  distinct_values <- unique(x)
  
  # Count the occurrence of each distinct value
  distinct_tabulate <- tabulate(match(x, distinct_values))
  
  # Return the value with the highest occurrence
  distinct_values[which.max(distinct_tabulate)]
}

data <- data %>% 
  mutate(Embarked = if_else(is.na(Embarked), 
                         calc_mode(Embarked), 
                         Embarked))
colSums(is.na(data))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           0

There are no missing values in the Titanic dataset.

We will now drop several variables that are not needed, such as Name, PassengerId, Ticket, and Cabin.

data_2 <- data
data <- data %>%
  select(-c(Name, PassengerId, Ticket, Cabin))

2.2 Data Visualization

Feature Engineering

data_2 <- data_2 %>% 
  mutate(Survived = ifelse(Survived == 1, "Survived", "Dead")) %>% 
  mutate(Pclass = sapply(as.character(Pclass), switch,
                         "1" = "1st Class",
                         "2" = "2nd Class",
                         "3" = "3rd Class")) %>% 
  mutate(Embarked = sapply(as.character(Embarked), switch, 
                           "C" = "Cherbourg",
                           "Q" = "Queenstown",
                           "S" = "Southampton"))
# Mengubah data menjadi factor
data_2 <- data_2 %>% 
  mutate_at(vars(Pclass,Sex,Embarked,Survived), as.factor)

data_2

To obtain more specific information about Age and Sex, feature engineering can be performed by extracting titles from the Name variable. The gsub function can be used in this case, and the titles will be categorized into 6 different groups. These groups include: Mr, Mrs, Master, Miss, Honorific Titles, and Officers.

# menggunakan fungsi gsub
data_2$Title <- gsub("(.*\\,|\\..*)", "", data_2$Name) %>%    
  gsub("[[:space:]]", "", .)

# membedakan variable title
data_2$Title[data_2$Title %in% c("Don", "Sir")] <- "Mr"
data_2$Title[data_2$Title %in% c("Ms", "Mme", "Mlle", "Lady", "Dona", "theCountess")] <- "Mrs"
data_2$Title[data_2$Title %in% c("Jonkheer", "Dr")] <- "Honorific Titles"
data_2$Title[data_2$Title %in% c("Capt", "Col", "Major", "Rev")] <- "Officers"

Grouping the Age variable into several categories.

Ages <- function(x){
  if(x < 20) {
    x <- "< 20"
  } else if (x >= 20 & x <= 29) {
    x <- "20-29"
  } else if (x >= 30 & x <= 39) {
    x <- "30-39"
  } else if (x >= 40 & x <= 49) {
    x <- "40-49"
  } else if (x >= 50 & x <= 59) {
    x <- "50-59" 
  } else if (x >= 60 & x <= 69) {
    x <- "60-69"
  } else if (x >= 70 & x <= 79) {
    x <- "70-79"
  } else (x <- "> 80")
}

data_2$Ages <- as.factor(sapply(data_2$Age, Ages))
summary(data_2)
##   PassengerId        Survived         Pclass        Name               Sex     
##  Min.   :  1.0   Dead    :549   1st Class:216   Length:891         female:314  
##  1st Qu.:223.5   Survived:342   2nd Class:184   Class :character   male  :577  
##  Median :446.0                  3rd Class:491   Mode  :character               
##  Mean   :446.0                                                                 
##  3rd Qu.:668.5                                                                 
##  Max.   :891.0                                                                 
##                                                                                
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:22.00   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :29.70   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:35.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##                                                                     
##       Fare           Cabin                  Embarked      Title          
##  Min.   :  0.00   Length:891         Cherbourg  :168   Length:891        
##  1st Qu.:  7.91   Class :character   Queenstown : 77   Class :character  
##  Median : 14.45   Mode  :character   Southampton:646   Mode  :character  
##  Mean   : 32.20                                                          
##  3rd Qu.: 31.00                                                          
##  Max.   :512.33                                                          
##                                                                          
##       Ages    
##  20-29  :220  
##  > 80   :178  
##  30-39  :167  
##  < 20   :164  
##  40-49  : 89  
##  50-59  : 48  
##  (Other): 25

Interpretation:

  1. The Max value: 891 in PassengerId describes the total number of passengers on the Titanic ship, which is 891.

  2. Survived (1) indicates that 342 passengers survived the Titanic shipwreck.

  3. The Sex of Titanic passengers consists of 577 males and 314 females.

  4. The Age of Titanic passengers ranges from the youngest being 0.42 years old to the oldest being 80 years old.

  5. The highest fare paid by a passenger is 512.33.

  6. The most common destination for Titanic passengers in the Embarked column is S (Southampton) with 646 passengers, followed by C (Cherbourg) with 168 passengers, and Q (Queenstown) with 646 passengers.

  7. The age range of passengers (Ages) is as follows:

  • < 20 years: 164 people
  • 20-29 years: 397 people
  • 30-39 years: 167 people
  • 40-49 years: 89 people
  • 50-59 years: 48 people
  • 60-69 years: 19 people
  • The remaining: 7 people
# Filter Survived and Dead in the `Survived` variable
data_2.clean <- data_2[data_2$Survived %in% c("Survived", "Dead"),]

# Subset data for passengers who survived
data_2.clean.survived <- data_2.clean[data_2.clean$Survived == "Survived",]

# Subset data for passengers who did not survive
data_2.clean.ntsurvived <- data_2.clean[data_2.clean$Survived == "Dead",]

2.2.0.1 What is the overall ratio of the Survived category?

agg_tbl <- data_2.clean %>% group_by(Survived) %>% 
  summarise(total_count=n(),
            .groups = 'drop')

ggplot(agg_tbl, aes(x=Survived, y=total_count, fill=Survived))+
  geom_bar(stat="identity")+
  geom_text(aes(label=total_count), vjust=-0.3, size=3.5) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Comparison of Survived Categories of Passengers",
       x = "Category",
       y = "Total Passengers")

The overall ratio of the Survived category is that the number of those who did not survive is greater than those who survived.

2.2.0.2 Are there more males than females in each Pclass?

ggplot(data = data_2, aes(x=Pclass, fill=Sex)) +
  geom_bar(position = "dodge", width = 0.4) +
  labs(title = "Perbandingan Jenis Kelamin dari PClass",
       x = "Pclass",
       y = "Total Passengers")

There are more males in all Pclass. However, the male-to-female ratio is significantly higher in Pclass 3.

2.2.0.3 What is the survival rate of Survived for each Gender?

ggplot(data = data_2.clean, aes(x=Sex, fill= Survived)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Comparison of Survived in each Gender",
       x = "Sex",
       y = "Total Passengers")

  • For females, more passengers survived.

  • For males, significantly more passengers did not survive (deaths).

  • There are more passengers who did not survive than those who survived.

2.2.0.4 What is the survival rate of Survived for each Pclass, are those who paid more prioritized for survival?

ggplot(data = data_2.clean, aes(x=Pclass, fill= Survived)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Comparison of Survived in each PClass",
       x = "PClass",
       y = "Total Passengers")

Yes, it turns out that those in Pclass 1 have a much better chance of survival compared to others.

2.2.0.5 Passengers’ Survived/Death Based on Passenger Class and Gender?

ggplot(data = data_2.clean.survived, mapping = aes(x = Pclass, y = Survived)) +
  geom_col(aes(fill = Sex), show.legend = T) +
  labs(title = "Survived Passengers by Passenger Class",
       x = "Passenger Class",
       y = "Survived",
       caption = "Titanic")

Female passengers have a higher survival rate than male passengers in all Passenger Class categories.

ggplot(data = data_2.clean.ntsurvived, mapping = aes(x = Pclass, y = Survived)) +
  geom_col(aes(fill = Sex), show.legend = T) +
  labs(title = "Deceased Passengers by Passenger Class",
       x = "Passenger Class",
       y = "Deceased",
       caption = "Titanic")

Male passengers have a higher rate of non-survival in all Passenger Class categories.

2.2.0.6 What is the survival rate of Survived based on Title of the name?

ggplot(data = data_2.clean, aes(x=Title, fill=Survived)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Comparison of Survived in each Title Name",
       x = "Title",
       y = "Total Passengers")

data_2.clean.title <- as.data.frame(table(data_2.clean$Survived, data_2.clean$Title))

data_2.clean.title
ggplot(data_2.clean.title, aes(x=Var2, y=Freq, fill= Var1)) +
  geom_col(position="fill") + 
  ggthemes::theme_economist() + 
  scale_color_gdocs() + 
  ggthemes::scale_fill_gdocs() +
  scale_fill_brewer(palette = "Set1") +
  geom_text(aes(label = Freq), position = position_fill(vjust = .5), col = "white") +
  coord_flip() +
  labs(fill = "Survived",
       title = "Comparison of Survival in each Title Name",
       x = "Title",
       y = "Total Passengers")

  • Male passengers with the title Mr have the worst survival rate.

  • Female passengers with the title Mrs have a better survival rate than males.

  • Honorific Titles and Officers have a worse survival rate compared to their death rate.

  • The title Master is used in English for young boys who are too young to be called Mister, and it seems that children have a better chance of survival.

2.2.0.7 Women have a better survival rate, but how about when we look at it based on their Passenger Class (Pclass), is their survival rate still higher?

ggplot(data = data_2.clean, aes(x=Sex, fill=Survived)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  facet_wrap(~Pclass, nrow = 3, scales = "free_y") +
  labs(title = "Comparison of Survival in each Sex and PClass",
       x = "Sex",
       y = "Total Passengers")

  • In Pclass 1, women have a very high survival rate with a high certainty, while men have a much higher death rate.

  • In Pclass 2, it is almost the same as what happened in Pclass 1, only the number differs.

  • In Pclass 3, the survival rate is almost 50/50, in stark contrast to the much higher death rate for men.

2.2.0.8 Does having family members on board affect the survival of passengers?

data_2.clean$Family <- data_2.clean$SibSp + data_2.clean$Parch + 1
ggplot(data = data_2.clean, aes(x=Family, fill=Survived)) +
  geom_bar(stat = "count", position = "dodge") +
  scale_x_continuous(breaks = c(1:11)) +
  scale_fill_brewer(palette = "Set1") +
  ylim(c(0, 400)) +
  labs(title = "Comparison of Survival by Family Size",
       x = "Family Size",
       y = "Total Passengers") +
  theme_bw()

  • Passengers without family have a lower chance of survival.
  • Families with smaller size (2-4 members) have a better chance of survival.
  • However, large families (+5 members) do not have a better chance of survival.

2.2.0.9 How is the distribution of survived and non-survived passengers based on age and Pclass?

# Filter Pclass 1st Class
data_2.pclass1 <- data_2.clean[data_2.clean$Pclass %in% "1st Class", ]
# Plot of Passengers with 1st Class Ticket
p1 <- ggplot(data = data_2.pclass1, aes(x = Fare, y = Age)) +
  geom_point(colour = "yellow", shape = 21, size = 3, aes(fill = Survived)) +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  labs(title = "Passengers with 1st Class Ticket")

# Filter Pclass 2nd Class
data_2.pclass2 <- data_2.clean[data_2.clean$Pclass %in% "2nd Class", ]

# Plot of Passengers with 2nd Class Ticket
p2 <- ggplot(data = data_2.pclass2, aes(x = Fare, y = Age)) +
  geom_point(colour = "yellow", shape = 21, size = 3, aes(fill = Survived)) +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  labs(title = "Passengers with 2nd Class Ticket")

# Filter Pclass 3rd Class
data_2.pclass3 <- data_2.clean[data_2.clean$Pclass %in% "3rd Class", ]

# Plot of Passengers with 3rd Class Ticket
p3 <- ggplot(data = data_2.pclass3, aes(x = Fare, y = Age)) +
  geom_point(colour = "yellow", shape = 21, size = 3, aes(fill = Survived)) +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  labs(title = "Passengers with 3rd Class Ticket")

require(gridExtra)
grid.arrange(p1, p2, p3, ncol = 1)

2.2.0.10 What is the proportion of passenger survival based on their age range?

data_2.clean.prop <- as.data.frame(table(data_2.clean$Survived,data_2.clean$Ages))
data_2.clean.prop
ggplot(data = data_2.clean.prop, mapping = aes(x = Freq, y = reorder(Var2, Freq))) +
  geom_col(mapping = aes(fill = Var1), position = "stack") +
  labs(x = "Proportion of Passenger Survival",
       y = NULL,
       fill = NULL,
       title = "Proportion of Passenger Survival on Titanic Ship",
       subtitle = "Based on Age Range") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  theme(legend.position = "top")

The age range of 20-29 years has a higher survival rate compared to others, but it also has a significantly higher death rate compared to the others.

2.2.0.11 What is the proportion of passenger survival based on their Embarked port?

data_2.clean.emb <- as.data.frame(table(data_2.clean$Survived, data_2.clean$Embarked))
data_2.clean.emb
ggplot(data_2.clean.emb, aes(x=Var2, y=Freq, fill= Var1)) +
  geom_col(position = "fill") +
  ggthemes::theme_economist() +
  scale_color_gdocs() +
  ggthemes::scale_fill_gdocs() +
  scale_fill_brewer(palette = "Set1") +
  geom_text(aes(label = Freq), position = position_fill(vjust = .5), col = "white") +
  labs(fill = "Survived",
       title = "Comparison of Survival in each Embarked Port",
       x = "Embarked",
       y = "Frequency")

  • For Embarked Cherbourg, there were 95 passengers who survived and 75 passengers who did not survive.
  • For Embarked Queenstown, there were 30 passengers who survived and 47 passengers who did not survive.
  • For Embarked Southampton, there were 217 passengers who survived and 427 passengers who did not survive.
ggplot(data = data_2.clean.survived, mapping = aes(x = Pclass,
                                      y = Survived)) +
  geom_col(aes(fill = Embarked), show.legend = T) +
  labs(title = "Survival by Passenger Class",
       x = "Passenger Class",
       y = "Survived",
       caption = "Titanic")

The majority of passengers who survived embarked from Southampton (S).

Before proceeding with the modeling, it is essential to pay attention to the proportion of the target variable, which is “Survived”.

agg_tbl <- data_2.clean %>% group_by(Survived) %>% 
  summarise(total_count=n(),
            .groups = 'drop')

ggplot(agg_tbl, aes(x=Survived, y=total_count, fill=Survived))+
  geom_bar(stat="identity")+
  geom_text(aes(label=total_count), vjust=-0.3, size=3.5) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Comparison of Survival Categories of Passengers",
       x = "Category",
       y = "Total Passengers")

prop.table(table(data$Survived))
## 
##         0         1 
## 0.6161616 0.3838384
table(data$Survived)
## 
##   0   1 
## 549 342

Conclusion

  • The number of passengers who did not survive is greater than the number of passengers who survived in general. Among females, more passengers survived, while among males, significantly more passengers did not survive.

  • Among the three Pclasses, it is found that those in Pclass 1 have a much higher chance of survival compared to the others.

  • When comparing with the Sex variable, there are more males in all Pclasses. However, the male-to-female ratio is much higher in Pclass 3. Female passengers have a higher survival rate than male passengers in all Pclass categories. Male passengers have a higher mortality rate in all Pclass categories.

  • In the comparison of survival based on Title Name, males with the Title “Mr” have the worst survival rate, while females with the Title “Mrs” have a better survival rate than males. Honorific Titles and Officers have worse survival rates compared to their mortality rates. However, for the Title “Master,” which is used in English for young boys who are too young to be called “Mister,” it seems that children have a better chance of survival.

  • In the comparison of survival based on Family Size, passengers without a family have a lower chance of survival. Families with a smaller size (2-4) have a better chance of survival. However, families with a larger size (+5) do not have a better chance of survival.

  • The age range of 20-29 years has a higher survival rate than others, but it also has a significantly higher death rate compared to other age ranges.

  • For the “Embarked” variable, passengers who embarked from Cherbourg have a relatively higher ratio of survival with 93 passengers survived and 75 passengers deceased. On the other hand, passengers who embarked from Queenstown and Southampton have lower survival ratios.

3 Modeling

3.1 Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(123)

# index sampling
index <- sample(x = nrow(data),
                size = nrow(data)*0.7) 

# splitting
data_train <- data[index, ]
data_test <- data[-index, ] 

4 LBB1: Logistic Regression vs KNN

Next, we proceed with building models using Logistic Regression and K-Nearest Neighbors (KNN).

4.1 Logistic Regression

In this step, we create models using Logistic Regression and KNN. The data will be split into training and testing data using cross-validation. For logistic regression modeling, the glm() function is used to model the logistic regression. The selected variables that influence the target variable will be the independent variables in this model, while the target variable will be the dependent or response variable. We will also perform feature selection using the stepwise method. The evaluation results of the model will be displayed based on accuracy, precision, recall, and F1-score.

In this section, we will create two models:

  1. Logistic regression model using all available predictor variables.

  2. Logistic regression model with filtered predictor variables using the stepwise feature selection method.

4.1.1 Model All

model_all <- glm(formula = Survived ~ ., family = "binomial", 
             data = data_train)
summary(model_all)
## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = data_train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.372059   0.556361   6.061 1.35e-09 ***
## Pclass2     -0.740048   0.345760  -2.140 0.032326 *  
## Pclass3     -1.668392   0.347153  -4.806 1.54e-06 ***
## Sexmale     -2.571089   0.229212 -11.217  < 2e-16 ***
## Age         -0.031937   0.009419  -3.391 0.000697 ***
## SibSp       -0.270896   0.121928  -2.222 0.026298 *  
## Parch       -0.049235   0.156618  -0.314 0.753246    
## Fare         0.004419   0.003082   1.434 0.151558    
## EmbarkedQ    0.068630   0.451919   0.152 0.879294    
## EmbarkedS   -0.370599   0.287357  -1.290 0.197162    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 831.47  on 622  degrees of freedom
## Residual deviance: 568.33  on 613  degrees of freedom
## AIC: 588.33
## 
## Number of Fisher Scoring iterations: 5

Using this model, we can make the following interpretations:

  • The intercept in this model represents the estimated log-odds of the response variable when all the predictor variables are set to zero or their reference levels (for categorical variables). In the provided coefficients table, the intercept is represented by (Intercept) and has an estimated value of 3.334869. This indicates the baseline log-odds of the response variable in the absence of any predictors or when the predictor variables are at their reference levels.

  • A breakdown of the information for the first few predictor variables:

    • Pclass2, Pclass3, Sexmale, EmbarkedS: If the passenger is in PClass 1/2 or is male or embarked from Southampton, then they are more likely to not survive.

    • Age, SibSp, Parch: The higher the value of these variables, the more likely the customer is not to survive.

    • Fare: The higher the value of this variable, the more likely the customer is to survive.

    • EmbarkedQ: If the passenger embarked from Queenstown, then they are more likely to survive.

The remaining predictor variables follow a similar pattern. The coefficients table provides information about the statistical significance and impact of each predictor variable on the response variable in the logistic regression model.

4.1.2 Model Fitting

In the first model, there are still many predictor variables that are not significant to the target variable. Therefore, we will attempt model fitting using the stepwise method.

model_both <- step(model_all, direction = "both")
## Start:  AIC=588.33
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
##            Df Deviance    AIC
## - Parch     1   568.42 586.42
## - Embarked  2   570.87 586.87
## <none>          568.33 588.33
## - Fare      1   570.88 588.88
## - SibSp     1   574.05 592.05
## - Age       1   580.51 598.51
## - Pclass    2   593.63 609.63
## - Sex       1   719.76 737.76
## 
## Step:  AIC=586.42
## Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked
## 
##            Df Deviance    AIC
## - Embarked  2   571.10 585.10
## <none>          568.42 586.42
## - Fare      1   570.88 586.88
## + Parch     1   568.33 588.33
## - SibSp     1   575.39 591.39
## - Age       1   580.53 596.53
## - Pclass    2   594.56 608.56
## - Sex       1   723.60 739.60
## 
## Step:  AIC=585.1
## Survived ~ Pclass + Sex + Age + SibSp + Fare
## 
##            Df Deviance    AIC
## <none>          571.10 585.10
## - Fare      1   574.42 586.42
## + Embarked  2   568.42 586.42
## + Parch     1   570.87 586.87
## - SibSp     1   579.65 591.65
## - Age       1   583.35 595.35
## - Pclass    2   596.73 606.73
## - Sex       1   735.31 747.31
summary(model_both)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Fare, family = "binomial", 
##     data = data_train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.118507   0.514684   6.059 1.37e-09 ***
## Pclass2     -0.817069   0.340137  -2.402 0.016298 *  
## Pclass3     -1.648740   0.335594  -4.913 8.97e-07 ***
## Sexmale     -2.598402   0.222967 -11.654  < 2e-16 ***
## Age         -0.031777   0.009354  -3.397 0.000681 ***
## SibSp       -0.308111   0.116117  -2.653 0.007967 ** 
## Fare         0.004717   0.002894   1.630 0.103080    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 831.47  on 622  degrees of freedom
## Residual deviance: 571.10  on 616  degrees of freedom
## AIC: 585.1
## 
## Number of Fisher Scoring iterations: 5

Using this model, we only use 6 variables out of the total 9 variables (excluding Embarked), and we can make the following interpretations:

  • The intercept in this model represents the estimated log-odds of the response variable when all the predictor variables are set to zero or their reference levels (for categorical variables). In the provided coefficients table, the intercept is represented by (Intercept) and has an estimated value of 3.097842. This indicates the baseline log-odds of the response variable in the absence of any predictors or when the predictor variables are at their reference levels.

  • Breakdown of the information for the first few predictor variables:

    • Pclass2, Pclass3, Sexmale: If the passenger is in PClass 1/2 or Male, then they are more likely to not survive.

    • Age, SibSp: The larger the value of these variables, the more likely the passenger will not survive.

    • Fare: The larger the value of this variable, the more likely the passenger will survive.

Comparison of AIC (Akaike Information Criterion): It estimates the amount of missing information in a model. The smaller the AIC value, the better the model.

  • Null deviance: It represents the error or discrepancy when no predictors are used in the model.

  • Residual deviance: It represents the error or discrepancy when the model is built using the predictors.

With these three parameters, we can conclude that the step-wise model is the better model.

model_all$aic
## [1] 588.3255
model_both$aic
## [1] 585.1033

AIC represents the amount of missing information in a model. The smaller the AIC value, the less information is missing. From the results above, it is known that the AIC value for model_both is smaller than the AIC value for model_all.

4.1.3 Prediction

data_test$prob_bh <- predict(model_both, type = "response", newdata = data_test)

In the syntax above, when the probability for a data point in the test set is greater than 0.5, it means that the model predicts the passenger as “Survived.” On the other hand, if the probability is less than or equal to 0.5, it means the model predicts the passenger as “Not Survived.”

data_test$pred_data <- factor(ifelse(data_test$prob_bh > 0.5, 1, 0))
data_test[1:10, c("pred_data", "Survived")]

4.1.4 Model Performance Evaluation

After making predictions using the model, there are still some incorrect predictions.

  • Recall/Sensitivity: It measures how effectively the model can correctly predict the proportion of actual positive data.

  • Specificity: It measures how effectively the model can correctly predict the proportion of actual negative data.

  • Accuracy: It measures how effectively the model can correctly predict the target variable Y.

  • Precision: It measures how effectively the model can correctly predict the positive class among all predicted results.

In classification, we evaluate the model based on the confusion matrix.

confusion_matrix <- confusionMatrix(data = data_test$pred_data, reference = data_test$Survived, positive = "1")
confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 141  30
##          1  26  71
##                                           
##                Accuracy : 0.791           
##                  95% CI : (0.7374, 0.8381)
##     No Information Rate : 0.6231          
##     P-Value [Acc > NIR] : 2.486e-09       
##                                           
##                   Kappa : 0.5516          
##                                           
##  Mcnemar's Test P-Value : 0.6885          
##                                           
##             Sensitivity : 0.7030          
##             Specificity : 0.8443          
##          Pos Pred Value : 0.7320          
##          Neg Pred Value : 0.8246          
##              Prevalence : 0.3769          
##          Detection Rate : 0.2649          
##    Detection Prevalence : 0.3619          
##       Balanced Accuracy : 0.7736          
##                                           
##        'Positive' Class : 1               
## 
accuracy <- confusion_matrix$overall["Accuracy"]
precision <- confusion_matrix$byClass["Pos Pred Value"]
recall <- confusion_matrix$byClass["Sensitivity"]
f1_score <- confusion_matrix$byClass["F1"]

4.1.5 Displaying Evaluation Results

To evaluate the performance of the model, we can display the results using the confusion matrix and other evaluation metrics such as recall, specificity, accuracy, and precision. These metrics will provide insights into how well the model performs in predicting survival outcomes for the passengers.

cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.7910448
cat("Precision:", precision, "\n")
## Precision: 0.7319588
cat("Recall:", recall, "\n")
## Recall: 0.7029703
cat("F1 Score:", f1_score, "\n")
## F1 Score: 0.7171717

Interpretation

Based on the above confusion matrix results, we can interpret as follows:

  • From all the data classified by the model, 79.1% have correct predictions (accuracy). This means that approximately 79.1% of the total observations in the dataset can be correctly predicted by the model.

  • When predicting the “Not Survived” class (negative classification), the model has a recall of 70.29%. This means that out of all the observations that actually belong to the “Survived” class, the model can correctly classify around 70.29% of them.

  • When predicting the “Survived” class (positive classification), the model has a recall of 73.19%. This means that out of the test data that is predicted as positive (Survived), around 73.19% are actually positive (Survived).

4.1.6 Model Interpretation

exp(model_both$coefficients) %>%
  data.frame()

4.2 K-Nearest Neighbour (KNN)

Next, the creation of the KNN model is carried out. In this stage, numeric predictor filtering and class proportion checking are performed. The data is also presented in summary form to understand the range of predictor variable values. Data training and data testing are generated by performing cross-validation.

4.2.1 Data Manipulation

data_knn <- data %>% 
  mutate(Pclass = as.numeric(Pclass),
         Sex = as.numeric(Sex),
         Embarked = as.numeric(Embarked))
glimpse(data_knn)
## Rows: 891
## Columns: 8
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass   <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Sex      <dbl> 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2…
## $ Age      <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, 29.69912, 5…
## $ SibSp    <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch    <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ Embarked <dbl> 4, 2, 4, 4, 4, 3, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 3, 4, 4, 2, 4…

4.2.2 Check class proportions

prop.table(table(data_knn$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

4.2.3 Check the Range of Predictor Variables

summary(data_knn)
##  Survived     Pclass           Sex             Age            SibSp      
##  0:549    Min.   :1.000   Min.   :1.000   Min.   : 0.42   Min.   :0.000  
##  1:342    1st Qu.:2.000   1st Qu.:1.000   1st Qu.:22.00   1st Qu.:0.000  
##           Median :3.000   Median :2.000   Median :29.70   Median :0.000  
##           Mean   :2.309   Mean   :1.648   Mean   :29.70   Mean   :0.523  
##           3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:35.00   3rd Qu.:1.000  
##           Max.   :3.000   Max.   :2.000   Max.   :80.00   Max.   :8.000  
##      Parch             Fare           Embarked    
##  Min.   :0.0000   Min.   :  0.00   Min.   :2.000  
##  1st Qu.:0.0000   1st Qu.:  7.91   1st Qu.:3.000  
##  Median :0.0000   Median : 14.45   Median :4.000  
##  Mean   :0.3816   Mean   : 32.20   Mean   :3.536  
##  3rd Qu.:0.0000   3rd Qu.: 31.00   3rd Qu.:4.000  
##  Max.   :6.0000   Max.   :512.33   Max.   :4.000

4.2.4 Cross Validation

set.seed(123)
index <- sample(x = nrow(data_knn), size = nrow(data_knn) * 0.8)
data_knn_train <- data_knn[index, ]
data_knn_test <- data_knn[-index, ]

4.2.5 Pre-processing

data_knn_train_x <- data_knn_train %>% select_if(is.numeric)
data_knn_test_x <- data_knn_test %>% select_if(is.numeric)
data_knn_train_y <- data_knn_train[,"Survived"]
data_knn_test_y <- data_knn_test[,"Survived"]

4.2.6 Scaling Data

Before training the KNN model, we perform pre-processing by scaling the data using the scale() function. Then, we search for the optimal value of k using the square root method. The KNN model is trained and tested using the knn() function, and the evaluation results of the model are displayed based on accuracy, precision, recall, and F1-score.

data_knn_train_xs <- scale(x = data_knn_train_x)
data_knn_test_xs <- scale(x = data_knn_test_x,
                      center = attr(data_knn_train_xs, "scaled:center"),
                      scale = attr(data_knn_train_xs, "scaled:scale"))
# Find Optimal k
optimal_k <- sqrt(nrow(data_knn_train_xs))
cat("Optimal k:", optimal_k, "\n")
## Optimal k: 26.68333

4.2.7 KNN Model Training and Prediction

knn_model <- knn(train = data_knn_train_xs,
                 test = data_knn_test_xs,
                 cl = data_knn_train_y,
                 k = optimal_k)

4.2.8 Model Evaluation

confusion_matrix_knn <- confusionMatrix(data = knn_model,
                                        reference = data_knn_test_y,
                                        positive = "1")
confusion_matrix_knn
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 107  20
##          1   9  43
##                                           
##                Accuracy : 0.838           
##                  95% CI : (0.7757, 0.8887)
##     No Information Rate : 0.648           
##     P-Value [Acc > NIR] : 1.311e-08       
##                                           
##                   Kappa : 0.6301          
##                                           
##  Mcnemar's Test P-Value : 0.06332         
##                                           
##             Sensitivity : 0.6825          
##             Specificity : 0.9224          
##          Pos Pred Value : 0.8269          
##          Neg Pred Value : 0.8425          
##              Prevalence : 0.3520          
##          Detection Rate : 0.2402          
##    Detection Prevalence : 0.2905          
##       Balanced Accuracy : 0.8025          
##                                           
##        'Positive' Class : 1               
## 

4.2.9 Displaying Evaluation Results

The evaluation results of the KNN model, including accuracy, precision, recall, and F1-score, are displayed below.

cat("KNN Model Performance\n")
## KNN Model Performance
cat("Accuracy:", confusion_matrix_knn$overall["Accuracy"], "\n")
## Accuracy: 0.8379888
cat("Precision:", confusion_matrix_knn$byClass["Pos Pred Value"], "\n")
## Precision: 0.8269231
cat("Recall:", confusion_matrix_knn$byClass["Sensitivity"], "\n")
## Recall: 0.6825397
cat("F1 Score:", confusion_matrix_knn$byClass["F1"], "\n")
## F1 Score: 0.7478261

Based on the confusion matrix results above, we can interpret as follows:

  • From all the data classified by the model, 83.79% have correct predictions (accuracy). This means that approximately 83.79% of the total observations in the dataset can be correctly predicted by the model.

  • When predicting the “Not Survived” class (negative classification), the model has a recall of 68.25%. This means that out of all observations that actually belong to the “Survived” class, the model is able to correctly classify about 68.25% of them.

  • When predicting the “Survived” class (positive classification), the model has a recall of 82.69%. This means that out of the test data that was predicted as positive (Survived), about 82.69% of them are actually positive (Survived).

4.3 Model Evaluation: Logistic Regression & KNN

Finally, a comparison of the evaluation results between the Logistic Regression and KNN models is performed. The evaluation metrics for each model are displayed in the form of a data frame containing accuracy, recall, specificity, and precision values.

# Evaluasi Model Logistic Regression
eval_logit <- data.frame(Accuracy = confusion_matrix$overall["Accuracy"],
                         Recall = confusion_matrix$byClass["Sensitivity"],
                         Specificity = confusion_matrix$byClass["Specificity"],
                         Precision = confusion_matrix$byClass["Pos Pred Value"])

# Evaluasi Model K-NN
eval_knn <- data.frame(Accuracy = confusion_matrix_knn$overall["Accuracy"],
                       Recall = confusion_matrix_knn$byClass["Sensitivity"],
                       Specificity = confusion_matrix_knn$byClass["Specificity"],
                       Precision = confusion_matrix_knn$byClass["Pos Pred Value"])
eval_logit
eval_knn
  • Target variable = Survived (1) or Not Survived (0).
  • Positive Class: Survived (1).
  • FN: Passengers predicted as Not Survived, but they actually Survived.
  • FP: Passengers predicted as Survived, but they actually Not Survived.
  • The concerning risk is when there are errors in predicting passengers who actually did not survive as survived (False Positive), so we prioritize maximizing Precision.

However, in this case, we will use Accuracy in the evaluation process for both cases.

Conclusion: The metrics that need to be prioritized or evaluated are Accuracy and Precision.

Based on the Precision values, the KNN model has higher Accuracy and Precision (0.8379888 / 0.8269231) compared to the Linear Regression model (0.7910448 / 0.7319588). This indicates that the KNN model performs better in predicting passengers who actually survived as not survived.

Therefore, from the evaluation results, it can be concluded that using the KNN method is recommended for predicting passengers who actually survived as not survived, with a higher Precision value.

4.4 Conclusion

In the classification analysis using the Logistic Regression and K-NN models to predict the survival of Titanic passengers, the evaluation results show that the KNN model performs better in predicting passengers who actually survived as not survived, with a higher Precision value compared to the Linear Regression model. In this case, Accuracy and Precision are the prioritized metrics as the most concerning risk is False Positives, where passengers who actually did not survive are predicted as survived.

Therefore, the best metrics that can be used to evaluate the model’s performance are Accuracy and Precision.

5 LBB 2: Naive Bayes, Decision Tree, vs Random Forest

5.1 Data Splitting

In this section, we will split the dataset into data train and data test. The data train will be used to train the model, and the data test will be used to evaluate the performance of the trained models. 80% of the dataset will be used for data train, and the remaining 20% will be used for data test.

glimpse(data)
## Rows: 891
## Columns: 8
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass   <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Sex      <fct> male, female, female, female, male, male, male, male, female,…
## $ Age      <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, 29.69912, 5…
## $ SibSp    <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch    <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C, S…
RNGkind(sample.kind = "Rounding")
set.seed(123)
samplesize <- round(0.8 * nrow(data), 0)
index <- sample(seq_len(nrow(data)), size = samplesize)

data_train <- data[index, ]
data_test <- data[-index, ]

5.2 Data Pre-processing

Before building the models, we need to examine the proportion of the target variable in the target column of the data train.

prop.table(table(data_train$Survived))
## 
##         0         1 
## 0.6072931 0.3927069

We can observe that the proportion of positive and negative values in the target variable is unbalanced. This imbalance can potentially affect the performance of the model. Since we have thousands of data rows, we will use the upsampling method to balance the proportion. However, it’s essential to note that no new information will be added to the current data during this process.

RNGkind(sample.kind = "Rounding")
train_down <- upSample(
  x = data_train %>% select(-Survived),
  y = data_train$Survived,
  yname = "Survived"
)
nrow(train_down)
## [1] 866
prop.table(table(train_down$Survived))
## 
##   0   1 
## 0.5 0.5

now we have balanced proportion of target variable

5.3 Naive Bayes

Naive Bayes is one of the classification methods based on Bayes’ Theorem of Probability. In fact, Bayes’ theorem is often used in daily life. The Naive Bayes model can be created using the naiveBayes() function from the e1071 library. In this analysis, we will first create a model using Naive Bayes. The model will use all the predictors in the dataset. One of the weaknesses of Naive Bayes is the possibility of encountering “scarcity,” where a predictor does not occur in a certain class. To address this issue, we will add the laplace = 1 parameter.

5.3.1 Model Build

model_naive <- naiveBayes(Survived ~ ., data = train_down, laplace = 1)  
model_naive
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   0   1 
## 0.5 0.5 
## 
## Conditional probabilities:
##    Pclass
## Y           1         2         3
##   0 0.1513761 0.1811927 0.6674312
##   1 0.4013761 0.2201835 0.3784404
## 
##    Sex
## Y      female      male
##   0 0.1517241 0.8482759
##   1 0.6781609 0.3218391
## 
##    Age
## Y       [,1]     [,2]
##   0 30.94853 12.26125
##   1 28.77417 13.58285
## 
##    SibSp
## Y        [,1]      [,2]
##   0 0.5381062 1.3066968
##   1 0.4595843 0.7257543
## 
##    Parch
## Y        [,1]      [,2]
##   0 0.3025404 0.7658317
##   1 0.4572748 0.7897778
## 
##    Fare
## Y       [,1]     [,2]
##   0 21.90124 31.70266
##   1 50.52230 70.46483
## 
##    Embarked
## Y                       C          Q          S
##   0 0.00228833 0.13729977 0.08695652 0.77345538
##   1 0.00228833 0.27002288 0.08237986 0.64530892

The provided information shows the output of a Naive Bayes classifier for discrete predictors. Here’s an explanation of the different components:

A-priori probabilities: These are the probabilities of the classes (0 and 1) occurring in the target variable. In this case, the probabilities are 0.5 for each class, indicating an equal prior probability for both classes.

Conditional probabilities: These are the conditional probabilities of each predictor variable given each class. Each table represents the probabilities of a specific predictor variable taking certain values (e.g., Pclass 1, 2, or 3) given the class (0 or 1).

These conditional probabilities can be used to calculate the posterior probabilities and make predictions using the Naive Bayes classifier.

5.3.2 Predict

# TRAIN
preds_naive_train <- predict(model_naive, newdata = data_train) 
table(preds_naive_train, data_train$Survived)
##                  
## preds_naive_train   0   1
##                 0 354  75
##                 1  79 205
# TEST
preds_naive_test <- predict(model_naive, newdata = data_test) 
table(preds_naive_test, data_test$Survived)
##                 
## preds_naive_test  0  1
##                0 97 15
##                1 19 47

The confusion matrix displays the following information:

True Negative (TN): The number of instances that were correctly predicted as negative (0) by the classifier. In this case, there are 97 instances that were correctly predicted as negative.

False Positive (FP): The number of instances that were incorrectly predicted as positive (1) by the classifier. In this case, there are 15 instances that were incorrectly predicted as positive.

False Negative (FN): The number of instances that were incorrectly predicted as negative (0) by the classifier. In this case, there are 19 instances that were incorrectly predicted as negative.

True Positive (TP): The number of instances that were correctly predicted as positive (1) by the classifier. In this case, there are 47 instances that were correctly predicted as positive.

5.3.3 Model Evaluation

  • Training
confusionMatrix_nv_train <- confusionMatrix(preds_naive_train, data_train$Survived) 
confusionMatrix_nv_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 354  75
##          1  79 205
##                                          
##                Accuracy : 0.784          
##                  95% CI : (0.752, 0.8137)
##     No Information Rate : 0.6073         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.5483         
##                                          
##  Mcnemar's Test P-Value : 0.809          
##                                          
##             Sensitivity : 0.8176         
##             Specificity : 0.7321         
##          Pos Pred Value : 0.8252         
##          Neg Pred Value : 0.7218         
##              Prevalence : 0.6073         
##          Detection Rate : 0.4965         
##    Detection Prevalence : 0.6017         
##       Balanced Accuracy : 0.7748         
##                                          
##        'Positive' Class : 0              
## 
confusionMatrix_nv_train <- confusionMatrix(preds_naive_train, data_train$Survived) 
confusionMatrix_nv_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 354  75
##          1  79 205
##                                          
##                Accuracy : 0.784          
##                  95% CI : (0.752, 0.8137)
##     No Information Rate : 0.6073         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.5483         
##                                          
##  Mcnemar's Test P-Value : 0.809          
##                                          
##             Sensitivity : 0.8176         
##             Specificity : 0.7321         
##          Pos Pred Value : 0.8252         
##          Neg Pred Value : 0.7218         
##              Prevalence : 0.6073         
##          Detection Rate : 0.4965         
##    Detection Prevalence : 0.6017         
##       Balanced Accuracy : 0.7748         
##                                          
##        'Positive' Class : 0              
## 
  • Testing
confusionMatrix_nv_test <- confusionMatrix(preds_naive_test, data_test$Survived)
confusionMatrix_nv_test
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 97 15
##          1 19 47
##                                           
##                Accuracy : 0.809           
##                  95% CI : (0.7434, 0.8639)
##     No Information Rate : 0.6517          
##     P-Value [Acc > NIR] : 2.959e-06       
##                                           
##                   Kappa : 0.5855          
##                                           
##  Mcnemar's Test P-Value : 0.6069          
##                                           
##             Sensitivity : 0.8362          
##             Specificity : 0.7581          
##          Pos Pred Value : 0.8661          
##          Neg Pred Value : 0.7121          
##              Prevalence : 0.6517          
##          Detection Rate : 0.5449          
##    Detection Prevalence : 0.6292          
##       Balanced Accuracy : 0.7971          
##                                           
##        'Positive' Class : 0               
## 

Result:

  • Target variable = Survived(1) or Not Survived (0)

  • Positive Class: Not Survived (0)

  • FN: Passengers predicted as survived, but actually not survived

  • FP: Passengers predicted as not survived, but actually survived

  • The concerning risk is if there is an error in predicting passengers who are actually not survived as survived (False Positive).

Conclusion: The metric that needs to be prioritized or evaluated is Precision.

From the output of the Naive Bayes classification model, it can be observed that:

Training: The model has an accuracy rate of 78.4% and a Precision/Positive Predictive Value of 82.52%.

Testing: The model has an accuracy rate of 80.9% and a Precision/Positive Predictive Value of 82.52%.

The results indicate good performance and show that the model is not overfitting or underfitting.

5.4 Desicion Tree

Decision Tree is a tree-based model that is relatively simple with robust and powerful performance for predictions. It generates a visualization in the form of a decision tree that can be easily interpreted.

The main drawback of Decision Tree is its tendency to overfit. This is caused by its ability to split the data into very detailed branches, even reaching leaf nodes that contain only one observation. To address overfitting, pruning techniques are used to determine when to stop creating branches in the decision tree, resulting in a simpler and more general tree that can recognize patterns and prevent the model from being too specific to the training data only.

5.4.1 Model Build

Model Without Pruning

dt_model <- ctree (Survived ~ ., data_train)

Model With Pruning

Performing model tuning by setting the parameters mincriterion = 0.90 and minsplit = 15.

# dt_model <- ctree (Survived ~ ., data_train,
#                   control = ctree_control(mincriterion = 0.9,
#                                           minsplit = 15,
#                                           minbucket = 0))

However, since the results obtained are the same, we will use the model without pruning only.

plot(dt_model, type = "simple")

dt_model
## 
## Model formula:
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
## Fitted party:
## [1] root
## |   [2] Sex in female
## |   |   [3] Pclass in 1, 2: 1 (n = 139, err = 6.5%)
## |   |   [4] Pclass in 3
## |   |   |   [5] Embarked in C, Q: 1 (n = 46, err = 26.1%)
## |   |   |   [6] Embarked in S: 0 (n = 73, err = 39.7%)
## |   [7] Sex in male
## |   |   [8] Pclass in 1: 0 (n = 95, err = 34.7%)
## |   |   [9] Pclass in 2, 3
## |   |   |   [10] Age <= 12
## |   |   |   |   [11] SibSp <= 2: 1 (n = 12, err = 0.0%)
## |   |   |   |   [12] SibSp > 2: 0 (n = 10, err = 10.0%)
## |   |   |   [13] Age > 12: 0 (n = 338, err = 12.1%)
## 
## Number of inner nodes:    6
## Number of terminal nodes: 7

The provided information represents a decision tree model (dt_model) that predicts the survival outcome (Survived) based on several predictor variables (Pclass, Sex, Age, SibSp, Parch, Fare, Embarked). Here is an explanation of the tree structure:

  • The root node is the starting point of the decision tree.

  • The first split is based on the variable Sex. If the individual is female, the tree proceeds to node 3. If the individual is male, the tree proceeds to node 7.

  • For females, the next split is based on the variable Pclass. If the individual’s Pclass is 1 or 2, the predicted outcome is 1 (survived) in node 3. If the individual’s Pclass is 3, the tree proceeds to node 4.

  • Similarly, the tree continues to split based on different predictor variables until it reaches the leaf nodes, which represent the final predicted outcomes (survived or not survived) based on the combination of predictor variable values.

  • The numbers in each node represent the number of observations that fall into that particular category based on the given split criteria.

Overall, the decision tree provides a clear visual representation of the decision-making process and how different predictor variables contribute to the final survival prediction.

5.4.2 Predict

# TRAIN
preds_dt_train <- predict(dt_model, newdata = data_train) 
table(preds_dt_train, data_train$Survived)
##               
## preds_dt_train   0   1
##              0 412 104
##              1  21 176
# TEST
preds_dt_test <- predict(dt_model, newdata = data_test) 
table(preds_dt_test, data_test$Survived)
##              
## preds_dt_test   0   1
##             0 110  21
##             1   6  41

True Negative (TN): The model predicted 0 (not survived) correctly, and the actual value is also 0. In this case, the count is 110.

False Positive (FP): The model predicted 1 (survived), but the actual value is 0 (not survived). In this case, the count is 21.

False Negative (FN): The model predicted 0 (not survived), but the actual value is 1 (survived). In this case, the count is 6.

True Positive (TP): The model predicted 1 (survived) correctly, and the actual value is also 1. In this case, the count is 41.

5.4.3 Model Evaluation

  • Training
confusionMatrix_dt_train <- confusionMatrix(preds_dt_train, data_train$Survived) 
confusionMatrix_dt_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 412 104
##          1  21 176
##                                           
##                Accuracy : 0.8247          
##                  95% CI : (0.7947, 0.8519)
##     No Information Rate : 0.6073          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6121          
##                                           
##  Mcnemar's Test P-Value : 2.229e-13       
##                                           
##             Sensitivity : 0.9515          
##             Specificity : 0.6286          
##          Pos Pred Value : 0.7984          
##          Neg Pred Value : 0.8934          
##              Prevalence : 0.6073          
##          Detection Rate : 0.5778          
##    Detection Prevalence : 0.7237          
##       Balanced Accuracy : 0.7900          
##                                           
##        'Positive' Class : 0               
## 
  • Testing
confusionMatrix_dt_test <- confusionMatrix(preds_dt_test, data_test$Survived) 
confusionMatrix_dt_test
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 110  21
##          1   6  41
##                                          
##                Accuracy : 0.8483         
##                  95% CI : (0.787, 0.8976)
##     No Information Rate : 0.6517         
##     P-Value [Acc > NIR] : 3.516e-09      
##                                          
##                   Kappa : 0.6459         
##                                          
##  Mcnemar's Test P-Value : 0.007054       
##                                          
##             Sensitivity : 0.9483         
##             Specificity : 0.6613         
##          Pos Pred Value : 0.8397         
##          Neg Pred Value : 0.8723         
##              Prevalence : 0.6517         
##          Detection Rate : 0.6180         
##    Detection Prevalence : 0.7360         
##       Balanced Accuracy : 0.8048         
##                                          
##        'Positive' Class : 0              
## 

Result:

  • Target variable = Survived (1) or Not Survived (0)

  • Positive Class: Not Survived (0)

  • FN: Passengers predicted as survived, but they did not survive.

  • FP: Passengers predicted as not survived, but they survived.

  • The concerning risk is if there is an error in predicting passengers who actually did not survive as survived (False Positive).

Conclusion: The metrics that need to be prioritized or evaluated is Precision.

From the output of the Decision Tree classification model, it can be observed that:

Training: The model has an accuracy of 82.47% and Precision/Positive Predictive Value of 79.84%.

Testing: The model has an accuracy of 84.83% and Precision/Positive Predictive Value of 83.97%.

These results indicate that the model performs well and does not suffer from Overfitting or Underfitting.

5.5 Random Forest

5.5.1 Modelling

K-fold cross-validation is a resampling technique commonly used in machine learning and model evaluation. Its purpose is to assess the performance and generalization ability of a predictive model.

The process involves dividing the available data into K equally sized subsets or “folds.” The model is then trained and evaluated K times, each time using a different fold as the validation set and the remaining folds as the training set. This allows for a more comprehensive evaluation of the model’s performance and helps to mitigate the impact of data variability and overfitting.

set.seed(417)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3) # k-fold cross validation
model_rf <- train(Survived ~ .,
                  data= data_train,
                  method="rf",
                  trControl = ctrl)

In summary, the Random Forest model achieved an accuracy of approximately 82% and a kappa coefficient of around 0.616 when trained with mtry = 6. These metrics indicate the model’s overall predictive performance on unseen data during cross-validation.

Among the different models tested with varying values of mtry, the selected model is the one with mtry = 6, as it achieved the highest accuracy when tested on the data obtained from bootstrap sampling, which can be considered as the training data used to build the decision tree in the random forest.

saveRDS(model_rf, file = "model_rf.rds")
model_rfload <- readRDS("model_rf.rds")
model_rfload
## Random Forest 
## 
## 713 samples
##   7 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 570, 571, 571, 570, 570, 570, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8153419  0.5949658
##    6    0.8200171  0.6163380
##   10    0.8055189  0.5863042
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.

5.5.2 Variable Importance

varImp(model_rfload)
## rf variable importance
## 
##           Overall
## Sexmale   100.000
## Fare       90.087
## Age        81.591
## Pclass3    18.835
## SibSp      13.233
## Parch       8.441
## EmbarkedS   4.235
## EmbarkedC   2.174
## Pclass2     1.453
## EmbarkedQ   0.000

These values represent the relative importance of each predictor variable in the model. Higher values indicate greater importance in predicting the outcome variable. In this case, Sex (male), Fare, and Age are the most important variables in the model.

On the other hand, Pclass3, SibSp, Parch, EmbarkedS, EmbarkedC, Pclass2, and EmbarkedQ have relatively lower importance.

5.5.3 Out of Bag Error

It is not necessary to perform cross-validation when using random forest. This is because, from the results of bootstrap sampling, there are data points that are not used in building the random forest. These data points are referred to as out-of-bag data and are considered as test data by the model. The model will make predictions using these data points and calculate the resulting error, known as the out-of-bag error. This out-of-bag error serves as an estimate of the model’s performance on unseen data and can be used as a measure of the model’s accuracy without the need for additional cross-validation.

model_rfload$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 18.23%
## Confusion matrix:
##     0   1 class.error
## 0 380  53   0.1224018
## 1  77 203   0.2750000

In the model_rf model, the Out of Bag Error value is 18.23%. Therefore, the accuracy of the model on the test data (out of bag data) is 100% - 18.23% = 81.77%.

5.5.4 Predict

# TRAIN
preds_rf_train <- predict(model_rfload, newdata = data_train, type = "raw") 
table(preds_rf_train, data_train$Survived)
##               
## preds_rf_train   0   1
##              0 429  18
##              1   4 262
# TEST
preds_rf_test <- predict(model_rfload, newdata = data_test, type = "raw") 
table(preds_rf_test, data_test$Survived)
##              
## preds_rf_test   0   1
##             0 102  12
##             1  14  50

True Negative (TN): The model correctly predicted 0 (not survived) instances, and the actual value is also 0. In this case, the count is 79.

False Positive (FP): The model predicted 1 (survived), but the actual value is 0 (not survived). In this case, the count is 6.

False Negative (FN): The model predicted 0 (not survived), but the actual value is 1 (survived). In this case, the count is 5.

True Positive (TP): The model correctly predicted 1 (survived) instances, and the actual value is also 1. In this case, the count is 53.

5.5.5 Model Evaluation

  • Training
confusionMatrix_rf_train <- confusionMatrix(preds_rf_train, data_train$Survived) 
confusionMatrix_rf_train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 429  18
##          1   4 262
##                                           
##                Accuracy : 0.9691          
##                  95% CI : (0.9537, 0.9806)
##     No Information Rate : 0.6073          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9347          
##                                           
##  Mcnemar's Test P-Value : 0.005578        
##                                           
##             Sensitivity : 0.9908          
##             Specificity : 0.9357          
##          Pos Pred Value : 0.9597          
##          Neg Pred Value : 0.9850          
##              Prevalence : 0.6073          
##          Detection Rate : 0.6017          
##    Detection Prevalence : 0.6269          
##       Balanced Accuracy : 0.9632          
##                                           
##        'Positive' Class : 0               
## 
  • Testing
confusionMatrix_rf_test <- confusionMatrix(preds_rf_test, data_test$Survived) 
confusionMatrix_rf_test
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 102  12
##          1  14  50
##                                           
##                Accuracy : 0.8539          
##                  95% CI : (0.7933, 0.9023)
##     No Information Rate : 0.6517          
##     P-Value [Acc > NIR] : 1.145e-09       
##                                           
##                   Kappa : 0.6807          
##                                           
##  Mcnemar's Test P-Value : 0.8445          
##                                           
##             Sensitivity : 0.8793          
##             Specificity : 0.8065          
##          Pos Pred Value : 0.8947          
##          Neg Pred Value : 0.7812          
##              Prevalence : 0.6517          
##          Detection Rate : 0.5730          
##    Detection Prevalence : 0.6404          
##       Balanced Accuracy : 0.8429          
##                                           
##        'Positive' Class : 0               
## 

Result:

  • Target variable = Survived (1) or Not Survived (0)

  • Positive Class: Not Survived (0)

  • FN: Passengers predicted as survived, but actually not survived.

  • FP: Passengers predicted as not survived, but actually survived.

  • The concerning risk is the occurrence of False Positive, where passengers who actually did not survive are predicted as survivors.

Conclusion: The metrics that need to be prioritized or evaluated are Precision.

From the output of the Random Forest classification model, it can be observed that:

Training: The model achieved an accuracy of 96.91% and a Precision/Positive Predictive Value of 95.97%.

Testing: The model achieved an accuracy of 85.39% and a Precision/Positive Predictive Value of 89.47%.

The results indicate that the model’s performance is above 80%, with a slight possibility of overfitting. ## Recap Model Evaluation NB/DT/RF

Finally, a comparison of the evaluation results between the Naive Bayes, Decision Tree, and Random Forest models was conducted. The evaluation metrics for each model are presented in a data frame, including accuracy, recall, specificity, and precision.

# Evaluasi Model Naive Bayes
eval_nv <- data.frame(Model = "Naive Bayes", 
                      Accuracy = confusionMatrix_nv_test$overall["Accuracy"],
                      Recall = confusionMatrix_nv_test$byClass["Sensitivity"],
                      Specificity = confusionMatrix_nv_test$byClass["Specificity"],
                      Precision = confusionMatrix_nv_test$byClass["Pos Pred Value"])

# Evaluasi Model Decision Tree
eval_dt <- data.frame(Model = "Decision Tree", 
                      Accuracy = confusionMatrix_dt_test$overall["Accuracy"],
                      Recall = confusionMatrix_dt_test$byClass["Sensitivity"],
                      Specificity = confusionMatrix_dt_test$byClass["Specificity"],
                      Precision = confusionMatrix_dt_test$byClass["Pos Pred Value"])

# Evaluasi Model Random Forest
eval_rf <- data.frame(Model = "Random Forest", 
                      Accuracy = confusionMatrix_rf_test$overall["Accuracy"],
                      Recall = confusionMatrix_rf_test$byClass["Sensitivity"],
                      Specificity = confusionMatrix_rf_test$byClass["Specificity"],
                      Precision = confusionMatrix_rf_test$byClass["Pos Pred Value"])

eval <- rbind(eval_nv, eval_dt, eval_rf)
rownames(eval) <- NULL 

eval[order(eval$Precision, decreasing = T), ]

5.6 Conclusion

As previously stated, Accuracy and Precision were chosen as the metrics to evaluate the model’s performance. Based on the evaluation results, the random forest model outperforms the Naive Bayes and Decision Tree models in predicting the survival of Titanic passengers.

The random forest model achieved the following performance metrics:

  • Training Accuracy: 96.91%, Precision: 95.97%
  • Testing Accuracy: 85.39%, Precision: 89.47%

An accuracy value of 85.39% indicates that the random forest model correctly predicted the survival outcome (0 or 1) for approximately 85.39% of the instances in the test data, which represents unseen data. This suggests that the model has a high ability to correctly classify both survival and non-survival cases in unseen data.

In summary, the random forest model demonstrates better predictive performance, with a higher accuracy and precision, making it a preferred model for predicting the survival of Titanic passengers.