This is an assignment for LBB Classification Data Science Algorithms:
The sinking of the Titanic ship was indeed a shocking event that reverberated across the world. News of the shipwreck spread rapidly, becoming a trending topic among people globally. This was due to the Titanic being one of the largest cruise ships ever built in its time. The accident resulted in the tragic loss of many lives. Data on the passengers who survived and those who did not were compiled with the hope of studying the impact of age and gender on the likelihood of survival during a shipwreck. The goal is to predict whether an individual could survive or not in the event of a ship accident. Obtaining a model would establish safety standards on cruise ships, specifically concerning certain genders or age groups, guiding the placement of safety equipment on board.
This report aims to conduct a classification analysis using the
Titanic dataset to predict passenger survival. The report employs the
Logistic Regression,
K-Nearest Neighbors (KNN), Naive Bayes,
Decision Tree, and Random Forest models. The
described steps include data preprocessing, feature selection, model
development, performance evaluation, and interpretation of results,
along with comparisons between the models.
In this report, I utilize five popular models, namely
Logistic Regression,
K-Nearest Neighbors (KNN), Naive Bayes,
Decision Tree, and Random Forest, to predict
passenger survival in the Titanic dataset.
survived).survived.The dataset can be found at https://www.kaggle.com/c/titanic. It contains biodata from Titanic passengers, and based on this biodata, we aim to predict whether a passenger survived or not.
Before diving into the project, we’ll need to import the required libraries to facilitate our analysis.
library(dplyr) # Preprocessing Dataset
library(class) # knn models
library(caret) # Confusion Matrix # pre-process and scale the data
library(ggplot2) # ggplot visualization
library(ggthemes) # ggplot theme
library(tidyverse) # For data processing and visualization
library(car) # Calculate VIF
library(randomForest) # make random forest models
library(partykit) # decision tree models
library(e1071) # functions for statistic and probabilistic algorithms like a fuzzy classifier, naive Bayes classifierAnd then import our data
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
Metadata:
PassengerId: Unique identification number for each
passenger.
Survived: Indicates whether the passenger survived
(1) or did not survive (0).
Pclass: Ticket class of the passenger, with values
1, 2, or 3, reflecting the social-economic class (1: First class, 2:
Second class, 3: Third class).
Name: Full name of the passenger.
Sex: Gender of the passenger (male: male, female:
female).
Age: Age of the passenger in years.
SibSp: Number of siblings or spouses accompanying
the passenger on board.
Parch: Number of parents or children accompanying
the passenger on board.
Ticket: Ticket number of the passenger.
Fare: Fare paid by the passenger.
Cabin: Cabin number of the passenger.
Embarked: Port where the passenger embarked the ship
(C: Cherbourg, Q: Queenstown, S: Southampton).
| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status
(SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is
estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this
way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this
way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for
them.
Before proceeding further, we need to examine the condition of our data and clean it to convert the raw dataset into a usable form that is more relevant to our analysis later.
Change data types and remove unnecessary columns: There are several variables with inappropriate data types that need to be corrected.
data <- data %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
Embarked = as.factor(Embarked))
head(data)## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:549 1:216 Length:891 female:314
## 1st Qu.:223.5 1:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Fare Cabin Embarked
## Min. : 0.00 Length:891 : 2
## 1st Qu.: 7.91 Class :character C:168
## Median : 14.45 Mode :character Q: 77
## Mean : 32.20 S:644
## 3rd Qu.: 31.00
## Max. :512.33
##
Insights Obtained:
There are 549 passengers who did not survive (0) and 342 passengers who survived (1).
There are 216 passengers in class 1, 184 passengers in class 2, and 491 passengers in class 3.
There are 577 male passengers (0) and 314 female passengers (1).
The age range of passengers spans from 0.42 years to 80 years. The average age of passengers is 29.7 years.
The majority of passengers do not have any siblings or spouses accompanying them (0), followed by 209 passengers with 1 sibling or spouse.
The majority of passengers do not have any parents or children accompanying them (0), followed by 118 passengers with 1 parent or child.
The ticket fares range from 0 to 512.33. The average ticket fare is 32.2.
The majority of passengers (644) boarded the ship from Southampton port (S), followed by 168 passengers from Cherbourg port (C), and 77 passengers from Queenstown port (Q).
##
## 0 1
## 0.6161616 0.3838384
The first step is to look for missing values. Handling missing data is crucial because any analytical results based on a dataset with missing values could be biased.
After that, we will handle the missing values in the Age
column by filling the missing values with the mean age. This is done to
make the dataset more complete and eliminate any missing values that may
disrupt the analysis process.
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
The only columns that have missing values are Age and
Embarked columns. However, the proportion of missing values
is quite large: 177 rows out of 891 rows for Age and 2 rows
out of 891 rows for Embarked. For Embarked,
empty values are represented by strings with ” “. Additionally, in the
Cabin column, there are values with”“, indicating that the
Passenger’s Cabin was not specified or known during the data
collection.
For this treatment, we will perform the following:
Age column with the
mean age and in the Embarked column with the mode
value.Name, PassengerId,
Ticket, and Cabin as they are not needed in
the modeling process.# Filling missing values in the `Age` column with the mean age
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm=TRUE)
# Filling missing values in the `Embarked` column with the mode value
calc_mode <- function(x){
# List the distinct / unique values
distinct_values <- unique(x)
# Count the occurrence of each distinct value
distinct_tabulate <- tabulate(match(x, distinct_values))
# Return the value with the highest occurrence
distinct_values[which.max(distinct_tabulate)]
}
data <- data %>%
mutate(Embarked = if_else(is.na(Embarked),
calc_mode(Embarked),
Embarked))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 0
There are no missing values in the Titanic dataset.
We will now drop several variables that are not needed, such as Name, PassengerId, Ticket, and Cabin.
Feature Engineering
data_2 <- data_2 %>%
mutate(Survived = ifelse(Survived == 1, "Survived", "Dead")) %>%
mutate(Pclass = sapply(as.character(Pclass), switch,
"1" = "1st Class",
"2" = "2nd Class",
"3" = "3rd Class")) %>%
mutate(Embarked = sapply(as.character(Embarked), switch,
"C" = "Cherbourg",
"Q" = "Queenstown",
"S" = "Southampton"))# Mengubah data menjadi factor
data_2 <- data_2 %>%
mutate_at(vars(Pclass,Sex,Embarked,Survived), as.factor)
data_2To obtain more specific information about Age and
Sex, feature engineering can be performed by extracting
titles from the Name variable. The gsub
function can be used in this case, and the titles will be categorized
into 6 different groups. These groups include: Mr,
Mrs, Master, Miss,
Honorific Titles, and Officers.
# menggunakan fungsi gsub
data_2$Title <- gsub("(.*\\,|\\..*)", "", data_2$Name) %>%
gsub("[[:space:]]", "", .)
# membedakan variable title
data_2$Title[data_2$Title %in% c("Don", "Sir")] <- "Mr"
data_2$Title[data_2$Title %in% c("Ms", "Mme", "Mlle", "Lady", "Dona", "theCountess")] <- "Mrs"
data_2$Title[data_2$Title %in% c("Jonkheer", "Dr")] <- "Honorific Titles"
data_2$Title[data_2$Title %in% c("Capt", "Col", "Major", "Rev")] <- "Officers"Grouping the Age variable into several categories.
Ages <- function(x){
if(x < 20) {
x <- "< 20"
} else if (x >= 20 & x <= 29) {
x <- "20-29"
} else if (x >= 30 & x <= 39) {
x <- "30-39"
} else if (x >= 40 & x <= 49) {
x <- "40-49"
} else if (x >= 50 & x <= 59) {
x <- "50-59"
} else if (x >= 60 & x <= 69) {
x <- "60-69"
} else if (x >= 70 & x <= 79) {
x <- "70-79"
} else (x <- "> 80")
}
data_2$Ages <- as.factor(sapply(data_2$Age, Ages))## PassengerId Survived Pclass Name Sex
## Min. : 1.0 Dead :549 1st Class:216 Length:891 female:314
## 1st Qu.:223.5 Survived:342 2nd Class:184 Class :character male :577
## Median :446.0 3rd Class:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:22.00 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :29.70 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:35.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
##
## Fare Cabin Embarked Title
## Min. : 0.00 Length:891 Cherbourg :168 Length:891
## 1st Qu.: 7.91 Class :character Queenstown : 77 Class :character
## Median : 14.45 Mode :character Southampton:646 Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
## Ages
## 20-29 :220
## > 80 :178
## 30-39 :167
## < 20 :164
## 40-49 : 89
## 50-59 : 48
## (Other): 25
Interpretation:
The Max value: 891 in PassengerId
describes the total number of passengers on the Titanic ship, which is
891.
Survived (1) indicates that 342 passengers survived
the Titanic shipwreck.
The Sex of Titanic passengers consists of 577 males
and 314 females.
The Age of Titanic passengers ranges from the
youngest being 0.42 years old to the oldest being
80 years old.
The highest fare paid by a passenger is 512.33.
The most common destination for Titanic passengers in the
Embarked column is S (Southampton) with 646
passengers, followed by C (Cherbourg) with 168 passengers,
and Q (Queenstown) with 646 passengers.
The age range of passengers (Ages) is as
follows:
# Filter Survived and Dead in the `Survived` variable
data_2.clean <- data_2[data_2$Survived %in% c("Survived", "Dead"),]
# Subset data for passengers who survived
data_2.clean.survived <- data_2.clean[data_2.clean$Survived == "Survived",]
# Subset data for passengers who did not survive
data_2.clean.ntsurvived <- data_2.clean[data_2.clean$Survived == "Dead",]agg_tbl <- data_2.clean %>% group_by(Survived) %>%
summarise(total_count=n(),
.groups = 'drop')
ggplot(agg_tbl, aes(x=Survived, y=total_count, fill=Survived))+
geom_bar(stat="identity")+
geom_text(aes(label=total_count), vjust=-0.3, size=3.5) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Comparison of Survived Categories of Passengers",
x = "Category",
y = "Total Passengers")The overall ratio of the Survived category is that the number of those who did not survive is greater than those who survived.
Pclass?ggplot(data = data_2, aes(x=Pclass, fill=Sex)) +
geom_bar(position = "dodge", width = 0.4) +
labs(title = "Perbandingan Jenis Kelamin dari PClass",
x = "Pclass",
y = "Total Passengers")There are more males in all Pclass. However, the male-to-female ratio is significantly higher in Pclass 3.
Survived for each
Gender?ggplot(data = data_2.clean, aes(x=Sex, fill= Survived)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set1") +
labs(title = "Comparison of Survived in each Gender",
x = "Sex",
y = "Total Passengers")For females, more passengers survived.
For males, significantly more passengers did not survive (deaths).
There are more passengers who did not survive than those who survived.
Survived for each Pclass,
are those who paid more prioritized for survival?ggplot(data = data_2.clean, aes(x=Pclass, fill= Survived)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set1") +
labs(title = "Comparison of Survived in each PClass",
x = "PClass",
y = "Total Passengers")Yes, it turns out that those in Pclass 1 have a much better chance of survival compared to others.
ggplot(data = data_2.clean.survived, mapping = aes(x = Pclass, y = Survived)) +
geom_col(aes(fill = Sex), show.legend = T) +
labs(title = "Survived Passengers by Passenger Class",
x = "Passenger Class",
y = "Survived",
caption = "Titanic")Female passengers have a higher survival rate than male passengers in all Passenger Class categories.
ggplot(data = data_2.clean.ntsurvived, mapping = aes(x = Pclass, y = Survived)) +
geom_col(aes(fill = Sex), show.legend = T) +
labs(title = "Deceased Passengers by Passenger Class",
x = "Passenger Class",
y = "Deceased",
caption = "Titanic")Male passengers have a higher rate of non-survival in all Passenger Class categories.
Survived based on Title
of the name?ggplot(data = data_2.clean, aes(x=Title, fill=Survived)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set1") +
labs(title = "Comparison of Survived in each Title Name",
x = "Title",
y = "Total Passengers")data_2.clean.title <- as.data.frame(table(data_2.clean$Survived, data_2.clean$Title))
data_2.clean.titleggplot(data_2.clean.title, aes(x=Var2, y=Freq, fill= Var1)) +
geom_col(position="fill") +
ggthemes::theme_economist() +
scale_color_gdocs() +
ggthemes::scale_fill_gdocs() +
scale_fill_brewer(palette = "Set1") +
geom_text(aes(label = Freq), position = position_fill(vjust = .5), col = "white") +
coord_flip() +
labs(fill = "Survived",
title = "Comparison of Survival in each Title Name",
x = "Title",
y = "Total Passengers")Male passengers with the title Mr have the worst survival rate.
Female passengers with the title Mrs have a better survival rate than males.
Honorific Titles and Officers have a worse survival rate compared to their death rate.
The title Master is used in English for young boys who are too young to be called Mister, and it seems that children have a better chance of survival.
ggplot(data = data_2.clean, aes(x=Sex, fill=Survived)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set1") +
facet_wrap(~Pclass, nrow = 3, scales = "free_y") +
labs(title = "Comparison of Survival in each Sex and PClass",
x = "Sex",
y = "Total Passengers")In Pclass 1, women have a very high survival rate with a high certainty, while men have a much higher death rate.
In Pclass 2, it is almost the same as what happened in Pclass 1, only the number differs.
In Pclass 3, the survival rate is almost 50/50, in stark contrast to the much higher death rate for men.
ggplot(data = data_2.clean, aes(x=Family, fill=Survived)) +
geom_bar(stat = "count", position = "dodge") +
scale_x_continuous(breaks = c(1:11)) +
scale_fill_brewer(palette = "Set1") +
ylim(c(0, 400)) +
labs(title = "Comparison of Survival by Family Size",
x = "Family Size",
y = "Total Passengers") +
theme_bw()# Filter Pclass 1st Class
data_2.pclass1 <- data_2.clean[data_2.clean$Pclass %in% "1st Class", ]
# Plot of Passengers with 1st Class Ticket
p1 <- ggplot(data = data_2.pclass1, aes(x = Fare, y = Age)) +
geom_point(colour = "yellow", shape = 21, size = 3, aes(fill = Survived)) +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "Passengers with 1st Class Ticket")
# Filter Pclass 2nd Class
data_2.pclass2 <- data_2.clean[data_2.clean$Pclass %in% "2nd Class", ]
# Plot of Passengers with 2nd Class Ticket
p2 <- ggplot(data = data_2.pclass2, aes(x = Fare, y = Age)) +
geom_point(colour = "yellow", shape = 21, size = 3, aes(fill = Survived)) +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "Passengers with 2nd Class Ticket")
# Filter Pclass 3rd Class
data_2.pclass3 <- data_2.clean[data_2.clean$Pclass %in% "3rd Class", ]
# Plot of Passengers with 3rd Class Ticket
p3 <- ggplot(data = data_2.pclass3, aes(x = Fare, y = Age)) +
geom_point(colour = "yellow", shape = 21, size = 3, aes(fill = Survived)) +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "Passengers with 3rd Class Ticket")
require(gridExtra)
grid.arrange(p1, p2, p3, ncol = 1)data_2.clean.prop <- as.data.frame(table(data_2.clean$Survived,data_2.clean$Ages))
data_2.clean.propggplot(data = data_2.clean.prop, mapping = aes(x = Freq, y = reorder(Var2, Freq))) +
geom_col(mapping = aes(fill = Var1), position = "stack") +
labs(x = "Proportion of Passenger Survival",
y = NULL,
fill = NULL,
title = "Proportion of Passenger Survival on Titanic Ship",
subtitle = "Based on Age Range") +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
theme(legend.position = "top")The age range of 20-29 years has a higher survival rate compared to others, but it also has a significantly higher death rate compared to the others.
Embarked port?data_2.clean.emb <- as.data.frame(table(data_2.clean$Survived, data_2.clean$Embarked))
data_2.clean.embggplot(data_2.clean.emb, aes(x=Var2, y=Freq, fill= Var1)) +
geom_col(position = "fill") +
ggthemes::theme_economist() +
scale_color_gdocs() +
ggthemes::scale_fill_gdocs() +
scale_fill_brewer(palette = "Set1") +
geom_text(aes(label = Freq), position = position_fill(vjust = .5), col = "white") +
labs(fill = "Survived",
title = "Comparison of Survival in each Embarked Port",
x = "Embarked",
y = "Frequency")ggplot(data = data_2.clean.survived, mapping = aes(x = Pclass,
y = Survived)) +
geom_col(aes(fill = Embarked), show.legend = T) +
labs(title = "Survival by Passenger Class",
x = "Passenger Class",
y = "Survived",
caption = "Titanic")The majority of passengers who survived embarked from Southampton (S).
Before proceeding with the modeling, it is essential to pay attention to the proportion of the target variable, which is “Survived”.
agg_tbl <- data_2.clean %>% group_by(Survived) %>%
summarise(total_count=n(),
.groups = 'drop')
ggplot(agg_tbl, aes(x=Survived, y=total_count, fill=Survived))+
geom_bar(stat="identity")+
geom_text(aes(label=total_count), vjust=-0.3, size=3.5) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Comparison of Survival Categories of Passengers",
x = "Category",
y = "Total Passengers")##
## 0 1
## 0.6161616 0.3838384
##
## 0 1
## 549 342
Conclusion
The number of passengers who did not survive is greater than the number of passengers who survived in general. Among females, more passengers survived, while among males, significantly more passengers did not survive.
Among the three Pclasses, it is found that those in Pclass 1 have a much higher chance of survival compared to the others.
When comparing with the Sex variable, there are more males in all Pclasses. However, the male-to-female ratio is much higher in Pclass 3. Female passengers have a higher survival rate than male passengers in all Pclass categories. Male passengers have a higher mortality rate in all Pclass categories.
In the comparison of survival based on Title Name, males with the Title “Mr” have the worst survival rate, while females with the Title “Mrs” have a better survival rate than males. Honorific Titles and Officers have worse survival rates compared to their mortality rates. However, for the Title “Master,” which is used in English for young boys who are too young to be called “Mister,” it seems that children have a better chance of survival.
In the comparison of survival based on Family Size, passengers without a family have a lower chance of survival. Families with a smaller size (2-4) have a better chance of survival. However, families with a larger size (+5) do not have a better chance of survival.
The age range of 20-29 years has a higher survival rate than others, but it also has a significantly higher death rate compared to other age ranges.
For the “Embarked” variable, passengers who embarked from Cherbourg have a relatively higher ratio of survival with 93 passengers survived and 75 passengers deceased. On the other hand, passengers who embarked from Queenstown and Southampton have lower survival ratios.
Next, we proceed with building models using Logistic Regression and K-Nearest Neighbors (KNN).
In this step, we create models using Logistic Regression and KNN. The data will be split into training and testing data using cross-validation. For logistic regression modeling, the glm() function is used to model the logistic regression. The selected variables that influence the target variable will be the independent variables in this model, while the target variable will be the dependent or response variable. We will also perform feature selection using the stepwise method. The evaluation results of the model will be displayed based on accuracy, precision, recall, and F1-score.
In this section, we will create two models:
Logistic regression model using all available predictor variables.
Logistic regression model with filtered predictor variables using the stepwise feature selection method.
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = data_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.372059 0.556361 6.061 1.35e-09 ***
## Pclass2 -0.740048 0.345760 -2.140 0.032326 *
## Pclass3 -1.668392 0.347153 -4.806 1.54e-06 ***
## Sexmale -2.571089 0.229212 -11.217 < 2e-16 ***
## Age -0.031937 0.009419 -3.391 0.000697 ***
## SibSp -0.270896 0.121928 -2.222 0.026298 *
## Parch -0.049235 0.156618 -0.314 0.753246
## Fare 0.004419 0.003082 1.434 0.151558
## EmbarkedQ 0.068630 0.451919 0.152 0.879294
## EmbarkedS -0.370599 0.287357 -1.290 0.197162
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 831.47 on 622 degrees of freedom
## Residual deviance: 568.33 on 613 degrees of freedom
## AIC: 588.33
##
## Number of Fisher Scoring iterations: 5
Using this model, we can make the following interpretations:
The intercept in this model represents the estimated log-odds of the response variable when all the predictor variables are set to zero or their reference levels (for categorical variables). In the provided coefficients table, the intercept is represented by (Intercept) and has an estimated value of 3.334869. This indicates the baseline log-odds of the response variable in the absence of any predictors or when the predictor variables are at their reference levels.
A breakdown of the information for the first few predictor variables:
Pclass2, Pclass3, Sexmale, EmbarkedS: If the passenger is in PClass 1/2 or is male or embarked from Southampton, then they are more likely to not survive.
Age, SibSp, Parch: The higher the value of these variables, the more likely the customer is not to survive.
Fare: The higher the value of this variable, the more likely the customer is to survive.
EmbarkedQ: If the passenger embarked from Queenstown, then they are more likely to survive.
The remaining predictor variables follow a similar pattern. The coefficients table provides information about the statistical significance and impact of each predictor variable on the response variable in the logistic regression model.
In the first model, there are still many predictor variables that are not significant to the target variable. Therefore, we will attempt model fitting using the stepwise method.
## Start: AIC=588.33
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Df Deviance AIC
## - Parch 1 568.42 586.42
## - Embarked 2 570.87 586.87
## <none> 568.33 588.33
## - Fare 1 570.88 588.88
## - SibSp 1 574.05 592.05
## - Age 1 580.51 598.51
## - Pclass 2 593.63 609.63
## - Sex 1 719.76 737.76
##
## Step: AIC=586.42
## Survived ~ Pclass + Sex + Age + SibSp + Fare + Embarked
##
## Df Deviance AIC
## - Embarked 2 571.10 585.10
## <none> 568.42 586.42
## - Fare 1 570.88 586.88
## + Parch 1 568.33 588.33
## - SibSp 1 575.39 591.39
## - Age 1 580.53 596.53
## - Pclass 2 594.56 608.56
## - Sex 1 723.60 739.60
##
## Step: AIC=585.1
## Survived ~ Pclass + Sex + Age + SibSp + Fare
##
## Df Deviance AIC
## <none> 571.10 585.10
## - Fare 1 574.42 586.42
## + Embarked 2 568.42 586.42
## + Parch 1 570.87 586.87
## - SibSp 1 579.65 591.65
## - Age 1 583.35 595.35
## - Pclass 2 596.73 606.73
## - Sex 1 735.31 747.31
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Fare, family = "binomial",
## data = data_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.118507 0.514684 6.059 1.37e-09 ***
## Pclass2 -0.817069 0.340137 -2.402 0.016298 *
## Pclass3 -1.648740 0.335594 -4.913 8.97e-07 ***
## Sexmale -2.598402 0.222967 -11.654 < 2e-16 ***
## Age -0.031777 0.009354 -3.397 0.000681 ***
## SibSp -0.308111 0.116117 -2.653 0.007967 **
## Fare 0.004717 0.002894 1.630 0.103080
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 831.47 on 622 degrees of freedom
## Residual deviance: 571.10 on 616 degrees of freedom
## AIC: 585.1
##
## Number of Fisher Scoring iterations: 5
Using this model, we only use 6 variables out of the total 9 variables (excluding Embarked), and we can make the following interpretations:
The intercept in this model represents the estimated log-odds of the response variable when all the predictor variables are set to zero or their reference levels (for categorical variables). In the provided coefficients table, the intercept is represented by (Intercept) and has an estimated value of 3.097842. This indicates the baseline log-odds of the response variable in the absence of any predictors or when the predictor variables are at their reference levels.
Breakdown of the information for the first few predictor variables:
Pclass2, Pclass3, Sexmale: If the passenger is in PClass 1/2 or Male, then they are more likely to not survive.
Age, SibSp: The larger the value of these variables, the more likely the passenger will not survive.
Fare: The larger the value of this variable, the more likely the passenger will survive.
Comparison of AIC (Akaike Information Criterion): It estimates the amount of missing information in a model. The smaller the AIC value, the better the model.
Null deviance: It represents the error or discrepancy when no predictors are used in the model.
Residual deviance: It represents the error or discrepancy when the model is built using the predictors.
With these three parameters, we can conclude that the step-wise model is the better model.
## [1] 588.3255
## [1] 585.1033
AIC represents the amount of missing information in a model. The
smaller the AIC value, the less information is missing. From the results
above, it is known that the AIC value for model_both is
smaller than the AIC value for model_all.
In the syntax above, when the probability for a data point in the test set is greater than 0.5, it means that the model predicts the passenger as “Survived.” On the other hand, if the probability is less than or equal to 0.5, it means the model predicts the passenger as “Not Survived.”
data_test$pred_data <- factor(ifelse(data_test$prob_bh > 0.5, 1, 0))
data_test[1:10, c("pred_data", "Survived")]After making predictions using the model, there are still some incorrect predictions.
Recall/Sensitivity: It measures how effectively the model can correctly predict the proportion of actual positive data.
Specificity: It measures how effectively the model can correctly predict the proportion of actual negative data.
Accuracy: It measures how effectively the model can correctly predict the target variable Y.
Precision: It measures how effectively the model can correctly predict the positive class among all predicted results.
In classification, we evaluate the model based on the confusion matrix.
confusion_matrix <- confusionMatrix(data = data_test$pred_data, reference = data_test$Survived, positive = "1")
confusion_matrix## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 141 30
## 1 26 71
##
## Accuracy : 0.791
## 95% CI : (0.7374, 0.8381)
## No Information Rate : 0.6231
## P-Value [Acc > NIR] : 2.486e-09
##
## Kappa : 0.5516
##
## Mcnemar's Test P-Value : 0.6885
##
## Sensitivity : 0.7030
## Specificity : 0.8443
## Pos Pred Value : 0.7320
## Neg Pred Value : 0.8246
## Prevalence : 0.3769
## Detection Rate : 0.2649
## Detection Prevalence : 0.3619
## Balanced Accuracy : 0.7736
##
## 'Positive' Class : 1
##
To evaluate the performance of the model, we can display the results using the confusion matrix and other evaluation metrics such as recall, specificity, accuracy, and precision. These metrics will provide insights into how well the model performs in predicting survival outcomes for the passengers.
## Accuracy: 0.7910448
## Precision: 0.7319588
## Recall: 0.7029703
## F1 Score: 0.7171717
Interpretation
Based on the above confusion matrix results, we can interpret as follows:
From all the data classified by the model, 79.1% have correct predictions (accuracy). This means that approximately 79.1% of the total observations in the dataset can be correctly predicted by the model.
When predicting the “Not Survived” class (negative classification), the model has a recall of 70.29%. This means that out of all the observations that actually belong to the “Survived” class, the model can correctly classify around 70.29% of them.
When predicting the “Survived” class (positive classification), the model has a recall of 73.19%. This means that out of the test data that is predicted as positive (Survived), around 73.19% are actually positive (Survived).
Next, the creation of the KNN model is carried out. In this stage, numeric predictor filtering and class proportion checking are performed. The data is also presented in summary form to understand the range of predictor variable values. Data training and data testing are generated by performing cross-validation.
data_knn <- data %>%
mutate(Pclass = as.numeric(Pclass),
Sex = as.numeric(Sex),
Embarked = as.numeric(Embarked))## Rows: 891
## Columns: 8
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Sex <dbl> 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2…
## $ Age <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, 29.69912, 5…
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ Embarked <dbl> 4, 2, 4, 4, 4, 3, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 3, 4, 4, 2, 4…
## Survived Pclass Sex Age SibSp
## 0:549 Min. :1.000 Min. :1.000 Min. : 0.42 Min. :0.000
## 1:342 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:22.00 1st Qu.:0.000
## Median :3.000 Median :2.000 Median :29.70 Median :0.000
## Mean :2.309 Mean :1.648 Mean :29.70 Mean :0.523
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:35.00 3rd Qu.:1.000
## Max. :3.000 Max. :2.000 Max. :80.00 Max. :8.000
## Parch Fare Embarked
## Min. :0.0000 Min. : 0.00 Min. :2.000
## 1st Qu.:0.0000 1st Qu.: 7.91 1st Qu.:3.000
## Median :0.0000 Median : 14.45 Median :4.000
## Mean :0.3816 Mean : 32.20 Mean :3.536
## 3rd Qu.:0.0000 3rd Qu.: 31.00 3rd Qu.:4.000
## Max. :6.0000 Max. :512.33 Max. :4.000
Before training the KNN model, we perform pre-processing by scaling
the data using the scale() function. Then, we search for
the optimal value of k using the square root method. The KNN model is
trained and tested using the knn() function, and the
evaluation results of the model are displayed based on accuracy,
precision, recall, and F1-score.
data_knn_train_xs <- scale(x = data_knn_train_x)
data_knn_test_xs <- scale(x = data_knn_test_x,
center = attr(data_knn_train_xs, "scaled:center"),
scale = attr(data_knn_train_xs, "scaled:scale"))## Optimal k: 26.68333
confusion_matrix_knn <- confusionMatrix(data = knn_model,
reference = data_knn_test_y,
positive = "1")
confusion_matrix_knn## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 107 20
## 1 9 43
##
## Accuracy : 0.838
## 95% CI : (0.7757, 0.8887)
## No Information Rate : 0.648
## P-Value [Acc > NIR] : 1.311e-08
##
## Kappa : 0.6301
##
## Mcnemar's Test P-Value : 0.06332
##
## Sensitivity : 0.6825
## Specificity : 0.9224
## Pos Pred Value : 0.8269
## Neg Pred Value : 0.8425
## Prevalence : 0.3520
## Detection Rate : 0.2402
## Detection Prevalence : 0.2905
## Balanced Accuracy : 0.8025
##
## 'Positive' Class : 1
##
The evaluation results of the KNN model, including accuracy, precision, recall, and F1-score, are displayed below.
## KNN Model Performance
## Accuracy: 0.8379888
## Precision: 0.8269231
## Recall: 0.6825397
## F1 Score: 0.7478261
Based on the confusion matrix results above, we can interpret as follows:
From all the data classified by the model, 83.79% have correct predictions (accuracy). This means that approximately 83.79% of the total observations in the dataset can be correctly predicted by the model.
When predicting the “Not Survived” class (negative classification), the model has a recall of 68.25%. This means that out of all observations that actually belong to the “Survived” class, the model is able to correctly classify about 68.25% of them.
When predicting the “Survived” class (positive classification), the model has a recall of 82.69%. This means that out of the test data that was predicted as positive (Survived), about 82.69% of them are actually positive (Survived).
Finally, a comparison of the evaluation results between the Logistic Regression and KNN models is performed. The evaluation metrics for each model are displayed in the form of a data frame containing accuracy, recall, specificity, and precision values.
# Evaluasi Model Logistic Regression
eval_logit <- data.frame(Accuracy = confusion_matrix$overall["Accuracy"],
Recall = confusion_matrix$byClass["Sensitivity"],
Specificity = confusion_matrix$byClass["Specificity"],
Precision = confusion_matrix$byClass["Pos Pred Value"])
# Evaluasi Model K-NN
eval_knn <- data.frame(Accuracy = confusion_matrix_knn$overall["Accuracy"],
Recall = confusion_matrix_knn$byClass["Sensitivity"],
Specificity = confusion_matrix_knn$byClass["Specificity"],
Precision = confusion_matrix_knn$byClass["Pos Pred Value"])
eval_logitHowever, in this case, we will use Accuracy in the evaluation process for both cases.
Conclusion: The metrics that need to be prioritized or evaluated are Accuracy and Precision.
Based on the Precision values, the KNN model has higher Accuracy and Precision (0.8379888 / 0.8269231) compared to the Linear Regression model (0.7910448 / 0.7319588). This indicates that the KNN model performs better in predicting passengers who actually survived as not survived.
Therefore, from the evaluation results, it can be concluded that using the KNN method is recommended for predicting passengers who actually survived as not survived, with a higher Precision value.
In the classification analysis using the Logistic Regression and K-NN models to predict the survival of Titanic passengers, the evaluation results show that the KNN model performs better in predicting passengers who actually survived as not survived, with a higher Precision value compared to the Linear Regression model. In this case, Accuracy and Precision are the prioritized metrics as the most concerning risk is False Positives, where passengers who actually did not survive are predicted as survived.
Therefore, the best metrics that can be used to evaluate the model’s performance are Accuracy and Precision.
In this section, we will split the dataset into data train and data test. The data train will be used to train the model, and the data test will be used to evaluate the performance of the trained models. 80% of the dataset will be used for data train, and the remaining 20% will be used for data test.
## Rows: 891
## Columns: 8
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Sex <fct> male, female, female, female, male, male, male, male, female,…
## $ Age <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, 29.69912, 5…
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C, S…
Before building the models, we need to examine the proportion of the target variable in the target column of the data train.
##
## 0 1
## 0.6072931 0.3927069
We can observe that the proportion of positive and negative values in the target variable is unbalanced. This imbalance can potentially affect the performance of the model. Since we have thousands of data rows, we will use the upsampling method to balance the proportion. However, it’s essential to note that no new information will be added to the current data during this process.
RNGkind(sample.kind = "Rounding")
train_down <- upSample(
x = data_train %>% select(-Survived),
y = data_train$Survived,
yname = "Survived"
)
nrow(train_down)## [1] 866
##
## 0 1
## 0.5 0.5
now we have balanced proportion of target variable
Naive Bayes is one of the classification methods based on Bayes’
Theorem of Probability. In fact, Bayes’ theorem is often used in daily
life. The Naive Bayes model can be created using the
naiveBayes() function from the e1071 library.
In this analysis, we will first create a model using Naive Bayes. The
model will use all the predictors in the dataset. One of the weaknesses
of Naive Bayes is the possibility of encountering “scarcity,” where a
predictor does not occur in a certain class. To address this issue, we
will add the laplace = 1 parameter.
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5 0.5
##
## Conditional probabilities:
## Pclass
## Y 1 2 3
## 0 0.1513761 0.1811927 0.6674312
## 1 0.4013761 0.2201835 0.3784404
##
## Sex
## Y female male
## 0 0.1517241 0.8482759
## 1 0.6781609 0.3218391
##
## Age
## Y [,1] [,2]
## 0 30.94853 12.26125
## 1 28.77417 13.58285
##
## SibSp
## Y [,1] [,2]
## 0 0.5381062 1.3066968
## 1 0.4595843 0.7257543
##
## Parch
## Y [,1] [,2]
## 0 0.3025404 0.7658317
## 1 0.4572748 0.7897778
##
## Fare
## Y [,1] [,2]
## 0 21.90124 31.70266
## 1 50.52230 70.46483
##
## Embarked
## Y C Q S
## 0 0.00228833 0.13729977 0.08695652 0.77345538
## 1 0.00228833 0.27002288 0.08237986 0.64530892
The provided information shows the output of a Naive Bayes classifier for discrete predictors. Here’s an explanation of the different components:
A-priori probabilities: These are the probabilities of the classes (0 and 1) occurring in the target variable. In this case, the probabilities are 0.5 for each class, indicating an equal prior probability for both classes.
Conditional probabilities: These are the conditional probabilities of each predictor variable given each class. Each table represents the probabilities of a specific predictor variable taking certain values (e.g., Pclass 1, 2, or 3) given the class (0 or 1).
These conditional probabilities can be used to calculate the posterior probabilities and make predictions using the Naive Bayes classifier.
# TRAIN
preds_naive_train <- predict(model_naive, newdata = data_train)
table(preds_naive_train, data_train$Survived)##
## preds_naive_train 0 1
## 0 354 75
## 1 79 205
# TEST
preds_naive_test <- predict(model_naive, newdata = data_test)
table(preds_naive_test, data_test$Survived)##
## preds_naive_test 0 1
## 0 97 15
## 1 19 47
The confusion matrix displays the following information:
True Negative (TN): The number of instances that were correctly predicted as negative (0) by the classifier. In this case, there are 97 instances that were correctly predicted as negative.
False Positive (FP): The number of instances that were incorrectly predicted as positive (1) by the classifier. In this case, there are 15 instances that were incorrectly predicted as positive.
False Negative (FN): The number of instances that were incorrectly predicted as negative (0) by the classifier. In this case, there are 19 instances that were incorrectly predicted as negative.
True Positive (TP): The number of instances that were correctly predicted as positive (1) by the classifier. In this case, there are 47 instances that were correctly predicted as positive.
confusionMatrix_nv_train <- confusionMatrix(preds_naive_train, data_train$Survived)
confusionMatrix_nv_train## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 354 75
## 1 79 205
##
## Accuracy : 0.784
## 95% CI : (0.752, 0.8137)
## No Information Rate : 0.6073
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5483
##
## Mcnemar's Test P-Value : 0.809
##
## Sensitivity : 0.8176
## Specificity : 0.7321
## Pos Pred Value : 0.8252
## Neg Pred Value : 0.7218
## Prevalence : 0.6073
## Detection Rate : 0.4965
## Detection Prevalence : 0.6017
## Balanced Accuracy : 0.7748
##
## 'Positive' Class : 0
##
confusionMatrix_nv_train <- confusionMatrix(preds_naive_train, data_train$Survived)
confusionMatrix_nv_train## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 354 75
## 1 79 205
##
## Accuracy : 0.784
## 95% CI : (0.752, 0.8137)
## No Information Rate : 0.6073
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5483
##
## Mcnemar's Test P-Value : 0.809
##
## Sensitivity : 0.8176
## Specificity : 0.7321
## Pos Pred Value : 0.8252
## Neg Pred Value : 0.7218
## Prevalence : 0.6073
## Detection Rate : 0.4965
## Detection Prevalence : 0.6017
## Balanced Accuracy : 0.7748
##
## 'Positive' Class : 0
##
confusionMatrix_nv_test <- confusionMatrix(preds_naive_test, data_test$Survived)
confusionMatrix_nv_test## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 97 15
## 1 19 47
##
## Accuracy : 0.809
## 95% CI : (0.7434, 0.8639)
## No Information Rate : 0.6517
## P-Value [Acc > NIR] : 2.959e-06
##
## Kappa : 0.5855
##
## Mcnemar's Test P-Value : 0.6069
##
## Sensitivity : 0.8362
## Specificity : 0.7581
## Pos Pred Value : 0.8661
## Neg Pred Value : 0.7121
## Prevalence : 0.6517
## Detection Rate : 0.5449
## Detection Prevalence : 0.6292
## Balanced Accuracy : 0.7971
##
## 'Positive' Class : 0
##
Result:
Target variable = Survived(1) or Not Survived (0)
Positive Class: Not Survived (0)
FN: Passengers predicted as survived, but actually not survived
FP: Passengers predicted as not survived, but actually survived
The concerning risk is if there is an error in predicting passengers who are actually not survived as survived (False Positive).
Conclusion: The metric that needs to be prioritized or evaluated is Precision.
From the output of the Naive Bayes classification model, it can be observed that:
Training: The model has an accuracy rate of 78.4% and a Precision/Positive Predictive Value of 82.52%.
Testing: The model has an accuracy rate of 80.9% and a Precision/Positive Predictive Value of 82.52%.
The results indicate good performance and show that the model is not overfitting or underfitting.
Decision Tree is a tree-based model that is relatively simple with robust and powerful performance for predictions. It generates a visualization in the form of a decision tree that can be easily interpreted.
The main drawback of Decision Tree is its tendency to overfit. This is caused by its ability to split the data into very detailed branches, even reaching leaf nodes that contain only one observation. To address overfitting, pruning techniques are used to determine when to stop creating branches in the decision tree, resulting in a simpler and more general tree that can recognize patterns and prevent the model from being too specific to the training data only.
Model Without Pruning
Model With Pruning
Performing model tuning by setting the parameters
mincriterion = 0.90 and minsplit = 15.
# dt_model <- ctree (Survived ~ ., data_train,
# control = ctree_control(mincriterion = 0.9,
# minsplit = 15,
# minbucket = 0))However, since the results obtained are the same, we will use the model without pruning only.
##
## Model formula:
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
##
## Fitted party:
## [1] root
## | [2] Sex in female
## | | [3] Pclass in 1, 2: 1 (n = 139, err = 6.5%)
## | | [4] Pclass in 3
## | | | [5] Embarked in C, Q: 1 (n = 46, err = 26.1%)
## | | | [6] Embarked in S: 0 (n = 73, err = 39.7%)
## | [7] Sex in male
## | | [8] Pclass in 1: 0 (n = 95, err = 34.7%)
## | | [9] Pclass in 2, 3
## | | | [10] Age <= 12
## | | | | [11] SibSp <= 2: 1 (n = 12, err = 0.0%)
## | | | | [12] SibSp > 2: 0 (n = 10, err = 10.0%)
## | | | [13] Age > 12: 0 (n = 338, err = 12.1%)
##
## Number of inner nodes: 6
## Number of terminal nodes: 7
The provided information represents a decision tree model (dt_model) that predicts the survival outcome (Survived) based on several predictor variables (Pclass, Sex, Age, SibSp, Parch, Fare, Embarked). Here is an explanation of the tree structure:
The root node is the starting point of the decision tree.
The first split is based on the variable Sex. If the individual is female, the tree proceeds to node 3. If the individual is male, the tree proceeds to node 7.
For females, the next split is based on the variable Pclass. If the individual’s Pclass is 1 or 2, the predicted outcome is 1 (survived) in node 3. If the individual’s Pclass is 3, the tree proceeds to node 4.
Similarly, the tree continues to split based on different predictor variables until it reaches the leaf nodes, which represent the final predicted outcomes (survived or not survived) based on the combination of predictor variable values.
The numbers in each node represent the number of observations that fall into that particular category based on the given split criteria.
Overall, the decision tree provides a clear visual representation of the decision-making process and how different predictor variables contribute to the final survival prediction.
# TRAIN
preds_dt_train <- predict(dt_model, newdata = data_train)
table(preds_dt_train, data_train$Survived)##
## preds_dt_train 0 1
## 0 412 104
## 1 21 176
# TEST
preds_dt_test <- predict(dt_model, newdata = data_test)
table(preds_dt_test, data_test$Survived)##
## preds_dt_test 0 1
## 0 110 21
## 1 6 41
True Negative (TN): The model predicted 0 (not survived) correctly, and the actual value is also 0. In this case, the count is 110.
False Positive (FP): The model predicted 1 (survived), but the actual value is 0 (not survived). In this case, the count is 21.
False Negative (FN): The model predicted 0 (not survived), but the actual value is 1 (survived). In this case, the count is 6.
True Positive (TP): The model predicted 1 (survived) correctly, and the actual value is also 1. In this case, the count is 41.
confusionMatrix_dt_train <- confusionMatrix(preds_dt_train, data_train$Survived)
confusionMatrix_dt_train## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 412 104
## 1 21 176
##
## Accuracy : 0.8247
## 95% CI : (0.7947, 0.8519)
## No Information Rate : 0.6073
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6121
##
## Mcnemar's Test P-Value : 2.229e-13
##
## Sensitivity : 0.9515
## Specificity : 0.6286
## Pos Pred Value : 0.7984
## Neg Pred Value : 0.8934
## Prevalence : 0.6073
## Detection Rate : 0.5778
## Detection Prevalence : 0.7237
## Balanced Accuracy : 0.7900
##
## 'Positive' Class : 0
##
confusionMatrix_dt_test <- confusionMatrix(preds_dt_test, data_test$Survived)
confusionMatrix_dt_test## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 110 21
## 1 6 41
##
## Accuracy : 0.8483
## 95% CI : (0.787, 0.8976)
## No Information Rate : 0.6517
## P-Value [Acc > NIR] : 3.516e-09
##
## Kappa : 0.6459
##
## Mcnemar's Test P-Value : 0.007054
##
## Sensitivity : 0.9483
## Specificity : 0.6613
## Pos Pred Value : 0.8397
## Neg Pred Value : 0.8723
## Prevalence : 0.6517
## Detection Rate : 0.6180
## Detection Prevalence : 0.7360
## Balanced Accuracy : 0.8048
##
## 'Positive' Class : 0
##
Result:
Target variable = Survived (1) or Not Survived (0)
Positive Class: Not Survived (0)
FN: Passengers predicted as survived, but they did not survive.
FP: Passengers predicted as not survived, but they survived.
The concerning risk is if there is an error in predicting passengers who actually did not survive as survived (False Positive).
Conclusion: The metrics that need to be prioritized or evaluated is Precision.
From the output of the Decision Tree classification model, it can be observed that:
Training: The model has an accuracy of 82.47% and Precision/Positive Predictive Value of 79.84%.
Testing: The model has an accuracy of 84.83% and Precision/Positive Predictive Value of 83.97%.
These results indicate that the model performs well and does not suffer from Overfitting or Underfitting.
K-fold cross-validation is a resampling technique commonly used in machine learning and model evaluation. Its purpose is to assess the performance and generalization ability of a predictive model.
The process involves dividing the available data into K equally sized subsets or “folds.” The model is then trained and evaluated K times, each time using a different fold as the validation set and the remaining folds as the training set. This allows for a more comprehensive evaluation of the model’s performance and helps to mitigate the impact of data variability and overfitting.
set.seed(417)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3) # k-fold cross validation
model_rf <- train(Survived ~ .,
data= data_train,
method="rf",
trControl = ctrl)In summary, the Random Forest model achieved an accuracy of approximately 82% and a kappa coefficient of around 0.616 when trained with mtry = 6. These metrics indicate the model’s overall predictive performance on unseen data during cross-validation.
Among the different models tested with varying values of mtry, the selected model is the one with mtry = 6, as it achieved the highest accuracy when tested on the data obtained from bootstrap sampling, which can be considered as the training data used to build the decision tree in the random forest.
## Random Forest
##
## 713 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 570, 571, 571, 570, 570, 570, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8153419 0.5949658
## 6 0.8200171 0.6163380
## 10 0.8055189 0.5863042
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
## rf variable importance
##
## Overall
## Sexmale 100.000
## Fare 90.087
## Age 81.591
## Pclass3 18.835
## SibSp 13.233
## Parch 8.441
## EmbarkedS 4.235
## EmbarkedC 2.174
## Pclass2 1.453
## EmbarkedQ 0.000
These values represent the relative importance of each predictor variable in the model. Higher values indicate greater importance in predicting the outcome variable. In this case, Sex (male), Fare, and Age are the most important variables in the model.
On the other hand, Pclass3, SibSp, Parch, EmbarkedS, EmbarkedC, Pclass2, and EmbarkedQ have relatively lower importance.
It is not necessary to perform cross-validation when using random forest. This is because, from the results of bootstrap sampling, there are data points that are not used in building the random forest. These data points are referred to as out-of-bag data and are considered as test data by the model. The model will make predictions using these data points and calculate the resulting error, known as the out-of-bag error. This out-of-bag error serves as an estimate of the model’s performance on unseen data and can be used as a measure of the model’s accuracy without the need for additional cross-validation.
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 18.23%
## Confusion matrix:
## 0 1 class.error
## 0 380 53 0.1224018
## 1 77 203 0.2750000
In the model_rf model, the Out of Bag Error value is
18.23%. Therefore, the accuracy of the model on the test data (out of
bag data) is 100% - 18.23% = 81.77%.
# TRAIN
preds_rf_train <- predict(model_rfload, newdata = data_train, type = "raw")
table(preds_rf_train, data_train$Survived)##
## preds_rf_train 0 1
## 0 429 18
## 1 4 262
# TEST
preds_rf_test <- predict(model_rfload, newdata = data_test, type = "raw")
table(preds_rf_test, data_test$Survived)##
## preds_rf_test 0 1
## 0 102 12
## 1 14 50
True Negative (TN): The model correctly predicted 0 (not survived) instances, and the actual value is also 0. In this case, the count is 79.
False Positive (FP): The model predicted 1 (survived), but the actual value is 0 (not survived). In this case, the count is 6.
False Negative (FN): The model predicted 0 (not survived), but the actual value is 1 (survived). In this case, the count is 5.
True Positive (TP): The model correctly predicted 1 (survived) instances, and the actual value is also 1. In this case, the count is 53.
confusionMatrix_rf_train <- confusionMatrix(preds_rf_train, data_train$Survived)
confusionMatrix_rf_train## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 429 18
## 1 4 262
##
## Accuracy : 0.9691
## 95% CI : (0.9537, 0.9806)
## No Information Rate : 0.6073
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9347
##
## Mcnemar's Test P-Value : 0.005578
##
## Sensitivity : 0.9908
## Specificity : 0.9357
## Pos Pred Value : 0.9597
## Neg Pred Value : 0.9850
## Prevalence : 0.6073
## Detection Rate : 0.6017
## Detection Prevalence : 0.6269
## Balanced Accuracy : 0.9632
##
## 'Positive' Class : 0
##
confusionMatrix_rf_test <- confusionMatrix(preds_rf_test, data_test$Survived)
confusionMatrix_rf_test## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 102 12
## 1 14 50
##
## Accuracy : 0.8539
## 95% CI : (0.7933, 0.9023)
## No Information Rate : 0.6517
## P-Value [Acc > NIR] : 1.145e-09
##
## Kappa : 0.6807
##
## Mcnemar's Test P-Value : 0.8445
##
## Sensitivity : 0.8793
## Specificity : 0.8065
## Pos Pred Value : 0.8947
## Neg Pred Value : 0.7812
## Prevalence : 0.6517
## Detection Rate : 0.5730
## Detection Prevalence : 0.6404
## Balanced Accuracy : 0.8429
##
## 'Positive' Class : 0
##
Result:
Target variable = Survived (1) or Not Survived (0)
Positive Class: Not Survived (0)
FN: Passengers predicted as survived, but actually not survived.
FP: Passengers predicted as not survived, but actually survived.
The concerning risk is the occurrence of False Positive, where passengers who actually did not survive are predicted as survivors.
Conclusion: The metrics that need to be prioritized or evaluated are Precision.
From the output of the Random Forest classification model, it can be observed that:
Training: The model achieved an accuracy of 96.91% and a Precision/Positive Predictive Value of 95.97%.
Testing: The model achieved an accuracy of 85.39% and a Precision/Positive Predictive Value of 89.47%.
The results indicate that the model’s performance is above 80%, with a slight possibility of overfitting. ## Recap Model Evaluation NB/DT/RF
Finally, a comparison of the evaluation results between the Naive Bayes, Decision Tree, and Random Forest models was conducted. The evaluation metrics for each model are presented in a data frame, including accuracy, recall, specificity, and precision.
# Evaluasi Model Naive Bayes
eval_nv <- data.frame(Model = "Naive Bayes",
Accuracy = confusionMatrix_nv_test$overall["Accuracy"],
Recall = confusionMatrix_nv_test$byClass["Sensitivity"],
Specificity = confusionMatrix_nv_test$byClass["Specificity"],
Precision = confusionMatrix_nv_test$byClass["Pos Pred Value"])
# Evaluasi Model Decision Tree
eval_dt <- data.frame(Model = "Decision Tree",
Accuracy = confusionMatrix_dt_test$overall["Accuracy"],
Recall = confusionMatrix_dt_test$byClass["Sensitivity"],
Specificity = confusionMatrix_dt_test$byClass["Specificity"],
Precision = confusionMatrix_dt_test$byClass["Pos Pred Value"])
# Evaluasi Model Random Forest
eval_rf <- data.frame(Model = "Random Forest",
Accuracy = confusionMatrix_rf_test$overall["Accuracy"],
Recall = confusionMatrix_rf_test$byClass["Sensitivity"],
Specificity = confusionMatrix_rf_test$byClass["Specificity"],
Precision = confusionMatrix_rf_test$byClass["Pos Pred Value"])
eval <- rbind(eval_nv, eval_dt, eval_rf)
rownames(eval) <- NULL
eval[order(eval$Precision, decreasing = T), ]As previously stated, Accuracy and Precision were chosen as the metrics to evaluate the model’s performance. Based on the evaluation results, the random forest model outperforms the Naive Bayes and Decision Tree models in predicting the survival of Titanic passengers.
The random forest model achieved the following performance metrics:
An accuracy value of 85.39% indicates that the random forest model correctly predicted the survival outcome (0 or 1) for approximately 85.39% of the instances in the test data, which represents unseen data. This suggests that the model has a high ability to correctly classify both survival and non-survival cases in unseen data.
In summary, the random forest model demonstrates better predictive performance, with a higher accuracy and precision, making it a preferred model for predicting the survival of Titanic passengers.