Background
Hi. In this project I will try to build a model using linear regression and kNN method based on the famous Titanic data from Kaggle here. This is an attempt for a task I need to do for a bootcamp course I’m attending right now.
RMS Titanic, or mostly known just as Titanic, was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean on April 1912, after striking an iceberg during its maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking at the time one of the deadliest of a single ship and the deadliest peacetime sinking of a superliner or cruise ship to date.
Objective:
Our objective here is to build a machine learning model with 2 methods: logistic regression and kNN and see which one of the 2 methods has the better prediction result.
Data Preparation
Import Data
# import libraries
library(knitr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.0.5
library(DT)
## Warning: package 'DT' was built under R version 4.0.5
library(tidyverse)
library(dplyr)
library(inspectdf)
## Warning: package 'inspectdf' was built under R version 4.0.5
library(ggplot2)
library(forcats)
library(colorspace)
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.0.5
library(class)
Kaggle gives us 2 datasets: train.csv
and test.csv
. We’re going to use the train.csv
only since test.csv
was supposed to be the submission data for the challenge (Survived
columns left empty for prediction).
# import data
<- read.csv("data/train.csv",
titanic na.strings = c("null", "", "NA"),
stringsAsFactors = T)
datatable(titanic, rownames = F, class = "nowrap", options = list(scrollX = T))
PassengerId
: passenger’s unique IDSurvived
: survival status of passenger (0 = didn’t survive, 1 = survived)Pclass
: passenger’s ticket class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)Name
: name of passengersSex
: gender of passengersAge
: age of passengers (in years)SibSp
: number of siblings/spouse aboardParch
: number of parents/children aboardTicket
: passenger’s ticket numberFare
: passenger’s fareCabin
: passenger’s cabin numberEmbarked
: port of embarkation (C = Cherbourg, Q = Queensland, S = Southampton)
Check Duplicates and NAs
Next we’re going to check whether this data has duplicates and/or NAs.
# check duplicates
table(duplicated(titanic))
##
## FALSE
## 891
# check NA
<- c()
NA_Sum for(i in 1:ncol(titanic)){
<- sum(is.na(titanic[,colnames(titanic)[i]]))
NA_Sum[i]
}
<- data.frame(
na Column_Name = c(colnames(titanic)),
NA_Sum,NA_Percent = c(round(NA_Sum/nrow(titanic)*100, 2))
)
kable(na, align = "c")
Column_Name | NA_Sum | NA_Percent |
---|---|---|
PassengerId | 0 | 0.00 |
Survived | 0 | 0.00 |
Pclass | 0 | 0.00 |
Name | 0 | 0.00 |
Sex | 0 | 0.00 |
Age | 177 | 19.87 |
SibSp | 0 | 0.00 |
Parch | 0 | 0.00 |
Ticket | 0 | 0.00 |
Fare | 0 | 0.00 |
Cabin | 687 | 77.10 |
Embarked | 2 | 0.22 |
There’s no duplicates but unfortunately there’s a 19.87% missing values from Age
column, 77.10% from Cabin
column, and 0.22% from Embarked
column.
Missing Values
Age
The missing values from the Age
column consists of 177 rows (19.87% of the total data). Age
is crucial since I suppose that women and children should be prioritized boarding the lifeboats first thus the column and rows couldn’t be removed. We’re going to impute the missing values based off the mean
age of each passenger’s Title
(Mr, Mrs, etc.).
We’re first going to extract titles and create a new column Title
# extracting name titles into a new column
$Title <- gsub('(.*, )|(\\..*)', '', titanic$Name)
titanic
# change title data type
$Title <- as.factor(titanic$Title)
titanic
# check which Title has an NA age
<- titanic %>% group_by(Title) %>% summarise(Total = length(Title))
title_total <- titanic %>% group_by(Title) %>% summarise("NA" = sum(is.na(Age)))
title_na
datatable(left_join(title_total, title_na), rownames = F)
Before going any further, we could see that there were some uncommon titles such as Don, Rev, and others. We’ll change those titles to their equivalents in Mr./Mrs./Ms./Master.
# changing the titles to the equivalents of mr/mrs/ms/master
$Title <- sapply(X = as.character(titanic$Title), # Data
titanicFUN = switch,
"Capt" = "Mr",
"Col" = "Mr",
"Don" = "Mr",
"Dr" = "Mr",
"Jonkheer" = "Mr",
"Major" = "Mr",
"Master" = "Master",
"Miss" = "Miss",
"Mr" = "Mr",
"Mrs" = "Mrs",
"Rev" = "Mr",
"Lady" = "Mrs",
"Sir" = "Mr",
"Mlle" = "Miss",
"Mme" = "Mrs",
"Ms" = "Miss",
"the Countess" = "Mrs")
# change title data type
$Title <- as.factor(titanic$Title) titanic
After creating the new Title
column, we’re now going to see whether imputing Age
based on their titles is relevant by looking at the box plot of each title category vs their age.
# age plot based on title
%>%
titanic na.omit() %>%
mutate(Title = fct_reorder(Title, Age, .fun = "median")) %>%
ggplot(aes(x = reorder(Title, Age), y = Age, fill = Title)) +
geom_boxplot() +
theme_minimal() +
scale_fill_discrete_qualitative(palette = "cold") +
labs(title = "Age Distribution per Title",
x = "") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
We could see from the plot above that each titles has their own age range, the Mr and Mrs title even have the same age range. From this result we can conclude that it’s valid to impute the Age
based on the mean/median, in this case I chose mean
since there were a little to none outliers.
# check which Title has an NA age
<- titanic %>% group_by(Title) %>% summarise(Total = length(Title))
title_total_2 <- titanic %>% group_by(Title) %>% summarise("NA" = sum(is.na(Age)))
title_na_2
kable(left_join(title_total_2, title_na_2), align = "c", )
Title | Total | NA |
---|---|---|
Master | 40 | 4 |
Miss | 185 | 36 |
Mr | 538 | 120 |
Mrs | 128 | 17 |
Here we could see that most NAs came from Mr
and Miss
title. We’re now going to impute those NAs with the mean
age of each title.
# age imputation based on specific titles
## mr.
<- titanic %>%
mr filter(Title == "Mr")%>%
mutate(Age = round(replace(Age, is.na(Age), mean(Age, na.rm = T))),0)
## mrs.
<- titanic %>%
mrs filter(Title == "Mrs")%>%
mutate(Age = round(replace(Age, is.na(Age), mean(Age, na.rm = T))),0)
## miss.
<- titanic %>%
ms filter(Title == "Miss")%>%
mutate(Age = round(replace(Age, is.na(Age), mean(Age, na.rm = T))),0)
## master.
<- titanic %>%
mstr filter(Title == "Master")%>%
mutate(Age = round(replace(Age, is.na(Age), mean(Age, na.rm = T))),0)
# combining all rows
<- bind_rows(mr, mrs, ms, mstr) %>%
titanic select(1:13) %>% #12
mutate(PassengerId = as.integer(PassengerId)) %>%
arrange(PassengerId)
# check NA for age column
sum(is.na(titanic$Age))
## [1] 0
All clear now for the Age
column.
Cabin
Since there were just too many of the NA values, we’ll just remove the entire column.
# remove cabin column
<- titanic %>% select(-Cabin) titanic
Embarked
There were 2 missing rows for Embarked
column so we’re just going to impute it with the the most frequent Embarked
.
# impute NA for Embarked
## creating mode function
<- function (x, na.rm) {
mode <- table(x)
xtab <- names(which(xtab == max(xtab)))
xmode if (length(xmode) > 1) xmode <- ">1 mode"
return(xmode)
}## impute NA
<- titanic %>%
titanic mutate(Embarked = as.factor(Embarked),
Embarked = replace(Embarked, is.na(Embarked), mode(Embarked, na.rm = T)))
# check NA
anyNA(titanic)
## [1] FALSE
Now we’ve imputed all missing values.
Data Exploration and Analysis
Change Data Type
We’re going to change each column’s incorrect data type.
glimpse(titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
## $ Name <fct> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
## $ Sex <fct> male, female, female, female, male, male, male, male, fema~
## $ Age <dbl> 22, 38, 26, 35, 35, 33, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
## $ Ticket <fct> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 373450, 330~
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C~
## $ Title <fct> Mr, Mrs, Miss, Mrs, Mr, Mr, Mr, Master, Mrs, Mrs, Miss, Mi~
# change data type
<- titanic %>%
titanic mutate_at(vars(Survived, Pclass), as.factor) %>%
mutate(Age = as.integer(Age))
Descriptive Statistics
# numeric
datatable((titanic %>% inspect_num()), options = list(scrollX = T))
# categorical
datatable(titanic %>% inspect_cat(), options = list(scrollX = T))
From the numerical values, we could see that:
- Passengers’ age ranges from 0 to 80 with its
mean
being 29.84 y.o. and median being 30 y.o. which shows that there were no extreme outliers. However 75% of the passengers are mostly aged from 0 - 36 y.o. - Most of the
SibSp
andParch
value is 0 which means that there were more passengers who travel alone as opposed to the ones traveling with their family. - 75% of the passengers paid < £ 31 for the ship fare, while max. price is £ 512. We’re not going to remove these outliers to see whether the ones who paid higher price survived or not.
As for the categorical values, we could see that:
- 72% of the passengers embarked from
S
Southampton. - 55% of the passengers were from the 3rd class, larger than both from the 1st and 2nd class combined together.
- 65% of the passengers are male.
- 62% of the passengers didn’t survive.
Create New Factors
Since I still go by the rule of ‘women and children first’, we’re now going to create a new column to categorize the age of passengers by:
- Ages < 18: Children
- Ages 18 - 65: Adult
- Ages > 65: Elderly
# create new column
<- titanic %>%
titanic mutate(AgeCategory = case_when(Age < 18 ~ "Children",
< 65 ~ "Adult",
Age ~ "Elderly"),
T AgeCategory = as.factor(AgeCategory)) %>%
select(-Age)
# age category viz
%>%
titanic group_by(AgeCategory) %>%
summarise(Total = length(AgeCategory)) %>%
ungroup() %>%
mutate(AgeCategory = fct_relevel(AgeCategory,
"Elderly", "Adult", "Children")) %>%
ggplot(aes(x = AgeCategory, y = Total, fill = AgeCategory)) +
geom_bar(stat = "identity") +
coord_flip() +
ylim(0, 850) +
theme_classic() +
geom_text(aes(label = Total),
hjust = -0.5) +
labs(title = "Total Passengers per Age Category",
y = "",
x = "") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
scale_fill_discrete_qualitative(palette = "cold")
We could see that 763 passengers are Adult, followed by 117 children, and 11 elderly.
We’ll also create a new column named Travel
which shows:
- Father: father traveling with wife and children
- Mother: mother traveling with children
- Child: child traveling with parents
- Siblings: siblings traveling with no parents
- Couples: married couples traveling with no children
- Alone: no partner and children
We’ll then remove Parch
, SibSp
, Title
, LastName
, and PassengerId
column afterwards.
# create lastname column
$LastName <- gsub(',.*$', '', titanic$Name)
titanic
# identify married women with no children
<- titanic %>%
names filter(Title == "Mrs" & SibSp == 1 & Parch == 0)
<- unique(names$LastName)
names
# travel
<- titanic %>%
titanic mutate(Travel = case_when(Title == "Mr" & SibSp == 1 & Parch != 0 ~ "Father",
== "Mrs" & SibSp == 1 & Parch != 0 ~ "Mother",
Title != "Mrs" & Parch == 1 | Parch == 2 ~ "Child",
Title == 0 & Parch == 0 ~ "Alone",
SibSp %in% names & SibSp == 1 & Parch == 0 ~ "Couples",
LastName != 0 & Parch == 0 ~ "Siblings",
SibSp ~ "Mother"),
T Travel = as.factor(Travel)) %>%
select(-c(Parch, SibSp, Title, LastName, PassengerId, Name, Ticket))
# last na check
anyNA(titanic)
## [1] FALSE
%>%
titanic group_by(Travel) %>%
summarise(Total = length(Travel)) %>%
ungroup() %>%
mutate(Travel = fct_reorder(Travel, Total)) %>%
ggplot(aes(x = Travel, y = Total, fill = Travel)) +
geom_bar(stat = "identity") +
geom_text(aes(label = Total),
hjust = -0.5) +
coord_flip() +
ylim(0,600) +
theme_classic() +
scale_fill_discrete_qualitative(palette = "cold") +
labs(title = "Total Passengers per Family",
y = "",
x = "") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
Here we could see that 537 of the passengers traveled alone, followed by 136 children, 74 couples, 67 siblings, 48 mothers, and 29 fathers.
Check Variables Correlation
In this part we’ll going to visualize some predictors (individual and/or combined) to their target (Survived
).
First I’d like to check whether Embarked
is related to Pclass
and the target Survived
to check whether we’ll need the Embarked
column or rather just the Pclass
column.
%>%
titanic group_by(Embarked, Pclass) %>%
summarise(Value = length(Pclass)) %>%
ungroup() %>%
mutate(Embarked, fct_reorder(Embarked, Value)) %>%
ggplot(aes(x = Embarked, y = Value)) +
geom_bar(stat = "identity", aes(fill = Pclass)) +
coord_flip() +
theme_classic() +
labs(title = "Passengers Based on Class and Port of Embarkation") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_discrete_qualitative(palette = "cold")
%>%
titanic group_by(Embarked, Survived) %>%
summarise(Value = length(Pclass)) %>%
ungroup() %>%
mutate(Survived = fct_reorder(Survived, desc(Value))) %>%
ggplot(aes(x = Survived, y = Value)) +
geom_bar(stat = "identity", aes(fill = Embarked)) +
coord_flip() +
theme_classic() +
labs(title = "Survived Passengers Based on Port of Embarkation") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_discrete_qualitative(palette = "cold")
As we could see from the plot above, passengers are mostly embarked from Southampton (from all passenger class) and 1st class passengers mostly embarked from Southampton followed by Cherbourg. From the second plot, we could see that the port of embarkation doesn’t really affects the survival rate since both passengers who survived and didn’t survive has a rather balanced composition considering that most passengers are from Southampton.
As we see that there’s no purpose in the Embarked
column, we’re going to remove it.
# remove column
<- titanic %>% select(-Embarked) titanic
We’re now going to check the Fare
column with the same condition as above.
# plot
<- titanic %>%
plot_fare_1 ggplot(aes(x = Fare, fill = Pclass)) +
geom_histogram(position = "identity", binwidth = 15, alpha = 0.7) +
theme_minimal() +
scale_fill_discrete_qualitative(palette = "cold")
<- titanic %>%
plot_fare_2 ggplot(aes(x = Fare, fill = Survived)) +
geom_histogram(position = "identity", binwidth = 15, alpha = 0.7) +
theme_minimal() +
scale_fill_discrete_qualitative(palette = "cold")
::ggarrange(plot_fare_1, plot_fare_2, ncol = 2) egg
Here we could see that the ones who paid < £ 50 are mostly passengers from 3rd and 2nd class, although there were some 3rd class passengers who paid around £ 50 - 100. From the second plot we could also see that passengers who paid > 50 on average survived. Thus we’re not going to remove the Fare
column since the Pclass
couldn’t replace the Fare
column as a mean to state whether the passengers survived or not.
Pre-Modeling
Dummy Variables
We’re going to create dummy variables for each categorical variables using dummy_cols()
.
# create dummy variable
<- dummy_cols(titanic, remove_first_dummy = T, remove_selected_columns = T)
titanic_dummy
#train <- train %>% select(-c(Pclass, Sex, AgeCategory, Travel))
datatable(titanic_dummy, rownames = F, options = list(scrollX = T))
Split Data
Next we’re going to split the data with an 80:20 ratio.
# split data
set.seed(902)
<- round(0.8 * nrow(titanic_dummy), 0)
samplesize <- sample(seq_len(nrow(titanic_dummy)), size = samplesize)
index
<- titanic_dummy[index, ]
train <- titanic_dummy[-index, ] test
Modeling
Logistic Regression
Model 1
We’re going to build the first model with all variables.
# build model
<- glm(Survived_1 ~., train, family = "binomial")
model_1
# summary
summary(model_1)
##
## Call:
## glm(formula = Survived_1 ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6279 -0.6998 -0.4349 0.6506 2.4896
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.256185 0.335857 6.718 0.0000000000185 ***
## Fare 0.002428 0.002379 1.020 0.307548
## Pclass_2 -0.874145 0.311214 -2.809 0.004972 **
## Pclass_3 -1.902837 0.293281 -6.488 0.0000000000869 ***
## Sex_male -2.681517 0.233335 -11.492 < 0.0000000000000002 ***
## AgeCategory_Children 1.288072 0.360600 3.572 0.000354 ***
## AgeCategory_Elderly -1.155967 1.091179 -1.059 0.289430
## Travel_Child -0.491265 0.333969 -1.471 0.141294
## Travel_Couples -0.045730 0.369559 -0.124 0.901519
## Travel_Father -1.220287 0.786293 -1.552 0.120674
## Travel_Mother -0.451344 0.456562 -0.989 0.322874
## Travel_Siblings -0.743653 0.420920 -1.767 0.077273 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 949.90 on 712 degrees of freedom
## Residual deviance: 635.78 on 701 degrees of freedom
## AIC: 659.78
##
## Number of Fisher Scoring iterations: 5
# intercept interpretation
exp(model_1$coefficients[1])
## (Intercept)
## 9.546602
The intercept variable consists of passengers in the 1st class, females, adults, and traveling alone. For passengers who are in the criteria, they are 9.54x more likely to survive than the others.
Model 2
In the first model there were some variables with insignificant variables, and we’re going to build the 2nd model which eliminates the insignificant variables.
# build 2nd model
<- step(model_1, direction = "backward", trace = F)
model_2
# summary
summary(model_2)
##
## Call:
## glm(formula = Survived_1 ~ Pclass_2 + Pclass_3 + Sex_male + AgeCategory_Children +
## Travel_Father + Travel_Siblings, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5863 -0.6817 -0.4254 0.6910 2.4515
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.2497 0.2451 9.179 < 0.0000000000000002 ***
## Pclass_2 -0.9817 0.2774 -3.539 0.000401 ***
## Pclass_3 -1.9977 0.2475 -8.072 0.000000000000000692 ***
## Sex_male -2.6092 0.2090 -12.481 < 0.0000000000000002 ***
## AgeCategory_Children 1.0587 0.3030 3.494 0.000475 ***
## Travel_Father -1.0974 0.7735 -1.419 0.155983
## Travel_Siblings -0.5969 0.4061 -1.470 0.141652
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 949.90 on 712 degrees of freedom
## Residual deviance: 640.58 on 706 degrees of freedom
## AIC: 654.58
##
## Number of Fisher Scoring iterations: 5
Model 1 and Model 2 Comparison
We’re going to compare between model_1
and model_2
and choose the model which fits better with our data.
# model comparison
::glance(model_1) %>% select(AIC, BIC)
broom::glance(model_2) %>% select(AIC, BIC) broom
Model | Variables | AIC | BIC |
---|---|---|---|
1 | 11 | 659.78 | 714.61 |
2 | 6 | 654.58 | 686.57 |
The AIC and BIC value of model_2
is lower than model_1
and the model is more simple, hence we’re going to choose model_2
as our final regression model. We’re going to rename it to model_glm
and predict the test result.
Prediction
We’re going to predict the test
data and set the treshold to the usual 0.5.
# rename model
<- model_2
model_glm
# prediction
<- predict(model_glm, test, type = "response")
predict_glm <- ifelse(predict_glm > 0.5, "1", "0")
label_glm <- as.factor(label_glm) label_glm
k-NN
Model 1
For the second model we’re going to use kNN method. We’re first going to scale the train
and test
data.
# scaling train and test data
# train
<- train %>% select(-Survived_1) %>% scale()
train_scale <- train %>% select(Survived_1)
train_y
# test
<- test %>% select(-Survived_1) %>% scale(center = attr(train_scale, "scaled:center"),
test_scale scale = attr(train_scale, "scaled:scale"))
<- test %>% select(Survived_1) test_y
We’re going to find the optimum k for our model.
# find optimum k
sqrt(nrow(train_scale))
## [1] 26.70206
We have either K = 25 or k = 27.
Model
For the kNN model, we’re going to use K = 25.
<- knn(train = train_scale,
model_knn test = test_scale,
cl = train_y$Survived_1,
k = 25)
Evaluation
We’re now going to evaluate our models from Logistic Regression and k-NN method with confusionMatrix()
# logistic regression
$Survived_1 <- as.factor(test$Survived_1)
test<- confusionMatrix(label_glm, test$Survived_1, positive = "1")
conf_glm conf_glm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 94 25
## 1 16 43
##
## Accuracy : 0.7697
## 95% CI : (0.7008, 0.8293)
## No Information Rate : 0.618
## P-Value [Acc > NIR] : 0.00001184
##
## Kappa : 0.4995
##
## Mcnemar's Test P-Value : 0.2115
##
## Sensitivity : 0.6324
## Specificity : 0.8545
## Pos Pred Value : 0.7288
## Neg Pred Value : 0.7899
## Prevalence : 0.3820
## Detection Rate : 0.2416
## Detection Prevalence : 0.3315
## Balanced Accuracy : 0.7434
##
## 'Positive' Class : 1
##
# knn
<- as.factor(model_knn)
model_knn <- test_y$Survived_1
test_y <- as.factor(test_y)
test_y <- confusionMatrix(model_knn, test_y, positive = "1")
conf_knn conf_knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 93 31
## 1 17 37
##
## Accuracy : 0.7303
## 95% CI : (0.6588, 0.794)
## No Information Rate : 0.618
## P-Value [Acc > NIR] : 0.001061
##
## Kappa : 0.4055
##
## Mcnemar's Test P-Value : 0.060602
##
## Sensitivity : 0.5441
## Specificity : 0.8455
## Pos Pred Value : 0.6852
## Neg Pred Value : 0.7500
## Prevalence : 0.3820
## Detection Rate : 0.2079
## Detection Prevalence : 0.3034
## Balanced Accuracy : 0.6948
##
## 'Positive' Class : 1
##
|
|
|
Model | Accuracy | Recall | Precision | Specificity |
---|---|---|---|---|
Logistic Regression | 0.7696629 | 0.6323529 | 0.7288136 | 0.8545455 |
k-NN | 0.7303371 | 0.5441176 | 0.6851852 | 0.8454545 |
Here we could see that the regression model did better based on the higher values in accuracy, recall, precision, and specificity.
In this titanic case, it would be better to have a higher recall which means the lower it is for the model to predict passengers survived as not survived. In the next part we’ll try to tune our model but we’re only going to tune the regression model since the result is better than the k-NN model.
Model Tuning
We’re now going to tune the regression model and lower its treshold to 0.4 and 0.3 to find the best recall value and result.
# treshold > 0.4
<- ifelse(predict_glm > 0.4, "1", "0")
label_glm_2 <- as.factor(label_glm_2)
label_glm_2
# treshold > 0.3
<- ifelse(predict_glm > 0.3, "1", "0")
label_glm_3 <- as.factor(label_glm_3)
label_glm_3
# treshold > 0.2
<- ifelse(predict_glm > 0.2, "1", "0")
label_glm_4 <- as.factor(label_glm_4)
label_glm_4
# confusion matrix
confusionMatrix(label_glm_2, test$Survived_1, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 81 15
## 1 29 53
##
## Accuracy : 0.7528
## 95% CI : (0.6827, 0.8143)
## No Information Rate : 0.618
## P-Value [Acc > NIR] : 0.00009647
##
## Kappa : 0.4963
##
## Mcnemar's Test P-Value : 0.05002
##
## Sensitivity : 0.7794
## Specificity : 0.7364
## Pos Pred Value : 0.6463
## Neg Pred Value : 0.8437
## Prevalence : 0.3820
## Detection Rate : 0.2978
## Detection Prevalence : 0.4607
## Balanced Accuracy : 0.7579
##
## 'Positive' Class : 1
##
confusionMatrix(label_glm_3, test$Survived_1, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 81 15
## 1 29 53
##
## Accuracy : 0.7528
## 95% CI : (0.6827, 0.8143)
## No Information Rate : 0.618
## P-Value [Acc > NIR] : 0.00009647
##
## Kappa : 0.4963
##
## Mcnemar's Test P-Value : 0.05002
##
## Sensitivity : 0.7794
## Specificity : 0.7364
## Pos Pred Value : 0.6463
## Neg Pred Value : 0.8437
## Prevalence : 0.3820
## Detection Rate : 0.2978
## Detection Prevalence : 0.4607
## Balanced Accuracy : 0.7579
##
## 'Positive' Class : 1
##
confusionMatrix(label_glm_4, test$Survived_1, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 61 7
## 1 49 61
##
## Accuracy : 0.6854
## 95% CI : (0.6117, 0.7528)
## No Information Rate : 0.618
## P-Value [Acc > NIR] : 0.03677
##
## Kappa : 0.404
##
## Mcnemar's Test P-Value : 0.00000004281
##
## Sensitivity : 0.8971
## Specificity : 0.5545
## Pos Pred Value : 0.5545
## Neg Pred Value : 0.8971
## Prevalence : 0.3820
## Detection Rate : 0.3427
## Detection Prevalence : 0.6180
## Balanced Accuracy : 0.7258
##
## 'Positive' Class : 1
##
The sensitivity of the model could go up to 0.897 but other values seemed to drop significantly.
Conclusion
The answer to the only objective I created before is logistic regression model’s result is better than k-NN in terms of accuracy, sensitivity, precision, and specificity as long as the treshold for logistic model is > 0.3.
In this analysis I was too focused on getting a higher recall value since I was blinded to get a more passengers to survive than not. I then realized that whether we want to tune up the recall/sensitivity/accuracy it all depends on the problem.
In this project since my only objective is to compare the 2 methods, I’ve gained the result that I wanted. But when I try this model on Kaggle I only scored 0.77 (similar to the accuracy value).
There are lots and lots of predictor combinations that could be tried to get a better result. The next time I’m doing this titanic data I want to use other classification method and/or come up with a better way to explore the predictors.