Background

Hi. In this project I will try to build a model using linear regression and kNN method based on the famous Titanic data from Kaggle here. This is an attempt for a task I need to do for a bootcamp course I’m attending right now.

RMS Titanic, or mostly known just as Titanic, was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean on April 1912, after striking an iceberg during its maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking at the time one of the deadliest of a single ship and the deadliest peacetime sinking of a superliner or cruise ship to date.

Objective:

Our objective here is to build a machine learning model with 2 methods: logistic regression and kNN and see which one of the 2 methods has the better prediction result.

Data Preparation

Import Data

# import libraries
library(knitr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.0.5
library(DT)
## Warning: package 'DT' was built under R version 4.0.5
library(tidyverse)
library(dplyr)
library(inspectdf)
## Warning: package 'inspectdf' was built under R version 4.0.5
library(ggplot2)
library(forcats)
library(colorspace)
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.0.5
library(class)

Kaggle gives us 2 datasets: train.csv and test.csv. We’re going to use the train.csv only since test.csv was supposed to be the submission data for the challenge (Survived columns left empty for prediction).

# import data
titanic <- read.csv("data/train.csv",
                    na.strings = c("null", "", "NA"),
                    stringsAsFactors = T)

datatable(titanic, rownames = F, class = "nowrap", options = list(scrollX = T))
  • PassengerId: passenger’s unique ID
  • Survived: survival status of passenger (0 = didn’t survive, 1 = survived)
  • Pclass: passenger’s ticket class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
  • Name: name of passengers
  • Sex: gender of passengers
  • Age: age of passengers (in years)
  • SibSp: number of siblings/spouse aboard
  • Parch: number of parents/children aboard
  • Ticket: passenger’s ticket number
  • Fare: passenger’s fare
  • Cabin: passenger’s cabin number
  • Embarked: port of embarkation (C = Cherbourg, Q = Queensland, S = Southampton)

Check Duplicates and NAs

Next we’re going to check whether this data has duplicates and/or NAs.

# check duplicates
table(duplicated(titanic))
## 
## FALSE 
##   891
# check NA
NA_Sum <- c()
for(i in 1:ncol(titanic)){
NA_Sum[i] <- sum(is.na(titanic[,colnames(titanic)[i]]))
}

na <- data.frame(
  Column_Name = c(colnames(titanic)),
  NA_Sum,
  NA_Percent = c(round(NA_Sum/nrow(titanic)*100, 2))
)

kable(na, align = "c")
Column_Name NA_Sum NA_Percent
PassengerId 0 0.00
Survived 0 0.00
Pclass 0 0.00
Name 0 0.00
Sex 0 0.00
Age 177 19.87
SibSp 0 0.00
Parch 0 0.00
Ticket 0 0.00
Fare 0 0.00
Cabin 687 77.10
Embarked 2 0.22

There’s no duplicates but unfortunately there’s a 19.87% missing values from Age column, 77.10% from Cabin column, and 0.22% from Embarked column.

Missing Values

Age

The missing values from the Age column consists of 177 rows (19.87% of the total data). Age is crucial since I suppose that women and children should be prioritized boarding the lifeboats first thus the column and rows couldn’t be removed. We’re going to impute the missing values based off the mean age of each passenger’s Title (Mr, Mrs, etc.).

We’re first going to extract titles and create a new column Title

# extracting name titles into a new column
titanic$Title <- gsub('(.*, )|(\\..*)', '', titanic$Name)

# change title data type
titanic$Title <- as.factor(titanic$Title)

# check which Title has an NA age
title_total <- titanic %>% group_by(Title) %>% summarise(Total = length(Title))
title_na <- titanic %>% group_by(Title) %>% summarise("NA" = sum(is.na(Age)))

datatable(left_join(title_total, title_na), rownames = F)

Before going any further, we could see that there were some uncommon titles such as Don, Rev, and others. We’ll change those titles to their equivalents in Mr./Mrs./Ms./Master.

# changing the titles to the equivalents of mr/mrs/ms/master
titanic$Title <- sapply(X = as.character(titanic$Title), # Data
                           FUN = switch,
                        "Capt" = "Mr",
                        "Col" = "Mr",
                        "Don" = "Mr",
                        "Dr" = "Mr",
                        "Jonkheer" = "Mr",
                        "Major" = "Mr",
                        "Master" = "Master",
                        "Miss" = "Miss",
                        "Mr" = "Mr",
                        "Mrs" = "Mrs",
                        "Rev" = "Mr",
                        "Lady" = "Mrs",
                        "Sir" = "Mr",
                        "Mlle" = "Miss",
                        "Mme" = "Mrs",
                        "Ms" = "Miss",
                        "the Countess" = "Mrs")

# change title data type
titanic$Title <- as.factor(titanic$Title)

After creating the new Title column, we’re now going to see whether imputing Age based on their titles is relevant by looking at the box plot of each title category vs their age.

# age plot based on title
titanic %>% 
  na.omit() %>% 
  mutate(Title = fct_reorder(Title, Age, .fun = "median")) %>% 
  ggplot(aes(x = reorder(Title, Age), y = Age, fill = Title)) +
  geom_boxplot() +
  theme_minimal() +
  scale_fill_discrete_qualitative(palette = "cold") +
  labs(title = "Age Distribution per Title",
       x = "") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

We could see from the plot above that each titles has their own age range, the Mr and Mrs title even have the same age range. From this result we can conclude that it’s valid to impute the Age based on the mean/median, in this case I chose mean since there were a little to none outliers.

# check which Title has an NA age
title_total_2 <- titanic %>% group_by(Title) %>% summarise(Total = length(Title))
title_na_2 <- titanic %>% group_by(Title) %>% summarise("NA" = sum(is.na(Age)))

kable(left_join(title_total_2, title_na_2), align = "c", )
Title Total NA
Master 40 4
Miss 185 36
Mr 538 120
Mrs 128 17

Here we could see that most NAs came from Mr and Miss title. We’re now going to impute those NAs with the mean age of each title.

# age imputation based on specific titles
## mr.
mr <- titanic %>% 
  filter(Title == "Mr")%>% 
  mutate(Age = round(replace(Age, is.na(Age), mean(Age, na.rm = T))),0)
## mrs.
mrs <- titanic %>% 
  filter(Title == "Mrs")%>% 
  mutate(Age = round(replace(Age, is.na(Age), mean(Age, na.rm = T))),0)
## miss.
ms <- titanic %>% 
  filter(Title == "Miss")%>% 
  mutate(Age = round(replace(Age, is.na(Age), mean(Age, na.rm = T))),0)
## master.
mstr <- titanic %>% 
  filter(Title == "Master")%>% 
  mutate(Age = round(replace(Age, is.na(Age), mean(Age, na.rm = T))),0)

# combining all rows
titanic <- bind_rows(mr, mrs, ms, mstr) %>% 
  select(1:13) %>% #12
  mutate(PassengerId = as.integer(PassengerId)) %>% 
  arrange(PassengerId)

# check NA for age column
sum(is.na(titanic$Age))
## [1] 0

All clear now for the Age column.

Cabin

Since there were just too many of the NA values, we’ll just remove the entire column.

# remove cabin column
titanic <- titanic %>% select(-Cabin)

Embarked

There were 2 missing rows for Embarked column so we’re just going to impute it with the the most frequent Embarked.

# impute NA for Embarked
## creating mode function
mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1) xmode <- ">1 mode"
    return(xmode)
}
## impute NA
titanic <- titanic %>%
 mutate(Embarked = as.factor(Embarked),
        Embarked = replace(Embarked, is.na(Embarked), mode(Embarked, na.rm = T)))

# check NA
anyNA(titanic)
## [1] FALSE

Now we’ve imputed all missing values.

Data Exploration and Analysis

Change Data Type

We’re going to change each column’s incorrect data type.

glimpse(titanic)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
## $ Name        <fct> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
## $ Sex         <fct> male, female, female, female, male, male, male, male, fema~
## $ Age         <dbl> 22, 38, 26, 35, 35, 33, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
## $ Ticket      <fct> A/5 21171, PC 17599, STON/O2. 3101282, 113803, 373450, 330~
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C~
## $ Title       <fct> Mr, Mrs, Miss, Mrs, Mr, Mr, Mr, Master, Mrs, Mrs, Miss, Mi~
# change data type
titanic <- titanic %>% 
  mutate_at(vars(Survived, Pclass), as.factor) %>% 
  mutate(Age = as.integer(Age))

Descriptive Statistics

# numeric
datatable((titanic %>% inspect_num()), options = list(scrollX = T))
# categorical
datatable(titanic %>% inspect_cat(), options = list(scrollX = T))

From the numerical values, we could see that:

  • Passengers’ age ranges from 0 to 80 with its mean being 29.84 y.o. and median being 30 y.o. which shows that there were no extreme outliers. However 75% of the passengers are mostly aged from 0 - 36 y.o.
  • Most of the SibSp and Parch value is 0 which means that there were more passengers who travel alone as opposed to the ones traveling with their family.
  • 75% of the passengers paid < £ 31 for the ship fare, while max. price is £ 512. We’re not going to remove these outliers to see whether the ones who paid higher price survived or not.

As for the categorical values, we could see that:

  • 72% of the passengers embarked from S Southampton.
  • 55% of the passengers were from the 3rd class, larger than both from the 1st and 2nd class combined together.
  • 65% of the passengers are male.
  • 62% of the passengers didn’t survive.

Create New Factors

Since I still go by the rule of ‘women and children first’, we’re now going to create a new column to categorize the age of passengers by:

  • Ages < 18: Children
  • Ages 18 - 65: Adult
  • Ages > 65: Elderly
# create new column
titanic <- titanic %>% 
  mutate(AgeCategory = case_when(Age < 18 ~ "Children",
                                 Age < 65 ~ "Adult",
                                 T ~ "Elderly"),
         AgeCategory = as.factor(AgeCategory)) %>% 
  select(-Age)
# age category viz
titanic %>% 
  group_by(AgeCategory) %>% 
  summarise(Total = length(AgeCategory)) %>% 
  ungroup() %>% 
  mutate(AgeCategory = fct_relevel(AgeCategory,
                                   "Elderly", "Adult", "Children")) %>%
  
  ggplot(aes(x = AgeCategory, y = Total, fill = AgeCategory)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ylim(0, 850) +
  theme_classic() +
  geom_text(aes(label = Total),
            hjust = -0.5) +
  labs(title = "Total Passengers per Age Category",
       y = "",
       x = "") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5)) +
  scale_fill_discrete_qualitative(palette = "cold")

We could see that 763 passengers are Adult, followed by 117 children, and 11 elderly.

We’ll also create a new column named Travel which shows:

  • Father: father traveling with wife and children
  • Mother: mother traveling with children
  • Child: child traveling with parents
  • Siblings: siblings traveling with no parents
  • Couples: married couples traveling with no children
  • Alone: no partner and children

We’ll then remove Parch, SibSp, Title, LastName, and PassengerId column afterwards.

# create lastname column
titanic$LastName <- gsub(',.*$', '', titanic$Name)

# identify married women with no children
names <- titanic %>% 
  filter(Title == "Mrs" & SibSp == 1 & Parch == 0)
names <- unique(names$LastName)

# travel
titanic <- titanic %>% 
  mutate(Travel = case_when(Title == "Mr" & SibSp == 1 & Parch != 0 ~ "Father",
                            Title == "Mrs" & SibSp == 1 & Parch != 0 ~ "Mother",
                            Title != "Mrs" & Parch == 1 | Parch == 2 ~ "Child",
                            SibSp == 0 & Parch == 0 ~ "Alone",
                            LastName %in% names & SibSp == 1 & Parch == 0 ~ "Couples",
                            SibSp != 0 & Parch == 0 ~ "Siblings",
                            T ~ "Mother"),
         Travel = as.factor(Travel)) %>% 
  select(-c(Parch, SibSp, Title, LastName, PassengerId, Name, Ticket))

# last na check
anyNA(titanic)
## [1] FALSE
titanic %>% 
  group_by(Travel) %>% 
  summarise(Total = length(Travel)) %>% 
  ungroup() %>% 
  mutate(Travel = fct_reorder(Travel, Total)) %>% 
  
  ggplot(aes(x = Travel, y = Total, fill = Travel)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = Total),
            hjust = -0.5) +
  coord_flip() +
  ylim(0,600) +
  theme_classic() +
  scale_fill_discrete_qualitative(palette = "cold") +
  labs(title = "Total Passengers per Family",
       y = "",
       x = "") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

Here we could see that 537 of the passengers traveled alone, followed by 136 children, 74 couples, 67 siblings, 48 mothers, and 29 fathers.

Check Variables Correlation

In this part we’ll going to visualize some predictors (individual and/or combined) to their target (Survived).

First I’d like to check whether Embarked is related to Pclass and the target Survived to check whether we’ll need the Embarked column or rather just the Pclass column.

titanic %>% 
  group_by(Embarked, Pclass) %>% 
  summarise(Value = length(Pclass)) %>% 
  ungroup() %>% 
  mutate(Embarked, fct_reorder(Embarked, Value)) %>% 
  ggplot(aes(x = Embarked, y = Value)) +
  geom_bar(stat = "identity", aes(fill = Pclass)) +
  coord_flip() +
  theme_classic() +
  labs(title = "Passengers Based on Class and Port of Embarkation") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_discrete_qualitative(palette = "cold")

titanic %>% 
  group_by(Embarked, Survived) %>% 
  summarise(Value = length(Pclass)) %>% 
  ungroup() %>% 
  mutate(Survived = fct_reorder(Survived, desc(Value))) %>% 
  ggplot(aes(x = Survived, y = Value)) +
  geom_bar(stat = "identity", aes(fill = Embarked)) +
  coord_flip() +
  theme_classic() +
  labs(title = "Survived Passengers Based on Port of Embarkation") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_discrete_qualitative(palette = "cold")

As we could see from the plot above, passengers are mostly embarked from Southampton (from all passenger class) and 1st class passengers mostly embarked from Southampton followed by Cherbourg. From the second plot, we could see that the port of embarkation doesn’t really affects the survival rate since both passengers who survived and didn’t survive has a rather balanced composition considering that most passengers are from Southampton.

As we see that there’s no purpose in the Embarked column, we’re going to remove it.

# remove column
titanic <- titanic %>% select(-Embarked)

We’re now going to check the Fare column with the same condition as above.

# plot
plot_fare_1 <- titanic %>% 
  ggplot(aes(x = Fare, fill = Pclass)) +
  geom_histogram(position = "identity", binwidth = 15, alpha = 0.7) +
  theme_minimal() +
  scale_fill_discrete_qualitative(palette = "cold")

plot_fare_2 <- titanic %>% 
  ggplot(aes(x = Fare, fill = Survived)) +
  geom_histogram(position = "identity", binwidth = 15, alpha = 0.7) +
  theme_minimal() +
  scale_fill_discrete_qualitative(palette = "cold")

egg::ggarrange(plot_fare_1, plot_fare_2, ncol = 2)

Here we could see that the ones who paid < £ 50 are mostly passengers from 3rd and 2nd class, although there were some 3rd class passengers who paid around £ 50 - 100. From the second plot we could also see that passengers who paid > 50 on average survived. Thus we’re not going to remove the Fare column since the Pclass couldn’t replace the Fare column as a mean to state whether the passengers survived or not.

Pre-Modeling

Dummy Variables

We’re going to create dummy variables for each categorical variables using dummy_cols().

# create dummy variable
titanic_dummy <- dummy_cols(titanic, remove_first_dummy = T, remove_selected_columns = T)

#train <- train %>% select(-c(Pclass, Sex, AgeCategory, Travel))

datatable(titanic_dummy, rownames = F, options = list(scrollX = T))

Split Data

Next we’re going to split the data with an 80:20 ratio.

# split data
set.seed(902)
samplesize <- round(0.8 * nrow(titanic_dummy), 0)
index <- sample(seq_len(nrow(titanic_dummy)), size = samplesize)

train <- titanic_dummy[index, ]
test <- titanic_dummy[-index, ]

Modeling

Logistic Regression

Model 1

We’re going to build the first model with all variables.

# build model
model_1 <- glm(Survived_1 ~., train, family = "binomial")

# summary
summary(model_1)
## 
## Call:
## glm(formula = Survived_1 ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6279  -0.6998  -0.4349   0.6506   2.4896  
## 
## Coefficients:
##                       Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)           2.256185   0.335857   6.718      0.0000000000185 ***
## Fare                  0.002428   0.002379   1.020             0.307548    
## Pclass_2             -0.874145   0.311214  -2.809             0.004972 ** 
## Pclass_3             -1.902837   0.293281  -6.488      0.0000000000869 ***
## Sex_male             -2.681517   0.233335 -11.492 < 0.0000000000000002 ***
## AgeCategory_Children  1.288072   0.360600   3.572             0.000354 ***
## AgeCategory_Elderly  -1.155967   1.091179  -1.059             0.289430    
## Travel_Child         -0.491265   0.333969  -1.471             0.141294    
## Travel_Couples       -0.045730   0.369559  -0.124             0.901519    
## Travel_Father        -1.220287   0.786293  -1.552             0.120674    
## Travel_Mother        -0.451344   0.456562  -0.989             0.322874    
## Travel_Siblings      -0.743653   0.420920  -1.767             0.077273 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 949.90  on 712  degrees of freedom
## Residual deviance: 635.78  on 701  degrees of freedom
## AIC: 659.78
## 
## Number of Fisher Scoring iterations: 5
# intercept interpretation
exp(model_1$coefficients[1])
## (Intercept) 
##    9.546602

The intercept variable consists of passengers in the 1st class, females, adults, and traveling alone. For passengers who are in the criteria, they are 9.54x more likely to survive than the others.

Model 2

In the first model there were some variables with insignificant variables, and we’re going to build the 2nd model which eliminates the insignificant variables.

# build 2nd model
model_2 <- step(model_1, direction = "backward", trace = F)

# summary
summary(model_2)
## 
## Call:
## glm(formula = Survived_1 ~ Pclass_2 + Pclass_3 + Sex_male + AgeCategory_Children + 
##     Travel_Father + Travel_Siblings, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5863  -0.6817  -0.4254   0.6910   2.4515  
## 
## Coefficients:
##                      Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)            2.2497     0.2451   9.179 < 0.0000000000000002 ***
## Pclass_2              -0.9817     0.2774  -3.539             0.000401 ***
## Pclass_3              -1.9977     0.2475  -8.072 0.000000000000000692 ***
## Sex_male              -2.6092     0.2090 -12.481 < 0.0000000000000002 ***
## AgeCategory_Children   1.0587     0.3030   3.494             0.000475 ***
## Travel_Father         -1.0974     0.7735  -1.419             0.155983    
## Travel_Siblings       -0.5969     0.4061  -1.470             0.141652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 949.90  on 712  degrees of freedom
## Residual deviance: 640.58  on 706  degrees of freedom
## AIC: 654.58
## 
## Number of Fisher Scoring iterations: 5

Model 1 and Model 2 Comparison

We’re going to compare between model_1 and model_2 and choose the model which fits better with our data.

# model comparison
broom::glance(model_1) %>% select(AIC, BIC)
broom::glance(model_2) %>% select(AIC, BIC)
Model Variables AIC BIC
1 11 659.78 714.61
2 6 654.58 686.57

The AIC and BIC value of model_2 is lower than model_1 and the model is more simple, hence we’re going to choose model_2 as our final regression model. We’re going to rename it to model_glm and predict the test result.

Prediction

We’re going to predict the test data and set the treshold to the usual 0.5.

# rename model
model_glm <- model_2

# prediction
predict_glm <- predict(model_glm, test, type = "response")
label_glm <- ifelse(predict_glm > 0.5, "1", "0")
label_glm <- as.factor(label_glm)

k-NN

Model 1

For the second model we’re going to use kNN method. We’re first going to scale the train and test data.

# scaling train and test data
# train
train_scale <- train %>% select(-Survived_1) %>% scale()
train_y <- train %>% select(Survived_1)

# test
test_scale <- test %>% select(-Survived_1) %>% scale(center = attr(train_scale, "scaled:center"),
                                                     scale = attr(train_scale, "scaled:scale"))
test_y <- test %>% select(Survived_1)

We’re going to find the optimum k for our model.

# find optimum k
sqrt(nrow(train_scale))
## [1] 26.70206

We have either K = 25 or k = 27.

Model

For the kNN model, we’re going to use K = 25.

model_knn <- knn(train = train_scale, 
                test = test_scale,
                cl = train_y$Survived_1,
                k = 25)

Evaluation

We’re now going to evaluate our models from Logistic Regression and k-NN method with confusionMatrix()

# logistic regression 
test$Survived_1 <- as.factor(test$Survived_1)
conf_glm <- confusionMatrix(label_glm, test$Survived_1, positive = "1")
conf_glm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 94 25
##          1 16 43
##                                           
##                Accuracy : 0.7697          
##                  95% CI : (0.7008, 0.8293)
##     No Information Rate : 0.618           
##     P-Value [Acc > NIR] : 0.00001184      
##                                           
##                   Kappa : 0.4995          
##                                           
##  Mcnemar's Test P-Value : 0.2115          
##                                           
##             Sensitivity : 0.6324          
##             Specificity : 0.8545          
##          Pos Pred Value : 0.7288          
##          Neg Pred Value : 0.7899          
##              Prevalence : 0.3820          
##          Detection Rate : 0.2416          
##    Detection Prevalence : 0.3315          
##       Balanced Accuracy : 0.7434          
##                                           
##        'Positive' Class : 1               
## 
# knn
model_knn <- as.factor(model_knn)
test_y <- test_y$Survived_1
test_y <- as.factor(test_y)
conf_knn <- confusionMatrix(model_knn, test_y, positive = "1")
conf_knn
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 93 31
##          1 17 37
##                                          
##                Accuracy : 0.7303         
##                  95% CI : (0.6588, 0.794)
##     No Information Rate : 0.618          
##     P-Value [Acc > NIR] : 0.001061       
##                                          
##                   Kappa : 0.4055         
##                                          
##  Mcnemar's Test P-Value : 0.060602       
##                                          
##             Sensitivity : 0.5441         
##             Specificity : 0.8455         
##          Pos Pred Value : 0.6852         
##          Neg Pred Value : 0.7500         
##              Prevalence : 0.3820         
##          Detection Rate : 0.2079         
##    Detection Prevalence : 0.3034         
##       Balanced Accuracy : 0.6948         
##                                          
##        'Positive' Class : 1              
## 
Logistic Regression (left) and k-NN (right) Matrix
Prediction Reference Freq
0 0 94
1 0 16
0 1 25
1 1 43
Condition
TN
FP
FN
TP
Prediction Reference Freq
0 0 93
1 0 17
0 1 31
1 1 37
Model Accuracy Recall Precision Specificity
Logistic Regression 0.7696629 0.6323529 0.7288136 0.8545455
k-NN 0.7303371 0.5441176 0.6851852 0.8454545

Here we could see that the regression model did better based on the higher values in accuracy, recall, precision, and specificity.

In this titanic case, it would be better to have a higher recall which means the lower it is for the model to predict passengers survived as not survived. In the next part we’ll try to tune our model but we’re only going to tune the regression model since the result is better than the k-NN model.

Model Tuning

We’re now going to tune the regression model and lower its treshold to 0.4 and 0.3 to find the best recall value and result.

# treshold > 0.4
label_glm_2 <- ifelse(predict_glm > 0.4, "1", "0")
label_glm_2 <- as.factor(label_glm_2)

# treshold > 0.3
label_glm_3 <- ifelse(predict_glm > 0.3, "1", "0")
label_glm_3 <- as.factor(label_glm_3)

# treshold > 0.2
label_glm_4 <- ifelse(predict_glm > 0.2, "1", "0")
label_glm_4 <- as.factor(label_glm_4)

# confusion matrix
confusionMatrix(label_glm_2, test$Survived_1, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 81 15
##          1 29 53
##                                           
##                Accuracy : 0.7528          
##                  95% CI : (0.6827, 0.8143)
##     No Information Rate : 0.618           
##     P-Value [Acc > NIR] : 0.00009647      
##                                           
##                   Kappa : 0.4963          
##                                           
##  Mcnemar's Test P-Value : 0.05002         
##                                           
##             Sensitivity : 0.7794          
##             Specificity : 0.7364          
##          Pos Pred Value : 0.6463          
##          Neg Pred Value : 0.8437          
##              Prevalence : 0.3820          
##          Detection Rate : 0.2978          
##    Detection Prevalence : 0.4607          
##       Balanced Accuracy : 0.7579          
##                                           
##        'Positive' Class : 1               
## 
confusionMatrix(label_glm_3, test$Survived_1, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 81 15
##          1 29 53
##                                           
##                Accuracy : 0.7528          
##                  95% CI : (0.6827, 0.8143)
##     No Information Rate : 0.618           
##     P-Value [Acc > NIR] : 0.00009647      
##                                           
##                   Kappa : 0.4963          
##                                           
##  Mcnemar's Test P-Value : 0.05002         
##                                           
##             Sensitivity : 0.7794          
##             Specificity : 0.7364          
##          Pos Pred Value : 0.6463          
##          Neg Pred Value : 0.8437          
##              Prevalence : 0.3820          
##          Detection Rate : 0.2978          
##    Detection Prevalence : 0.4607          
##       Balanced Accuracy : 0.7579          
##                                           
##        'Positive' Class : 1               
## 
confusionMatrix(label_glm_4, test$Survived_1, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 61  7
##          1 49 61
##                                           
##                Accuracy : 0.6854          
##                  95% CI : (0.6117, 0.7528)
##     No Information Rate : 0.618           
##     P-Value [Acc > NIR] : 0.03677         
##                                           
##                   Kappa : 0.404           
##                                           
##  Mcnemar's Test P-Value : 0.00000004281   
##                                           
##             Sensitivity : 0.8971          
##             Specificity : 0.5545          
##          Pos Pred Value : 0.5545          
##          Neg Pred Value : 0.8971          
##              Prevalence : 0.3820          
##          Detection Rate : 0.3427          
##    Detection Prevalence : 0.6180          
##       Balanced Accuracy : 0.7258          
##                                           
##        'Positive' Class : 1               
## 

The sensitivity of the model could go up to 0.897 but other values seemed to drop significantly.

Conclusion

The answer to the only objective I created before is logistic regression model’s result is better than k-NN in terms of accuracy, sensitivity, precision, and specificity as long as the treshold for logistic model is > 0.3.

In this analysis I was too focused on getting a higher recall value since I was blinded to get a more passengers to survive than not. I then realized that whether we want to tune up the recall/sensitivity/accuracy it all depends on the problem.

In this project since my only objective is to compare the 2 methods, I’ve gained the result that I wanted. But when I try this model on Kaggle I only scored 0.77 (similar to the accuracy value).

There are lots and lots of predictor combinations that could be tried to get a better result. The next time I’m doing this titanic data I want to use other classification method and/or come up with a better way to explore the predictors.

Resources

  1. https://en.wikipedia.org/wiki/Titanic
  2. https://www.historyonthenet.com/the-titanic-lifeboats#
  3. http://www.sthda.com/english/articles/38-regression-model-validation/158-regression-model-accuracy-metrics-r-square-aic-bic-cp-and-more/