Introduction

Learning to use logistic regression model using “Titanic - Machine Learning from Disaster” dataset from Kaggle. We want to use the data to create a model that predicts which passengers survived the Titanic shipwreck.

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. Thus, we would like to predict what sorts of people were more likely to survive.

There are two datasets: train.csv and test.csv. Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”. While the test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger.

Data Preparation

Install the package required.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(gtools)
library(ggplot2)
library(caret)

## Loading required package: lattice

library(class)

Load the dataset.

titanic <- read.csv("train.csv")

head(titanic)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

glimpse(titanic)

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

The titanic data has 891 rows and 12 columns, with the descriptions below: - PassengerId : passenger ID - Survived : survival (0 = No, 1 = Yes) - Pclass : ticket class (1 = 1st (upper), 2 = 2 nd (middle), 3 = 3rd (low)) - Sex : sex - Age : age in years - SibSp : number of siblings/spouse aboard the Titanic - Parch : number of parent/children aboard the Titanic (Parent = mother, father, Child = daughter, son, stepdaughter, stepson, Children who travelled with nanny count as 0) - Ticket : ticket number - Fare : passenger fare - Cabin : cabin number - Embarked : Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Based on the data, the variable target is survived (survive = 1, not survive = 0)

First, we need to filter the columns that is a unique identifier that we can ignore, which are: - PassengerID - Name - Ticket - Cabin - Embarked Also, we need to change the data type based on the correct type.

titanic <- titanic %>% select (- PassengerId, - Name, - Ticket, - Cabin, - Embarked) %>%
  mutate(Survived = as.factor(Survived))

Next, we have to check if there are any missing value.

colSums(is.na(titanic))

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare 
##        0        0        0      177        0        0        0

The data is clean, thus we don’t need to do a treatment for missing value.

However, for the sex category, we can change into factor, by changing 0 = male and 1 = female.

titanic <- titanic %>% 
  mutate(Sex = recode_factor(Sex, "male" = 0, "female" = 1))

We can also change the age range into a category.

titanic <- titanic %>% 
  mutate(Age = case_when(
    between(Age, 1, 24) ~ "< 25 years",
    between(Age, 25, 34) ~ "25-34 years",
    between(Age, 35, 44) ~ "35-44 years",
    between(Age, 45, 54) ~ "45-54 years",
    between(Age, 55, 64) ~ "55-64 years",
    between(Age, 65, 84) ~ ">65 years"
  ))

Pclass indicates the higher category of the ship class. Thus can be change into factor.

titanic <- titanic %>% 
  mutate(Pclass = as.factor(Pclass),
         SibSp = as.factor(SibSp))

Logistic Regression

Pre-Processing Data

prop.table(table(titanic$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

table(titanic$Survived)

## 
##   0   1 
## 549 342

Based on the data, a percentage between not survived and survived = 60:40 is still balance. Thus, we don’t need another pre-processiong process to balanced between the two class.

Splitting Train-Test

Nest, we need to do a splitting train test data. The goal is to use the data train for modelling, while the data test is uded to test the model that we have made towards the unseen data.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

intrain <- sample(x = nrow(titanic), size = nrow(titanic)*0.8)
titanic_train <- titanic[intrain,]
titanic_test <- titanic[-intrain,]

prop.table(table(titanic_train$Survived))

## 
##         0         1 
## 0.6095506 0.3904494

Based on the data for data train, a percentage between not survived and survived = 60:40 is still balance. Thus, we don’t need another pre-processiong process to balanced between the two class.

Modelling

Next, we do the logistic regression model using glm() function. In choosing the predictor variable, we compare the model using stepwise forward and backward.

model_forward <- glm(formula = Survived ~ ., data = titanic_train, family = "binomial")
summary(model_forward)

## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.329499   0.390323   0.844 0.398575    
## Pclass2         -1.218327   0.362402  -3.362 0.000774 ***
## Pclass3         -2.080475   0.369029  -5.638 1.72e-08 ***
## Sex1             2.514220   0.243402  10.330  < 2e-16 ***
## Age>65 years    -1.980675   1.125207  -1.760 0.078361 .  
## Age25-34 years  -0.163390   0.280440  -0.583 0.560149    
## Age35-44 years  -0.663440   0.335326  -1.978 0.047873 *  
## Age45-54 years  -1.030226   0.414610  -2.485 0.012962 *  
## Age55-64 years  -1.238150   0.608824  -2.034 0.041984 *  
## SibSp1          -0.129255   0.262322  -0.493 0.622200    
## SibSp2           0.278595   0.731193   0.381 0.703193    
## SibSp3          -1.541956   0.793829  -1.942 0.052085 .  
## SibSp4          -0.850582   0.791018  -1.075 0.282240    
## SibSp5         -14.941010 715.196527  -0.021 0.983333    
## Parch           -0.121038   0.131457  -0.921 0.357187    
## Fare             0.002659   0.002670   0.996 0.319290    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 747.74  on 551  degrees of freedom
## Residual deviance: 519.35  on 536  degrees of freedom
##   (160 observations deleted due to missingness)
## AIC: 551.35
## 
## Number of Fisher Scoring iterations: 14

model_backward <- step(object = model_forward, direction = "backward", trace = F)
summary(model_backward)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age, family = "binomial", 
##     data = titanic_train)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.39265    0.31135   1.261   0.2073    
## Pclass2        -1.32110    0.31537  -4.189 2.80e-05 ***
## Pclass3        -2.29285    0.30663  -7.478 7.57e-14 ***
## Sex1            2.37398    0.22729  10.445  < 2e-16 ***
## Age>65 years   -1.97138    1.12175  -1.757   0.0788 .  
## Age25-34 years -0.07324    0.27291  -0.268   0.7884    
## Age35-44 years -0.56470    0.32453  -1.740   0.0819 .  
## Age45-54 years -0.93976    0.40518  -2.319   0.0204 *  
## Age55-64 years -1.11315    0.59190  -1.881   0.0600 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 747.74  on 551  degrees of freedom
## Residual deviance: 529.12  on 543  degrees of freedom
##   (160 observations deleted due to missingness)
## AIC: 547.12
## 
## Number of Fisher Scoring iterations: 4

Based on the model above (model_backward), we can interpret that: - The variable that is significant to the target variable: Pclass, Sex, Age - The variable that increase the survival chance: Sex1 (female) - The variable that decrease the survival chance: Pclass 2 (second class), Pclass 3 (Third class), Age 45-54 years

# category: log of odds/ coefficient from Pclass2
exp(-1.32110)

## [1] 0.2668416

# category: log of odds/ coefficient from Pclass3
exp(-2.29285)

## [1] 0.1009783

# category: log of odds/ coefficient from Sex1
exp(2.37398)

## [1] 10.74005

# category: log of odds/ coefficient from Age 45-54 years
exp(-0.93976)

## [1] 0.3907216

Based on the coefficient of each predictor variable that is significant, we can interpret that: - The chances of passengers surviving in Pclass2 (Second Class) are 0.26 times smaller than passengers in other classes. - The chances of passengers surviving in Pclass3 (Third Class) are 0.10 times smaller than passengers in other classes. - The chance of a female passenger (Sex1) surviving is 10.7 times greater than that of a male passenger. - The chance of survival for passengers aged 45-54 years is 0.40 times smaller than for passengers of other ages.

Predict

Based on the backward model that we have made, we try to predict using the data test that we have, and we try to see the probability distribution of the data prediction.

titanic_test$prob_Survived <- predict(object = model_backward,
                           newdata = titanic_test,
                           type = "response")

ggplot(titanic_test, aes (x = prob_Survived)) + geom_density (lwd = 0.5) + labs (title = "Distribution Probability of Data Prediction") + theme_minimal()

## Warning: Removed 26 rows containing non-finite values (`stat_density()`).

In the graph above, it can be interpreted that the prediction results are more inclined towards 0, which means not survived.

titanic_test$pred_Survived <- factor(ifelse(titanic_test$prob_Survived > 0.5, "Survived", "Not Survived"))
titanic_test[1:10, c("pred_Survived", "Survived")]

##    pred_Survived Survived
## 3       Survived        1
## 4       Survived        1
## 5   Not Survived        0
## 7   Not Survived        0
## 8   Not Survived        0
## 17  Not Survived        0
## 21  Not Survived        0
## 23      Survived        1
## 35      Survived        0
## 38  Not Survived        0

Based on the synthax above, the data test with probability more than 0.5 means that they survived.

Model Evaluation

To evaluate the model that we’ve made, we make a confusion matrix.

First, we need to check the unique levels in both vectors.

unique(titanic_test$pred_Survived)

## [1] Survived     Not Survived <NA>        
## Levels: Not Survived Survived

unique(titanic_test$Survived)

## [1] 1 0
## Levels: 0 1

The pred_Survived and Survived has a different level, thus seeting the levels to match the other levels are needed.

levels(titanic_test$pred_Survived) <- levels(titanic_test$Survived)

titanic_conf <- confusionMatrix(data = titanic_test$pred_Survived, reference = titanic_test$Survived, positive = "1")
titanic_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 86 13
##          1 11 43
##                                           
##                Accuracy : 0.8431          
##                  95% CI : (0.7757, 0.8968)
##     No Information Rate : 0.634           
##     P-Value [Acc > NIR] : 9.294e-09       
##                                           
##                   Kappa : 0.6594          
##                                           
##  Mcnemar's Test P-Value : 0.8383          
##                                           
##             Sensitivity : 0.7679          
##             Specificity : 0.8866          
##          Pos Pred Value : 0.7963          
##          Neg Pred Value : 0.8687          
##              Prevalence : 0.3660          
##          Detection Rate : 0.2810          
##    Detection Prevalence : 0.3529          
##       Balanced Accuracy : 0.8272          
##                                           
##        'Positive' Class : 1               
##

Re-call/Sensitivity = From all of the positive actual data, what proportion is my model capable of guessing correctly. Based on the model, the recall/sensitivity value is 0.7679.
Specificity = From all of the negative actual data, what proportion is my model capable of guessing correctly. Based on the model, the specificity value is 0.8866.
Accuracy = How capable my model is of correctly guessing the target. Based on the model, the accuracy value is 0.8431.
Pos Pred Value/Precision = From all prediction results, how capable my model is of correctly guessing the positive class. Based on the model, the precision value is 0.7963.

Positive = Survived, Negative = Not Survived. FP = Predict survived (+), but not survived (-) FN = Predict not survived (-), but survived (+) Based on the metrics in this case, we have to minimize false positive. Thus, we focus on “precision”.

K-Nearest Neighbour

Pre-Processing Data

Checking the value range of each variable to check if there is other pre-processing data that needs to be done. The data summary looks good. Thus, no need for another pre-processing data.

summary(titanic)

##  Survived Pclass  Sex         Age            SibSp       Parch       
##  0:549    1:216   0:577   Length:891         0:608   Min.   :0.0000  
##  1:342    2:184   1:314   Class :character   1:209   1st Qu.:0.0000  
##           3:491           Mode  :character   2: 28   Median :0.0000  
##                                              3: 16   Mean   :0.3816  
##                                              4: 18   3rd Qu.:0.0000  
##                                              5:  5   Max.   :6.0000  
##                                              8:  7                   
##       Fare       
##  Min.   :  0.00  
##  1st Qu.:  7.91  
##  Median : 14.45  
##  Mean   : 32.20  
##  3rd Qu.: 31.00  
##  Max.   :512.33  
##

glimpse(titanic)

## Rows: 891
## Columns: 7
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
## $ Pclass   <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ Sex      <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0…
## $ Age      <chr> "< 25 years", "35-44 years", "25-34 years", "35-44 years", "3…
## $ SibSp    <fct> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0…
## $ Parch    <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0…
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…

titanic_knn <- titanic

dummy <- dummyVars("~.", data = titanic_knn)
dummy <- data.frame(predict(dummy, newdata = titanic_knn))
glimpse(dummy)

## Rows: 891
## Columns: 22
## $ Survived.0     <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1…
## $ Survived.1     <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0…
## $ Pclass.1       <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ Pclass.2       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0…
## $ Pclass.3       <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1…
## $ Sex.0          <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0…
## $ Sex.1          <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1…
## $ Age..25.years  <dbl> 1, 0, 0, 0, 0, NA, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, NA,…
## $ Age.65.years   <dbl> 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA,…
## $ Age25.34.years <dbl> 0, 0, 1, 0, 0, NA, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, NA,…
## $ Age35.44.years <dbl> 0, 1, 0, 1, 1, NA, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, NA,…
## $ Age45.54.years <dbl> 0, 0, 0, 0, 0, NA, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA,…
## $ Age55.64.years <dbl> 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, NA,…
## $ SibSp.0        <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0…
## $ SibSp.1        <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1…
## $ SibSp.2        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SibSp.3        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SibSp.4        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ SibSp.5        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SibSp.8        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Parch          <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0…
## $ Fare           <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86…

Erasing data with missing NA.

anyNA(dummy)

## [1] TRUE

dummy <- na.omit(dummy)

Splitting the data into data train and data test.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

intrain <- sample(x = nrow(dummy), size = nrow(dummy)*0.8)
dummy_train <- dummy[intrain,]
dummy_test <- dummy[-intrain,]

glimpse(dummy)

## Rows: 705
## Columns: 22
## $ Survived.0     <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0…
## $ Survived.1     <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1…
## $ Pclass.1       <dbl> 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Pclass.2       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1…
## $ Pclass.3       <dbl> 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0…
## $ Sex.0          <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
## $ Sex.1          <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0…
## $ Age..25.years  <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0…
## $ Age.65.years   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Age25.34.years <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1…
## $ Age35.44.years <dbl> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0…
## $ Age45.54.years <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Age55.64.years <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0…
## $ SibSp.0        <dbl> 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1…
## $ SibSp.1        <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0…
## $ SibSp.2        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SibSp.3        <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SibSp.4        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ SibSp.5        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SibSp.8        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Parch          <dbl> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Fare           <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0…

# Predictor
dummy_train_x <- dummy_train [, c(1, 3:22)]
dummy_test_x <- dummy_test [, c(1, 3:22)]

# Target
dummy_train_y <- dummy_train [, 2]
dummy_test_y <- dummy_test [,2]

Predict

We need to find the optimum K

sqrt(nrow(titanic_train))

## [1] 26.68333

The number of target class = 2 (Survived and Not Survived) K = 26.68 (round up to 27)

titanic_pred <- class:: knn(train = dummy_train_x,
                    test = dummy_test_x,
                    cl = dummy_train_y,
                    k = 27)

pred_knn_conf <- confusionMatrix(as.factor(titanic_pred),
                                 as.factor(dummy_test_y), "1")
pred_knn_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 71 21
##          1  9 40
##                                           
##                Accuracy : 0.7872          
##                  95% CI : (0.7104, 0.8516)
##     No Information Rate : 0.5674          
##     P-Value [Acc > NIR] : 3.584e-08       
##                                           
##                   Kappa : 0.5562          
##                                           
##  Mcnemar's Test P-Value : 0.04461         
##                                           
##             Sensitivity : 0.6557          
##             Specificity : 0.8875          
##          Pos Pred Value : 0.8163          
##          Neg Pred Value : 0.7717          
##              Prevalence : 0.4326          
##          Detection Rate : 0.2837          
##    Detection Prevalence : 0.3475          
##       Balanced Accuracy : 0.7716          
##                                           
##        'Positive' Class : 1               
##

Re-call/Sensitivity = Based on the model, the recall/sensitivity value is 0.6557.
Specificity = Based on the model, the specificity value is 0.8875.
Accuracy = Based on the model, the accuracy value is 0.7872.
Pos Pred Value/Precision = Based on the model, the precision value is 0.8163.

Model Evaluation and Conclusion

Metrics summary of both model: 1. Re-call/Sensitivity = - Logistic Regression: 0.7679 - KNN: 0.6557 2. Specificity = - Logistic Regression: 0.8866 - KNN: 0.8875 3. Accuracy = - Logistic Regression: 0.8431 - KNN: 0.7872 4. Pos Pred Value/Precision = - Logistic Regression: 0.7963. - KNN: 0.8163 Based on both model, KNN has a better precision value in predicting people who survived than logistic regression. Precision is chosen to minimize false positive predictions, Thus, if I wanted to make sure that all passengers who survived is really the one who survived.

Source

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Titanic - Machine Learning from Disaster : Exploratory Analysis using Logistic Regression Model and KNN

Berliana Harahap

2023-10-29

Introduction

Data Preparation

Logistic Regression

Pre-Processing Data

Splitting Train-Test

Modelling

Predict

Model Evaluation

K-Nearest Neighbour

Pre-Processing Data

Predict

Model Evaluation and Conclusion

Source