Intro

We will try to predict the survival of Titanic passengers based on some of the predictor variables using Logistic Regression and K-Nearest Neighbor(KNN).

Library and Setup

Load the required packages.

library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(tidyverse)
library(caret)
library(scales)
library(GGally)
library(class)

Data Preparation

The data that will be use is titanic data from Kaggle. You can download the data here

Load the dataset

data <- read.csv("data_input/titanic.csv")
str(data)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...
head(data)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Brief explanation about the data

PassengerID : ID of the passenger survival: Survival status of the passenger (0 = No, 1 = Yes) pclass : Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) name : Name of the passenger sex : Gender of the passenger
Age : Age in years
sibsp : numbers of siblings / spouses aboard the Titanic
parch : numbers of parents / children aboard the Titanic
ticket : Ticket number
fare : Passenger fare
cabin : Cabin number
embarked : Port of Embarkation(C = Cherbourg, Q = Queenstown, S = Southampton)

Check missing values

colSums(is.na(data))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

There are missing values in our data so we need to delete those data.

data_clean <- na.omit(data)

Re-check for missing values

colSums(is.na(data_clean))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Great! There are no more missing values in our data.

Data Manipulation

We need to change the data types of some variables and also eliminate some of the irrelevant data for the model

data_clean <- data_clean %>% 
              mutate(Survived = as.factor(Survived)) %>% 
              mutate(Pclass = as.factor(Pclass)) %>% 
              mutate(Sex = as.factor(Sex)) %>% 
              mutate(Embarked = as.factor(Embarked)) %>% 
              dplyr::select(-PassengerId, -Name, -Ticket, -Cabin)

Data Pre-processing

Let’s see the proportion of the target variable

prop.table(table(data_clean$Survived))
## 
##         0         1 
## 0.5938375 0.4061625

It seems that the proportion of the target variable is pretty balanced so we don’t need to additional pre-processing of the data to balance the proportion

Splitting Train-Test

We will split our data into train and test data with the proportion of 80% train data and 20% test data

RNGkind(sample.kind = "Rounding")
set.seed(417)

index_data_clean <- sample(nrow(data_clean), nrow(data_clean)*0.8)
  
data_train <- data_clean[index_data_clean,]
data_test <- data_clean[-index_data_clean,]

Check the proportion of the train data

prop.table(table(data_train$Survived))
## 
##         0         1 
## 0.5989492 0.4010508
head(data_clean)
##   Survived Pclass    Sex Age SibSp Parch    Fare Embarked
## 1        0      3   male  22     1     0  7.2500        S
## 2        1      1 female  38     1     0 71.2833        C
## 3        1      3 female  26     0     0  7.9250        S
## 4        1      1 female  35     1     0 53.1000        S
## 5        0      3   male  35     0     0  8.0500        S
## 7        0      1   male  54     0     0 51.8625        S

Logistic Regression

We will create a logistic regression model using all predictor variables and named it model

Modelling

model <- glm(formula = Survived~., family = "binomial", 
             data = data_train)
summary(model)
## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = data_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6468  -0.6764  -0.4080   0.6712   2.3997  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  16.529034 609.237874   0.027  0.97836    
## Pclass2      -1.009352   0.355580  -2.839  0.00453 ** 
## Pclass3      -2.207270   0.375362  -5.880 4.09e-09 ***
## Sexmale      -2.540175   0.244564 -10.387  < 2e-16 ***
## Age          -0.040920   0.009115  -4.489 7.15e-06 ***
## SibSp        -0.365919   0.142287  -2.572  0.01012 *  
## Parch        -0.058835   0.134371  -0.438  0.66149    
## Fare          0.001882   0.002871   0.655  0.51218    
## EmbarkedC   -12.377652 609.237721  -0.020  0.98379    
## EmbarkedQ   -13.202104 609.237953  -0.022  0.98271    
## EmbarkedS   -12.776693 609.237702  -0.021  0.98327    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 769.06  on 570  degrees of freedom
## Residual deviance: 522.68  on 560  degrees of freedom
## AIC: 544.68
## 
## Number of Fisher Scoring iterations: 13

Model Fitting

In the first model, there are many insignificant predictor variables, therefore we will create a new model using stepwise method and we will name it model2

library(MASS)
model2 <- stepAIC(model, direction = "backward")
## Start:  AIC=544.68
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
##            Df Deviance    AIC
## - Embarked  3   525.69 541.69
## - Parch     1   522.87 542.87
## - Fare      1   523.14 543.14
## <none>          522.68 544.68
## - SibSp     1   529.88 549.88
## - Age       1   544.65 564.65
## - Pclass    2   562.00 580.00
## - Sex       1   655.64 675.64
## 
## Step:  AIC=541.69
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
## 
##          Df Deviance    AIC
## - Parch   1   525.93 539.93
## - Fare    1   526.59 540.59
## <none>        525.69 541.69
## - SibSp   1   533.43 547.43
## - Age     1   549.12 563.12
## - Pclass  2   568.78 580.78
## - Sex     1   662.94 676.94
## 
## Step:  AIC=539.93
## Survived ~ Pclass + Sex + Age + SibSp + Fare
## 
##          Df Deviance    AIC
## - Fare    1   526.68 538.68
## <none>        525.93 539.93
## - SibSp   1   535.08 547.08
## - Age     1   549.28 561.28
## - Pclass  2   571.60 581.60
## - Sex     1   665.91 677.91
## 
## Step:  AIC=538.68
## Survived ~ Pclass + Sex + Age + SibSp
## 
##          Df Deviance    AIC
## <none>        526.68 538.68
## - SibSp   1   535.33 545.33
## - Age     1   551.21 561.21
## - Pclass  2   603.60 611.60
## - Sex     1   668.90 678.90

The summary of the model2

summary(model2)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = data_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7059  -0.6792  -0.4054   0.6654   2.4088  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.096828   0.489796   8.364  < 2e-16 ***
## Pclass2     -1.261508   0.307725  -4.099 4.14e-05 ***
## Pclass3     -2.502068   0.314380  -7.959 1.74e-15 ***
## Sexmale     -2.528740   0.235241 -10.750  < 2e-16 ***
## Age         -0.042459   0.008987  -4.724 2.31e-06 ***
## SibSp       -0.376970   0.134639  -2.800  0.00511 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 769.06  on 570  degrees of freedom
## Residual deviance: 526.68  on 565  degrees of freedom
## AIC: 538.68
## 
## Number of Fisher Scoring iterations: 5

We will create prediction of survival probability based on model2

data_test$prob_survival<-predict(model2, type = "response", newdata = data_test)

We will create a plot to see the distribution of probability prediction data

ggplot(data_test, aes(x=prob_survival)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

In the plot above, we can see that the prediction result have a tendency towards 0 meaning there are more predictions of not survived than survived.

We will create a threshold of 0.5, meaning that if the prediction probability > 0.5 then it will be considered that the passengers survived.

data_test$pred_survival <- factor(ifelse(data_test$prob_survival> 0.5, 1 , 0))
data_test[1:10, c("pred_survival", "Survived")]
##    pred_survival Survived
## 4              1        1
## 9              1        1
## 10             1        1
## 12             1        1
## 23             1        1
## 24             1        1
## 28             0        0
## 40             1        1
## 42             1        0
## 51             0        0

We will create a confusion matrix to evaluate the model

log_conf <- confusionMatrix(data_test$pred_survival, data_test$Survived, positive = "1")
log_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 73 14
##          1  9 47
##                                           
##                Accuracy : 0.8392          
##                  95% CI : (0.7685, 0.8952)
##     No Information Rate : 0.5734          
##     P-Value [Acc > NIR] : 9.57e-12        
##                                           
##                   Kappa : 0.6677          
##                                           
##  Mcnemar's Test P-Value : 0.4042          
##                                           
##             Sensitivity : 0.7705          
##             Specificity : 0.8902          
##          Pos Pred Value : 0.8393          
##          Neg Pred Value : 0.8391          
##              Prevalence : 0.4266          
##          Detection Rate : 0.3287          
##    Detection Prevalence : 0.3916          
##       Balanced Accuracy : 0.8304          
##                                           
##        'Positive' Class : 1               
## 

Accuracy = refers to how accurate the model can correctly predict the target variable. Re-call/Sensitivity = refers to the probability of a positive test when the actual outcome is positive Specificity = refers to the probability of a negative test when the actual outcome is negative Precision = refers to how many of our positive prediction is correct

\[ Accuracy=\frac {TP+TN}{TP+TN+FP+FN}\\ Sensitivity = \frac {TP}{TP+FN}\\ Specificity = \frac {TN}{TN+FP}\\ Precision = \frac {TP}{TP+FP} \] Logistic regression produced coefficients in the form of log of odds. To interpret the coefficients from the model, we need to transforms the coefficients into odds.

exp(model2$coefficients) %>% 
  data.frame() 
##                       .
## (Intercept) 60.14918886
## Pclass2      0.28322648
## Pclass3      0.08191538
## Sexmale      0.07975942
## Age          0.95842936
## SibSp        0.68593653

Interpretation: -Pclass2: Passengers in 2nd class is 0.28 times more likely to survived -Pclass3: Passengers in 3rd class is 0.08 times more likely to survived -Sexmale: Male passengers is 0.08 times more likely to survived -Age: For every 1 year increase in age, the passenger is 0.95 times more likely to survived -SibSp: For every 1 sibling, the passenger is 0.68 times more likely to survived

K-Nearest Neighbor

Data Pre-processing

We will create dummy variables for categorical data since K-NN is not suitable to use with categorical data

dmy <- dummyVars(" ~Survived+Sex+Pclass+Embarked", data = data_clean)
dmy <- data.frame(predict(dmy, newdata = data_clean))
str(dmy)
## 'data.frame':    714 obs. of  11 variables:
##  $ Survived.0: num  1 0 0 0 1 1 1 0 0 0 ...
##  $ Survived.1: num  0 1 1 1 0 0 0 1 1 1 ...
##  $ Sex.female: num  0 1 1 1 0 0 0 1 1 1 ...
##  $ Sex.male  : num  1 0 0 0 1 1 1 0 0 0 ...
##  $ Pclass.1  : num  0 1 0 1 0 1 0 0 0 0 ...
##  $ Pclass.2  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Pclass.3  : num  1 0 1 0 1 0 1 1 0 1 ...
##  $ Embarked. : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Embarked.C: num  0 1 0 0 0 0 0 0 1 0 ...
##  $ Embarked.Q: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Embarked.S: num  1 0 1 1 1 1 1 1 0 1 ...

We will delete the dummy variables that only consisted of 2 categories

dmy$Survived.0 <- NULL
dmy$Sex.female <- NULL

Let’s check the names of the dummy variables

names(dmy)
## [1] "Survived.1" "Sex.male"   "Pclass.1"   "Pclass.2"   "Pclass.3"  
## [6] "Embarked."  "Embarked.C" "Embarked.Q" "Embarked.S"

Create train and test data from the dummy variables

set.seed(417)
dmy_train <- dmy[index_data_clean,2:9]
dmy_test <- dmy[-index_data_clean,2:9]

dmy_train_label <- dmy[index_data_clean,1]
dmy_test_label <- dmy[-index_data_clean,1]

Picking optimum K

sqrt(nrow(data_train))
## [1] 23.89561

Modelling

pred_knn <- class::knn(train = dmy_train,
                       test = dmy_test, 
                       cl = dmy_train_label, 
                       k = 24)

Confusion matrix

pred_knn_conf <- confusionMatrix(as.factor(pred_knn), as.factor(dmy_test_label),"1")
pred_knn_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 80 26
##          1  2 35
##                                           
##                Accuracy : 0.8042          
##                  95% CI : (0.7296, 0.8658)
##     No Information Rate : 0.5734          
##     P-Value [Acc > NIR] : 4.642e-09       
##                                           
##                   Kappa : 0.5785          
##                                           
##  Mcnemar's Test P-Value : 1.383e-05       
##                                           
##             Sensitivity : 0.5738          
##             Specificity : 0.9756          
##          Pos Pred Value : 0.9459          
##          Neg Pred Value : 0.7547          
##              Prevalence : 0.4266          
##          Detection Rate : 0.2448          
##    Detection Prevalence : 0.2587          
##       Balanced Accuracy : 0.7747          
##                                           
##        'Positive' Class : 1               
## 

Comparing the results of the confusion matrix from Logistic Regression and K-NN model that we have created

eval_logit <- data_frame(Accuracy = log_conf$overall[1],
           Recall = log_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = log_conf$byClass[3])

eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
           Recall = pred_knn_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = pred_knn_conf$byClass[3])
eval_logit
## # A tibble: 1 x 4
##   Accuracy Recall Specificity Precision
##      <dbl>  <dbl>       <dbl>     <dbl>
## 1    0.839  0.770       0.890     0.839
eval_knn
## # A tibble: 1 x 4
##   Accuracy Recall Specificity Precision
##      <dbl>  <dbl>       <dbl>     <dbl>
## 1    0.804  0.574       0.890     0.946

Conclusion

Logistic regression performs better in all metrics compared to K-NN except for precision. In this case, the evaluation metrics will be precision because precision measures how many of our positive prediction is correct. K-NN has 94.5% precision, meanwhile Logistic regression only has 83.9% precision. Therefore, K-NN is the better model in predicting the survival chance of titanic passengers.