Logistic Regression and K-NN on Titanic

Introduction

Logistic regression is a classification algorithm used to fit a regression curve, y=f(x), where y is a categorical variable. When y is binary (1 for yes, 0 for no) we also call the model binomial logistic regression where in cases of y assuming more than 2 values you’ll sometimes hear the model being referred to as a class of multinomial logistic regression. We can think of logistic regression as a special case of linear regression, except we’re using log of odds as our target variable.

K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. This algorithms segregates unlabeled data points into well defined groups. Choosing the number of nearest neighbors i.e. determining the value of k plays a significant role in determining the efficacy of the model. Thus, selection of k will determine how well the data can be utilized to generalize the results of the kNN algorithm. A large k value has benefits which include reducing the variance due to the noisy data; the side effect being developing a bias due to which the learner tends to ignore the smaller patterns which may have useful insights.

For this case, we are going to predict the survival of the Titanic passenger. The logistic regression and K-Nearest Neighbor (K-NN) would be used as the algorithm classification method. Data collecting from titanic. We are going to briefly go through the data, perform data preprocessing (if applicable), built and evaluate the models.

Data Wrangling

Load the required package

library(dplyr)          
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(kableExtra)
library(scales)

Load Dataset

titanic.train <- read.csv("data_input/train.csv")
titanic.test <- read.csv("data_input/test.csv")
# Combining all data
titanic <- bind_rows(titanic.train,titanic.test)
# check data structure
glimpse(titanic)
Rows: 1,309
Columns: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
$ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
$ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal~
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
$ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
$ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
$ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~

The data has 1309 rows and 12 columns, and herewith the description of each column in titanic dataset :

  • PassengerId: ID of the passenger.

  • Survived: people whose survive, 0 = No, 1 = Yes.

  • Pclass: Ticket class, 1 = 1st, 2 = 2nd, 3 = 3rd.

  • Name: name of the passenger.

  • Sex: sex, male and female.

  • Age: Age in years.

  • SibSp: of siblings / spouses aboard the Titanic

  • Parch: of parents / children aboard the Titanic

  • Ticket: Ticket number

  • Fare: Passenger fare

  • Cabin: Cabin number

  • Embarked: Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton.

Before we go further, first we need to make sure that our data is clean and will be useful, and we have to check which column is not correct for the data type. We have to change the data type of each column which not correct.

Exploratory Data Analysis

## Let's check for any missing values in the data
colSums(is.na(titanic))
PassengerId    Survived      Pclass        Name         Sex         Age 
          0         418           0           0           0         263 
      SibSp       Parch      Ticket        Fare       Cabin    Embarked 
          0           0           0           1           0           0 
## Checking for empty values
colSums(titanic=='')
PassengerId    Survived      Pclass        Name         Sex         Age 
          0          NA           0           0           0          NA 
      SibSp       Parch      Ticket        Fare       Cabin    Embarked 
          0           0           0          NA        1014           2 

There are 5 variables which have missing values Survived, Age, Fare, Cabin and Embarked.

We will not impute the Cabin variable as it has too much missing value, also PassengerId, Ticket, and Name, we do not really needed for this column information, so we can remove those column, and I will remove the missing value in Survived column. The variables which need to be treated are Age, Fare, and Embarked.

First, we check in which observation has missing Fare value :

# Find which observation has missing fare value
which(is.na(titanic$Fare))
[1] 1044
titanic[1044,]
     PassengerId Survived Pclass               Name  Sex  Age SibSp Parch
1044        1044       NA      3 Storey, Mr. Thomas male 60.5     0     0
     Ticket Fare Cabin Embarked
1044   3701   NA              S

The Passenger which has missing fare value belongs to third class passenger who departed from Southampton (‘S’). Let’s try to visualize Fares among all others sharing their class and embarkment.

ggplot(titanic[titanic$Pclass == '3' & titanic$Embarked == 'S', ], 
  aes(x = Fare)) +
  geom_density(fill = '#99d6ff', alpha=0.4) + 
  geom_vline(aes(xintercept=median(Fare, na.rm=T)),
    colour='red', linetype='dashed', lwd=1) +
  scale_x_continuous(labels=scales::dollar_format()) +
  theme_minimal()

From the visualization, I think we can replace the missing value from the median fare for Pclass and Embarked which is $8.05

# Replace fare missing value with the median fare for Pclass and Embarked
titanic$Fare[1044] <- median(titanic[titanic$Pclass == 3 & titanic$Embarked == 'S', ]$Fare,na.rm = TRUE)

let’s continue, to check Embarked missing value :

titanic %>% dplyr::filter(Embarked=='')
  PassengerId Survived Pclass                                      Name    Sex
1          62        1      1                       Icard, Miss. Amelie female
2         830        1      1 Stone, Mrs. George Nelson (Martha Evelyn) female
  Age SibSp Parch Ticket Fare Cabin Embarked
1  38     0     0 113572   80   B28         
2  62     0     0 113572   80   B28         

From this information it seems clear that there are 2 significant similarities between these two passengers, both belongs to class 1 and have paid $80 as fare. So we can visualize the port of embarkation sharing similar information.

plot.data <- titanic[-c(62,830),]

ggplot(plot.data, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
geom_boxplot() +
geom_hline(aes(yintercept=80), colour='red', linetype='dashed', lwd=2) +
scale_y_continuous(labels=scales::dollar) +
theme_minimal()

This visualization gives a clear picture of the median fare for a first class passenger departing from Charbourg (C) which coincides nicely with the $80 paid by our embarkment-deficient passengers. So the missing embarkement values can be replaced by C.

# replace missing value in Embarked with "C"
titanic$Embarked[c(62,830)] <- "C"

Now we will be dealing with Age variable imputation as this variable has 263 missing value. As for this column I will replace the missing value with the mean of Age column

# replace using mean value of the age
titanic$Age[is.na(titanic$Age)] <- mean(titanic$Age, na.rm = T)

Next, Survived missing value, I will remove missing value, because I can’t impute the value, due to the target variable I think it will impact the prediction .

titanic <- titanic[complete.cases(titanic), ]    

Next, remove Cabin, PassengerId, Ticket, and Name column :

# remove cabin, passengerid, ticket, and name column
titanic <- titanic %>% 
  dplyr::select(-c(Cabin, PassengerId, Ticket, Name))

ok, I think is done to clean the missing value variable, let’s check again if there is still any missing value in dataset

colSums(is.na(titanic))
Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
       0        0        0        0        0        0        0        0 

Then, I will check for the categorical data.

# Check number of uniques values for each of the column to find out columns which we can convert to factors
sapply(titanic, function(x) length(unique(x)))
Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
       2        3        2       89        7        7      248        3 

Before we check uniques values for each column to find out which columns we can convert to factors, as from the result Survived,Pclass,Sex, and Embarked variable not match with the data type we should change it into factor data type

# change data type into categorical & create labels for target variable
titanic <- titanic %>% 
  mutate(Survived = factor(Survived, levels = c(0,1), labels = c("No","Yes")),
         Pclass = factor(Pclass),
         Sex = factor(Sex),
         Embarked = factor(Embarked))

glimpse(titanic)
Rows: 891
Columns: 8
$ Survived <fct> No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, Yes, Yes, No, No~
$ Pclass   <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2~
$ Sex      <fct> male, female, female, female, male, male, male, male, female,~
$ Age      <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, 29.88114, 5~
$ SibSp    <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0, 0~
$ Parch    <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0~
$ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21~
$ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C, S~
## Create dummy variables for categorical variables
library(dummies)
titanic<- dummy.data.frame(titanic, names=c("Pclass","Sex","Embarked"), sep="_")

Modeling

Splitting Data

Before we train our model, first we check if there is a class imbalance in target variable . Class imbalance can affect our model ability to classify the output.

prop.table(table(titanic$Survived))

       No       Yes 
0.6161616 0.3838384 

It is quite balanced, but we can balancing the data class for improvement later. Then, split the dataset into train and test dataset. Purposing is to check if the model is capable to classify new data that has not been seen by the model. The data will be split with ratio of 80/20 (80% data will be used to train, 20% to test).

RNGkind(sample.kind = "Rounding") 
set.seed(417) 
index <- sample(nrow(titanic), nrow(titanic)*0.8)
train <- titanic[index, ]
test <- titanic[-index, ]
# Check proportion of titanic data train
prop.table(table(train$Survived))

       No       Yes 
0.6179775 0.3820225 

Logistic Regression

This is the equation of a logistic regression model:

\[log(\frac{p(x)}{1-p(x)}) = \beta_0 + \beta_1(x)\]

The left-hand side is called the log-odds or logit. On the right side, the _0 is the model intercept and _1 is the coefficient of feature x.

titanic_glm <- glm(Survived ~.,family=binomial(link='logit'),data=train)

summary(titanic_glm)

Call:
glm(formula = Survived ~ ., family = binomial(link = "logit"), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5716  -0.5690  -0.4233   0.6464   2.3842  

Coefficients: (3 not defined because of singularities)
             Estimate Std. Error z value             Pr(>|z|)    
(Intercept) -1.247653   0.307499  -4.057       0.000049617360 ***
Pclass_1     2.140342   0.331617   6.454       0.000000000109 ***
Pclass_2     1.004680   0.262437   3.828             0.000129 ***
Pclass_3           NA         NA      NA                   NA    
Sex_female   2.715390   0.223782  12.134 < 0.0000000000000002 ***
Sex_male           NA         NA      NA                   NA    
Age         -0.034517   0.008877  -3.888             0.000101 ***
SibSp       -0.385369   0.128480  -2.999             0.002705 ** 
Parch       -0.118518   0.137710  -0.861             0.389440    
Fare         0.002327   0.002626   0.886             0.375602    
Embarked_C   0.429872   0.256911   1.673             0.094281 .  
Embarked_Q   0.361507   0.374045   0.966             0.333804    
Embarked_S         NA         NA      NA                   NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 947.02  on 711  degrees of freedom
Residual deviance: 631.32  on 702  degrees of freedom
AIC: 651.32

Number of Fisher Scoring iterations: 5

From the output there are 5 variable indicated significant value due to the p-value (< 0.05), which is : Pclass_1, Pclass_2, Sex_female, Age, and SibSp. Let’s try to interpreted this variable with the target variable , but before we interpret, first we need to convert the log off odds into odds or probability value, because log of odss value can not be interpreted.

paste("Pclass_1:", round(exp(2.112772), 4))
[1] "Pclass_1: 8.2711"
paste("Pclass_2:", round(exp(1.203143), 4))
[1] "Pclass_2: 3.3306"
paste("Sex_female:", round(exp(2.833890), 4))
[1] "Sex_female: 17.0115"
paste("Age:", round(exp(-0.033508), 4))
[1] "Age: 0.967"
paste("SibSp:", round(exp(-0.474637), 4))
[1] "SibSp: 0.6221"
  • The probability of the survive person from the Pclass_1 is 8.27 times possible than person from the Pclass_3
  • The probability of the survive person from the Pclass_2 is 3.33 times possible than person from the Pclass_3
  • The probability of the survive person from the female gender is 17.01 times possible than male gender
  • The probability of the survive person who has siblings or spouse who went with him/her in the boat is 0.62 times possible than those who have no siblings/spouse

In the first modeling, there are still many predictor variables that are not significant to the target variable, hence we try to implemented with automatic feature selection step wise regression using backward elimination. Starting from using all features, the backward elimination process will iteratively discard some and evaluate the model until it finds one with the lowest Akaike Information Criterion (AIC). Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models based on information loss. Lower AIC means better model. We’ll use step() function to apply backward elimination.

library(MASS)
titanic_glm2 <- stepAIC(titanic_glm, direction = "backward", trace = F)
summary(titanic_glm2)

Call:
glm(formula = Survived ~ Pclass_1 + Pclass_2 + Sex_female + Age + 
    SibSp + Embarked_C, family = binomial(link = "logit"), data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5723  -0.5669  -0.4245   0.6493   2.4844  

Coefficients:
             Estimate Std. Error z value             Pr(>|z|)    
(Intercept) -1.200849   0.301000  -3.990 0.000066204794888496 ***
Pclass_1     2.241372   0.279150   8.029 0.000000000000000981 ***
Pclass_2     0.965744   0.252964   3.818             0.000135 ***
Sex_female   2.708151   0.217295  12.463 < 0.0000000000000002 ***
Age         -0.034134   0.008836  -3.863             0.000112 ***
SibSp       -0.409213   0.121698  -3.363             0.000772 ***
Embarked_C   0.413600   0.251434   1.645             0.099976 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 947.02  on 711  degrees of freedom
Residual deviance: 633.85  on 705  degrees of freedom
AIC: 647.85

Number of Fisher Scoring iterations: 5

The same as first model, there are 5 variable indicated significant value due to the p-value (< 0.05), which is : Pclass_1, Pclass_2, Sex_female, Age, and SibSp. Let’s try to interpreted this variable, the log off odds value is quite similar with the first model, lets’ convert and interpret.

paste("Pclass_1:", round(exp(2.241372), 4))
[1] "Pclass_1: 9.4062"
paste("Pclass_2:", round(exp(0.965744), 4))
[1] "Pclass_2: 2.6267"
paste("Sex_female:", round(exp(2.708151), 4))
[1] "Sex_female: 15.0015"
paste("Age:", round(exp(-0.034134), 4))
[1] "Age: 0.9664"
paste("SibSp:", round(exp(-0.409213), 4))
[1] "SibSp: 0.6642"
  • The probability of the survive person from the Pclass_1 is 9.40 times possible than person from the Pclass_3
  • The probability of the survive person from the Pclass_2 is 2.62 times possible than person from the Pclass_3
  • The probability of the survive person from the female gender is 15.00 times possible than male gender
  • The probability of the survive person who has siblings or spouse who went with him/her in the boat is 0.66 times possible than those who have no siblings/spouse

K-NN

K-Nearest Neighbor is a method that uses a supervised learning algorithm where the results of the new instance are classified based on the majority of the k-nearest neighbor category. The K-Nearest Neighbor algorithm uses the Neighborhood Classification as the predictive value of the new instance value. The working principle of K-Nearest Neighbor (KNN) is to find the shortest distance between the data to be evaluated and the k closest neighbors in the training data. Where k is the number of nearest neighbors,to find near or far the distance between points in class k is usually calculated using the Euclidean distance. Euclidean distance is a formula for finding the distance between 2 points in two-dimensional space.

Let’s try to classify using K-NN, in the end I would like to compare between this two algorithm, which one that we can use it better to predict .

In the k-Nearest Neighbor algorithm, we need to do one additional data pre-processing stage. For every train and test data we have, omit the categorical variables except the Survived variable. Separate predictor and target variables from train and test data.

Keep in mind for the distance measurement in K-NN is highly dependent on the scale of the data from the predictor variables which have to be input of the model. The existence of predictors that have a very different range of values from other predictors can cause problems in the classification model. Therefore, we have to normalize the data to equalize the scale of each predictor variable so that it has a standard range of values.

# check range of the data
summary(train)
 Survived     Pclass_1         Pclass_2         Pclass_3        Sex_female    
 No :440   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 Yes:272   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
           Median :0.0000   Median :0.0000   Median :1.0000   Median :0.0000  
           Mean   :0.2346   Mean   :0.2037   Mean   :0.5618   Mean   :0.3539  
           3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
           Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    Sex_male           Age            SibSp            Parch       
 Min.   :0.0000   Min.   : 0.42   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:22.00   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1.0000   Median :29.88   Median :0.0000   Median :0.0000  
 Mean   :0.6461   Mean   :29.77   Mean   :0.5211   Mean   :0.3778  
 3rd Qu.:1.0000   3rd Qu.:35.00   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :80.00   Max.   :8.0000   Max.   :6.0000  
      Fare           Embarked_C       Embarked_Q        Embarked_S    
 Min.   :  0.000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:  7.896   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
 Median : 14.454   Median :0.0000   Median :0.00000   Median :1.0000  
 Mean   : 32.669   Mean   :0.2079   Mean   :0.08848   Mean   :0.7037  
 3rd Qu.: 30.696   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000  
 Max.   :512.329   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  

The range of the data is different , so we ned to scale the data , as bellow :

# omit categorical variables on predictors & normalize data using scale function in train data
train_x_scaled <- train %>% 
  select_if(is.numeric) %>% 
  scale()

# omit categorical variables on predictors & normalize data using scale function in test data
test_x_scaled <- test %>% 
  select_if(is.numeric) %>% 
  scale(center = attr(train_x_scaled, "scaled:center"),
        scale = attr(train_x_scaled, "scaled:scale"))
# separate target variable for train data
train_y <- train$Survived

# separate target variable for test data
test_y <- test$Survived
# find the optimum k value using square root by number of row
k <- round(sqrt(nrow(train_x_scaled)),2)
k
[1] 26.68

k must be odd if the number of target classes is even, and k must be even if the number of target classes is odd. This is to avoid a draw when majority voting. From the output result k value is 26, it’s even value, I will use 25 of k value for K-NN predict, due to our classes is even.

# create the K-NN model
library(class)
titanic_knn <- knn(train = train_x_scaled, test = test_x_scaled, cl = train_y, k=25)
rmarkdown::paged_table(head(as.data.frame(titanic_knn), 10))

We’ve successfully classify the data test with 25 neighbour.

Evaluation

Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.

The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision (Saito and Rehmsmeier, 2015). Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.

\[Accuracy = \frac{TP + TN}{TP + TN + FP + FN }\]

\[Sensitivity = \frac{TP}{TP + FN }\]

\[Specificity = \frac{TN}{TN + FP }\]

\[Precision = \frac{TP}{TP + FP }\]

Logistic Regression

Let’s predict our logistics regression model, in the previous modeling. I have created two model in logistic regression, first model using all predictor and the second one using step wise regression backward elimination method. I will compare both model. Logistic regression return value within range of [0,1] and not a binary class. The value is an estimate of the probability that the data will belong to the positive class (Yes or No).

# predict of model 1
pred_logit1 <- predict(titanic_glm, test, type = "response")

rmarkdown::paged_table(head(as.data.frame(pred_logit1),10))

Because the prediction results in the logistic regression model are in the form of probabilities, we must convert these values into our target category/class using threshold value. Any values above the threshold value will be classified as positive class. By default, the threshold value is 0.5.

# determine the class based on the threshold 0.5
pred_class1 <- as.factor(if_else(pred_logit1 > 0.5, "Yes", "No"))

# confusion matrix
library(caret)
log_conf1 <- confusionMatrix(data = pred_class1, reference = test$Survived, positive = "Yes") 
log_conf1
Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  98  18
       Yes 11  52
                                          
               Accuracy : 0.838           
                 95% CI : (0.7757, 0.8887)
    No Information Rate : 0.6089          
    P-Value [Acc > NIR] : 0.00000000002273
                                          
                  Kappa : 0.6536          
                                          
 Mcnemar's Test P-Value : 0.2652          
                                          
            Sensitivity : 0.7429          
            Specificity : 0.8991          
         Pos Pred Value : 0.8254          
         Neg Pred Value : 0.8448          
             Prevalence : 0.3911          
         Detection Rate : 0.2905          
   Detection Prevalence : 0.3520          
      Balanced Accuracy : 0.8210          
                                          
       'Positive' Class : Yes             
                                          

Let’s check for model 2 :

# predict of model 2
pred_logit2 <- predict(titanic_glm2, test, type = "response")

rmarkdown::paged_table(head(as.data.frame(pred_logit2),10))
# determine the class based on the threshold 0.5
pred_class2 <- as.factor(if_else(pred_logit2 > 0.5, "Yes", "No"))

# confusion matrix
library(caret)
log_conf2 <- confusionMatrix(data = pred_class2, reference = test$Survived, positive = "Yes") 
log_conf2
Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  92  19
       Yes 17  51
                                         
               Accuracy : 0.7989         
                 95% CI : (0.7326, 0.855)
    No Information Rate : 0.6089         
    P-Value [Acc > NIR] : 0.00000004117  
                                         
                  Kappa : 0.5755         
                                         
 Mcnemar's Test P-Value : 0.8676         
                                         
            Sensitivity : 0.7286         
            Specificity : 0.8440         
         Pos Pred Value : 0.7500         
         Neg Pred Value : 0.8288         
             Prevalence : 0.3911         
         Detection Rate : 0.2849         
   Detection Prevalence : 0.3799         
      Balanced Accuracy : 0.7863         
                                         
       'Positive' Class : Yes            
                                         
# combine the performance data
logit1 <- data_frame(Accuracy = round(log_conf1$overall[1],4),
           Sensitivity = round (log_conf1$byClass[1],4),
           Specificity = round (log_conf1$byClass[2],4),
           Precision = round (log_conf1$byClass[3],4))

logit2 <- data_frame(Accuracy = round(log_conf2$overall[1],4),
           Sensitivity = round (log_conf2$byClass[1],4),
           Specificity = round (log_conf2$byClass[2],4),
           Precision = round (log_conf2$byClass[3],4))

performance <- bind_rows(logit1,logit2)
Accuracy Sensitivity Specificity Precision
0.8380 0.7429 0.8991 0.8254
0.7989 0.7286 0.8440 0.7500

The result from both model shows that the first model performance is better than the second one, our first logistic regression model has accuracy of 83.8% on test dataset, meaning that 83.8% of our data is correctly classified. The value of sensitivity and specificity is 74.29% and 89.91%. This indicate that most of positive outcomes are correctly classified . The precision predicted value is 82.54%, meaning that 82.54% of our positive prediction is correct.

K-NN

# create confusion matrix from KNN predict
knn_conf <- confusionMatrix(data = titanic_knn, reference = test_y, positive = "Yes")
knn_conf
Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  96  28
       Yes 13  42
                                          
               Accuracy : 0.7709          
                 95% CI : (0.7024, 0.8303)
    No Information Rate : 0.6089          
    P-Value [Acc > NIR] : 0.000003124     
                                          
                  Kappa : 0.4999          
                                          
 Mcnemar's Test P-Value : 0.02878         
                                          
            Sensitivity : 0.6000          
            Specificity : 0.8807          
         Pos Pred Value : 0.7636          
         Neg Pred Value : 0.7742          
             Prevalence : 0.3911          
         Detection Rate : 0.2346          
   Detection Prevalence : 0.3073          
      Balanced Accuracy : 0.7404          
                                          
       'Positive' Class : Yes             
                                          

The result shows that our K-NN with K = 25 has accuracy of 77.09% on test dataset, meaning that 77.09% of our data is correctly classified. The value of sensitivity and specificity is 60.00% and 88.07%. This indicate that our sensitivity value is quite low to predict the positive target. The precision predicted value is 76.36%, meaning that 76.36 % of our positive prediction is correct.

Model Improvement

We want to improve the performance of our model. Here I try to making a more balanced of the class. Actually there are a lot of method to handle imbalanced data. The simple technique to reduce the negative impact of this problem is by subsampling the data. The common subsampling methods used in practice are the following. Below methods should be applied only on the training set.

  • Upsampling: this method increases the size of the minority class by sampling with replacement so that the classes will have the same size.

  • Downsampling: in contrast to the above method, this one decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.

  • Hybrid methods : The well known hybrid methods are ROSE (Random oversampling examples), and SMOTE (Synthetic minority oversampling technique), they downsample the majority class, and creat new artificial points in the minority class.

Here I try using downsampling for balancing titanic data train :

# resampling data train using downsampling
library(caret)
RNGkind(sample.kind = "Rounding")
set.seed(417)
train_down <- downSample(x = train %>% dplyr::select(-Survived),
                              y = train$Survived, 
                              yname = "Survived")

# check class proportion(
prop.table(table(train_down$Survived))

 No Yes 
0.5 0.5 

From the proportion output class already balance, next I will use train_down to make a model again. I will use all predictor, because in the first step, the evaluation result is better when use all predictor.

Logistic Regression

# Fitting the model
titanic_down <- glm(Survived ~.,family=binomial(link='logit'),data=train_down)

summary(titanic_down)

Call:
glm(formula = Survived ~ ., family = binomial(link = "logit"), 
    data = train_down)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4763  -0.6704  -0.0043   0.6920   2.1682  

Coefficients: (3 not defined because of singularities)
             Estimate Std. Error z value             Pr(>|z|)    
(Intercept) -0.772742   0.343486  -2.250             0.024468 *  
Pclass_1     1.964450   0.377692   5.201          0.000000198 ***
Pclass_2     0.990516   0.292520   3.386             0.000709 ***
Pclass_3           NA         NA      NA                   NA    
Sex_female   2.656155   0.256650  10.349 < 0.0000000000000002 ***
Sex_male           NA         NA      NA                   NA    
Age         -0.033602   0.010259  -3.276             0.001055 ** 
SibSp       -0.435804   0.148191  -2.941             0.003273 ** 
Parch       -0.101009   0.156637  -0.645             0.519015    
Fare         0.004278   0.003883   1.102             0.270543    
Embarked_C   0.472748   0.285512   1.656             0.097764 .  
Embarked_Q   0.361876   0.424357   0.853             0.393791    
Embarked_S         NA         NA      NA                   NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 754.14  on 543  degrees of freedom
Residual deviance: 504.74  on 534  degrees of freedom
AIC: 524.74

Number of Fisher Scoring iterations: 5
# Predict the models
pred_down <- predict(titanic_down, test, type = "response")

# determine the class based on the threshold 0.5
pred_class_down <- as.factor(if_else(pred_down > 0.5, "Yes", "No"))

# Confusion Matrix
perf_down <- confusionMatrix(data = pred_class_down, reference = test$Survived, positive = "Yes")
perf_down
Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  90  16
       Yes 19  54
                                          
               Accuracy : 0.8045          
                 95% CI : (0.7387, 0.8599)
    No Information Rate : 0.6089          
    P-Value [Acc > NIR] : 0.0000000157    
                                          
                  Kappa : 0.5926          
                                          
 Mcnemar's Test P-Value : 0.7353          
                                          
            Sensitivity : 0.7714          
            Specificity : 0.8257          
         Pos Pred Value : 0.7397          
         Neg Pred Value : 0.8491          
             Prevalence : 0.3911          
         Detection Rate : 0.3017          
   Detection Prevalence : 0.4078          
      Balanced Accuracy : 0.7986          
                                          
       'Positive' Class : Yes             
                                          

K-NN

# omit categorical variables on predictors & normalize data using scale function in train
train_x_scaled_dwn <- train_down %>% 
  select_if(is.numeric) %>% 
  scale()

# omit categorical variables on predictors & normalize data using scale function in test data
test_x_scaled_dwn <- test %>% 
  select_if(is.numeric) %>% 
  scale(center = attr(train_x_scaled_dwn, "scaled:center"),
        scale = attr(train_x_scaled_dwn, "scaled:scale"))
# separate target variable for train data
train_y_dwn <- train_down$Survived
# find the optimum k value using square root by number of row
k_down <- round(sqrt(nrow(train_x_scaled_dwn)),2)
k_down
[1] 23.32
# create the K-NN model
titanic_knn_dwn <- knn(train = train_x_scaled_dwn, test = test_x_scaled_dwn, cl = train_y_dwn, k=23)
# create confusion matrix from KNN predict
perf_knn_dwn <- confusionMatrix(data = titanic_knn_dwn, reference = test_y, positive = "Yes")
perf_knn_dwn
Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  85  16
       Yes 24  54
                                          
               Accuracy : 0.7765          
                 95% CI : (0.7084, 0.8353)
    No Information Rate : 0.6089          
    P-Value [Acc > NIR] : 0.0000014       
                                          
                  Kappa : 0.5402          
                                          
 Mcnemar's Test P-Value : 0.2684          
                                          
            Sensitivity : 0.7714          
            Specificity : 0.7798          
         Pos Pred Value : 0.6923          
         Neg Pred Value : 0.8416          
             Prevalence : 0.3911          
         Detection Rate : 0.3017          
   Detection Prevalence : 0.4358          
      Balanced Accuracy : 0.7756          
                                          
       'Positive' Class : Yes             
                                          

Conclusion

Herewith the comparison between initial model logistic regression and K-NN and improve model using downsampling for balancing the class data.

Model Accuracy Sensitivity Specificity Precision
logit_initial 0.8380 0.7429 0.8991 0.8254
knn_initial 0.7709 0.6000 0.8807 0.7636
logit_improve 0.8045 0.7714 0.8257 0.7397
knn_improve 0.7765 0.7714 0.7798 0.6923

The metric I use to get the best parameter value to predict whose survived in titanic accident is the accuracy rate, I choose to use accuracy due to check how accurated model predicted class of the target. In this case the best value of accuracy is about 83.80% (logit_initial), it is from the logistic regression model one which used all predictor to create the model. But when we improve the logistic model using downsampling method, the value decrease to 80.45% (logit_improve). Contrary in K-NN model the initial KNN model value of accuracy is 77.09 obtained at k=25 lower than K-NN improve model, when we do improvement using downsampling method in data train and do scaling again the accuracy a bit increased in K-NN improve to 77.65% obtained at k=23.

I can conclude that so far the initial model from the logistic regression algorithm is better than others with a 83.80% accuracy, and if we want to use K-NN model , we can use the improve model K-NN with a 77.65% accuracy. This does not tells us whether the model specification is optimal, we can try in obtaining a better model performance using others algorithm, and maybe we can check for the recall/sensitivity value for comparison.