1 Introduction

The data is about biography of passengers of Titanic. It contains information about gender, who survived or not, ticket class, age, total sibling that aboard the titanic, number of parents aboard, passenger fare, and port of embarkation. This research is to predict the rate survival of Titanic’s passenger based on those thing. This research will use two models, logistic regression and KNN analysis. At the end of this analysis, this will compare performance between logistic and KNN model.

The steps of the analysis are as followed :

1.1 1. Read Data

titanic <- read.csv("train_and_test3.csv")

The data consist of 9 variable as followed :

  1. PassengerID : ID of passenger (integer)
  2. Age : Passenger Age (numeric)
  3. Fare : Passenger fare (numeric)
  4. Sex : Passenger gender (integer); 0 as male and 1 as female
  5. sibsp : Number of siblings / spouses aboard the Titanic (integer)
  6. Parch : Number of parents / children aboard the Titanic (integer)
  7. Pclass : Ticket class (integer); 1 as first class, 2 as second class, 3 as third class
  8. Embarked : Port of Embarkation (integer); 0 as Cherbourg, 1 as Queenstown, 2 as Southampton
  9. Survived : Survival (integer); 1 as Survived and 0 as Died

1.2 2. Data Preprocessing

# To see data structure
str(titanic)
## 'data.frame':    1309 obs. of  15 variables:
##  $ Passengerid: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age        : num  22 38 26 35 35 28 54 2 27 14 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Sex        : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ sibsp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Embarked   : int  2 0 2 2 2 1 2 2 2 0 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Age1       : Factor w/ 5 levels "0-10","11-20",..: 3 3 3 3 3 3 4 1 3 2 ...
##  $ X          : logi  NA NA NA NA NA NA ...
##  $ X.1        : logi  NA NA NA NA NA NA ...
##  $ X.2        : logi  NA NA NA NA NA NA ...
##  $ X.3        : logi  NA NA NA NA NA NA ...
##  $ X.4        : logi  NA NA NA NA NA NA ...

Class of the data mostly are integer and numeric.

# Based on data structur and data variation, we would not use `PassengerID` and `Fare`
library(dplyr)
titanic1 <- titanic %>% 
  select(Sex, sibsp, Parch, Pclass, Embarked, Survived, Age1)

1.2.1 2.1 Data cleaning

# to see if there is missing value on data frame

colSums(is.na(titanic1))
##      Sex    sibsp    Parch   Pclass Embarked Survived     Age1 
##        0        0        0        0        2        0        0

There are 2 missing value of ‘Embarked’ Variable. Therefore we need to remove NA observation.

# Then we use `dplyr` to mutate variable from numeric to factor

library(dplyr)
library(tidyverse)

titanic1 <- titanic1 %>% 
  drop_na(Embarked) %>% 
  mutate(Survived = case_when(Survived == 1 ~ "Yes",
                              Survived == 0 ~ "No"),
         Survived = as.factor(Survived),
         Pclass = case_when(Pclass == 1 ~ "First Class",
                            Pclass == 2 ~ "Second Class",
                            Pclass == 3 ~ "Third Class"),
         Sex = case_when(Sex == 1 ~ "Female",
                         Sex == 0 ~ "Male"),
         Embarked = case_when(Embarked == 0 ~ "Cherbourg",
                              Embarked == 1 ~ "Queenstown",
                              Embarked == 2 ~ "Southampton")
         )

1.3 3. Exploratory Data Analysis

library(ggplot2)

titanic1 %>% 
  ggplot(mapping = aes(x = Sex)) +
  geom_density(aes(fill = Survived)) +
  theme_minimal()

1.4 4. Cross Validation (Split Data)

set.seed(100)
idx_l <- sample(nrow(titanic1), size=nrow(titanic1)*0.8)
train <- titanic1[idx_l,]
test <- titanic1[-idx_l,]
# checking whether the data train has been distributed equally
prop.table(table(train$Survived))
## 
##        No       Yes 
## 0.7320574 0.2679426

the data train has not been distributed eqaully. therefore, need to do upSample

library(caret)

train_up <- upSample(x = train %>% 
                       select(-Survived),
                           y = as.factor(train$Survived),
                           yname = "Survived")

prop.table(table(train_up$Survived))
## 
##  No Yes 
## 0.5 0.5

The train data has been distributed equally for the number of Survival

1.5 5. Modelling

using stepwise backward

model_backward_up <- step(glm(Survived~., data = train_up, family="binomial"), direction = "backward")
## Start:  AIC=1684.28
## Survived ~ Sex + sibsp + Parch + Pclass + Embarked + Age1
## 
##            Df Deviance    AIC
## - Embarked  2   1662.1 1682.1
## - Parch     1   1661.9 1683.9
## <none>          1660.3 1684.3
## - sibsp     1   1674.4 1696.4
## - Age1      4   1709.0 1725.0
## - Pclass    2   1737.6 1757.6
## - Sex       1   1919.7 1941.7
## 
## Step:  AIC=1682.07
## Survived ~ Sex + sibsp + Parch + Pclass + Age1
## 
##          Df Deviance    AIC
## - Parch   1   1663.8 1681.8
## <none>        1662.1 1682.1
## - sibsp   1   1676.8 1694.8
## - Age1    4   1711.3 1723.3
## - Pclass  2   1753.2 1769.2
## - Sex     1   1938.0 1956.0
## 
## Step:  AIC=1681.75
## Survived ~ Sex + sibsp + Pclass + Age1
## 
##          Df Deviance    AIC
## <none>        1663.8 1681.8
## - sibsp   1   1681.8 1697.8
## - Age1    4   1711.5 1721.5
## - Pclass  2   1755.3 1769.3
## - Sex     1   1942.0 1958.0
summary(model_backward_up)$call
## glm(formula = Survived ~ Sex + sibsp + Pclass + Age1, family = "binomial", 
##     data = train_up)

Variable ‘embarked’ is not significant for this model, therefore the logistic model to predict who survived on Titanic is:

glm(formula = Survived ~ Sex + sibsp + Parch + Pclass + Age1, family = “binomial”, data = train_up)

we found that port embarked, parch (number of parent) do not include in this model.

1.5.1 5.1 Interpretasi model

library(gtools)
data.frame("coef" = coef(model_backward_up),
           "Odds_ratio" = exp(coef(model_backward_up)),
           "prob" = inv.logit(coef(model_backward_up)))

Interpretation : - Class Based on regression we can conclude that probability of Titanic’s Passenger second class and third class to survive are 38% and 23% respectively. or comparing to first class Titanic’s passenger their probability to survive are smaller.

  • Sex Probability of male to survive is smaller than female, in which probability of Male passenger of Titanic to survive is 14%,

  • sibSp (number of siblings passenger aboard the Titanic) So number of family aboard on Titanic also affect to probability someone to survive on Titanic. they who have more number of siblings were tend to have more probability to die. Probability of Titanic’s Passenger to survive is 45%.

  • Age, probability of child (0-10) to survive is larger than adult. Moreover, probability of adult (21-40) to survive is higher than elder (61-80).

To sum up, they who are first class, female, young and have less family aboard on Titanic has the more probability to survive.

1.6 6. Predict

# Predict the data using data test.
test$probability_up <- predict(model_backward_up, newdata = test
                                , type = "response")
test1 <- test 

test <-  test %>% 
  select(-probability_up)
# Comparing result prediction to data actual
test1 %>% 
  select(probability_up, Survived)
#convert to categorical

test1$prediction_up <- ifelse(test1$probability_up > 0.5, "Yes", "No")
test1 %>% 
  select(prediction_up, Survived)

1.7 7. Evaluation

table("prediction" = test1$prediction_up, 
      "actual" = test1$Survived)
##           actual
## prediction  No Yes
##        No  139  16
##        Yes  63  44
library(caret)


confusionMatrix(data = as.factor(test1$prediction_up),
                reference = as.factor(test1$Survived),
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  139  16
##        Yes  63  44
##                                          
##                Accuracy : 0.6985         
##                  95% CI : (0.639, 0.7534)
##     No Information Rate : 0.771          
##     P-Value [Acc > NIR] : 0.9973         
##                                          
##                   Kappa : 0.3305         
##                                          
##  Mcnemar's Test P-Value : 0.0000002274   
##                                          
##             Sensitivity : 0.7333         
##             Specificity : 0.6881         
##          Pos Pred Value : 0.4112         
##          Neg Pred Value : 0.8968         
##              Prevalence : 0.2290         
##          Detection Rate : 0.1679         
##    Detection Prevalence : 0.4084         
##       Balanced Accuracy : 0.7107         
##                                          
##        'Positive' Class : Yes            
## 

from the model above, we know that Accuracy level is 75%, Sensitivity / Recall is 73%, and precision is 50%. There is no preference to use sensitivity nor precision. Therefore the metric that will be used is Accuracy.

1.8 KNN Evaluation

Preparing predictor and target variable. In KNN analysis only variable numeric that can be analyzed. therfore, we need to mutate variable gender into numeric variable.

train_up_knn <- train_up %>% 
  mutate(Sex = case_when(Sex == "Female" ~ 1,
                         Sex == "Male" ~ 0))


test_knn <- test %>% 
  mutate(Sex = case_when(Sex == "Female" ~ 1,
                         Sex == "Male" ~ 0))
#Persiapan data X

train_x <- train_up_knn %>% 
  select_if(is.numeric)

test_x <- test_knn %>% 
  select_if(is.numeric)
#Persiapan data y

train_y <- train_up_knn %>% 
  select(Survived)

test_y <- test_knn %>% 
  select(Survived)
  • Z-score standarization
# To standardize the Z-score

train_x <- train_x %>%
  scale()

test_x <- test_x %>% 
  scale(center= attr(train_x, "scaled:center"),
        scale = attr(train_x, "scaled:scale"))
# to find out the square root of the data in order to use in KNN prediction

sqrt(nrow(train_x))
## [1] 39.11521
dim(train_x)
## [1] 1530    3
dim(test_x)
## [1] 262   3
length(train_up$Survived)
## [1] 1530
#since the length of data is 1542 (even) therefore for k value we take 'odd' number (39)

library(class)

knn_prediction <- knn(train = train_x,
    test = test_x,
    cl = train_y$Survived, 
    k= 39)
confusionMatrix(data = as.factor(knn_prediction),
                reference = test_y$Survived,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  138  16
##        Yes  64  44
##                                          
##                Accuracy : 0.6947         
##                  95% CI : (0.635, 0.7498)
##     No Information Rate : 0.771          
##     P-Value [Acc > NIR] : 0.9983         
##                                          
##                   Kappa : 0.3251         
##                                          
##  Mcnemar's Test P-Value : 0.0000001482   
##                                          
##             Sensitivity : 0.7333         
##             Specificity : 0.6832         
##          Pos Pred Value : 0.4074         
##          Neg Pred Value : 0.8961         
##              Prevalence : 0.2290         
##          Detection Rate : 0.1679         
##    Detection Prevalence : 0.4122         
##       Balanced Accuracy : 0.7083         
##                                          
##        'Positive' Class : Yes            
## 

1.9 Compare model Logistic and KNN

Based on the analysis using matrix confusion above, we know that Accuracy level resulted using logistic regression and KNN analysis is 75.5% and 75.1% respectively. Therefore we could say performance of both model are same. MeanWhile, based on sensitivity from logistic regression we found 72% that slightly below compared to KNN (75%). However, precision value of logistic regression and KNN analysis are same at 50%.

Therefore, based on that performance we can use logistic regression to predict the survival rate on Titanic over KNN analysis. Besides, there are more variable that can be used in logistic regression to predict the survival rate/