Introduction

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Objective

In this project, we will try to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data from kaggle. The algorithm that we will use are logistic regression and k-nearest neighbor which is included in supervised learning.

Library and Setup

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(MASS)
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice

Logistic Regression

Data Preparation

Data Import

titanic <-  read.csv('data_input/train.csv')

titanic

Data Exploration

  1. Check Data Type
glimpse(titanic)
#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
#> $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
#> $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
#> $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
#> $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal~
#> $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
#> $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
#> $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
#> $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
#> $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
#> $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
#> $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
  1. Data Summary
summary(titanic)
#>   PassengerId       Survived          Pclass          Name          
#>  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
#>  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
#>  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
#>  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
#>  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
#>  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
#>                                                                     
#>      Sex                 Age            SibSp           Parch       
#>  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
#>  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
#>  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
#>                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
#>                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
#>                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
#>                     NA's   :177                                     
#>     Ticket               Fare           Cabin             Embarked        
#>  Length:891         Min.   :  0.00   Length:891         Length:891        
#>  Class :character   1st Qu.:  7.91   Class :character   Class :character  
#>  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
#>                     Mean   : 32.20                                        
#>                     3rd Qu.: 31.00                                        
#>                     Max.   :512.33                                        
#> 

Parch, min value to Q3 tends to 0, meaning that the data is not very informative (it may not be used later in modeling)

  1. Missing Value Check
colSums(is.na(titanic))
#> PassengerId    Survived      Pclass        Name         Sex         Age 
#>           0           0           0           0           0         177 
#>       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           0           0           0           0

The Age column has 177 missing values, we will fill these missing values with the median value of Age

  1. Data Overview

Glossary data titanic:

  1. PassengerId : Unique ID Number for each Titanic passenger (out of 831)
  2. Survived : Survival (0 = No, 1 = Yes)
  3. Pclass : Ticket Class (1= Upper, 2= Middle, 3= Low)
  4. Name : Unique Name of each Titanic passenger
  5. Sex : Female or Male
  6. Age : Age in year (out of 80)
  7. SibSp : Number of siblings / spouses aboard the Titanic (out of 8)
  8. Parch : Number of parents . children aboard the Titanic (out of 6)
  9. Ticket : Unique Ticket Number (out of 831)
  10. Fare : Passenger Fare (out of 512.33)
  11. Cabin : Cabin Number
  12. Embarked : Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Data Resume: - The data has 891 rows and 12 columns. - We can ignore PassengerId, Ticket, Name because it cannot be classified as a category or number that can affect the model. - Our target variable is Survived. - Survived, Pclass, Sex data type can be changed into a factor.

Data Manipulation

  1. Data Manipulation
titanic <- titanic %>% 
  mutate(Embarked_S = ifelse(Embarked == "S", 1, 0), #create column Embarked_S
         Embarked_C = ifelse(Embarked == "C", 1, 0), #create column Embarked_C
         Embarked_Q = ifelse(Embarked == "Q", 1, 0), #create column Embarked_Q
         Sex = ifelse(Sex == "female", 0, 1), 
         Age = replace_na(Age, replace = round(median(Age, na.rm = T))), #fill missing value with mean
         Age = as.integer(Age)) %>% 
  dplyr::select(-c(PassengerId, Ticket, Cabin, Name, Embarked)) %>% 
  mutate_at(c('Survived', 'Pclass', 'Sex', 'Embarked_S', 'Embarked_C', 'Embarked_Q'), as.factor)
glimpse(titanic)
#> Rows: 891
#> Columns: 10
#> $ Survived   <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,~
#> $ Pclass     <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3,~
#> $ Sex        <fct> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,~
#> $ Age        <int> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14, 5~
#> $ SibSp      <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0,~
#> $ Parch      <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0,~
#> $ Fare       <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, ~
#> $ Embarked_S <fct> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,~
#> $ Embarked_C <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,~
#> $ Embarked_Q <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,~
  1. Missing Value Check
anyNA(titanic)
#> [1] FALSE

Pre-Processing Data

  1. Variable Target Proportion
prop.table(table(titanic$Survived))
#> 
#>         0         1 
#> 0.6161616 0.3838384

The target variable has a fairly balanced proportion, doesn’t need additional pre-processing.

  1. Cross Validation
RNGkind(sample.kind = "Rounding")
set.seed(303)
index <- sample(nrow(titanic), nrow(titanic)*0.7)

#Data Splitting
t_train <- titanic[index,]
t_test <- titanic[-index,]

Modelling

  1. Logistic Regression Model with all predictor variable
model_all <- glm(formula = Survived ~ ., family = "binomial", data = t_train)

summary(model_all)
#> 
#> Call:
#> glm(formula = Survived ~ ., family = "binomial", data = t_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6390  -0.6078  -0.4148   0.5991   2.3877  
#> 
#> Coefficients:
#>               Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)  15.708014 535.411546   0.029               0.9766    
#> Pclass2      -0.837625   0.359627  -2.329               0.0199 *  
#> Pclass3      -2.021656   0.358867  -5.633         0.0000000177 ***
#> Sex1         -2.755338   0.244294 -11.279 < 0.0000000000000002 ***
#> Age          -0.039604   0.009294  -4.261         0.0000203320 ***
#> SibSp        -0.370450   0.135074  -2.743               0.0061 ** 
#> Parch        -0.195371   0.160192  -1.220               0.2226    
#> Fare          0.003919   0.003151   1.244               0.2136    
#> Embarked_S1 -12.010468 535.411300  -0.022               0.9821    
#> Embarked_C1 -11.445806 535.411328  -0.021               0.9829    
#> Embarked_Q1 -11.598464 535.411409  -0.022               0.9827    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 840.01  on 622  degrees of freedom
#> Residual deviance: 546.78  on 612  degrees of freedom
#> AIC: 568.78
#> 
#> Number of Fisher Scoring iterations: 12
  1. Logistic Regression Model with selected predictor variable using backward method
model_back <- stepAIC(model_all, direction = "backward", trace = F)

summary(model_back)
#> 
#> Call:
#> glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked_S, 
#>     family = "binomial", data = t_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6255  -0.5960  -0.4075   0.6132   2.4064  
#> 
#> Coefficients:
#>              Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)  4.455798   0.501566   8.884 < 0.0000000000000002 ***
#> Pclass2     -1.052548   0.318168  -3.308             0.000939 ***
#> Pclass3     -2.282750   0.291912  -7.820  0.00000000000000528 ***
#> Sex1        -2.714222   0.235121 -11.544 < 0.0000000000000002 ***
#> Age         -0.039171   0.009164  -4.275  0.00001914625862058 ***
#> SibSp       -0.389312   0.127386  -3.056             0.002242 ** 
#> Embarked_S1 -0.573857   0.246259  -2.330             0.019790 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 840.01  on 622  degrees of freedom
#> Residual deviance: 549.94  on 616  degrees of freedom
#> AIC: 563.94
#> 
#> Number of Fisher Scoring iterations: 5

Prediction

t_test$prob_survive<-predict(model_back, type = "response", newdata = t_test)
t_test$pred_survive <- factor(ifelse(t_test$prob_survive > 0.5, 1, 0))
t_test[1:10, c("pred_survive", "Survived")]

At a glance the prediction results have the same value as the actual data.

Model Evaluation

log_conf <- confusionMatrix(t_test$pred_survive, t_test$Survived, positive = "1")
log_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 145  24
#>          1  32  67
#>                                           
#>                Accuracy : 0.791           
#>                  95% CI : (0.7374, 0.8381)
#>     No Information Rate : 0.6604          
#>     P-Value [Acc > NIR] : 0.000001877     
#>                                           
#>                   Kappa : 0.5439          
#>                                           
#>  Mcnemar's Test P-Value : 0.3496          
#>                                           
#>             Sensitivity : 0.7363          
#>             Specificity : 0.8192          
#>          Pos Pred Value : 0.6768          
#>          Neg Pred Value : 0.8580          
#>              Prevalence : 0.3396          
#>          Detection Rate : 0.2500          
#>    Detection Prevalence : 0.3694          
#>       Balanced Accuracy : 0.7777          
#>                                           
#>        'Positive' Class : 1               
#> 
eval_logit <- data_frame(Accuracy = log_conf$overall[1],
           Recall = log_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = log_conf$byClass[3]) %>% print()
#> # A tibble: 1 x 4
#>   Accuracy Recall Specificity Precision
#>      <dbl>  <dbl>       <dbl>     <dbl>
#> 1    0.791  0.736       0.819     0.677

K-Nearest Neighbour

  1. First, we need to change factor variables into numerical and remove categorical variables:
titanic2 <- titanic #duplicate dataset

for (i in names(titanic2)){
  if(is.factor(titanic2[,i])){
    titanic2[,i] = as.numeric(titanic2[,i])
  }else if (is.character(titanic2[,i])){
      titanic2 <- dplyr::select(titanic2,-i)
    }
}
titanic2 <- titanic2 %>% 
  mutate(Survived = as.factor(Survived))
glimpse(titanic2)
#> Rows: 891
#> Columns: 10
#> $ Survived   <fct> 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2,~
#> $ Pclass     <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3,~
#> $ Sex        <dbl> 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1,~
#> $ Age        <int> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14, 5~
#> $ SibSp      <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0,~
#> $ Parch      <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0,~
#> $ Fare       <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, ~
#> $ Embarked_S <dbl> 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1,~
#> $ Embarked_C <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,~
#> $ Embarked_Q <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,~
  1. Data Splitting
RNGkind(sample.kind = "Rounding")
set.seed(303)
knn_train <- titanic2[index,]
knn_test <- titanic2[-index,]
  1. Defining Target and Predictor Variable
#Predictor
knn_train_x <- knn_train %>% select_if(is.numeric)
knn_test_x <- knn_test %>% select_if(is.numeric)

#Target
knn_train_y <- knn_train[,'Survived']
knn_test_y <- knn_test[,'Survived']
  1. Scalling (Data Pre-Processing)
knn_train_xs <- scale(knn_train_x)
knn_test_xs <- scale(knn_test_x, 
                  center = attr(knn_train_xs, "scaled:center"), 
                  scale = attr(knn_train_xs,  "scaled:scale"))
  1. Generating k optimum.
sqrt(nrow(knn_train_xs))
#> [1] 24.95997
  1. Making predictions with K-NN
pred_knn <- class::knn(train=knn_train_xs, test=knn_test_xs, cl=knn_train_y,k=25)
  1. Creating a confusion matrix from K-NN prediction.
pred_knn_conf <- confusionMatrix(as.factor(pred_knn), as.factor(knn_test_y),"2")
pred_knn_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   1   2
#>          1 155  27
#>          2  22  64
#>                                           
#>                Accuracy : 0.8172          
#>                  95% CI : (0.7656, 0.8616)
#>     No Information Rate : 0.6604          
#>     P-Value [Acc > NIR] : 0.000000009161  
#>                                           
#>                   Kappa : 0.5868          
#>                                           
#>  Mcnemar's Test P-Value : 0.5677          
#>                                           
#>             Sensitivity : 0.7033          
#>             Specificity : 0.8757          
#>          Pos Pred Value : 0.7442          
#>          Neg Pred Value : 0.8516          
#>              Prevalence : 0.3396          
#>          Detection Rate : 0.2388          
#>    Detection Prevalence : 0.3209          
#>       Balanced Accuracy : 0.7895          
#>                                           
#>        'Positive' Class : 2               
#> 
  1. Model Evaluation
eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
           Recall = pred_knn_conf$byClass[1],
           Specificity = pred_knn_conf$byClass[2],
           Precision = pred_knn_conf$byClass[3]) %>% print()
#> # A tibble: 1 x 4
#>   Accuracy Recall Specificity Precision
#>      <dbl>  <dbl>       <dbl>     <dbl>
#> 1    0.817  0.703       0.876     0.744

Model Evaluation Logistic Regression and K-NN

# Model Evaluation Logit
eval_logit
# Model Evaluation K-NN
eval_knn

Based on the performance results of the two methods (Logistics Regression and K-NN), the model’s ability to correctly predict the actual data of survivors is better by using the K-NN method in terms of Accuracy, Specifity, and Precision. But if you refer to the Recall value, Logistic Regression is better with a value of 73,6%.

Conclusion

  1. As a rescue team, we need to know the data of passengers who should have survived the Titanic crash. Therefore, we will look at the Recall metrics, this is to ensure that no survivors are missed in the rescue process.
  2. All models do not provide satisfactory performance. To get better performance, tuning model can be done such as changing the proportion of the target class, changing the K value, data pre-processing to a more suitable one or even changing the model algorithm.