Introduction

At this time, we want to build predictive model to predict passengers of Titanic whether or not they survived the sinking of the ship. It will be a model to answer question like “what sorts of people were more likely to survive?”, using passenger data (ie name, age, gender, socio-economic class, etc). The logistic regression and K-Nearest Neighbor (K-NN) would be used as the classification method. The dataset is splitted into two groups, i.e. training set and test set taken from here.

Data Preparation

Load library

# libraries
library(readr)
library(tidyverse)
library(gtools)
library(class)
library(caret)
library(e1071)
library(car)
library(rsample)

Read data train.csv

titanic_train <- read.csv("train.csv")
head(titanic_train)

Description:

  • PassengerId: Row ID in the data set
  • Survived: If a passenger survived or not (0 = No, 1 = Yes)
  • Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  • Name: Passenger’s name
  • Sex: Passenger’s gender (male or female)
  • Age: Passenger’s age (in years)
  • SibSp: # of siblings / spouses aboard the Titanic
  • Parch: # of parents / children aboard the Titanic
  • Ticket: Ticket number
  • Fare: Passenger fare
  • Cabin: Cabin number
  • Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Check missing value

colSums(is.na(titanic_train))
#> PassengerId    Survived      Pclass        Name         Sex         Age 
#>           0           0           0           0           0         177 
#>       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           0           0           0           0

Since there is missing value in Age variable, we will only select the data that has a value in Age variable. This is important because we do not want the data interrupt our model later on. We will do this in data wrangling section.

Data wrangling

glimpse(titanic_train) 
#> Rows: 891
#> Columns: 12
#> $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
#> $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
#> $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
#> $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
#> $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal~
#> $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
#> $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
#> $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
#> $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
#> $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
#> $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
#> $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
# data selection and transformation
titanic_train_clean <- titanic_train %>% 
  filter(!is.na(titanic_train$Age) & !titanic_train$Embarked=="") %>% # filter out data with missing value or empty data
  select(-c("PassengerId","Name","Ticket","Cabin")) %>% # unselect column that is not useful
  mutate(
    Survived = as.factor(Survived),
    Pclass = as.factor(Pclass),
    Sex = as.factor(Sex),
    Embarked = as.factor(Embarked)
  )

str(titanic_train_clean)
#> 'data.frame':    712 obs. of  8 variables:
#>  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 2 2 2 ...
#>  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 1 3 3 2 3 ...
#>  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 1 1 1 ...
#>  $ Age     : num  22 38 26 35 35 54 2 27 14 4 ...
#>  $ SibSp   : int  1 1 0 1 0 0 3 0 1 1 ...
#>  $ Parch   : int  0 0 0 0 0 0 1 2 0 1 ...
#>  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
#>  $ Embarked: Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 3 3 3 1 3 ...

Check missing value after data wrangling

colSums(is.na(titanic_train_clean))
#> Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
#>        0        0        0        0        0        0        0        0

Eksploratory Data Analysis

Data Distribution Analysis

summary(titanic_train_clean)
#>  Survived Pclass      Sex           Age            SibSp           Parch       
#>  0:424    1:184   female:259   Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
#>  1:288    2:173   male  :453   1st Qu.:20.00   1st Qu.:0.000   1st Qu.:0.0000  
#>           3:355                Median :28.00   Median :0.000   Median :0.0000  
#>                                Mean   :29.64   Mean   :0.514   Mean   :0.4326  
#>                                3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:1.0000  
#>                                Max.   :80.00   Max.   :5.000   Max.   :6.0000  
#>       Fare        Embarked
#>  Min.   :  0.00   C:130   
#>  1st Qu.:  8.05   Q: 28   
#>  Median : 15.65   S:554   
#>  Mean   : 34.57           
#>  3rd Qu.: 33.00           
#>  Max.   :512.33

Check Class Imbalance

prop.table(table(titanic_train_clean$Survived))
#> 
#>         0         1 
#> 0.5955056 0.4044944

Based on the proportion value above, the target variable class (Survived) is balance enough so that we do not need to do additional data pre-processing to balance the class.

Cross Validation

When doing prediction, we need to show that a model can predict new data. Thus, we have evaluate the model using new data and we need to split the data into test data and train data. Let’s say we use 80% of the data as train data and the rest as test data.

RNGkind(sample.kind = "Rounding")
set.seed(417)

index <- sample(nrow(titanic_train_clean), nrow(titanic_train_clean)*0.8)
titanic_train <- titanic_train_clean[index,]
titanic_test <- titanic_train_clean[-index,]
# re-check class imbalance
table(titanic_train$Survived)
#> 
#>   0   1 
#> 343 226

Build Model

titanic_survival <- glm(Survived~., data = titanic_train, family = "binomial")
summary(titanic_survival)
#> 
#> Call:
#> glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6780  -0.6943  -0.4103   0.6845   2.3938  
#> 
#> Coefficients:
#>              Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)  4.050751   0.583123   6.947     0.00000000000374 ***
#> Pclass2     -1.150354   0.356663  -3.225              0.00126 ** 
#> Pclass3     -2.376974   0.373226  -6.369     0.00000000019061 ***
#> Sexmale     -2.491100   0.244823 -10.175 < 0.0000000000000002 ***
#> Age         -0.037553   0.009033  -4.157     0.00003221446050 ***
#> SibSp       -0.261659   0.141787  -1.845              0.06497 .  
#> Parch       -0.027426   0.131446  -0.209              0.83473    
#> Fare         0.001380   0.002765   0.499              0.61754    
#> EmbarkedQ   -1.019041   0.697578  -1.461              0.14406    
#> EmbarkedS   -0.310492   0.302544  -1.026              0.30476    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 764.57  on 568  degrees of freedom
#> Residual deviance: 522.99  on 559  degrees of freedom
#> AIC: 542.99
#> 
#> Number of Fisher Scoring iterations: 5

💡 Interpretation:

Convert log off odds to odds

exp(titanic_survival$coefficients)
#> (Intercept)     Pclass2     Pclass3     Sexmale         Age       SibSp 
#> 57.44055420  0.31652480  0.09283107  0.08281880  0.96314357  0.76977338 
#>       Parch        Fare   EmbarkedQ   EmbarkedS 
#>  0.97294705  1.00138142  0.36094098  0.73308636

Passengers with ticket class 2 are more likely to survive 0.31652480 times than passengers with ticket class 1, note that other variable values are constant.

Male passengers are more likely to survive 0.08281880 times than female passengers, note that other variable values are constant.

Passengers who were embarked from S (Southampton) port are more likely to survive 0.73308636 times than passengers who embarked from C (Cherbourg) port, note that other variable values are constant.

Every 1 point increase of passenger’s age, the passengers are more likely to survive 0.96314357 times than passengers with below age, note that other variable values are constant.

Every 1 point increase of # of siblings / spouses, the passengers are more likely to survive 0.76977338 times than passengers with below # of siblings / spouses, note that other variable values are constant.

Every 1 point increase of # of parents / children, the passengers are more likely to survive 0.97294705 times than passengers with below # of parents / children, note that other variable values are constant.

Every 1 point increase of passenger fare, the passengers are more likely to survive 1.00138142 times than passengers with below passenger fare, note that other variable values are constant.

Looking at Pr(>|z|) value, the significant predictor variables are Pclass, Sex and Age.

Step Wise Regression

Let’s try step wise regression to find out a fitted model.

# logistic regression with step
step(titanic_survival, direction = "backward", trace = 0)
#> 
#> Call:  glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
#>     data = titanic_train)
#> 
#> Coefficients:
#> (Intercept)      Pclass2      Pclass3      Sexmale          Age        SibSp  
#>     3.96186     -1.32407     -2.60295     -2.46487     -0.03837     -0.25703  
#> 
#> Degrees of Freedom: 568 Total (i.e. Null);  563 Residual
#> Null Deviance:       764.6 
#> Residual Deviance: 525.9     AIC: 537.9
# save model resulted from stepwise
titanic_survival2 <- glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
    data = titanic_train)

summary(titanic_survival2)
#> 
#> Call:
#> glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
#>     data = titanic_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.7034  -0.6870  -0.4194   0.6725   2.4041  
#> 
#> Coefficients:
#>              Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)  3.961858   0.489064   8.101 0.000000000000000546 ***
#> Pclass2     -1.324069   0.311397  -4.252 0.000021184394278967 ***
#> Pclass3     -2.602950   0.315110  -8.260 < 0.0000000000000002 ***
#> Sexmale     -2.464869   0.235042 -10.487 < 0.0000000000000002 ***
#> Age         -0.038373   0.008942  -4.291 0.000017773245301492 ***
#> SibSp       -0.257034   0.132223  -1.944               0.0519 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 764.57  on 568  degrees of freedom
#> Residual deviance: 525.95  on 563  degrees of freedom
#> AIC: 537.95
#> 
#> Number of Fisher Scoring iterations: 5

Predict

With the model from step-wise, we try to predict using test data we already have.

# predict probability of survived and save to a new column in test data
titanic_test$pred.Survived <- predict(object = titanic_survival2, newdata = titanic_test, type="response")

Classify titanic_test data based on pred.Survived and save to a new column named pred.Label

titanic_test$pred.Label <- ifelse(titanic_test$pred.Survived>0.5, 1, 0) %>% as.factor()
# see the prediction result
titanic_test %>% select(pred.Survived, pred.Label, Survived)

Let’s try to see the probability distribution of predicted data.

ggplot(titanic_test, aes(x=pred.Survived)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

💡 Insight: The graph shows that the predicted data is skewed to 0, which means the model’s prediction is more to 0 (not survived) when predicts new data.

Model Evaluation

After doing prediction with a model, there is still tendency to wrong prediction. In this situation, we want to evaluate our model based on Confusion Matrix.

There are 4 metrics to evaluate model’s performance: * Re-call/Sensitivity: how far a model predicts all positive class of target variable correctly * Specificity: how far a model predicts all negative class of target variable correctly * Accuracy: how far a model predicts target variable correctly (in general) * Precision: how precise a model predicts positive class of target variable correctly

log_conf <- confusionMatrix(data=titanic_test$pred.Label, reference=titanic_test$Survived, positive="1")
log_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 71 10
#>          1 10 52
#>                                              
#>                Accuracy : 0.8601             
#>                  95% CI : (0.7923, 0.9124)   
#>     No Information Rate : 0.5664             
#>     P-Value [Acc > NIR] : 0.00000000000003934
#>                                              
#>                   Kappa : 0.7153             
#>                                              
#>  Mcnemar's Test P-Value : 1                  
#>                                              
#>             Sensitivity : 0.8387             
#>             Specificity : 0.8765             
#>          Pos Pred Value : 0.8387             
#>          Neg Pred Value : 0.8765             
#>              Prevalence : 0.4336             
#>          Detection Rate : 0.3636             
#>    Detection Prevalence : 0.4336             
#>       Balanced Accuracy : 0.8576             
#>                                              
#>        'Positive' Class : 1                  
#> 

Accuracy How far our model predicts target class (positive or negative). It is used when positive and negative class are important or the class proportion is balance.

(TP+TN/TOTAL)

(52+71)/(71+10+10+52)
#> [1] 0.8601399

Sensitivity/Recall How many that can be predicted positive from the actually positive

(TP/(TP+FN))

52/(52+10)
#> [1] 0.8387097

Pos Pred Value/Precision How many that can be predicted positive from the predictively positive

(TP/(TP+FP))

52/(52+10)
#> [1] 0.8387097

In this Titanic Survival Passenger case, we want to avoid passenger’s characteristics that are not survived. Therefore, we use Precision as metric to avoid False Positive (passengers who were actually not survived, but they’re predicted as survived).

6.4 Specificity How many that can be predicted negative from the actually negative

(TN/(TN+FP))

71/(71+10)
#> [1] 0.8765432

Assumptions

There are 3 assumptions in Logistic Regression:
* Multicollinearity: no interference between predictor variables
* Independence of Observations: independent observations
* Linearity of Predictor & Log of Odds: value of predictor variable has linear correlation with log of odds value

vif(titanic_survival2)
#>            GVIF Df GVIF^(1/(2*Df))
#> Pclass 1.430189  2        1.093574
#> Sex    1.105275  1        1.051321
#> Age    1.464801  1        1.210289
#> SibSp  1.138885  1        1.067186

Assumption is fulfilled.

k-NN (K-nearest neighboor)

This method will classify new data by comparing test data with train data. How close the characteristics of both data is measured with Euclidean Distance. Then it will select the closest k neighboor from the new data and decide the class with majority voting.

Cross Validation

set.seed(100) # refer to the key of cross validation in k-NN

index <- initial_split(data=titanic_train_clean,  # beginning data before splitting
                       prop = 0.8, # split proportion 80:20
                       strata = Survived) # class label 

titanic_knn_train <- training(index)
titanic_knn_test <- testing(index)
# check proportion
prop.table(table(titanic_knn_train$Survived))
#> 
#>         0         1 
#> 0.5957821 0.4042179
prop.table(table(titanic_knn_test$Survived))
#> 
#>         0         1 
#> 0.5944056 0.4055944

Picking Optimum k

# train data predictor
train_titanic_x <- titanic_knn_train %>% select_if(is.numeric) # select only numeric column for scaling

# train data target
train_titanic_y <- titanic_knn_train %>% select(Survived) # select target class only

# test data predictor
test_titanic_x <- titanic_knn_test %>% select_if(is.numeric)

# test data target
test_titanic_y <-  titanic_knn_test %>% select(Survived)

Scaling for test data and train data

# run this code 1x only
train_titanic_x <- scale(train_titanic_x)
test_titanic_x <- scale(test_titanic_x, 
                center=attr(train_titanic_x, "scaled:center"), # mean value of train data
                scale=attr(train_titanic_x, "scaled:scale")) #  std deviation value of train data

Select k value based on no of train data rows

sqrt(nrow(train_titanic_x))
#> [1] 23.85372

Since there are 2 target classes (Survived or Not Survived), we use k = 23 (odd number) for our model.

# create k-NN model
titanic_knn_pred <- knn(train = train_titanic_x, # train data predictor
    test = test_titanic_x, # test data predictor
    cl = train_titanic_y$Survived, # train data target
    k=23) # no of k used for classification

Evaluation Model

knn_conf <- confusionMatrix(data=titanic_knn_pred, reference = test_titanic_y$Survived, positive="1")
knn_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 74 30
#>          1 11 28
#>                                           
#>                Accuracy : 0.7133          
#>                  95% CI : (0.6318, 0.7858)
#>     No Information Rate : 0.5944          
#>     P-Value [Acc > NIR] : 0.002108        
#>                                           
#>                   Kappa : 0.3727          
#>                                           
#>  Mcnemar's Test P-Value : 0.004937        
#>                                           
#>             Sensitivity : 0.4828          
#>             Specificity : 0.8706          
#>          Pos Pred Value : 0.7179          
#>          Neg Pred Value : 0.7115          
#>              Prevalence : 0.4056          
#>          Detection Rate : 0.1958          
#>    Detection Prevalence : 0.2727          
#>       Balanced Accuracy : 0.6767          
#>                                           
#>        'Positive' Class : 1               
#> 

Logistic Regression vs k-NN

eval_logit <- data_frame(Accuracy = log_conf$overall[1],
           Recall = log_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = log_conf$byClass[3])

eval_knn <- data_frame(Accuracy = knn_conf$overall[1],
           Recall = knn_conf$byClass[1],
           Specificity = knn_conf$byClass[2],
           Precision = knn_conf$byClass[3])

# Model Evaluation Logit
eval_logit
# Model Evaluation K-NN
eval_knn

Conclusion

According to the above evaluation, Logistic Regression model has better precision than k-NN model (precision value = 83.87%). It means that Logistic Regression model can predict better for whether or not Titanic’s passengers survived the sinking of the ship.