Intro

What We’ll Do

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

There is lot of method to predict, who have more possibility to survive. so in this case we want to compare Logistic Regression and K-NN method to predict who can survive frim Titanic shipwreck, and which method have better performance?

How To do

Predict survivor and to create a model that predicts which passengers survived the Titanic shipwreck. In this case we will target “Survived” and using other columns as Predictor Variable.

We will try to do a titanic analysis and classify survivor based on dataset we get from this link.

We will use Logistic Refression and K-Nearest Neighbor (K-NN) method as classification method and predict survivor. We will compare perfomance between this method.

We will use Confussion Metrics to evaluate the models, and based on result of confussion matrix we will compare the perfomance between them.

Data Prepartion

Load required package

library(dplyr)
library(tidyr)
library(car)
# Knn modelling
library(class)
# Confussion Mtrix
library(caret)

As we know the dataset get from link. The dataset consist of file, gender_submission.csv contain survived status and passanger id, train.csv and test.csv both of it contain passanger biodata but it already seperate for train and test data

submission <- read.csv("data/gender_submission.csv")
rmarkdown::paged_table(submission)
train_titanic <- read.csv("data/train.csv")
rmarkdown::paged_table(train_titanic)
test_titanic <- read.csv("data/test.csv")
rmarkdown::paged_table(test_titanic)

Here is data dictionary:

Although based on the information we read at the source, that the dataset provided has been prepared properly. Explore data its must. After import the data, some variable have incorrect class and need to manipulate it and we do it to both train and test data

submission <- submission %>% 
  mutate(PassengerId = as.factor(PassengerId))

test_titanic <- test_titanic %>% 
  mutate(PassengerId = as.factor(PassengerId),
         Pclass = as.factor(Pclass),
         Name = as.character(Name),
         Age = as.integer(Age))

train_titanic <- train_titanic %>% 
  mutate(PassengerId = as.factor(PassengerId),
         Survived = as.factor(Survived),
         Pclass = as.factor(Pclass),
         Name = as.character(Name),
         Age = as.integer(Age))

there is some different between test and train data, test dont have survived variable. we can found this variable on submission data. so we can join it based on passanger id

test_titanic <- inner_join(test_titanic, submission) %>% 
  mutate(Survived = as.factor(Survived))
## Joining, by = "PassengerId"

NA Check to all data, we found age variabel have NA value.

colSums(is.na(test_titanic))
## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           0           0           1           0           0           0
nrow(test_titanic)
## [1] 418
colSums(is.na(train_titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0
nrow(train_titanic)
## [1] 891

At first we will Handling this missing data with deleting columns / observer. Before test titanic have 418 row it filtered to 332 row and train_titanic have 891 row it filtered to 714 rows

test_titanic <- test_titanic %>% 
  filter(!is.na(Age)) %>% 
  na.omit()

train_titanic <- train_titanic %>% 
  filter(!is.na(Age))

Explatory Data Analysis

In Survived variabel we found binary “0” and “1” to explain whether passanger survived or not. To make it more clear we will label “1” > “yes” and “0” > “no”

test_titanic <- test_titanic %>% 
  mutate(Survived = as.factor(ifelse(Survived == 1, "yes", "no")))

train_titanic <- train_titanic %>% 
  mutate(Survived = as.factor(ifelse(Survived == 1, "yes", "no")))

Multicollinearity based on data dictionary, and didnt found any related variable.

glimpse(test_titanic)
## Observations: 331
## Variables: 12
## $ PassengerId <fct> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 903, 90…
## $ Pclass      <fct> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3, 3, …
## $ Name        <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)", "…
## $ Sex         <fct> male, female, male, male, female, male, female, male, fem…
## $ Age         <int> 34, 47, 62, 27, 22, 14, 30, 26, 18, 21, 46, 23, 63, 47, 2…
## $ SibSp       <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 1, 1, 1, 1, 0, 0, 1, 0, …
## $ Parch       <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Ticket      <fct> 330911, 363272, 240276, 315154, 3101298, 7538, 330972, 24…
## $ Fare        <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292, …
## $ Cabin       <fct> , , , , , , , , , , , B45, , E31, , , , , , , , , B57 B59…
## $ Embarked    <fct> Q, S, Q, S, S, S, Q, S, C, S, S, S, S, S, C, Q, C, S, C, …
## $ Survived    <fct> no, yes, no, no, yes, no, yes, no, yes, no, no, yes, no, …

Some of our varible will not use in our modelling because it didnt have related with our target variable.

#take out Name and Cabin varible 

test_titanic <- test_titanic %>% 
  select(-c(PassengerId,Ticket,Name,Cabin))

train_titanic <- train_titanic %>% 
  select(-c(PassengerId,Ticket,Name,Cabin))

which(is.na(test_titanic))
## integer(0)
test_titanic[2778,]
##    Pclass  Sex Age SibSp Parch Fare Embarked Survived
## NA   <NA> <NA>  NA    NA    NA   NA     <NA>     <NA>

Before train model, our data we get from source have seperate into Train and Test data. We should check whether proportion total data from each train and test compare with total data. We get train have more total row then test data, and proportion si train 68% and test is 32%, so we doesnt need to re-make data-frame

#total row data
total_row <- nrow(test_titanic) + nrow(train_titanic)
#Test data compare total row data
round((nrow(test_titanic) / total_row) * 100, 3)
## [1] 31.675
#Train data compare total row data
round((nrow(train_titanic) / total_row) * 100, 3)
## [1] 68.325

Check proportion of survive status from passanger of train data, we get that proportion of variable survive is balance

prop.table(table(train_titanic$Survived))
## 
##        no       yes 
## 0.5938375 0.4061625
table(train_titanic$Survived)
## 
##  no yes 
## 424 290

Modelling

logistic regressin

We will train logistic regression model. at first we will use all our variabel inside train data frame, and use Survived as target.

logres_model_all <- glm(formula = Survived ~ . ,data = train_titanic, family = "binomial")
summary(logres_model_all)
## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7217  -0.6450  -0.3767   0.6294   2.4462  
## 
## Coefficients:
##               Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)  16.690199 607.920508   0.027             0.978097    
## Pclass2      -1.190492   0.329224  -3.616             0.000299 ***
## Pclass3      -2.395513   0.343296  -6.978     0.00000000000299 ***
## Sexmale      -2.638768   0.223042 -11.831 < 0.0000000000000002 ***
## Age          -0.043281   0.008310  -5.208     0.00000019039880 ***
## SibSp        -0.362899   0.129307  -2.806             0.005009 ** 
## Parch        -0.060848   0.123959  -0.491             0.623515    
## Fare          0.001454   0.002595   0.560             0.575255    
## EmbarkedC   -12.259056 607.920379  -0.020             0.983911    
## EmbarkedQ   -13.082493 607.920581  -0.022             0.982831    
## EmbarkedS   -12.660510 607.920362  -0.021             0.983385    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 632.29  on 703  degrees of freedom
## AIC: 654.29
## 
## Number of Fisher Scoring iterations: 13

K-NN

K-NN classify by looking the nearest data-points we want to classify with their neighbour

# separate target variable
train_x <- train_titanic %>% 
  select(-c(Survived, Pclass, Sex, Embarked))

test_x <- test_titanic %>% 
  select(-c(Survived, Pclass, Sex, Embarked))

train_y <- train_titanic %>% 
  select(c(Survived))

test_y <- test_titanic %>% 
  select(c(Survived))
train_p <- scale(x = train_x, center = T)
test_p <- scale(x = test_x,
                center = attr(train_p, "scaled:center"),
                scale = attr(train_p, "scaled:scale")
                )
knn_survivor_pred <- knn(train = train_p,
                         test = test_p,
                         cl = train_y$Survived,
                         k = 1)

Evalution

Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.

The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision. Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.

Logisitic Regression

We have make our logistic regression using all predict variable on dataset. Next this model use to predict with train dataset.

logres_survivor_pred <- predict(logres_model_all, newdata = test_titanic, type = "response")

rmarkdown::paged_table(head(as.data.frame(logres_survivor_pred), 20))

Based on result, we will convert probabilty into class using threshold value (by default we will put 0.5 as threshold value). Using confusion matrix to see predict using our model compare test data we have prepare before

logres_survivor_pred <- as.factor(if_else(logres_survivor_pred > 0.5, "yes", "no"))

confusionMatrix(data = logres_survivor_pred,
                reference = as.factor(test_y$Survived),
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  186  14
##        yes  18 113
##                                              
##                Accuracy : 0.9033             
##                  95% CI : (0.8663, 0.9329)   
##     No Information Rate : 0.6163             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7968             
##                                              
##  Mcnemar's Test P-Value : 0.5959             
##                                              
##             Sensitivity : 0.8898             
##             Specificity : 0.9118             
##          Pos Pred Value : 0.8626             
##          Neg Pred Value : 0.9300             
##              Prevalence : 0.3837             
##          Detection Rate : 0.3414             
##    Detection Prevalence : 0.3958             
##       Balanced Accuracy : 0.9008             
##                                              
##        'Positive' Class : yes                
## 

The result show that result our prediction on test dataset using logistic regression model is 90.33% for accuracy, it mean that our result data prediction 90,33% is correctly classified. Precision/positive predicted value around 86.26%, mean thath 86.26% of our positive prediction is correctly classified. Value of sensitivity is 88.98% and specificity 88.98%, this indicate positive and negative predict correctly classified range around 86-88%, this model really have high performance for our prediction that we need.

K-NN

confusionMatrix(data = knn_survivor_pred,
                reference = as.factor(test_y$Survived),
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  125  61
##        yes  79  66
##                                           
##                Accuracy : 0.577           
##                  95% CI : (0.5218, 0.6309)
##     No Information Rate : 0.6163          
##     P-Value [Acc > NIR] : 0.9358          
##                                           
##                   Kappa : 0.129           
##                                           
##  Mcnemar's Test P-Value : 0.1508          
##                                           
##             Sensitivity : 0.5197          
##             Specificity : 0.6127          
##          Pos Pred Value : 0.4552          
##          Neg Pred Value : 0.6720          
##              Prevalence : 0.3837          
##          Detection Rate : 0.1994          
##    Detection Prevalence : 0.4381          
##       Balanced Accuracy : 0.5662          
##                                           
##        'Positive' Class : yes             
## 

The result show that result our prediction on test dataset using K-NN model using K = 1 is 57.7% accuracy, it mean that our result data prediction 57,7% is correctly classified. Precision/positive predicted value around 45.52%, mean thath 45.52% of our positive prediction is correctly classified. Value of sensitivity is 51.9% and specificity 61.27%, this indicate positive and negative predict correctly classified range around 50-60%

Model Improvement

Logistic Regression

Based on confussion matrix logistic regression we made before, get good result to predict survivor. We still try to improve model model so it can fitting more better, on this case we will use step wise method.

Prepare none model

logres_model_none <- glm(formula = Survived ~ 1 ,data = train_titanic, family = "binomial")

Backward eliminate, from all predictor we use in model, it will eliminate one by one from model. Result will give us suggestion best model with best lower model

step(object = logres_model_all, direction = "backward", trace = 0)
## 
## Call:  glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = train_titanic)
## 
## Coefficients:
## (Intercept)      Pclass2      Pclass3      Sexmale          Age        SibSp  
##     4.33307     -1.41497     -2.65290     -2.62827     -0.04473     -0.38026  
## 
## Degrees of Freedom: 713 Total (i.e. Null);  708 Residual
## Null Deviance:       964.5 
## Residual Deviance: 636.5     AIC: 648.5

Both elimination, it combine backward and forward method to get best lower AIC model that fit to our prediction

step(object = logres_model_all, scope = list(lower = logres_model_none, upper = logres_model_all), direction = "both", trace = 0)
## 
## Call:  glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = train_titanic)
## 
## Coefficients:
## (Intercept)      Pclass2      Pclass3      Sexmale          Age        SibSp  
##     4.33307     -1.41497     -2.65290     -2.62827     -0.04473     -0.38026  
## 
## Degrees of Freedom: 713 Total (i.e. Null);  708 Residual
## Null Deviance:       964.5 
## Residual Deviance: 636.5     AIC: 648.5

Found that based on backward and both eliminate method, model they suggest is “glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family =”binomial". let try to use this model to predict

logres_model_stepwise = glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", data = train_titanic)

logres_survivor_pred_2 <- predict(logres_model_stepwise, newdata = test_titanic, type = "response")

logres_survivor_pred_2 <- as.factor(if_else(logres_survivor_pred_2 > 0.5, "yes", "no"))

confusionMatrix(data = logres_survivor_pred_2,
                reference = as.factor(test_y$Survived),
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  188  10
##        yes  16 117
##                                              
##                Accuracy : 0.9215             
##                  95% CI : (0.887, 0.948)     
##     No Information Rate : 0.6163             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.8354             
##                                              
##  Mcnemar's Test P-Value : 0.3268             
##                                              
##             Sensitivity : 0.9213             
##             Specificity : 0.9216             
##          Pos Pred Value : 0.8797             
##          Neg Pred Value : 0.9495             
##              Prevalence : 0.3837             
##          Detection Rate : 0.3535             
##    Detection Prevalence : 0.4018             
##       Balanced Accuracy : 0.9214             
##                                              
##        'Positive' Class : yes                
## 

Based this result, we compare with previous logistic regression model we made, the accuracy increase from 90.33% to 92.15%, Precision increase from 86.26% to 87.97% and sensitivity increase from 88.98% to 92.13%

K-NN

By default on K-NN model we use K = 1. To improve it will use optimum K using this method, we get that our optimum k is 27

optimum_k = round(sqrt(nrow(train_p)))
optimum_k
## [1] 27

implement optimum K in K-NN model, we will using same test dataset to make apple-to-apple comparisson.

knn_survivor_pred <- knn(train = train_p,
                         test = test_p,
                         cl = train_y$Survived,
                         k = optimum_k)

Using confusion matrix to summarise result our K-NN model

confusionMatrix(data = knn_survivor_pred,
                reference = as.factor(test_y$Survived),
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  153  66
##        yes  51  61
##                                          
##                Accuracy : 0.6465         
##                  95% CI : (0.5924, 0.698)
##     No Information Rate : 0.6163         
##     P-Value [Acc > NIR] : 0.1413         
##                                          
##                   Kappa : 0.2356         
##                                          
##  Mcnemar's Test P-Value : 0.1956         
##                                          
##             Sensitivity : 0.4803         
##             Specificity : 0.7500         
##          Pos Pred Value : 0.5446         
##          Neg Pred Value : 0.6986         
##              Prevalence : 0.3837         
##          Detection Rate : 0.1843         
##    Detection Prevalence : 0.3384         
##       Balanced Accuracy : 0.6152         
##                                          
##        'Positive' Class : yes            
## 

The result show that result our prediction on test dataset using K-NN model using optimum K is 27. Compare with last K-NN we made using K = 1. Accuracy increase from 57.7% to 64.65%, precision/positive predicted value increase from 45.52% to 54.46%, and sensitivty decrease from 51.9% to 48.03% but specificity increase from 61.27% to 75%

Conculusion

Our target or goals is to predict survivor and to create a model that predicts which passengers survived the Titanic shipwreck. In this case we will target “Survived” and using other columns as Predictor Variable. There isnt specific explaination which more priority from our source, recall or precision, so in this case we will find highest accuracy model. In the other hand, this titanic case didnt restrict how many variable we can use or didnt explain we cant delete some varible.

Based on result before and after improvement model, Logistic regression perform more better in every aspect than K-NN model. Logistic Regression model we have improve using step wise method get better and highest accuracy and other aspect.

So, the result we can use Logistic regression to precit “Survivor” of Titanic shipwreck

Source Code

This analysis made for education purpose, and creator make it public access for data and source code.

File can access and download in Github: alfandash github

Result of this rmardown can access in RPubs: alfandash rpubs