Introduction

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Task

We will try to analyze what sorts of people were more likely to survive with predictive model using the titanic data.

Library

This is library that we will use in this analysis.

library(tidyverse)
library(ggplot2)
library(caret)
library(car)

Data Preprocessing

Read data and see data structure.

train <- read.csv("train.csv")
test <- read.csv("test.csv")
glimpse(train)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6",...
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...

Based on the result, we need to change Survived, Sex, and Embarked into factor.

train <- train %>% 
  mutate(Survived = as.factor(ifelse(Survived == 0, "No", "Yes")),
         Pclass = as.factor(Pclass),
         Sex = as.factor(Sex),
         Embarked = as.factor(Embarked))
glimpse(train)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Survived    <fct> No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, Yes, Yes, ...
## $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
## $ Sex         <fct> male, female, female, female, male, male, male, male, f...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6",...
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S...

Lets check if there is any NA.

colSums(is.na(train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Based on the result, we can see that Age have 177 NAs, but it is weird because based on the summary, the cabin’s coloumn have NA but in form of "" and it is not detected by colsums. So we need to change every "" in the data into NA.

train[train == ""]<-NA
colSums(is.na(train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

And we can see that Age, Cabin, and Embarked have missing values. Before do Exploratory Data Analysis and Processing we need also to check test data.

glimpse(test)
## Rows: 418
## Columns: 11
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, ...
## $ Pclass      <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3...
## $ Name        <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)",...
## $ Sex         <chr> "male", "female", "male", "male", "female", "male", "fe...
## $ Age         <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 2...
## $ SibSp       <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1...
## $ Parch       <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Ticket      <chr> "330911", "363272", "240276", "315154", "3101298", "753...
## $ Fare        <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292...
## $ Cabin       <chr> "", "", "", "", "", "", "", "", "", "", "", "", "B45", ...
## $ Embarked    <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S", ...

Based on the result, we need to change Pclass, Sex, and Embarked into factor.

test <- test %>% 
  mutate(Pclass = as.factor(Pclass),
         Sex = as.factor(Sex),
         Embarked = as.factor(Embarked))
glimpse(test)
## Rows: 418
## Columns: 11
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, ...
## $ Pclass      <fct> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3, 3...
## $ Name        <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs)",...
## $ Sex         <fct> male, female, male, male, female, male, female, male, f...
## $ Age         <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0, 2...
## $ SibSp       <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0, 1...
## $ Parch       <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Ticket      <chr> "330911", "363272", "240276", "315154", "3101298", "753...
## $ Fare        <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6292...
## $ Cabin       <chr> "", "", "", "", "", "", "", "", "", "", "", "", "B45", ...
## $ Embarked    <fct> Q, S, Q, S, S, S, Q, S, C, S, S, S, S, S, S, C, Q, C, S...

We also can see that the cabin also have "". So we need to change it into NA.

test[test == ""]<-NA
colSums(is.na(test))
## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           1         327           0

As we can see there are 3 columns that have NA. After this we do Exploratory Data Analysis and Processing to solve the NA problems.

EDA and Data Processing

We will fix the NA in data train. First we need to fill NA in Age. To decide whether we use mean or median for Age, we check if the distribution is normal.

shapiro.test(train$Age)
## 
##  Shapiro-Wilk normality test
## 
## data:  train$Age
## W = 0.98146, p-value = 7.337e-08

As we can see, the distribution is not normal so we use median for Age. After that we need to fill NA in Embarked. To decide whether we fill NA with “C”,“Q” or “S”, we need to see the number of passenger from every port.

ggplot(train, aes(x = Embarked))+
  geom_bar()+
  labs(x = "Embarked", y = "Count")

From the data, we see that “S” is the highest so we fill the NA with “S”. For cabin, we don’t need to fill the NA because the NA are too many. We will deselect PassengerId, Name, Ticket, and Cabin because we don’t need it in our predictive modelling.

train_processed <- train %>%
   mutate(Age = replace_na(Age,median(Age, na.rm = TRUE)),
         Embarked = replace_na(Embarked,"S")) %>% 
   select(-c(PassengerId,Name,Ticket,Cabin))

Lets check if the NA issue is solved.

colSums(is.na(train_processed))
## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0

After we fill the NA, and filter the coloumns we want to use, we left with this data. After this, We also need to solve NA issue in data test. First we will solve the NA in Age, we will look the distribution.

shapiro.test(test$Age)
## 
##  Shapiro-Wilk normality test
## 
## data:  test$Age
## W = 0.97101, p-value = 3.288e-06

Based on the result, the distribution is not normal so we will use median. After that we will solve the issue in Fare. First we need to check which passenger and from what class is the passenger.

test %>% filter(is.na(Fare))

Based on the data, we can see that the NA is from passenger that from pclass 3. So we need to see the distribution of pclass to decide whether we use median or mean.

shapiro.test(as.numeric(test$Pclass))
## 
##  Shapiro-Wilk normality test
## 
## data:  as.numeric(test$Pclass)
## W = 0.73333, p-value < 2.2e-16

Based on the result, we need to use median. To do so, we aggregate the data.

aggregate(Fare ~ Pclass,
          data = test,
          FUN = median) 

Based on the result, the median is 7.8958 and like data train, we exclude cabin because there are too many NA.

test_processed <- test %>%
   mutate(Age = replace_na(Age,median(Age, na.rm = TRUE)),
         Fare = replace_na(Fare, 7.8958)) %>% 
   select(-c(Cabin))
colSums(is.na(test_processed))
## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0           0           0 
##       Parch      Ticket        Fare    Embarked 
##           0           0           0           0

And we good to go. Next we will do predictive modelling.

Modelling

First, we need to check whether the proportion of the target is balance. If target is imbalance we need to upscale the data.

round(prop.table(table(train_processed$Survived)),2)
## 
##   No  Yes 
## 0.62 0.38

The proportion is still Ok (60:40) so we don’t need to upscale the data.

For this case, We will use logistic regression. First, we need to split the data train into data train and data validation with proportion of 80% for data train and 20% for data validation.

set.seed(123)
row_data <- nrow(train_processed)
index <- sample(row_data, row_data*0.8)

data_train <- train_processed[ index, ]
data_validation <- train_processed[ -index, ]

Lets make base model

set.seed(123)
model_full <- glm(Survived ~ ., data_train, family = "binomial")

summary(model_full)
## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = data_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5513  -0.6138  -0.4133   0.6229   2.4532  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.122458   0.524881   7.854 4.03e-15 ***
## Pclass2     -0.909675   0.330529  -2.752  0.00592 ** 
## Pclass3     -2.058254   0.325573  -6.322 2.58e-10 ***
## Sexmale     -2.770524   0.225659 -12.277  < 2e-16 ***
## Age         -0.037250   0.008886  -4.192 2.76e-05 ***
## SibSp       -0.352950   0.124118  -2.844  0.00446 ** 
## Parch       -0.082107   0.127182  -0.646  0.51855    
## Fare         0.001814   0.002542   0.714  0.47531    
## EmbarkedQ    0.066711   0.419519   0.159  0.87365    
## EmbarkedS   -0.590536   0.263752  -2.239  0.02516 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 947.98  on 711  degrees of freedom
## Residual deviance: 625.28  on 702  degrees of freedom
## AIC: 645.28
## 
## Number of Fisher Scoring iterations: 5

Lets use the model to predict data validation

pred_valid_base <- predict(model_full, data_validation, type = "response")
pred_class_validation_base <- ifelse(pred_valid_base > 0.5, "Yes", "No") %>% 
  as.factor
confusionMatrix(pred_class_validation_base, data_validation$Survived, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  94  25
##        Yes 16  44
##                                           
##                Accuracy : 0.7709          
##                  95% CI : (0.7024, 0.8303)
##     No Information Rate : 0.6145          
##     P-Value [Acc > NIR] : 6.222e-06       
##                                           
##                   Kappa : 0.5045          
##                                           
##  Mcnemar's Test P-Value : 0.2115          
##                                           
##             Sensitivity : 0.6377          
##             Specificity : 0.8545          
##          Pos Pred Value : 0.7333          
##          Neg Pred Value : 0.7899          
##              Prevalence : 0.3855          
##          Detection Rate : 0.2458          
##    Detection Prevalence : 0.3352          
##       Balanced Accuracy : 0.7461          
##                                           
##        'Positive' Class : Yes             
## 

Based on the result, we can see the accuracy is 77%, Sensitivity is 63%, Specificity is 85%, and Presicion is 73% with positive class “Yes”.

Model Tuning

After we have base model, lets tuning it to become simpler, lower AIC, and only consist of significance coloumns. To do this we will make model_none and do stepwise modelling with the lower is model_none and the upper is model_full.

set.seed(123)
model_none <- glm(Survived ~ 1, data_train, family = "binomial")
model_step <- step(model_none, 
                   scope = list(lower = model_none, upper = model_full),
                   direction = "both",
                   trace = 0
                   )
summary(model_step)
## 
## Call:
## glm(formula = Survived ~ Sex + Pclass + Age + SibSp + Embarked, 
##     family = "binomial", data = data_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5421  -0.6137  -0.4089   0.6330   2.4666  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.251380   0.469602   9.053  < 2e-16 ***
## Sexmale     -2.751059   0.219520 -12.532  < 2e-16 ***
## Pclass2     -1.001797   0.302114  -3.316 0.000913 ***
## Pclass3     -2.181286   0.281243  -7.756 8.77e-15 ***
## Age         -0.037529   0.008861  -4.235 2.28e-05 ***
## SibSp       -0.361967   0.119262  -3.035 0.002405 ** 
## EmbarkedQ    0.069979   0.415866   0.168 0.866369    
## EmbarkedS   -0.623392   0.259775  -2.400 0.016407 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 947.98  on 711  degrees of freedom
## Residual deviance: 626.07  on 704  degrees of freedom
## AIC: 642.07
## 
## Number of Fisher Scoring iterations: 5

As we can see we get lower AIC, and the model only consist of significance columns. Lets use this model to predict the validation data.

pred_valid_step <- predict(model_step, data_validation, type = "response")
pred_class_validation_step <- ifelse(pred_valid_step > 0.5, "Yes", "No") %>% 
  as.factor
confusionMatrix(pred_class_validation_step, data_validation$Survived, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  93  25
##        Yes 17  44
##                                           
##                Accuracy : 0.7654          
##                  95% CI : (0.6964, 0.8254)
##     No Information Rate : 0.6145          
##     P-Value [Acc > NIR] : 1.317e-05       
##                                           
##                   Kappa : 0.4938          
##                                           
##  Mcnemar's Test P-Value : 0.2801          
##                                           
##             Sensitivity : 0.6377          
##             Specificity : 0.8455          
##          Pos Pred Value : 0.7213          
##          Neg Pred Value : 0.7881          
##              Prevalence : 0.3855          
##          Detection Rate : 0.2458          
##    Detection Prevalence : 0.3408          
##       Balanced Accuracy : 0.7416          
##                                           
##        'Positive' Class : Yes             
## 

As we can see, the accuracy is 76.5%, sensitivity is 63%, specificity is 84%, and precision is 72%. With the simpler model and lower AIC, it lower the accuracy by 0.5% and specificity by 1%.

Model Diagnostic

After we see the difference between base model and stepwise model. We will use stepwise model because the AIC is lower and the model is simpler than base model. With simpler model and lower AIC, it only cost us 0.5% in accuracy and 1% in specificity.

Before we make prediction to data test. We need to check if the model is good. First lets check if there is multicollinearity.

vif(model_step)
##              GVIF Df GVIF^(1/(2*Df))
## Sex      1.153201  1        1.073872
## Pclass   1.504479  2        1.107507
## Age      1.292843  1        1.137033
## SibSp    1.156015  1        1.075181
## Embarked 1.192914  2        1.045087

Based on the result, there is no multicollinearity between predictors.

After that, lets see if the model optimum.

pred_train_step <- predict(model_step, data_train, type = "response")
pred_class_train_step <- ifelse(pred_train_step > 0.5, "Yes", "No") %>% 
  as.factor
confusionMatrix(pred_class_train_step, data_train$Survived, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  384  77
##        Yes  55 196
##                                           
##                Accuracy : 0.8146          
##                  95% CI : (0.7841, 0.8425)
##     No Information Rate : 0.6166          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6018          
##                                           
##  Mcnemar's Test P-Value : 0.06758         
##                                           
##             Sensitivity : 0.7179          
##             Specificity : 0.8747          
##          Pos Pred Value : 0.7809          
##          Neg Pred Value : 0.8330          
##              Prevalence : 0.3834          
##          Detection Rate : 0.2753          
##    Detection Prevalence : 0.3525          
##       Balanced Accuracy : 0.7963          
##                                           
##        'Positive' Class : Yes             
## 

The difference between accuracy, sensitivity, specificity, and precision between data train and data test are below 5%. So we can say that the model is optimum.

Make prediction for data test

Lets see the prediction

pred_probs_test <- predict(model_step, test_processed, type = "response")
pred_test <- ifelse(pred_probs_test > 0.5, "Yes", "No")
prediction <- data.frame(test_processed,pred_test)
prediction