Prediction of survival on the Titanic

Objection

In this research we will analyze the survivability of a titanic passenger during the tragic disaster of the cruise ship. Many will consider the survival of the person is a luck or even a miracle, it can be true, but in this case we will try to interpret those miracle into information based on the provided data. Hopefully we can undertand the parameters that are crucial to the survival of a passenger.

Library and Setup

But first we will need to load the necessary packages to help this research. Those packages must be installed first to be able to be load as below.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)

Data Import

The dataset we are using today are provided from Kaggle containing passenger information from the Titanic, more specifically their data regarding their time at the ship.

data_titanic <- read.csv("titanic/train.csv")
str(data_titanic)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Detailed information on the column:

Survived : Passenger survival, 0 = No; 1 = Yes
Pclass : Ticket class, 1 = 1st, 2 = 2nd, 3 = 3rd
Name : Passenger Name
Sex : Passenger Sex
Age : Passenger Age
SibSp : Number of siblings / spouses aboard the Titanic
Parch : Number of parents / children aboard the Titanic
Ticket : Ticket Number
Fare : The ticket price
Cabin : Cabin number
Embarked : Port of Embarkation, C = Cherbourg; Q = Queenstown; S = Southampton

Data Manipulation

In the provided dataset there are some unmatched data type, in other words it is still poorly presented. A adjustment that are needed to be done.

data_titanic <- data_titanic %>%
  mutate_at(vars(Survived, Pclass,Sex), as.factor) %>% 
  mutate(Survived = factor(Survived, levels = c(0,1), labels = c("Not Survived", "Survived")))
glimpse(data_titanic)

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <fct> Not Survived, Survived, Survived, Survived, Not Survived, …
## $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <fct> male, female, female, female, male, male, male, male, fema…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

Next step is to recognize the missing data in the dataset

colSums(is.na(data_titanic))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

It is observable that in the Age column there are significant amount of missing data. But we decide to drop it all because in our consideration that the Age column is such a determinant factor in deciding the rate of survival in each passenger.

data_titanic <-  data_titanic[!is.na(data_titanic$Age),]
colSums(is.na(data_titanic))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Data Preprocessing

Our target variable is in the Survival column wheter the passenger survived or not. Therefore it is essential to make sure the dataset in balanced so that the model is sensitive to both target class (0=Not Survived, 1=Survived).

prop.table(table(data_titanic$Survived))

## 
## Not Survived     Survived 
##    0.5938375    0.4061625

table(data_titanic$Survived)

## 
## Not Survived     Survived 
##          424          290

As we can see, the data is fairly balance, so we can continue using this dataset without anymore action about skewed data

Feature Engineering

From the given predictor, there has been an attention to the Cabin column, it contain an information about a passenger’s cabin in string. We will process the string data to extract the number of cabin a passenger has. We suspect it can be a predicting factor for out model.

data_titanic$totalCabin <- ifelse(test = data_titanic$Cabin == "", yes = 0, no=sapply(strsplit(data_titanic$Cabin, " "), length))

Data Train-Test Split

We will devide the dataset into two parts. The first one served as the trainning data, that will be the base of the prediction model. And the second part is the testing data, a unseen data for the model so that we can evaluate the model’s performance.

set.seed(303)
intrain <- sample(nrow(data_titanic), nrow(data_titanic)*0.8)
titanic_train <- data_titanic[intrain,]
titanic_test <- data_titanic[-intrain,]

names(data_titanic)

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"    "totalCabin"

Modelling

The step where we actually make the prediction model using logistic regression or in this case using glm() function. We will not use all of the variables since not all of the variable are considered important to predict the survavibility of a passenger.

titanic_model <- glm(formula = Survived~Pclass+Sex+Age+SibSp+Parch+Fare+Embarked+totalCabin, family = "binomial", 
             data = titanic_train)
summary(titanic_model)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare + Embarked + totalCabin, family = "binomial", data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5696  -0.7152  -0.4085   0.6871   2.3991  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.570e+01  5.354e+02   0.029   0.9766    
## Pclass2     -8.621e-01  4.094e-01  -2.106   0.0352 *  
## Pclass3     -2.011e+00  4.291e-01  -4.686 2.78e-06 ***
## Sexmale     -2.513e+00  2.454e-01 -10.244  < 2e-16 ***
## Age         -3.776e-02  8.794e-03  -4.294 1.75e-05 ***
## SibSp       -3.706e-01  1.469e-01  -2.522   0.0117 *  
## Parch       -1.225e-01  1.340e-01  -0.914   0.3608    
## Fare        -5.504e-04  3.047e-03  -0.181   0.8566    
## EmbarkedC   -1.171e+01  5.354e+02  -0.022   0.9826    
## EmbarkedQ   -1.289e+01  5.354e+02  -0.024   0.9808    
## EmbarkedS   -1.217e+01  5.354e+02  -0.023   0.9819    
## totalCabin   2.555e-01  2.903e-01   0.880   0.3788    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 771.40  on 570  degrees of freedom
## Residual deviance: 533.81  on 559  degrees of freedom
## AIC: 557.81
## 
## Number of Fisher Scoring iterations: 12

We have successfully made a model using all the predictor, but as we can see, based on the data there are still many predictor that are not significant that cost us efficiency. So we decide on using stepwise method to reduce the predictor unto a set of variables that are important only.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

titanic_model_step <- stepAIC(titanic_model, direction = "backward", trace = FALSE)
summary(titanic_model_step)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = titanic_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6247  -0.6973  -0.4178   0.6843   2.3653  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.903644   0.471246   8.284  < 2e-16 ***
## Pclass2     -1.207460   0.298297  -4.048 5.17e-05 ***
## Pclass3     -2.414140   0.300302  -8.039 9.05e-16 ***
## Sexmale     -2.443079   0.233547 -10.461  < 2e-16 ***
## Age         -0.039576   0.008629  -4.586 4.51e-06 ***
## SibSp       -0.412410   0.137856  -2.992  0.00278 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 771.40  on 570  degrees of freedom
## Residual deviance: 539.77  on 565  degrees of freedom
## AIC: 551.77
## 
## Number of Fisher Scoring iterations: 4

After using the stepwise method, we can conclude that the most significant predictors are Pclass, Sex, Age and SibSp. In other words, those parameters are the deciding factors wheter a passenger can survive the disaster or not.

Prediction

Using model that are developed using stepwise method, we will try to predict the rate of survival of each passenger.

titanic_test$pred <- predict(titanic_model_step, type = "response", newdata = titanic_test)
titanic_test[,c("PassengerId","Survived","pred")]

##     PassengerId     Survived       pred
## 3             3     Survived 0.61314086
## 4             4     Survived 0.89149291
## 11           11     Survived 0.71479577
## 12           12     Survived 0.83316813
## 17           17 Not Survived 0.06402196
## 19           19 Not Survived 0.46263115
## 25           25 Not Survived 0.48392270
## 36           36 Not Survived 0.35114094
## 39           39 Not Survived 0.48808242
## 41           41 Not Survived 0.37614655
## 45           45     Survived 0.67646305
## 57           57     Survived 0.86588942
## 60           60 Not Survived 0.03074031
## 62           62     Survived 0.91680934
## 68           68 Not Survived 0.15374636
## 69           69     Survived 0.30302733
## 85           85     Survived 0.88323167
## 104         104 Not Survived 0.09452667
## 109         109 Not Survived 0.07889511
## 111         111 Not Survived 0.40143322
## 112         112 Not Survived 0.62322177
## 113         113 Not Survived 0.13892562
## 114         114 Not Survived 0.57091343
## 121         121 Not Survived 0.19737115
## 133         133 Not Survived 0.31368076
## 145         145 Not Survived 0.38716151
## 158         158 Not Survived 0.10518950
## 165         165 Not Survived 0.06643478
## 171         171 Not Survived 0.27816988
## 173         173     Survived 0.73837095
## 174         174 Not Survived 0.14372788
## 175         175 Not Survived 0.31958515
## 176         176 Not Survived 0.11121907
## 190         190 Not Survived 0.08484223
## 195         195     Survived 0.89681349
## 196         196     Survived 0.83316813
## 201         201 Not Survived 0.11287580
## 209         209     Survived 0.70188533
## 217         217     Survived 0.60371253
## 229         229 Not Survived 0.38716151
## 232         232 Not Survived 0.10897319
## 240         240 Not Survived 0.25866935
## 243         243 Not Survived 0.29016289
## 253         253 Not Survived 0.27029358
## 262         262     Survived 0.06169098
## 263         263 Not Survived 0.26702181
## 267         267 Not Survived 0.03781770
## 268         268     Survived 0.08663914
## 270         270     Survived 0.92542804
## 282         282 Not Survived 0.11287580
## 294         294 Not Survived 0.63173928
## 295         295 Not Survived 0.12972485
## 297         297 Not Survived 0.13197524
## 303         303 Not Survived 0.15374636
## 310         310     Survived 0.93798576
## 312         312     Survived 0.91423373
## 314         314 Not Survived 0.11287580
## 315         315 Not Survived 0.13457911
## 319         319     Survived 0.93564340
## 320         320     Survived 0.87081714
## 322         322 Not Survived 0.11689984
## 323         323     Survived 0.81889969
## 329         329     Survived 0.46263115
## 342         342     Survived 0.84768985
## 345         345 Not Survived 0.23656166
## 351         351 Not Survived 0.13425865
## 354         354 Not Survived 0.08663914
## 361         361 Not Survived 0.04978304
## 362         362 Not Survived 0.21298881
## 371         371     Survived 0.51468587
## 373         373 Not Survived 0.15374636
## 387         387 Not Survived 0.04499351
## 388         388     Survived 0.78099187
## 406         406 Not Survived 0.18169856
## 408         408     Survived 0.43093617
## 415         415     Survived 0.06327441
## 425         425 Not Survived 0.11121907
## 434         434 Not Survived 0.16432916
## 436         436     Survived 0.94965375
## 462         462 Not Survived 0.09119325
## 472         472 Not Survived 0.07889511
## 479         479 Not Survived 0.13892562
## 490         490     Survived 0.15159226
## 497         497     Survived 0.79480600
## 531         531     Survived 0.90066385
## 536         536     Survived 0.91827624
## 540         540     Survived 0.95404188
## 544         544     Survived 0.19376484
## 551         551     Survived 0.68735369
## 557         557     Survived 0.83084065
## 559         559     Survived 0.87520425
## 562         562 Not Survived 0.07333146
## 566         566 Not Survived 0.06132893
## 577         577     Survived 0.79422892
## 589         589 Not Survived 0.13892562
## 593         593 Not Survived 0.05659171
## 596         596 Not Survived 0.05782810
## 601         601     Survived 0.71535693
## 604         604 Not Survived 0.06327441
## 617         617 Not Survived 0.06229458
## 620         620 Not Survived 0.31521160
## 625         625 Not Survived 0.14372788
## 636         636     Survived 0.83034328
## 661         661     Survived 0.20700994
## 666         666 Not Survived 0.13727142
## 669         669 Not Survived 0.06566104
## 676         676 Not Survived 0.15896635
## 683         683 Not Survived 0.14866748
## 685         685 Not Survived 0.07351884
## 686         686 Not Survived 0.24072666
## 688         688 Not Survived 0.15374636
## 695         695 Not Survived 0.28618568
## 700         700 Not Survived 0.06813114
## 704         704 Not Survived 0.12532198
## 705         705 Not Survived 0.08355820
## 708         708     Survived 0.44976555
## 709         709     Survived 0.95404188
## 720         720 Not Survived 0.09452667
## 721         721     Survived 0.92119746
## 734         734 Not Survived 0.34138273
## 738         738     Survived 0.51884418
## 764         764     Survived 0.88760488
## 770         770 Not Survived 0.09796880
## 780         780     Survived 0.90041871
## 781         781     Survived 0.72611879
## 785         785 Not Survived 0.12532198
## 787         787     Survived 0.68506345
## 797         797     Survived 0.87701168
## 805         805     Survived 0.11689984
## 806         806 Not Survived 0.10152223
## 813         813 Not Survived 0.24378340
## 824         824     Survived 0.60371253
## 825         825 Not Survived 0.06402196
## 832         832     Survived 0.45210897
## 835         835 Not Survived 0.15896635
## 842         842 Not Survived 0.40610087
## 848         848 Not Survived 0.08796596
## 859         859     Survived 0.63173928
## 876         876     Survived 0.71009956
## 880         880     Survived 0.84388199
## 881         881     Survived 0.84641940
## 882         882 Not Survived 0.09452667
## 884         884 Not Survived 0.29838143

ggplot(titanic_test, aes(x=pred)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

Model Evaluation

We already produce the survivability rate of each passenger, now to be able to examine our model we will set a threshold of 0.75 to decide a passenger will survive or not.

titanic_test$pred_label <- factor(ifelse(titanic_test$pred > 0.75, "Survived","Not Survived"))

library(caret)

## Loading required package: lattice

conf_matrix <- confusionMatrix(titanic_test$pred_label, titanic_test$Survived, positive = "Survived")
conf_matrix

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Not Survived Survived
##   Not Survived           85       27
##   Survived                0       31
##                                           
##                Accuracy : 0.8112          
##                  95% CI : (0.7373, 0.8717)
##     No Information Rate : 0.5944          
##     P-Value [Acc > NIR] : 2.541e-08       
##                                           
##                   Kappa : 0.5772          
##                                           
##  Mcnemar's Test P-Value : 5.624e-07       
##                                           
##             Sensitivity : 0.5345          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.7589          
##              Prevalence : 0.4056          
##          Detection Rate : 0.2168          
##    Detection Prevalence : 0.2168          
##       Balanced Accuracy : 0.7672          
##                                           
##        'Positive' Class : Survived        
##

Based on the confusion matrix above we can conclude that our model can predict class Survived with the accuracy of 79,72 percent with 0.75 threshold, with many errors on Survived passengers are predicted not survived but actually do survive the accident.

Conclusion

In these research of understanding the dataset of Titanic passengers, we found a pattern in parameters of survival in the disaster is ticket class, sex, age, and number of siblings or spouses. The prediction model can only achieve accuracy of 79.7% with many errors on passengers came out survived despite all otherwise factors, and this passengers can be truly considered miracluous statistically.