Introduction

FigName

If you were one of the unfortunate passengers who boarding that Titanic ship, how confident are you to be survived? Well, if you have watched the James Cameron’s Titanic movie (who haven’t?!), the scene where Jack had to let go of Rose to that lifeboat must be called to your mind. You may notice that the lifeboats were dominated by women. But, is the movie really portray the true event?

To prove that gender really influenced the survivability chance, we make use the titanic dataset from Kaggle to interpretate which factors contribute and to make a model to predict the passengers survival.

Titanic Dataset

We begin by importing both of the train and test dataset. To make it easier for predicting, we combine the train and test dataset into one dataframe.

## Observations: 1,309
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ Survived    <fct> Dead, Survived, Survived, Survived, Dead, Dead, Dead, Dea…
## $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, …
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (F…
## $ Sex         <fct> male, female, female, female, male, male, male, male, fem…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14,…
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, …
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, …
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "3…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625…
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "…
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, …

The titanic dataset consits of 12 variables. We would explain each variable in the Exploratory Data Analysis section.

Exploratory Data Analysis

In EDA section, we would like to explore the relation from each variable into the chance of survivability. Also, we would like to show how we handle the missing value and creating new variables to improve the model.

Numerical Variable

Age

We begin with Age variable. Did children prioritized more than adults?

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite values (stat_bin).
## Warning: Removed 177 rows containing non-finite values (stat_density).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite values (stat_bin).
## Warning: Removed 177 rows containing non-finite values (stat_density).

From the graph, there were more survived children than the dead one. For adult, the number seems equal. If we combine Gender and Age variable, high number of male who survived were babies. However, it is is not clear yet whether gender influence the survival chances.

Fare

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most of the Titanic’s passengers were charged under 50 dollars. However, survival rate is very low for the passengers with fare under 25 dollars.

Family

SibSp defines family relations in this way: Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored).

Parch defines family relations in this way: Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch = 0 for them.

The survival rate tends to be lower when the number of siblings and spouse higher. However, the same was not true for number of parents and children.

We would like to make new variables to describe the size of the family who onboard the ship.

In summary, there were many passengers that were travelling alone. Survival rate are high for passenger who traveled with 1 to 3 family members, however the chance is much lower when they traveled with more than 3 family members.

Categorical Variable

Sex

From the 1998’s movie, it mentioned that female were prioritized more than male. We will see how really it looks like from our data.

From the data, it is clear that female has higher chance of survive than male. This indication will be really important for our model.

Embarked

Embarked is port of embarkation, the city where the passengers came on the ship. C = Cherbourg, Q = Queenstown, S = Southampton.

## Warning: Factor `Embarked` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `Embarked` contains implicit NA, consider using
## `forcats::fct_explicit_na`

It seems like that passengers from Cherbourg has the best survival rate. But why is that?

If we combine Embarked with Fare and Pclass, it looks like the passengers from Cherbourg paid more and most of them were travel in the first class.

Feature Engineering

Titles

To know more specifically about the Age and Sex, we could do some feature engineerings by extracting the title from the variable Name. In this case, we would use gsub function.

In this case, we would differentiate the title variable into 6 different titles, which is Mr, Mrs, Master, Miss, Honorific Titles, and Officers.

## [1] "Honorific Titles" "Master"           "Miss"             "Mr"              
## [5] "Mrs"              "Officers"

Mr title that would indicate an adult male has a really bad survival rate for about 20%. On the other hand, Mrs, Miss and Master have survival rate for more than 50%. As indicate before, women and children were indeed prioritized more.

Missing Value Handling

Unfortunately, there are many missing value in our data for Age, Cabin, Embarked and Fare variables.

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0         418           0           0           0         263 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1        1014           2 
## Family_Size       title 
##           0           0

Age

## 
## FALSE  TRUE 
## 16628  1698

In our data, there are 1188 rows with missing value of Age. Therefore, instead of deleting the row with missing values, we would try to predict the value based on correlation of Age with another variables.

## Warning: Removed 263 rows containing non-finite values (stat_boxplot).

Sex and Age unfortunately do not correlate with each other, so we can’t predict age from sex.

## Warning: Removed 263 rows containing non-finite values (stat_boxplot).

Just like Sex variable, the place where the passengers got into the ship do not give much information. The missing data in Embarked also giving us some issues.

## Warning: Removed 263 rows containing non-finite values (stat_boxplot).

If we look into Pclass, we could conclude that there were older passengers in first class than second class which also have older passengers than the third class.

From the boxplot above, we could conclude that the median values from each class are 37, 29 and 24.

We would use those ages into our missing values.

Embarked

The Embarked variable also has some missing values. We would fill them with the most frequent value, S.

Cross Validation

We need to inspect the target variable ratio:

## .
##      Dead  Survived 
## 0.6161616 0.3838384

The ratio between dean and survived is 61:38 that would mean an appropriate ratio for our model.

Modeling

We would like to use three models: logistic regression, random forest and k-NN. For modeling, we would use 5-Fold cross validation that aim to make a less biased model:

Logistic Regression

We start from logistic regression model:

## 
## Call:
## glm(formula = Survived ~ Pclass + title + Fare + Family_Size + 
##     Parch + SibSp + Embarked + Age + Sex, family = "binomial", 
##     data = titanic[1:nrow(train), ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3538  -0.5529  -0.3838   0.5380   2.5442  
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error z value        Pr(>|z|)    
## (Intercept)    15.673658 535.411435   0.029        0.976646    
## Pclass2        -1.044921   0.321855  -3.247        0.001168 ** 
## Pclass3        -2.123569   0.316094  -6.718 0.0000000000184 ***
## titleMaster     3.666652   1.035631   3.541        0.000399 ***
## titleMiss     -11.400407 535.411289  -0.021        0.983012    
## titleMr         0.114645   0.904279   0.127        0.899115    
## titleMrs      -10.643087 535.411300  -0.020        0.984140    
## titleOfficers  -0.025944   1.205630  -0.022        0.982832    
## Fare            0.003503   0.002651   1.321        0.186350    
## Family_Size    -0.547278   0.126128  -4.339 0.0000143087964 ***
## Parch           0.193246   0.196892   0.981        0.326355    
## SibSp                 NA         NA      NA              NA    
## EmbarkedQ      -0.089604   0.394641  -0.227        0.820384    
## EmbarkedS      -0.430448   0.250043  -1.721        0.085161 .  
## Age            -0.024912   0.009158  -2.720        0.006524 ** 
## Sexmale       -14.488715 535.411932  -0.027        0.978411    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  725.86  on 876  degrees of freedom
## AIC: 755.86
## 
## Number of Fisher Scoring iterations: 12

SibSp returns NA values in our model because it can be expressed as linear combination of other variables. In our case Family_Size is the one who responsible, because it derived from Parch and SibSp. We decide to throw away the variable from our model.

## 
## Call:
## glm(formula = Survived ~ Pclass + title + Fare + Parch + SibSp + 
##     Embarked + Age + Sex, family = "binomial", data = titanic[1:nrow(train), 
##     ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3538  -0.5529  -0.3838   0.5380   2.5442  
## 
## Coefficients:
##                 Estimate Std. Error z value        Pr(>|z|)    
## (Intercept)    15.126380 535.411415   0.028        0.977461    
## Pclass2        -1.044921   0.321855  -3.247        0.001168 ** 
## Pclass3        -2.123569   0.316094  -6.718 0.0000000000184 ***
## titleMaster     3.666652   1.035631   3.541        0.000399 ***
## titleMiss     -11.400407 535.411289  -0.021        0.983012    
## titleMr         0.114645   0.904279   0.127        0.899115    
## titleMrs      -10.643087 535.411300  -0.020        0.984140    
## titleOfficers  -0.025944   1.205630  -0.022        0.982832    
## Fare            0.003503   0.002651   1.321        0.186350    
## Parch          -0.354032   0.133991  -2.642        0.008237 ** 
## SibSp          -0.547278   0.126128  -4.339 0.0000143087964 ***
## EmbarkedQ      -0.089604   0.394641  -0.227        0.820384    
## EmbarkedS      -0.430448   0.250043  -1.721        0.085161 .  
## Age            -0.024912   0.009158  -2.720        0.006524 ** 
## Sexmale       -14.488715 535.411932  -0.027        0.978411    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  725.86  on 876  degrees of freedom
## AIC: 755.86
## 
## Number of Fisher Scoring iterations: 12

Our model produce high error for title and sex variables. We will correct that by using variance inflation factor (VIF), value below 5 generally can be accepted:

##                     GVIF Df GVIF^(1/(2*Df))
## Pclass          2.193549  2        1.216990
## title    17719466.847276  5        5.306951
## Fare            1.619052  1        1.272420
## Parch           1.461297  1        1.208841
## SibSp           1.609432  1        1.268634
## Embarked        1.302684  2        1.068341
## Age             1.741526  1        1.319669
## Sex       7822381.571822  1     2796.852083

Sex and title generate a very high number of VIF! It can be interpreted that they have high correlation with another variables in our model, which is understandable as title is our dummy variable that influenced by Sex and Age. So that, we throw away the Sex variable.

## 
## Call:
## glm(formula = Survived ~ Pclass + title + Fare + Parch + SibSp + 
##     Embarked + Age, family = "binomial", data = titanic[1:nrow(train), 
##     ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3534  -0.5552  -0.3850   0.5404   2.5516  
## 
## Coefficients:
##                Estimate Std. Error z value        Pr(>|z|)    
## (Intercept)    1.069610   0.886977   1.206        0.227854    
## Pclass2       -1.059341   0.320970  -3.300        0.000965 ***
## Pclass3       -2.128815   0.315280  -6.752 0.0000000000146 ***
## titleMaster    3.235751   0.943470   3.430        0.000604 ***
## titleMiss      2.646031   0.821130   3.222        0.001271 ** 
## titleMr       -0.335742   0.792192  -0.424        0.671701    
## titleMrs       3.395526   0.840794   4.038 0.0000537998599 ***
## titleOfficers -0.484886   1.121404  -0.432        0.665457    
## Fare           0.003438   0.002634   1.305        0.191839    
## Parch         -0.352041   0.133770  -2.632        0.008496 ** 
## SibSp         -0.551418   0.126103  -4.373 0.0000122677469 ***
## EmbarkedQ     -0.089235   0.394354  -0.226        0.820981    
## EmbarkedS     -0.414961   0.249782  -1.661        0.096655 .  
## Age           -0.024445   0.009130  -2.678        0.007417 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  728.03  on 877  degrees of freedom
## AIC: 756.03
## 
## Number of Fisher Scoring iterations: 5
##              GVIF Df GVIF^(1/(2*Df))
## Pclass   2.199306  2        1.217787
## title    2.760855  5        1.106890
## Fare     1.615927  1        1.271191
## Parch    1.460806  1        1.208638
## SibSp    1.609455  1        1.268643
## Embarked 1.299747  2        1.067738
## Age      1.746630  1        1.321601

And finally, we have the model with low error and VIF. From our logistic regression model, Pclass, TitleMaster and Family_Size variables contribute the most. Now we would try to apply the model to our train and test data:

Then, we would like to compare the model prediction to our train data:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Dead Survived
##   Dead      482       86
##   Survived   67      256
##                                              
##                Accuracy : 0.8283             
##                  95% CI : (0.8019, 0.8525)   
##     No Information Rate : 0.6162             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6331             
##                                              
##  Mcnemar's Test P-Value : 0.1456             
##                                              
##             Sensitivity : 0.7485             
##             Specificity : 0.8780             
##          Pos Pred Value : 0.7926             
##          Neg Pred Value : 0.8486             
##              Prevalence : 0.3838             
##          Detection Rate : 0.2873             
##    Detection Prevalence : 0.3625             
##       Balanced Accuracy : 0.8132             
##                                              
##        'Positive' Class : Survived           
## 

The accuracy for our logistic regression model is 82.83%. Not a bad one for a first try, isn’t it? Now, let try our model into the test data:

## [1] "Dead"     "Survived"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Dead Survived
##   Dead      242       10
##   Survived   24      142
##                                               
##                Accuracy : 0.9187              
##                  95% CI : (0.8882, 0.943)     
##     No Information Rate : 0.6364              
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.8276              
##                                               
##  Mcnemar's Test P-Value : 0.02578             
##                                               
##             Sensitivity : 0.9342              
##             Specificity : 0.9098              
##          Pos Pred Value : 0.8554              
##          Neg Pred Value : 0.9603              
##              Prevalence : 0.3636              
##          Detection Rate : 0.3397              
##    Detection Prevalence : 0.3971              
##       Balanced Accuracy : 0.9220              
##                                               
##        'Positive' Class : Survived            
## 

Surprisingly, the accuracy is even better in our test data. The sensitivity and pos pred value are also much better in our test data.

Random Forest

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Dead Survived
##   Dead      506       89
##   Survived   43      253
##                                                
##                Accuracy : 0.8519               
##                  95% CI : (0.8268, 0.8745)     
##     No Information Rate : 0.6162               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.6787               
##                                                
##  Mcnemar's Test P-Value : 0.00008975           
##                                                
##             Sensitivity : 0.7398               
##             Specificity : 0.9217               
##          Pos Pred Value : 0.8547               
##          Neg Pred Value : 0.8504               
##              Prevalence : 0.3838               
##          Detection Rate : 0.2840               
##    Detection Prevalence : 0.3322               
##       Balanced Accuracy : 0.8307               
##                                                
##        'Positive' Class : Survived             
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Dead Survived
##   Dead      249       12
##   Survived   17      140
##                                              
##                Accuracy : 0.9306             
##                  95% CI : (0.9019, 0.953)    
##     No Information Rate : 0.6364             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.8511             
##                                              
##  Mcnemar's Test P-Value : 0.4576             
##                                              
##             Sensitivity : 0.9211             
##             Specificity : 0.9361             
##          Pos Pred Value : 0.8917             
##          Neg Pred Value : 0.9540             
##              Prevalence : 0.3636             
##          Detection Rate : 0.3349             
##    Detection Prevalence : 0.3756             
##       Balanced Accuracy : 0.9286             
##                                              
##        'Positive' Class : Survived           
## 

The random forest model give us a relatively high result even in a relatively high correlated data set such as the titanic dataset. When we realize how random forest works, it is not surprising that the good result is achieved because it is great in eliminating muticollinearity.

k-NN

Then, we would do the K-nearest neighbors modeling. This method classified data test into classes and comparing those into data train using Eucledian Distance. Then, we set the number of neighbors from the data test (k) and there would be majority voting to choose the class. First, we need to change factor variables into numerical and remove categorical variables:

## Observations: 891
## Variables: 10
## $ Survived    <fct> 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, …
## $ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, …
## $ Sex         <dbl> 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, …
## $ Age         <dbl> 22, 38, 26, 35, 35, 29, 54, 2, 27, 14, 4, 58, 20, 39, 14,…
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, …
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, …
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625…
## $ Embarked    <dbl> 3, 1, 3, 3, 3, 2, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 2, 3, 3, …
## $ Family_Size <dbl> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, 2, …
## $ title       <dbl> 4, 5, 3, 5, 4, 4, 4, 2, 5, 5, 3, 3, 4, 4, 3, 5, 2, 4, 5, …

To choose optimal k or neighbors, we would use caret library:

## k-Nearest Neighbors 
## 
## 891 samples
##   9 predictor
##   2 classes: '1', '2' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 713, 713, 713, 712, 713 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.6947524  0.3526385
##    2  0.7037097  0.3659510
##    3  0.7261691  0.4135471
##    4  0.6880045  0.3323226
##    5  0.7070931  0.3712995
##    6  0.7126734  0.3821138
##    7  0.7171992  0.3905793
##    8  0.7138221  0.3769667
##    9  0.7239094  0.3943054
##   10  0.7059193  0.3554381
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.

Based on calculation above, the best accuracy is achieved by 7 neighbors. So that, we will choose k = 7 in our k-NN model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2
##          1 500  81
##          2  49 261
##                                               
##                Accuracy : 0.8541              
##                  95% CI : (0.8292, 0.8766)    
##     No Information Rate : 0.6162              
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.686               
##                                               
##  Mcnemar's Test P-Value : 0.00655             
##                                               
##             Sensitivity : 0.7632              
##             Specificity : 0.9107              
##          Pos Pred Value : 0.8419              
##          Neg Pred Value : 0.8606              
##              Prevalence : 0.3838              
##          Detection Rate : 0.2929              
##    Detection Prevalence : 0.3479              
##       Balanced Accuracy : 0.8370              
##                                               
##        'Positive' Class : 2                   
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2
##          1 237  16
##          2  15 150
##                                              
##                Accuracy : 0.9258             
##                  95% CI : (0.8964, 0.9491)   
##     No Information Rate : 0.6029             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.845              
##                                              
##  Mcnemar's Test P-Value : 1                  
##                                              
##             Sensitivity : 0.9036             
##             Specificity : 0.9405             
##          Pos Pred Value : 0.9091             
##          Neg Pred Value : 0.9368             
##              Prevalence : 0.3971             
##          Detection Rate : 0.3589             
##    Detection Prevalence : 0.3947             
##       Balanced Accuracy : 0.9220             
##                                              
##        'Positive' Class : 2                  
## 

Conclusion

Logistic Regression

## # A tibble: 1 x 7
##   Model Accuracy_Train Accuracy_Test Recall_Train Recall_Test Precision_Train
##   <chr>          <dbl>         <dbl>        <dbl>       <dbl>           <dbl>
## 1 Logi…          0.828         0.919        0.749       0.934           0.793
## # … with 1 more variable: Precision_Test <dbl>

Random Forest

## # A tibble: 1 x 7
##   Model Accuracy_Train Accuracy_Test Recall_Train Recall_Test Precision_Train
##   <chr>          <dbl>         <dbl>        <dbl>       <dbl>           <dbl>
## 1 Rand…          0.852         0.931        0.740       0.921           0.855
## # … with 1 more variable: Precision_Test <dbl>

k-NN

## # A tibble: 1 x 7
##   Model Accuracy_Train Accuracy_Test Recall_Train Recall_Test Precision_Train
##   <chr>          <dbl>         <dbl>        <dbl>       <dbl>           <dbl>
## 1 KNN            0.854         0.926        0.763       0.904           0.842
## # … with 1 more variable: Precision_Test <dbl>

All of the models give us pretty much same accuracy. However, for precision, random forest and k-NN model gives us better result for the titanic data set.