Introduction

The RMS Titanic sank in the early morning hours of 15 April 1912 in the North Atlantic Ocean, four days into the ship’s maiden voyage from Southampton to New York City. The largest ocean liner in service at the time, Titanic had an estimated 2,224 people on board when she struck an iceberg at around 23:40 (ship’s time) on Sunday, 14 April 1912. Her sinking two hours and forty minutes later at 02:20 (ship’s time; 05:18 GMT) on Monday, 15 April, resulted in the deaths of more than 1,500 people, making it one of the deadliest peacetime marine disasters in history. In this time, we’ll try to predict whether the passengers is survived or not.

Import Library and Setup

library(dplyr)
library(ggplot2)
library(class)
library(caret)

Data Import

This data contains 3 dataset, such as :
1. train.csv: is using for training data (complete with predictor variable)
2. test.csv : is using for testing data (without survived variable)
3. gender_submission.csv: contains id and survived variable

Let’s read the data

Data train

train <- read.csv("train.csv")
rmarkdown::paged_table(train)

Train data has 891 obs and 12 columns. We’ll use this data to make model later.

Data Test

test <- read.csv("test.csv")
rmarkdown::paged_table(test)

Test data has 418 obs and 11 columns. If we see in train data has 12 columns and in test data has 11 columns. The difference is that in train data has Survivedvariable which in test data has not. The Survived variable in test data is contained in gender_submission.csv. Let’s read the data.

survive <- read.csv("gender_submission.csv")
rmarkdown::paged_table(survive)

Some information about the data :

  • Pclass : A proxy for socio-economic status (SES)
    1st = Upper
    2nd = Middle
    3rd = Lower
  • Name : Name of the passenger
  • Sex : Gender of passenger
  • Age : Age in years
  • Sibsp : siblings / spouses aboard the Titanic
    Sibling = brother, sister, stepbrother, stepsister
    Spouse = husband, wife (mistresses and fiancés were ignored)
  • Parch : parents / children aboard the Titanic
    Parent = mother, father
    Child = daughter, son, stepdaughter, stepson
    Some children travelled only with a nanny, therefore parch=0 for them.
  • Ticket : Ticket number
  • Fare : Passenger fare
  • Cabin : Cabin number
  • Embarked : Port of Embarkation
    C = Cherbourg
    Q = Queenstown
    S = Southampton

Data Preparation and EDA

To make it easier to read, let’s join the test data and survive, so that we just only work with only two data.

test1 <- cbind(test, Survived = survive$Survived)

Now, we only have 2 data : train and test dataset. Let’s check the if there is missing value in data and the data type of data.

colSums(is.na(train))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

There is 177 missing value in age column. It was around 19% of our data. Instead of remove the data. let’s try to replace the age number with the mean of age.

train_clean <- train %>%
    mutate(Age = if_else(is.na(Age), mean(Age, na.rm = TRUE), Age))

colSums(is.na(train_clean))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Let’s check if there’s any missing value in data test.

colSums(is.na(test1))
## PassengerId      Pclass        Name         Sex         Age       SibSp 
##           0           0           0           0          86           0 
##       Parch      Ticket        Fare       Cabin    Embarked    Survived 
##           0           0           1           0           0           0

There’s some missing value in test data. To avoid NA from affecting the prediction results, we will delete data that contains NA

test_clean <- test1 %>% 
    na.omit()

In this case, Survived is the target variable, and some variables are not contain much information about the modeling. Let’s select some predictor variable and change the data type.

data_train <- train_clean %>% 
        select(-c(PassengerId, Name, Ticket, Embarked, Cabin)) %>% 
        mutate(Survived = as.factor(Survived),
           Pclass = as.factor(Pclass),
           Sex = as.factor(Sex),
           SibSp = as.factor(SibSp),
           Parch = as.factor(Parch))

data_test <- test_clean %>% 
        select(-c(PassengerId, Name, Ticket, Embarked, Cabin)) %>% 
        mutate(Survived = as.factor(Survived),
           Pclass = as.factor(Pclass),
           Sex = as.factor(Sex),
           SibSp = as.factor(SibSp),
           Parch = as.factor(Parch))

Let’s check the structure of data train by using str()

str(data_train)
## 'data.frame':    891 obs. of  7 variables:
##  $ Survived: Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass  : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 ...
##  $ SibSp   : Factor w/ 7 levels "0","1","2","3",..: 2 2 1 2 1 1 1 4 1 2 ...
##  $ Parch   : Factor w/ 7 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 3 1 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...

Before we continue to modeling, let’s check the proportion of our data in class target.

prop.table(table(data_train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

In the train dataset, the proportion of our data was around 61,61% not survived and around 38,38% survived. It’s balanced enough. Let’s continue to make the model.

Modeling

Logistic Regression

m_LR1 <- glm(Survived~., data_train, family = "binomial") 
summary(m_LR1)
## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = data_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8306  -0.6225  -0.4347   0.6019   2.4601  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.510e+00  4.662e-01   7.531 5.05e-14 ***
## Pclass2     -1.057e+00  2.987e-01  -3.539 0.000402 ***
## Pclass3     -2.016e+00  2.946e-01  -6.841 7.86e-12 ***
## Sexmale     -2.719e+00  2.007e-01 -13.544  < 2e-16 ***
## Age         -3.639e-02  8.343e-03  -4.362 1.29e-05 ***
## SibSp1       8.279e-02  2.239e-01   0.370 0.711570    
## SibSp2      -2.858e-01  5.342e-01  -0.535 0.592645    
## SibSp3      -2.353e+00  7.175e-01  -3.280 0.001040 ** 
## SibSp4      -1.773e+00  7.687e-01  -2.306 0.021108 *  
## SibSp5      -1.611e+01  9.549e+02  -0.017 0.986541    
## SibSp8      -1.609e+01  7.542e+02  -0.021 0.982984    
## Parch1       3.808e-01  2.891e-01   1.317 0.187769    
## Parch2       4.165e-02  3.789e-01   0.110 0.912462    
## Parch3       2.485e-01  1.040e+00   0.239 0.811241    
## Parch4      -1.609e+01  1.053e+03  -0.015 0.987805    
## Parch5      -1.329e+00  1.168e+00  -1.138 0.255228    
## Parch6      -1.671e+01  2.400e+03  -0.007 0.994444    
## Fare         2.810e-03  2.483e-03   1.132 0.257803    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  768.71  on 873  degrees of freedom
## AIC: 804.71
## 
## Number of Fisher Scoring iterations: 15

From model m_LR1, we see that so many variables is not signicificant effected the survival rate. Let’s try do model fitting by using stepwise with backward method.

m_LR2 <- step(object = glm(Survived~., data_train, family = "binomial"),
                    direction = "backward")
## Start:  AIC=804.71
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare
## 
##          Df Deviance     AIC
## - Parch   6   777.98  801.98
## - Fare    1   770.13  804.13
## <none>        768.71  804.71
## - SibSp   6   794.41  818.41
## - Age     1   789.10  823.10
## - Pclass  2   817.59  849.59
## - Sex     1   997.14 1031.14
## 
## Step:  AIC=801.98
## Survived ~ Pclass + Sex + Age + SibSp + Fare
## 
##          Df Deviance     AIC
## - Fare    1   779.24  801.24
## <none>        777.98  801.98
## - SibSp   6   805.17  817.17
## - Age     1   805.91  827.91
## - Pclass  2   836.86  856.86
## - Sex     1  1013.15 1035.15
## 
## Step:  AIC=801.24
## Survived ~ Pclass + Sex + Age + SibSp
## 
##          Df Deviance     AIC
## <none>        779.24  801.24
## - SibSp   6   805.29  815.29
## - Age     1   808.21  828.21
## - Pclass  2   880.32  898.32
## - Sex     1  1020.51 1040.51
summary(m_LR2)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp, family = "binomial", 
##     data = data_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8259  -0.5998  -0.4326   0.6147   2.4463  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   3.914045   0.406967   9.618  < 2e-16 ***
## Pclass2      -1.220029   0.264067  -4.620 3.83e-06 ***
## Pclass3      -2.288869   0.245340  -9.329  < 2e-16 ***
## Sexmale      -2.708998   0.195246 -13.875  < 2e-16 ***
## Age          -0.041263   0.008001  -5.157 2.51e-07 ***
## SibSp1        0.142801   0.210125   0.680  0.49676    
## SibSp2       -0.142497   0.519745  -0.274  0.78396    
## SibSp3       -2.073128   0.685422  -3.025  0.00249 ** 
## SibSp4       -1.668710   0.744919  -2.240  0.02508 *  
## SibSp5      -16.004773 956.874492  -0.017  0.98666    
## SibSp8      -15.833751 753.839723  -0.021  0.98324    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  779.24  on 880  degrees of freedom
## AIC: 801.24
## 
## Number of Fisher Scoring iterations: 15

Model m_LR2 which use stepwise has AIC of 801.24 and m_LR1 has AIC of 804.71. We’ll using model m_LR2 to predict.

Prediction

data_test$prob <- predict(m_LR2, type = "response", newdata = data_test)

Let’s see the distribution of predicted data.

ggplot(data_test, aes(x=prob)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

From the plot, we can see that the data interpreted the distribution of prediction data lean more towards 0 which means Not Survived. If we set the threshold is 0.5 (by default). Let’s give the label of the prediction data by :

  1. If the probability is greater than 0.5, the label will get 1 (survived).
  2. If the probability is lower than 0.5, the label will get 0 (not survived)
data_test$pred_label <- factor(ifelse(data_test$prob > 0.5, 1, 0)) #1 = survived, 0 = not survived
data_test %>% 
    select(Survived, pred_label) %>% 
    head()
##   Survived pred_label
## 1        0          0
## 2        1          0
## 3        0          0
## 4        0          0
## 5        1          1
## 6        0          0

K-NN Model

Because of K-NN is counting based on distance, we should make sure our data is only contain numeric data.

#Select data train and data test without PassengerId, Name, Sex, Ticket, Cabin, and Embarked
train_k <- train_clean %>% 
    select(-c(PassengerId, Name, Sex, Ticket, Cabin, Embarked))

test_k <- test_clean %>% 
    select(-c(PassengerId, Name, Sex, Ticket, Cabin, Embarked))

After select the predictor and target, let’s scale the data because the scale in some variable like fare is not the same with Age

#scale the predictor and the target variable(Survived don't need to be scale)
train_x <- scale(train_k[, -1]) 
test_x <- scale(test_k[, -6], center = attr(train_x, "scaled:center"), 
                scale = attr(train_x, "scaled:scale")) 

Then, let’s create the target Variable Survived

#Create the target variable in data train and data test
train_y <- train_k$Survived 
test_y <- test_k$Survived 

Before we go ahead, let’s find the optimum k-number of KNN based on the number of observation in data train.

k <- sqrt(nrow(train_x)) 
k
## [1] 29.84962

We, get the number of k = 29.84962 and we’ll use 29 for the k. Use the function knn() from package class to do modeling. In KNN, there is no need to do the prediction. We just need to make the model, and the prediction is automatically done in modeling.

model_knn <- knn(train = train_x, test = test_x, cl = train_y, k = 29)

Let’s see some sample of model_knn

head(model_knn,10)
##  [1] 0 0 0 0 0 0 0 1 0 0
## Levels: 0 1

Evaluation Model

We have made model and prediction with logistic regression and K-NN model, let’s make some model evaluation to see if our model is good enough. Before we do some evaluation model, let’s get to know what metric will we value more.

TN : True Negative FP : False Positive FN : False Negative TP : True Positive

There is 4 types of metrics in evaluation model, such as :

  • Accuracy: how precisely our model predicts the target class (globally)

(TP+TN/TOTAL)

  • Sensitivity/ Recall: measure of the model’s goodness of the positive class

(TP/(TP+FN))

  • Specificity: a measure of the model’s goodness of the negative class

(TN/(TN+FP))

  • Pos Pred Value/Precision: how precision the model predicts positive classes

(TP/(TP+FP))

In this case, the positive class (1) is survived. We’ll use sensitivity / recall metric for model evaluation in this case, because we don’t wish to see the False Negative class (actually survived but the model predict not survived).

Logistic Regression

For model evaluation, we’ll use confusionMatrix() function from package caret

confusionMatrix(reference = as.factor(data_test$Survived), 
                data = as.factor(data_test$pred_label), positive = "1" )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 191   5
##          1  13 122
##                                           
##                Accuracy : 0.9456          
##                  95% CI : (0.9154, 0.9675)
##     No Information Rate : 0.6163          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8864          
##                                           
##  Mcnemar's Test P-Value : 0.09896         
##                                           
##             Sensitivity : 0.9606          
##             Specificity : 0.9363          
##          Pos Pred Value : 0.9037          
##          Neg Pred Value : 0.9745          
##              Prevalence : 0.3837          
##          Detection Rate : 0.3686          
##    Detection Prevalence : 0.4079          
##       Balanced Accuracy : 0.9485          
##                                           
##        'Positive' Class : 1               
## 

From the confusion matrix of the prediction using logistic regression model, we can get information such as:
1. The model can predict correctly in positive class and negative class base on the real data is 94.56% (Accuracy)
2. The measurement of the model’s goodness of the positive class (survived) / Sensitivity is 96.06%
3. Based on the Specificity metrics, we assume the model is good enough to predict the data test.

Model K-NN

Let’s try the model evaluation of K-NN model.

confusionMatrix(data = as.factor(model_knn), reference = as.factor(test_y), 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 151  63
##          1  53  64
##                                           
##                Accuracy : 0.6495          
##                  95% CI : (0.5955, 0.7009)
##     No Information Rate : 0.6163          
##     P-Value [Acc > NIR] : 0.1173          
##                                           
##                   Kappa : 0.2478          
##                                           
##  Mcnemar's Test P-Value : 0.4034          
##                                           
##             Sensitivity : 0.5039          
##             Specificity : 0.7402          
##          Pos Pred Value : 0.5470          
##          Neg Pred Value : 0.7056          
##              Prevalence : 0.3837          
##          Detection Rate : 0.1934          
##    Detection Prevalence : 0.3535          
##       Balanced Accuracy : 0.6221          
##                                           
##        'Positive' Class : 1               
## 

From the confusion matrix of the prediction using K-NN model, we can get information such as:
1. The model can predict correctly in positive class and negative class base on the real data is 64.95% (Accuracy)
2. The measurement of the model’s goodness of the positive class (survived) / Sensitivity is 50.39%
3. Based on the Specificity metrics, we assume the model has not good enough and we’ll try to tune the model a bit.

Model Improvement

In this case, logistic regression model has perdict the data test well. So, we’ll try to tune the model in K-NN. Let’s try tune the model by balancing the class target.

prop.table(table(train_k$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

The proportion of survived and not survived in data train train_k is not balance enough. Let’s do down-sampling to make it balance.

#do down sampling in train data
train_tune <- downSample(x = train_k[, -1], y = as.factor(train_k[, 1]), yname = "Survived")
rmarkdown::paged_table(train_tune)

The number of rows get decrease because we do down-sampling to balancing the proportion in class target Survived. let’s check the proportion of class target.

prop.table(table(train_tune$Survived))
## 
##   0   1 
## 0.5 0.5

Now, the propotion is balance enough. Let’s do the K-NN modeling and check the evaluation model. Is it get better result ? P.s : we only make some improvement in data train, and the data test in still the same.

#the column of Survived (target) is in column 6
train_xT <- scale(train_tune[, -6]) 
train_yT <- train_tune$Survived 

#make K-NN modeling by the data which we have balancing and store in model_knnT
model_knnT <- knn(train = train_xT, test = test_x, cl = train_yT, k = 29)

#check the evaluation model again of model_knnT by using confusion Matrix
confusionMatrix(data = as.factor(model_knnT), reference = as.factor(test_y), 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 104  39
##          1 100  88
##                                           
##                Accuracy : 0.5801          
##                  95% CI : (0.5249, 0.6338)
##     No Information Rate : 0.6163          
##     P-Value [Acc > NIR] : 0.9206          
##                                           
##                   Kappa : 0.1859          
##                                           
##  Mcnemar's Test P-Value : 3.597e-07       
##                                           
##             Sensitivity : 0.6929          
##             Specificity : 0.5098          
##          Pos Pred Value : 0.4681          
##          Neg Pred Value : 0.7273          
##              Prevalence : 0.3837          
##          Detection Rate : 0.2659          
##    Detection Prevalence : 0.5680          
##       Balanced Accuracy : 0.6014          
##                                           
##        'Positive' Class : 1               
## 

The previous K-NN model we have done model_knn which the class target has not balance only got the score of Sensitivity is 50.39%. After we balancing the proportion of positive and negative class in target Survived, we got the Sensitivity score of 65.35%. This K-NN model model_knnT is better than model_knn by evaluating with Sensitivity metric.

Conclusion

The titanic ship accident that occurred in 1912 might have caused trauma for Survivors. If I am a psychologist who want to help the Survivors overcome the trauma, I’ll use the Sensitivity metric. I can tolerate False Positive rather than False Negative. I don’t wish Survivor who actually survived, but we predict not survive. Therefore, we can get more people to get treatment. For the model of prediction, I rather use logistic regression rather than K-NN model for this case.