The RMS Titanic in port during 1912 (Source: Universal History Archive/Getty Images)

The RMS Titanic in port during 1912 (Source: Universal History Archive/Getty Images)

An overview

This is a continuation of Titanic Survival Prediction Part 1 where we tried to predict whether a passenger will survived or not using the two algorithms in classification: the logistic regression and k-Nearest Neighbor. The ending of Part 1 tells us which of the two predictive models that produced a better performance in general. As it turns out, the prediction that was generated from the kNN model has a slightly better output compared to the one we just made with logistic regression.

Now, in part 2, we will explore the titanic survival prediction with two other algorithms that are widely used in classification, the Naive Bayes and Decision Trees.

Let’s get started!

Import files and packages

We were provided with three different datasets to work on (the train, test, and gender_submission). What we are going to do is to load all of the datasets and to merge the “gender_submission.csv” dataset with the “test.csv”. This is done in order to add column “Survived” from the gender_submission to the entire dataset in “test.csv”.

library(readr)
# Import files
# data train
data_train <- read_csv("Titanic/train.csv")

# data test
data_test <- read_csv("Titanic/test.csv")
gender <- read_csv("Titanic/gender_submission.csv")

# merging column survived in "gender" with "data_test"
data_test_merge <- merge(data_test, gender, by = "PassengerId")

Inspect the first six rows of “data_train” and “data_test_merge”.

head(data_train)
head(data_test)
head(gender)
head(data_test_merge)

Columns description:

  • Survived: Survival. 0 = No, 1 = Yes.
  • Pclass: Ticket class. A proxy for socio-economic status (SES)
    • 1st = Upper
    • 2nd = Middle
    • 3rd = Lower
  • Sex: Sex
  • Age: Age in years. Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
  • SibSp: of siblings / spouses aboard the Titanic. The dataset defines family relations in this way:
    • Sibling = brother, sister, stepbrother, stepsister
    • Spouse = husband, wife (mistresses and fiancés were ignored)
  • parch: of parents / children aboard the Titanic. The dataset defines family relations in this way:
    • Parent = mother, father
    • Child = daughter, son, stepdaughter, stepson
    • Some children travelled only with a nanny, therefore parch=0 for them.
  • Ticket: Ticket’s number
  • Fare: Passenger fare (In Britain’s currency / Pound sterling (£))
  • Cabin: Cabin’s number
  • Embarked: Port of embarkation.
    • C = Cherbourg
    • Q = Queenstown
    • S = Southampton

Data wrangling

In this step, we would merge both of the datasets into a single object called “titanic” and to perform EDA with the whole dataset. But before we do that, we need to add a new column to indicates which rows that belongs to the training or test dataset. This is done to help us in the cross validation later.

# add new column
data_train$Separators <- "TRAIN"
data_test_merge$Separators <- "TEST"
# merge the training and test dataset into one
titanic <- rbind(data_train, data_test_merge)

# check dimensions
dim(titanic)
## [1] 1309   13
dim(data_train)
## [1] 891  13
dim(data_test_merge)
## [1] 418  13
# maximum missing values that we tolerated for each columns
nrow(titanic)*0.05
## [1] 65.45
# check missing values from all columns
colSums(is.na(titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         263 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           1        1014           2 
##  Separators 
##           0

Inspect data types for each columns.

library(dplyr)
glimpse(titanic)
## Rows: 1,309
## Columns: 13
## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
## $ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal~
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
## $ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
## $ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C~
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
## $ Separators  <chr> "TRAIN", "TRAIN", "TRAIN", "TRAIN", "TRAIN", "TRAIN", "TRA~

For data wrangling, we would need to:

  1. Change columns to its correct data types
  • Columns that need to be converted to factor: Survived, Pclass, Sex and Embarked.
  1. Drop column Cabin, Name, PassengerId and Ticket.
  • PassengerId only stored the identifier numbers for each passengers and doesn’t contain any valuable informations for us to analyse later.
  • For Cabin, we will drop it since it has a total of missing values that counts more than half of the total observations.
  • In the case of column Name, to be honest I think it actually stored valuable information that we can use to explore the family relations among the titanic passengers. It is a very useful variable to use when we are exploring the data sets with more advanced algorithm in classification. We could have keep it as it is but since we are using a much simpler model, that is with logistic regression and kNN, we would not use it to create our models.
  • Ticket has a character data type and stored no valuable information.
# change columns data types into the correct ones
titanic <- titanic %>% 
  mutate_at(.vars = c("Survived", "Pclass", "Sex", "Embarked"), .funs = as.factor) %>% 
  select(-c(PassengerId, Cabin, Name, Ticket))

head(titanic)

Treatment for missing values

# check again for columns that have missing values
colSums(is.na(titanic))
##   Survived     Pclass        Sex        Age      SibSp      Parch       Fare 
##          0          0          0        263          0          0          1 
##   Embarked Separators 
##          2          0

We have located the columns that have missing values and those columns consisted of Age, Fare and Embarked. Since Embarked is a factor, we will replace the two missing values with its mode or the value that appeared most frequently. As for the Age column, we can either replace the 177 missing values with the mean or median. If there’s no outlier, then we’ll replace them with mean but if it does, then it is better to use the median number instead. The same goes for column Fare.

# check the mode for column `Embarked`
table(titanic$Embarked)
## 
##   C   Q   S 
## 270 123 914

Turns out, most of the passengers in the Titanic embarked from the Southampton (S) port with 914 passengers in total.

# replace missing values in column `Embarked` with the mode (Southampton port)
which(is.na(titanic$Embarked)) # locate missing values index in Embarked
## [1]  62 830
library(psych)
## Warning: package 'psych' was built under R version 4.1.2
library(tracerer)
## Warning: package 'tracerer' was built under R version 4.1.2
titanic <- titanic %>% 
  mutate(Embarked = if_else(is.na(Embarked), calc_mode(Embarked), Embarked))
# check outliers in Age
boxplot(titanic$Age)

Since we have plenty of outliers in the Age column, it is better to replace the missing values with the median.

# replace missing values in column `Age` with the median
median(titanic$Age, na.rm = T) # check the median age from all passengers without taking into account the NA
## [1] 28
titanic <- titanic %>% 
  mutate(Age = if_else(is.na(Age), median(Age, na.rm = T), Age)) %>% 
  mutate(Fare = if_else(is.na(Fare), median(Fare, na.rm = T), Fare))
# check again whether we still have missing values or not
anyNA(titanic)
## [1] FALSE

Great! All variables within the titanic dataframe is now stored in their correct data types and has no more missing values. So far, we have performed data cleansing and we’re ready to go and create classification models from it.

titanic

Exploratory Data Analysis

summary(titanic)
##  Survived Pclass      Sex           Age            SibSp            Parch      
##  0:815    1:323   female:466   Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  1:494    2:277   male  :843   1st Qu.:22.00   1st Qu.:0.0000   1st Qu.:0.000  
##           3:709                Median :28.00   Median :0.0000   Median :0.000  
##                                Mean   :29.50   Mean   :0.4989   Mean   :0.385  
##                                3rd Qu.:35.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                                Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##       Fare         Embarked  Separators       
##  Min.   :  0.000   C:270    Length:1309       
##  1st Qu.:  7.896   Q:123    Class :character  
##  Median : 14.454   S:916    Mode  :character  
##  Mean   : 33.281                              
##  3rd Qu.: 31.275                              
##  Max.   :512.329
library(tidyverse)

titanic %>% 
  ggplot(aes(Pclass, Fare)) + geom_boxplot(aes(fill = Pclass)) +
  scale_fill_manual(name = "Ticket class",
                    labels = c("1st","2nd","3rd"),
                    values = c('#732A24', '#735246', '#D9B4A7')) +
  labs(title = "Fare distribution for each ticket class",
       x = "Ticket class",
       y = "Fare") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "top")

The median fare for the first class is approximately at £60 with a few outliers that cost roughly more than £200. The 2nd and 3rd class tickets cost less than £50 in general.

# the median for ticket's fare for each class
titanic %>% 
  group_by(Pclass) %>% 
  summarise(fare = median(Fare))

The median fare for each ticket class: * 1st class (Upper): 60 * 2nd class (Middle): 15.04 * 3rd class (Lower): 8.05

titanic %>%
  ggplot(aes(x=Embarked)) + geom_bar(aes(fill = Pclass), position = "dodge") +
  scale_fill_manual(name = "Ticket class",
                    labels = c("1st","2nd","3rd"),
                    values = c('#732A24', '#735246', '#D9B4A7')) +
  labs(title = "Distribution of ticket class in different ports",
       x = "Port of embarkation",
       y = "Total passengers") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), legend.position = "top")

table(titanic$Embarked)
## 
##   C   Q   S 
## 270 123 916
prop.table(table(titanic$Embarked))
## 
##          C          Q          S 
## 0.20626432 0.09396486 0.69977082

There were three different ports of embarkation; The Cherbourg (C), Queenstown (Q) and Southampton (S) port. With total passengers such as follows: 1. Southampton = 916 passengers 2. Cherbourg = 270 passengers 3. Queenstown = 123 passengers

Other insights we can get from the bar plot above are: 1) 69.97% of titanic passengers were boarded from the Southampton port. 2) Cherbourg has the highest proportion of 1st class passengers compared to the other ports. 3) The majority of titanic passengers that embarked from Queenstown were consisted of 3rd class passengers and has a very low number of 1st and 2nd class passengers. 4) Approximately half of the titanic passengers that embarked from Southampton came from the 3rd class passengers. Also, unlike the other two ports, there’s not a huge gap of proportion between the 1st and 2nd class.

Machine Learning

As we have mentioned earlier, we will create a prediction of whether a passenger will survived or not using two different classification models; the logistic regression and the k-Nearest Neighbor (kNN).

The first step in creating a prediction with both models is to decide which of the variable that we want to set as the target and the predictor variables. Now, because our aim is to predicted which passengers that will survived or not, we will use the Survived column as the target variable while the rest of the variables will be used as the predictors.

Cross-validation

# data test
titanic_train <- titanic %>%
  filter(Separators == "TRAIN")

# data train
titanic_test <- titanic %>% 
  filter(Separators == "TEST")
unique(titanic_train$Separators)
## [1] "TRAIN"
unique(titanic_test$Separators)
## [1] "TEST"
# remove column Separators
titanic_train <- titanic_train %>% select(-Separators)
titanic_test <- titanic_test %>% select(-Separators)
dim(titanic_train)
## [1] 891   8
dim(titanic_test)
## [1] 418   8
  1. Target variable: Column Survived. We will also take “1” (Survived) as the positive class for our models.
  2. Predictor variable(s): All variables beside column Survived

Check whether the target variables have balanced or imbalanced class.

prop.table(table(titanic_train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

The proportion for our target class seems to be quite balanced and at this stage, we won’t try to downsampling / upsampling the target variable just yet.

Naive Bayes

Model fitting

Naive bayes works well with predictor variables that are categorical, therefore we would try to use categorical variables from the titanic_train dataset which are the Pclass, Sex and Embarked columns. The algorithm also assumed that each of the predictors are independent to each other.

library(e1071)
## Warning: package 'e1071' was built under R version 4.1.2
model_titanic <- naiveBayes(Survived~Pclass + Sex + Embarked, data = titanic_train, laplace = 1)
model_titanic
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.6161616 0.3838384 
## 
## Conditional probabilities:
##    Pclass
## Y           1         2         3
##   0 0.1467391 0.1775362 0.6757246
##   1 0.3971014 0.2550725 0.3478261
## 
##    Sex
## Y      female      male
##   0 0.1488203 0.8511797
##   1 0.6802326 0.3197674
## 
##    Embarked
## Y            C          Q          S
##   0 0.13768116 0.08695652 0.77536232
##   1 0.27246377 0.08985507 0.63768116

A few interpretations that we can get from the model above: - 39.71% survivors were from the Upper class. - Compared to the other two classes, the lower class ticket holders have the highest numbers of passengers who didn’t survive (67.57%). - 68% of the survivors were females and 85.11% of passengers who died were males. - Since most of the titanic passengers embarked from Port Southampton (72.5%), it is not odd to see that it has the highest proportion of passengers who did (63.76%) and did not survived (77.53%).

Model Prediction

survival_predClass <- predict(object = model_titanic, newdata = titanic_test, type = "class")
head(survival_predClass)
## [1] 0 1 0 0 1 0
## Levels: 0 1

Model evaluation

For model evaluation, we can use both confusion matrix and ROC - AUC to see whether our model has performed well or not.

# confusion matrix
library(caret)
confusionMatrix(data = as.factor(survival_predClass), reference = as.factor(titanic_test$Survived), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 238   0
##          1  28 152
##                                          
##                Accuracy : 0.933          
##                  95% CI : (0.9046, 0.955)
##     No Information Rate : 0.6364         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8608         
##                                          
##  Mcnemar's Test P-Value : 3.352e-07      
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.8947         
##          Pos Pred Value : 0.8444         
##          Neg Pred Value : 1.0000         
##              Prevalence : 0.3636         
##          Detection Rate : 0.3636         
##    Detection Prevalence : 0.4306         
##       Balanced Accuracy : 0.9474         
##                                          
##        'Positive' Class : 1              
## 

From the summary above we get a result of 93.3% accuracy, a recall of 100% and 84.44% for precision. However, as our data is imbalanced (the proportion between TN (True Negative) and TP (True Positive)), it is better for us to use the two other metrics beside accuracy which are the precision and recall. Here is a brief explanation for both metrics:

Precision: The number of true positives divided by the total number of true positives and false positives. Recall: The number of true positives divided by the total number of true positives and false negatives.

Model evaluation with Receiver Operating Curve (ROC) and Area Under Curve (AUC).

  • Receiver Operating Curve (ROC)

According to Analytics Vidhya, ROC or Receiver Operating Curve can be defined as “an evaluation metric for binary classification problems. It is a probability curve that plots the TPR against FPR at various threshold values”. TPR in here means the True Positive Rate from our target class while FPR stands for False Positive Rate.

# prediction
pred_titanicProb <- predict(object = model_titanic, newdata = titanic_test, type = "raw")
head(pred_titanicProb)
##              0          1
## [1,] 0.8892999 0.11070006
## [2,] 0.4534273 0.54657267
## [3,] 0.7421463 0.25785367
## [4,] 0.9098572 0.09014285
## [5,] 0.4534273 0.54657267
## [6,] 0.9098572 0.09014285
# prepare data frame for ROC
roc <- data.frame(pred_prob = pred_titanicProb[, "1"],
                  actual = titanic_test$Survived)
roc

Create a ROC prediction object and then to make plots from it.

# install.packages("ROCR", dependencies = T)
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.1.2
# create a prediction object
roc_pred <- prediction(predictions = roc$pred_prob, 
                       labels = roc$actual)

# create a plot from the roc_pred
plot(performance(prediction.obj = roc_pred, measure = "tpr", x.measure = "fpr"))

Our model is quite good at classifying both the positive and negative class.

  • Area Under The ROC Curve (AUC)

AUC represents the area under the curve of a ROC plot. We can determine whether our model has a good performance or not by looking at its AUC. If it’s closed to 1, then it means that our model is capable to correctly classified both of the negative and positive class.

auc <- performance(prediction.obj = roc_pred, measure = "auc")
auc@y.values # to display the AUC from the ROC plot above
## [[1]]
## [1] 0.9716066

Our model has an AUC of 0.9716066–which is closed to 1. It means that our model performed quite well in predicting both of the target class.

Decision Trees

# create a decision tree model
library(partykit)
## Warning: package 'partykit' was built under R version 4.1.2
## Warning: package 'libcoin' was built under R version 4.1.2
model_dt <- ctree(formula = Survived~., data = titanic_train)
model_dt
## 
## Model formula:
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
## 
## Fitted party:
## [1] root
## |   [2] Sex in female
## |   |   [3] Pclass in 1, 2: 1 (n = 170, err = 5.3%)
## |   |   [4] Pclass in 3
## |   |   |   [5] Fare <= 23.25: 1 (n = 117, err = 41.0%)
## |   |   |   [6] Fare > 23.25: 0 (n = 27, err = 11.1%)
## |   [7] Sex in male
## |   |   [8] Pclass in 1: 0 (n = 122, err = 36.9%)
## |   |   [9] Pclass in 2, 3
## |   |   |   [10] Age <= 9
## |   |   |   |   [11] SibSp <= 2: 1 (n = 16, err = 0.0%)
## |   |   |   |   [12] SibSp > 2: 0 (n = 14, err = 7.1%)
## |   |   |   [13] Age > 9: 0 (n = 425, err = 11.1%)
## 
## Number of inner nodes:    6
## Number of terminal nodes: 7
# visualized the result
plot(model_dt, type = "simple")

An example on how to interpret the decision trees model: 170 passengers who are female and came from the 1st or 2nd class are predicted to survived with 5.3% of errors.

Model Evaluation

Just like we did in naive bayes, we can use confusion matrix to evaluate our decisions tree model.

# prediksi kelas di data test
survived_pred <- predict(object = model_dt, newdata = titanic_test, type = "response")
head(survived_pred)
## 1 2 3 4 5 6 
## 0 1 0 0 1 0 
## Levels: 0 1
# confusion matrix data test
confusionMatrix(data = as.factor(survived_pred),
                reference = titanic_test$Survived, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 258   6
##          1   8 146
##                                           
##                Accuracy : 0.9665          
##                  95% CI : (0.9444, 0.9816)
##     No Information Rate : 0.6364          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9278          
##                                           
##  Mcnemar's Test P-Value : 0.7893          
##                                           
##             Sensitivity : 0.9605          
##             Specificity : 0.9699          
##          Pos Pred Value : 0.9481          
##          Neg Pred Value : 0.9773          
##              Prevalence : 0.3636          
##          Detection Rate : 0.3493          
##    Detection Prevalence : 0.3684          
##       Balanced Accuracy : 0.9652          
##                                           
##        'Positive' Class : 1               
## 

For model with decision tree, we get a quite good result as it scored 96.65% for the accuracy, 96% in recall and a precision of 94.81%. However, instead of using accuracy, we will focused our attention to the other two metrics; recall (96%) and precision (94.81%). It seems that our model performed pretty well, judging from the confusion matrix above.

Conclusion

In Part 2, we managed to create a titanic survival prediction with naive bayes and decision trees, which are the two algorithms that are widely used in classification problems. The model prediction with naive bayes used three variables as its predictors (Pclass, Sex, Embarked) and has a quite good performance for the returns. It has a recall of 100%, 84.44% for precision and roughly 0.97 for its AUC (model evaluation using ROC-AUC).

While the decision tree model that used all of the variables as its predictors has an output of 96% recall and 94.81% precision (which is 10% lower than the one we got from naive bayes).

Overall, our naive bayes model has a better performance compared to the decision tree model. This was shown by having zero false negatives predictions with a slightly lower score in precision.