Titanic: Survival Prediction using Machine Learning

Introduction

Titanic sinking ilustration.

The sinking of the Titanic is one of the most infamous shipwrecks in history.

Royal Mail Ship (RMS) Titanic, a British passenger liner, sank in the North Atlantic Ocean on 15 April 1912 after striking an iceberg during its maiden voyage from Southampton to New York City. Out of the estimated 2,224 passengers and crew aboard, more than 1,500 died. There were not enough lifeboats and because of a “women and children first” protocol for loading lifeboats, a disproportionate number of men were left aboard. This project aims to predict passenger survival onboard Titanic using dataset from Kaggle.

Data Analysis

Data Preparation

Load Libraries

library(tidyverse) # data manipulation
library(caret)
library(e1071) # building naive bayes model
library(partykit) # building decision tree model
library(randomForest) # building random forest model
library(mice) # imputation
library(GGally) # building correlation checking
library(RColorBrewer) # visualization
library(yardstick) # model evaluation
library(plotly)

Load the Dataset

train <- read.csv("train.csv")
test <- read.csv("test.csv")
test_result <- read.csv("gender_submission.csv")
test <- inner_join(test, test_result, by = "PassengerId")
titanic <- rbind(train, test)
str(titanic)
## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

There are 12 variables in this dataset, and Survived will be the target variable.

Dataset Explanations:
* PassengerId: Passenger id
* Pclass: Ticket class (1 = 1st (Upper), 2 = 2nd (Middle), 3 = 3rd (Lower))
* Name: Name of the passenger
* Sex: Sex of the passenger
* Age: Age of the passenger in years
* Sibsp: Number of siblings/spouses aboard the titanic
* Parch: Number of parents/children aboard the titanic
* Ticket: Ticket number
* Fare: Passenger fare
* Cabin: Cabin number
* Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southamption)
* Survival: survival status (0 = No, 1 = Yes)

Data Wrangling

Convert data types

titanic <- titanic %>% mutate(
  Survived = factor(Survived, levels = c(0, 1), labels = c("No", "Yes")), 
  Pclass = factor(Pclass, levels = c(1, 2, 3), labels = c("1st", "2nd", "3rd")),
  Sex = factor(Sex, levels = c("female", "male"), labels = c("Female", "Male")),
  Embarked = factor(Embarked)
  )

We have Name variable and looks like it will not provide useful information on the model. But, we can extract the passengers’ title and see if it’s useful.

# Modifying name
titanic <- titanic %>% separate(Name, into = c("LastName", "firstname"), sep = ", ") %>%
  separate(firstname, into = c("Title", "FirstName"), sep = "\\. ")
## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [514].
titanic[titanic$Title == "Mlle", ]$Title <- "Miss"
titanic[titanic$Title == "Mme", ]$Title <- "Mrs"
titanic[titanic$Title == "Ms", ]$Title <- "Miss"

table(titanic$Title)
## 
##         Capt          Col          Don         Dona           Dr     Jonkheer 
##            1            4            1            1            8            1 
##         Lady        Major       Master         Miss           Mr          Mrs 
##            1            2           61          264          757          198 
##          Rev          Sir the Countess 
##            8            1            1

There are 18 titles in total, and seems like there are some rare titles. I will combine the rare titles into one variable.

titanic$Title <- ifelse(titanic$Title %in% c("Master", "Miss", "Mr", "Mrs"), titanic$Title, "Other")
titanic$Title <- as.factor(titanic$Title)

This dataset also gives on information on number of passengers’ parents or spouse (Parch) and sibling (Sibsp) aboard the Titanic. We can create a new variable to represent a passenger’s family size using the variables mentioned.

titanic$FamilySize <- 1 + titanic$SibSp + titanic$Parch

Then, check for missing values.

anyNA(titanic)
## [1] TRUE
colSums(is.na(titanic))
## PassengerId    Survived      Pclass    LastName       Title   FirstName 
##           0           0           0           0           0           0 
##         Sex         Age       SibSp       Parch      Ticket        Fare 
##           0         263           0           0           0           1 
##       Cabin    Embarked  FamilySize 
##           0           0           0

There are 263 missing values in passengers’ Age and 1 missing value in Fare. Let’s check the respective variables.

titanic %>% filter(is.na(Fare))
##   PassengerId Survived Pclass LastName Title FirstName  Sex  Age SibSp Parch
## 1        1044       No    3rd   Storey    Mr    Thomas Male 60.5     0     0
##   Ticket Fare Cabin Embarked FamilySize
## 1   3701   NA              S          1

The missing Fare belongs to Mr. Thomas Storey from the 3rd class. We can impute the missing value using mean/median depending on the data distribution.

hist(titanic$Fare)

Since the variable is not normally distributed, we can impute the missing value using median from the corresponding ticket class.

titanic %>% group_by(Pclass) %>%
  summarise(median = median(Fare, na.rm = TRUE))
## # A tibble: 3 x 2
##   Pclass median
##   <fct>   <dbl>
## 1 1st     60   
## 2 2nd     15.0 
## 3 3rd      8.05
titanic[is.na(titanic$Fare), ]$Fare <- 8.05

Next, let’s deal with the missing age values. We can do it using mice package.

set.seed(123)
mice_mod <- mice(titanic[, !names(titanic) %in% c('PassengerId','LastName', 'FirstName', 'Ticket','Cabin','FamilySize','Survived')], method='pmm')
## 
##  iter imp variable
##   1   1  Age
##   1   2  Age
##   1   3  Age
##   1   4  Age
##   1   5  Age
##   2   1  Age
##   2   2  Age
##   2   3  Age
##   2   4  Age
##   2   5  Age
##   3   1  Age
##   3   2  Age
##   3   3  Age
##   3   4  Age
##   3   5  Age
##   4   1  Age
##   4   2  Age
##   4   3  Age
##   4   4  Age
##   4   5  Age
##   5   1  Age
##   5   2  Age
##   5   3  Age
##   5   4  Age
##   5   5  Age
mice_output <- complete(mice_mod)

Let’s compare the distribution of the original data and the imputed data.

par(mfrow=c(1,2))
hist(titanic$Age, freq=F, main='Age: Original Data', 
  col='blue')
hist(mice_output$Age, freq=F, main='Age: MICE Output', 
  col='lightblue')

The original and imputed data distribution is similar, then we can use the mice output to fill our missing values.

titanic$Age <- mice_output$Age

Then, check for the missing values once again to make sure.

anyNA(titanic)
## [1] FALSE

Now we’re good to go. Next, check the data summary.

summary(titanic)
##   PassengerId   Survived  Pclass      LastName            Title    
##  Min.   :   1   No :815   1st:323   Length:1309        Master: 61  
##  1st Qu.: 328   Yes:494   2nd:277   Class :character   Miss  :264  
##  Median : 655             3rd:709   Mode  :character   Mr    :757  
##  Mean   : 655                                          Mrs   :198  
##  3rd Qu.: 982                                          Other : 29  
##  Max.   :1309                                                      
##   FirstName             Sex           Age            SibSp       
##  Length:1309        Female:466   Min.   : 0.17   Min.   :0.0000  
##  Class :character   Male  :843   1st Qu.:20.00   1st Qu.:0.0000  
##  Mode  :character                Median :28.00   Median :0.0000  
##                                  Mean   :29.45   Mean   :0.4989  
##                                  3rd Qu.:37.00   3rd Qu.:1.0000  
##                                  Max.   :80.00   Max.   :8.0000  
##      Parch          Ticket               Fare            Cabin          
##  Min.   :0.000   Length:1309        Min.   :  0.000   Length:1309       
##  1st Qu.:0.000   Class :character   1st Qu.:  7.896   Class :character  
##  Median :0.000   Mode  :character   Median : 14.454   Mode  :character  
##  Mean   :0.385                      Mean   : 33.276                     
##  3rd Qu.:0.000                      3rd Qu.: 31.275                     
##  Max.   :9.000                      Max.   :512.329                     
##  Embarked   FamilySize    
##   :  2    Min.   : 1.000  
##  C:270    1st Qu.: 1.000  
##  Q:123    Median : 1.000  
##  S:914    Mean   : 1.884  
##           3rd Qu.: 2.000  
##           Max.   :11.000

Looks like there are 2 passengers with empty value of port of embarkation. Check the corresponding passengers.

titanic %>% filter(Embarked == "")
##   PassengerId Survived Pclass LastName Title                     FirstName
## 1          62      Yes    1st    Icard  Miss                        Amelie
## 2         830      Yes    1st    Stone   Mrs George Nelson (Martha Evelyn)
##      Sex Age SibSp Parch Ticket Fare Cabin Embarked FamilySize
## 1 Female  38     0     0 113572   80   B28                   1
## 2 Female  62     0     0 113572   80   B28                   1

Both passengers with missing port of embarkation were in the 1st class and paid fare of 80. Let’s see where most passengers from the 1st class embarked from.

ggplot(titanic, aes(Embarked, fill = Pclass)) +
  geom_bar(position = "fill")

Looks like half passengers embarked from Cherbourg. To make sure, let’s see how much fare passengers paid based on their port of embarkation in average.

titanic %>% filter(Embarked != "") %>% group_by(Embarked) %>%
  summarise(mean = mean(Fare),
            median = median(Fare))
## # A tibble: 3 x 3
##   Embarked  mean median
##   <fct>    <dbl>  <dbl>
## 1 C         62.3  28.5 
## 2 Q         12.4   7.75
## 3 S         27.4  13

Majority of passengers embarked from Cherbourg paid higher amount of fare than passengers from other ports. Then, based on the exploratory done, we can conclude that the 2 passengers with missing value of port of embarkation embarked from Cherbourg.

titanic[titanic$Embarked == "", ]$Embarked <- "C"
titanic$Embarked <- droplevels(titanic$Embarked)

Before we continue our analysis, let’s see if there is a correlation between each predictor.

ggcorr(titanic, label = T)
## Warning in ggcorr(titanic, label = T): data in column(s) 'Survived', 'Pclass',
## 'LastName', 'Title', 'FirstName', 'Sex', 'Ticket', 'Cabin', 'Embarked' are not
## numeric and were ignored

There is a strong correlation between Parch, Sibsp and FamilySize. It is inevitable since FamilySize is derived from those variables. To prevent biased result, we need to choose one variable to keep. Let’s choose Family Size.

Some variables have unique values and do not provide any useful information for our model, let’s just exclude them.

titanic <- titanic %>% select(-c(PassengerId, LastName, FirstName, Ticket, Cabin, Parch, SibSp))

Exploratory Data Analysis

How was the survival rate overall?

prop.table(table(titanic$Survived))
## 
##        No       Yes 
## 0.6226127 0.3773873

More than half passengers did not survive from the disaster. Let’s see if passengers with particular characteristics had more chance to survive than others.

Family Size

ggplot(titanic, aes(as.factor(FamilySize), fill = Survived)) +
  geom_bar(position = "fill") +
  labs(title = "Survival by Family Size Class",
       x = "Family Size",
       y = "Survived") +
    theme_minimal() +
    scale_fill_brewer(palette = "Set2")

Passengers who were onboard alone and with a larger number of family members were more likely to not survive.

Sex

ggplot(titanic, aes(Sex, fill = Survived)) +
  geom_bar(position = "fill") +
  labs(title = "Survival by Sex",
       y = "Survived")+
  theme_minimal()  +
  scale_fill_brewer(palette = "Set2")

Female passengers had a higher chance of surviving than their male counterparts.

Pclass

ggplot(titanic, aes(Pclass, fill = Survived)) +
  geom_bar(position = "fill") +
  labs(title = "Survival by Ticket Class",
       y = "Survived") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

Passengers in the 1st class were more likely to survive than passengers in other classes.

Fare

ggplot(train, aes(x = Fare, y = as.numeric(Survived)))+
  geom_point(alpha = 0.1)+
  geom_smooth(color="red", fill="#69b3a2", se=TRUE)+
  scale_y_continuous(breaks = seq(1, 2, 0.5), labels = c(0, 0.5, 1)) +
  labs(title = "Survival by Fare",
       y = "Survived")+
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Passengers who paid more fare had a higher chance of surviving.

From the data wrangling done beforehand, it seems that passengers from the 1st class paid much more fare than others. To make sure, let’s visualize it.

ggplot(titanic, aes(Pclass, Fare)) +
  geom_boxplot() + labs(title = "Fare Paid by Passengers from Each Ticket Class") +
  theme_classic()

Passengers from the 1st class indeed paid more fare than passengers from other classes. Since fare and ticket class pretty much are correlated and explain the same thing, I will exclude one of these 2 variables.

titanic <- titanic %>% select(-Fare)

Embarked

ggplot(titanic, aes(Embarked, fill = Survived)) +
  geom_bar(position = "fill") +
  labs(title = "Survival by Port of Embarkation",
       y = "Survived") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

Passengers embarked from Cherbourg were more likely to survive than passengers who embarked from other ports.

Title

ggplot(titanic, aes(Title, fill = Survived)) +
  geom_bar(position = "fill") +
  labs(title = "Survival by Title",
       y = "Survived") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

Passengers titled Miss and Mrs had a higher chance of surviving. It makes sense since women were more likely to survive this disaster.

Age

ggplot(train, aes(x = Age, y = as.numeric(Survived)))+
  geom_point(alpha = 0.1)+
  geom_smooth(color="red", fill="#69b3a2", se=TRUE)+
  scale_y_continuous(breaks = seq(0, 1, 0.5), labels = c (0, 0.5, 1))+
  labs(title = "Survival by Age",
       y = "Survived") +
  theme_minimal()
## Warning: Removed 177 rows containing non-finite values (stat_smooth).
## Warning: Removed 177 rows containing missing values (geom_point).

The younger the passengers, the more likely they survived.

Modeling

For this project, I will use 3 machine learning algorithms to predict the survival of the passengers, namely Naive Bayes, Decision Tree, Random Forest and compare the performances of those algorithms.

Naive Bayes

Split the data into train and test data. Since this dataset is originally already split into two different files, I will split it that way.

set.seed(610)
index <- sample(nrow(titanic), nrow(titanic)*0.8)
train <- titanic[index, ]
test <- titanic[-index, ]

Before continuing the analysis, let’s see the proportion of both classes.

prop.table(table(train$Survived))
## 
##        No       Yes 
## 0.6150907 0.3849093

Build the model and use it to predict the survival rate on test data.

model_nb <- naiveBayes(Survived ~ ., data = train, laplace = 1)
naive_pred <- predict(model_nb, newdata = test, type = "class")
naive_prob <- predict(model_nb, newdata = test, type = "raw")

naive_table <- select(test, Survived) %>%
  bind_cols(surv_pred = naive_pred) %>%
  bind_cols(surv_eprob = round(naive_prob[,1],4)) %>% 
  bind_cols(surv_pprob = round(naive_prob[,2],4))

Let’s see how well the model predicts new data.

naive_table %>% 
  conf_mat(Survived, surv_pred) %>% 
  autoplot(type = "heatmap")

eval_naive <- naive_table %>%
  summarise(
    Accuracy = accuracy_vec(Survived, surv_pred),
    Sensitivity = sens_vec(Survived, surv_pred),
    Specificity = spec_vec(Survived, surv_pred),
    Precision = precision_vec(Survived, surv_pred)
  )

eval_naive
##    Accuracy Sensitivity Specificity Precision
## 1 0.8549618   0.8947368   0.7802198 0.8843931

This model is capable of predicting 98% cases correctly overall.

The other way to evaluate a model performance is by using AUC - ROC curve, a performance measurement for the classification problems. It tells us how much the model is capable of distinguishing between classes. The higher the AUC, the better the model is at predicting negative class as negative and vice versa. It ranges from 0 - 1.

library(ROCR)
pred <- prediction(predictions = naive_prob[, 2], labels = test$Survived)
perf <- performance(prediction.obj = pred, measure = "tpr", x.measure = "fpr")
plot(perf)

auc <- performance(pred, "auc")
auc@y.values[[1]]
## [1] 0.8960221

This model also has AUC value of 0.97, which means it performs very well on distinguishing both positive and negative class.

eval_naive <- eval_naive %>%
  cbind(AUC = auc@y.values[[1]])

Decision Tree

Build the model.

model_dt <- ctree(Survived ~ ., data = train)

Predict a new data using the model.

dt_pred <- predict(model_dt, newdata = test, type = "response")
dt_prob <- predict(model_dt, newdata = test, type = "prob")

dt_table <- select(test, Survived) %>%
  bind_cols(surv_pred = dt_pred) %>%
  bind_cols(surv_eprob = round(dt_prob[,1],4)) %>% 
  bind_cols(surv_pprob = round(dt_prob[,2],4))

Check for the model performance on predicting a new data.

eval_dt <- dt_table %>%
  summarise(
    Accuracy = accuracy_vec(Survived, surv_pred),
    Sensitivity = sens_vec(Survived, surv_pred),
    Specificity = spec_vec(Survived, surv_pred),
    Precision = precision_vec(Survived, surv_pred)
  )

eval_dt
##    Accuracy Sensitivity Specificity Precision
## 1 0.8625954   0.8947368   0.8021978 0.8947368

This decision tree model also gives us 94.7% accuracy in predicting new data, slighlty lower than the previous model.

plot(model_dt, type = "simple")

ROC

pred.dt <- prediction(predictions = dt_prob[, 2], labels = test$Survived)
perf.dt <- performance(prediction.obj = pred, measure = "tpr", x.measure = "fpr")
plot(perf.dt)

auc2 <- performance(pred.dt, "auc")
auc2@y.values[[1]]
## [1] 0.8880535

Based on the auc value, even though it’s slightly lower than the previous one, this model also has a good performance.

eval_dt <- eval_dt %>%
  cbind(AUC = auc2@y.values[[1]])

Random Forest

Before starting with the analysis, we need to exclude variable with little variation, since it doesn’t provide much information for the model.

n0_var <- nearZeroVar(titanic)
n0_var
## integer(0)

Turns out our data doesn’t have a variable with little variation. Now we’re good to go.

set.seed(417)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_rf <- train(Survived ~ ., data = train, method = "rf", trControl= ctrl)
# print output 
model_rf
## Random Forest 
## 
## 1047 samples
##    6 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 839, 838, 837, 837, 837, 838, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8631264  0.7070843
##    6    0.8590055  0.6978656
##   11    0.8453044  0.6710996
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# final model used
model_rf$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 13.94%
## Confusion matrix:
##      No Yes class.error
## No  584  60   0.0931677
## Yes  86 317   0.2133995
# visualize model
plot(model_rf)

# the most important predictors
varImp(model_rf) %>% plot()

In Random Forest model, we can see the most important variables to the model. 3 most important variables in this model are Title, Sex, and Ticket class.

Use the built model to predict the survival rate on test data.

pred_rf <- predict(object = model_rf, newdata = test, type = "raw")
prob_rf <- predict(object = model_rf, newdata = test, type = "prob")

rf_table <- select(test, Survived) %>%
  bind_cols(surv_pred = pred_rf) %>%
  bind_cols(surv_eprob = round(prob_rf[,1],4)) %>% 
  bind_cols(surv_pprob = round(prob_rf[,2],4))

Let’s see how well the model predicts new data.

rf_table %>% 
  conf_mat(Survived, surv_pred) %>% 
  autoplot(type = "heatmap")

eval_rf <- rf_table %>%
  summarise(
    Accuracy = accuracy_vec(Survived, surv_pred),
    Sensitivity = sens_vec(Survived, surv_pred),
    Specificity = spec_vec(Survived, surv_pred),
    Precision = precision_vec(Survived, surv_pred)
  )

eval_rf
##    Accuracy Sensitivity Specificity Precision
## 1 0.8587786   0.8947368   0.7912088 0.8895349

The random forest model also performs fairly well, but it has accuracy value lower than 2 previous model. Let’s see if we can increase its performance by tuning it.

pred.rf <- prediction(predictions = prob_rf[, 2], labels = test$Survived)
perf.rf <- performance(prediction.obj = pred.rf, measure = "tpr", x.measure = "fpr")
plot(perf.rf)

auc3 <- performance(pred.rf, "auc")
auc3@y.values[[1]]
## [1] 0.871088

Turns out Random Forest gives the highest value of AUC.

eval_rf <- eval_rf %>%
  cbind(AUC = auc3@y.values[[1]])

Conclusion

rbind("Naive Bayes" = eval_naive, "Decision Tree" = eval_dt, "Random Forest" = eval_rf) %>% data.frame()
##                Accuracy Sensitivity Specificity Precision       AUC
## Naive Bayes   0.8549618   0.8947368   0.7802198 0.8843931 0.8960221
## Decision Tree 0.8625954   0.8947368   0.8021978 0.8947368 0.8880535
## Random Forest 0.8587786   0.8947368   0.7912088 0.8895349 0.8710880

Overall, Naive Bayes model has the best performances in predicting both positive and negative classes among all the models, even though Random Forest has the highest value of AUC, slightly higher than the Naive Bayes model gives. In conclusion, Naive Bayes is the best model to predict survival rate of passengers aboard Titanic based on ticket class, title, sex, age, port of embarkation, and family size.