Analysis of Titanic Mortality (Classification Machine Learning)

1 SUMMARY

The sank of Titanic was known as the deadliest sank of a single ship in 1912. Approximately 1500 passengers out of 2224 died in the tragedy. This project used an incomplete dataset to understand the overall picture.

The dataset had information of 1309 passengers included their ticket-class bought, gender, family size, ticket fare, age, and most importantly, their survivorship. 32% of passengers on the information list did not have their survivorship recorded, there were also many missing values in each of the information group. Data manipulation, imputation, feature engineering, and machine learning algorithms (K-Nearest neightbour, random forest, and extreme-gradient boosting) were applied to clean the dataset. A final, perfectly cleaned dataset was synthesised for data visualisation to understand the trend in the tragedy.

This project concluded that there was 62% of passengers died from the sank, the death rate was the highest in 3rd-class ticket passengers, and the death rate was the highest in adult male. Statistically, 76% of 3rd class ticket passengers, 57% of 2nd class ticket passengers, and 37% of 1st class ticket passengers died from the sank. Among age groups, 47% of children, 57% of teenagers, 65% of adults, and 90% of elders died from the sank of Titanic.

Highlights

2 R PACKAGES

library(tidyverse)
library(kableExtra)
library(skimr)
library(corrplot)
library(e1071)
library(caret)
library(doSNOW)  # allow training in parallel 
library(ipred)
library(xgboost)
library(plotly)
library(highcharter)
library(leaflet)
library(leaflet.minicharts)

3 INTRODUCTION

RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean on 15 April 1912, after striking an iceberg during her voyage from Southampton to New York City (Wikipedia 2021).

According to Wikipedia, there was an estimate of 2224 passengers and crew aboard, and the sank has caused estimated 1500 of casualty. The sank of Titanic was known at the time one of the deadliest of a single ship.

In this project, I will analyse a Titanic dataset publicly available from Kaggle. The dataset has information of 1309 passengers and their survivorship information (survived, not survived, and missing).

I will apply various data science methodologies include data exploration, data manipulation, feature engineering, algorithmic imputation, and machine learning models to fill up missing values in the dataset including predicting the survival of passengers who had their survivorship data unrecorded. I will then use the final table for data visualisation using graphs and maps to understand the overall trend.

4 DATA PREPARATION

4.1 Data import

Following codes upload the datasets into R.

# Data import 

train <- read.csv("train.csv")
test <- read.csv("test.csv")

# Combine dataset

train <- train %>% 
  relocate(Survived, .after = Embarked) %>% 
  mutate(source = "train")

test <- test %>% 
  mutate(source = "test")
    
titanic <- full_join(train, test) 

# write.csv(titanic, "titanic_raw.csv")

Following shows a random draw of 10 rows of information from the imported dataset. We can see many information such as Name, Sex, Age, ticket classes, fare and etc.

sample_n(titanic, 10) %>% kbl(align = "c") %>% kable_styling(bootstrap_options = "border")

PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Survived	source
1256	1	Harder, Mrs. George Achilles (Dorothy Annan)	female	25.0	1	0	11765	55.4417	E50	C	NA	test
397	3	Olsson, Miss. Elina	female	31.0	0	0	350407	7.8542		S	0	train
1048	1	Bird, Miss. Ellen	female	29.0	0	0	PC 17483	221.7792	C97	S	NA	test
836	1	Compton, Miss. Sara Rebecca	female	39.0	1	1	PC 17756	83.1583	E49	C	1	train
1207	3	Hagardon, Miss. Kate	female	17.0	0	0	AQ/3. 30631	7.7333		Q	NA	test
682	1	Hassab, Mr. Hammad	male	27.0	0	0	PC 17572	76.7292	D49	C	1	train
705	3	Hansen, Mr. Henrik Juul	male	26.0	1	0	350025	7.8542		S	0	train
20	3	Masselmani, Mrs. Fatima	female	NA	0	0	2649	7.2250		C	1	train
883	3	Dahlberg, Miss. Gerda Ulrika	female	22.0	0	0	7552	10.5167		S	0	train
123	2	Nasser, Mr. Nicholas	male	32.5	1	0	237736	30.0708		C	0	train

4.2 Data description

This table is adapted from Kaggle.

Variable <- c("PassengerId",
              "Pclass",
              "Name",
              "Sex",
              "Age",
              "SibSp",
              "Parch",
              "Ticket",
              "Fare",
              "Cabin",
              "Embarked",
              "Survived",
              "Source")

Definition <- c("Id of the passenger",
                "Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd. It is a A proxy for socio-economic status (SES) with 1st = Upper, 2nd = Middle, 3rd = Lower",
                "Name of the passenger",
                "Sex",
                "Age in years",
                "# of siblings / spouses aboard the Titanic",
                "# of parents / children aboard the Titanic",
                "Ticket number",
                "Passenger fare",
                "Cabin number",
                "Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton",
                "Survivalship information: 0 = No, 1 = Yes, blank = missing value",
                "train and test. Train has either survived or not survived recorded, whereas test does not. Machine learning will be used to make the prediction.")

data.frame(Variable, Definition) %>% 
  kbl() %>% 
  kable_styling(bootstrap_options = c("bordered", "striped"))

Variable	Definition
PassengerId	Id of the passenger
Pclass	Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd. It is a A proxy for socio-economic status (SES) with 1st = Upper, 2nd = Middle, 3rd = Lower
Name	Name of the passenger
Sex	Sex
Age	Age in years
SibSp	# of siblings / spouses aboard the Titanic
Parch	# of parents / children aboard the Titanic
Ticket	Ticket number
Fare	Passenger fare
Cabin	Cabin number
Embarked	Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton
Survived	Survivalship information: 0 = No, 1 = Yes, blank = missing value
Source	train and test. Train has either survived or not survived recorded, whereas test does not. Machine learning will be used to make the prediction.

4.3 Data exploration

There are 1309 rows of observations and 13 columns. There are 6 columns recognised as character type and 7 as numerical type. It is important to change some types into factor during analysis. I will identified columns that need this conversion.

skim_without_charts(titanic)

Data summary

Name	titanic
Number of rows	1309
Number of columns	13
_______________________
Column type frequency:
character	6
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
Name	1	12	82	0	1307
Sex	1	4	6	0	2
Ticket	1	3	18	0	929
Cabin	1	0	15	1014	187
Embarked	1	0	1	2	4
source	1	4	5	0	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
PassengerId	0	1.00	655.00	378.02	1.00	328.0	655.00	982.00	1309.00
Pclass	0	1.00	2.29	0.84	1.00	2.0	3.00	3.00	3.00
Age	263	0.80	29.88	14.41	0.17	21.0	28.00	39.00	80.00
SibSp	0	1.00	0.50	1.04	0.00	0.0	0.00	1.00	8.00
Parch	0	1.00	0.39	0.87	0.00	0.0	0.00	0.00	9.00
Fare	1	1.00	33.30	51.76	0.00	7.9	14.45	31.27	512.33
Survived	418	0.68	0.38	0.49	0.00	0.0	0.00	1.00	1.00

I identified that there are 263 missing values from age, 1 from Fare and 418 from survived. However, there are many missing values in “Cabin” as well, the missing values were recorded with a space rather than having a truly blank that would regonised as “NA” in R and be detected.

head(titanic$Cabin, 10)

##  [1] ""     "C85"  ""     "C123" ""     ""     "E46"  ""     ""     ""

There are 77% of values went missing the column of Cabin, therefore this column will be removed because there are too many missing values in the column. There is a rule of thumb in the market recommending that a column with 60% of missing values and above should be removed during predictive analysis.

titanic %>% 
  dplyr::select(Cabin) %>% 
  mutate(value = case_when(Cabin == "" ~ "Missing",
                           TRUE ~ "Not_Missing")) %>% 
  group_by(value) %>% 
  summarise(statistics = n()) %>% 
  mutate(total = sum(statistics),
         percent = paste0(round(statistics/total * 100), "%"))

## # A tibble: 2 x 4
##   value       statistics total percent
##   <chr>            <int> <int> <chr>  
## 1 Missing           1014  1309 77%    
## 2 Not_Missing        295  1309 23%

Following shows another way of looking at the dataset that displays their data types and initial values.

glimpse(titanic)

## Rows: 1,309
## Columns: 13
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal~
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
## $ source      <chr> "train", "train", "train", "train", "train", "train", "tra~

5 DATA CLEANING

Identified cleaning tasks:

PasengerId will be removed, because it adds nothing to the analysis of this project.
Name will be removed, because it adds nothing to the analysis of this project. I am not doing text analytics in this project.
Ticket will be removed, because it adds nothing to the analysis of this project.
Cabin will be removed, because it adds nothing to the analysis of this project.
Convert all remaining character variables into factor.
Pclass should be converted into factor.
Survived should be converted into factor.

5.1 Variable removal

Following codes remove PassengerId, Name, Ticket, and Cabin.

titanic <- titanic %>% 
  dplyr::select(-PassengerId, -Name, -Ticket, -Cabin)

5.2 Factor conversion

Following codes convert all remaining character variables into factor as well as the numeric column Pclass and Survived.

titanic <- titanic %>% 
  mutate_if(is.character, as.factor) %>% 
  mutate(Pclass = as.factor(Pclass),
         Survived = as.factor(Survived))

5.3 Renaming levels

This section renames the levels of many variables. It will not impact on the analysis, instead it will help readers to understand the levels better, especially when these levels are displayed in graphs.

titanic <- titanic %>% 
  mutate(Pclass = fct_recode(Pclass,
                             "1st_class" = "1",
                             "2nd_class" = "2",
                             "3rd_class" = "3"),
         Sex = fct_recode(Sex, 
                          "Male" = "male",
                          "Female" = "female"),
         Embarked = fct_recode(Embarked,
                               "Cherbourg" = "C",
                               "Queenstown" = "Q",
                               "Southampton" = "S"),
         Survived = fct_recode(Survived,
                               "Yes" = "1",
                               "No" = "0"))

5.4 Renaming variables

Making all variables’ name into lower-case format as there are more than 1 form of format. It will not affect the analysis, but helps to make the table looks more clean and tidy.

names(titanic) <- tolower(names(titanic))

5.5 Imputation

This section applies imputation model to fill up missing values in the dataset. There are many types of imputation methods including using mean, median, mode (most occurring values, generally applies to categorical data), and machine learning models that make use of the entire dataset to predict the missing values.

Missing values that need imputation are present in following columns:

Fare (Will be imputed using median)
Embarked (Will be imputed using mode)
Age (Will be imputed using imputation model)

summary(titanic)

##        pclass        sex           age            sibsp            parch      
##  1st_class:323   Female:466   Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  2nd_class:277   Male  :843   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##  3rd_class:709                Median :28.00   Median :0.0000   Median :0.000  
##                               Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##                               3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                               Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                               NA's   :263                                     
##       fare                embarked   survived     source   
##  Min.   :  0.000              :  2   No  :549   test :418  
##  1st Qu.:  7.896   Cherbourg  :270   Yes :342   train:891  
##  Median : 14.454   Queenstown :123   NA's:418              
##  Mean   : 33.295   Southampton:914                         
##  3rd Qu.: 31.275                                           
##  Max.   :512.329                                           
##  NA's   :1

Following codes will (1) replace the NA’s in Fare with the overall fare median, and (2) replace 2 of the NA’s in Embarked with “S”, which is the most frequently occurring level within the column.

titanic <- titanic %>% 
  mutate(embarked = as.character(embarked),
         embarked = na_if(embarked, "")) %>%
  mutate(fare = replace_na(fare, median(fare, na.rm = T)),       # Impute with median for Fare
         embarked = replace_na(embarked, "Southampton"),
         embarked = as.factor(embarked))                   # Impute with mode for Embarked

Following codes complete the imputation of missing values in “Age” using bagged tree algorithm.

# Dummy format conversion because relevant functions from caret "package" does not with categorical data

dummy_function <- dummyVars(~., data = titanic[, -10])   # Exclude "survived" 
titanic_dummy <- dummy_function %>% predict(titanic[, -10])

# Impute with Bagged tree models

Bagimpute_function <- titanic_dummy %>% preProcess(method = "bagImpute")
titanic_dummy_impute <- Bagimpute_function %>% predict(titanic_dummy)

# Extract Age from titanic_dummy_impute into titanic table

titanic$age <- titanic_dummy_impute[, 6]

All missing values in the dataset have been filled up and left with only the column of “Survived” with 418 missing values. This is actually the responding variable of this analysis, and the survivorship of these missing values will be computed via machine learning algorithm in later section.

summary(titanic)

##        pclass        sex           age            sibsp            parch      
##  1st_class:323   Female:466   Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  2nd_class:277   Male  :843   1st Qu.:21.87   1st Qu.:0.0000   1st Qu.:0.000  
##  3rd_class:709                Median :28.38   Median :0.0000   Median :0.000  
##                               Mean   :29.72   Mean   :0.4989   Mean   :0.385  
##                               3rd Qu.:36.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                               Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##       fare                embarked   survived     source   
##  Min.   :  0.000   Cherbourg  :270   No  :549   test :418  
##  1st Qu.:  7.896   Queenstown :123   Yes :342   train:891  
##  Median : 14.454   Southampton:916   NA's:418              
##  Mean   : 33.281                                           
##  3rd Qu.: 31.275                                           
##  Max.   :512.329

Clean up the age column, the numeric in age column shouldn’t has floating numbers and therefore I am rounding up those imputed values since they are just an estimate.

titanic <- titanic %>% 
  mutate(age = round(age))

5.6 Round the Fare

Since the unit of fare often comes with 2 floating numbers, I will transform decimal places of “fare” from 4 into 2.

titanic <- titanic %>% 
  mutate(fare = round(fare, 2))

5.7 Feature Engineering

Since “SibSp” (number of siblings or spouses) and “Parch” (parents or children) are the total number of family a passenger was with, and a combination of them would create a new variable “familysize”.

titanic <- titanic %>% 
  mutate(familysize = sibsp + parch) %>% 
  relocate(familysize, .after = parch)

Grouping different ranges of age into “age_group” of kid, teenage, adult, and elder.

titanic <- titanic %>% mutate(age_group = case_when(age <= 12 ~ "Kid",
                                                    age >= 13 & age <= 19 ~ "Teenage",
                                                    age >= 20 & age <= 65 ~ "Adult",
                                                    age >= 66 ~ "Elder"),
         age_group = factor(age_group, levels = c("Kid", "Teenage", "Adult", "Elder"))) %>% 
         relocate(age_group, .after = age)

The dataset has now been cleaned.

6 MACHINE LEARNING

There are 418 passengers do not have their survivorship recorded in the dataset, I will predict their survivorship using relevant data in the dataset with the aid of machine learning algorithms.

summary(titanic$survived)

##   No  Yes NA's 
##  549  342  418

There will be 3 different ways in splitting the dataset.

Split the dataset into two datasets, one with survivorship and one without survivorship.

titanic_with_survivorship <- titanic %>% 
  filter(source == "train") %>% 
  dplyr::select(-source)

titanic_without_survivorship <- titanic %>% 
  filter(source == "test") %>% 
  dplyr::select(-source)

Split out the one with survival information into 80% train set and 20% test set.

set.seed(123)

# Create data partition 

training.set <- titanic_with_survivorship$survived %>% createDataPartition(p = 0.8, list = F)

# Get train and test test

train.set <- titanic_with_survivorship[training.set, ]
  
test.set <- titanic_with_survivorship[-training.set, ]

6.1 K-Nearest Neightbors (KNN)

This section trains a non-parametric algorithm, KNN, on the train set and make predictions on the test set.

model_knn <- train(survived ~., data = train.set,
                   method = "knn",
                   trControl = trainControl(method = "repeatedcv", 
                                            number = 10,
                                            repeats = 3),
                   preProcess = c("center", "scale")
                   )

plot(model_knn)

According to graph above and following function, the best K is 9. It will automatically selected as the default K value when this KNN model is used for predictions.

model_knn$bestTune

##   k
## 3 9

Applying the KNN model to make predictions on the test set and evaluate its predictive performance (accuracy at %).

# Make predictions

prediction_knn <- model_knn %>% predict(test.set)

# Test performance

mean(prediction_knn == test.set$survived)

## [1] 0.8248588

Confusion matrix to check on other performance metrics of this model.

CM <- confusionMatrix(prediction_knn, test.set$survived)
CM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  100  22
##        Yes   9  46
##                                           
##                Accuracy : 0.8249          
##                  95% CI : (0.7607, 0.8778)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.304e-09       
##                                           
##                   Kappa : 0.6161          
##                                           
##  Mcnemar's Test P-Value : 0.03114         
##                                           
##             Sensitivity : 0.9174          
##             Specificity : 0.6765          
##          Pos Pred Value : 0.8197          
##          Neg Pred Value : 0.8364          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5650          
##    Detection Prevalence : 0.6893          
##       Balanced Accuracy : 0.7970          
##                                           
##        'Positive' Class : No              
##

6.2 Random Forest

This section applies random forest algorithm on the train set and make predictions on the test test.

set.seed(123)

model_rf <- train(survived ~., data = train.set,
                   method = "rf",
                   trControl = trainControl(method = "repeatedcv", 
                                            number = 10,
                                            repeats = 3),
               importance = TRUE,
               tuneLength = 9
                   )

Making the predictions based on random forest model and evaluate its predictive performance (%).

# Make predictions

prediction_rf <- model_rf %>% predict(test.set)
  
# Test performance

mean(prediction_rf == test.set$survived)

## [1] 0.8248588

Confusion matrix to check on other performance metrics of this model.

confusionMatrix(prediction_rf, test.set$survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  98  20
##        Yes 11  48
##                                           
##                Accuracy : 0.8249          
##                  95% CI : (0.7607, 0.8778)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.304e-09       
##                                           
##                   Kappa : 0.6204          
##                                           
##  Mcnemar's Test P-Value : 0.1508          
##                                           
##             Sensitivity : 0.8991          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.8305          
##          Neg Pred Value : 0.8136          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5537          
##    Detection Prevalence : 0.6667          
##       Balanced Accuracy : 0.8025          
##                                           
##        'Positive' Class : No              
##

6.3 Xgboosts

This section applies extreme-gradient boosting, which is an alternative to random forest algorithm. Building the model with following codes.

set.seed(123)

# Tuning

my_tunes <- expand.grid(eta = c(0.05, 0.075, 0.1),
                        nrounds = c(50, 75, 100),
                        max_depth = 6:8, 
                        min_child_weight = c(2.0, 2.25, 2.5),
                        colsample_bytree = c(0.3, 0.4, 0.5),
                        gamma = 0,
                        subsample = 1)

# Initiate parallel computing to speed up boosting process

my_cluster <- makeCluster(5, type = "SOCK")
registerDoSNOW(my_cluster)

# Build the model

model_xgb <- train(survived ~., data = train.set,
                   method = "xgbTree",
                   trControl = trainControl(method = "repeatedcv", 
                                            number = 10,
                                            repeats = 3,
                                            search = "grid"),
                   tuneGrid = my_tunes)

# Stop my_cluster

stopCluster(my_cluster)

Making the predictions based on random forest model and evaluate its performance.

# Make predictions

prediction_xgb <- model_xgb %>% predict(test.set)
  
# Test performance

mean(prediction_xgb == test.set$survived)

## [1] 0.8022599

Confusion matrix to check on other performance metrics of this model.

confusionMatrix(prediction_xgb, test.set$survived)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  94  20
##        Yes 15  48
##                                           
##                Accuracy : 0.8023          
##                  95% CI : (0.7359, 0.8582)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 7.432e-08       
##                                           
##                   Kappa : 0.5762          
##                                           
##  Mcnemar's Test P-Value : 0.499           
##                                           
##             Sensitivity : 0.8624          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.8246          
##          Neg Pred Value : 0.7619          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5311          
##    Detection Prevalence : 0.6441          
##       Balanced Accuracy : 0.7841          
##                                           
##        'Positive' Class : No              
##

6.4 Model comparison

I will use the KNN model to make prediction on the new dataset that do not have survivorship recorded because KNN model has the highest accuracy, sensitivity, and specificity.

tests <- c("Model_knn", "Model_rf", "Model_xgb")

accuracy <- c(mean(prediction_knn == test.set$survived),
              mean(prediction_rf == test.set$survived),
              mean(prediction_xgb == test.set$survived))

sensitivity <- c(0.9083, 0.8899, 0.8899)

specificity <- c(0.7059, 0.6765, 0.7059)

 

# Data frame

model_compare <- data.frame(tests, accuracy, sensitivity, specificity)


# Data transform

df6.4 <- model_compare %>% 
  pivot_longer(c(2:4), names_to = "metric", values_to = "values")

# plot

ggplot(df6.4, aes(x = metric, y = values, fill = tests)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_text(aes(label = round(values, 2)), position = position_dodge(width = 0.9), vjust = -0.7) +
  theme_bw() +
  labs(x = "Metric",
       y = "Values",
       fill = "Model",
       title = "Model_KNN has the best accuracy, sensitivity, and third-place in specificity") +
  scale_y_continuous(lim = c(0, 1))

7 PREDICTIONS

A quick recap, my dataset has 1309 rows, and there are 418 passengers do not have their survivorship recorded. Therefore, I trained 3 machine learning algorithms and found that the KNN model had the best predictive power.

Now, I will use the KNN model to predict the survivorship of these 418 passengers to obtain a final cleaned dataset for visualisation.

Making the prediction with following codes.

predicted <- model_knn %>% predict(titanic_without_survivorship)

predicted

##   [1] No  No  No  No  No  No  Yes No  Yes No  No  No  Yes No  Yes Yes No  No 
##  [19] No  No  No  Yes Yes No  Yes No  Yes No  Yes No  No  No  No  No  Yes No 
##  [37] No  No  No  No  No  No  No  Yes Yes No  No  No  Yes No  No  No  Yes Yes
##  [55] No  No  No  No  No  Yes No  No  No  Yes Yes Yes Yes No  No  Yes Yes No 
##  [73] No  No  Yes No  No  Yes No  Yes Yes No  No  No  No  No  Yes Yes Yes Yes
##  [91] No  No  Yes No  No  No  No  No  No  No  Yes No  No  No  Yes No  No  No 
## [109] No  No  No  Yes Yes Yes Yes No  No  Yes No  Yes Yes No  Yes No  No  Yes
## [127] No  Yes No  No  No  No  No  No  No  No  No  No  Yes No  No  Yes Yes No 
## [145] Yes No  No  No  No  No  Yes No  No  No  No  No  Yes Yes Yes No  Yes Yes
## [163] Yes No  No  Yes No  No  Yes No  No  No  No  No  No  Yes Yes No  Yes Yes
## [181] No  Yes Yes No  Yes No  Yes No  No  No  No  No  Yes No  Yes No  Yes Yes
## [199] No  No  Yes Yes No  Yes No  No  Yes No  Yes No  No  No  No  Yes No  No 
## [217] Yes Yes Yes No  Yes No  Yes No  Yes Yes No  Yes No  No  No  Yes No  No 
## [235] Yes No  No  No  Yes Yes Yes Yes Yes No  No  No  Yes No  Yes No  Yes No 
## [253] Yes No  No  No  No  No  No  No  No  No  Yes Yes No  No  No  No  No  No 
## [271] No  No  Yes Yes No  Yes No  No  No  No  No  Yes Yes Yes Yes No  No  No 
## [289] No  No  No  Yes No  No  No  No  Yes No  Yes No  No  No  No  No  Yes Yes
## [307] Yes Yes Yes No  No  No  No  Yes Yes Yes Yes No  No  No  No  No  No  Yes
## [325] Yes No  Yes No  No  No  Yes No  No  Yes No  Yes No  No  No  No  No  No 
## [343] No  Yes No  Yes No  No  No  Yes Yes No  No  Yes Yes No  Yes No  No  No 
## [361] No  Yes Yes No  Yes No  No  No  Yes No  No  Yes No  No  Yes Yes No  No 
## [379] No  No  No  No  No  Yes No  Yes No  No  No  No  Yes Yes No  No  No  Yes
## [397] No  Yes No  No  Yes No  Yes No  No  No  No  No  Yes Yes Yes Yes No  No 
## [415] Yes No  No  No 
## Levels: No Yes

Data insert.

titanic_without_survivorship$survived <- predicted

Combine both titanic tables.

titanic_final <- rbind(titanic_without_survivorship, titanic_with_survivorship)

Final check of the dataset:

summary(titanic_final)

##        pclass        sex           age          age_group        sibsp       
##  1st_class:323   Female:466   Min.   : 0.00   Kid    : 108   Min.   :0.0000  
##  2nd_class:277   Male  :843   1st Qu.:22.00   Teenage: 132   1st Qu.:0.0000  
##  3rd_class:709                Median :28.00   Adult  :1059   Median :0.0000  
##                               Mean   :29.69   Elder  :  10   Mean   :0.4989  
##                               3rd Qu.:36.00                  3rd Qu.:1.0000  
##                               Max.   :80.00                  Max.   :8.0000  
##      parch         familysize           fare               embarked   survived 
##  Min.   :0.000   Min.   : 0.0000   Min.   :  0.00   Cherbourg  :270   No :819  
##  1st Qu.:0.000   1st Qu.: 0.0000   1st Qu.:  7.90   Queenstown :123   Yes:490  
##  Median :0.000   Median : 0.0000   Median : 14.45   Southampton:916            
##  Mean   :0.385   Mean   : 0.8839   Mean   : 33.28                              
##  3rd Qu.:0.000   3rd Qu.: 1.0000   3rd Qu.: 31.27                              
##  Max.   :9.000   Max.   :10.0000   Max.   :512.33

There are no more missing values from the dataset and is now ready for visualisation.

Saving the file.

#write.csv(titanic_final, "titanic_final.csv")

8 VISUALISATION

tf <- titanic_final

8.1 Passengers across Classes

There are 1309 rows of passengers information, among them, there are 323 passengers bought the first class ticket, 277 for second class ticket, and 709 for third class ticket.

# df

df8.1 <- tf %>% 
  group_by(pclass) %>% 
  summarise(count = n())

# plot

plot_ly(df8.1,
        labels = ~pclass,
        values = ~count,
        type = "pie", 
        textinfo = "label+percent",
        textposition = "inside",
        textfont = list(color = "White", size = 30),
        marker = list(line = list(color = "White", width = 2)), 
        showlegend = FALSE) %>% 
  layout(title = "Passenger Counts by Ticket Classes")

8.2 Passengers from Each Port

Following table shows the number of passengers boarded from each port.

# df of location

embarked <- c("Cherbourg", "Queenstown", "Southampton")
long <- c(-1.620000,  -8.299167, -1.404351)
lat <- c(49.630001, 51.857222, 50.909698)

port_loc <- data.frame(long, lat, embarked)
  
# merge to tf

df8.2 <- tf %>% 
  group_by(embarked, pclass) %>% 
  summarise(count = n()) %>% 
  left_join(port_loc, by = "embarked") %>% 
  ungroup() %>%
  group_by(embarked) %>% 
  mutate(embarked_total = sum(count)) %>% 
  pivot_wider(values_from = count, names_from = pclass)

# plot

leaflet() %>% 
  addTiles(group = "OpenStreetMap (Default)") %>% 
  addProviderTiles(providers$Esri.WorldImagery, group = "Esri") %>%
  addProviderTiles(providers$Stamen.TonerLite, group = "Toner Lite") %>%  
  setView(lng = -4.143841, lat = 50.376289, zoom = 6) %>% 
  addMinicharts(df8.2$long, df8.2$lat,
                chartdata = df8.2$embarked_total,
                width =  4 * sqrt(df8.2$embarked_total),
                showLabels = T,
                fill = "orange",
                opacity = 0.8) %>% 
  addLabelOnlyMarkers(group = "city",
                      data = df8.2,
                      lat = ~ lat, 
                      lng = ~ long,
                      label = ~ embarked,
                      labelOptions = labelOptions(noHide = T, textOnly = F, textsize = 20, opacity = 0.8)) %>% 
  addLayersControl(baseGroups = c("Esri", "Street Map", "Toner Lite"),
                   overlayGroups = c("city"),
                   options = layersControlOptions(collapsed = F))

Following map shows the proportion of various ticket-class holders from each port.

leaflet() %>% 
  addTiles(group = "OpenStreetMap (Default)") %>% 
  addProviderTiles(providers$Esri.WorldImagery, group = "Esri") %>%
  addProviderTiles(providers$Stamen.TonerLite, group = "Toner Lite") %>% 
  setView(lng = -4.143841, lat = 50.376289, zoom = 6) %>% 
  addMinicharts(df8.2$long, df8.2$lat,
                type = "pie",
                chartdata = df8.2[, c("1st_class", "2nd_class", "3rd_class")],
                width = 3*sqrt(df8.2$embarked_total),
                colorPalette = c("black", "lightgreen", "blue"),
                opacity = 0.9) %>% 
  addLabelOnlyMarkers(group = "Total Passenger",
                      data = df8.2,
                      lat = ~ lat,
                      lng = ~ long,
                      label = ~ paste0("Total: ", as.character(embarked_total)),
                      labelOptions = labelOptions(noHide = T, textsize = 20, textOnly = F)) %>% 
  addLabelOnlyMarkers(group = "city",
                      data = df8.2,
                      lat = ~ lat, 
                      lng = ~ long,
                      label = ~ embarked,
                      labelOptions = labelOptions(noHide = T, textOnly = F, textsize = 20, opacity = 0.8)) %>% 
  addLayersControl(baseGroups = c("Street Map", "Esri", "Toner Lite"),
                   overlayGroups = c("city", "Total Passenger"),
                   options = layersControlOptions(collapsed = F))

8.3 Ticket Prices

There are not much information about how ticket prices are determined. Following box plot shows that generally a higher class ticket is more expensive.

df8.3 <- ggplot(tf, aes(y = fare, x = pclass)) +
  geom_boxplot() +
  geom_jitter(size = 3, alpha = 0.4, shape = 21, colour = "grey") +
  labs(x = "Ticket Classes",
       y = "Fare") + 
  stat_summary(fun = "mean", geom = "point", size = 6, stroke = 1, shape = 4, colour = "blue") +
  theme_bw()


ggplotly(df8.3)

8.4 Family Sizes

There are a lot of big families in third-class ticket group.

plot8.4 <- ggplot(tf, aes(x = sex, y = age, fill = age_group, size = familysize)) +
  geom_jitter(alpha = 0.5, shape = 21, width = 0.3) +
  facet_wrap(~pclass) +
  scale_size(range = c(0, 6)) +
  theme_bw()


ggplotly(plot8.4)

8.5 Mortality - Social Classes

The mortality in the third class group was the highest at 76%, 57% for the middle class group, and 37% for the first class ticket group.

# df

df8.5 <- tf %>% 
  group_by(pclass, survived) %>% 
  summarise(count = n()) %>% 
  group_by(pclass) %>% 
  mutate(pclass_sum = sum(count), 
         percent = paste0(round(count/pclass_sum * 100, 0), "%"))

# plot 

plot8.5 <- ggplot(df8.5, aes(x = pclass, y = count, fill = survived)) +
  geom_bar(stat = "identity", alpha = 0, size = 1, aes(colour = survived), width = 0.6) +
  geom_text(aes(label = percent, colour = survived), position = position_stack(vjust = 0.5), size = 6) +
  theme_minimal() +
  labs(x = "Ticket class", y = "Survival Count",
       title = "Mortality: Social Class") +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"),
        panel.grid = element_blank(),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0))
      ) +
  scale_colour_manual(values = c("red", "green3")) 

ggplotly(plot8.5)

8.6 Mortality - Genders

Gender-wise, 84% of male and 24% for female died from this sank.

# df

df8.6 <- tf %>% 
  group_by(sex, survived) %>% 
  summarise(count = n()) %>% 
  group_by(sex) %>% 
  mutate(sex_sum = sum(count), 
         percent = paste0(round(count/sex_sum * 100, 0), "%"))

# plot

plot8.6 <- ggplot(df8.6, aes(x = sex, y = count, fill = survived)) +
  geom_bar(stat = "identity", alpha = 0, size = 1, aes(colour = survived), width = 0.6) +
  geom_text(aes(label = percent, colour = survived), position = position_stack(vjust = 0.5), size = 6) +
  theme_minimal() +
  labs(x = "Sex", y = "Survival Count", 
       title = "Mortality: Sex") +
  theme(legend.position = "none",
        panel.grid = element_blank(),
        plot.title = element_text(face = "bold"),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0))
      ) +
  scale_colour_manual(values = c("red", "green3")) 

ggplotly(plot8.6)

8.7 Mortality - Age Group

On age groups, 47% of children, 57% of teenagers, 65% of adults, and 90% of elders died from this sank.

# df

df8.7 <- tf %>% 
  group_by(age_group, survived) %>% 
  summarise(count = n()) %>% 
  group_by(age_group) %>% 
  mutate(age_group_sum = sum(count), 
         percent = paste0(round(count/age_group_sum * 100, 0), "%"))
  

# plot


plot8.7 <- ggplot(df8.7, aes(x = age_group, y = count, fill = survived)) +
  geom_bar(stat = "identity", alpha = 0, size = 1, aes(colour = survived), width = 0.6) +
  geom_text(aes(label = percent, colour = survived), position = position_stack(vjust = 0.5), size = 5) +
  theme_minimal() +
  labs(x = "Sex", y = "Survival Count",
       title = "Mortality: Age Group") +
  theme(legend.position = "none",
        panel.grid = element_blank(),
        plot.title = element_text(face = "bold"),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0))
      ) +
  scale_colour_manual(values = c("red", "green3")) 

ggplotly(plot8.7)

8.8 Mortality - Port

All ports had passenger casualty closed to and more than 50%.

# df

df8.8 <- tf %>% 
  group_by(embarked, survived) %>% 
  summarise(count = n()) %>% 
  group_by(embarked) %>% 
  mutate(embarked_sum = sum(count), 
         percent = paste0(round(count/embarked_sum * 100, 0), "%"))

# plot

plot8.8 <- ggplot(df8.8, aes(x = embarked, y = count, fill = survived)) +
  geom_bar(stat = "identity", alpha = 0, size = 1, aes(colour = survived), width = 0.6) +
  geom_text(aes(label = percent, colour = survived), position = position_stack(vjust = 0.5), size = 5) +
  theme_minimal() +
  labs(x = "Sex", y = "Survival Count",
       title = "Mortality: Port") +
  theme(legend.position = "none",
        panel.grid = element_blank(),
        plot.title = element_text(face = "bold"),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0))
      ) +
  scale_colour_manual(values = c("red", "green3")) 

ggplotly(plot8.8)

8.9 Mortality: Overall

The overall casuality is 62%.

# df

df8.9 <- tf %>% 
  group_by(survived) %>% 
  summarise(count = n()) %>% 
  mutate(embarked_sum = sum(count), 
         percent = paste0(round(count/embarked_sum * 100, 0), "%"))

# plot

plot8.9 <- ggplot(df8.9, aes(x = survived, y = count, fill = survived)) +
  geom_bar(stat = "identity", alpha = 0, size = 1, aes(colour = survived), width = 0.6) +
  geom_text(aes(label = percent, colour = survived), position = position_stack(vjust = 0.5), size = 5) +
  theme_minimal() +
  labs(x = "Sex", y = "Survival Count",
       title = "Mortality: embarked") +
  theme(legend.position = "none",
        panel.grid = element_blank(),
        plot.title = element_text(face = "bold"),
        axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0))
      ) +
  scale_colour_manual(values = c("red", "green3")) 

ggplotly(plot8.9)

8.10 Death Group Analysis

Adult male passengers holding the 3rd class ticket had the highest death rate.

# df

death_group <- tf %>% 
  dplyr::filter(survived == "No") %>% 
  dplyr::select(-age, -sibsp, -parch, -familysize, -fare)

# plot

plot8.10 <- ggplot(death_group, aes(x = age_group, fill = sex)) + 
  geom_histogram(stat = "count", position = position_dodge()) +
  facet_grid(~pclass) +
  theme_bw() +
  theme(axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
        axis.title.y = element_text(margin = margin(0, 10, 0, 0))) +
  labs(x = "Age Group",
       y = "Death Count",
       title = "Analysing the Death Group")



ggplotly(plot8.10)

9 CONCLUSION

From this dataset,

There were 62% of passengers died from the sank
The death rate was the highest in 3rd class ticket passengers, which is a proxy for socio-economic status.
The death rate was the highest in adult male group.
76% of 3rd class ticket passengers, 57% of 2nd class ticket passengers, and 37% of 1st class ticket passengers diead from the sank.
47% of children, 57% of teenagers, 65% of adults, and 90% of elders died from this sank.

Thank you for reading!

10 REFERENCE

Dave Langer 2017, Intro to Machine Learning with R & caret, Data Science Dojo, Viewed 22 October 2021, https://www.youtube.com/watch?v=z8PRU46I3NY&t=1492s

Kaggle 2021, Titanic - Machine Learning from Disaster, viewed 22 October 2021, https://www.kaggle.com/c/titanic/data?select=gender_submission.csv

“Untergang der Titanic”, By Willy Stöwer - Magazine Die Gartenlaube, en:Die Gartenlaube and de:Die Gartenlaube, Public Domain, https://commons.wikimedia.org/w/index.php?curid=97646

Titanic 2021, https://en.wikipedia.org/wiki/Titanic, viewed 24 October 2021, https://en.wikipedia.org/wiki/Titanic

Analysis of Titanic Mortality (Classification Machine Learning)

Kar Ng

2021

1 SUMMARY

2 R PACKAGES

3 INTRODUCTION

4 DATA PREPARATION

4.1 Data import

4.2 Data description

4.3 Data exploration

5 DATA CLEANING

5.1 Variable removal

5.2 Factor conversion

5.3 Renaming levels

5.4 Renaming variables

5.5 Imputation

5.6 Round the Fare

5.7 Feature Engineering

6 MACHINE LEARNING

6.1 K-Nearest Neightbors (KNN)

6.2 Random Forest

6.3 Xgboosts

6.4 Model comparison

7 PREDICTIONS

8 VISUALISATION

8.1 Passengers across Classes

8.2 Passengers from Each Port

8.3 Ticket Prices

8.4 Family Sizes

8.5 Mortality - Social Classes

8.6 Mortality - Genders

8.7 Mortality - Age Group

8.8 Mortality - Port

8.9 Mortality: Overall

8.10 Death Group Analysis

9 CONCLUSION

10 REFERENCE