The sank of Titanic was known as the deadliest sank of a single ship in 1912. Approximately 1500 passengers out of 2224 died in the tragedy. This project used an incomplete dataset to understand the overall picture.
The dataset had information of 1309 passengers included their ticket-class bought, gender, family size, ticket fare, age, and most importantly, their survivorship. 32% of passengers on the information list did not have their survivorship recorded, there were also many missing values in each of the information group. Data manipulation, imputation, feature engineering, and machine learning algorithms (K-Nearest neightbour, random forest, and extreme-gradient boosting) were applied to clean the dataset. A final, perfectly cleaned dataset was synthesised for data visualisation to understand the trend in the tragedy.
This project concluded that there was 62% of passengers died from the sank, the death rate was the highest in 3rd-class ticket passengers, and the death rate was the highest in adult male. Statistically, 76% of 3rd class ticket passengers, 57% of 2nd class ticket passengers, and 37% of 1st class ticket passengers died from the sank. Among age groups, 47% of children, 57% of teenagers, 65% of adults, and 90% of elders died from the sank of Titanic.
Highlights
library(tidyverse)
library(kableExtra)
library(skimr)
library(corrplot)
library(e1071)
library(caret)
library(doSNOW) # allow training in parallel
library(ipred)
library(xgboost)
library(plotly)
library(highcharter)
library(leaflet)
library(leaflet.minicharts)
RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean on 15 April 1912, after striking an iceberg during her voyage from Southampton to New York City (Wikipedia 2021).
According to Wikipedia, there was an estimate of 2224 passengers and crew aboard, and the sank has caused estimated 1500 of casualty. The sank of Titanic was known at the time one of the deadliest of a single ship.
In this project, I will analyse a Titanic dataset publicly available from Kaggle. The dataset has information of 1309 passengers and their survivorship information (survived, not survived, and missing).
I will apply various data science methodologies include data exploration, data manipulation, feature engineering, algorithmic imputation, and machine learning models to fill up missing values in the dataset including predicting the survival of passengers who had their survivorship data unrecorded. I will then use the final table for data visualisation using graphs and maps to understand the overall trend.
Following codes upload the datasets into R.
# Data import
train <- read.csv("train.csv")
test <- read.csv("test.csv")
# Combine dataset
train <- train %>%
relocate(Survived, .after = Embarked) %>%
mutate(source = "train")
test <- test %>%
mutate(source = "test")
titanic <- full_join(train, test)
# write.csv(titanic, "titanic_raw.csv")
Following shows a random draw of 10 rows of information from the imported dataset. We can see many information such as Name, Sex, Age, ticket classes, fare and etc.
sample_n(titanic, 10) %>% kbl(align = "c") %>% kable_styling(bootstrap_options = "border")
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Survived | source |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1256 | 1 | Harder, Mrs. George Achilles (Dorothy Annan) | female | 25.0 | 1 | 0 | 11765 | 55.4417 | E50 | C | NA | test |
| 397 | 3 | Olsson, Miss. Elina | female | 31.0 | 0 | 0 | 350407 | 7.8542 | S | 0 | train | |
| 1048 | 1 | Bird, Miss. Ellen | female | 29.0 | 0 | 0 | PC 17483 | 221.7792 | C97 | S | NA | test |
| 836 | 1 | Compton, Miss. Sara Rebecca | female | 39.0 | 1 | 1 | PC 17756 | 83.1583 | E49 | C | 1 | train |
| 1207 | 3 | Hagardon, Miss. Kate | female | 17.0 | 0 | 0 | AQ/3. 30631 | 7.7333 | Q | NA | test | |
| 682 | 1 | Hassab, Mr. Hammad | male | 27.0 | 0 | 0 | PC 17572 | 76.7292 | D49 | C | 1 | train |
| 705 | 3 | Hansen, Mr. Henrik Juul | male | 26.0 | 1 | 0 | 350025 | 7.8542 | S | 0 | train | |
| 20 | 3 | Masselmani, Mrs. Fatima | female | NA | 0 | 0 | 2649 | 7.2250 | C | 1 | train | |
| 883 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | S | 0 | train | |
| 123 | 2 | Nasser, Mr. Nicholas | male | 32.5 | 1 | 0 | 237736 | 30.0708 | C | 0 | train |
This table is adapted from Kaggle.
Variable <- c("PassengerId",
"Pclass",
"Name",
"Sex",
"Age",
"SibSp",
"Parch",
"Ticket",
"Fare",
"Cabin",
"Embarked",
"Survived",
"Source")
Definition <- c("Id of the passenger",
"Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd. It is a A proxy for socio-economic status (SES) with 1st = Upper, 2nd = Middle, 3rd = Lower",
"Name of the passenger",
"Sex",
"Age in years",
"# of siblings / spouses aboard the Titanic",
"# of parents / children aboard the Titanic",
"Ticket number",
"Passenger fare",
"Cabin number",
"Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton",
"Survivalship information: 0 = No, 1 = Yes, blank = missing value",
"train and test. Train has either survived or not survived recorded, whereas test does not. Machine learning will be used to make the prediction.")
data.frame(Variable, Definition) %>%
kbl() %>%
kable_styling(bootstrap_options = c("bordered", "striped"))
| Variable | Definition |
|---|---|
| PassengerId | Id of the passenger |
| Pclass | Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd. It is a A proxy for socio-economic status (SES) with 1st = Upper, 2nd = Middle, 3rd = Lower |
| Name | Name of the passenger |
| Sex | Sex |
| Age | Age in years |
| SibSp | # of siblings / spouses aboard the Titanic |
| Parch | # of parents / children aboard the Titanic |
| Ticket | Ticket number |
| Fare | Passenger fare |
| Cabin | Cabin number |
| Embarked | Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton |
| Survived | Survivalship information: 0 = No, 1 = Yes, blank = missing value |
| Source | train and test. Train has either survived or not survived recorded, whereas test does not. Machine learning will be used to make the prediction. |
There are 1309 rows of observations and 13 columns. There are 6 columns recognised as character type and 7 as numerical type. It is important to change some types into factor during analysis. I will identified columns that need this conversion.
skim_without_charts(titanic)
| Name | titanic |
| Number of rows | 1309 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Name | 0 | 1 | 12 | 82 | 0 | 1307 | 0 |
| Sex | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| Ticket | 0 | 1 | 3 | 18 | 0 | 929 | 0 |
| Cabin | 0 | 1 | 0 | 15 | 1014 | 187 | 0 |
| Embarked | 0 | 1 | 0 | 1 | 2 | 4 | 0 |
| source | 0 | 1 | 4 | 5 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| PassengerId | 0 | 1.00 | 655.00 | 378.02 | 1.00 | 328.0 | 655.00 | 982.00 | 1309.00 |
| Pclass | 0 | 1.00 | 2.29 | 0.84 | 1.00 | 2.0 | 3.00 | 3.00 | 3.00 |
| Age | 263 | 0.80 | 29.88 | 14.41 | 0.17 | 21.0 | 28.00 | 39.00 | 80.00 |
| SibSp | 0 | 1.00 | 0.50 | 1.04 | 0.00 | 0.0 | 0.00 | 1.00 | 8.00 |
| Parch | 0 | 1.00 | 0.39 | 0.87 | 0.00 | 0.0 | 0.00 | 0.00 | 9.00 |
| Fare | 1 | 1.00 | 33.30 | 51.76 | 0.00 | 7.9 | 14.45 | 31.27 | 512.33 |
| Survived | 418 | 0.68 | 0.38 | 0.49 | 0.00 | 0.0 | 0.00 | 1.00 | 1.00 |
I identified that there are 263 missing values from age, 1 from Fare and 418 from survived. However, there are many missing values in “Cabin” as well, the missing values were recorded with a space rather than having a truly blank that would regonised as “NA” in R and be detected.
head(titanic$Cabin, 10)
## [1] "" "C85" "" "C123" "" "" "E46" "" "" ""
There are 77% of values went missing the column of Cabin, therefore this column will be removed because there are too many missing values in the column. There is a rule of thumb in the market recommending that a column with 60% of missing values and above should be removed during predictive analysis.
titanic %>%
dplyr::select(Cabin) %>%
mutate(value = case_when(Cabin == "" ~ "Missing",
TRUE ~ "Not_Missing")) %>%
group_by(value) %>%
summarise(statistics = n()) %>%
mutate(total = sum(statistics),
percent = paste0(round(statistics/total * 100), "%"))
## # A tibble: 2 x 4
## value statistics total percent
## <chr> <int> <int> <chr>
## 1 Missing 1014 1309 77%
## 2 Not_Missing 295 1309 23%
Following shows another way of looking at the dataset that displays their data types and initial values.
glimpse(titanic)
## Rows: 1,309
## Columns: 13
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal~
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
## $ source <chr> "train", "train", "train", "train", "train", "train", "tra~
Identified cleaning tasks:
Following codes remove PassengerId, Name, Ticket, and Cabin.
titanic <- titanic %>%
dplyr::select(-PassengerId, -Name, -Ticket, -Cabin)
Following codes convert all remaining character variables into factor as well as the numeric column Pclass and Survived.
titanic <- titanic %>%
mutate_if(is.character, as.factor) %>%
mutate(Pclass = as.factor(Pclass),
Survived = as.factor(Survived))
This section renames the levels of many variables. It will not impact on the analysis, instead it will help readers to understand the levels better, especially when these levels are displayed in graphs.
titanic <- titanic %>%
mutate(Pclass = fct_recode(Pclass,
"1st_class" = "1",
"2nd_class" = "2",
"3rd_class" = "3"),
Sex = fct_recode(Sex,
"Male" = "male",
"Female" = "female"),
Embarked = fct_recode(Embarked,
"Cherbourg" = "C",
"Queenstown" = "Q",
"Southampton" = "S"),
Survived = fct_recode(Survived,
"Yes" = "1",
"No" = "0"))
Making all variables’ name into lower-case format as there are more than 1 form of format. It will not affect the analysis, but helps to make the table looks more clean and tidy.
names(titanic) <- tolower(names(titanic))
This section applies imputation model to fill up missing values in the dataset. There are many types of imputation methods including using mean, median, mode (most occurring values, generally applies to categorical data), and machine learning models that make use of the entire dataset to predict the missing values.
Missing values that need imputation are present in following columns:
summary(titanic)
## pclass sex age sibsp parch
## 1st_class:323 Female:466 Min. : 0.17 Min. :0.0000 Min. :0.000
## 2nd_class:277 Male :843 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## 3rd_class:709 Median :28.00 Median :0.0000 Median :0.000
## Mean :29.88 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :263
## fare embarked survived source
## Min. : 0.000 : 2 No :549 test :418
## 1st Qu.: 7.896 Cherbourg :270 Yes :342 train:891
## Median : 14.454 Queenstown :123 NA's:418
## Mean : 33.295 Southampton:914
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :1
Following codes will (1) replace the NA’s in Fare with the overall fare median, and (2) replace 2 of the NA’s in Embarked with “S”, which is the most frequently occurring level within the column.
titanic <- titanic %>%
mutate(embarked = as.character(embarked),
embarked = na_if(embarked, "")) %>%
mutate(fare = replace_na(fare, median(fare, na.rm = T)), # Impute with median for Fare
embarked = replace_na(embarked, "Southampton"),
embarked = as.factor(embarked)) # Impute with mode for Embarked
Following codes complete the imputation of missing values in “Age” using bagged tree algorithm.
# Dummy format conversion because relevant functions from caret "package" does not with categorical data
dummy_function <- dummyVars(~., data = titanic[, -10]) # Exclude "survived"
titanic_dummy <- dummy_function %>% predict(titanic[, -10])
# Impute with Bagged tree models
Bagimpute_function <- titanic_dummy %>% preProcess(method = "bagImpute")
titanic_dummy_impute <- Bagimpute_function %>% predict(titanic_dummy)
# Extract Age from titanic_dummy_impute into titanic table
titanic$age <- titanic_dummy_impute[, 6]
All missing values in the dataset have been filled up and left with only the column of “Survived” with 418 missing values. This is actually the responding variable of this analysis, and the survivorship of these missing values will be computed via machine learning algorithm in later section.
summary(titanic)
## pclass sex age sibsp parch
## 1st_class:323 Female:466 Min. : 0.17 Min. :0.0000 Min. :0.000
## 2nd_class:277 Male :843 1st Qu.:21.87 1st Qu.:0.0000 1st Qu.:0.000
## 3rd_class:709 Median :28.38 Median :0.0000 Median :0.000
## Mean :29.72 Mean :0.4989 Mean :0.385
## 3rd Qu.:36.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## fare embarked survived source
## Min. : 0.000 Cherbourg :270 No :549 test :418
## 1st Qu.: 7.896 Queenstown :123 Yes :342 train:891
## Median : 14.454 Southampton:916 NA's:418
## Mean : 33.281
## 3rd Qu.: 31.275
## Max. :512.329
Clean up the age column, the numeric in age column shouldn’t has floating numbers and therefore I am rounding up those imputed values since they are just an estimate.
titanic <- titanic %>%
mutate(age = round(age))
Since the unit of fare often comes with 2 floating numbers, I will transform decimal places of “fare” from 4 into 2.
titanic <- titanic %>%
mutate(fare = round(fare, 2))
Since “SibSp” (number of siblings or spouses) and “Parch” (parents or children) are the total number of family a passenger was with, and a combination of them would create a new variable “familysize”.
titanic <- titanic %>%
mutate(familysize = sibsp + parch) %>%
relocate(familysize, .after = parch)
Grouping different ranges of age into “age_group” of kid, teenage, adult, and elder.
titanic <- titanic %>% mutate(age_group = case_when(age <= 12 ~ "Kid",
age >= 13 & age <= 19 ~ "Teenage",
age >= 20 & age <= 65 ~ "Adult",
age >= 66 ~ "Elder"),
age_group = factor(age_group, levels = c("Kid", "Teenage", "Adult", "Elder"))) %>%
relocate(age_group, .after = age)
The dataset has now been cleaned.
There are 418 passengers do not have their survivorship recorded in the dataset, I will predict their survivorship using relevant data in the dataset with the aid of machine learning algorithms.
summary(titanic$survived)
## No Yes NA's
## 549 342 418
There will be 3 different ways in splitting the dataset.
Split the dataset into two datasets, one with survivorship and one without survivorship.
titanic_with_survivorship <- titanic %>%
filter(source == "train") %>%
dplyr::select(-source)
titanic_without_survivorship <- titanic %>%
filter(source == "test") %>%
dplyr::select(-source)
Split out the one with survival information into 80% train set and 20% test set.
set.seed(123)
# Create data partition
training.set <- titanic_with_survivorship$survived %>% createDataPartition(p = 0.8, list = F)
# Get train and test test
train.set <- titanic_with_survivorship[training.set, ]
test.set <- titanic_with_survivorship[-training.set, ]
This section trains a non-parametric algorithm, KNN, on the train set and make predictions on the test set.
model_knn <- train(survived ~., data = train.set,
method = "knn",
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3),
preProcess = c("center", "scale")
)
plot(model_knn)
According to graph above and following function, the best K is 9. It will automatically selected as the default K value when this KNN model is used for predictions.
model_knn$bestTune
## k
## 3 9
Applying the KNN model to make predictions on the test set and evaluate its predictive performance (accuracy at %).
# Make predictions
prediction_knn <- model_knn %>% predict(test.set)
# Test performance
mean(prediction_knn == test.set$survived)
## [1] 0.8248588
Confusion matrix to check on other performance metrics of this model.
CM <- confusionMatrix(prediction_knn, test.set$survived)
CM
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 100 22
## Yes 9 46
##
## Accuracy : 0.8249
## 95% CI : (0.7607, 0.8778)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.304e-09
##
## Kappa : 0.6161
##
## Mcnemar's Test P-Value : 0.03114
##
## Sensitivity : 0.9174
## Specificity : 0.6765
## Pos Pred Value : 0.8197
## Neg Pred Value : 0.8364
## Prevalence : 0.6158
## Detection Rate : 0.5650
## Detection Prevalence : 0.6893
## Balanced Accuracy : 0.7970
##
## 'Positive' Class : No
##
This section applies random forest algorithm on the train set and make predictions on the test test.
set.seed(123)
model_rf <- train(survived ~., data = train.set,
method = "rf",
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3),
importance = TRUE,
tuneLength = 9
)
Making the predictions based on random forest model and evaluate its predictive performance (%).
# Make predictions
prediction_rf <- model_rf %>% predict(test.set)
# Test performance
mean(prediction_rf == test.set$survived)
## [1] 0.8248588
Confusion matrix to check on other performance metrics of this model.
confusionMatrix(prediction_rf, test.set$survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 98 20
## Yes 11 48
##
## Accuracy : 0.8249
## 95% CI : (0.7607, 0.8778)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.304e-09
##
## Kappa : 0.6204
##
## Mcnemar's Test P-Value : 0.1508
##
## Sensitivity : 0.8991
## Specificity : 0.7059
## Pos Pred Value : 0.8305
## Neg Pred Value : 0.8136
## Prevalence : 0.6158
## Detection Rate : 0.5537
## Detection Prevalence : 0.6667
## Balanced Accuracy : 0.8025
##
## 'Positive' Class : No
##
This section applies extreme-gradient boosting, which is an alternative to random forest algorithm. Building the model with following codes.
set.seed(123)
# Tuning
my_tunes <- expand.grid(eta = c(0.05, 0.075, 0.1),
nrounds = c(50, 75, 100),
max_depth = 6:8,
min_child_weight = c(2.0, 2.25, 2.5),
colsample_bytree = c(0.3, 0.4, 0.5),
gamma = 0,
subsample = 1)
# Initiate parallel computing to speed up boosting process
my_cluster <- makeCluster(5, type = "SOCK")
registerDoSNOW(my_cluster)
# Build the model
model_xgb <- train(survived ~., data = train.set,
method = "xgbTree",
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
search = "grid"),
tuneGrid = my_tunes)
# Stop my_cluster
stopCluster(my_cluster)
Making the predictions based on random forest model and evaluate its performance.
# Make predictions
prediction_xgb <- model_xgb %>% predict(test.set)
# Test performance
mean(prediction_xgb == test.set$survived)
## [1] 0.8022599
Confusion matrix to check on other performance metrics of this model.
confusionMatrix(prediction_xgb, test.set$survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 94 20
## Yes 15 48
##
## Accuracy : 0.8023
## 95% CI : (0.7359, 0.8582)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 7.432e-08
##
## Kappa : 0.5762
##
## Mcnemar's Test P-Value : 0.499
##
## Sensitivity : 0.8624
## Specificity : 0.7059
## Pos Pred Value : 0.8246
## Neg Pred Value : 0.7619
## Prevalence : 0.6158
## Detection Rate : 0.5311
## Detection Prevalence : 0.6441
## Balanced Accuracy : 0.7841
##
## 'Positive' Class : No
##
I will use the KNN model to make prediction on the new dataset that do not have survivorship recorded because KNN model has the highest accuracy, sensitivity, and specificity.
tests <- c("Model_knn", "Model_rf", "Model_xgb")
accuracy <- c(mean(prediction_knn == test.set$survived),
mean(prediction_rf == test.set$survived),
mean(prediction_xgb == test.set$survived))
sensitivity <- c(0.9083, 0.8899, 0.8899)
specificity <- c(0.7059, 0.6765, 0.7059)
# Data frame
model_compare <- data.frame(tests, accuracy, sensitivity, specificity)
# Data transform
df6.4 <- model_compare %>%
pivot_longer(c(2:4), names_to = "metric", values_to = "values")
# plot
ggplot(df6.4, aes(x = metric, y = values, fill = tests)) +
geom_bar(stat = "identity", position = position_dodge()) +
geom_text(aes(label = round(values, 2)), position = position_dodge(width = 0.9), vjust = -0.7) +
theme_bw() +
labs(x = "Metric",
y = "Values",
fill = "Model",
title = "Model_KNN has the best accuracy, sensitivity, and third-place in specificity") +
scale_y_continuous(lim = c(0, 1))
A quick recap, my dataset has 1309 rows, and there are 418 passengers do not have their survivorship recorded. Therefore, I trained 3 machine learning algorithms and found that the KNN model had the best predictive power.
Now, I will use the KNN model to predict the survivorship of these 418 passengers to obtain a final cleaned dataset for visualisation.
Making the prediction with following codes.
predicted <- model_knn %>% predict(titanic_without_survivorship)
predicted
## [1] No No No No No No Yes No Yes No No No Yes No Yes Yes No No
## [19] No No No Yes Yes No Yes No Yes No Yes No No No No No Yes No
## [37] No No No No No No No Yes Yes No No No Yes No No No Yes Yes
## [55] No No No No No Yes No No No Yes Yes Yes Yes No No Yes Yes No
## [73] No No Yes No No Yes No Yes Yes No No No No No Yes Yes Yes Yes
## [91] No No Yes No No No No No No No Yes No No No Yes No No No
## [109] No No No Yes Yes Yes Yes No No Yes No Yes Yes No Yes No No Yes
## [127] No Yes No No No No No No No No No No Yes No No Yes Yes No
## [145] Yes No No No No No Yes No No No No No Yes Yes Yes No Yes Yes
## [163] Yes No No Yes No No Yes No No No No No No Yes Yes No Yes Yes
## [181] No Yes Yes No Yes No Yes No No No No No Yes No Yes No Yes Yes
## [199] No No Yes Yes No Yes No No Yes No Yes No No No No Yes No No
## [217] Yes Yes Yes No Yes No Yes No Yes Yes No Yes No No No Yes No No
## [235] Yes No No No Yes Yes Yes Yes Yes No No No Yes No Yes No Yes No
## [253] Yes No No No No No No No No No Yes Yes No No No No No No
## [271] No No Yes Yes No Yes No No No No No Yes Yes Yes Yes No No No
## [289] No No No Yes No No No No Yes No Yes No No No No No Yes Yes
## [307] Yes Yes Yes No No No No Yes Yes Yes Yes No No No No No No Yes
## [325] Yes No Yes No No No Yes No No Yes No Yes No No No No No No
## [343] No Yes No Yes No No No Yes Yes No No Yes Yes No Yes No No No
## [361] No Yes Yes No Yes No No No Yes No No Yes No No Yes Yes No No
## [379] No No No No No Yes No Yes No No No No Yes Yes No No No Yes
## [397] No Yes No No Yes No Yes No No No No No Yes Yes Yes Yes No No
## [415] Yes No No No
## Levels: No Yes
Data insert.
titanic_without_survivorship$survived <- predicted
Combine both titanic tables.
titanic_final <- rbind(titanic_without_survivorship, titanic_with_survivorship)
Final check of the dataset:
summary(titanic_final)
## pclass sex age age_group sibsp
## 1st_class:323 Female:466 Min. : 0.00 Kid : 108 Min. :0.0000
## 2nd_class:277 Male :843 1st Qu.:22.00 Teenage: 132 1st Qu.:0.0000
## 3rd_class:709 Median :28.00 Adult :1059 Median :0.0000
## Mean :29.69 Elder : 10 Mean :0.4989
## 3rd Qu.:36.00 3rd Qu.:1.0000
## Max. :80.00 Max. :8.0000
## parch familysize fare embarked survived
## Min. :0.000 Min. : 0.0000 Min. : 0.00 Cherbourg :270 No :819
## 1st Qu.:0.000 1st Qu.: 0.0000 1st Qu.: 7.90 Queenstown :123 Yes:490
## Median :0.000 Median : 0.0000 Median : 14.45 Southampton:916
## Mean :0.385 Mean : 0.8839 Mean : 33.28
## 3rd Qu.:0.000 3rd Qu.: 1.0000 3rd Qu.: 31.27
## Max. :9.000 Max. :10.0000 Max. :512.33
There are no more missing values from the dataset and is now ready for visualisation.
Saving the file.
#write.csv(titanic_final, "titanic_final.csv")
tf <- titanic_final
There are 1309 rows of passengers information, among them, there are 323 passengers bought the first class ticket, 277 for second class ticket, and 709 for third class ticket.
# df
df8.1 <- tf %>%
group_by(pclass) %>%
summarise(count = n())
# plot
plot_ly(df8.1,
labels = ~pclass,
values = ~count,
type = "pie",
textinfo = "label+percent",
textposition = "inside",
textfont = list(color = "White", size = 30),
marker = list(line = list(color = "White", width = 2)),
showlegend = FALSE) %>%
layout(title = "Passenger Counts by Ticket Classes")
Following table shows the number of passengers boarded from each port.
# df of location
embarked <- c("Cherbourg", "Queenstown", "Southampton")
long <- c(-1.620000, -8.299167, -1.404351)
lat <- c(49.630001, 51.857222, 50.909698)
port_loc <- data.frame(long, lat, embarked)
# merge to tf
df8.2 <- tf %>%
group_by(embarked, pclass) %>%
summarise(count = n()) %>%
left_join(port_loc, by = "embarked") %>%
ungroup() %>%
group_by(embarked) %>%
mutate(embarked_total = sum(count)) %>%
pivot_wider(values_from = count, names_from = pclass)
# plot
leaflet() %>%
addTiles(group = "OpenStreetMap (Default)") %>%
addProviderTiles(providers$Esri.WorldImagery, group = "Esri") %>%
addProviderTiles(providers$Stamen.TonerLite, group = "Toner Lite") %>%
setView(lng = -4.143841, lat = 50.376289, zoom = 6) %>%
addMinicharts(df8.2$long, df8.2$lat,
chartdata = df8.2$embarked_total,
width = 4 * sqrt(df8.2$embarked_total),
showLabels = T,
fill = "orange",
opacity = 0.8) %>%
addLabelOnlyMarkers(group = "city",
data = df8.2,
lat = ~ lat,
lng = ~ long,
label = ~ embarked,
labelOptions = labelOptions(noHide = T, textOnly = F, textsize = 20, opacity = 0.8)) %>%
addLayersControl(baseGroups = c("Esri", "Street Map", "Toner Lite"),
overlayGroups = c("city"),
options = layersControlOptions(collapsed = F))
Following map shows the proportion of various ticket-class holders from each port.
leaflet() %>%
addTiles(group = "OpenStreetMap (Default)") %>%
addProviderTiles(providers$Esri.WorldImagery, group = "Esri") %>%
addProviderTiles(providers$Stamen.TonerLite, group = "Toner Lite") %>%
setView(lng = -4.143841, lat = 50.376289, zoom = 6) %>%
addMinicharts(df8.2$long, df8.2$lat,
type = "pie",
chartdata = df8.2[, c("1st_class", "2nd_class", "3rd_class")],
width = 3*sqrt(df8.2$embarked_total),
colorPalette = c("black", "lightgreen", "blue"),
opacity = 0.9) %>%
addLabelOnlyMarkers(group = "Total Passenger",
data = df8.2,
lat = ~ lat,
lng = ~ long,
label = ~ paste0("Total: ", as.character(embarked_total)),
labelOptions = labelOptions(noHide = T, textsize = 20, textOnly = F)) %>%
addLabelOnlyMarkers(group = "city",
data = df8.2,
lat = ~ lat,
lng = ~ long,
label = ~ embarked,
labelOptions = labelOptions(noHide = T, textOnly = F, textsize = 20, opacity = 0.8)) %>%
addLayersControl(baseGroups = c("Street Map", "Esri", "Toner Lite"),
overlayGroups = c("city", "Total Passenger"),
options = layersControlOptions(collapsed = F))
There are not much information about how ticket prices are determined. Following box plot shows that generally a higher class ticket is more expensive.
df8.3 <- ggplot(tf, aes(y = fare, x = pclass)) +
geom_boxplot() +
geom_jitter(size = 3, alpha = 0.4, shape = 21, colour = "grey") +
labs(x = "Ticket Classes",
y = "Fare") +
stat_summary(fun = "mean", geom = "point", size = 6, stroke = 1, shape = 4, colour = "blue") +
theme_bw()
ggplotly(df8.3)
There are a lot of big families in third-class ticket group.
plot8.4 <- ggplot(tf, aes(x = sex, y = age, fill = age_group, size = familysize)) +
geom_jitter(alpha = 0.5, shape = 21, width = 0.3) +
facet_wrap(~pclass) +
scale_size(range = c(0, 6)) +
theme_bw()
ggplotly(plot8.4)
Gender-wise, 84% of male and 24% for female died from this sank.
# df
df8.6 <- tf %>%
group_by(sex, survived) %>%
summarise(count = n()) %>%
group_by(sex) %>%
mutate(sex_sum = sum(count),
percent = paste0(round(count/sex_sum * 100, 0), "%"))
# plot
plot8.6 <- ggplot(df8.6, aes(x = sex, y = count, fill = survived)) +
geom_bar(stat = "identity", alpha = 0, size = 1, aes(colour = survived), width = 0.6) +
geom_text(aes(label = percent, colour = survived), position = position_stack(vjust = 0.5), size = 6) +
theme_minimal() +
labs(x = "Sex", y = "Survival Count",
title = "Mortality: Sex") +
theme(legend.position = "none",
panel.grid = element_blank(),
plot.title = element_text(face = "bold"),
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
axis.title.y = element_text(margin = margin(0, 10, 0, 0))
) +
scale_colour_manual(values = c("red", "green3"))
ggplotly(plot8.6)
On age groups, 47% of children, 57% of teenagers, 65% of adults, and 90% of elders died from this sank.
# df
df8.7 <- tf %>%
group_by(age_group, survived) %>%
summarise(count = n()) %>%
group_by(age_group) %>%
mutate(age_group_sum = sum(count),
percent = paste0(round(count/age_group_sum * 100, 0), "%"))
# plot
plot8.7 <- ggplot(df8.7, aes(x = age_group, y = count, fill = survived)) +
geom_bar(stat = "identity", alpha = 0, size = 1, aes(colour = survived), width = 0.6) +
geom_text(aes(label = percent, colour = survived), position = position_stack(vjust = 0.5), size = 5) +
theme_minimal() +
labs(x = "Sex", y = "Survival Count",
title = "Mortality: Age Group") +
theme(legend.position = "none",
panel.grid = element_blank(),
plot.title = element_text(face = "bold"),
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
axis.title.y = element_text(margin = margin(0, 10, 0, 0))
) +
scale_colour_manual(values = c("red", "green3"))
ggplotly(plot8.7)
All ports had passenger casualty closed to and more than 50%.
# df
df8.8 <- tf %>%
group_by(embarked, survived) %>%
summarise(count = n()) %>%
group_by(embarked) %>%
mutate(embarked_sum = sum(count),
percent = paste0(round(count/embarked_sum * 100, 0), "%"))
# plot
plot8.8 <- ggplot(df8.8, aes(x = embarked, y = count, fill = survived)) +
geom_bar(stat = "identity", alpha = 0, size = 1, aes(colour = survived), width = 0.6) +
geom_text(aes(label = percent, colour = survived), position = position_stack(vjust = 0.5), size = 5) +
theme_minimal() +
labs(x = "Sex", y = "Survival Count",
title = "Mortality: Port") +
theme(legend.position = "none",
panel.grid = element_blank(),
plot.title = element_text(face = "bold"),
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
axis.title.y = element_text(margin = margin(0, 10, 0, 0))
) +
scale_colour_manual(values = c("red", "green3"))
ggplotly(plot8.8)
The overall casuality is 62%.
# df
df8.9 <- tf %>%
group_by(survived) %>%
summarise(count = n()) %>%
mutate(embarked_sum = sum(count),
percent = paste0(round(count/embarked_sum * 100, 0), "%"))
# plot
plot8.9 <- ggplot(df8.9, aes(x = survived, y = count, fill = survived)) +
geom_bar(stat = "identity", alpha = 0, size = 1, aes(colour = survived), width = 0.6) +
geom_text(aes(label = percent, colour = survived), position = position_stack(vjust = 0.5), size = 5) +
theme_minimal() +
labs(x = "Sex", y = "Survival Count",
title = "Mortality: embarked") +
theme(legend.position = "none",
panel.grid = element_blank(),
plot.title = element_text(face = "bold"),
axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
axis.title.y = element_text(margin = margin(0, 10, 0, 0))
) +
scale_colour_manual(values = c("red", "green3"))
ggplotly(plot8.9)
Adult male passengers holding the 3rd class ticket had the highest death rate.
# df
death_group <- tf %>%
dplyr::filter(survived == "No") %>%
dplyr::select(-age, -sibsp, -parch, -familysize, -fare)
# plot
plot8.10 <- ggplot(death_group, aes(x = age_group, fill = sex)) +
geom_histogram(stat = "count", position = position_dodge()) +
facet_grid(~pclass) +
theme_bw() +
theme(axis.title.x = element_text(margin = margin(10, 0, 0, 0)),
axis.title.y = element_text(margin = margin(0, 10, 0, 0))) +
labs(x = "Age Group",
y = "Death Count",
title = "Analysing the Death Group")
ggplotly(plot8.10)
From this dataset,
There were 62% of passengers died from the sank
The death rate was the highest in 3rd class ticket passengers, which is a proxy for socio-economic status.
The death rate was the highest in adult male group.
76% of 3rd class ticket passengers, 57% of 2nd class ticket passengers, and 37% of 1st class ticket passengers diead from the sank.
47% of children, 57% of teenagers, 65% of adults, and 90% of elders died from this sank.
Thank you for reading!
Dave Langer 2017, Intro to Machine Learning with R & caret, Data Science Dojo, Viewed 22 October 2021, https://www.youtube.com/watch?v=z8PRU46I3NY&t=1492s
Kaggle 2021, Titanic - Machine Learning from Disaster, viewed 22 October 2021, https://www.kaggle.com/c/titanic/data?select=gender_submission.csv
“Untergang der Titanic”, By Willy Stöwer - Magazine Die Gartenlaube, en:Die Gartenlaube and de:Die Gartenlaube, Public Domain, https://commons.wikimedia.org/w/index.php?curid=97646
Titanic 2021, https://en.wikipedia.org/wiki/Titanic, viewed 24 October 2021, https://en.wikipedia.org/wiki/Titanic