1 Background Information

The R.M.S. Titanic, an ocean passenger ship commissioned in England in 1910, was one of the most luxurious, state-of-the-art ships during its time. At nearly 269 meters long, the ship was thought to be unsinkable.

However, on April 15, 1912, the unthinkable happened. During the ship’s maiden voyage from Southampton, England, to New York City, New York, the Titanic struck an iceberg in the North Atlantic Ocean off the coast of Newfoundland, Canada and sank, tragically taking the lives of more than 1,500 passengers and crew. It was, and remains, one of the greatest nautical disasters in history.

Image by Olisa Obiora from Unsplash

110+ years have passed since R.M.S Titanic sank in the Atlantic Ocean. While the sinking of Titanic remains an unfortunate tragedy, the world takes a lot of notes to prevent such tragedy to reoccur. For example, after the disaster, recommendations were made by both the British and American Boards of Inquiry stating that ships should carry enough lifeboats for all aboard, mandated lifeboat drills would be implemented, lifeboat inspections would be conducted, etc. Many of these recommendations were incorporated into the International Convention for the Safety of Life at Sea passed in 1914.

From data science perspective, we can get many valuable insights in a hope to learn more and to understand more about this tragedy. More about this in later sections.

2 Environment Setup and Dataset

2.1 Environment and Library Setup

To assist and perform data wrangling and data visualization process, we are going to use tidyverse and waffle.

Tidyverse is a package that contain several packages commonly used for data sciences. Some of these packages are:

dplyr -> Data Wrangling
ggplot -> Visualization
readr -> Input & Output

Waffle is an extension of ggplot allows you to create waffle plot which we will use in this analysis report.

Plotly is also is used in conjuction to ggplot to make the plot dynamic.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(waffle)
library(plotly)

## Warning: package 'plotly' was built under R version 4.0.4

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Do not worry if you are shown warning or conflicts. If warnings does not result in error, you should be fine to continue your analysis. On the other hand, conflicts can result in a error when two or more functions share a name. Below is how you can resolve conflict when it happens.

Choose which functions to use:

dplyr::filter
stats::filter

Bottomline, it is okay to ignore warning as long as it does not results in error and use the above grammar to avoid conflicts.

What to do when warnings appear

Source: https://twitter.com/data_question/status/1328346236747870208/photo/1

2.2 The Titanic Dataset

The full dataset can be downloaded from the link below:

Link: https://www.kaggle.com/c/titanic/data

Since we are not building a machine learning model,but rather, performing several relatively simple statistical analyses and data visualization, we are only going to use train.csv data (picking either train or test is fine).

2.2.1 Data Explanation

titanic <- read_csv("train.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

The good thing about readr::read_csv() compared to utils::read.csv()(base) is that you can see the columns data type when parsed into the R environment.

Looking at the result above, there are several character columns that needs to be changed to format(category). We can use the following code:

titanic[,c("Survived", "Pclass", "Sex", "Embarked")] <- 
  lapply(titanic[,c("Survived", "Pclass", "Sex", "Embarked")], 
         FUN =as.factor)
sapply(titanic, class) #check columns data types

## PassengerId    Survived      Pclass        Name         Sex         Age 
##   "numeric"    "factor"    "factor" "character"    "factor"   "numeric" 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##   "numeric"   "numeric" "character"   "numeric" "character"    "factor"

From Kaggle, we can check the description for each column. Below is a brief overview/explanation.

3 Discussion

3.1 Passenger Identity

Before we get into the analysis, let’s first understand about the Titanic passenger first. The passenger’s distribution that we would like to know are:

Sex
A proxy to elucidate socio economic status
Place of Embarkment

Before we calculate the distribution, we will first use switch to change several values to a more appropriate one with the following codes:

titanic$Pclass <- sapply(
  X = as.character(titanic$Pclass), # Data
  FUN = switch, 
  "1" = "First Class",
  "2" = "Middle Class",
  "3" = "Lower Class"
  )

titanic$Embarked <- sapply(
  X = as.character(titanic$Embarked), # Data
  FUN = switch, 
  "C" = "Cherbourg",
  "Q" = "Queenstown",
  "S" = "Southampton"
  )

Now that we’re done, lets look at the passenger distribution…

3.1.1 Sex distribution with stacked bar plot

titanic_agg1 <- titanic %>%
  count(Sex)

ggplot(titanic_agg1, aes(x="", y=n, 
                         fill = Sex, alpha = 0.5, 
                         label = paste0(round(100*n/sum(titanic_agg1$n),2),"%") )) +
  geom_col(col = "black", width = 0.5) +  
  labs(title = "Titanic Passenger's Sex Distribution", 
       caption = "source: Kaggle",
       y = "Number of passengers",
       x=""
       ) +
  guides(fill = F, alpha = F) +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_blank(), 
        axis.line = element_line(colour = "black"),
        plot.title = element_text(hjust = 0.5)
        ) +
  coord_flip() +
  geom_text(size = 6, position = position_stack(vjust = 0.5))

We can conclude that there are more male than female passengers.

3.1.2 Socio-economic status with waffle chart

titanic_agg2 <- titanic %>%
  count(Pclass) %>%
  arrange(desc(n))

#create a named numeric vector
titanic_vector <- structure(titanic_agg2$n, 
                            .Names = as.character(titanic_agg2$Pclass))

waffle(parts = titanic_vector/10,
       title="Titanic Passengers Socio-Economic Class Distribution"
       ) +
  coord_flip() +
  labs(
    y = "1 square = 10 persons"
  )

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

From this plot, we get that most of the passengers are from lower class. There were actually more first class than middle class passenger but their relative size is comparable.

3.1.3 Embarkment Place with bar plot

titanic_agg3 <- titanic %>%
  count(Embarked)

titanic_agg3$Embarked <- as.factor(as.character(titanic_agg3$Embarked))


ggplot(titanic_agg3, aes(x = reorder(Embarked,n), y = n, label =n)) +
  geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
    labs(title = "Titanic Passenger's Embarkment Location Distribution", 
       caption = "source: Kaggle",
       y = "Number of passengers",
       x=""
       ) +
  guides(fill = F, alpha = F) +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_blank(), 
        axis.line = element_line(colour = "black"),
        plot.title = element_text(hjust = 0.5)
        ) +
  geom_text(size =3.5, position = position_stack(vjust = 0.5))

The main location of Embarkment is from a port in Southampton followed by Cherbourg and finally Queesntow. Funnily enough, there were 2 passengers which port of embarkment could not be identified

3.2 Survival Based on Gender and Socio-Economic Class

The data is prepared as follow:

titanic_agg4 <- titanic %>%
  group_by(Survived) %>%
  count(Sex, Pclass)

titanic_agg4$Survived <- sapply(
  X = as.character(titanic_agg4$Survived ), # Data
  FUN = switch, 
  "0" = "Perished",
  "1" = "Survived"
  )

3.2.1 Gender

ggplot(data = titanic_agg4, mapping = aes(x = Survived, y = n, fill = Sex)) +
  geom_col() +
  labs(
    x = "",
    y = "",
    title = "Survival of Titanic Passengers based on Gender"
  ) +
  theme(
    plot.background = element_rect(fill = "#263238"), 
    panel.background = element_rect(fill = "#263238", color = "white"),
    panel.grid = element_blank(),
    axis.text.y = element_text(color="white", 
                           size=10),
    axis.text.x = element_text(color="white", 
                           size=10),
    plot.subtitle = element_text(color = "white",  hjust = 0.5),
    plot.title = element_text(color = "white", hjust = 0.5),
    legend.background = element_rect(fill = "#263238"),
    legend.text = element_text(color = "white"),
    legend.title = element_text(color = "white")
  )

From this training set (we don’t use the complete dataset), it can be inffered that most of the passengers Titanic meet their end and most of them are male passengers. Passengers that survived the tragedy are mostly women.

3.2.2 Socio-Economic Class

ggplot(data = titanic_agg4, mapping = aes(x = Survived, y = n, fill = Pclass)) +
  geom_col() +
  labs(
    x = "",
    y = "",
    title = "Survival of Titanic Passengers based on Class"
  ) +
  theme(
    plot.background = element_rect(fill = "#263238"), 
    panel.background = element_rect(fill = "#263238", color = "white"),
    panel.grid = element_blank(),
    axis.text.y = element_text(color="white", 
                           size=10),
    axis.text.x = element_text(color="white", 
                           size=10),
    plot.subtitle = element_text(color = "white",  hjust = 0.5),
    plot.title = element_text(color = "white", hjust = 0.5),
    legend.background = element_rect(fill = "#263238"),
    legend.text = element_text(color = "white"),
    legend.title = element_text(color = "white")
  )

From this result, most lower class passengers failed to survive. Middle class passengers have equal proportion of persihed and survived. Perhaps not too surprising, first class passengers safety are prioritized which lead to a greater survival rate for first class passengers.

3.2.3 Gender + Class

titanic_agg5 <- titanic_agg4 %>%
  unite(col = "Sex_Class", Sex, Pclass, sep = " & ")

g <-
  ggplot(data = titanic_agg5, mapping = aes(x = Survived, y = n, fill = Sex_Class)) +
  geom_col() +
  labs(
    x = "",
    y = "",
    title = "Survival of Titanic Passengers based on Gender and Class"
  ) +
  theme(
    plot.background = element_rect(fill = "#263238"), 
    panel.background = element_rect(fill = "#263238", color = "white"),
    panel.grid = element_blank(),
    axis.text.y = element_text(color="white", 
                           size=10),
    axis.text.x = element_text(color="white", 
                           size=10),
    plot.subtitle = element_text(color = "white",  hjust = 0.5),
    plot.title = element_text(color = "white", hjust = 0.5),
    legend.background = element_rect(fill = "#263238"),
    legend.text = element_text(color = "white"),
    legend.title = element_text(color = "white")
  )
ggplotly(g)

This result is quite interesting because it turns out that female middle and first class passengers survived significantly more than lowe class female passengers. Whereas for male first class passengers, the casuality is still considerable even though they are a first class passengers.

4 Conclusion

Most of the passengers are male and belong in the lower class
There are more female passengers that survived the tragedy
First class passengers have the least casualty
Female First and Middle class passengers have the greatest survival rate among others

Learning From World’s Most Iconic Disaster: The Titanic

Zainul Arifin

31 March, 2021