1 Background Information

The R.M.S. Titanic, an ocean passenger ship commissioned in England in 1910, was one of the most luxurious, state-of-the-art ships during its time. At nearly 269 meters long, the ship was thought to be unsinkable.

However, on April 15, 1912, the unthinkable happened. During the ship’s maiden voyage from Southampton, England, to New York City, New York, the Titanic struck an iceberg in the North Atlantic Ocean off the coast of Newfoundland, Canada and sank, tragically taking the lives of more than 1,500 passengers and crew. It was, and remains, one of the greatest nautical disasters in history.

Image by Olisa Obiora from Unsplash

110+ years have passed since R.M.S Titanic sank in the Atlantic Ocean. While the sinking of Titanic remains an unfortunate tragedy, the world takes a lot of notes to prevent such tragedy to reoccur. For example, after the disaster, recommendations were made by both the British and American Boards of Inquiry stating that ships should carry enough lifeboats for all aboard, mandated lifeboat drills would be implemented, lifeboat inspections would be conducted, etc. Many of these recommendations were incorporated into the International Convention for the Safety of Life at Sea passed in 1914.

From data science perspective, we can get many valuable insights in a hope to learn more and to understand more about this tragedy. More about this in later sections.

2 Environment Setup and Dataset

2.1 Environment and Library Setup

To assist and perform data wrangling and data visualization process, we are going to use tidyverse and waffle.

Tidyverse is a package that contain several packages commonly used for data sciences. Some of these packages are:

  • dplyr -> Data Wrangling
  • ggplot -> Visualization
  • readr -> Input & Output

Waffle is an extension of ggplot allows you to create waffle plot which we will use in this analysis report.

Plotly is also is used in conjuction to ggplot to make the plot dynamic.

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Warning: package 'plotly' was built under R version 4.0.4
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Do not worry if you are shown warning or conflicts. If warnings does not result in error, you should be fine to continue your analysis. On the other hand, conflicts can result in a error when two or more functions share a name. Below is how you can resolve conflict when it happens.

Choose which functions to use:

  • dplyr::filter
  • stats::filter

Bottomline, it is okay to ignore warning as long as it does not results in error and use the above grammar to avoid conflicts.

What to do when warnings appear

Source: https://twitter.com/data_question/status/1328346236747870208/photo/1

2.2 The Titanic Dataset

The full dataset can be downloaded from the link below:

Link: https://www.kaggle.com/c/titanic/data

Since we are not building a machine learning model,but rather, performing several relatively simple statistical analyses and data visualization, we are only going to use train.csv data (picking either train or test is fine).

2.2.1 Data Explanation

## 
## -- Column specification --------------------------------------------------------
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

The good thing about readr::read_csv() compared to utils::read.csv()(base) is that you can see the columns data type when parsed into the R environment.

Looking at the result above, there are several character columns that needs to be changed to format(category). We can use the following code:

## PassengerId    Survived      Pclass        Name         Sex         Age 
##   "numeric"    "factor"    "factor" "character"    "factor"   "numeric" 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##   "numeric"   "numeric" "character"   "numeric" "character"    "factor"

From Kaggle, we can check the description for each column. Below is a brief overview/explanation.

3 Discussion

3.1 Passenger Identity

Before we get into the analysis, let’s first understand about the Titanic passenger first. The passenger’s distribution that we would like to know are:

  • Sex
  • A proxy to elucidate socio economic status
  • Place of Embarkment

Before we calculate the distribution, we will first use switch to change several values to a more appropriate one with the following codes:

Now that we’re done, lets look at the passenger distribution…

3.1.2 Socio-economic status with waffle chart

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

From this plot, we get that most of the passengers are from lower class. There were actually more first class than middle class passenger but their relative size is comparable.

3.2 Survival Based on Gender and Socio-Economic Class

The data is prepared as follow:

3.2.3 Gender + Class

This result is quite interesting because it turns out that female middle and first class passengers survived significantly more than lowe class female passengers. Whereas for male first class passengers, the casuality is still considerable even though they are a first class passengers.

4 Conclusion

  • Most of the passengers are male and belong in the lower class
  • There are more female passengers that survived the tragedy
  • First class passengers have the least casualty
  • Female First and Middle class passengers have the greatest survival rate among others