The Titanic Disaster


RMS Titanic, or better known simply as Titanic, is a British ocean liner that sank after crashing into an iceberg on the 15th of April 1912, during its first voyage. Out of the estimated 2224 passengers and crew on board, 1496 perished. It was considered one of the deadliest sinking accidents at the time, and has become somewhat infamous, inspiring movies and other multimedia works.

One of the biggest causes of the high death toll is the lack of lifeboats. There wasn’t nearly enough for the total amount of passengers on board, causing the crew to prioritize some groups of people more than others.

Although tragic, there is always something to be learned from accidents such as this one. Which is why we’ll be taking a look at some passenger data from the Titanic to see if there are any patterns to its survivors.

Data Preparation

You can find the data set that we will be using for this project here. It consists of three different files, train, test, and gender_submission. Since this data set was originally created for machine learning purposes, it was split into these three files. For the purposes of this project, we’ll only be using the train document as it has information regarding the survival status of the passengers.

Let’s start by reading the main data set, train.csv using read.csv

passengers <- read.csv("data/train.csv")

Then, let’s take a quick look at the data frame using head.

head(passengers)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q


As you can see, the dataframe consists of 12 columns. Those 12 columns being:

  • PassengerId: The ID of the passenger, can be treated as an index in this case
  • Survived: A 0 or 1 value indicating whether or not the passenger survived the disaster
  • Pclass: The ticket class of the passenger, which would be first (1), second (2), or third (3)
  • Name: Name of the passenger
  • Sex: Gender of the passenger
  • Age: Age of the passenger in years
  • SibSp: The amount of siblings the passenger has on board
  • Parch: The amount of parents/children the passenger has on board
  • Ticket: The passenger’s ticket ID
  • Fare: The fare price that the passenger had to pay
  • Cabin: Passenger’s cabin number
  • Embarked: The passenger’s port of embarkation, in this case Cherbourg (C), Queenstown (Q), or Southampton (S)

Now that we’ve seen how our data looks like, we can start cleaning it up.

Changing Data Types

In order to make the data preparation process easier, we will be using the dplyr package. It allows us to use functions such as glimpse, which will make it easier for us to see all the different data types and values.

library(dplyr)
glimpse(passengers)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

It would seem like some columns such as Survived, Pclass, Sex, and Embarked have commonly repeating values. Let’s change them into a categorical format using mutate and store the new dataframe inside passengers_clean.

passengers_clean <- passengers %>%
  mutate(Survived=as.factor(Survived),
         Pclass=as.factor(Pclass),
         Sex=as.factor(Sex),
         Embarked=as.factor(Embarked))

glimpse(passengers_clean)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <fct> male, female, female, female, male, male, male, male, fema…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…

Removing NA Values

Next up would be taking care of NA values. Let’s check whether or not we have NA values and where they are in our dataframe using which and is.na.

which(is.na(passengers_clean))
##   [1] 4461 4473 4475 4482 4484 4485 4487 4488 4492 4498 4501 4502 4503 4504 4511
##  [16] 4520 4521 4532 4533 4538 4543 4551 4557 4563 4565 4577 4582 4584 4596 4610
##  [31] 4614 4615 4622 4624 4632 4636 4637 4641 4642 4652 4654 4657 4670 4679 4685
##  [46] 4691 4696 4697 4706 4712 4716 4720 4726 4730 4733 4740 4751 4754 4756 4757
##  [61] 4759 4760 4762 4780 4786 4790 4791 4803 4807 4810 4814 4815 4820 4823 4824
##  [76] 4831 4840 4844 4865 4866 4867 4869 4871 4876 4881 4884 4887 4900 4907 4910
##  [91] 4913 4915 4920 4922 4924 4926 4931 4937 4941 4946 4951 4953 4958 4963 4967
## [106] 4973 4978 4980 4983 4987 4989 4994 5003 5008 5013 5016 5019 5020 5024 5029
## [121] 5034 5040 5045 5049 5052 5054 5057 5058 5067 5068 5069 5085 5089 5095 5099
## [136] 5104 5106 5109 5112 5123 5125 5130 5136 5148 5153 5165 5167 5174 5183 5188
## [151] 5194 5195 5196 5216 5222 5224 5229 5232 5234 5239 5246 5248 5249 5271 5281
## [166] 5282 5284 5288 5293 5295 5302 5305 5315 5319 5324 5334 5344

Seems like we have quite a few. Let’s try dropping rows that are truly empty (only filled with NA values) using na.omit. Then, we’ll use anyNA.data.frame to check if there are still any NA values in our dataframe.

passengers_clean <- 
passengers_clean %>%
  na.omit()
anyNA.data.frame(passengers_clean)
## [1] FALSE

Seems like we successfully got rid of all the NA values.

Dropping Columns

Some categorical columns such as PassengerId, Cabin, and Ticket would be difficult to analyze since they have so many unique values. In this case, we can simply drop them. In order to do so, we’ll use select and put a - in front of the vector of the column values we’d like to drop. This will do the opposite of what select usually does, which would mean selecting everything but the chosen columns.

passengers_clean <- 
  passengers_clean %>%
  select(-c(PassengerId, Cabin, Ticket))

head(passengers_clean)
##   Survived Pclass                                                Name    Sex
## 1        0      3                             Braund, Mr. Owen Harris   male
## 2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female
## 3        1      3                              Heikkinen, Miss. Laina female
## 4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female
## 5        0      3                            Allen, Mr. William Henry   male
## 7        0      1                             McCarthy, Mr. Timothy J   male
##   Age SibSp Parch    Fare Embarked
## 1  22     1     0  7.2500        S
## 2  38     1     0 71.2833        C
## 3  26     0     0  7.9250        S
## 4  35     1     0 53.1000        S
## 5  35     0     0  8.0500        S
## 7  54     0     0 51.8625        S

Now we have a much cleaner dataframe.

Analysis

Gender

We can now start taking a look at our data. Let’s start by doing some aggregation regarding the gender of the survivors. For this we can use group_by first and then use summarise with n() to count the number of times that value appears.

gender_survival <- 
  passengers_clean %>%
  group_by(Sex, Survived) %>%
  summarise(Count=n())
gender_survival
## # A tibble: 4 × 3
## # Groups:   Sex [2]
##   Sex    Survived Count
##   <fct>  <fct>    <int>
## 1 female 0           64
## 2 female 1          197
## 3 male   0          360
## 4 male   1           93

While we can just read from the table above, it’s rather dull and takes a bit to process, doesn’t it? Let’s try making a visualization using ggplot and geom_col in order to make a barplot. In order to see the amount of passengers who survived, let’s set the fill to differentiate the colors by the Survived column.

library(ggplot2)
gender_survival %>%
  ggplot(mapping= aes(x=Count, y=reorder(Sex, Count), fill=Survived)) +
  geom_col()


Much better. From the chart above we can see that there are way more male than female passengers on board, but there are more female than male survivors. If you have seen the Titanic movie, perhaps you may recall the evacuation scene where the crew prioritized loading women and children onto the lifeboats. This would explain why most of the survivors were females.

Age

I wonder if the age of the passenger has any correlation to their survival chances? Let’s map it out using geom_boxplot.

passengers_clean %>%
  ggplot(mapping=aes(x=Survived,y=Age, color = Survived)) +
  geom_boxplot()


While there doesn’t seem to be much difference, it seems that those who survived had a lower age range on average, with a very particular outlier at age 80. A density chart is also another alternative to view the distribution of values. Let’s try it out using geom_density and adjust the color of the lines by the survival status.

passengers_clean %>%
  ggplot(mapping=aes(x=Age)) +
  geom_density(aes(color=Survived))


From this we can better see that very young children were also prioritized for evacuation, as their survival rates are relatively high.

Class

I feel like it would also be interesting to see how class factors into the equation. Let’s use group_by and summarise again for this. To make things faster let’s also just pipe this into another ggplot and geom_col so that we can immediately see the visuals.

passengers_clean %>%
  group_by(Pclass, Survived) %>%
  summarise(Count=n()) %>%
  ggplot(mapping=aes(x=Count,y=Pclass, fill=Survived)) +
  geom_col()


As you can probably expect, most of the survivors were First Class passengers. Amount-wise, there doesn’t seem to be much difference between the Second and Third class survivors, but there are way more Third Class passengers than Second Class.

Since there is a difference in the amount of passengers for each class, we can also see the survival chances proportionately instead by adjusting the position in geom_col to fill.

passengers_clean %>%
  group_by(Pclass, Survived) %>%
  summarise(Count=n()) %>%
  ggplot(mapping=aes(x=Count,y=Pclass, fill=Survived)) +
  geom_col(position = "fill")


Proportion-wise the highest survival chance is still held by the First Class, followed by the Second, and finally the Third. Says a lot about the privilege of the wealthy, doesn’t it? They were most likely prioritized in the evacuation, or perhaps the position of their cabins were closer to the lifeboats compared to passengers in Third Class.

Fare

I imagine that fare has about the same effect as class does in terms of survival, but let’s make a density chart using geom_density to in order to make sure.

passengers_clean %>%
  ggplot(mapping=aes(x=Fare)) +
  geom_density(aes(color=Survived))


Interestingly, there seems to be people who straight-up did not have to pay for the fare at all. We can confirm this using range as well.

range(passengers_clean$Fare)
## [1]   0.0000 512.3292

But as expected, there is a high mortality rate for passengers whose fare was closer to zero, and having higher fare in general increases your chances of survival.

We’d expect fare prices to relate to the passenger’s class, so let’s see the fare ranges of each class using some boxplots. Let’s start by grouping by the passenger’s class again, and then plotting out the fare range. Let’s also separate the color by the passenger’s survival status.

passengers_clean %>%
  group_by(Pclass) %>%
  ggplot(mapping=aes(x=Pclass,y=Fare, color = Survived)) +
  geom_boxplot()


Seems like the fare range difference between the First Class and the other classes are quite big. Interestingly, seems like the fare does affect the passenger’s survival rate even within the same class category, as the survivors tend to have higher fare medians. Another interesting thing to note is the First Class fare outlier who had to pay more than 500. I’m not surprised that they survived the ordeal.

Family Members

I wonder if the amount of family members on board affected the chances at all? Let’s use geom_histogram this time as we have much fewer variance in values. Once more, set the fill to Survived, and let’s use the dodge position so we can see the real value count.

passengers_clean %>%
  ggplot(mapping=aes(x=Parch, fill=Survived)) +
  geom_histogram(position="dodge")


It would seem that family sizes of around 2-4 people (1-3 Parch count) have a higher chance of survival. The rate dramatically drops beyond that point. Perhaps having family members on board would mean that there’s at least someone who could ensure that you were not left behind?

Let’s try making the same chart type with the SibSp column, but using the identity position.

passengers_clean %>%
  ggplot(mapping=aes(x=SibSp, fill=Survived)) +
  geom_histogram(position="identity")


When it comes to siblings, it would seem that having 1-2 siblings on board gives you the highest chances of survival.

Port of Embarkation

How does the port of embarkation affect the passenger’s chances of survival? The workflow for this question is similar to when we had to work with the gender category. Simply group_by the relevant categories and summarise the count of data.

passengers_clean %>%
  group_by(Embarked, Survived) %>%
  summarise(Count=n()) %>%
  ggplot(mapping=aes(x=Count, y=Embarked, fill=Survived)) +
  geom_col(position="fill")


Oops, seems like getting rid of NA values is not the same as getting rid of blank/empty string values. This is because an empty string is not equal to NA in R. No problem, let’s just filter those values out by adding a ! in front of the condition of the values we want to get rid of. Similar to - from before, this chooses all values but the ones that fulfill that condition. An empty string can simply be represented by two quotation marks, like this: “”

passengers_clean %>%
  filter(!Embarked %in% "") %>%
  group_by(Embarked, Survived) %>%
  summarise(Count=n()) %>%
  ggplot(mapping=aes(x=Count, y=Embarked, fill=Survived)) +
  geom_col(position="fill")


From this chart we can see that passengers from Cherbourg have the highest survival rate, followed by Southampton, and finally Queenstown. Perhaps those with knowledge of the area can immediately deduce why passengers departing from Cherbourg have a higher chance of survival. However, since I live nowhere near those ports nor have sufficient geographical knowledge of the area, let’s try to deduce the reason by looking at the characteristics of the passengers from each port. For example, by their gender and class.

Let’s start by gender.

passengers_clean %>%
  filter(!Embarked %in% "") %>%
  group_by(Embarked, Sex) %>%
  summarise(Count=n()) %>%
  ggplot(mapping=aes(x=Count, y=Embarked, fill=Sex)) +
  geom_col(position="fill")


As we can see by the chart above, Cherbourg has the highest proportion of female passengers when compared to the other ports. This could explain why Cherbourg passengers have a higher survival chance, but Queenstown has the second largest proportion of female passengers even though it has the highest mortality rate. Perhaps we’re missing something?

Now, let’s check by the passenger’s class.

passengers_clean %>%
  filter(!Embarked %in% "") %>%
  group_by(Embarked, Pclass) %>%
  summarise(Count=n()) %>%
  ggplot(mapping=aes(x=Count, y=Embarked, fill=Pclass)) +
  geom_col(position="fill")


Seems like class has a bigger effect on survival than gender in this regard, seeing how Cherbourg has the highest proportion of First Class passengers, followed by Southampton, and Queenstown being last; The same order as when we looked at the survival rate of passengers by port only. We can now conclude that the reason why Cherbourg has the highest survival rate was most likely due to the amount of First Class passengers which embarked from that port. In turn, we could also guess that perhaps the area around Cherbourg had more wealthy citizens than the other two ports.

Conclusion

From this analysis, we can conclude that there are a variety of factors that affected a passenger’s survival chances on board the Titanic, and some of those factors matter more than others. The biggest factor would seem to be the passenger’s cabin class, which tends to directly relate to their socioeconomic status. From this alone, we can see the importance of socioeconomic statuses in 1912, especially in the cruise industry.

Thankfully, the Titanic disaster caused many changes (for the better) in terms of maritime safety regulations. One of them being the mandate that every ship needs to carry enough lifeboats for all the passengers on board. In the event that a similar accident happens, hopefully no one will have to be left behind again.