RMS Titanic, or better known simply as Titanic,
is a British ocean liner that sank after crashing into an iceberg on the
15th of April 1912, during its first voyage. Out of the estimated 2224
passengers and crew on board, 1496 perished. It was considered one of
the deadliest sinking accidents at the time, and has become somewhat
infamous, inspiring movies and other multimedia works.
One of the biggest causes of the high death toll is the lack of lifeboats. There wasn’t nearly enough for the total amount of passengers on board, causing the crew to prioritize some groups of people more than others.
Although tragic, there is always something to be learned from accidents such as this one. Which is why we’ll be taking a look at some passenger data from the Titanic to see if there are any patterns to its survivors.
You can find the data set that we will be using for this project here. It consists of three different files,
train, test, and
gender_submission. Since this data set was originally
created for machine learning purposes, it was split into these three
files. For the purposes of this project, we’ll only be using the
train document as it has information regarding the survival
status of the passengers.
Let’s start by reading the main data set, train.csv using
read.csv
passengers <- read.csv("data/train.csv")
Then, let’s take a quick look at the data frame using
head.
head(passengers)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
| 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q |
As you can see, the dataframe consists of 12 columns. Those 12
columns being:
Now that we’ve seen how our data looks like, we can start cleaning it up.
In order to make the data preparation process easier, we will be
using the dplyr package. It allows us to use functions such as
glimpse, which will make it easier for us to see all the
different data types and values.
library(dplyr)
glimpse(passengers)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
It would seem like some columns such as Survived, Pclass, Sex, and
Embarked have commonly repeating values. Let’s change them into a
categorical format using mutate and store the new dataframe
inside passengers_clean.
passengers_clean <- passengers %>%
mutate(Survived=as.factor(Survived),
Pclass=as.factor(Pclass),
Sex=as.factor(Sex),
Embarked=as.factor(Embarked))
glimpse(passengers_clean)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <fct> male, female, female, female, male, male, male, male, fema…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S, C…
Next up would be taking care of NA values. Let’s check whether or not
we have NA values and where they are in our dataframe using
which and is.na.
which(is.na(passengers_clean))
## [1] 4461 4473 4475 4482 4484 4485 4487 4488 4492 4498 4501 4502 4503 4504 4511
## [16] 4520 4521 4532 4533 4538 4543 4551 4557 4563 4565 4577 4582 4584 4596 4610
## [31] 4614 4615 4622 4624 4632 4636 4637 4641 4642 4652 4654 4657 4670 4679 4685
## [46] 4691 4696 4697 4706 4712 4716 4720 4726 4730 4733 4740 4751 4754 4756 4757
## [61] 4759 4760 4762 4780 4786 4790 4791 4803 4807 4810 4814 4815 4820 4823 4824
## [76] 4831 4840 4844 4865 4866 4867 4869 4871 4876 4881 4884 4887 4900 4907 4910
## [91] 4913 4915 4920 4922 4924 4926 4931 4937 4941 4946 4951 4953 4958 4963 4967
## [106] 4973 4978 4980 4983 4987 4989 4994 5003 5008 5013 5016 5019 5020 5024 5029
## [121] 5034 5040 5045 5049 5052 5054 5057 5058 5067 5068 5069 5085 5089 5095 5099
## [136] 5104 5106 5109 5112 5123 5125 5130 5136 5148 5153 5165 5167 5174 5183 5188
## [151] 5194 5195 5196 5216 5222 5224 5229 5232 5234 5239 5246 5248 5249 5271 5281
## [166] 5282 5284 5288 5293 5295 5302 5305 5315 5319 5324 5334 5344
Seems like we have quite a few. Let’s try dropping rows that are
truly empty (only filled with NA values) using na.omit.
Then, we’ll use anyNA.data.frame to check if there are
still any NA values in our dataframe.
passengers_clean <-
passengers_clean %>%
na.omit()
anyNA.data.frame(passengers_clean)
## [1] FALSE
Seems like we successfully got rid of all the NA values.
Some categorical columns such as PassengerId, Cabin, and Ticket would
be difficult to analyze since they have so many unique values. In this
case, we can simply drop them. In order to do so, we’ll use
select and put a - in front of the vector of
the column values we’d like to drop. This will do the opposite of what
select usually does, which would mean selecting everything
but the chosen columns.
passengers_clean <-
passengers_clean %>%
select(-c(PassengerId, Cabin, Ticket))
head(passengers_clean)
## Survived Pclass Name Sex
## 1 0 3 Braund, Mr. Owen Harris male
## 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female
## 3 1 3 Heikkinen, Miss. Laina female
## 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
## 5 0 3 Allen, Mr. William Henry male
## 7 0 1 McCarthy, Mr. Timothy J male
## Age SibSp Parch Fare Embarked
## 1 22 1 0 7.2500 S
## 2 38 1 0 71.2833 C
## 3 26 0 0 7.9250 S
## 4 35 1 0 53.1000 S
## 5 35 0 0 8.0500 S
## 7 54 0 0 51.8625 S
Now we have a much cleaner dataframe.
We can now start taking a look at our data. Let’s start by doing some
aggregation regarding the gender of the survivors. For this we can use
group_by first and then use summarise with
n() to count the number of times that value appears.
gender_survival <-
passengers_clean %>%
group_by(Sex, Survived) %>%
summarise(Count=n())
gender_survival
## # A tibble: 4 × 3
## # Groups: Sex [2]
## Sex Survived Count
## <fct> <fct> <int>
## 1 female 0 64
## 2 female 1 197
## 3 male 0 360
## 4 male 1 93
While we can just read from the table above, it’s rather dull and
takes a bit to process, doesn’t it? Let’s try making a visualization
using ggplot and geom_col in order to make a
barplot. In order to see the amount of passengers who survived, let’s
set the fill to differentiate the colors by the Survived
column.
library(ggplot2)
gender_survival %>%
ggplot(mapping= aes(x=Count, y=reorder(Sex, Count), fill=Survived)) +
geom_col()
Much better. From the chart above we can see that there are way
more male than female passengers on board, but there are more female
than male survivors. If you have seen the Titanic movie, perhaps you may
recall the evacuation scene where the crew prioritized loading women and
children onto the lifeboats. This would explain why most of the
survivors were females.
I wonder if the age of the passenger has any correlation to their
survival chances? Let’s map it out using geom_boxplot.
passengers_clean %>%
ggplot(mapping=aes(x=Survived,y=Age, color = Survived)) +
geom_boxplot()
While there doesn’t seem to be much difference, it seems that those
who survived had a lower age range on average, with a very particular
outlier at age 80. A density chart is also another alternative to view
the distribution of values. Let’s try it out using
geom_density and adjust the color of the lines by the
survival status.
passengers_clean %>%
ggplot(mapping=aes(x=Age)) +
geom_density(aes(color=Survived))
From this we can better see that very young children were also
prioritized for evacuation, as their survival rates are relatively
high.
I feel like it would also be interesting to see how class factors
into the equation. Let’s use group_by and
summarise again for this. To make things faster let’s also
just pipe this into another ggplot and
geom_col so that we can immediately see the visuals.
passengers_clean %>%
group_by(Pclass, Survived) %>%
summarise(Count=n()) %>%
ggplot(mapping=aes(x=Count,y=Pclass, fill=Survived)) +
geom_col()
As you can probably expect, most of the survivors were First Class
passengers. Amount-wise, there doesn’t seem to be much difference
between the Second and Third class survivors, but there are way more
Third Class passengers than Second Class.
Since there is a difference in the amount of passengers for each
class, we can also see the survival chances proportionately instead by
adjusting the position in geom_col to
fill.
passengers_clean %>%
group_by(Pclass, Survived) %>%
summarise(Count=n()) %>%
ggplot(mapping=aes(x=Count,y=Pclass, fill=Survived)) +
geom_col(position = "fill")
Proportion-wise the highest survival chance is still held by the
First Class, followed by the Second, and finally the Third. Says a lot
about the privilege of the wealthy, doesn’t it? They were most likely
prioritized in the evacuation, or perhaps the position of their cabins
were closer to the lifeboats compared to passengers in Third Class.
I imagine that fare has about the same effect as class does in terms
of survival, but let’s make a density chart using
geom_density to in order to make sure.
passengers_clean %>%
ggplot(mapping=aes(x=Fare)) +
geom_density(aes(color=Survived))
Interestingly, there seems to be people who straight-up did not
have to pay for the fare at all. We can confirm this using
range as well.
range(passengers_clean$Fare)
## [1] 0.0000 512.3292
But as expected, there is a high mortality rate for passengers whose fare was closer to zero, and having higher fare in general increases your chances of survival.
We’d expect fare prices to relate to the passenger’s class, so let’s see the fare ranges of each class using some boxplots. Let’s start by grouping by the passenger’s class again, and then plotting out the fare range. Let’s also separate the color by the passenger’s survival status.
passengers_clean %>%
group_by(Pclass) %>%
ggplot(mapping=aes(x=Pclass,y=Fare, color = Survived)) +
geom_boxplot()
Seems like the fare range difference between the First Class and
the other classes are quite big. Interestingly, seems like the fare does
affect the passenger’s survival rate even within the same class
category, as the survivors tend to have higher fare medians. Another
interesting thing to note is the First Class fare outlier who had to pay
more than 500. I’m not surprised that they survived the ordeal.
I wonder if the amount of family members on board affected the
chances at all? Let’s use geom_histogram this time as we
have much fewer variance in values. Once more, set the fill
to Survived, and let’s use the dodge position so we can see
the real value count.
passengers_clean %>%
ggplot(mapping=aes(x=Parch, fill=Survived)) +
geom_histogram(position="dodge")
It would seem that family sizes of around 2-4 people (1-3 Parch
count) have a higher chance of survival. The rate dramatically drops
beyond that point. Perhaps having family members on board would mean
that there’s at least someone who could ensure that you were not left
behind?
Let’s try making the same chart type with the SibSp column, but using
the identity position.
passengers_clean %>%
ggplot(mapping=aes(x=SibSp, fill=Survived)) +
geom_histogram(position="identity")
When it comes to siblings, it would seem that having 1-2 siblings
on board gives you the highest chances of survival.
How does the port of embarkation affect the passenger’s chances of
survival? The workflow for this question is similar to when we had to
work with the gender category. Simply group_by the relevant
categories and summarise the count of data.
passengers_clean %>%
group_by(Embarked, Survived) %>%
summarise(Count=n()) %>%
ggplot(mapping=aes(x=Count, y=Embarked, fill=Survived)) +
geom_col(position="fill")
Oops, seems like getting rid of NA values is not the same as
getting rid of blank/empty string values. This is because an empty
string is not equal to NA in R. No problem, let’s just
filter those values out by adding a ! in front
of the condition of the values we want to get rid of. Similar to
- from before, this chooses all values but the ones that
fulfill that condition. An empty string can simply be represented by two
quotation marks, like this: “”
passengers_clean %>%
filter(!Embarked %in% "") %>%
group_by(Embarked, Survived) %>%
summarise(Count=n()) %>%
ggplot(mapping=aes(x=Count, y=Embarked, fill=Survived)) +
geom_col(position="fill")
From this chart we can see that passengers from Cherbourg have the
highest survival rate, followed by Southampton, and finally Queenstown.
Perhaps those with knowledge of the area can immediately deduce why
passengers departing from Cherbourg have a higher chance of survival.
However, since I live nowhere near those ports nor have sufficient
geographical knowledge of the area, let’s try to deduce the reason by
looking at the characteristics of the passengers from each port. For
example, by their gender and class.
Let’s start by gender.
passengers_clean %>%
filter(!Embarked %in% "") %>%
group_by(Embarked, Sex) %>%
summarise(Count=n()) %>%
ggplot(mapping=aes(x=Count, y=Embarked, fill=Sex)) +
geom_col(position="fill")
As we can see by the chart above, Cherbourg has the highest
proportion of female passengers when compared to the other ports. This
could explain why Cherbourg passengers have a higher survival chance,
but Queenstown has the second largest proportion of female passengers
even though it has the highest mortality rate. Perhaps we’re missing
something?
Now, let’s check by the passenger’s class.
passengers_clean %>%
filter(!Embarked %in% "") %>%
group_by(Embarked, Pclass) %>%
summarise(Count=n()) %>%
ggplot(mapping=aes(x=Count, y=Embarked, fill=Pclass)) +
geom_col(position="fill")
Seems like class has a bigger effect on survival than gender in
this regard, seeing how Cherbourg has the highest proportion of First
Class passengers, followed by Southampton, and Queenstown being last;
The same order as when we looked at the survival rate of passengers by
port only. We can now conclude that the reason why Cherbourg has the
highest survival rate was most likely due to the amount of First Class
passengers which embarked from that port. In turn, we could also guess
that perhaps the area around Cherbourg had more wealthy citizens than
the other two ports.
From this analysis, we can conclude that there are a variety of factors that affected a passenger’s survival chances on board the Titanic, and some of those factors matter more than others. The biggest factor would seem to be the passenger’s cabin class, which tends to directly relate to their socioeconomic status. From this alone, we can see the importance of socioeconomic statuses in 1912, especially in the cruise industry.
Thankfully, the Titanic disaster caused many changes (for the better) in terms of maritime safety regulations. One of them being the mandate that every ship needs to carry enough lifeboats for all the passengers on board. In the event that a similar accident happens, hopefully no one will have to be left behind again.