“There is no danger that Titanic will sink. The boat is unsinkable and nothing but inconvenience will be suffered by the passengers.”
Phillip Franklin, White Star Line vice-president, 1912
Once thought to be unsinkable, the Titanic ironically sank on her first voyage in the North Atlantic Ocean on 15 April 1912. It is generally believed that some 1.500 people perished and only 706 people survived. It has been claimed that out of 706 survivors, most of them were women and children, and the majority were first-class passengers.
Based on this claim, we will do a simple analysis of Titanic data to fact-check the claims mentioned above using the Titanic dataset that was taken from Kaggle.
The titanic dataset has three csv files:
gender_submission.csv, test.csv, and
train.csv. In consideration of the scope of this assignment
that doesn’t delve into a machine learning activity, we will only use
the train.csv file as a data source.
First, read train.csv and assign it to
titanic:
Check the head of titanic:
Check the tail of titanic:
Let’s see the structure of titanic dataframe:
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
The explanation of each column is:
PassengerId: The id of the passenger
Survived: The survival of the passenger (1 = Yes and 0 = No)
Pclass: Ticket class
Name: The name of the passenger
Sex: The sex of the passenger
Age: The age of the passenger (in years)
SibSp: The numbers of siblings or spouses who aboard the Titanic
Parch: The numbers of parents or children who aboard the Titanic
Ticket: Ticket number
Fare: Passenger fare
Cabin: Cabin number
Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown and S = Southampton)
We also want to know how many rows and columns there are in
titanic:
## [1] 891 12
There are 891 rows and 12 columns.
Because of the scope of this analysis, we will remove unused columns
such as SibSp, Parch, Ticket,
Fare, and Cabin, and make the count of columns
down to 7 columns:
## [1] 891 7
The next step is to make sure that all the data types are correct.
One of the ways is to check for unique values to determine whether it’s
a factor data type. We will only check the columns that are suspected to
have less than 10 unique values, so in this case Survived,
Pclass, Sex, and Embarked
columns:
## [1] 0 1
## [1] 3 1 2
## [1] "male" "female"
## [1] "S" "C" "Q" ""
Notice that in the Embarked column, there is an empty
value (““).
It’s likely because the ports in which the passengers embarked were
unknown. So, before changing the datatypes, we will convert the empty
value into another value. Check for the Embarked empty
values position in the titanic dataframe:
The empty values are in rows 62 and 830. We will assign the empty
values of the Embarked column to the median value of
Embarked:
embarked_median <- median(titanic$Embarked)
titanic[titanic$Embarked == "", ]$Embarked <- embarked_medianAfter assigning the empty values to the median value, let’s recheck to make sure that the empty string value has disappeared:
## [1] "S" "C" "Q"
Great. Now we will continue the prior step which is changing data types. The result above showed us that these four columns have less than 10 unique values and their data types can be changed to factor data type:
titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)Recheck for the newly changed datatypes:
## 'data.frame': 891 obs. of 7 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
Done.
Now, we need to make sure that there are no missing values. Let’s check for missing values:
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## Embarked
## 0
There are 177 empty values in the Age column. For the
sake of data clarity, we will assign the missing data in the
Age column as the median age based on sex. Search for
median age in female passengers and male passengers:
male_pass <- titanic[titanic$Sex == "male", ]
male_median <- median(male_pass$Age, na.rm = TRUE)
male_median## [1] 29
female_pass <- titanic[titanic$Sex == "female", ]
female_median <- median(female_pass$Age, na.rm = TRUE)
female_median## [1] 27
Set all NA in column Age as 29 for male passengers and
27 for female passengers:
titanic$Age <- ifelse(titanic$Sex == "male" & is.na(titanic$Age), male_median, titanic$Age)
titanic$Age <- ifelse(titanic$Sex == "female" & is.na(titanic$Age), female_median, titanic$Age)Recheck for any missing values:
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## Embarked
## 0
Now we’re good! All the values are finally not empty. Let’s move on to the next point which is Data Explanation.
Check for the statistic’s summary of titanic
dataframe:
## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:549 1:216 Length:891 female:314
## 1st Qu.:223.5 1:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
## Age Embarked
## Min. : 0.42 C:168
## 1st Qu.:22.00 Q: 77
## Median :29.00 S:646
## Mean :29.44
## 3rd Qu.:35.00
## Max. :80.00
The insights gained from the summary above:
There were 342 passengers survived and 549 passengers didn’t
There were 216 passengers in the 1st class, 184 passengers in the 2nd class, and 491 passengers in the 3rd class
The passengers were consisted of 314 females and 577 males
The average age of the passengers is 29.44, whereas the minimum age is 0.42 and the maximum age is 88.00
The average fare is 32.20, the minimum fare is 0.00, and the maximum fare is 512.33
As many as 646 passengers embarked from Southampton, followed by 168 passengers embarked from Cherbourg, and 77 passengers embarked from Queenstown
In this section, we will analyze how each column in
titanic corresponds to the survival rate. But before that,
let’s check for the whole outlook of survival distribution.
survival_dist <- prop.table(table(titanic$Survived)) * 100
survival_dist <- round(survival_dist, 1)
survival_dist_df <- as.data.frame(survival_dist)
names(survival_dist_df) <- c("Survived", "Percentage")
survival_dist_df$Survived <- ifelse(survival_dist_df$Survived == 0, "No", "Yes")
survival_dist_dfTake a look at the plot below:
library(ggplot2)
ggplot(survival_dist_df, aes(x = Percentage, y = Survived, fill = Survived)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("#dd7777", "#75a375")) +
labs(
title = "Survival Distribution",
subtitle = "In The Titanic Sinking Incident (1912)",
x = "Percentage",
y = "Survived?",
fill = "Survived") +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))This is a devastating finding: more than half of the total passengers did not survive clocking in at 61.6%. Meanwhile, only 38.4% of passengers were survived.
We will breakdown the survivability rate based on four different criteria; sex, passenger class, age, and embarkment:
Previous claims said that more women survived, let’s see if the claim is true:
sex_survival_dist <- prop.table(table(titanic$Sex, titanic$Survived)) * 100
sex_survival_dist <- round(sex_survival_dist, 1)
sex_survival_dist_df <- as.data.frame(sex_survival_dist)
colnames(sex_survival_dist_df) <- c("Sex", "Survived", "Percentage")
sex_survival_dist_df$Survived <- ifelse(sex_survival_dist_df$Survived == 0, "No", "Yes")
sex_survival_dist_dfAs many as 26.2% of women survived and 9.1% of women didn’t. Meanwhile, 12.2% of men survived and sadly 52.5% of men didn’t.
ggplot(sex_survival_dist_df, aes(x = Percentage, y = Sex, fill = Survived)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("#dd7777", "#75a375")) +
labs(
title = "Survival By Sex",
subtitle = "In The Titanic Sinking Incident (1912)",
x = "Percentage",
y = "Sex",
fill = "Survived?") +
theme_minimal() +
theme(
legend.position = "right",
legend.title = element_text(margin = margin(t = 10)),
legend.box.margin = margin(0, 0, -10, 0),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))The claim is true. More women survived at 26.2%, making up a third of the number of passengers.
Previous claims said that more first-class passengers survived, let’s see if the claim is true:
pclass_survival_dist <- prop.table(table(titanic$Pclass, titanic$Survived)) * 100
pclass_survival_dist <- round(pclass_survival_dist, 1)
pclass_survival_dist_df <- as.data.frame(pclass_survival_dist)
colnames(pclass_survival_dist_df) <- c("Passenger Class", "Survived", "Percentage")
pclass_survival_dist_df$Survived <- ifelse(pclass_survival_dist_df$Survived == 0, "No", "Yes")
pclass_survival_dist_dfAs many as 15.3% of first-class passengers survived and 9.0% of first-class passengers didn’t. For second-class, 9.8% of passengers survived and 10.9% didn’t. And lastly, 13.4% of third-class passengers survived and 41.8% didn’t.
ggplot(pclass_survival_dist_df, aes(x = Percentage, y = `Passenger Class`, fill = Survived)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("#dd7777", "#75a375")) +
labs(
title = "Survival Rate by Passenger Class",
subtitle = "In The Titanic Sinking Incident (1912)",
x = "Percentage",
y = "Passenger Class",
fill = "Survived?"
) +
theme_minimal() +
theme(
legend.position = "right",
legend.title = element_text(margin = margin(t = 10)),
legend.box.margin = margin(0, 0, -10, 0),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))The claim is true. More first-class passengers survived at 15.3%, followed by third-class passengers at 13.4%, and second-class passengers at 9.8%.
Previous claims said that many children survived, let’s see if the
claim is true. First, let’s see the histogram plot to check the
distribution of Age of survivors:
age_survived <- titanic[titanic$Survived == 1, ]$Age
ggplot(data = NULL, aes(x = age_survived)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(
title = "Age Distribution of Survived Passengers",
subtitle = "In The Titanic Sinking Incident (1912)",
x = "Age",
y = "Frequency"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))Based on the histogram above, we can see that survived passengers
under the age of 20 were less than survived passengers over the age of
20. But just to make sure, we will make a prop table based on
Age and Survival. First, create a new column
named IsAdult. This column will contain 1 if the passenger
is an adult where the Age is greater than or equal to 18
and 0 if the passenger is a child where the Age is less
than 18.
titanic$IsAdult <- ifelse(titanic$Age >= 18, 1, 0)
age_survival <- round(prop.table(table(titanic$IsAdult, titanic$Survived)) * 100, 1)
age_survival_df <- as.data.frame(age_survival)
colnames(age_survival_df) <- c("IsAdult", "Survived", "Percentage")
age_survival_df$IsAdult <- ifelse(age_survival_df$IsAdult == 0, "No", "Yes")
age_survival_df$Survived <- ifelse(age_survival_df$Survived == 0, "No", "Yes")
age_survival_dfAs many as 6.8% of child passengers survived and a devastating 5.8% of child passengers didn’t. For adult passengers, there are 31.5% of passengers survived and 55.8% didn’t.
ggplot(age_survival_df, aes(x = Percentage, y = IsAdult, fill = Survived)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("#dd7777", "#75a375")) +
labs(
title = "Survival By Age Category",
subtitle = "In The Titanic Sinking Incident (1912)",
x = "Percentage",
y = "Is Adult?",
fill = "Survived?") +
theme_minimal() +
theme(
legend.position = "right",
legend.title = element_text(margin = margin(t = 10)),
legend.box.margin = margin(0, 0, -10, 0),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))The claim is false. More adults survived at 31.5% whereas the percentage of survival of children is at 6.8%.
There was no claim about the point of embarkment being a variable of survivability. That being said, let’s see in which port the most number of survivors were:
embarked_survival_dist <- round(prop.table(table(titanic$Embarked, titanic$Survived)) * 100, 1)
embarked_survival_dist_df <- as.data.frame(embarked_survival_dist)
colnames(embarked_survival_dist_df) <- c("Embarkment", "Survived", "Percentage")
embarked_survival_dist_df$Survived <- ifelse(embarked_survival_dist_df$Survived == 0, "No", "Yes")
embarked_survival_dist_dfTo remind the values in the Embarked column are aliases
for cities:
C for Cherbourg
Q for Queenstown
S for Southampton
As many as 10.4% of passengers embarked from Cherbourg survived and 8.4% didn’t. There are 3.4% of passengers embarked from Queenstown survived and 5.3% didn’t. Lastly, there are 24.6% of passengers embarked from Southampton survived and 47.9% didn’t.
ggplot(embarked_survival_dist_df, aes(x = Percentage, y = Embarkment, fill = Survived)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("#dd7777", "#75a375")) +
labs(
title = "Survival Rate by Embarkment",
subtitle = "In The Titanic Sinking Incident (1912)",
x = "Percentage",
y = "Embarkment",
fill = "Survived?"
) +
theme_minimal() +
theme(
legend.position = "right",
legend.title = element_text(margin = margin(t = 10)),
legend.box.margin = margin(0, 0, -10, 0),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))The majority of passengers embarked from Southampton, and it also have the most survivors at 24.6% and the most casualties at 47.9%.
We have done the analysis of survivability based on sex, passenger class, age, and embarkment. In doing so, we get these conclusions:
More women survived than men
More first-class passengers survived than the other classes
More adult passengers survived than child passengers
More passengers that embarked from Southampton survived than the other cities
Keep in mind that this is a simple analysis based on a snippet of data. The conclusion may be different if we use additional data.
Considering all of the facts, the sinking of the Titanic changed various safety procedures that made traveling by ship safer. Its legacy will be remembered by people for hundreds of years to come.