1 Background

“There is no danger that Titanic will sink. The boat is unsinkable and nothing but inconvenience will be suffered by the passengers.”

Phillip Franklin, White Star Line vice-president, 1912

Once thought to be unsinkable, the Titanic ironically sank on her first voyage in the North Atlantic Ocean on 15 April 1912. It is generally believed that some 1.500 people perished and only 706 people survived. It has been claimed that out of 706 survivors, most of them were women and children, and the majority were first-class passengers.

Based on this claim, we will do a simple analysis of Titanic data to fact-check the claims mentioned above using the Titanic dataset that was taken from Kaggle.

2 Data Preparation

The titanic dataset has three csv files: gender_submission.csv, test.csv, and train.csv. In consideration of the scope of this assignment that doesn’t delve into a machine learning activity, we will only use the train.csv file as a data source.

First, read train.csv and assign it to titanic:

titanic <- read.csv("data_input/train.csv")

2.1 Data Inspection

Check the head of titanic:

head(titanic)

Check the tail of titanic:

tail(titanic)

Let’s see the structure of titanic dataframe:

str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

The explanation of each column is:

  • PassengerId: The id of the passenger

  • Survived: The survival of the passenger (1 = Yes and 0 = No)

  • Pclass: Ticket class

  • Name: The name of the passenger

  • Sex: The sex of the passenger

  • Age: The age of the passenger (in years)

  • SibSp: The numbers of siblings or spouses who aboard the Titanic

  • Parch: The numbers of parents or children who aboard the Titanic

  • Ticket: Ticket number

  • Fare: Passenger fare

  • Cabin: Cabin number

  • Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown and S = Southampton)

We also want to know how many rows and columns there are in titanic:

dim(titanic)
## [1] 891  12

There are 891 rows and 12 columns.

2.2 Data Cleansing & Coertions

Because of the scope of this analysis, we will remove unused columns such as SibSp, Parch, Ticket, Fare, and Cabin, and make the count of columns down to 7 columns:

titanic <- subset(titanic, select = c(-SibSp, -Parch, -Ticket, -Fare, -Cabin))
dim(titanic)
## [1] 891   7

The next step is to make sure that all the data types are correct. One of the ways is to check for unique values to determine whether it’s a factor data type. We will only check the columns that are suspected to have less than 10 unique values, so in this case Survived, Pclass, Sex, and Embarked columns:

unique(titanic$Survived)
## [1] 0 1
unique(titanic$Pclass)
## [1] 3 1 2
unique(titanic$Sex)
## [1] "male"   "female"
unique(titanic$Embarked)
## [1] "S" "C" "Q" ""

Notice that in the Embarked column, there is an empty value (““).

It’s likely because the ports in which the passengers embarked were unknown. So, before changing the datatypes, we will convert the empty value into another value. Check for the Embarked empty values position in the titanic dataframe:

titanic[titanic$Embarked == "", ]

The empty values are in rows 62 and 830. We will assign the empty values of the Embarked column to the median value of Embarked:

embarked_median <- median(titanic$Embarked)
titanic[titanic$Embarked == "", ]$Embarked <- embarked_median

After assigning the empty values to the median value, let’s recheck to make sure that the empty string value has disappeared:

unique(titanic$Embarked)
## [1] "S" "C" "Q"

Great. Now we will continue the prior step which is changing data types. The result above showed us that these four columns have less than 10 unique values and their data types can be changed to factor data type:

titanic$Survived <- as.factor(titanic$Survived)
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)

Recheck for the newly changed datatypes:

str(titanic)
## 'data.frame':    891 obs. of  7 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

Done.

2.3 Missing Data

Now, we need to make sure that there are no missing values. Let’s check for missing values:

colSums(is.na(titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##    Embarked 
##           0

There are 177 empty values in the Age column. For the sake of data clarity, we will assign the missing data in the Age column as the median age based on sex. Search for median age in female passengers and male passengers:

male_pass <- titanic[titanic$Sex == "male", ]
male_median <- median(male_pass$Age, na.rm = TRUE)
male_median
## [1] 29
female_pass <- titanic[titanic$Sex == "female", ]
female_median <- median(female_pass$Age, na.rm = TRUE)
female_median
## [1] 27

Set all NA in column Age as 29 for male passengers and 27 for female passengers:

titanic$Age <- ifelse(titanic$Sex == "male" & is.na(titanic$Age), male_median, titanic$Age)
titanic$Age <- ifelse(titanic$Sex == "female" & is.na(titanic$Age), female_median, titanic$Age)

Recheck for any missing values:

colSums(is.na(titanic))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##    Embarked 
##           0

Now we’re good! All the values are finally not empty. Let’s move on to the next point which is Data Explanation.

2.4 Duplicates Data

Let’s move on to the next step which is to check for duplicated data. We want to check for titanic with same Name, Sex, and Age:

duplicates_pssgr <- duplicated(titanic$Name, titanic$Sex, fromLast = TRUE)
sum(duplicates_pssgr)
## [1] 0

There’s no duplicates data.

3 Data Explanation

Check for the statistic’s summary of titanic dataframe:

summary(titanic)
##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:549    1:216   Length:891         female:314  
##  1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
##  Median :446.0            3:491   Mode  :character               
##  Mean   :446.0                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##       Age        Embarked
##  Min.   : 0.42   C:168   
##  1st Qu.:22.00   Q: 77   
##  Median :29.00   S:646   
##  Mean   :29.44           
##  3rd Qu.:35.00           
##  Max.   :80.00

The insights gained from the summary above:

  • There were 342 passengers survived and 549 passengers didn’t

  • There were 216 passengers in the 1st class, 184 passengers in the 2nd class, and 491 passengers in the 3rd class

  • The passengers were consisted of 314 females and 577 males

  • The average age of the passengers is 29.44, whereas the minimum age is 0.42 and the maximum age is 88.00

  • The average fare is 32.20, the minimum fare is 0.00, and the maximum fare is 512.33

  • As many as 646 passengers embarked from Southampton, followed by 168 passengers embarked from Cherbourg, and 77 passengers embarked from Queenstown

4 Data Analysis

In this section, we will analyze how each column in titanic corresponds to the survival rate. But before that, let’s check for the whole outlook of survival distribution.

4.1 Distribution of Survival Rate

survival_dist <- prop.table(table(titanic$Survived)) * 100
survival_dist <- round(survival_dist, 1)

survival_dist_df <- as.data.frame(survival_dist)
names(survival_dist_df) <- c("Survived", "Percentage")
survival_dist_df$Survived <- ifelse(survival_dist_df$Survived == 0, "No", "Yes")

survival_dist_df

Take a look at the plot below:

library(ggplot2)

ggplot(survival_dist_df, aes(x = Percentage, y = Survived, fill = Survived)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("#dd7777", "#75a375")) +
  labs(
    title = "Survival Distribution",
    subtitle = "In The Titanic Sinking Incident (1912)",
    x = "Percentage",
    y = "Survived?",
    fill = "Survived") +
  theme_minimal() +
  theme(
    legend.position = "none", 
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5))

This is a devastating finding: more than half of the total passengers did not survive clocking in at 61.6%. Meanwhile, only 38.4% of passengers were survived.

4.2 Survivability Rate

We will breakdown the survivability rate based on four different criteria; sex, passenger class, age, and embarkment:

4.2.1 By Sex

Previous claims said that more women survived, let’s see if the claim is true:

sex_survival_dist <- prop.table(table(titanic$Sex, titanic$Survived)) * 100
sex_survival_dist <- round(sex_survival_dist, 1)

sex_survival_dist_df <- as.data.frame(sex_survival_dist)
colnames(sex_survival_dist_df) <- c("Sex", "Survived", "Percentage")
sex_survival_dist_df$Survived <- ifelse(sex_survival_dist_df$Survived == 0, "No", "Yes")
sex_survival_dist_df

As many as 26.2% of women survived and 9.1% of women didn’t. Meanwhile, 12.2% of men survived and sadly 52.5% of men didn’t.

ggplot(sex_survival_dist_df, aes(x = Percentage, y = Sex, fill = Survived)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("#dd7777", "#75a375")) +
  labs(
    title = "Survival By Sex",
    subtitle = "In The Titanic Sinking Incident (1912)",
    x = "Percentage",
    y = "Sex",
    fill = "Survived?") +
  theme_minimal() +
  theme(
    legend.position = "right", 
    legend.title = element_text(margin = margin(t = 10)),
    legend.box.margin = margin(0, 0, -10, 0),
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5))

The claim is true. More women survived at 26.2%, making up a third of the number of passengers.

4.2.2 By Passenger Class

Previous claims said that more first-class passengers survived, let’s see if the claim is true:

pclass_survival_dist <- prop.table(table(titanic$Pclass, titanic$Survived)) * 100
pclass_survival_dist <- round(pclass_survival_dist, 1)

pclass_survival_dist_df <- as.data.frame(pclass_survival_dist)
colnames(pclass_survival_dist_df) <- c("Passenger Class", "Survived", "Percentage")
pclass_survival_dist_df$Survived <- ifelse(pclass_survival_dist_df$Survived == 0, "No", "Yes")
pclass_survival_dist_df

As many as 15.3% of first-class passengers survived and 9.0% of first-class passengers didn’t. For second-class, 9.8% of passengers survived and 10.9% didn’t. And lastly, 13.4% of third-class passengers survived and 41.8% didn’t.

ggplot(pclass_survival_dist_df, aes(x = Percentage, y = `Passenger Class`, fill = Survived)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("#dd7777", "#75a375")) +
  labs(
    title = "Survival Rate by Passenger Class",
    subtitle = "In The Titanic Sinking Incident (1912)",
    x = "Percentage",
    y = "Passenger Class",
    fill = "Survived?"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right", 
    legend.title = element_text(margin = margin(t = 10)),
    legend.box.margin = margin(0, 0, -10, 0),
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5))

The claim is true. More first-class passengers survived at 15.3%, followed by third-class passengers at 13.4%, and second-class passengers at 9.8%.

4.2.3 By Age

Previous claims said that many children survived, let’s see if the claim is true. First, let’s see the histogram plot to check the distribution of Age of survivors:

age_survived <- titanic[titanic$Survived == 1, ]$Age

ggplot(data = NULL, aes(x = age_survived)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(
    title = "Age Distribution of Survived Passengers",
    subtitle = "In The Titanic Sinking Incident (1912)",
    x = "Age",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Based on the histogram above, we can see that survived passengers under the age of 20 were less than survived passengers over the age of 20. But just to make sure, we will make a prop table based on Age and Survival. First, create a new column named IsAdult. This column will contain 1 if the passenger is an adult where the Age is greater than or equal to 18 and 0 if the passenger is a child where the Age is less than 18.

titanic$IsAdult <- ifelse(titanic$Age >= 18, 1, 0)
age_survival <- round(prop.table(table(titanic$IsAdult, titanic$Survived)) * 100, 1)

age_survival_df <- as.data.frame(age_survival)
colnames(age_survival_df) <- c("IsAdult", "Survived", "Percentage")
age_survival_df$IsAdult <- ifelse(age_survival_df$IsAdult == 0, "No", "Yes")
age_survival_df$Survived <- ifelse(age_survival_df$Survived == 0, "No", "Yes")
age_survival_df

As many as 6.8% of child passengers survived and a devastating 5.8% of child passengers didn’t. For adult passengers, there are 31.5% of passengers survived and 55.8% didn’t.

ggplot(age_survival_df, aes(x = Percentage, y = IsAdult, fill = Survived)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("#dd7777", "#75a375")) +
  labs(
    title = "Survival By Age Category",
    subtitle = "In The Titanic Sinking Incident (1912)",
    x = "Percentage",
    y = "Is Adult?",
    fill = "Survived?") +
  theme_minimal() +
  theme(
    legend.position = "right", 
    legend.title = element_text(margin = margin(t = 10)),
    legend.box.margin = margin(0, 0, -10, 0),
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5))

The claim is false. More adults survived at 31.5% whereas the percentage of survival of children is at 6.8%.

4.2.4 By Embarkment

There was no claim about the point of embarkment being a variable of survivability. That being said, let’s see in which port the most number of survivors were:

embarked_survival_dist <- round(prop.table(table(titanic$Embarked, titanic$Survived)) * 100, 1)

embarked_survival_dist_df <- as.data.frame(embarked_survival_dist)
colnames(embarked_survival_dist_df) <- c("Embarkment", "Survived", "Percentage")
embarked_survival_dist_df$Survived <- ifelse(embarked_survival_dist_df$Survived == 0, "No", "Yes")
embarked_survival_dist_df

To remind the values in the Embarked column are aliases for cities:

  • C for Cherbourg

  • Q for Queenstown

  • S for Southampton

As many as 10.4% of passengers embarked from Cherbourg survived and 8.4% didn’t. There are 3.4% of passengers embarked from Queenstown survived and 5.3% didn’t. Lastly, there are 24.6% of passengers embarked from Southampton survived and 47.9% didn’t.

ggplot(embarked_survival_dist_df, aes(x = Percentage, y = Embarkment, fill = Survived)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("#dd7777", "#75a375")) +
  labs(
    title = "Survival Rate by Embarkment",
    subtitle = "In The Titanic Sinking Incident (1912)",
    x = "Percentage",
    y = "Embarkment",
    fill = "Survived?"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right", 
    legend.title = element_text(margin = margin(t = 10)),
    legend.box.margin = margin(0, 0, -10, 0),
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5))

The majority of passengers embarked from Southampton, and it also have the most survivors at 24.6% and the most casualties at 47.9%.

5 Conclusion

We have done the analysis of survivability based on sex, passenger class, age, and embarkment. In doing so, we get these conclusions:

  • More women survived than men

  • More first-class passengers survived than the other classes

  • More adult passengers survived than child passengers

  • More passengers that embarked from Southampton survived than the other cities

Keep in mind that this is a simple analysis based on a snippet of data. The conclusion may be different if we use additional data.

Considering all of the facts, the sinking of the Titanic changed various safety procedures that made traveling by ship safer. Its legacy will be remembered by people for hundreds of years to come.

6 Reference

  1. Wikipedia
  2. Britannica