Introduction

The RMS Titanic story has living with us through numerous stories and films until now. Titanic was known as the one of the world’s most celebrated ship because it was the biggest ship that has been made by human at that time. Through its popularity, many wealthy people, high-ranking officials and celebrities were on-board to witness the Titanic’s Maiden Voyage along with other 2,222 passengers and crew.

Departure of Titanic on April 10, 1912

On April 15, after 5 days of sailing, Titanic has sank in the North Atlantic after crashing an iceberg. Of the 2,222 passengers and crew on board. More than 1,500 lost their lives in the disaster. Because of this unfortunate event, there are so many studies and speculation about the lack of satisfactory emergency procedures when the disaster is happening. Such as the lifeboats shortage, the violation of unwritten law of the sea , first-class priority, and much more.

In this document, we will try to answer some speculation with the analysis about this unfortunate event that happened in 1912.

Understanding our Data

We will use “Kaggle’s Titanic Dataset” that’s already famous for many similar prediction.

titanic <- read.csv("datainput/train.csv",row.names = "PassengerId")

Initial Column Description:

Survival: Survival, 0 = No, 1 = Yes
Pclass: Ticket class, 1 = 1st, 2 = 2nd, 3 = 3rd
Sex: Gender
Age: Age in years
SibSp: number of siblings / spouses aboard the Titanic
Parch: number of parents / children aboard the Titanic
Ticket: Ticket number
Fare: Passenger Fare
Cabin: Cabin number
Embarked: Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton

Data Cleansing

Before we go any further, we should see the condition of our data then cleaning the data to convert a raw dataset into a useable form to make it more relevant with our analysis later.

Missing Values

First thing that we want to do is looking for missing values. It is important to handle missing data because any analytic results based on a dataset with missing values could be biased. 11 https://en.wikipedia.org/wiki/Missing_data

Data Structure:

str(titanic)

## 'data.frame':    891 obs. of  11 variables:
##  $ Survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name    : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex     : chr  "male" "female" "female" "female" ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket  : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin   : chr  "" "C85" "" "C123" ...
##  $ Embarked: chr  "S" "C" "S" "S" ...

sort(colSums(is.na(titanic)),decreasing=T)

##      Age Survived   Pclass     Name      Sex    SibSp    Parch   Ticket 
##      177        0        0        0        0        0        0        0 
##     Fare    Cabin Embarked 
##        0        0        0

nrow(titanic)

## [1] 891

The only column that has missing values is Age column. However, the proportion of the missing values is quite large: 177 rows of 891 rows which is about 19,9% of the data. And If we do further scrunity with our Data Structure, we can see in the Cabin column, there are values with “” means the Passenger’s Cabin was not specified/known during the collection of the data. Thus we can treat the value as missing values and check its proportion within the column.

sort(head(table(titanic$Cabin)), decreasing=T)

## 
##     A10 A14 A16 A19 A20 
## 687   1   1   1   1   1

There are, too, large amount of missing values (687 rows of 891) in Cabin column which is 77% of the data. Since our goal is to avoid biased analytics, we need to drop these 2 columns because we will not use it for our analysis.

titanic <- subset(titanic,select=-c(Age,Cabin))

Inessential Columns

What made a column to be Inessential? In this project, we will determine it by looking at which column that’s already been represented with another column and which column that doesn’t have any relation with the Survival column by just looking at it. As of now, by our understanding, there are 6 columns with such characteristics: Ticket, Embarked, Fare, Name, SibSp, and Parch.

Reason of Removal:

Ticket: Too much unique values and it doesn’t represent anything that related to Survivability.
Embarked: It doesn’t mean much since when the disaster happened, all passengers were already embarked from their respective docks.
Fare: If we talk about how cheap/expensive each fare. It should be already represented with Pclass since the higher the Fare, the higher Pclass that you can get.
Name: Same as Fare, we can represent each name Title with Pclass along with Sex.

Drop those columns from our dataset.

titanic <- subset(titanic,select=-c(Ticket,Embarked,Fare,Name))

`Relatives`

SibSp and Parch: Since these two columns represent how much Passenger’s Relatives that were also on-board, then there are chances that passenger was on-board without any relatives.

Before we drop those columns altogether, we want to simplify them into one new column called Relatives. This column will give you value of No if you’re alone and Yes if you’re with relatives no matter how much is it.

titanic$FamilySize <- titanic$SibSp + titanic$Parch

convert_family = function(x){
  if(x == 0) {x <- "No"}
  else {x <- "Yes"}
}

titanic$Relatives <- sapply(X = titanic$FamilySize, 
                            FUN = convert_family)

Check the if the Relatives values are as we desired or not:

head(titanic[,c("SibSp","Parch","FamilySize","Relatives")],8)

##   SibSp Parch FamilySize Relatives
## 1     1     0          1       Yes
## 2     1     0          1       Yes
## 3     0     0          0        No
## 4     1     0          1       Yes
## 5     0     0          0        No
## 6     0     0          0        No
## 7     0     0          0        No
## 8     3     1          4       Yes

Drop SibSp, Parch, and FamilySize columns since their values already represented by our new Relatives column.

titanic <- subset(titanic,select=-c(SibSp,Parch,FamilySize))

Final Dataset

knitr::kable(titanic[1:6,c("Sex","Relatives","Pclass","Survived")], 
             caption = 'Final Dataset.')

Final Dataset.

Sex	Relatives	Pclass	Survived
male	Yes	3	0
female	Yes	1	1
female	No	3	1
female	Yes	1	1
male	No	3	0
male	No	3	0

Check the final data type:

str(titanic)

## 'data.frame':    891 obs. of  4 variables:
##  $ Survived : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass   : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Sex      : chr  "male" "female" "female" "female" ...
##  $ Relatives: chr  "Yes" "Yes" "No" "Yes" ...

Change the final data type to the desired one:

titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Relatives <- as.factor(titanic$Relatives)

Analysis

The rest of this document will be consists of our analysis about Survivability Factor based on three variables left in our dataset.

Pclass Factor

“Did you know? Passengers traveling first class on Titanic were roughly 44 percent more likely to survive than other passengers.”22 https://www.history.com/topics/early-20th-century-us/titanic

— history.com

To answer that same question, we will try to calculate the chance of survival between each class.

# Calculate the chance
surv_mean_class <- aggregate(Survived~Pclass,titanic,mean)
surv_mean_class$Survived <- surv_mean_class$Survived * 100
colnames(surv_mean_class)[2] <- "Survival_Chance"
surv_mean_class$Pclass <- factor(surv_mean_class$Pclass,
                                 labels = c('1st Class', '2nd Class', '3rd Class'))
knitr::kable(surv_mean_class)

Pclass	Survival_Chance
1st Class	62.96296
2nd Class	47.28261
3rd Class	24.23625

# Plot
ggplot(surv_mean_class, aes(Pclass, Survival_Chance))+
  geom_col(aes(fill=Pclass), fill = c("#C6563D","#292625","#959595"), col="Black")+
  theme(plot.background = element_rect(fill = "#EEEEEE", color = "#EEEEEE"),
        panel.background = element_rect(fill = "#EEEEEE"),
        panel.grid = element_line(colour = "#EEEEEE"))+
  labs(x=NULL,
       y="Survival Chance(%)")

The 1st Class were roughly 16% and 39% more likely to survive than the 2nd Class and 3rd Class respectively. But What if we want to know the chance between the 1st Class and the 2nd & 3rd altogether?

# Change the 2nd & 3rd Class into 'Other Class'
titanic_other <- titanic

convert_class = function(x){
  if(x == "1") {x <- "1st Class"}
  else {x <- "Other Class"}
}

titanic_other$Pclass <- sapply(X = titanic_other$Pclass, 
                            FUN = convert_class)
titanic_other$Pclass <- as.factor(titanic_other$Pclass)

# Calculate the chance
surv_mean_class_other <- aggregate(Survived~Pclass,titanic_other,mean)
surv_mean_class_other$Survived <- surv_mean_class_other$Survived * 100
colnames(surv_mean_class_other)[2] <- "Survival_Chance"
knitr::kable(surv_mean_class_other)

Pclass	Survival_Chance
1st Class	62.96296
Other Class	30.51852

# Plot
ggplot(surv_mean_class_other, aes(Pclass, Survival_Chance))+
  geom_col(aes(fill=Pclass), fill = c("#C6563D","#292625"), col="Black")+
  theme(plot.background = element_rect(fill = "#EEEEEE", color = "#EEEEEE"),
        panel.background = element_rect(fill = "#EEEEEE"),
        panel.grid = element_line(colour = "#EEEEEE"))+
  labs(x=NULL,
       y="Survival Chance(%)")

When we grouped the other class altogether, the differences between the 1st Class and the Other Class is about 32%. The result, however, is different with the number that has been stated by history.com. The reason of this might be because the dataset from Kaggle is not complete enough which only 891 Passengers recorded out of 2222 Passengers.33 https://titanicfacts.net/titanic-passengers/

However, we can clearly see the same pattern within these two results, that, the 1st Class has always likely to have better survival chance rather than the other classes.

Sex Factor

The origin of the phrase “Women and Children First” was started when HMS Birkenhead sank off the coast of South Africa on 26th February 1852.44 https://www.phrases.org.uk/meanings/women-and-children-first.html Saving women and children first was tied so closely to the disaster that the practice became known as the Birkenhead Drill. 55 https://www.historic-uk.com/CultureUK/Women-Children-First/

Margaret Brown, better known as ‘the unsinkable Molly Brown’, earned her nickname because she survived the sinking of the Titanic and later went on to become a staunch philanthropist and activist.

In this section, we am going to calculate the chance of survival based on Passenger’s Gender to see if the drill has been maintained when the disaster is happening.

# Calculate the chance
surv_mean_sex <- aggregate(Survived~Sex,titanic,mean)
surv_mean_sex$Survived <- surv_mean_sex$Survived * 100
colnames(surv_mean_sex)[2] <- "Survival_Chance"

knitr::kable(surv_mean_sex)

Sex	Survival_Chance
female	74.20382
male	18.89081

# Plot
ggplot(surv_mean_sex, aes(Sex, Survival_Chance))+
  geom_col(aes(fill=Sex), fill = c("#C6563D","#292625"), col="Black")+
  theme(plot.background = element_rect(fill = "#EEEEEE", color = "#EEEEEE"),
        panel.background = element_rect(fill = "#EEEEEE"),
        panel.grid = element_line(colour = "#EEEEEE"))+
  labs(x=NULL,
       y="Survival Chance(%)")

Women has 55% more chance to survive rather than its counterpart so we can assume that the drill has been maintained. However, Titanic had total of 20 Lifeboats, in which each Lifeboat can held total of 68 people and we can conclude that the Lifeboats could take about 1360 passengers or more.66 https://en.wikipedia.org/wiki/Lifeboats_of_the_Titanic

Then we want to see how much women were on-board based on the dataset.

#Calculate the frequency
sex_freq <- as.data.frame(table(titanic$Sex))
colnames(sex_freq)[1] <- "Sex"
knitr::kable(sex_freq)

Sex	Freq
female	314
male	577

There were only 314 Female Passengers and if we look for the actual number, there were 402 Women Passengers on-board77 https://titanicfacts.net/titanic-passengers, which were still under the capacity of all lifeboats. Meaning that all women were supposedly can be saved but unfortunately, they’re not.

By this, though the Birkenhead Drill was upheld as a “chivalric ideal” among ship-crews, in fact, 60 years after the sinking of the troopship (which was the time when RMS Titanic disaster happened), this “ideal” was not completely adhered by the crews on-board.

Family Factor

“Hundreds of human dramas unfolded between the order to load the lifeboats and the ship’s final plunge: Men saw off wives and children, families were separated in the confusion and selfless individuals gave up their spots to remain with loved ones or allow a more vulnerable passenger to escape.”

— history.com

The Sad Parting, illustration of 1912.

This is the last factor that we want to analyze in this document. As we know before, there were many children and women on-board along with their Husband/Dad and there were, too, people on-board without any relatives with them. Now we want see how much proportion of those who were alone and those who’re with families.

#Calculate the proportion
fam_prop <- as.data.frame(prop.table(table(titanic$Relatives)))
colnames(fam_prop)[1] <- "Relatives"
colnames(fam_prop)[2] <- "Proportion%"
fam_prop$`Proportion%` <- fam_prop$`Proportion%`*100
knitr::kable(fam_prop)

Relatives	Proportion%
No	60.26936
Yes	39.73064

There were roughly 60% people that on-board without any relatives and 40% with their relatives. And we are going to check their Survival Chance based on our dataset.

# Calculate the chance
surv_mean_fam <- aggregate(Survived~Relatives,titanic,mean)
surv_mean_fam$Survived <- surv_mean_fam$Survived * 100
colnames(surv_mean_fam)[2] <- "Survival_Chance"

knitr::kable(surv_mean_fam)

Relatives	Survival_Chance
No	30.35382
Yes	50.56497

# Plot
ggplot(surv_mean_fam, aes(Relatives, Survival_Chance))+
  geom_col(aes(fill=Relatives), fill = c("#C6563D","#292625"), col="Black")+
  theme(plot.background = element_rect(fill = "#EEEEEE", color = "#EEEEEE"),
        panel.background = element_rect(fill = "#EEEEEE"),
        panel.grid = element_line(colour = "#EEEEEE"))+
  labs(x="Relatives",
       y="Survival Chance(%)")

Passengers with relatives has a better chance of survival (20% more) rather than people that were alone. This might be because the number of children and wives prioritized by each families. Moreover, if we see the “quote” from above, there were “selfless individuals gave up their spots to remain with loved ones or allow a more vulnerable passenger to escape” and this can be the reason that why not all Women were saved in the first place.

Conclusion

Now we know that there are 3 factors that affect Passenger’s Survivability:

Pclass factor : Passengers from the 1st Class are more likely to survive in the event. This answers the speculation that 1st Class passengers were highly prioritized.
Sex factor : Female passenger are more likely to survive in the event. This answers the speculation that Female passengers were highly prioritized.
Relatives factor : Passenger that with Relatives on-board are more likely to survive in the event. This answers the speculation that Couples/Families were highly prioritized.

However, there are still much speculation that we can’t answers with our analysis about the lack of satisfactory emergency procedures and we hope the same speculation will find some certitude in the future.

RMS Titanic Disaster

Passengers Survivability Factor

Rangga Gemilang

2022-08-05