Titanic: A Giant of the Seas

Ahmad Fauzi

2023-06-11

The History

Why did the Titanic sink?

The immediate cause of RMS Titanic’s demise was a collision with an iceberg that caused the ocean liner to sink on April 14–15, 1912. While the ship could reportedly stay afloat if as many as 4 of its 16 compartments were breached, the impact had affected at least 5 compartments. It was originally believed that the iceberg had caused a long gash in the hull. After examining the wreck, however, scientists discovered that the collision had produced a series of thin gashes as well as brittle fracturing and separation of seams in the adjacent hull plates, thus allowing water to flood into the Titanic.

How many people died when the Titanic sank?

The exact number of people killed is unknown. Original passenger and crew lists were rendered inaccurate by such factors as misspellings, omissions, aliases, and failure to count musicians and other contracted employees as either passengers or crew members. However, it is generally believed that of the ship’s approximately 2,200 passengers and crew members, some 1,500 people perished when the ship sank.

knitr::include_graphics("assets/sititanic.png")

Departure of Titanic on April 10, 1912

Departure of Titanic on April 10, 1912

Where is the wreck of the Titanic?

The wreck of the Titanic—which was discovered on September 1, 1985—is located at the bottom of the Atlantic Ocean, some 13,000 feet (4,000 metres) underwater. It is approximately 400 nautical miles (740 km) from Newfoundland, Canada.

Why is the Titanic famous?

From the outset, the Titanic captured the public’s imagination. At the time, it was one of the largest and most opulent ships in the world. It was also considered unsinkable, due to a series of compartment doors that could be closed if the bow was breached. However, four days into its maiden voyage in 1912, the Titanic struck an iceberg, and less than three hours later it sank

This project attempts to answer some of the speculations by analyzing this unfortunate event in 1912.

Read and Extracting Data

We will use “Kaggle’s Titanic Dataset” that’s already famous for many similar prediction.

titanic <- read.csv("datainput/train.csv",row.names = "PassengerId") # read data

str(titanic) # displays the structure of the data frame object
## 'data.frame':    891 obs. of  11 variables:
##  $ Survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name    : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex     : chr  "male" "female" "female" "female" ...
##  $ Age     : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket  : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin   : chr  "" "C85" "" "C123" ...
##  $ Embarked: chr  "S" "C" "S" "S" ...

Columnwise Description

Each column in the given list corresponds to specific information in the dataset. Here’s a description of each column:

Data Cleansing

Before proceeding, it’s a good idea to check the state of the data and clean it up to transform the raw data set into a usable format, making it suitable for later analysis.

Check Missing Value

First, let’s check for missing values. Dealing with missing data is important because analytical results based on datasets with missing values can be biased. 1

sort(colSums(is.na(titanic)),decreasing=T) # see the missing value in the data
##      Age Survived   Pclass     Name      Sex    SibSp    Parch   Ticket 
##      177        0        0        0        0        0        0        0 
##     Fare    Cabin Embarked 
##        0        0        0
nrow(titanic) # number of rows/data
## [1] 891

The only column with missing values is the Age column. However, the percentage of missing values is very large. 177 rows by 891 rows, which is about 19.9% of the data. If we take a closer look at the data structure, we’ll see that the Cabin column has values with ” “. This means that the passenger’s cabin was not identified/unknown at the time the data was collected. So we can treat the value as missing and see its share in the column.

sort(head(table(titanic$Cabin)), decreasing=T)
## 
##     A10 A14 A16 A19 A20 
## 687   1   1   1   1   1

The Cabin column has a large number of missing values (687 out of 891 rows), accounting for 77% of the data. Since our goal is to avoid a biased analysis, we should remove these two columns as they are not used in our analysis.

titanic <- subset(titanic,select=-c(Age,Cabin))
dim(titanic) # data dimensions on dataframe
## [1] 891   9

Explicit Coercion

SibSp and Parch columns: Since these two columns represent how many Relatives of the Passenger are also on board, it is possible that the passengers are on board without any relatives. So we’re going to simplify that into one new column called Alone. This column will give us a Yes if you are alone and No if you are with relatives no matter what the number.

titanic$SumPeople <- titanic$SibSp + titanic$Parch

convert_people = function(x){
  if(x == 0) {x <- "Yes"}
  else {x <- "No"}
}

titanic$Alone <- sapply(X = titanic$SumPeople, 
                            FUN = convert_people)

Change the final data type to the desired one:

titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Alone <- as.factor(titanic$Alone)

The Questions

1) What class of passengers has a 50% chance of survival?

To answer the question, let’s calculate the viability between each class.

knitr::include_graphics("assets/boat.png")

Titanic Survivors in a Lifeboat on April 10, 1912

Titanic Survivors in a Lifeboat on April 10, 1912
# Calculate the chance
surv_class <- aggregate(Survived~Pclass,titanic,mean)
surv_class$Survived <- surv_class$Survived * 100
colnames(surv_class)[2] <- "Survival_Chance"
surv_class$Pclass <- factor(surv_class$Pclass,
                                 labels = c('1st Class', '2nd Class', '3rd Class'))
knitr::kable(surv_class)
Pclass Survival_Chance
1st Class 62.96296
2nd Class 47.28261
3rd Class 24.23625

# Plot
library(ggplot2)
ggplot(surv_class, aes(Pclass, Survival_Chance))+
  geom_col(aes(fill=Pclass), fill = c("#3F2305","#DFD7BF","#F2EAD3"), col="Black")+
  theme(plot.background = element_rect(fill = "#F5F5F5", color = "#F5F5F5"),
        panel.background = element_rect(fill = "#F5F5F5"),
        panel.grid = element_line(colour = "#F5F5F5"))+
  labs(x=NULL,
       y="Survival Chance(in percent)")

Insight💡

Only the first passenger class has a chance of survival above 50%, which is 62.9%. This meant that 1st Class always had a better chance of survival than the other classes.

2) What is the passenger’s chance of survival if viewed by gender?

knitr::include_graphics("assets/family.png")

Lifeboats carrying survivors of the Titanic, April 15, 1912

Lifeboats carrying survivors of the Titanic, April 15, 1912

First, we want to see how much women were on-board based on the dataset.

#Calculate the frequency
gender_freq <- as.data.frame(table(titanic$Sex))
colnames(gender_freq)[1] <- "Gender"
knitr::kable(gender_freq)
Gender Freq
female 314
male 577
# Calculate the chance
surv_gender <- aggregate(Survived~Sex,titanic,mean)
surv_gender$Survived <- surv_gender$Survived * 100
colnames(surv_gender)[2] <- "Survival_Chance"

knitr::kable(surv_gender)
Sex Survival_Chance
female 74.20382
male 18.89081

# Plot
ggplot(surv_gender, aes(Sex, Survival_Chance))+
  geom_col(aes(fill=Pclass), fill = c("#DFD7BF","#F2EAD3"), col="Black")+
    theme(plot.background = element_rect(fill = "#F5F5F5", color = "#F5F5F5"),
          panel.background = element_rect(fill = "#F5F5F5"),
          panel.grid = element_line(colour = "#F5F5F5"))+
    labs(x=NULL,
         y="Survival Chance(in percent)")

Insight💡

The majority of passengers with female sex have a higher survival rate than men, which is touching at 74.2% compared to only 18.8% for men.

3) What is the proportion of passengers who are alone and those with family ?

knitr::include_graphics("assets/leaving.png")

The Titanic leaving Queenstown (Cobh), Ireland, April 11, 1912; it is one of the last known photographs of the liner

The Titanic leaving Queenstown (Cobh), Ireland, April 11, 1912; it is one of the last known photographs of the liner
#Calculate the proportion
prop_pass <- as.data.frame(prop.table(table(titanic$Alone)))
colnames(prop_pass)[1] <- "Alone"
colnames(prop_pass)[2] <- "Proportion%"
prop_pass$`Proportion%` <- prop_pass$`Proportion%`*100
knitr::kable(prop_pass)
Alone Proportion%
No 39.73064
Yes 60.26936

There were roughly 60% people that on-board without any relatives and 40% with their relatives. And we are going to check their Survival Chance based on our dataset.

# Calculate the chance
surv_propp <- aggregate(Survived~Alone,titanic,mean)
surv_propp$Survived <- surv_propp$Survived * 100
colnames(surv_propp)[2] <- "Survival_Chance"

knitr::kable(surv_propp)
Alone Survival_Chance
No 50.56497
Yes 30.35382

# Plot
ggplot(surv_propp, aes(Alone, Survival_Chance))+
    geom_col(aes(fill=Pclass), fill = c("#DFD7BF","#F2EAD3"), col="Black")+
    theme(plot.background = element_rect(fill = "#F5F5F5", color = "#F5F5F5"),
          panel.background = element_rect(fill = "#F5F5F5"),
          panel.grid = element_line(colour = "#F5F5F5"))+
  labs(x="Alone",
       y="Survival Chance(in percent)")

Insight💡

Passengers with family has a better chance of survival (20% more) rather than people that were alone. This might be because the number of children and wives prioritized by each families.

Conclusion

We found that there are three factors that affect passenger survival.

  1. Pclass factor: First class passengers are more likely to survive an accident. This is in response to speculation that first class passengers will be given priority.

  2. Gender factor: Female passengers are more likely to survive. This is in response to speculation that female passengers will be given priority.

  3. Family factor: Passengers with family on board are more likely to survive an accident. This is in response to speculation that couples/families are the number one priority.

However, there is still a lot of speculation about the lack of satisfactory contingency plans in place that our analysis leaves unanswered, and we hope that the same speculation will have some justification in the future. 

References


  1. https://en.wikipedia.org/wiki/Missing_data↩︎