The immediate cause of RMS Titanic’s demise was a collision with an iceberg that caused the ocean liner to sink on April 14–15, 1912. While the ship could reportedly stay afloat if as many as 4 of its 16 compartments were breached, the impact had affected at least 5 compartments. It was originally believed that the iceberg had caused a long gash in the hull. After examining the wreck, however, scientists discovered that the collision had produced a series of thin gashes as well as brittle fracturing and separation of seams in the adjacent hull plates, thus allowing water to flood into the Titanic.
The exact number of people killed is unknown. Original passenger and crew lists were rendered inaccurate by such factors as misspellings, omissions, aliases, and failure to count musicians and other contracted employees as either passengers or crew members. However, it is generally believed that of the ship’s approximately 2,200 passengers and crew members, some 1,500 people perished when the ship sank.
knitr::include_graphics("assets/sititanic.png")
Departure of Titanic on April 10, 1912
The wreck of the Titanic—which was discovered on September 1, 1985—is located at the bottom of the Atlantic Ocean, some 13,000 feet (4,000 metres) underwater. It is approximately 400 nautical miles (740 km) from Newfoundland, Canada.
From the outset, the Titanic captured the public’s imagination. At the time, it was one of the largest and most opulent ships in the world. It was also considered unsinkable, due to a series of compartment doors that could be closed if the bow was breached. However, four days into its maiden voyage in 1912, the Titanic struck an iceberg, and less than three hours later it sank
This project attempts to answer some of the speculations by analyzing this unfortunate event in 1912.
We will use “Kaggle’s Titanic Dataset” that’s already famous for many similar prediction.
titanic <- read.csv("datainput/train.csv",row.names = "PassengerId") # read data
str(titanic) # displays the structure of the data frame object
## 'data.frame': 891 obs. of 11 variables:
## $ Survived: int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked: chr "S" "C" "S" "S" ...
Each column in the given list corresponds to specific information in the dataset. Here’s a description of each column:
Survived: Survival, 0 = No,
1 = YesPclass: Ticket class, 1 =
1st, 2 = 2nd, 3 =
3rdName: Victim’s nameSex: GenderAge: Age in yearsSibSp: number of siblings / spouses aboard the
TitanicParch: number of parents / children aboard the
TitanicTicket: Ticket numberFare: Passenger FareCabin: Cabin numberEmbarked: Port of Embarkation, C =
Cherbourg, Q = Queenstown,
S = SouthamptonBefore proceeding, it’s a good idea to check the state of the data and clean it up to transform the raw data set into a usable format, making it suitable for later analysis.
First, let’s check for missing values. Dealing with missing data is important because analytical results based on datasets with missing values can be biased. 1
sort(colSums(is.na(titanic)),decreasing=T) # see the missing value in the data
## Age Survived Pclass Name Sex SibSp Parch Ticket
## 177 0 0 0 0 0 0 0
## Fare Cabin Embarked
## 0 0 0
nrow(titanic) # number of rows/data
## [1] 891
The only column with missing values is the Age column. However, the
percentage of missing values is very large. 177 rows by 891 rows, which
is about 19.9% of the data. If we take a closer look at the data
structure, we’ll see that the Cabin column has values
with ” “. This means that the passenger’s cabin was not
identified/unknown at the time the data was collected. So we can treat
the value as missing and see its share in the column.
sort(head(table(titanic$Cabin)), decreasing=T)
##
## A10 A14 A16 A19 A20
## 687 1 1 1 1 1
The Cabin column has a large number of missing values
(687 out of 891 rows), accounting for 77% of the data. Since our goal is
to avoid a biased analysis, we should remove these two columns as they
are not used in our analysis.
titanic <- subset(titanic,select=-c(Age,Cabin))
dim(titanic) # data dimensions on dataframe
## [1] 891 9
SibSp and Parch columns: Since these two
columns represent how many Relatives of the Passenger are also on board,
it is possible that the passengers are on board without any relatives.
So we’re going to simplify that into one new column called
Alone. This column will give us a Yes
if you are alone and No if you are with
relatives no matter what the number.
titanic$SumPeople <- titanic$SibSp + titanic$Parch
convert_people = function(x){
if(x == 0) {x <- "Yes"}
else {x <- "No"}
}
titanic$Alone <- sapply(X = titanic$SumPeople,
FUN = convert_people)
Change the final data type to the desired one:
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Alone <- as.factor(titanic$Alone)
To answer the question, let’s calculate the viability between each class.
knitr::include_graphics("assets/boat.png")
Titanic Survivors in a Lifeboat on April 10, 1912
# Calculate the chance
surv_class <- aggregate(Survived~Pclass,titanic,mean)
surv_class$Survived <- surv_class$Survived * 100
colnames(surv_class)[2] <- "Survival_Chance"
surv_class$Pclass <- factor(surv_class$Pclass,
labels = c('1st Class', '2nd Class', '3rd Class'))
knitr::kable(surv_class)
| Pclass | Survival_Chance |
|---|---|
| 1st Class | 62.96296 |
| 2nd Class | 47.28261 |
| 3rd Class | 24.23625 |
# Plot
library(ggplot2)
ggplot(surv_class, aes(Pclass, Survival_Chance))+
geom_col(aes(fill=Pclass), fill = c("#3F2305","#DFD7BF","#F2EAD3"), col="Black")+
theme(plot.background = element_rect(fill = "#F5F5F5", color = "#F5F5F5"),
panel.background = element_rect(fill = "#F5F5F5"),
panel.grid = element_line(colour = "#F5F5F5"))+
labs(x=NULL,
y="Survival Chance(in percent)")
Insight💡
Only the first passenger class has a chance of survival above 50%, which is 62.9%. This meant that 1st Class always had a better chance of survival than the other classes.
knitr::include_graphics("assets/family.png")
Lifeboats carrying survivors of the Titanic, April 15, 1912
First, we want to see how much women were on-board based on the dataset.
#Calculate the frequency
gender_freq <- as.data.frame(table(titanic$Sex))
colnames(gender_freq)[1] <- "Gender"
knitr::kable(gender_freq)
| Gender | Freq |
|---|---|
| female | 314 |
| male | 577 |
# Calculate the chance
surv_gender <- aggregate(Survived~Sex,titanic,mean)
surv_gender$Survived <- surv_gender$Survived * 100
colnames(surv_gender)[2] <- "Survival_Chance"
knitr::kable(surv_gender)
| Sex | Survival_Chance |
|---|---|
| female | 74.20382 |
| male | 18.89081 |
# Plot
ggplot(surv_gender, aes(Sex, Survival_Chance))+
geom_col(aes(fill=Pclass), fill = c("#DFD7BF","#F2EAD3"), col="Black")+
theme(plot.background = element_rect(fill = "#F5F5F5", color = "#F5F5F5"),
panel.background = element_rect(fill = "#F5F5F5"),
panel.grid = element_line(colour = "#F5F5F5"))+
labs(x=NULL,
y="Survival Chance(in percent)")
Insight💡
The majority of passengers with female sex have a higher survival rate than men, which is touching at 74.2% compared to only 18.8% for men.
knitr::include_graphics("assets/leaving.png")
The Titanic leaving Queenstown (Cobh), Ireland, April 11, 1912; it is one of the last known photographs of the liner
#Calculate the proportion
prop_pass <- as.data.frame(prop.table(table(titanic$Alone)))
colnames(prop_pass)[1] <- "Alone"
colnames(prop_pass)[2] <- "Proportion%"
prop_pass$`Proportion%` <- prop_pass$`Proportion%`*100
knitr::kable(prop_pass)
| Alone | Proportion% |
|---|---|
| No | 39.73064 |
| Yes | 60.26936 |
There were roughly 60% people that on-board without any relatives and 40% with their relatives. And we are going to check their Survival Chance based on our dataset.
# Calculate the chance
surv_propp <- aggregate(Survived~Alone,titanic,mean)
surv_propp$Survived <- surv_propp$Survived * 100
colnames(surv_propp)[2] <- "Survival_Chance"
knitr::kable(surv_propp)
| Alone | Survival_Chance |
|---|---|
| No | 50.56497 |
| Yes | 30.35382 |
# Plot
ggplot(surv_propp, aes(Alone, Survival_Chance))+
geom_col(aes(fill=Pclass), fill = c("#DFD7BF","#F2EAD3"), col="Black")+
theme(plot.background = element_rect(fill = "#F5F5F5", color = "#F5F5F5"),
panel.background = element_rect(fill = "#F5F5F5"),
panel.grid = element_line(colour = "#F5F5F5"))+
labs(x="Alone",
y="Survival Chance(in percent)")
Insight💡
Passengers with family has a better chance of survival (20% more) rather than people that were alone. This might be because the number of children and wives prioritized by each families.
We found that there are three factors that affect passenger survival.
Pclass factor: First class passengers are
more likely to survive an accident. This is in response to speculation
that first class passengers will be given priority.
Gender factor: Female passengers are more
likely to survive. This is in response to speculation that female
passengers will be given priority.
Family factor: Passengers with family on
board are more likely to survive an accident. This is in response to
speculation that couples/families are the number one priority.
However, there is still a lot of speculation about the lack of satisfactory contingency plans in place that our analysis leaves unanswered, and we hope that the same speculation will have some justification in the future.