The R.M.S. Titanic, an ocean passenger ship commissioned in England in 1910, was one of the most luxurious, state-of-the-art ships during its time. At nearly 269 meters long, the ship was thought to be unsinkable.
However, on April 15, 1912, the unthinkable happened. During the ship’s maiden voyage from Southampton, England, to New York City, New York, the Titanic struck an iceberg in the North Atlantic Ocean off the coast of Newfoundland, Canada and sank, tragically taking the lives of more than 1,500 passengers and crew. It was, and remains, one of the greatest nautical disasters in history.
110+ years have passed since R.M.S Titanic sank in the Atlantic Ocean. While the sinking of Titanic remains an unfortunate tragedy, the world takes a lot of notes to prevent such tragedy to reoccur. For example, after the disaster, recommendations were made by both the British and American Boards of Inquiry stating that ships should carry enough lifeboats for all aboard, mandated lifeboat drills would be implemented, lifeboat inspections would be conducted, etc. Many of these recommendations were incorporated into the International Convention for the Safety of Life at Sea passed in 1914.
From data science perspective, we can get many valuable insights in a hope to learn more and to understand more about this tragedy. More about this in later sections.
To assist and perform data wrangling and data visualization process, we are going to use tidyverse and waffle.
Tidyverse is a package that contain several packages commonly used for data sciences. Some of these packages are:
Waffle is an extension of ggplot allows you to create waffle plot which we will use in this analysis report.
Plotly is also is used in conjuction to ggplot to make the plot dynamic.
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Warning: package 'plotly' was built under R version 4.0.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Do not worry if you are shown warning or conflicts. If warnings does not result in error, you should be fine to continue your analysis. On the other hand, conflicts can result in a error when two or more functions share a name. Below is how you can resolve conflict when it happens.
Choose which functions to use:
Bottomline, it is okay to ignore warning as long as it does not results in error and use the above grammar to avoid conflicts.
The full dataset can be downloaded from the link below:
Link: https://www.kaggle.com/c/titanic/data
Since we are not building a machine learning model,but rather, performing several relatively simple statistical analyses and data visualization, we are only going to use train.csv data (picking either train or test is fine).
##
## -- Column specification --------------------------------------------------------
## cols(
## PassengerId = col_double(),
## Survived = col_double(),
## Pclass = col_double(),
## Name = col_character(),
## Sex = col_character(),
## Age = col_double(),
## SibSp = col_double(),
## Parch = col_double(),
## Ticket = col_character(),
## Fare = col_double(),
## Cabin = col_character(),
## Embarked = col_character()
## )
The good thing about readr::read_csv() compared to utils::read.csv()(base) is that you can see the columns data type when parsed into the R environment.
Looking at the result above, there are several character columns that needs to be changed to format(category). We can use the following code:
titanic[,c("Survived", "Pclass", "Sex", "Embarked")] <-
lapply(titanic[,c("Survived", "Pclass", "Sex", "Embarked")],
FUN =as.factor)
sapply(titanic, class) #check columns data types## PassengerId Survived Pclass Name Sex Age
## "numeric" "factor" "factor" "character" "factor" "numeric"
## SibSp Parch Ticket Fare Cabin Embarked
## "numeric" "numeric" "character" "numeric" "character" "factor"
From Kaggle, we can check the description for each column. Below is a brief overview/explanation.
Before we get into the analysis, let’s first understand about the Titanic passenger first. The passenger’s distribution that we would like to know are:
Before we calculate the distribution, we will first use switch to change several values to a more appropriate one with the following codes:
titanic$Pclass <- sapply(
X = as.character(titanic$Pclass), # Data
FUN = switch,
"1" = "First Class",
"2" = "Middle Class",
"3" = "Lower Class"
)
titanic$Embarked <- sapply(
X = as.character(titanic$Embarked), # Data
FUN = switch,
"C" = "Cherbourg",
"Q" = "Queenstown",
"S" = "Southampton"
)Now that we’re done, lets look at the passenger distribution…
titanic_agg1 <- titanic %>%
count(Sex)
ggplot(titanic_agg1, aes(x="", y=n,
fill = Sex, alpha = 0.5,
label = paste0(round(100*n/sum(titanic_agg1$n),2),"%") )) +
geom_col(col = "black", width = 0.5) +
labs(title = "Titanic Passenger's Sex Distribution",
caption = "source: Kaggle",
y = "Number of passengers",
x=""
) +
guides(fill = F, alpha = F) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(hjust = 0.5)
) +
coord_flip() +
geom_text(size = 6, position = position_stack(vjust = 0.5))We can conclude that there are more male than female passengers.
titanic_agg2 <- titanic %>%
count(Pclass) %>%
arrange(desc(n))
#create a named numeric vector
titanic_vector <- structure(titanic_agg2$n,
.Names = as.character(titanic_agg2$Pclass))
waffle(parts = titanic_vector/10,
title="Titanic Passengers Socio-Economic Class Distribution"
) +
coord_flip() +
labs(
y = "1 square = 10 persons"
)## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
From this plot, we get that most of the passengers are from lower class. There were actually more first class than middle class passenger but their relative size is comparable.
titanic_agg3 <- titanic %>%
count(Embarked)
titanic_agg3$Embarked <- as.factor(as.character(titanic_agg3$Embarked))
ggplot(titanic_agg3, aes(x = reorder(Embarked,n), y = n, label =n)) +
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
labs(title = "Titanic Passenger's Embarkment Location Distribution",
caption = "source: Kaggle",
y = "Number of passengers",
x=""
) +
guides(fill = F, alpha = F) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(hjust = 0.5)
) +
geom_text(size =3.5, position = position_stack(vjust = 0.5))The main location of Embarkment is from a port in Southampton followed by Cherbourg and finally Queesntow. Funnily enough, there were 2 passengers which port of embarkment could not be identified
The data is prepared as follow:
titanic_agg4 <- titanic %>%
group_by(Survived) %>%
count(Sex, Pclass)
titanic_agg4$Survived <- sapply(
X = as.character(titanic_agg4$Survived ), # Data
FUN = switch,
"0" = "Perished",
"1" = "Survived"
)ggplot(data = titanic_agg4, mapping = aes(x = Survived, y = n, fill = Sex)) +
geom_col() +
labs(
x = "",
y = "",
title = "Survival of Titanic Passengers based on Gender"
) +
theme(
plot.background = element_rect(fill = "#263238"),
panel.background = element_rect(fill = "#263238", color = "white"),
panel.grid = element_blank(),
axis.text.y = element_text(color="white",
size=10),
axis.text.x = element_text(color="white",
size=10),
plot.subtitle = element_text(color = "white", hjust = 0.5),
plot.title = element_text(color = "white", hjust = 0.5),
legend.background = element_rect(fill = "#263238"),
legend.text = element_text(color = "white"),
legend.title = element_text(color = "white")
)From this training set (we don’t use the complete dataset), it can be inffered that most of the passengers Titanic meet their end and most of them are male passengers. Passengers that survived the tragedy are mostly women.
ggplot(data = titanic_agg4, mapping = aes(x = Survived, y = n, fill = Pclass)) +
geom_col() +
labs(
x = "",
y = "",
title = "Survival of Titanic Passengers based on Class"
) +
theme(
plot.background = element_rect(fill = "#263238"),
panel.background = element_rect(fill = "#263238", color = "white"),
panel.grid = element_blank(),
axis.text.y = element_text(color="white",
size=10),
axis.text.x = element_text(color="white",
size=10),
plot.subtitle = element_text(color = "white", hjust = 0.5),
plot.title = element_text(color = "white", hjust = 0.5),
legend.background = element_rect(fill = "#263238"),
legend.text = element_text(color = "white"),
legend.title = element_text(color = "white")
)From this result, most lower class passengers failed to survive. Middle class passengers have equal proportion of persihed and survived. Perhaps not too surprising, first class passengers safety are prioritized which lead to a greater survival rate for first class passengers.
g <-
ggplot(data = titanic_agg5, mapping = aes(x = Survived, y = n, fill = Sex_Class)) +
geom_col() +
labs(
x = "",
y = "",
title = "Survival of Titanic Passengers based on Gender and Class"
) +
theme(
plot.background = element_rect(fill = "#263238"),
panel.background = element_rect(fill = "#263238", color = "white"),
panel.grid = element_blank(),
axis.text.y = element_text(color="white",
size=10),
axis.text.x = element_text(color="white",
size=10),
plot.subtitle = element_text(color = "white", hjust = 0.5),
plot.title = element_text(color = "white", hjust = 0.5),
legend.background = element_rect(fill = "#263238"),
legend.text = element_text(color = "white"),
legend.title = element_text(color = "white")
)
ggplotly(g)This result is quite interesting because it turns out that female middle and first class passengers survived significantly more than lowe class female passengers. Whereas for male first class passengers, the casuality is still considerable even though they are a first class passengers.