{(r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)
The sinking of the Titanic ship was indeed a shocking event that reverberated across the world. News of the shipwreck spread rapidly, becoming a trending topic among people globally. This was due to the Titanic being one of the largest cruise ships ever built in its time. The accident resulted in the tragic loss of many lives. Data on the passengers who survived and those who did not were compiled with the hope of studying the impact of age and gender on the likelihood of survival during a shipwreck.
The Titanic was a tragic maritime disaster that occurred in 1912, resulting in the loss of thousands of lives. In this report, we aim to conduct a comprehensive analysis of the Titanic passenger data through the lens of data visualization. By visualizing the data, we seek to gain insights into various factors that might have influenced the survival of passengers on board the ill-fated ship.
The dataset can be found at https://www.kaggle.com/c/titanic. The dataset used for this analysis contains information about the passengers on the Titanic, including their age, gender, passenger class, fare, and survival status. Before proceeding with the visualization, we perform data preprocessing, including handling missing values and converting categorical variables into appropriate formats.
Before diving into the project, we’ll need to import the required libraries to facilitate our analysis.
library(dplyr) # Preprocessing Dataset
library(ggplot2) # ggplot visualization
library(ggthemes) # ggplot theme
library(tidyverse) # For data processing and visualizationAnd then import our data
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
Metadata:
PassengerId: Unique identification number for each
passenger.
Survived: Indicates whether the passenger survived
(1) or did not survive (0).
Pclass: Ticket class of the passenger, with values
1, 2, or 3, reflecting the social-economic class (1: First class, 2:
Second class, 3: Third class).
Name: Full name of the passenger.
Sex: Gender of the passenger (male: male, female:
female).
Age: Age of the passenger in years.
SibSp: Number of siblings or spouses accompanying
the passenger on board.
Parch: Number of parents or children accompanying
the passenger on board.
Ticket: Ticket number of the passenger.
Fare: Fare paid by the passenger.
Cabin: Cabin number of the passenger.
Embarked: Port where the passenger embarked the ship
(C: Cherbourg, Q: Queenstown, S: Southampton).
| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
pclass: A proxy for socio-economic status
(SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is
estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this
way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this
way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for
them.
Before proceeding further, we need to examine the condition of our data and clean it to convert the raw dataset into a usable form that is more relevant to our analysis later.
Change data types and remove unnecessary columns: There are several variables with inappropriate data types that need to be corrected.
data <- data %>%
mutate(Survived = as.factor(Survived),
Pclass = as.factor(Pclass),
Sex = as.factor(Sex),
Embarked = as.factor(Embarked))
head(data)## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:549 1:216 Length:891 female:314
## 1st Qu.:223.5 1:342 2:184 Class :character male :577
## Median :446.0 3:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Fare Cabin Embarked
## Min. : 0.00 Length:891 : 2
## 1st Qu.: 7.91 Class :character C:168
## Median : 14.45 Mode :character Q: 77
## Mean : 32.20 S:644
## 3rd Qu.: 31.00
## Max. :512.33
##
Insights Obtained:
There are 549 passengers who did not survive (0) and 342 passengers who survived (1).
There are 216 passengers in class 1, 184 passengers in class 2, and 491 passengers in class 3.
There are 577 male passengers (0) and 314 female passengers (1).
The age range of passengers spans from 0.42 years to 80 years. The average age of passengers is 29.7 years.
The majority of passengers do not have any siblings or spouses accompanying them (0), followed by 209 passengers with 1 sibling or spouse.
The majority of passengers do not have any parents or children accompanying them (0), followed by 118 passengers with 1 parent or child.
The ticket fares range from 0 to 512.33. The average ticket fare is 32.2.
The majority of passengers (644) boarded the ship from Southampton port (S), followed by 168 passengers from Cherbourg port (C), and 77 passengers from Queenstown port (Q).
##
## 0 1
## 0.6161616 0.3838384
The first step is to look for missing values. Handling missing data is crucial because any analytical results based on a dataset with missing values could be biased.
After that, we will handle the missing values in the Age
column by filling the missing values with the mean age. This is done to
make the dataset more complete and eliminate any missing values that may
disrupt the analysis process.
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
The only columns that have missing values are Age and
Embarked columns. However, the proportion of missing values
is quite large: 177 rows out of 891 rows for Age and 2 rows
out of 891 rows for Embarked. For Embarked,
empty values are represented by strings with ” “. Additionally, in the
Cabin column, there are values with”“, indicating that the
Passenger’s Cabin was not specified or known during the data
collection.
For this treatment, we will perform the following:
Age column with the
mean age and in the Embarked column with the mode
value.Name, PassengerId,
Ticket, and Cabin as they are not needed in
the analysis process.# Filling missing values in the `Age` column with the mean age
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm=TRUE)
# Filling missing values in the `Embarked` column with the mode value
calc_mode <- function(x){
# List the distinct / unique values
distinct_values <- unique(x)
# Count the occurrence of each distinct value
distinct_tabulate <- tabulate(match(x, distinct_values))
# Return the value with the highest occurrence
distinct_values[which.max(distinct_tabulate)]
}
data <- data %>%
mutate(Embarked = if_else(is.na(Embarked),
calc_mode(Embarked),
Embarked))## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 0
There are no missing values in the Titanic dataset.
We will now drop several variables that are not needed, such as Name, PassengerId, Ticket, and Cabin.
Feature Engineering
data_2 <- data_2 %>%
mutate(Survived = ifelse(Survived == 1, "Survived", "Dead")) %>%
mutate(Pclass = sapply(as.character(Pclass), switch,
"1" = "1st Class",
"2" = "2nd Class",
"3" = "3rd Class")) %>%
mutate(Embarked = sapply(as.character(Embarked), switch,
"C" = "Cherbourg",
"Q" = "Queenstown",
"S" = "Southampton"))# Mengubah data menjadi factor
data_2 <- data_2 %>%
mutate_at(vars(Pclass,Sex,Embarked,Survived), as.factor)
data_2To obtain more specific information about Age and
Sex, feature engineering can be performed by extracting
titles from the Name variable. The gsub
function can be used in this case, and the titles will be categorized
into 6 different groups. These groups include: Mr,
Mrs, Master, Miss,
Honorific Titles, and Officers.
# menggunakan fungsi gsub
data_2$Title <- gsub("(.*\\,|\\..*)", "", data_2$Name) %>%
gsub("[[:space:]]", "", .)
# membedakan variable title
data_2$Title[data_2$Title %in% c("Don", "Sir")] <- "Mr"
data_2$Title[data_2$Title %in% c("Ms", "Mme", "Mlle", "Lady", "Dona", "theCountess")] <- "Mrs"
data_2$Title[data_2$Title %in% c("Jonkheer", "Dr")] <- "Honorific Titles"
data_2$Title[data_2$Title %in% c("Capt", "Col", "Major", "Rev")] <- "Officers"Grouping the Age variable into several categories.
Ages <- function(x){
if(x < 20) {
x <- "< 20"
} else if (x >= 20 & x <= 29) {
x <- "20-29"
} else if (x >= 30 & x <= 39) {
x <- "30-39"
} else if (x >= 40 & x <= 49) {
x <- "40-49"
} else if (x >= 50 & x <= 59) {
x <- "50-59"
} else if (x >= 60 & x <= 69) {
x <- "60-69"
} else if (x >= 70 & x <= 79) {
x <- "70-79"
} else (x <- "> 80")
}
data_2$Ages <- as.factor(sapply(data_2$Age, Ages))## PassengerId Survived Pclass Name Sex
## Min. : 1.0 Dead :549 1st Class:216 Length:891 female:314
## 1st Qu.:223.5 Survived:342 2nd Class:184 Class :character male :577
## Median :446.0 3rd Class:491 Mode :character
## Mean :446.0
## 3rd Qu.:668.5
## Max. :891.0
##
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
## 1st Qu.:22.00 1st Qu.:0.000 1st Qu.:0.0000 Class :character
## Median :29.70 Median :0.000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:35.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
##
## Fare Cabin Embarked Title
## Min. : 0.00 Length:891 Cherbourg :168 Length:891
## 1st Qu.: 7.91 Class :character Queenstown : 77 Class :character
## Median : 14.45 Mode :character Southampton:646 Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
## Ages
## 20-29 :220
## > 80 :178
## 30-39 :167
## < 20 :164
## 40-49 : 89
## 50-59 : 48
## (Other): 25
Interpretation:
The Max value: 891 in PassengerId
describes the total number of passengers on the Titanic ship, which is
891.
Survived (1) indicates that 342 passengers survived
the Titanic shipwreck.
The Sex of Titanic passengers consists of 577 males
and 314 females.
The Age of Titanic passengers ranges from the
youngest being 0.42 years old to the oldest being
80 years old.
The highest fare paid by a passenger is 512.33.
The most common destination for Titanic passengers in the
Embarked column is S (Southampton) with 646
passengers, followed by C (Cherbourg) with 168 passengers,
and Q (Queenstown) with 646 passengers.
The age range of passengers (Ages) is as
follows:
# Filter Survived and Dead in the `Survived` variable
data_2.clean <- data_2[data_2$Survived %in% c("Survived", "Dead"),]
# Subset data for passengers who survived
data_2.clean.survived <- data_2.clean[data_2.clean$Survived == "Survived",]
# Subset data for passengers who did not survive
data_2.clean.ntsurvived <- data_2.clean[data_2.clean$Survived == "Dead",]agg_tbl <- data_2.clean %>% group_by(Survived) %>%
summarise(total_count=n(),
.groups = 'drop')
ggplot(agg_tbl, aes(x=Survived, y=total_count, fill=Survived))+
geom_bar(stat="identity")+
geom_text(aes(label=total_count), vjust=-0.3, size=3.5) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Comparison of Survived Categories of Passengers",
x = "Category",
y = "Total Passengers")The overall ratio of the Survived category is that the number of those who did not survive is greater than those who survived.
Pclass?ggplot(data = data_2, aes(x=Pclass, fill=Sex)) +
geom_bar(position = "dodge", width = 0.4) +
labs(title = "Perbandingan Jenis Kelamin dari PClass",
x = "Pclass",
y = "Total Passengers")There are more males in all Pclass. However, the male-to-female ratio is significantly higher in Pclass 3.
Survived for each
Gender?ggplot(data = data_2.clean, aes(x=Sex, fill= Survived)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set1") +
labs(title = "Comparison of Survived in each Gender",
x = "Sex",
y = "Total Passengers")For females, more passengers survived.
For males, significantly more passengers did not survive (deaths).
There are more passengers who did not survive than those who survived.
Survived for each Pclass,
are those who paid more prioritized for survival?ggplot(data = data_2.clean, aes(x=Pclass, fill= Survived)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set1") +
labs(title = "Comparison of Survived in each PClass",
x = "PClass",
y = "Total Passengers")Yes, it turns out that those in Pclass 1 have a much better chance of survival compared to others.
ggplot(data = data_2.clean.survived, mapping = aes(x = Pclass, y = Survived)) +
geom_col(aes(fill = Sex), show.legend = T) +
labs(title = "Survived Passengers by Passenger Class",
x = "Passenger Class",
y = "Survived",
caption = "Titanic")Female passengers have a higher survival rate than male passengers in all Passenger Class categories.
ggplot(data = data_2.clean.ntsurvived, mapping = aes(x = Pclass, y = Survived)) +
geom_col(aes(fill = Sex), show.legend = T) +
labs(title = "Deceased Passengers by Passenger Class",
x = "Passenger Class",
y = "Deceased",
caption = "Titanic")Male passengers have a higher rate of non-survival in all Passenger Class categories.
Survived based on Title
of the name?ggplot(data = data_2.clean, aes(x=Title, fill=Survived)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set1") +
labs(title = "Comparison of Survived in each Title Name",
x = "Title",
y = "Total Passengers")data_2.clean.title <- as.data.frame(table(data_2.clean$Survived, data_2.clean$Title))
data_2.clean.titleggplot(data_2.clean.title, aes(x=Var2, y=Freq, fill= Var1)) +
geom_col(position="fill") +
ggthemes::theme_economist() +
scale_color_gdocs() +
ggthemes::scale_fill_gdocs() +
scale_fill_brewer(palette = "Set1") +
geom_text(aes(label = Freq), position = position_fill(vjust = .5), col = "white") +
coord_flip() +
labs(fill = "Survived",
title = "Comparison of Survival in each Title Name",
x = "Title",
y = "Total Passengers")Male passengers with the title Mr have the worst survival rate.
Female passengers with the title Mrs have a better survival rate than males.
Honorific Titles and Officers have a worse survival rate compared to their death rate.
The title Master is used in English for young boys who are too young to be called Mister, and it seems that children have a better chance of survival.
ggplot(data = data_2.clean, aes(x=Sex, fill=Survived)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Set1") +
facet_wrap(~Pclass, nrow = 3, scales = "free_y") +
labs(title = "Comparison of Survival in each Sex and PClass",
x = "Sex",
y = "Total Passengers")In Pclass 1, women have a very high survival rate with a high certainty, while men have a much higher death rate.
In Pclass 2, it is almost the same as what happened in Pclass 1, only the number differs.
In Pclass 3, the survival rate is almost 50/50, in stark contrast to the much higher death rate for men.
ggplot(data = data_2.clean, aes(x=Family, fill=Survived)) +
geom_bar(stat = "count", position = "dodge") +
scale_x_continuous(breaks = c(1:11)) +
scale_fill_brewer(palette = "Set1") +
ylim(c(0, 400)) +
labs(title = "Comparison of Survival by Family Size",
x = "Family Size",
y = "Total Passengers") +
theme_bw()# Filter Pclass 1st Class
data_2.pclass1 <- data_2.clean[data_2.clean$Pclass %in% "1st Class", ]
# Plot of Passengers with 1st Class Ticket
p1 <- ggplot(data = data_2.pclass1, aes(x = Fare, y = Age)) +
geom_point(colour = "yellow", shape = 21, size = 3, aes(fill = Survived)) +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "Passengers with 1st Class Ticket")
# Filter Pclass 2nd Class
data_2.pclass2 <- data_2.clean[data_2.clean$Pclass %in% "2nd Class", ]
# Plot of Passengers with 2nd Class Ticket
p2 <- ggplot(data = data_2.pclass2, aes(x = Fare, y = Age)) +
geom_point(colour = "yellow", shape = 21, size = 3, aes(fill = Survived)) +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "Passengers with 2nd Class Ticket")
# Filter Pclass 3rd Class
data_2.pclass3 <- data_2.clean[data_2.clean$Pclass %in% "3rd Class", ]
# Plot of Passengers with 3rd Class Ticket
p3 <- ggplot(data = data_2.pclass3, aes(x = Fare, y = Age)) +
geom_point(colour = "yellow", shape = 21, size = 3, aes(fill = Survived)) +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
labs(title = "Passengers with 3rd Class Ticket")
require(gridExtra)
grid.arrange(p1, p2, p3, ncol = 1)data_2.clean.prop <- as.data.frame(table(data_2.clean$Survived,data_2.clean$Ages))
data_2.clean.propggplot(data = data_2.clean.prop, mapping = aes(x = Freq, y = reorder(Var2, Freq))) +
geom_col(mapping = aes(fill = Var1), position = "stack") +
labs(x = "Proportion of Passenger Survival",
y = NULL,
fill = NULL,
title = "Proportion of Passenger Survival on Titanic Ship",
subtitle = "Based on Age Range") +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
theme(legend.position = "top")The age range of 20-29 years has a higher survival rate compared to others, but it also has a significantly higher death rate compared to the others.
Embarked port?data_2.clean.emb <- as.data.frame(table(data_2.clean$Survived, data_2.clean$Embarked))
data_2.clean.embggplot(data_2.clean.emb, aes(x=Var2, y=Freq, fill= Var1)) +
geom_col(position = "fill") +
ggthemes::theme_economist() +
scale_color_gdocs() +
ggthemes::scale_fill_gdocs() +
scale_fill_brewer(palette = "Set1") +
geom_text(aes(label = Freq), position = position_fill(vjust = .5), col = "white") +
labs(fill = "Survived",
title = "Comparison of Survival in each Embarked Port",
x = "Embarked",
y = "Frequency")ggplot(data = data_2.clean.survived, mapping = aes(x = Pclass,
y = Survived)) +
geom_col(aes(fill = Embarked), show.legend = T) +
labs(title = "Survival by Passenger Class",
x = "Passenger Class",
y = "Survived",
caption = "Titanic")The majority of passengers who survived embarked from Southampton (S).
Before proceeding with the analysis, it is essential to pay attention to the proportion of the target variable, which is “Survived”.
agg_tbl <- data_2.clean %>% group_by(Survived) %>%
summarise(total_count=n(),
.groups = 'drop')
ggplot(agg_tbl, aes(x=Survived, y=total_count, fill=Survived))+
geom_bar(stat="identity")+
geom_text(aes(label=total_count), vjust=-0.3, size=3.5) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Comparison of Survival Categories of Passengers",
x = "Category",
y = "Total Passengers")##
## 0 1
## 0.6161616 0.3838384
##
## 0 1
## 549 342
The number of passengers who did not survive is greater than the number of passengers who survived in general. Among females, more passengers survived, while among males, significantly more passengers did not survive.
Among the three Pclasses, it is found that those in Pclass 1 have a much higher chance of survival compared to the others.
When comparing with the Sex variable, there are more males in all Pclasses. However, the male-to-female ratio is much higher in Pclass 3. Female passengers have a higher survival rate than male passengers in all Pclass categories. Male passengers have a higher mortality rate in all Pclass categories.
In the comparison of survival based on Title Name, males with the Title “Mr” have the worst survival rate, while females with the Title “Mrs” have a better survival rate than males. Honorific Titles and Officers have worse survival rates compared to their mortality rates. However, for the Title “Master,” which is used in English for young boys who are too young to be called “Mister,” it seems that children have a better chance of survival.
In the comparison of survival based on Family Size, passengers without a family have a lower chance of survival. Families with a smaller size (2-4) have a better chance of survival. However, families with a larger size (+5) do not have a better chance of survival.
The age range of 20-29 years has a higher survival rate than others, but it also has a significantly higher death rate compared to other age ranges.
For the “Embarked” variable, passengers who embarked from Cherbourg have a relatively higher ratio of survival with 93 passengers survived and 75 passengers deceased. On the other hand, passengers who embarked from Queenstown and Southampton have lower survival ratios.