{(r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE
)

1 Introduction

The sinking of the Titanic ship was indeed a shocking event that reverberated across the world. News of the shipwreck spread rapidly, becoming a trending topic among people globally. This was due to the Titanic being one of the largest cruise ships ever built in its time. The accident resulted in the tragic loss of many lives. Data on the passengers who survived and those who did not were compiled with the hope of studying the impact of age and gender on the likelihood of survival during a shipwreck.

The Titanic was a tragic maritime disaster that occurred in 1912, resulting in the loss of thousands of lives. In this report, we aim to conduct a comprehensive analysis of the Titanic passenger data through the lens of data visualization. By visualizing the data, we seek to gain insights into various factors that might have influenced the survival of passengers on board the ill-fated ship.

The dataset can be found at https://www.kaggle.com/c/titanic. The dataset used for this analysis contains information about the passengers on the Titanic, including their age, gender, passenger class, fare, and survival status. Before proceeding with the visualization, we perform data preprocessing, including handling missing values and converting categorical variables into appropriate formats.

1.1 Data Understanding

Before diving into the project, we’ll need to import the required libraries to facilitate our analysis.

library(dplyr) # Preprocessing Dataset
library(ggplot2) # ggplot visualization
library(ggthemes) # ggplot theme
library(tidyverse) # For data processing and visualization

And then import our data

data <- read.csv("datasets/train.csv")
glimpse(data)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

Metadata:

  • PassengerId: Unique identification number for each passenger.

  • Survived: Indicates whether the passenger survived (1) or did not survive (0).

  • Pclass: Ticket class of the passenger, with values 1, 2, or 3, reflecting the social-economic class (1: First class, 2: Second class, 3: Third class).

  • Name: Full name of the passenger.

  • Sex: Gender of the passenger (male: male, female: female).

  • Age: Age of the passenger in years.

  • SibSp: Number of siblings or spouses accompanying the passenger on board.

  • Parch: Number of parents or children accompanying the passenger on board.

  • Ticket: Ticket number of the passenger.

  • Fare: Fare paid by the passenger.

  • Cabin: Cabin number of the passenger.

  • Embarked: Port where the passenger embarked the ship (C: Cherbourg, Q: Queenstown, S: Southampton).

1.1.1 Data Dictionary

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

1.1.2 Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

2 EDA & Data Wrangling

Before proceeding further, we need to examine the condition of our data and clean it to convert the raw dataset into a usable form that is more relevant to our analysis later.

Change data types and remove unnecessary columns: There are several variables with inappropriate data types that need to be corrected.

data <- data %>%
  mutate(Survived = as.factor(Survived),
         Pclass = as.factor(Pclass),
         Sex = as.factor(Sex),
         Embarked = as.factor(Embarked))

head(data)
summary(data)
##   PassengerId    Survived Pclass      Name               Sex     
##  Min.   :  1.0   0:549    1:216   Length:891         female:314  
##  1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
##  Median :446.0            3:491   Mode  :character               
##  Mean   :446.0                                                   
##  3rd Qu.:668.5                                                   
##  Max.   :891.0                                                   
##                                                                  
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin           Embarked
##  Min.   :  0.00   Length:891          :  2   
##  1st Qu.:  7.91   Class :character   C:168   
##  Median : 14.45   Mode  :character   Q: 77   
##  Mean   : 32.20                      S:644   
##  3rd Qu.: 31.00                              
##  Max.   :512.33                              
## 

Insights Obtained:

  • There are 549 passengers who did not survive (0) and 342 passengers who survived (1).

  • There are 216 passengers in class 1, 184 passengers in class 2, and 491 passengers in class 3.

  • There are 577 male passengers (0) and 314 female passengers (1).

  • The age range of passengers spans from 0.42 years to 80 years. The average age of passengers is 29.7 years.

  • The majority of passengers do not have any siblings or spouses accompanying them (0), followed by 209 passengers with 1 sibling or spouse.

  • The majority of passengers do not have any parents or children accompanying them (0), followed by 118 passengers with 1 parent or child.

  • The ticket fares range from 0 to 512.33. The average ticket fare is 32.2.

  • The majority of passengers (644) boarded the ship from Southampton port (S), followed by 168 passengers from Cherbourg port (C), and 77 passengers from Queenstown port (Q).

prop.table(table(data$Survived))
## 
##         0         1 
## 0.6161616 0.3838384

2.1 Check Missing Values

The first step is to look for missing values. Handling missing data is crucial because any analytical results based on a dataset with missing values could be biased.

After that, we will handle the missing values in the Age column by filling the missing values with the mean age. This is done to make the dataset more complete and eliminate any missing values that may disrupt the analysis process.

data$Embarked[data$Embarked == ""] <- NA
data$Cabin[data$Cabin == ""] <- NA
colSums(is.na(data))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

The only columns that have missing values are Age and Embarked columns. However, the proportion of missing values is quite large: 177 rows out of 891 rows for Age and 2 rows out of 891 rows for Embarked. For Embarked, empty values are represented by strings with ” “. Additionally, in the Cabin column, there are values with”“, indicating that the Passenger’s Cabin was not specified or known during the data collection.

For this treatment, we will perform the following:

  • Input the missing values in the Age column with the mean age and in the Embarked column with the mode value.
  • Drop the columns Name, PassengerId, Ticket, and Cabin as they are not needed in the analysis process.
# Filling missing values in the `Age` column with the mean age
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm=TRUE)

# Filling missing values in the `Embarked` column with the mode value
calc_mode <- function(x){
  
  # List the distinct / unique values
  distinct_values <- unique(x)
  
  # Count the occurrence of each distinct value
  distinct_tabulate <- tabulate(match(x, distinct_values))
  
  # Return the value with the highest occurrence
  distinct_values[which.max(distinct_tabulate)]
}

data <- data %>% 
  mutate(Embarked = if_else(is.na(Embarked), 
                         calc_mode(Embarked), 
                         Embarked))
colSums(is.na(data))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           0

There are no missing values in the Titanic dataset.

We will now drop several variables that are not needed, such as Name, PassengerId, Ticket, and Cabin.

data_2 <- data
data <- data %>%
  select(-c(Name, PassengerId, Ticket, Cabin))

2.2 Data Visualization

Feature Engineering

data_2 <- data_2 %>% 
  mutate(Survived = ifelse(Survived == 1, "Survived", "Dead")) %>% 
  mutate(Pclass = sapply(as.character(Pclass), switch,
                         "1" = "1st Class",
                         "2" = "2nd Class",
                         "3" = "3rd Class")) %>% 
  mutate(Embarked = sapply(as.character(Embarked), switch, 
                           "C" = "Cherbourg",
                           "Q" = "Queenstown",
                           "S" = "Southampton"))
# Mengubah data menjadi factor
data_2 <- data_2 %>% 
  mutate_at(vars(Pclass,Sex,Embarked,Survived), as.factor)

data_2

To obtain more specific information about Age and Sex, feature engineering can be performed by extracting titles from the Name variable. The gsub function can be used in this case, and the titles will be categorized into 6 different groups. These groups include: Mr, Mrs, Master, Miss, Honorific Titles, and Officers.

# menggunakan fungsi gsub
data_2$Title <- gsub("(.*\\,|\\..*)", "", data_2$Name) %>%    
  gsub("[[:space:]]", "", .)

# membedakan variable title
data_2$Title[data_2$Title %in% c("Don", "Sir")] <- "Mr"
data_2$Title[data_2$Title %in% c("Ms", "Mme", "Mlle", "Lady", "Dona", "theCountess")] <- "Mrs"
data_2$Title[data_2$Title %in% c("Jonkheer", "Dr")] <- "Honorific Titles"
data_2$Title[data_2$Title %in% c("Capt", "Col", "Major", "Rev")] <- "Officers"

Grouping the Age variable into several categories.

Ages <- function(x){
  if(x < 20) {
    x <- "< 20"
  } else if (x >= 20 & x <= 29) {
    x <- "20-29"
  } else if (x >= 30 & x <= 39) {
    x <- "30-39"
  } else if (x >= 40 & x <= 49) {
    x <- "40-49"
  } else if (x >= 50 & x <= 59) {
    x <- "50-59" 
  } else if (x >= 60 & x <= 69) {
    x <- "60-69"
  } else if (x >= 70 & x <= 79) {
    x <- "70-79"
  } else (x <- "> 80")
}

data_2$Ages <- as.factor(sapply(data_2$Age, Ages))
summary(data_2)
##   PassengerId        Survived         Pclass        Name               Sex     
##  Min.   :  1.0   Dead    :549   1st Class:216   Length:891         female:314  
##  1st Qu.:223.5   Survived:342   2nd Class:184   Class :character   male  :577  
##  Median :446.0                  3rd Class:491   Mode  :character               
##  Mean   :446.0                                                                 
##  3rd Qu.:668.5                                                                 
##  Max.   :891.0                                                                 
##                                                                                
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:22.00   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :29.70   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:35.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##                                                                     
##       Fare           Cabin                  Embarked      Title          
##  Min.   :  0.00   Length:891         Cherbourg  :168   Length:891        
##  1st Qu.:  7.91   Class :character   Queenstown : 77   Class :character  
##  Median : 14.45   Mode  :character   Southampton:646   Mode  :character  
##  Mean   : 32.20                                                          
##  3rd Qu.: 31.00                                                          
##  Max.   :512.33                                                          
##                                                                          
##       Ages    
##  20-29  :220  
##  > 80   :178  
##  30-39  :167  
##  < 20   :164  
##  40-49  : 89  
##  50-59  : 48  
##  (Other): 25

Interpretation:

  1. The Max value: 891 in PassengerId describes the total number of passengers on the Titanic ship, which is 891.

  2. Survived (1) indicates that 342 passengers survived the Titanic shipwreck.

  3. The Sex of Titanic passengers consists of 577 males and 314 females.

  4. The Age of Titanic passengers ranges from the youngest being 0.42 years old to the oldest being 80 years old.

  5. The highest fare paid by a passenger is 512.33.

  6. The most common destination for Titanic passengers in the Embarked column is S (Southampton) with 646 passengers, followed by C (Cherbourg) with 168 passengers, and Q (Queenstown) with 646 passengers.

  7. The age range of passengers (Ages) is as follows:

  • < 20 years: 164 people
  • 20-29 years: 397 people
  • 30-39 years: 167 people
  • 40-49 years: 89 people
  • 50-59 years: 48 people
  • 60-69 years: 19 people
  • The remaining: 7 people
# Filter Survived and Dead in the `Survived` variable
data_2.clean <- data_2[data_2$Survived %in% c("Survived", "Dead"),]

# Subset data for passengers who survived
data_2.clean.survived <- data_2.clean[data_2.clean$Survived == "Survived",]

# Subset data for passengers who did not survive
data_2.clean.ntsurvived <- data_2.clean[data_2.clean$Survived == "Dead",]

2.2.0.1 What is the overall ratio of the Survived category?

agg_tbl <- data_2.clean %>% group_by(Survived) %>% 
  summarise(total_count=n(),
            .groups = 'drop')

ggplot(agg_tbl, aes(x=Survived, y=total_count, fill=Survived))+
  geom_bar(stat="identity")+
  geom_text(aes(label=total_count), vjust=-0.3, size=3.5) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Comparison of Survived Categories of Passengers",
       x = "Category",
       y = "Total Passengers")

The overall ratio of the Survived category is that the number of those who did not survive is greater than those who survived.

2.2.0.2 Are there more males than females in each Pclass?

ggplot(data = data_2, aes(x=Pclass, fill=Sex)) +
  geom_bar(position = "dodge", width = 0.4) +
  labs(title = "Perbandingan Jenis Kelamin dari PClass",
       x = "Pclass",
       y = "Total Passengers")

There are more males in all Pclass. However, the male-to-female ratio is significantly higher in Pclass 3.

2.2.0.3 What is the survival rate of Survived for each Gender?

ggplot(data = data_2.clean, aes(x=Sex, fill= Survived)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Comparison of Survived in each Gender",
       x = "Sex",
       y = "Total Passengers")

  • For females, more passengers survived.

  • For males, significantly more passengers did not survive (deaths).

  • There are more passengers who did not survive than those who survived.

2.2.0.4 What is the survival rate of Survived for each Pclass, are those who paid more prioritized for survival?

ggplot(data = data_2.clean, aes(x=Pclass, fill= Survived)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Comparison of Survived in each PClass",
       x = "PClass",
       y = "Total Passengers")

Yes, it turns out that those in Pclass 1 have a much better chance of survival compared to others.

2.2.0.5 Passengers’ Survived/Death Based on Passenger Class and Gender?

ggplot(data = data_2.clean.survived, mapping = aes(x = Pclass, y = Survived)) +
  geom_col(aes(fill = Sex), show.legend = T) +
  labs(title = "Survived Passengers by Passenger Class",
       x = "Passenger Class",
       y = "Survived",
       caption = "Titanic")

Female passengers have a higher survival rate than male passengers in all Passenger Class categories.

ggplot(data = data_2.clean.ntsurvived, mapping = aes(x = Pclass, y = Survived)) +
  geom_col(aes(fill = Sex), show.legend = T) +
  labs(title = "Deceased Passengers by Passenger Class",
       x = "Passenger Class",
       y = "Deceased",
       caption = "Titanic")

Male passengers have a higher rate of non-survival in all Passenger Class categories.

2.2.0.6 What is the survival rate of Survived based on Title of the name?

ggplot(data = data_2.clean, aes(x=Title, fill=Survived)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Comparison of Survived in each Title Name",
       x = "Title",
       y = "Total Passengers")

data_2.clean.title <- as.data.frame(table(data_2.clean$Survived, data_2.clean$Title))

data_2.clean.title
ggplot(data_2.clean.title, aes(x=Var2, y=Freq, fill= Var1)) +
  geom_col(position="fill") + 
  ggthemes::theme_economist() + 
  scale_color_gdocs() + 
  ggthemes::scale_fill_gdocs() +
  scale_fill_brewer(palette = "Set1") +
  geom_text(aes(label = Freq), position = position_fill(vjust = .5), col = "white") +
  coord_flip() +
  labs(fill = "Survived",
       title = "Comparison of Survival in each Title Name",
       x = "Title",
       y = "Total Passengers")

  • Male passengers with the title Mr have the worst survival rate.

  • Female passengers with the title Mrs have a better survival rate than males.

  • Honorific Titles and Officers have a worse survival rate compared to their death rate.

  • The title Master is used in English for young boys who are too young to be called Mister, and it seems that children have a better chance of survival.

2.2.0.7 Women have a better survival rate, but how about when we look at it based on their Passenger Class (Pclass), is their survival rate still higher?

ggplot(data = data_2.clean, aes(x=Sex, fill=Survived)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  facet_wrap(~Pclass, nrow = 3, scales = "free_y") +
  labs(title = "Comparison of Survival in each Sex and PClass",
       x = "Sex",
       y = "Total Passengers")

  • In Pclass 1, women have a very high survival rate with a high certainty, while men have a much higher death rate.

  • In Pclass 2, it is almost the same as what happened in Pclass 1, only the number differs.

  • In Pclass 3, the survival rate is almost 50/50, in stark contrast to the much higher death rate for men.

2.2.0.8 Does having family members on board affect the survival of passengers?

data_2.clean$Family <- data_2.clean$SibSp + data_2.clean$Parch + 1
ggplot(data = data_2.clean, aes(x=Family, fill=Survived)) +
  geom_bar(stat = "count", position = "dodge") +
  scale_x_continuous(breaks = c(1:11)) +
  scale_fill_brewer(palette = "Set1") +
  ylim(c(0, 400)) +
  labs(title = "Comparison of Survival by Family Size",
       x = "Family Size",
       y = "Total Passengers") +
  theme_bw()

  • Passengers without family have a lower chance of survival.
  • Families with smaller size (2-4 members) have a better chance of survival.
  • However, large families (+5 members) do not have a better chance of survival.

2.2.0.9 How is the distribution of survived and non-survived passengers based on age and Pclass?

# Filter Pclass 1st Class
data_2.pclass1 <- data_2.clean[data_2.clean$Pclass %in% "1st Class", ]
# Plot of Passengers with 1st Class Ticket
p1 <- ggplot(data = data_2.pclass1, aes(x = Fare, y = Age)) +
  geom_point(colour = "yellow", shape = 21, size = 3, aes(fill = Survived)) +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  labs(title = "Passengers with 1st Class Ticket")

# Filter Pclass 2nd Class
data_2.pclass2 <- data_2.clean[data_2.clean$Pclass %in% "2nd Class", ]

# Plot of Passengers with 2nd Class Ticket
p2 <- ggplot(data = data_2.pclass2, aes(x = Fare, y = Age)) +
  geom_point(colour = "yellow", shape = 21, size = 3, aes(fill = Survived)) +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  labs(title = "Passengers with 2nd Class Ticket")

# Filter Pclass 3rd Class
data_2.pclass3 <- data_2.clean[data_2.clean$Pclass %in% "3rd Class", ]

# Plot of Passengers with 3rd Class Ticket
p3 <- ggplot(data = data_2.pclass3, aes(x = Fare, y = Age)) +
  geom_point(colour = "yellow", shape = 21, size = 3, aes(fill = Survived)) +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  labs(title = "Passengers with 3rd Class Ticket")

require(gridExtra)
grid.arrange(p1, p2, p3, ncol = 1)

2.2.0.10 What is the proportion of passenger survival based on their age range?

data_2.clean.prop <- as.data.frame(table(data_2.clean$Survived,data_2.clean$Ages))
data_2.clean.prop
ggplot(data = data_2.clean.prop, mapping = aes(x = Freq, y = reorder(Var2, Freq))) +
  geom_col(mapping = aes(fill = Var1), position = "stack") +
  labs(x = "Proportion of Passenger Survival",
       y = NULL,
       fill = NULL,
       title = "Proportion of Passenger Survival on Titanic Ship",
       subtitle = "Based on Age Range") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  theme(legend.position = "top")

The age range of 20-29 years has a higher survival rate compared to others, but it also has a significantly higher death rate compared to the others.

2.2.0.11 What is the proportion of passenger survival based on their Embarked port?

data_2.clean.emb <- as.data.frame(table(data_2.clean$Survived, data_2.clean$Embarked))
data_2.clean.emb
ggplot(data_2.clean.emb, aes(x=Var2, y=Freq, fill= Var1)) +
  geom_col(position = "fill") +
  ggthemes::theme_economist() +
  scale_color_gdocs() +
  ggthemes::scale_fill_gdocs() +
  scale_fill_brewer(palette = "Set1") +
  geom_text(aes(label = Freq), position = position_fill(vjust = .5), col = "white") +
  labs(fill = "Survived",
       title = "Comparison of Survival in each Embarked Port",
       x = "Embarked",
       y = "Frequency")

  • For Embarked Cherbourg, there were 95 passengers who survived and 75 passengers who did not survive.
  • For Embarked Queenstown, there were 30 passengers who survived and 47 passengers who did not survive.
  • For Embarked Southampton, there were 217 passengers who survived and 427 passengers who did not survive.
ggplot(data = data_2.clean.survived, mapping = aes(x = Pclass,
                                      y = Survived)) +
  geom_col(aes(fill = Embarked), show.legend = T) +
  labs(title = "Survival by Passenger Class",
       x = "Passenger Class",
       y = "Survived",
       caption = "Titanic")

The majority of passengers who survived embarked from Southampton (S).

Before proceeding with the analysis, it is essential to pay attention to the proportion of the target variable, which is “Survived”.

agg_tbl <- data_2.clean %>% group_by(Survived) %>% 
  summarise(total_count=n(),
            .groups = 'drop')

ggplot(agg_tbl, aes(x=Survived, y=total_count, fill=Survived))+
  geom_bar(stat="identity")+
  geom_text(aes(label=total_count), vjust=-0.3, size=3.5) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Comparison of Survival Categories of Passengers",
       x = "Category",
       y = "Total Passengers")

prop.table(table(data$Survived))
## 
##         0         1 
## 0.6161616 0.3838384
table(data$Survived)
## 
##   0   1 
## 549 342

2.3 Conclusion

  • The number of passengers who did not survive is greater than the number of passengers who survived in general. Among females, more passengers survived, while among males, significantly more passengers did not survive.

  • Among the three Pclasses, it is found that those in Pclass 1 have a much higher chance of survival compared to the others.

  • When comparing with the Sex variable, there are more males in all Pclasses. However, the male-to-female ratio is much higher in Pclass 3. Female passengers have a higher survival rate than male passengers in all Pclass categories. Male passengers have a higher mortality rate in all Pclass categories.

  • In the comparison of survival based on Title Name, males with the Title “Mr” have the worst survival rate, while females with the Title “Mrs” have a better survival rate than males. Honorific Titles and Officers have worse survival rates compared to their mortality rates. However, for the Title “Master,” which is used in English for young boys who are too young to be called “Mister,” it seems that children have a better chance of survival.

  • In the comparison of survival based on Family Size, passengers without a family have a lower chance of survival. Families with a smaller size (2-4) have a better chance of survival. However, families with a larger size (+5) do not have a better chance of survival.

  • The age range of 20-29 years has a higher survival rate than others, but it also has a significantly higher death rate compared to other age ranges.

  • For the “Embarked” variable, passengers who embarked from Cherbourg have a relatively higher ratio of survival with 93 passengers survived and 75 passengers deceased. On the other hand, passengers who embarked from Queenstown and Southampton have lower survival ratios.