TITANIC

Who wouldn’t know Titanic and its sad story? The movie itself has reached a total gross profit of $2.19B and if those numbers doesn’t seem to tell you anything, it’s currently the 3rd highest-grossing films of all time.

RMS Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean on 15 April 1912, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking at the time one of the deadliest of a single ship[a] and the deadliest peacetime sinking of a superliner or cruise ship to date.[4] With much public attention in the aftermath the disaster has since been the material of many artistic works and a founding material of the disaster film genre.

The Data

The data that I’ll be using is train.csv data from Kaggle’s Titanic - Machine Learning from Disaster Competition.

Data Overview

Here is the allover titanic passengers’ data overview.

# read data
titanic <- read_csv("data/train.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Ticket = col_character(),
##   Fare = col_double(),
##   Cabin = col_character(),
##   Embarked = col_character()
## )

# quick data overview
titanic

## # A tibble: 891 x 12
##    PassengerId Survived Pclass Name   Sex     Age SibSp Parch Ticket  Fare Cabin
##          <dbl>    <dbl>  <dbl> <chr>  <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
##  1           1        0      3 Braun~ male     22     1     0 A/5 2~  7.25 <NA> 
##  2           2        1      1 Cumin~ fema~    38     1     0 PC 17~ 71.3  C85  
##  3           3        1      3 Heikk~ fema~    26     0     0 STON/~  7.92 <NA> 
##  4           4        1      1 Futre~ fema~    35     1     0 113803 53.1  C123 
##  5           5        0      3 Allen~ male     35     0     0 373450  8.05 <NA> 
##  6           6        0      3 Moran~ male     NA     0     0 330877  8.46 <NA> 
##  7           7        0      1 McCar~ male     54     0     0 17463  51.9  E46  
##  8           8        0      3 Palss~ male      2     3     1 349909 21.1  <NA> 
##  9           9        1      3 Johns~ fema~    27     0     2 347742 11.1  <NA> 
## 10          10        1      2 Nasse~ fema~    14     1     0 237736 30.1  <NA> 
## # ... with 881 more rows, and 1 more variable: Embarked <chr>

This data contains 12 columns which are:

PassengerId: passenger’s id
Survived: whether the passenger in the end survived (1 = survived, 0 = deceased).
Pclass: passenger ticket’s class(1 = upper class, 2 = mid class, 3 = lower class).
Name: name of the passenger.
Sex: passenger’s gender.
Age: passenger’s age.
SibSp: number of siblings/spouse aboard in the titanic.
Parch: number of parents/children abord in the titanic.
Ticket: passenger’s ticket number.
Fare: passenger’s fare.
Cabin: passenger’s cabin number.
Embarked: passenger’s port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Data Pre-processing

First, I’m going to drop the columns that won’t serve any purpose in this analysis.

# drop columns
titanic <- titanic %>% 
  select(-c(Cabin, Ticket))

Missing Values

Next, I’m going to check whether this data contains missing values.

# change 0 and 1 in Survived column to logical
titanic$Survived = as.logical(titanic$Survived)

# filling blank space with NA
titanic <- titanic %>% 
  mutate(across(everything(), ~ifelse(.=="", NA, as.character(.))))

# change data type
titanic <- titanic %>% 
  mutate(Pclass = as.factor(Pclass),
         Sex = as.factor(Sex),
         Age = as.integer(Age),
         Survived = as.logical(Survived)
  )

# check missing values
colSums(is.na(titanic))

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch        Fare    Embarked 
##           0           0           0           2

Since the missing value consists of 177 rows and it’s more than 5% of the total data, getting to the easy method of dropping all rows with NAs is not a wise option. The method that I decided to use in the imputation of the Age column is based on the median age of the passengers’ title (Mr. / Mrs. / etc.). The reason behind why I choose this method instead of just imputing all of the passengers’ median age is because I’m assuming that most of the passengers in the Titanic would be adults, hence imputing children’s age with adults’ median age is not very precise since I’m provided with another helpful information to know a passenger’s age range by their title

First I would like to know the titles of the passengers and their NAs.

# extracting name titles into a new column
titanic$Title <- gsub('(.*, )|(\\..*)', '', titanic$Name)

# change title data type
titanic$Title <- as.factor(titanic$Title)
table(titanic$Title)

## 
##         Capt          Col          Don           Dr     Jonkheer         Lady 
##            1            2            1            7            1            1 
##        Major       Master         Miss         Mlle          Mme           Mr 
##            2           40          182            2            1          517 
##          Mrs           Ms          Rev          Sir the Countess 
##          125            1            6            1            1

# check which Title has an NA Age
na <- titanic %>% 
  filter(is.na(Age))
table(na$Title)

## 
##         Capt          Col          Don           Dr     Jonkheer         Lady 
##            0            0            0            1            0            0 
##        Major       Master         Miss         Mlle          Mme           Mr 
##            0            4           36            0            0          119 
##          Mrs           Ms          Rev          Sir the Countess 
##           17            0            0            0            0

# titles with NA
title <- unique(na$Title)

Before going further, I’ve noticed that there are some titles that aren’t very common to be used these days, so I’m going to change those to the equivalents of Mr./Mrs./Ms./Master.

# changing the titles to the equivalents of mr/mrs/ms/master
titanic$Title <- sapply(X = as.character(titanic$Title), # Data
                           FUN = switch,
                        "Capt" = "Mr",
                        "Col" = "Mr",
                        "Don" = "Mr",
                        "Dr" = "Mr",
                        "Jonkheer" = "Mr",
                        "Major" = "Mr",
                        "Master" = "Master",
                        "Miss" = "Miss",
                        "Mr" = "Mr",
                        "Mrs" = "Mrs",
                        "Rev" = "Mr",
                        "Lady" = "Mrs",
                        "Sir" = "Mr",
                        "Mlle" = "Miss",
                        "Mme" = "Mrs",
                        "Ms" = "Miss",
                        "the Countess" = "Mrs")

Next I’m going to go through with the imputation process. For this step, I will break the titanic data set into 4 different dataframes each containing names with different titles: Mr., Mrs., Miss, and Master according to the result I obtained above. I will then calculate each of their median Age and impute them to their respective NAs. In the end of the process, I’ll re-combine them together as 1 whole dataframe titanic.

# age imputation based on specific titles
## mr.
mr <- titanic %>% 
  filter(Title == "Mr")%>% 
  mutate(Age = round(replace(Age, is.na(Age), median(Age, na.rm = T))),0)
## mrs.
mrs <- titanic %>% 
  filter(Title == "Mrs")%>% 
  mutate(Age = round(replace(Age, is.na(Age), median(Age, na.rm = T))),0)
## miss.
ms <- titanic %>% 
  filter(Title == "Miss")%>% 
  mutate(Age = round(replace(Age, is.na(Age), median(Age, na.rm = T))),0)
## master.
mstr <- titanic %>% 
  filter(Title == "Master")%>% 
  mutate(Age = round(replace(Age, is.na(Age), median(Age, na.rm = T))),0)

# combining all rows
titanic <- bind_rows(mr, mrs, ms, mstr) %>% 
  select(1:11) %>% 
  mutate(PassengerId = as.integer(PassengerId)) %>% 
  arrange(PassengerId)

# impute NA for Embarked
## creating mode function
mode <- function (x, na.rm) {
    xtab <- table(x)
    xmode <- names(which(xtab == max(xtab)))
    if (length(xmode) > 1) xmode <- ">1 mode"
    return(xmode)
}
## impute NA
titanic <- titanic %>%
 mutate(Embarked = as.factor(Embarked),
        Embarked = replace(Embarked, is.na(Embarked), mode(Embarked, na.rm = T)))
 
# check NA
anyNA(titanic)

## [1] FALSE

Column Data Type

The data types of each columns will be stated below.

# check column data type
str(titanic)

## tibble [891 x 11] (S3: tbl_df/tbl/data.frame)
##  $ PassengerId: int [1:891] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : logi [1:891] FALSE TRUE TRUE TRUE FALSE FALSE ...
##  $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr [1:891] "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num [1:891] 22 38 26 35 35 30 54 2 27 14 ...
##  $ SibSp      : chr [1:891] "1" "1" "0" "1" ...
##  $ Parch      : chr [1:891] "0" "0" "0" "0" ...
##  $ Fare       : chr [1:891] "7.25" "71.2833" "7.925" "53.1" ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
##  $ Title      : Named chr [1:891] "Mr" "Mrs" "Miss" "Mrs" ...
##   ..- attr(*, "names")= chr [1:891] "Mr" "Mrs" "Miss" "Mrs" ...

Some of the columns still doesn’t seem to be sitting with the correct data type, so I’m going to change that.

# change data type
titanic <- titanic %>% 
  mutate(Title = as.factor(Title),
         Age = as.integer(Age),
         SibSp = as.integer(SibSp),
         Parch = as.integer(Parch),
         Fare = as.numeric(Fare))

Aaaand end of Data Pre-processing. 😊

About the Passengers

Here is a quick summary of the titanic data.

summary(titanic)

##   PassengerId     Survived       Pclass      Name               Sex     
##  Min.   :  1.0   Mode :logical   1:216   Length:891         female:314  
##  1st Qu.:223.5   FALSE:549       2:184   Class :character   male  :577  
##  Median :446.0   TRUE :342       3:491   Mode  :character               
##  Mean   :446.0                                                          
##  3rd Qu.:668.5                                                          
##  Max.   :891.0                                                          
##       Age            SibSp           Parch             Fare        Embarked
##  Min.   : 0.00   Min.   :0.000   Min.   :0.0000   Min.   :  0.00   C:168   
##  1st Qu.:21.00   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.91   Q: 77   
##  Median :30.00   Median :0.000   Median :0.0000   Median : 14.45   S:646   
##  Mean   :29.36   Mean   :0.523   Mean   :0.3816   Mean   : 32.20           
##  3rd Qu.:35.00   3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.: 31.00           
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000   Max.   :512.33           
##     Title    
##  Master: 40  
##  Miss  :185  
##  Mr    :538  
##  Mrs   :128  
##              
##

There were approximately 891 passengers onboarding the Titanic according to this dataset. The passengers consisted of 577 males and 314 females including elderly and children, age ranging from 80 to below than 1 year old. The largest family onboarding this ship is a family of 8 including children and spouse. The passengers paid a wide variety of fare from £0 to £512.33. Most of the passengers embarked from the Southampton port. In the end, 349 of them survived the Iceberg and the rest tragically didn’t.

Most people these days probably know Titanic’s story from the movie, and the movie specifically told the story about the ship’s tragic hit to an iceberg.. and Jack and Rose’s story, probably the other way around. Some articles also mentioned that this trip was supposed to be a romantic trip to them. It could also be a family vacation for some. Now let’s dig deeper to these passengers!

But before anything else, for this Markdown I’ll be creating my own theme.

theme_lbb2 <- theme(panel.background = element_rect("#f7f8f4"),
                    plot.title = element_text(hjust = 0.5),
                    plot.subtitle = element_text(hjust = 0.5))

Let’s get on to the next part, how many of the passengers were male and female?

# gender
gender <- titanic %>% 
  group_by(Sex) %>% 
  summarise(Value = length(Sex)) %>% 
  ungroup()
gender

## # A tibble: 2 x 2
##   Sex    Value
##   <fct>  <int>
## 1 female   314
## 2 male     577

There was definitely more men than women on boarding the ship. How about categorizing the gender by class?

gender_class <- titanic %>% 
  group_by(Pclass, Sex) %>% 
  summarise(Value = length(Sex)) %>% 
  ungroup()

## `summarise()` has grouped output by 'Pclass'. You can override using the `.groups` argument.

ggplot(gender_class, aes(x = Pclass,
                         y = Value)) +
  geom_col(aes(fill = Sex)) +
  scale_fill_manual(values = c("#e2bbb6", "#e2cfc0")) +
  
  labs(x = "Passenger Class",
       y = "",
       title = "Overview of Passenger Class",
       subtitle = "Based on Gender") +
  guides(fill=guide_legend(title="Gender")) +
  
  theme_lbb2

We can now see then that most of the passengers consist of the 3rd class with the highest both female and male passengers. There are also more passengers in the 1st class than in the 2nd class. The difference between the 1st class and 2nd class to the 3rd class is pretty high, which can actually mean that Titanic’s passenger was targeted more towards the 3rd class range.

The question I really want to know is how many of those 891 people onboard are actually families and singles and their gender composition.

# creating a new column "Travel" from single passengers
single <- titanic %>% 
  filter(Parch == 0,
         SibSp == 0) %>% 
  mutate(Travel = "Single")
# joining single dataframe to the main titanic dataframe
titanic <- left_join(titanic, single)

## Joining, by = c("PassengerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "Title")

titanic[is.na(titanic)] = "Family"


gender_travel <- titanic %>% 
  group_by(Travel, Sex) %>% 
  summarise(Value = length(Sex)) %>% 
  ungroup()

## `summarise()` has grouped output by 'Travel'. You can override using the `.groups` argument.

ggplot(gender_travel, aes(x = Travel,
                         y = Value)) +
  geom_col(aes(fill = Sex)) +
  scale_fill_manual(values = c("#e2bbb6", "#e2cfc0")) +
  
  labs(x = "Passenger Class",
       y = "",
       title = "Overview of Vacation Category",
       subtitle = "Based on Gender") +
  guides(fill=guide_legend(title="Gender")) +
  
  theme_lbb2

The information that we could get next from the graph above is most female on boarding the Titanic Ship was traveling with their family. How about the age range of the passengers?

ggplot(titanic, aes(x = Age,
                    fill = Sex)) +
  geom_histogram(binwidth = 1) +
  
  labs(y = "Number of Passengers",
       title = "Passenger's Age Range") +
  scale_fill_manual(values = c("#e2bbb6", "#e2cfc0")) +
  
  labs(x = "Passenger Class",
       y = "",
       title = "Passenger Age Range",
       subtitle = "Based on Gender") +
  guides(fill=guide_legend(title="Gender")) +
  
  theme_lbb2 +
  theme(legend.position = "bottom")

Here we could see that most passengers (both female and male) are mostly aged around 20-40, with the highest population of age in around 30 years old.

Next thing I want to know is how are the different classes priced?

ggplot(titanic, aes(x = Pclass,
                    y = Fare,
                    fill = Pclass)) +
  geom_boxplot(color = "#77575a",
               alpha = 0.7) +
  scale_fill_manual(values = c("#e2bbb6", "#e2cfc0", "#e8d3d2")) +
  
  labs(title = "Ship Fare",
       subtitle = "Based on Passenger Class",
       x = "Passenger Class") +
  scale_y_continuous(labels = scales::unit_format(unit = "£")) +
  theme_lbb2 +
  theme(legend.position = "none")

From the box plot above we could see that there’s 1 outlier data that is very far up high in the 1st class which cost the passenger more than £500. Other than that we could see that the price range of 2nd and 3rd class isn’t very much different and some of the 3rd class passenger even had also bought the ticket with the same price as the 2nd class price.`

Next we’re going to see if different embarkation port resulted in different price range.

ggplot(titanic, aes(x = Embarked,
                    y = Fare,
                    fill = Embarked)) +
  geom_boxplot(color = "#77575a",
               alpha = 0.5) +
  scale_fill_manual(values = c("#e2bbb6", "#e2cfc0", "#e8d3d2")) +
  
  labs(title = "Ship Fare",
       subtitle = "Based on Embarkation Port",
       x = "Embarkation Port") +
  scale_y_continuous(labels = scales::unit_format(unit = "£")) +
  theme_lbb2 +
  theme(legend.position = "none")

From the graph above, we could see that passengers who embarked from Cherbourg on average paid the highest price among the others, and people who embarked from Queenstown on average paid the lowest price. Just based on the graph above, by far we could safely assume that most of the 1st class passengers embarked from Cherbourg and Southampton. To make sure that my assumption is true, I’m going to create another plot which compares the Pclass and Embarked.

titanic %>% 
  group_by(Pclass, Embarked) %>% 
  summarise(Value = length(Pclass)) %>% 
  ungroup() %>%
  
  ggplot(aes(x = Embarked,
             y = Value,
             fill = Pclass)) +
  geom_col()+
  scale_fill_manual(values = c("#e2bbb6", "#e2cfc0", "#e8d3d2")) +
  labs(x = "Port of Embarkation",
       y = "Number of Passenger",
       title = "Passenger Class and Port of Embarkation") +
  guides(fill=guide_legend(title="Passenger Class")) +
  
  theme_lbb2 +
  theme(legend.position = "bottom")

## `summarise()` has grouped output by 'Pclass'. You can override using the `.groups` argument.

From the new graph above, we could see that 1st class passengers indeed are mostly embarked from Cherbourg and Southampton, a little to none from Queensland. Another visible insight we could get here is that most of the 2nd class and 3rd class passengers embarked from Southampton.

Next I’ll be calculating the numbers of family (by family name) and the total passengers in each category.

Survivors

From all the data above, we’re going to find out how many of them survived in the end.

How is the Pclass survived vs deceased state?

titanic %>% 
  group_by(Sex, Survived) %>% 
  summarise(Value = length(Sex)) %>% 
  ungroup() %>% 
  
  ggplot(aes(x = Sex,
             y = Value,
             fill = Survived)) +
  geom_col() +
  scale_fill_manual(values = c("#e2bbb6", "#e2cfc0")) +
  labs(x = "Gender",
       y = "Number of Passenger",
       title = "Survivors Based on Gender") +
  guides(fill=guide_legend(title="Survived Status")) +
  
  theme_lbb2

## `summarise()` has grouped output by 'Sex'. You can override using the `.groups` argument.

We could definitely see a major difference between male’s and female’s survivor rate. The lifeboats they had on the ship mostly were for the women and as a result, more than 50% of the men in the end couldn’t make it.

Let’s find out next about the survival rate based on the age, but first I’m going to differentiate the age based on 4 categories:

children (ages 0-14),
young adult (ages 15 - 25),
adult (26 - 55),
elderly (>56)

# children
children <- titanic %>% 
  filter(Age <=14) %>% 
  mutate(AgeCategory = "Children")
# young adult
young_adult <- titanic %>% 
  filter(Age > 14 & Age <= 25) %>% 
  mutate(AgeCategory = "Young Adult")
# adult
adult <- titanic %>% 
  filter(Age >25 & Age <= 55) %>% 
  mutate(AgeCategory = "Adult")
# elderly
elderly <- titanic %>% 
  filter(Age >55) %>% 
  mutate(AgeCategory = "Elderly")

# combining everything
titanic <- bind_rows(children, young_adult, adult, elderly) %>% 
  arrange(PassengerId)

titanic %>% 
  group_by(AgeCategory, Survived) %>% 
  summarise(Value = length(Survived)) %>% 
  ungroup() %>% 
  
  ggplot(aes(x = AgeCategory,
             y = Value,
             fill = Survived)) + 
  geom_col() +
  scale_fill_manual(values = c("#e2bbb6", "#e2cfc0", "#e8d3d2")) +
  labs(x = "Age Category",
       y = "Number of Passenger",
       title = "Survivors Based on Age Category") +
  guides(fill=guide_legend(title="Survived Status")) +
  
  theme_lbb2 +
  theme(legend.position = "bottom")

## `summarise()` has grouped output by 'AgeCategory'. You can override using the `.groups` argument.

At first I thought that children, women, and elderly are priorities while evacuating and although it’s true that female survivors count is much higher than male, only a few elderly could be saved, followed by children, young adult, and adult. However this result could be biased since the total passengers of elderly is the fewest, followed by children, young adult, and adult as per survival result.

Next we’re going to find out survival rate of each passenger class.

titanic %>% 
  group_by(Pclass, Survived) %>% 
  summarise(Value = length(Pclass)) %>% 
  ungroup() %>% 
  
  ggplot(aes(x = Pclass,
             y = Value,
             fill = Survived)) +
  geom_col() +
  scale_fill_manual(values = c("#e2bbb6", "#e2cfc0")) +
  labs(x = "Passenger Class",
       y = "Number of Passenger",
       title = "Survivors Based on Passenger Class") +
  guides(fill=guide_legend(title="Survived Status")) +
  
  theme_lbb2

## `summarise()` has grouped output by 'Pclass'. You can override using the `.groups` argument.

Based on the graph above, we could see that more than 50% of the 1st class passengers managed to survive, almost 50% of the 2nd class passengers managed to survive too, but very few people from the 3rd passenger class managed to survive. We could also safely assume that they prioritize the safety of the 1st class passengers among the others despite the few people in it.

Conclusion

From all the graphs above, we could see that there are more male passengers than female, and female passengers tend to board the ship with the means to travel with their family. Passengers are mostly aged 20-40 years old. Ship fare price ranges, with the highest price, of course, was for the 1st class, but 2nd class and 3rd class price aren’t much different. 1st class passengers are mostly boarded from Cherbourg and Southampton, while very few passengers boarded from Queenstown.

From all the passengers, most female passengers managed to survive by making up more than 60% of the total survivors. They also managed to save some children, elderly, young adults, and adults, saving more children along the way too. Judging by the survival rate based on passengers’ class, I would also assume that during the time, they arrange most of the 1st class to get on the lifeboats first, since more than 50% of the 1st class passengers survived, while 2nd passenger survived around 50%, and 3rd class was the most fatal by around 30%.

This story was definitely very tragic. There were not many survivors since the lifeboats departed not with the maximum capacity, crews weren’t really trained to prepare for the worst, and even the lifeboats were reduced in the ship before departing since it looked cluttered. If there’s anything we could learn from this story is for us to always be aware of our surroundings, not be selfish, and to prioritize safety among others, especially geopardizing safety for aesthetics. Every safety supplies are there for a reason and bad things could happen anytime and anywhere. So stay safe everyone! 😊