1 Introduction

RMS Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of 15 April 1912 after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of modern history’s deadliest peacetime commercial marine disasters. Over 100 years after the crash, titanic seems to pop up in many different contexts. We will see the insight by visualizing the disaster data towards the survived and dead passengers’ conditions.

library(dplyr) # for data wrangling
library(ggplot2) # to visualize data
library(gridExtra) # to display multiple graph
library(inspectdf) # for EDA
library(tidymodels) # to build tidy models
library(caret) # to pre-process data
library(yardstick)

2 Reading Data

The source of the data is https://www.kaggle.com/c/titanic

Titanic_Survival <- read.csv("train (1).csv")

The data contain information as follow:

  • survival : Survival 0 = No, 1 = Yes
  • pclass : Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • sex : Male or Female
  • Age : Age in years
  • sibsp : number of siblings / spouses aboard the Titanic
  • parch : number of parents / children aboard the Titanic
  • ticket : Ticket number
  • fare : Passenger fare
  • cabin : Cabin number
  • embarked : Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
rmarkdown::paged_table(Titanic_Survival)

3 Data Pre-Processing

Before we build the visualization, we need to investigate the data whether it has NA value or not.

colSums(is.na(Titanic_Survival))
#> PassengerId    Survived      Pclass        Name         Sex         Age 
#>           0           0           0           0           0         177 
#>       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           0           0           0           0
sum(is.na(Titanic_Survival$Age))/nrow(Titanic_Survival)*100
#> [1] 19.86532

We lost 177 data or approximately almost 20% of the dataset. Next we see the missing value in the data

sapply(Titanic_Survival, function(x){sum(x=='')})
#> PassengerId    Survived      Pclass        Name         Sex         Age 
#>           0           0           0           0           0          NA 
#>       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
#>           0           0           0           0         687           2

The Cabin have a lot of blank data. We will remove this variable and the two blank observation in the embarked data.

As a justification, the variable age will be deleted since it has less than 20% NA value. The further recommendation may arise since tackling with NA value would be better with imputation with the specific technique rather than just removing the observations. We also remove the variable Name, PassengerId, and Ticket. It might be interesting to analyze the characteristic of each passenger name with the individual survival rate.

Titanic_new <-  Titanic_Survival %>% 
  dplyr::select(-c(Cabin, PassengerId, Name, Ticket)) %>% 
  filter(Age != '') %>% 
  filter(Embarked != '')
glimpse(Titanic_new)
#> Rows: 712
#> Columns: 8
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1...
#> $ Pclass   <int> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3...
#> $ Sex      <chr> "male", "female", "female", "female", "male", "male", "mal...
#> $ Age      <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, ...
#> $ SibSp    <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0...
#> $ Parch    <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0...
#> $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750...
#> $ Embarked <chr> "S", "C", "S", "S", "S", "S", "S", "S", "C", "S", "S", "S"...
colSums(is.na(Titanic_new))
#> Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
#>        0        0        0        0        0        0        0        0

We need to reformat the type of the data into the proper format—the variable Sex and Embarked need to reformat into factor.

Titanic_clean <- Titanic_new %>% 
  mutate(Sex = as.factor(Sex),
         Embarked = as.factor(Embarked))
glimpse(Titanic_clean)
#> Rows: 712
#> Columns: 8
#> $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1...
#> $ Pclass   <int> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3...
#> $ Sex      <fct> male, female, female, female, male, male, male, female, fe...
#> $ Age      <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, ...
#> $ SibSp    <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0...
#> $ Parch    <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0...
#> $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750...
#> $ Embarked <fct> S, C, S, S, S, S, S, S, C, S, S, S, S, S, S, Q, S, S, S, Q...

4 Data Visualization

4.1 Age

titanic_survived <- Titanic_clean %>% 
  filter(Survived == 1)

titanic_not_survived <- Titanic_clean %>% 
  filter(Survived == 0)

hist(titanic_survived$Age, breaks=20, col = "Green", xlab = "Age",  main   = "Distribution of Survived Passengers")

hist(titanic_not_survived$Age, breaks=20, col = "Red", xlab = "Age", main = "Distribution of Non-Survived Passengers")

We see a lot of young passengers survived from the crash. The consideration to save young passengers might arise during the critical time.

4.2 Sex

Titanic_clean %>% group_by(Sex) %>% summarise(avgAge = mean(Age), stdev = sd(Age))
ggplot(data=Titanic_clean, aes(x=Age, fill=Sex)) + geom_density(alpha=0.5)

The male passenger age is over 60 until 80. This is higher than the fraction in females.

prop.table(table(Titanic_clean$Survived, Titanic_clean$Sex))
#>    
#>         female       male
#>   0 0.08988764 0.50561798
#>   1 0.27387640 0.13061798

The male passenger will likely not survived during the tragedy, with more than 50% than the total number of passengers.

4.3 Pclass

hist(titanic_survived$Pclass, col="green", main="Distribution of Survived Passenger Based on Class", xlab = "Class")

hist(titanic_not_survived$Pclass, col="red", main="Distribution of Died Passenger Based on Class", xlab = "Class")

The passenger who died came mostly from class 3, the lowest class while the passenger in the first-class gave the highest portion of the number who survived.

posn.j <- position_jitter(0.3, 1)
ggplot(Titanic_clean,aes(x=factor(Pclass),y=Age,col=factor(Sex)))+
  geom_jitter(size=3,alpha=0.5,position=posn.j)+
  facet_grid(". ~ Survived")

The passengers who died from class 3 are dominated by Male Passengers, while in the survived passenger graphics, females have a higher chance of living.

4.4 Family Member

ggplot(data = Titanic_clean, aes( x =  SibSp + Parch,  fill = as.factor(Survived) ) ) +
geom_bar(position = 'dodge')

Having from 1 to 3 family members on board increased the survival chances; this makes sense small groups could organize better and find space on a boat.

4.5 Fare

ggplot(data = Titanic_clean , aes(x = as.factor(Pclass), y = Fare, colour = Sex)) +
geom_boxplot() + #Boxplot
scale_y_log10()

We see here the first-class paid higher than the other class while interestingly, male paid lower than female belonging in the same class.

4.6 Embarked

posn.t <- position_jitter(0.2, 1)
ggplot(Titanic_clean,aes(x=factor(Embarked),y=Age,col=factor(Sex)))+
  geom_jitter(size=3,alpha=0.5,position=posn.t)+
  facet_grid(". ~ Survived")

Many passengers embarked from Southhampton while the least come from Queenstown. We see here female embarked from Cherbroug mostly survived during the tragedy while male embarked from Southampton died in the disaster.

5 Final Words

Data visualization is an excellent way to inform the reader about the situation with the data. It is excellent since telling a story is more potent than visualizing a chunk of numbers. According to the insight above, we see that several factors affecting the life of the passengers. Female passengers had a higher chance of living than males, while the age distribution shows that younger passengers prioritized using the life-saver facility during the tragedy. We also the first-class passenger tended to have a better chance to survive than the other two classes. Fewer family members would give a better score to survive, as well. The most embarked passenger came from Southhampton, while in the Cherbourg came mostly from female passengers. It is interesting to see that male passengers paid lower in the same class. We might find exciting facts with further insight into why males paid lower in the ship.