1 Overview

This makeover aims to visualize the survival situation of Titanic Incident, a historical catastrophe. The data set is downloaded from Kaggle, see here. It would be interesting to look into the data and infer what was happening at the sinking and rescuring moment.

2 Load Packages and Data

2.1 Load Packages

The package tidyverse is an opinionated collection of R packages designed for data science, including dplyr, tidyr, stringr, readr, tibble, ggplot2, purrr and so on. plotly helps to make visualizations interactive. gridExtra and crosstalk can help to rearrange charts.

packages = c("tidyverse","gridExtra","plotly", "crosstalk")

for (p in packages){
  if (!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only =T)
}
## Loading required package: tidyverse
## -- Attaching packages ---------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.2     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: crosstalk

2.2 Load Data

The function read_csv from package readr is used to read the Titanic dataset.

df = read_csv("E:/111Visual Analytics & Applications/Session8 Assign4/Titanic.csv")
## Parsed with column specification:
## cols(
##   PassengerId = col_double(),
##   Survived = col_double(),
##   Pclass = col_double(),
##   Sex = col_character(),
##   Age = col_double(),
##   SibSp = col_double(),
##   Parch = col_double(),
##   Fare = col_double(),
##   Embarked = col_character()
## )
Column Name Explaination
PassengerId A number series
Survived Did not survived-0; Survived-1
Pclass Ticket Class: 1st, 2nd, 3rd class
Sex Male or Female
Age Age in years
SibSp Number of siblings/spouses
Parch Number of parents/children
Fare Ticket fare
Embarked Embarkment port, C - Cherbourg, Q - Queenstown, S - Southampton

We could see the data type of each column. Survived and Pclass should be categorical variables but the dataframe we have interpret them in a wrong manner. We should also check the dataset to find whether there be any other problems.

summary(df)
##   PassengerId       Survived          Pclass          Sex           
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##       Age            SibSp           Parch             Fare       
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.91  
##  Median :28.00   Median :0.000   Median :0.0000   Median : 14.45  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816   Mean   : 32.20  
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.: 31.00  
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000   Max.   :512.33  
##  NA's   :177                                                      
##    Embarked        
##  Length:891        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

The result shows that we should change the data type of PassengerId. SibSp, and Parch to be integer variables, and change Survived and Pclass to be categorical variables. Now we should check the dataset again to ensure it looks great.

#Change data type
df$Survived <- as.factor(df$Survived)
df$Pclass <- as.factor(df$Pclass)

summary(df)
##   PassengerId    Survived Pclass      Sex                 Age       
##  Min.   :  1.0   0:549    1:216   Length:891         Min.   : 0.42  
##  1st Qu.:223.5   1:342    2:184   Class :character   1st Qu.:20.12  
##  Median :446.0            3:491   Mode  :character   Median :28.00  
##  Mean   :446.0                                       Mean   :29.70  
##  3rd Qu.:668.5                                       3rd Qu.:38.00  
##  Max.   :891.0                                       Max.   :80.00  
##                                                      NA's   :177    
##      SibSp           Parch             Fare          Embarked        
##  Min.   :0.000   Min.   :0.0000   Min.   :  0.00   Length:891        
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.91   Class :character  
##  Median :0.000   Median :0.0000   Median : 14.45   Mode  :character  
##  Mean   :0.523   Mean   :0.3816   Mean   : 32.20                     
##  3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.: 31.00                     
##  Max.   :8.000   Max.   :6.0000   Max.   :512.33                     
## 

2.2.1 Proposed Sketch for Dataset

After my first glance of the dataset, I would like to carry out my visualization plan in the way shown below,

3 Visualization & Story

The first thing we should know about is that how many people survived and how many didn’t. This gives us a general view of this catastrophe.

Code Design: Derive a new dataframe from the original long table, with survival or not as a column and its counts as another column. Use ggplot to plot a bar chart, showing the overview of the survival situation. Since the background color is white for the output HTML, theme is set to be light(same below). Change the title and y-axis name accordingly(same below). Make it interactive with ggplotly(same below).

From the above chart we could see that only 342 out of 891 passengers survived.

3.1 Influential Factor Detection

What factors Would affect the survival rate of Titanic incident? Let’s see what would happen if we other variables into the visualization of survival.

First, we would check whether the number of people accompanying them during the trip helps to improve the survival rate. In addition, we would like to know whether there is a gap between different genders.

Code Design: Since SibSp, Parch, and Sex are all categorical data, geom_bar() is used here to plot bar charts. The data x under aes() is changed accordingly.

The result shows that gender plays an important role when it comes to suvival. However, the number of people accompanying them during the trip seems help just a bit—people with one person or tow accompanying them would have a higher survival rate than those without.(Note: Users could click on Compare data on hover from the pane above each chart to see data of two categories simultinously.)

Second, we would check whether age , cabin class, and embarked ports play important roles in survival rate.

Code Design: Since Pclass and Embarked are categorical data, geom_bar() is used here to plot bar charts. The data x under aes() is changed accordingly. Since Age is continuous data, geom_histogram() will be used to create the chart.

The result shows that cabin class does play an important role. From the fare chart, those with a higher class cabin have a much higher rate of survival. However, the information provided by embarked port and age agroup does not show a big bias.

3.2 Discover Underlying Stories

Now that we found gender and cabin class are the most important factors. Let’s delve deeper into them.

3.2.1 Gender

The following chart aims to show the survival rate of each gender.

Code Design: Data aggregation is needed here to count the percentage of survival according to each gender. group_by() and mutate functions are used. Next, geom_bar() need to be used to create a bar chart, with x be Sex, y be Percentage derived just now, and Survived as fill. In order to run the code successfully, we should also use stat="identity" in geom_bar(). Use ggplotly to make it interactive, enabling users to see the data label when hovering their cursor over the chart.

Insight 1: 74% of females survived from this catastrophe, while less than 19% of males made it. We could infer that it is also a heroic story–gentlemen left the chance of survival to ladies. “Lady first” still works in fromt of a disaster.

3.2.2 Cabin Class

The following chart aims to show the survival rate of each cabin class.

Code Design: Data aggregation is needed here to count the percentage of survival according to each cabin class. group_by() and mutate functions are used. Next, geom_bar() need to be used to create a bar chart, with x be Pclass ,y be Percentage derived just now, and Survived as fill. In order to run the code successfully, we should also use stat="identity" in geom_bar(). Use ggplotly to make it interactive, enabling users to see the data label when hovering their cursor over the chart.

Insight 2: The survival rates are 63%, 47%, and 24% for the first, second, and third class respectively. The more expensive service you buy, the safer trip you can enjoy.

3.2.3 Gender & Cabin Class

How would it happen if we combine these two most important factors? Let’s see.

Code Design: Data aggregation is needed here to count the percentage of survival according to each gender and each cabin class. group_by() and mutate functions are used. Next, geom_bar() need to be used to create a bar chart, with x be Sex, y be Percentage derived just now, and Survived as fill. In order to run the code successfully, we should also use stat="identity" in geom_bar(). Moreover, use facet_wrap to add Pclass. At last, use ggplotly to make it interactive, enabling users to see the data label when hovering their cursor over the chart.

Insight 3: Even though some males are in the first class, they helped females in the second and third class to board lifeboats. The humanistic behavior under great danger is precious. The reason why the story of Titanic lasts long without being erased by the flying time is that it contains human arrogancy that causes those mistakes, misfortune that hits the iceberg, and, most importantly, LOVE that shines from the darkness moment.

4 End

4.1 Difficulties

  • Several charts are needed to be complied into one chart in the variable detection part. It is hard to find a proper package to solve the problem universally. I used gridExtra and crosstalk in two different situations.

  • Also, when charts are compiled together, the front size of titles become a problem. Luckily, we can use plot.title = element_text(size=?) under theme to adjust the front size of titles.

  • The original dataset is a long table and ggplot2 needs a aggregated dataset to perform well. Hence, I need to learn how to manipulate data with R language.

  • As I proposed with my sketch, I would like to visualize the final percentage with a pie chart. However, ggplot2 does not have a direct way to create pie charts but needs to translate bar charts into pie charts. The output chart is not aesthetic, with count numbers around the chart. So I change my idea in the end, using a equal-height bar chart with survival rate as color filling to show the percentages.

4.2 Gains

  • I have learnt how to do the basic data manipulation with R and feel the flexiblity of R Markdown.

  • The visualization of Titanic gives me a brand new idea of this incident. In my sketch, I though males are more likely to survive probably because they are stronger and could run faster. But to my surprise, females had a much higher survival rate than males did. It’s a tragedy. But at the same time, it’s a beautiful story of humanity.