This makeover aims to visualize the survival situation of Titanic Incident, a historical catastrophe. The data set is downloaded from Kaggle, see here. It would be interesting to look into the data and infer what was happening at the sinking and rescuring moment.
The package tidyverse is an opinionated collection of R packages designed for data science, including dplyr, tidyr, stringr, readr, tibble, ggplot2, purrr and so on. plotly helps to make visualizations interactive. gridExtra and crosstalk can help to rearrange charts.
packages = c("tidyverse","gridExtra","plotly", "crosstalk")
for (p in packages){
if (!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only =T)
}
## Loading required package: tidyverse
## -- Attaching packages ---------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.2 v dplyr 1.0.0
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: crosstalk
The function read_csv from package readr is used to read the Titanic dataset.
df = read_csv("E:/111Visual Analytics & Applications/Session8 Assign4/Titanic.csv")
## Parsed with column specification:
## cols(
## PassengerId = col_double(),
## Survived = col_double(),
## Pclass = col_double(),
## Sex = col_character(),
## Age = col_double(),
## SibSp = col_double(),
## Parch = col_double(),
## Fare = col_double(),
## Embarked = col_character()
## )
| Column Name | Explaination |
|---|---|
| PassengerId | A number series |
| Survived | Did not survived-0; Survived-1 |
| Pclass | Ticket Class: 1st, 2nd, 3rd class |
| Sex | Male or Female |
| Age | Age in years |
| SibSp | Number of siblings/spouses |
| Parch | Number of parents/children |
| Fare | Ticket fare |
| Embarked | Embarkment port, C - Cherbourg, Q - Queenstown, S - Southampton |
We could see the data type of each column. Survived and Pclass should be categorical variables but the dataframe we have interpret them in a wrong manner. We should also check the dataset to find whether there be any other problems.
summary(df)
## PassengerId Survived Pclass Sex
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Age SibSp Parch Fare
## Min. : 0.42 Min. :0.000 Min. :0.0000 Min. : 0.00
## 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 7.91
## Median :28.00 Median :0.000 Median :0.0000 Median : 14.45
## Mean :29.70 Mean :0.523 Mean :0.3816 Mean : 32.20
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 31.00
## Max. :80.00 Max. :8.000 Max. :6.0000 Max. :512.33
## NA's :177
## Embarked
## Length:891
## Class :character
## Mode :character
##
##
##
##
The result shows that we should change the data type of PassengerId. SibSp, and Parch to be integer variables, and change Survived and Pclass to be categorical variables. Now we should check the dataset again to ensure it looks great.
#Change data type
df$Survived <- as.factor(df$Survived)
df$Pclass <- as.factor(df$Pclass)
summary(df)
## PassengerId Survived Pclass Sex Age
## Min. : 1.0 0:549 1:216 Length:891 Min. : 0.42
## 1st Qu.:223.5 1:342 2:184 Class :character 1st Qu.:20.12
## Median :446.0 3:491 Mode :character Median :28.00
## Mean :446.0 Mean :29.70
## 3rd Qu.:668.5 3rd Qu.:38.00
## Max. :891.0 Max. :80.00
## NA's :177
## SibSp Parch Fare Embarked
## Min. :0.000 Min. :0.0000 Min. : 0.00 Length:891
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 7.91 Class :character
## Median :0.000 Median :0.0000 Median : 14.45 Mode :character
## Mean :0.523 Mean :0.3816 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 Max. :512.33
##
After my first glance of the dataset, I would like to carry out my visualization plan in the way shown below,
The first thing we should know about is that how many people survived and how many didn’t. This gives us a general view of this catastrophe.
Code Design: Derive a new dataframe from the original long table, with survival or not as a column and its counts as another column. Use ggplot to plot a bar chart, showing the overview of the survival situation. Since the background color is white for the output HTML, theme is set to be light(same below). Change the title and y-axis name accordingly(same below). Make it interactive with ggplotly(same below).
From the above chart we could see that only 342 out of 891 passengers survived.
What factors Would affect the survival rate of Titanic incident? Let’s see what would happen if we other variables into the visualization of survival.
First, we would check whether the number of people accompanying them during the trip helps to improve the survival rate. In addition, we would like to know whether there is a gap between different genders.
Code Design: Since SibSp, Parch, and Sex are all categorical data, geom_bar() is used here to plot bar charts. The data x under aes() is changed accordingly.
The result shows that gender plays an important role when it comes to suvival. However, the number of people accompanying them during the trip seems help just a bit—people with one person or tow accompanying them would have a higher survival rate than those without.(Note: Users could click on Compare data on hover from the pane above each chart to see data of two categories simultinously.)
Second, we would check whether age , cabin class, and embarked ports play important roles in survival rate.
Code Design: Since Pclass and Embarked are categorical data, geom_bar() is used here to plot bar charts. The data x under aes() is changed accordingly. Since Age is continuous data, geom_histogram() will be used to create the chart.
The result shows that cabin class does play an important role. From the fare chart, those with a higher class cabin have a much higher rate of survival. However, the information provided by embarked port and age agroup does not show a big bias.
Now that we found gender and cabin class are the most important factors. Let’s delve deeper into them.
The following chart aims to show the survival rate of each gender.
Code Design: Data aggregation is needed here to count the percentage of survival according to each gender. group_by() and mutate functions are used. Next, geom_bar() need to be used to create a bar chart, with x be Sex, y be Percentage derived just now, and Survived as fill. In order to run the code successfully, we should also use stat="identity" in geom_bar(). Use ggplotly to make it interactive, enabling users to see the data label when hovering their cursor over the chart.
Insight 1: 74% of females survived from this catastrophe, while less than 19% of males made it. We could infer that it is also a heroic story–gentlemen left the chance of survival to ladies. “Lady first” still works in fromt of a disaster.
The following chart aims to show the survival rate of each cabin class.
Code Design: Data aggregation is needed here to count the percentage of survival according to each cabin class. group_by() and mutate functions are used. Next, geom_bar() need to be used to create a bar chart, with x be Pclass ,y be Percentage derived just now, and Survived as fill. In order to run the code successfully, we should also use stat="identity" in geom_bar(). Use ggplotly to make it interactive, enabling users to see the data label when hovering their cursor over the chart.
Insight 2: The survival rates are 63%, 47%, and 24% for the first, second, and third class respectively. The more expensive service you buy, the safer trip you can enjoy.
How would it happen if we combine these two most important factors? Let’s see.
Code Design: Data aggregation is needed here to count the percentage of survival according to each gender and each cabin class. group_by() and mutate functions are used. Next, geom_bar() need to be used to create a bar chart, with x be Sex, y be Percentage derived just now, and Survived as fill. In order to run the code successfully, we should also use stat="identity" in geom_bar(). Moreover, use facet_wrap to add Pclass. At last, use ggplotly to make it interactive, enabling users to see the data label when hovering their cursor over the chart.
Insight 3: Even though some males are in the first class, they helped females in the second and third class to board lifeboats. The humanistic behavior under great danger is precious. The reason why the story of Titanic lasts long without being erased by the flying time is that it contains human arrogancy that causes those mistakes, misfortune that hits the iceberg, and, most importantly, LOVE that shines from the darkness moment.
Several charts are needed to be complied into one chart in the variable detection part. It is hard to find a proper package to solve the problem universally. I used gridExtra and crosstalk in two different situations.
Also, when charts are compiled together, the front size of titles become a problem. Luckily, we can use plot.title = element_text(size=?) under theme to adjust the front size of titles.
The original dataset is a long table and ggplot2 needs a aggregated dataset to perform well. Hence, I need to learn how to manipulate data with R language.
As I proposed with my sketch, I would like to visualize the final percentage with a pie chart. However, ggplot2 does not have a direct way to create pie charts but needs to translate bar charts into pie charts. The output chart is not aesthetic, with count numbers around the chart. So I change my idea in the end, using a equal-height bar chart with survival rate as color filling to show the percentages.
I have learnt how to do the basic data manipulation with R and feel the flexiblity of R Markdown.
The visualization of Titanic gives me a brand new idea of this incident. In my sketch, I though males are more likely to survive probably because they are stronger and could run faster. But to my surprise, females had a much higher survival rate than males did. It’s a tragedy. But at the same time, it’s a beautiful story of humanity.