Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original


Source: Our World in Data - Life Expectancy (2013).


Objective

The selected data visualisation is about the number of deaths annually around the world. The data is represented for each year from 1950 to 2019. The plot shows the data of various regions of the world such as Africa, Asia, North America, Latin America and the Caribbean, Europe, and Oceania. The death count is plotted across a period of 69 years from 1950 to 2019.

This data can be used in the study of life expectancy by scientists, healthcare professionals and researchers in general. The study of life expectancy and death rate of a region is essential to the policy makers and public works professionals to help them make crucial policy decisions for the region. Therefore, the target audience for this data visualisation can be scientists, healthcare professionals, researchers, policy makers and public works professionals.

The visualisation chosen had the following three main issues:

  • Incorrect Use of Visualisation Technique: The visualisation technique used in the original graph is the stacked area chart. The chart is a representation of various categories in a single variable. Here, the categories are the various regions. The data used for plotting this chart is a time series data. The best way to plot a time series data is a simple line plot. The trend can clearly be represented in the form of lines in a simple line plot.
  • Confusing Layout and Visual Bombardment: The chart displays various regions in different colours. The figure becomes too colourful for the viewer. It is confusing to grasp the actual numbers from the chart. The layout of the areas tries to convey a serial approach in the data. However, that is not the case as the dataset consists of a time series for each region in the world. The dataset is itself a set of time series.
  • Inaccurate Area: The size of the area does not match the actual proportion of the death counts. The areas for Africa, North America, Europe, Latin America, and the Caribbean do not have a correct start and end. They seem to be plotted to fit in the area created by Asia and to forcibly appear stacked.

Reference

Code

The following code was used to fix the issues identified in the original.

# Importing the necessary libraries
library(ggplot2)
library(readr)
library(tidyr) 
library(dplyr)
require(scales)

# Reading the Data and storing it in a dataframe
world_death_rate <- read_csv("~/Master of Data Science (RMIT)/Semester 3/Data Visualization/Assignments/Assignment 2/annual-number-of-deaths-by-world-region.csv")
head(world_death_rate)
## # A tibble: 6 x 4
##   Entity     Code   Year `Estimates, 1950 - 2020: Annually interpolated demogra~
##   <chr>      <chr> <dbl>                                                   <dbl>
## 1 Afghanist~ AFG    1950                                                 298139 
## 2 Afghanist~ AFG    1951                                                 297342 
## 3 Afghanist~ AFG    1952                                                 295841 
## 4 Afghanist~ AFG    1953                                                 294524 
## 5 Afghanist~ AFG    1954                                                 293392 
## 6 Afghanist~ AFG    1955                                                 292446.
# Renaming the columns
names(world_death_rate)[names(world_death_rate) == "Estimates, 1950 - 2020: Annually interpolated demographic indicators - Deaths (thousands)"] <- "death_count"
names(world_death_rate)[names(world_death_rate) == "Entity"] <- "entity"
names(world_death_rate)[names(world_death_rate) == "Year"] <- "year"
names(world_death_rate)[names(world_death_rate) == "Code"] <- "code"
head(world_death_rate)
## # A tibble: 6 x 4
##   entity      code   year death_count
##   <chr>       <chr> <dbl>       <dbl>
## 1 Afghanistan AFG    1950     298139 
## 2 Afghanistan AFG    1951     297342 
## 3 Afghanistan AFG    1952     295841 
## 4 Afghanistan AFG    1953     294524 
## 5 Afghanistan AFG    1954     293392 
## 6 Afghanistan AFG    1955     292446.
# Storing the data for the coontinents in seperate data frames
deaths_asia <- world_death_rate %>% filter(entity == "Asia")
deaths_africa <- world_death_rate %>% filter(entity == "Africa")
deaths_oceania <- world_death_rate %>% filter(entity == "Oceania")
deaths_north_america <- world_death_rate %>% filter(entity == "Northern America")
deaths_europe <- world_death_rate %>% filter(entity == "Europe")
deaths_latin_america <- world_death_rate %>% filter(entity == "Latin America and the Caribbean")

# Plotting the data to fix the issues identified (with geom = "point" in layer)
p1 <- ggplot(deaths_asia, aes(x = deaths_asia$year), legend = TRUE) + 
  
  geom_line(aes(y = deaths_asia$death_count, color = "red"), size = 1) + 
  layer(data = deaths_asia, mapping = aes(x = deaths_asia$year, y = deaths_asia$death_count), stat = "identity", position = position_identity(), geom =  "point") +
  
  geom_line(aes(y = deaths_africa$death_count, color = "blue"), size = 1) + 
  layer(data = deaths_africa, mapping = aes(x = deaths_africa$year, y = deaths_africa$death_count), stat = "identity", position = position_identity(), geom =  "point") +
  
  geom_line(aes(y = deaths_europe$death_count, color = "green"), size = 1) + 
  layer(data = deaths_europe, mapping = aes(x = deaths_europe$year, y = deaths_europe$death_count), stat = "identity", position = position_identity(), geom =  "point") +
  
  geom_line(aes(y = deaths_oceania$death_count, color = "yellow"), size = 1) + 
  layer(data = deaths_oceania, mapping = aes(x = deaths_oceania$year, y = deaths_oceania$death_count), stat = "identity", position = position_identity(), geom =  "point") +
  
  geom_line(aes(y = deaths_north_america$death_count, color = "khaki4"), size = 1) + 
  layer(data = deaths_north_america, mapping = aes(x = deaths_north_america$year, y = deaths_north_america$death_count), stat = "identity", position = position_identity(), geom =  "point") +
  
  geom_line(aes(y = deaths_latin_america$death_count, color = "hotpink"), size = 1) + 
  layer(data = deaths_latin_america, mapping = aes(x = deaths_latin_america$year, y = deaths_latin_america$death_count), stat = "identity", position = position_identity(), geom =  "point") +
  
  ggtitle("Number of Deaths around the World") + 
  xlab('Year') + 
  ylab('Death Count (in millions)') + 
  theme_minimal() + 
  scale_x_continuous(limits=c(1950, 2019)) + 
  scale_y_continuous(label = comma) + 
  scale_colour_identity(guide ="legend", name = "Regions", labels = c("Africa", "Europe", "Latin  America and the Caribbean", "Northern America", "Asia", "Oceania"))

# Plotting the data to fix the issues identified (without the data points)
p2 <- ggplot(deaths_asia, aes(x = deaths_asia$year), legend = TRUE) + 
  
  geom_line(aes(y = deaths_asia$death_count, color = "red"), size = 1) + 
  geom_line(aes(y = deaths_africa$death_count, color = "blue"), size = 1) + 
  geom_line(aes(y = deaths_europe$death_count, color = "green"), size = 1) + 
  geom_line(aes(y = deaths_oceania$death_count, color = "yellow"), size = 1) + 
  geom_line(aes(y = deaths_north_america$death_count, color = "khaki4"), size = 1) + 
  geom_line(aes(y = deaths_latin_america$death_count, color = "hotpink"), size = 1) + 
  ggtitle("Number of Deaths around the World") + 
  xlab('Year') + 
  ylab('Death Count (in millions)') + 
  theme_minimal() + 
  scale_x_continuous(limits=c(1950, 2019)) + 
  scale_y_continuous(label = comma) + 
  scale_colour_identity(guide ="legend", name = "Regions", labels = c("Africa", "Europe", "Latin  America and the Caribbean", "Northern America", "Asia", "Oceania"))

Data Reference

Reconstruction

The reconstruction of the original chart was done to represent the original data in a line plot. There were four distinct columns in the dataset. The columns were first renamed for better understanding the data. Data for various regions was then stored in separate data frames. The data was plotted in the form of a multiple line plot. Each line represents a region of the world.

The ggplot2 library was used to plot the graph. A variable was created for plotting the graph and multiple lines were added to it. The main x axis parameter was set to the column ‘year’ from the regional data of Asia. The reason for this being that the x variable was the same for the data of all regions. The geom_line() function was used to add lines for each regions. Different colours were set in the aes() parameter of the geom_line() function. The layers were added to the plot to show the data points on the lines by customizing the geom parameter inside the layer() function. The geom parameter was initially set as ‘point’. This resulted in showing the exact values of the data points on the line. However, the points looked conjusted on the plot and the line colour was hardly visible.

Hence, the layer() function can be removed from the plot to remove the data points.

The scale_x_continuous() was used to set limits for the x axis. The default value of the label type in scale_y_continuous was ‘scientific’ which showed scale values as exponential numbers. The label type was changed to ‘comma’ to view the scale figures normally.

The problems identified in the original chart were fixed as follows.

  • The visualisation technique was changed to a multiple line plot. Time series for different regions were plotted by keeping the scale constant.

  • The layout of the regions was also fixed by representing each region using a different line. As the areas were replaced by lines, the amount of colour in the graph also reduced. Hence, visual bombardment is no longer present.

  • The inaccuracy in area disappeared as the regions are now represented using lines.