Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.
Objective
The selected data visualisation is about the number of deaths annually around the world. The data is represented for each year from 1950 to 2019. The plot shows the data of various regions of the world such as Africa, Asia, North America, Latin America and the Caribbean, Europe, and Oceania. The death count is plotted across a period of 69 years from 1950 to 2019.
This data can be used in the study of life expectancy by scientists, healthcare professionals and researchers in general. The study of life expectancy and death rate of a region is essential to the policy makers and public works professionals to help them make crucial policy decisions for the region. Therefore, the target audience for this data visualisation can be scientists, healthcare professionals, researchers, policy makers and public works professionals.
The visualisation chosen had the following three main issues:
Reference
The following code was used to fix the issues identified in the original.
# Importing the necessary libraries
library(ggplot2)
library(readr)
library(tidyr)
library(dplyr)
require(scales)
# Reading the Data and storing it in a dataframe
world_death_rate <- read_csv("~/Master of Data Science (RMIT)/Semester 3/Data Visualization/Assignments/Assignment 2/annual-number-of-deaths-by-world-region.csv")
head(world_death_rate)
## # A tibble: 6 x 4
## Entity Code Year `Estimates, 1950 - 2020: Annually interpolated demogra~
## <chr> <chr> <dbl> <dbl>
## 1 Afghanist~ AFG 1950 298139
## 2 Afghanist~ AFG 1951 297342
## 3 Afghanist~ AFG 1952 295841
## 4 Afghanist~ AFG 1953 294524
## 5 Afghanist~ AFG 1954 293392
## 6 Afghanist~ AFG 1955 292446.
# Renaming the columns
names(world_death_rate)[names(world_death_rate) == "Estimates, 1950 - 2020: Annually interpolated demographic indicators - Deaths (thousands)"] <- "death_count"
names(world_death_rate)[names(world_death_rate) == "Entity"] <- "entity"
names(world_death_rate)[names(world_death_rate) == "Year"] <- "year"
names(world_death_rate)[names(world_death_rate) == "Code"] <- "code"
head(world_death_rate)
## # A tibble: 6 x 4
## entity code year death_count
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan AFG 1950 298139
## 2 Afghanistan AFG 1951 297342
## 3 Afghanistan AFG 1952 295841
## 4 Afghanistan AFG 1953 294524
## 5 Afghanistan AFG 1954 293392
## 6 Afghanistan AFG 1955 292446.
# Storing the data for the coontinents in seperate data frames
deaths_asia <- world_death_rate %>% filter(entity == "Asia")
deaths_africa <- world_death_rate %>% filter(entity == "Africa")
deaths_oceania <- world_death_rate %>% filter(entity == "Oceania")
deaths_north_america <- world_death_rate %>% filter(entity == "Northern America")
deaths_europe <- world_death_rate %>% filter(entity == "Europe")
deaths_latin_america <- world_death_rate %>% filter(entity == "Latin America and the Caribbean")
# Plotting the data to fix the issues identified (with geom = "point" in layer)
p1 <- ggplot(deaths_asia, aes(x = deaths_asia$year), legend = TRUE) +
geom_line(aes(y = deaths_asia$death_count, color = "red"), size = 1) +
layer(data = deaths_asia, mapping = aes(x = deaths_asia$year, y = deaths_asia$death_count), stat = "identity", position = position_identity(), geom = "point") +
geom_line(aes(y = deaths_africa$death_count, color = "blue"), size = 1) +
layer(data = deaths_africa, mapping = aes(x = deaths_africa$year, y = deaths_africa$death_count), stat = "identity", position = position_identity(), geom = "point") +
geom_line(aes(y = deaths_europe$death_count, color = "green"), size = 1) +
layer(data = deaths_europe, mapping = aes(x = deaths_europe$year, y = deaths_europe$death_count), stat = "identity", position = position_identity(), geom = "point") +
geom_line(aes(y = deaths_oceania$death_count, color = "yellow"), size = 1) +
layer(data = deaths_oceania, mapping = aes(x = deaths_oceania$year, y = deaths_oceania$death_count), stat = "identity", position = position_identity(), geom = "point") +
geom_line(aes(y = deaths_north_america$death_count, color = "khaki4"), size = 1) +
layer(data = deaths_north_america, mapping = aes(x = deaths_north_america$year, y = deaths_north_america$death_count), stat = "identity", position = position_identity(), geom = "point") +
geom_line(aes(y = deaths_latin_america$death_count, color = "hotpink"), size = 1) +
layer(data = deaths_latin_america, mapping = aes(x = deaths_latin_america$year, y = deaths_latin_america$death_count), stat = "identity", position = position_identity(), geom = "point") +
ggtitle("Number of Deaths around the World") +
xlab('Year') +
ylab('Death Count (in millions)') +
theme_minimal() +
scale_x_continuous(limits=c(1950, 2019)) +
scale_y_continuous(label = comma) +
scale_colour_identity(guide ="legend", name = "Regions", labels = c("Africa", "Europe", "Latin America and the Caribbean", "Northern America", "Asia", "Oceania"))
# Plotting the data to fix the issues identified (without the data points)
p2 <- ggplot(deaths_asia, aes(x = deaths_asia$year), legend = TRUE) +
geom_line(aes(y = deaths_asia$death_count, color = "red"), size = 1) +
geom_line(aes(y = deaths_africa$death_count, color = "blue"), size = 1) +
geom_line(aes(y = deaths_europe$death_count, color = "green"), size = 1) +
geom_line(aes(y = deaths_oceania$death_count, color = "yellow"), size = 1) +
geom_line(aes(y = deaths_north_america$death_count, color = "khaki4"), size = 1) +
geom_line(aes(y = deaths_latin_america$death_count, color = "hotpink"), size = 1) +
ggtitle("Number of Deaths around the World") +
xlab('Year') +
ylab('Death Count (in millions)') +
theme_minimal() +
scale_x_continuous(limits=c(1950, 2019)) +
scale_y_continuous(label = comma) +
scale_colour_identity(guide ="legend", name = "Regions", labels = c("Africa", "Europe", "Latin America and the Caribbean", "Northern America", "Asia", "Oceania"))
Data Reference
The reconstruction of the original chart was done to represent the original data in a line plot. There were four distinct columns in the dataset. The columns were first renamed for better understanding the data. Data for various regions was then stored in separate data frames. The data was plotted in the form of a multiple line plot. Each line represents a region of the world.
The ggplot2 library was used to plot the graph. A variable was created for plotting the graph and multiple lines were added to it. The main x axis parameter was set to the column ‘year’ from the regional data of Asia. The reason for this being that the x variable was the same for the data of all regions. The geom_line() function was used to add lines for each regions. Different colours were set in the aes() parameter of the geom_line() function. The layers were added to the plot to show the data points on the lines by customizing the geom parameter inside the layer() function. The geom parameter was initially set as ‘point’. This resulted in showing the exact values of the data points on the line. However, the points looked conjusted on the plot and the line colour was hardly visible.
Hence, the layer() function can be removed from the plot to remove the data points.
The scale_x_continuous() was used to set limits for the x axis. The default value of the label type in scale_y_continuous was ‘scientific’ which showed scale values as exponential numbers. The label type was changed to ‘comma’ to view the scale figures normally.
The problems identified in the original chart were fixed as follows.
The visualisation technique was changed to a multiple line plot. Time series for different regions were plotted by keeping the scale constant.
The layout of the regions was also fixed by representing each region using a different line. As the areas were replaced by lines, the amount of colour in the graph also reduced. Hence, visual bombardment is no longer present.
The inaccuracy in area disappeared as the regions are now represented using lines.