Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original


Source: IHME, Global Burden of Disease (GBD).


Objective

The chosen visualization aims to exhibit the total number of deaths caused due to cancer in various age groups from the year 1990 to 2017 across the world. The target audience here is any public person who is interested to learn more about statistics and effects of cancer,cancer affected patients,health department officials,cancer researchers.

The visualisation chosen had the following three main issues:

  • Similarity in colours in the visualization is an issue here. The visualization makes use of two almost similar types of green (light and dark)very nearby which might be confusing the end user

  • Deception due to poor scaling and over plotting exists in this visualization. For example, from the outside it looks like the number of deaths in 1990 is around 4-6 million for people above 70 years,whereas when we look at the data during analysis,we could see the count for 1990(above 70 years) is around 10 million which occurs due to the visualization deceiving the readers with poor scaling technique and over plotting.

  • Choosing of graph itself is wrong for this kind of data and it leads to an user not able to track or understand how the death count values vary in a time period . The data involves categorical variables and also involves time series data and so it is best practice to use time series graph which can show us how the counts vary over the years in a proper way

Reference

Code

The following code was used to fix the issues identified in the original.

#Loading libraries
library(ggplot2)
library(dplyr)
library(readr)
library(lubridate)
library(tidyr)
library(scales)

#Reading the chosen dataset 
cancer_df <- read_csv("C:/Users/thyagu/rmit/semester 3/DVC/Assi2/cancer-deaths-by-age.csv")
#Renaming column names for simpler use
cancer_df <- cancer_df %>% 
  rename(
    Death_under_age5 = `Deaths - Neoplasms - Sex: Both - Age: Under 5 (Number)`,
    Death_age5_to_age14 = `Deaths - Neoplasms - Sex: Both - Age: 5-14 years (Number)`,
    Death_age15_to_age49 = `Deaths - Neoplasms - Sex: Both - Age: 15-49 years (Number)`,
    Death_age50_to_age69 = `Deaths - Neoplasms - Sex: Both - Age: 50-69 years (Number)`,
    Death_above_age70 = `Deaths - Neoplasms - Sex: Both - Age: 70+ years (Number)`
    )
#Grouping the total number of deaths by year for all age categories by summing it
a=cancer_df %>% group_by(Year) %>% summarise(Val_a = sum(Death_under_age5,na.rm=TRUE))
b=cancer_df %>% group_by(Year) %>% summarise(Val_b = sum(Death_age5_to_age14,na.rm=TRUE))
c=cancer_df %>% group_by(Year) %>% summarise(Val_c = sum(Death_age15_to_age49,na.rm=TRUE))
d=cancer_df %>% group_by(Year) %>% summarise(Val_d = sum(Death_age50_to_age69,na.rm=TRUE))
e=cancer_df %>% group_by(Year) %>% summarise(Val_e = sum(Death_above_age70,na.rm=TRUE))

#Create a new dataframe called cancer_df_new by using the above sum of count of deaths grouped by year 

cancer_df_new <-data.frame(under_age5 = a$Val_a,
                           betw_age5to14 = b$Val_b,
                           betw_age15to49 = c$Val_c,
                           betw_age50to69 = d$Val_d,
                           above_age70 = e$Val_e,
                           Year = a$Year)
head(cancer_df_new,5)
##   under_age5 betw_age5to14 betw_age15to49 betw_age50to69 above_age70 Year
## 1   328315.9      360135.8        4212007       12401442    10858030 1990
## 2   327373.4      370123.7        4313488       12598342    11174685 1991
## 3   323350.2      381875.0        4465921       12757476    11478261 1992
## 4   318516.8      388858.2        4620017       13003419    11874614 1993
## 5   312528.4      390315.2        4763845       13121812    12190458 1994
# This wide formatted Dataframe does not help us in fixing our issue. So we are # changing it to long format using the gather function and name it my_plot.Now # there are separate entries for each age category vs each year and their 
# respective counts

my_plot = gather(cancer_df_new,key="age_category",value="death_counts",1:5)
head(my_plot,5)
##   Year age_category death_counts
## 1 1990   under_age5     328315.9
## 2 1991   under_age5     327373.4
## 3 1992   under_age5     323350.2
## 4 1993   under_age5     318516.8
## 5 1994   under_age5     312528.4
#Now our dataframe is ready to be reconstructed using ggplot
#name our finally reconstructed plot as final_plot
#Use ggplot for reconstruction

final_plot=ggplot() +
  layer(
    data = my_plot,
    mapping = aes(x = Year, y = death_counts, color = age_category),
    geom = "point",
    stat = "identity",
    position = position_identity()
  ) +
  layer(
    data = my_plot,
    mapping = aes(x = Year, y = death_counts, color = age_category),
    geom = "line",
    stat = "identity",
    position = position_identity()
  ) + facet_wrap(~age_category)+
  ggtitle("deaths by cancer from 1990-2017 for various age categories") + xlab('Year') + ylab('total number of deaths')+
  scale_x_continuous(limits=c(1990, 2018)) + scale_y_continuous(labels = comma) 

Data Reference

Reconstruction

  • The following plot fixes the main issues in the original. To deconstruct this visualization, we have made use of the ggplot2 library. To begin with,the right graph(time plot) was chosen to easily visualize the various trends in death counts over various years for different age groups.
  • Also, there was clear overlapping due to poor scaling and deception which led to user not able to understand the counts properly.This was mitigated using faceting by dividing visualisations into small facets of different age groups.
  • Also,various apt colours were provided to all facets according to their respective age groups