Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.
Objective
The chosen visualization aims to exhibit the total number of deaths caused due to cancer in various age groups from the year 1990 to 2017 across the world. The target audience here is any public person who is interested to learn more about statistics and effects of cancer,cancer affected patients,health department officials,cancer researchers.
The visualisation chosen had the following three main issues:
Similarity in colours in the visualization is an issue here. The visualization makes use of two almost similar types of green (light and dark)very nearby which might be confusing the end user
Deception due to poor scaling and over plotting exists in this visualization. For example, from the outside it looks like the number of deaths in 1990 is around 4-6 million for people above 70 years,whereas when we look at the data during analysis,we could see the count for 1990(above 70 years) is around 10 million which occurs due to the visualization deceiving the readers with poor scaling technique and over plotting.
Choosing of graph itself is wrong for this kind of data and it leads to an user not able to track or understand how the death count values vary in a time period . The data involves categorical variables and also involves time series data and so it is best practice to use time series graph which can show us how the counts vary over the years in a proper way
Reference
Max Roser and Hannah Ritchie (2015) - “Cancer”. Published online at OurWorldInData.org. Retrieved from: ‘https://ourworldindata.org/cancer’ [Online Resource]
Roser, M., & Ritchie, H. (2015a). Deaths from cancer, by age. Our World in Data. https://ourworldindata.org/grapher/cancer-deaths-by-age?stackMode=relative&country=%7EOWID_WRL
Baglin, J. (2020). Data Visualisation: From Theory to Practice. Retrieved 3 May 2021, from https://dark-star-161610.appspot.com/secured/_book/index.html
The following code was used to fix the issues identified in the original.
#Loading libraries
library(ggplot2)
library(dplyr)
library(readr)
library(lubridate)
library(tidyr)
library(scales)
#Reading the chosen dataset
cancer_df <- read_csv("C:/Users/thyagu/rmit/semester 3/DVC/Assi2/cancer-deaths-by-age.csv")
#Renaming column names for simpler use
cancer_df <- cancer_df %>%
rename(
Death_under_age5 = `Deaths - Neoplasms - Sex: Both - Age: Under 5 (Number)`,
Death_age5_to_age14 = `Deaths - Neoplasms - Sex: Both - Age: 5-14 years (Number)`,
Death_age15_to_age49 = `Deaths - Neoplasms - Sex: Both - Age: 15-49 years (Number)`,
Death_age50_to_age69 = `Deaths - Neoplasms - Sex: Both - Age: 50-69 years (Number)`,
Death_above_age70 = `Deaths - Neoplasms - Sex: Both - Age: 70+ years (Number)`
)
#Grouping the total number of deaths by year for all age categories by summing it
a=cancer_df %>% group_by(Year) %>% summarise(Val_a = sum(Death_under_age5,na.rm=TRUE))
b=cancer_df %>% group_by(Year) %>% summarise(Val_b = sum(Death_age5_to_age14,na.rm=TRUE))
c=cancer_df %>% group_by(Year) %>% summarise(Val_c = sum(Death_age15_to_age49,na.rm=TRUE))
d=cancer_df %>% group_by(Year) %>% summarise(Val_d = sum(Death_age50_to_age69,na.rm=TRUE))
e=cancer_df %>% group_by(Year) %>% summarise(Val_e = sum(Death_above_age70,na.rm=TRUE))
#Create a new dataframe called cancer_df_new by using the above sum of count of deaths grouped by year
cancer_df_new <-data.frame(under_age5 = a$Val_a,
betw_age5to14 = b$Val_b,
betw_age15to49 = c$Val_c,
betw_age50to69 = d$Val_d,
above_age70 = e$Val_e,
Year = a$Year)
head(cancer_df_new,5)
## under_age5 betw_age5to14 betw_age15to49 betw_age50to69 above_age70 Year
## 1 328315.9 360135.8 4212007 12401442 10858030 1990
## 2 327373.4 370123.7 4313488 12598342 11174685 1991
## 3 323350.2 381875.0 4465921 12757476 11478261 1992
## 4 318516.8 388858.2 4620017 13003419 11874614 1993
## 5 312528.4 390315.2 4763845 13121812 12190458 1994
# This wide formatted Dataframe does not help us in fixing our issue. So we are # changing it to long format using the gather function and name it my_plot.Now # there are separate entries for each age category vs each year and their
# respective counts
my_plot = gather(cancer_df_new,key="age_category",value="death_counts",1:5)
head(my_plot,5)
## Year age_category death_counts
## 1 1990 under_age5 328315.9
## 2 1991 under_age5 327373.4
## 3 1992 under_age5 323350.2
## 4 1993 under_age5 318516.8
## 5 1994 under_age5 312528.4
#Now our dataframe is ready to be reconstructed using ggplot
#name our finally reconstructed plot as final_plot
#Use ggplot for reconstruction
final_plot=ggplot() +
layer(
data = my_plot,
mapping = aes(x = Year, y = death_counts, color = age_category),
geom = "point",
stat = "identity",
position = position_identity()
) +
layer(
data = my_plot,
mapping = aes(x = Year, y = death_counts, color = age_category),
geom = "line",
stat = "identity",
position = position_identity()
) + facet_wrap(~age_category)+
ggtitle("deaths by cancer from 1990-2017 for various age categories") + xlab('Year') + ylab('total number of deaths')+
scale_x_continuous(limits=c(1990, 2018)) + scale_y_continuous(labels = comma)
Data Reference
Max Roser and Hannah Ritchie (2015) - “Cancer”. Published online at OurWorldInData.org. Retrieved from: ‘https://ourworldindata.org/cancer’ [Online Resource]
Roser, M., & Ritchie, H. (2015a). Deaths from cancer, by age. Our World in Data. https://ourworldindata.org/grapher/cancer-deaths-by-age?stackMode=relative&country=%7EOWID_WRL