Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original


Source: Leading Causes of Death Report 2020, Wisqars Data Visualization, Centers for Disease Control and Prevention, National Center for Injury Prevention and Control (c. 2022).


Objective

The purpose of the data visualization is to show the top 10 causes of death by age group in the United States in 2020. The Centers for Disease Control and Prevention, National Center for Injury Prevention and Control (c. 2022) state that ‘Researchers, the media, public health professionals, and the public can use WISQARS™ data to learn more about the public health and economic burden associated with unintentional and violence-related injury in the United States.’ The 10 Leading causes of Death is one dataset and visualization within the WISQARS database.

Utilizing the Data Visualization Checklist (Evergreen and Emery 2016), the author identified that visualization selected has the following three main issues:

  • Use of color: The visualization uses red and green for Homicide and Suicide, which may be difficult for severely color blind individuals to distinguish, and for those that do not suffer from color blindness, these color choices may be inappropriately emotive. A valuable further test would be to print the chart to determine if the colors are distinguishable when printed with black toner only (unfortunately the author did not have access to a printer while compiling this report, however a greyscale print preview yielded the colors onscreen as indistinguishable). Furthermore, the use of color to highlight significant trends is inconsistent and potentially biased to communicate a particular message, e.g. Malignant Neoplasms is also a significant cause of death for a number of age groups and this has not been colored, nor has Heart Disease which is significant in older age groups. For this reason, the reader may fail to distinguish significant causes of death as the eye is attracted to the bright color boxes as opposed to the white boxes.
  • Type of chart: A better type of chart could be used to depict the data. It is difficult to follow the boxes through the age groups (this is particularly exacerbated by the inconsistent use of color, as noted above). A stacked column chart could better illustrate ranked causes of death by age group.
  • Significant findings / conclusion: Although the chart could allow the audience to see the rank order of cause of death by age group, the ‘so what’ is unclear (this links back to type of chart, color / inconsistent use of color to highlight significant trends and a key ‘take-away’ message, without some sort of explanatory narrative). The visualization also fails to highlight the impact of COVID on the US population, which is a significant deviation from the 2019 data from the same source, particularly for older age groups.

In addition to these three main points, a number of additional observations could be made about this visualization with reference to Evergreen and Emery (2016).

Reference

Code

The following code was used to fix the issues identified in the original.

# The author loaded the packages needed to reconstruct the visualization

library(ggplot2)
library(dplyr)
library(stringr)
library(scales)
library(hrbrthemes)
library(RColorBrewer)
# The author set the working directory in RStudio and read in the data, setting it as a data frame
setwd("~/Documents/RMIT Data Visualization & Communication/Assignment 2 2022")
LCD2020 <- read.csv("WISQARS_lcd_ypll_BarChart_20220802-172602.csv")

# The author kept only the columns desired for the visualization, to make it easier to work with
keepcolumns<-c("Age.Group","Cause.Category","Percentage")
LCD2020tidy <- LCD2020[keepcolumns]

# Upon noticing some redundant values (totals), the author removed these rows, along with a row with error message data
LCD2020tidy <- LCD2020tidy[-c(101:111), ]

# The author checked the class of the Percentage object within the Data Frame and noticing it was a character object, set it to be a numeric object
class(LCD2020tidy$Percentage)
## [1] "character"
LCD2020tidy$Percentage <- as.numeric(LCD2020tidy$Percentage)

# The author checked the class of the Age.Group object within the Data Frame and noticing it was a character object, set it to be a factor vector
class(LCD2020tidy$Age.Group)
## [1] "character"
LCD2020tidy$Age.Group <- as.factor(LCD2020tidy$Age.Group)

# The author checked the class of the Cause.Category object within the Data Frame and noticing it was a character object, set it to be a factor vector
class(LCD2020tidy$Cause.Category)
## [1] "character"
LCD2020tidy$Cause.Category <- as.factor(LCD2020tidy$Cause.Category)

# The author ordered the data descending by number of deaths (in Percentage order)
LCD2020tidyordered <- LCD2020tidy[order(LCD2020tidy$Percentage, decreasing = TRUE), ]  

# The author then selected the top 10 values per age group
LCD2020tidyordered <- Reduce(rbind,                                 
                    by(LCD2020tidyordered,
                       LCD2020tidyordered["Age.Group"],
                       head,
                       n = 10))

# The author made a concious decision to remove data where the cause resulted in less than 5% of deaths (given the intent to show leading causes of death), in order to make the chart less cluttered and allow some key messages to be extracted  
LCD2020tidyordered <- LCD2020tidyordered %>% filter(Percentage >5)

# Given the character data type of the Age Group and the practical challenges arising in sorting this data and getting the data to display correctly, the author constructed an object to be used to assist in ordering the data correctly within the plot
level_order = factor(LCD2020tidyordered$Age.Group, level= c('<1', '1-4', '5-9', '10-14', '15-24', '25-34', '35-44', '45-54', '55-64', '65+'))

# The author then built the new plot in ggplot
plot_title <- ("Leading causes of death in United States in 2020 by Age group")
p1 <- ggplot(LCD2020tidyordered,
       aes(x=`level_order`, fill=`Cause.Category`, y=Percentage)) + geom_bar(position = "dodge", stat = "identity") + ggtitle("Leading causes of death in United States in 2020 by Age Group") 

# The author attempted to overcome the challenges of using color in the chart with ColorBrewer, noting there were sixteen unique values for Cause of Death category - the author selected Set2 as a color blind friendly palette, expanding the base palette to accomodate 16 colors
nb.cols <- 16
mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(nb.cols)
 
# The author then visualized the new ggplot
p1 + theme_ipsum() + theme(legend.position="bottom") + labs(y= "Percentage of Deaths", x = "Age Group") + 
  scale_fill_manual(values = mycolors) + theme(plot.title = element_text(size = 12)) +
theme(axis.text.x = element_text(size = 7)) +
  theme(axis.text.y = element_text(size = 6)) +
  theme(legend.key.size = unit(0.3, 'cm'), legend.key.height = unit(0.3, "cm"), legend.key.width = unit(0.3, "cm"), legend.title = element_text(size=7), legend.text = element_text(size=6)) 

Data Reference

  • Centers for Disease Control and Prevention, National Center for Injury Prevention and Control (c. 2022) Wisqars Data Visualization. Leading Causes of Death Report 2020. website, accessed 2 August, 2022. https://wisqars.cdc.gov/data/lcd/home

Reconstruction

The following plot fixes the main issues in the original, including:

  • Use of color now conforms to a standard ColorBrewer palette (that is color blind friendly) and is less emotive.

  • The chart type is improved, allowing the viewer to see that leading causes of death differ by age group and what the most significant causes of death were in 2020, including COVID.

  • Linked to the above points, it is possible for the viewer to take away some factual key messages from the visualization, i.e. causes of death differ by age group in the US, unitentional injury is the leading cause of injury for young to middle aged adults (not homicide or suicide) and older age groups tended to suffer from expected health problems (such as heart disease or milignant neoplasms as well as COVID).