Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.
Objective
The purpose of the data visualization is to show the top 10 causes of death by age group in the United States in 2020. The Centers for Disease Control and Prevention, National Center for Injury Prevention and Control (c. 2022) state that ‘Researchers, the media, public health professionals, and the public can use WISQARS™ data to learn more about the public health and economic burden associated with unintentional and violence-related injury in the United States.’ The 10 Leading causes of Death is one dataset and visualization within the WISQARS database.
Utilizing the Data Visualization Checklist (Evergreen and Emery 2016), the author identified that visualization selected has the following three main issues:
In addition to these three main points, a number of additional observations could be made about this visualization with reference to Evergreen and Emery (2016).
Reference
Evergreen, S and Emery, A 2016, Data Visualization Checklist, Evergreen Data website. Viewed July 30, 2022, https://stephanieevergreen.com/wp-content/uploads/2016/10/DataVizChecklist_May2016.pdf
Centers for Disease Control and Prevention, National Center for Injury Prevention and Control (c. 2022) Wisqars Injury Data website, accessed 2 August, 2022. https://www.cdc.gov/injury/wisqars/index.html
Centers for Disease Control and Prevention, National Center for Injury Prevention and Control (c. 2022) Wisqars Data Visualization. Leading Causes of Death Report 2020 website, accessed 2 August, 2022. https://wisqars.cdc.gov/data/lcd/home
The following code was used to fix the issues identified in the original.
# The author loaded the packages needed to reconstruct the visualization
library(ggplot2)
library(dplyr)
library(stringr)
library(scales)
library(hrbrthemes)
library(RColorBrewer)
# The author set the working directory in RStudio and read in the data, setting it as a data frame
setwd("~/Documents/RMIT Data Visualization & Communication/Assignment 2 2022")
LCD2020 <- read.csv("WISQARS_lcd_ypll_BarChart_20220802-172602.csv")
# The author kept only the columns desired for the visualization, to make it easier to work with
keepcolumns<-c("Age.Group","Cause.Category","Percentage")
LCD2020tidy <- LCD2020[keepcolumns]
# Upon noticing some redundant values (totals), the author removed these rows, along with a row with error message data
LCD2020tidy <- LCD2020tidy[-c(101:111), ]
# The author checked the class of the Percentage object within the Data Frame and noticing it was a character object, set it to be a numeric object
class(LCD2020tidy$Percentage)
## [1] "character"
LCD2020tidy$Percentage <- as.numeric(LCD2020tidy$Percentage)
# The author checked the class of the Age.Group object within the Data Frame and noticing it was a character object, set it to be a factor vector
class(LCD2020tidy$Age.Group)
## [1] "character"
LCD2020tidy$Age.Group <- as.factor(LCD2020tidy$Age.Group)
# The author checked the class of the Cause.Category object within the Data Frame and noticing it was a character object, set it to be a factor vector
class(LCD2020tidy$Cause.Category)
## [1] "character"
LCD2020tidy$Cause.Category <- as.factor(LCD2020tidy$Cause.Category)
# The author ordered the data descending by number of deaths (in Percentage order)
LCD2020tidyordered <- LCD2020tidy[order(LCD2020tidy$Percentage, decreasing = TRUE), ]
# The author then selected the top 10 values per age group
LCD2020tidyordered <- Reduce(rbind,
by(LCD2020tidyordered,
LCD2020tidyordered["Age.Group"],
head,
n = 10))
# The author made a concious decision to remove data where the cause resulted in less than 5% of deaths (given the intent to show leading causes of death), in order to make the chart less cluttered and allow some key messages to be extracted
LCD2020tidyordered <- LCD2020tidyordered %>% filter(Percentage >5)
# Given the character data type of the Age Group and the practical challenges arising in sorting this data and getting the data to display correctly, the author constructed an object to be used to assist in ordering the data correctly within the plot
level_order = factor(LCD2020tidyordered$Age.Group, level= c('<1', '1-4', '5-9', '10-14', '15-24', '25-34', '35-44', '45-54', '55-64', '65+'))
# The author then built the new plot in ggplot
plot_title <- ("Leading causes of death in United States in 2020 by Age group")
p1 <- ggplot(LCD2020tidyordered,
aes(x=`level_order`, fill=`Cause.Category`, y=Percentage)) + geom_bar(position = "dodge", stat = "identity") + ggtitle("Leading causes of death in United States in 2020 by Age Group")
# The author attempted to overcome the challenges of using color in the chart with ColorBrewer, noting there were sixteen unique values for Cause of Death category - the author selected Set2 as a color blind friendly palette, expanding the base palette to accomodate 16 colors
nb.cols <- 16
mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(nb.cols)
# The author then visualized the new ggplot
p1 + theme_ipsum() + theme(legend.position="bottom") + labs(y= "Percentage of Deaths", x = "Age Group") +
scale_fill_manual(values = mycolors) + theme(plot.title = element_text(size = 12)) +
theme(axis.text.x = element_text(size = 7)) +
theme(axis.text.y = element_text(size = 6)) +
theme(legend.key.size = unit(0.3, 'cm'), legend.key.height = unit(0.3, "cm"), legend.key.width = unit(0.3, "cm"), legend.title = element_text(size=7), legend.text = element_text(size=6))
Data Reference
The following plot fixes the main issues in the original, including:
Use of color now conforms to a standard ColorBrewer palette (that is color blind friendly) and is less emotive.
The chart type is improved, allowing the viewer to see that leading causes of death differ by age group and what the most significant causes of death were in 2020, including COVID.
Linked to the above points, it is possible for the viewer to take away some factual key messages from the visualization, i.e. causes of death differ by age group in the US, unitentional injury is the leading cause of injury for young to middle aged adults (not homicide or suicide) and older age groups tended to suffer from expected health problems (such as heart disease or milignant neoplasms as well as COVID).