The Data Viz Checklist

Whenever you make a data visualization, you can check through this list of questions to see whether you might have missed something.

There are two parts to the checklist: whether your visualization is effectively communicated, and whether its aesthetic characteristics are well-chosen.

Communication

  1. Fill in X in the sentence “someone who sees this visualization should learn X”
  2. Look at your visualization and ask yourself “Is X true?” and see if you can answer that question.
  3. Look at your visualization and ask yourself “If I had no idea that this visualization was trying to tell me X, would I figure that out on my own?”
  4. Have you picked the geometry (type of graph - scatterplot, bar, etc.) that makes X easiest to see?
  5. Have you used preattentive attributes (color, shape, linetype, spatial position, etc. etc.) to draw our attention to the data that tells us X or to encourage the comparison that tells us X?
  6. Have you used text to help guide a reader towards X?
  7. Have you eliminated information or graphical features that distracts from the story or data? (assuming doing so does not make the graph misrepresentative)
  8. Is all the information on the visualization in easy-to-interpret terms and clearly visible? i.e. if I see the number “5” on your graph, will I know what “5” means? 5 what? In what units?

Aesthetics

  1. Are all text and value labels in easy-to-understand language that clearly communicate what they are doing to the audience you’re targeting your visualization at?
  2. Have you selected a theming that lets the important elements be clearly visible? Often this means avoiding dark backgrounds, complex shading, and 3-d elements.
  3. Have you selected colors that are aesthetically pleasing and are clearly distinguishable by colorblind people? Can you pick colors that help make X more clear? If the colors correspond to labels, have you made it as easy as possible to tell which color goes to which label?
  4. Have you selected shapes, line types, etc., that are clearly distinguished. If they correspond to a label, have you made it as easy as possible to tell which style goes to which label?
  5. Have you selected a font that is easy to read and visually appealing, and have you made it large enough for those without perfect eyesight to read? Have you made sure that your important labels aren’t cut off?

An Example

Let’s walk through an example of applying the checklist.

We’ll start with this graph, which has the intended takeaway of “People with higher incomes have more variance in their wealth than people with lower incomes”

set.seed(2000)
dat <- tibble(Income = 100000*exp(rnorm(400))) %>%
  mutate(Wealth = (100*log(Income) + rnorm(400, 0, sqrt(Income)))*100 - 80000)

ggplot(dat, aes(x = Income, y = Wealth)) + geom_point() + 
  geom_smooth(method = 'lm', se = FALSE) +
  scale_x_log10() + 
  labs(x = 'Income (Log Scale)', y = 'Wealth Holdings',
       'Wealth and Income')

Communication

  1. Fill in X in the sentence “someone who sees this visualization should learn X”

“People with higher incomes have more variance in their wealth than people with lower incomes”

  1. Look at your visualization and ask yourself “Is X true?” and see if you can answer that question.

Yes, you can see that the points have a higher vertical spread on the right than on the left.

  1. Look at your visualization and ask yourself “If I had no idea that this visualization was trying to tell me X, would I figure that out on my own?”

Not so good. You would be unlikely to walk away with that idea by just looking at the graph.

  1. Have you picked the geometry that makes X easiest to see?

If I want people to see a difference in variance, I should use a geometry that lets people see a difference in variance. The point geometry doesn’t lead us there, and the trendline is a distraction from it, guiding us in a different direction. Lots of options - I could plot a density distribution at different income levels, or box plots. Let’s see how density plots look. This will require us to bin our income variable.

By the way, notice the use of “Below median” and “Above median” as labels - this makes it more explicit, without a lengthy explanation, of what we’re comparing here, rather than the more nebulous “Low income” and “High income” that might require an explanation.

dat <- dat %>%
  mutate(IncomeBins = case_when(
    Income <= median(Income) ~ 'Below Median',
    TRUE ~ 'Above Median'
  ))

ggplot(dat, aes(x = Wealth, color = IncomeBins)) + 
  geom_density()  +
  labs(x = 'Wealth Holdings', y = 'Density',
       title = 'Distribution of Wealth by Income Category')

  1. Have you used preattentive attributes (color, shape, linetype, spatial position, etc. etc.) to draw our attention to the data that tells us X or to encourage the comparison that tells us X?

The comparison we want to make is between the variances of the two distributions. We can compare the distributions easily, although nothing really points us to compare their variance. We might instead be inclined to notice how one of the distributions is to the right of the other, encouraging the comparison of the level of wealth rather than the variance.

After trying a few different ways of emphasizing variance, nothing that’s immediately clear, easily understandable, and doesn’t make it look like both groups have the same average earnings (misleading!) pops up as an easy solution. So we may need to rely on text guidance (although we’d want text guidance even if we did have a visual solution).

  1. Have you used text to help guide a reader towards X?

Don’t be afraid to just tell people what you want them to see on the graph! Don’t feel the need either to explain in full technical detail. Note the use below of a general description of spread, rather than giving an exact, say, standard deviation (confusing to many readers!). Of course, the appropriate level of technical detail will change depending on the audience.

Color here, too, helps guide interpretation, with the label color matching the curve it goes with.

ggplot(dat, aes(x = Wealth, color = IncomeBins)) + 
  geom_density()  +
  annotate(geom = 'text', x = -10000, y = .000009, label = 'Low-income wealth is\nheavily concentrated\nat low, positive values.', hjust = 1, color = palettes_d$basetheme$clean[2], size = 10/.pt) + # Add annotations
  annotate(geom = 'text', x = 150000, y = .000003, label = 'High-income wealth\nis spread out. Higher highs\nand more big debts.', hjust = 0, color = palettes_d$basetheme$clean[1], size = 10/.pt) + # Add annotations
  expand_limits(x = 450000) + # Push the right edge of the graph so we can see the full annotation
  labs(x = 'Wealth Holdings', y = 'Density',
       title = 'People with Higher Incomes have More Variation in Wealth') + 
  scale_color_paletteer_d('basetheme::clean') # Pick a palette explicitly so we can match it in the annotations

Notice also that the labels here answer the unasked question of what those negative wealth values mean (they’re debt) without being distracting about it. If you were wondering that, the answer is right there. If not, the explanation doesn’t get in your way.

  1. Have you eliminated information or graphical features that distracts from the story or data? (assuming doing so does not make the graph misrepresentative)

Here, the tails of the distributions make it harder to compare the distributions. Also, since we’ve labeled the curves with our annotations, the legend is unnecessary (If possible, do the work necessary to make the legend unnecessary!). Also, the density information on the y-axis doesn’t tell us much.

ggplot(dat, aes(x = Wealth, color = IncomeBins)) + 
  geom_density()  +
  annotate(geom = 'text', x = -10000, y = .000009, label = 'Below-median-income wealth is\nheavily concentrated\nat low, positive values.', hjust = 1, color = palettes_d$basetheme$clean[2], size = 10/.pt) +
  annotate(geom = 'text', x = 150000, y = .000003, label = 'Above-median-income wealth\nis spread out. Higher highs\nand more big debts.', hjust = 0, color = palettes_d$basetheme$clean[1], size = 10/.pt) +
  scale_x_continuous(limits = c(-100000, 300000)) + # Cut out the edge values so we can zoom in
  guides(color = FALSE) + # Get rid of color legend
  labs(x = 'Wealth Holdings', y = 'Density',
       title = 'People with Higher Incomes have More Variation in Wealth',
       caption = 'Values outside -$100,000 to $300,000 omitted for visual clarity.') + # Be honest!
  scale_color_paletteer_d('basetheme::clean') + 
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) # Get rid of y-axis values

  1. Is all the information on the visualization in easy-to-interpret terms and clearly visible?

We have wealth here in scientific notation which most people don’t understand. Let’s fix that! And let’s make sure our labels aren’t cut off in any way, perhaps move them so they aren’t.

We also haven’t made fully explicit what “Median income” is - you can’t figure that out by looking at the graph! Seems caption-appropriate.

ggplot(dat, aes(x = Wealth, color = IncomeBins)) + 
  geom_density()  +
  annotate(geom = 'text', x = 70000, y = .000012, label = 'Below-median-income wealth is\nheavily concentrated\nat low, positive values.', hjust = 0, color = palettes_d$basetheme$clean[2], size = 10/.pt) + # Move this annotation to the right where there's space
  annotate(geom = 'text', x = 150000, y = .000003, label = 'Above-median-income wealth\nis spread out. Higher highs\nand more big debts.', hjust = 0, color = palettes_d$basetheme$clean[1], size = 10/.pt) +
  scale_x_continuous(limits = c(-100000, 300000),
                     labels = scales::dollar) + # Label wealth as a dollar value
  guides(color = FALSE) + # Get rid of color legend
  labs(x = 'Wealth Holdings', y = 'Density',
       title = 'People with Higher Incomes have More Variation in Wealth',
       caption = paste0('Values outside -$100,000 to $300,000 omitted for visual clarity.\nMedian income is ',scales::dollar(median(dat$Income)))) + # Let people know what median income is!
  scale_color_paletteer_d('basetheme::clean') + 
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) # Get rid of y-axis values

Aesthetics

  1. Are all text and value labels in easy-to-understand language that clearly communicate what they are doing?

Seems good on this. Mostly, anyway. Will our audience know what “Density” is? They may not! This may be clearer if we rename that “Proportion” or “Proportion with this wealth” - technically not accurate, but close enough, and for a lay audience this would get the idea across more clearly.

  1. Have you selected a theming that lets the important elements be clearly visible?

The background is just distracting, and we don’t need the gridlines; they don’t help us much here. Also, the density curves are the main piece of information we want the reader to see and they’re thin.

Without the gridlines, though, there’s no way to distinguish “positive values”, which the story is partially about for low-income people. So we’ll add that line back in.

ggplot(dat, aes(x = Wealth, color = IncomeBins)) + 
  geom_density(size = 1)  + # Thicker lines
  geom_vline(aes(xintercept = 0), linetype = 'dashed') + # A line at 0 so we can tell the positives from negatives
  annotate(geom = 'text', x = 70000, y = .000012, label = 'Below-median-income wealth is\nheavily concentrated\nat low, positive values.', hjust = 0, color = palettes_d$basetheme$clean[2], size = 10/.pt) + # Move this annotation to the right where there's space
  annotate(geom = 'text', x = 150000, y = .000003, label = 'Above-median-income wealth\nis spread out. Higher highs\nand more big debts.', hjust = 0, color = palettes_d$basetheme$clean[1], size = 10/.pt) +
  scale_x_continuous(limits = c(-100000, 300000),
                     labels = scales::dollar) + # Label wealth as a dollar value
  guides(color = FALSE) + 
  labs(x = 'Wealth Holdings', y = 'Proportion With This Wealth',
       title = 'People with Higher Incomes have More Variation in Wealth',
       caption = paste0('Values outside -$100,000 to $300,000 omitted for visual clarity.\nMedian income is ',scales::dollar(median(dat$Income)))) + 
  scale_color_paletteer_d('basetheme::clean') + 
  ggpubr::theme_pubr() + # A clean, no-gridline theme
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

  1. Have you selected colors that are aesthetically pleasing and are clearly distinguishable by colorblind people? Can you pick colors that help make X more clear? If the colors correspond to labels, have you made it as easy as possible to tell which color goes to which label?

While blue/yellow would be a colorblindness issue, the blue/orange we have seems fine. As for aesthetics, I’m not the biggest fan of the orange, we could change that. There aren’t really “high-income colors” and “low-income colors” we could pick to help show the low/high income difference, but if there were we could do that.

We’ve already made connecting the labels easy by moving them onto the graph and making them textual, rather than off to the side in a far-off legend. Remember, going back and forth from data to legend to see what’s what is tiring!

  1. Have you selected shapes, line types, etc., that are clearly distinguished. If they correspond to a label, have you made it as easy as possible to tell which style goes to which label?

Not applicable here - we do have a line type (solid), which is easy to see, and we’re not relying on it to label data.

  1. Have you selected a font that is easy to read and visually appealing, and have you made it large enough for those without perfect eyesight to read? Have you made sure that your important labels aren’t cut off?

This text is too small, especially on the annotations! Now we have more room for them anyway. And I prefer a serif font, that’s just me.

ggplot(dat, aes(x = Wealth, color = IncomeBins)) + 
  geom_density(size = 1)  + # Thicker lines
  geom_vline(aes(xintercept = 0), linetype = 'dashed') + 
  annotate(geom = 'text', x = 70000, y = .000012, label = 'Below-median-income wealth is\nheavily concentrated\nat low, positive values.', hjust = 0, color = palettes_d$basetheme$clean[2], 
           size = 14/.pt, family = 'serif') + # Bigger, serif font
  annotate(geom = 'text', x = 125000, y = .000004, label = 'Above-median-income wealth\nis spread out. Higher highs\nand more big debts.', hjust = 0, color = palettes_d$basetheme$clean[1],
           size = 14/.pt, family = 'serif') + # Bigger, serif font
  scale_x_continuous(limits = c(-100000, 300000),
                     labels = scales::dollar) + 
  guides(color = FALSE) + 
  labs(x = 'Wealth Holdings', y = 'Proportion With This Wealth',
       title = 'People with Higher Incomes have More Variation in Wealth',
       caption = paste0('Values outside -$100,000 to $300,000 omitted for visual clarity.\nMedian income is ',scales::dollar(median(dat$Income)))) + 
  scale_color_paletteer_d('basetheme::clean') + 
  ggpubr::theme_pubr() + # A clean, no-gridline theme
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        text = element_text(size = 13, family = 'serif')) # Bigger, serif font 

And there we have it! It may not be perfect, but it at least can get the point across.