Whenever you make a data visualization, you can check through this list of questions to see whether you might have missed something.
There are two parts to the checklist: whether your visualization is effectively communicated, and whether its aesthetic characteristics are well-chosen.
Let’s walk through an example of applying the checklist.
We’ll start with this graph, which has the intended takeaway of “People with higher incomes have more variance in their wealth than people with lower incomes”
set.seed(2000)
dat <- tibble(Income = 100000*exp(rnorm(400))) %>%
mutate(Wealth = (100*log(Income) + rnorm(400, 0, sqrt(Income)))*100 - 80000)
ggplot(dat, aes(x = Income, y = Wealth)) + geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
scale_x_log10() +
labs(x = 'Income (Log Scale)', y = 'Wealth Holdings',
'Wealth and Income')
“People with higher incomes have more variance in their wealth than people with lower incomes”
Yes, you can see that the points have a higher vertical spread on the right than on the left.
Not so good. You would be unlikely to walk away with that idea by just looking at the graph.
If I want people to see a difference in variance, I should use a geometry that lets people see a difference in variance. The point geometry doesn’t lead us there, and the trendline is a distraction from it, guiding us in a different direction. Lots of options - I could plot a density distribution at different income levels, or box plots. Let’s see how density plots look. This will require us to bin our income variable.
By the way, notice the use of “Below median” and “Above median” as labels - this makes it more explicit, without a lengthy explanation, of what we’re comparing here, rather than the more nebulous “Low income” and “High income” that might require an explanation.
dat <- dat %>%
mutate(IncomeBins = case_when(
Income <= median(Income) ~ 'Below Median',
TRUE ~ 'Above Median'
))
ggplot(dat, aes(x = Wealth, color = IncomeBins)) +
geom_density() +
labs(x = 'Wealth Holdings', y = 'Density',
title = 'Distribution of Wealth by Income Category')
The comparison we want to make is between the variances of the two distributions. We can compare the distributions easily, although nothing really points us to compare their variance. We might instead be inclined to notice how one of the distributions is to the right of the other, encouraging the comparison of the level of wealth rather than the variance.
After trying a few different ways of emphasizing variance, nothing that’s immediately clear, easily understandable, and doesn’t make it look like both groups have the same average earnings (misleading!) pops up as an easy solution. So we may need to rely on text guidance (although we’d want text guidance even if we did have a visual solution).
Don’t be afraid to just tell people what you want them to see on the graph! Don’t feel the need either to explain in full technical detail. Note the use below of a general description of spread, rather than giving an exact, say, standard deviation (confusing to many readers!). Of course, the appropriate level of technical detail will change depending on the audience.
Color here, too, helps guide interpretation, with the label color matching the curve it goes with.
ggplot(dat, aes(x = Wealth, color = IncomeBins)) +
geom_density() +
annotate(geom = 'text', x = -10000, y = .000009, label = 'Low-income wealth is\nheavily concentrated\nat low, positive values.', hjust = 1, color = palettes_d$basetheme$clean[2], size = 10/.pt) + # Add annotations
annotate(geom = 'text', x = 150000, y = .000003, label = 'High-income wealth\nis spread out. Higher highs\nand more big debts.', hjust = 0, color = palettes_d$basetheme$clean[1], size = 10/.pt) + # Add annotations
expand_limits(x = 450000) + # Push the right edge of the graph so we can see the full annotation
labs(x = 'Wealth Holdings', y = 'Density',
title = 'People with Higher Incomes have More Variation in Wealth') +
scale_color_paletteer_d('basetheme::clean') # Pick a palette explicitly so we can match it in the annotations
Notice also that the labels here answer the unasked question of what those negative wealth values mean (they’re debt) without being distracting about it. If you were wondering that, the answer is right there. If not, the explanation doesn’t get in your way.
Here, the tails of the distributions make it harder to compare the distributions. Also, since we’ve labeled the curves with our annotations, the legend is unnecessary (If possible, do the work necessary to make the legend unnecessary!). Also, the density information on the y-axis doesn’t tell us much.
ggplot(dat, aes(x = Wealth, color = IncomeBins)) +
geom_density() +
annotate(geom = 'text', x = -10000, y = .000009, label = 'Below-median-income wealth is\nheavily concentrated\nat low, positive values.', hjust = 1, color = palettes_d$basetheme$clean[2], size = 10/.pt) +
annotate(geom = 'text', x = 150000, y = .000003, label = 'Above-median-income wealth\nis spread out. Higher highs\nand more big debts.', hjust = 0, color = palettes_d$basetheme$clean[1], size = 10/.pt) +
scale_x_continuous(limits = c(-100000, 300000)) + # Cut out the edge values so we can zoom in
guides(color = FALSE) + # Get rid of color legend
labs(x = 'Wealth Holdings', y = 'Density',
title = 'People with Higher Incomes have More Variation in Wealth',
caption = 'Values outside -$100,000 to $300,000 omitted for visual clarity.') + # Be honest!
scale_color_paletteer_d('basetheme::clean') +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) # Get rid of y-axis values
We have wealth here in scientific notation which most people don’t understand. Let’s fix that! And let’s make sure our labels aren’t cut off in any way, perhaps move them so they aren’t.
We also haven’t made fully explicit what “Median income” is - you can’t figure that out by looking at the graph! Seems caption-appropriate.
ggplot(dat, aes(x = Wealth, color = IncomeBins)) +
geom_density() +
annotate(geom = 'text', x = 70000, y = .000012, label = 'Below-median-income wealth is\nheavily concentrated\nat low, positive values.', hjust = 0, color = palettes_d$basetheme$clean[2], size = 10/.pt) + # Move this annotation to the right where there's space
annotate(geom = 'text', x = 150000, y = .000003, label = 'Above-median-income wealth\nis spread out. Higher highs\nand more big debts.', hjust = 0, color = palettes_d$basetheme$clean[1], size = 10/.pt) +
scale_x_continuous(limits = c(-100000, 300000),
labels = scales::dollar) + # Label wealth as a dollar value
guides(color = FALSE) + # Get rid of color legend
labs(x = 'Wealth Holdings', y = 'Density',
title = 'People with Higher Incomes have More Variation in Wealth',
caption = paste0('Values outside -$100,000 to $300,000 omitted for visual clarity.\nMedian income is ',scales::dollar(median(dat$Income)))) + # Let people know what median income is!
scale_color_paletteer_d('basetheme::clean') +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank()) # Get rid of y-axis values
Seems good on this. Mostly, anyway. Will our audience know what “Density” is? They may not! This may be clearer if we rename that “Proportion” or “Proportion with this wealth” - technically not accurate, but close enough, and for a lay audience this would get the idea across more clearly.
The background is just distracting, and we don’t need the gridlines; they don’t help us much here. Also, the density curves are the main piece of information we want the reader to see and they’re thin.
Without the gridlines, though, there’s no way to distinguish “positive values”, which the story is partially about for low-income people. So we’ll add that line back in.
ggplot(dat, aes(x = Wealth, color = IncomeBins)) +
geom_density(size = 1) + # Thicker lines
geom_vline(aes(xintercept = 0), linetype = 'dashed') + # A line at 0 so we can tell the positives from negatives
annotate(geom = 'text', x = 70000, y = .000012, label = 'Below-median-income wealth is\nheavily concentrated\nat low, positive values.', hjust = 0, color = palettes_d$basetheme$clean[2], size = 10/.pt) + # Move this annotation to the right where there's space
annotate(geom = 'text', x = 150000, y = .000003, label = 'Above-median-income wealth\nis spread out. Higher highs\nand more big debts.', hjust = 0, color = palettes_d$basetheme$clean[1], size = 10/.pt) +
scale_x_continuous(limits = c(-100000, 300000),
labels = scales::dollar) + # Label wealth as a dollar value
guides(color = FALSE) +
labs(x = 'Wealth Holdings', y = 'Proportion With This Wealth',
title = 'People with Higher Incomes have More Variation in Wealth',
caption = paste0('Values outside -$100,000 to $300,000 omitted for visual clarity.\nMedian income is ',scales::dollar(median(dat$Income)))) +
scale_color_paletteer_d('basetheme::clean') +
ggpubr::theme_pubr() + # A clean, no-gridline theme
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())
While blue/yellow would be a colorblindness issue, the blue/orange we have seems fine. As for aesthetics, I’m not the biggest fan of the orange, we could change that. There aren’t really “high-income colors” and “low-income colors” we could pick to help show the low/high income difference, but if there were we could do that.
We’ve already made connecting the labels easy by moving them onto the graph and making them textual, rather than off to the side in a far-off legend. Remember, going back and forth from data to legend to see what’s what is tiring!
Not applicable here - we do have a line type (solid), which is easy to see, and we’re not relying on it to label data.
This text is too small, especially on the annotations! Now we have more room for them anyway. And I prefer a serif font, that’s just me.
ggplot(dat, aes(x = Wealth, color = IncomeBins)) +
geom_density(size = 1) + # Thicker lines
geom_vline(aes(xintercept = 0), linetype = 'dashed') +
annotate(geom = 'text', x = 70000, y = .000012, label = 'Below-median-income wealth is\nheavily concentrated\nat low, positive values.', hjust = 0, color = palettes_d$basetheme$clean[2],
size = 14/.pt, family = 'serif') + # Bigger, serif font
annotate(geom = 'text', x = 125000, y = .000004, label = 'Above-median-income wealth\nis spread out. Higher highs\nand more big debts.', hjust = 0, color = palettes_d$basetheme$clean[1],
size = 14/.pt, family = 'serif') + # Bigger, serif font
scale_x_continuous(limits = c(-100000, 300000),
labels = scales::dollar) +
guides(color = FALSE) +
labs(x = 'Wealth Holdings', y = 'Proportion With This Wealth',
title = 'People with Higher Incomes have More Variation in Wealth',
caption = paste0('Values outside -$100,000 to $300,000 omitted for visual clarity.\nMedian income is ',scales::dollar(median(dat$Income)))) +
scale_color_paletteer_d('basetheme::clean') +
ggpubr::theme_pubr() + # A clean, no-gridline theme
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
text = element_text(size = 13, family = 'serif')) # Bigger, serif font
And there we have it! It may not be perfect, but it at least can get the point across.