A brief theory of data visualization

The way in which you choose to display your data is as important to communicating your message as the statistical analysis itself. Therefore, critical thought and consideration into which data you want to display and how you want it displayed is key to doing good science. After all, what good is your data if it’s displayed unclearly? For those of us who do applied science, complicated research must be communicated to managers and stakeholders who don’t necessarily have a lot of time on their hands to sit and digest an overly complicated figure, or to try to understand the nuts-and-bolts of a complex analysis. So, if you want your research to be read and ultimately understood by decision-makers your field, my advice is to harness the power of modern data visualization techniques to communicate your research as clearly and effectively as possible.

There has been plenty written on how to design effective figures for science but there are no universally agreed upon commandments when it comes to how to display data. For interested readers, I recommend checking out the following papers:

How do display data badly by Howard Wainer, 1984.

Let’s practice what we preach: turning tables into graphs by Andrew Gelman et al., 2002.

However, I will share with you here what I think are good rules to follow when creating figures:

1. Your figures should tell a story

The point of a figure should be to tell a story. For example if you were to plot the means of some variable across various groups, say average reported income by education level, then the implied story you are telling is about the effect of education level on expected income. Alternatively, if you make a plot that shows cumulative numbers of birds sighted on the y-axis and the number of hours spent searching on the x-axis, then the implied story you are telling is that the number of birds you saw is somehow related to the amount of time you spent searching for them. When you create figures for your research, first spend some time thinking about the story you want to tell with your figures. Are you trying to compare two groups? Do you want to show a potential relationship between two variables? Do you suspect there is an interaction occurring between two variables? Do you want to show how some variable has changed over time? Once you’ve thought about the stories you want to tell, then you can consider how best to display your data to tell those stories.

2. Keep it simple, but not too simple

The best figures give you the just right amount of information to tell their stories. Not too much, and not too little. Sometimes this can be a delicate balancing act. Surely you will have a ton of data that you could show us. But the key question is which data should you show us? We often collect so much data that it would be impractical to share all of it. This is where you must again consider your story. What story are you trying to tell with your data and what data are both sufficient and necessary to communicate your message? Also consider the aesthetics of your figures and try to avoid having too many distracting visuals. A clean, minimalist theme will help you emphasis the data over all else.

3. Figures should be able to stand alone

Do you need to explain what’s going on in your figure with a paragraph of text? That’s a sign you should probably reconsider how you’re displaying your data. Good figures should be easy to understand on their own. If you find that you are constantly having to explain your figures to people, maybe it’s time to try a different approach or reconsider how you are displaying your data.

4. Consider your audience

It’s important to think about who you are trying to reach with your figure. Are you making figures intended to be seen by your peers? Is it for communicating to the general public? Are you speaking to a 4th grade science class? Considering your audience can drastically change how you decide to display your data. For example, I probably wouldn’t show a box-and-whisker plot to a class of 4th graders, because I wouldn’t expect them to know how to read that type of plot. At the same time, I would be more comfortable making slightly more complicated figures for a professional conference if the data warrant it.

5. Don’t use your figures to mislead

Producing misleading figures with your data is unfortunately all too easy and it can even happen accidentally if you aren’t careful. Here are a collection of some of my favorite examples of horribly misleading and/or downright confusing figures:

Source: Fox Business

This one is clearly doing its best to convince you that the top tax rate will go way up if the Bush tax cuts expire. But a closer look at the scale reveals that the difference is only a ~4.6% change. Is that very meaningful, given the context? Moreover, does it merit a y-axis that zoomed in? Also notice how tiny those numbers are compared to the graphics. This is an example of a figure that manipulates the audience with a misleading scale.

Source: Trip Savvy

Ok, so while maybe not technically wrong, this graph is still quite misleading. The key assumption here is that those states are directly comparable. But what if the differences in shark attacks actually has little to do with what state you’re in and more to do with, say, population density? Or how about kilometers of coastline? Number of public beaches? These data need some kind of normalizing factor applied to them so that we can actually compare relative risk of a shark attack between each state instead of just the absolute number of shark attacks. The context of your data matters!

Source: Business Insider

I. Uh. What even?

Exploring data with ggplot2

Without further ado, let’s get into some of the many wonderful features of the ggplot2

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
library(palmerpenguins)

head(penguins)
## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
## 1 Adelie  Torge~           39.1          18.7              181        3750
## 2 Adelie  Torge~           39.5          17.4              186        3800
## 3 Adelie  Torge~           40.3          18                195        3250
## 4 Adelie  Torge~           NA            NA                 NA          NA
## 5 Adelie  Torge~           36.7          19.3              193        3450
## 6 Adelie  Torge~           39.3          20.6              190        3650
## # ... with 2 more variables: sex <fct>, year <int>

The penguins dataset is a collection of observations made on three species of penguin by researchers at the Palmer Station Antarctica LTER. It was recently made freely available in an R package by Allison Horst as an alternative to the iris data. Let’s dig into the penguins data, starting with a simple scatter plot.

ggplot(penguins) +
  geom_point(aes(y = bill_depth_mm, x = body_mass_g))
## Warning: Removed 2 rows containing missing values (geom_point).

Let’s take a look at the individual arguments we passed. The ggplot() argument is where we specify the dataframe we wish to use to plot our figures with. Next, we specify a geom argument. Geom stand for geometric object and you can specify it to be anything from a point, a line, a bar, and more. In our case, we specified geom_point(), which just tells R we want to make a scatter plot. Other types of plots would be specified with the geom argument as well. For example, geom_boxplot(), geom_line(), and so on.

Next we fill in our aesthetics aes() arguments. This is where we tell R what we want to actually put on the plot. You can make all kinds of customizations within aes(), including point and line types, color mapping, groupings, and so on. But more about that later. The key thing to remember with aes() is that this is where you specify your x and y variables. For a quick example, we can ask R to encode color information by species.

ggplot(penguins) +
  geom_point(aes(y = bill_depth_mm, x = body_mass_g, color = species))
## Warning: Removed 2 rows containing missing values (geom_point).

ggplot2 makes a pretty nice figure with its default settings, but I think we can spruce it up a bit.

ggplot(penguins) +
  geom_point(aes(y = bill_depth_mm, x = body_mass_g, color = species), size = 2) +
  labs(x = "Body mass (g)", y = "Bill depth (mm)", color = "Species") +
  scale_color_manual(values = c("dodgerblue2", "orangered2", "darkgoldenrod2")) +
  theme_bw()
## Warning: Removed 2 rows containing missing values (geom_point).

We did quite a bit here with just a couple of lines of code. Let’s break it down. First, I increased the size of the points by passing the argument size = 2 outside of aes(). Adding it inside of aes() would make size = 2 a plot element, which is not what we want. We just want to adjust the size of the points to make it easier to see. You can go ahead and try passing the size argument inside aes() to see what I’m talking about.

Next, I changed the theme of the plot to theme_bw. ggplot2 offers a bunch of different plot theme options that you can find [here]

Finally, I chose custom colors for the data points. R has a ton of color options available for your plots. A full list of them can be found here. In addition, you can also create custom colors in R using hexadecimal color codes.

ggplot(penguins) +
  geom_point(aes(y = bill_depth_mm, x = body_mass_g, color = species), size = 2) +
  labs(x = "Body mass (g)", y = "Bill depth (mm)", color = "Species") +
  scale_color_manual(values = c("#ff4d94", "#6666ff", "#558000")) +
  theme_bw()
## Warning: Removed 2 rows containing missing values (geom_point).

However, there are already lots of preset color palettes that work nicely together. An especially helpful one is the viridis color palette, which helps you create color-blindness friendly figures, making your science more accessible.

library(viridis)
## Warning: package 'viridis' was built under R version 3.5.3
## Loading required package: viridisLite
ggplot(penguins) +
  geom_point(aes(y = bill_depth_mm, x = body_mass_g, color = species), size = 2) +
  labs(x = "Body mass (g)", y = "Bill depth (mm)", color = "Species") +
  scale_color_viridis(discrete = T) +
  theme_bw()
## Warning: Removed 2 rows containing missing values (geom_point).

Alternatively, we can ditch color altogether and represent our different species by shapes instead either using shape = or the pch = argument.

ggplot(penguins) +
  geom_point(aes(y = bill_depth_mm, x = body_mass_g, shape = species), size = 2) +
  labs(x = "Body mass (g)", y = "Bill depth (mm)", shape = "Species") +
  theme_bw()
## Warning: Removed 2 rows containing missing values (geom_point).

Or better yet, we can combine shapes and colors!

ggplot(penguins) +
  geom_point(aes(y = bill_depth_mm, x = body_mass_g, color = species, shape = species), size = 2) +
  labs(x = "Body mass (g)", y = "Bill depth (mm)", color = "Species", shape = "Species") +
  scale_color_manual(values = c("dodgerblue2", "orangered2", "darkgoldenrod2")) +
  theme_bw()
## Warning: Removed 2 rows containing missing values (geom_point).

I think our figure is looking pretty good now, but the text on the axes are a bit small. We can make the text bigger with just one extra line of code.

ggplot(penguins) +
  geom_point(aes(y = bill_depth_mm, x = body_mass_g, color = species, shape = species), size = 2) +
  labs(x = "Body mass (g)", y = "Bill depth (mm)", color = "Species", shape = "Species") +
  scale_color_manual(values = c("dodgerblue2", "orangered2", "darkgoldenrod2")) +
  theme_bw() +
  theme(text = element_text(size = 14)) 
## Warning: Removed 2 rows containing missing values (geom_point).

Adding a linear model

Finally, ggplot2 has a nice function for overlaying models using the geom_smooth() function. It’s important to make sure you know what geom_smooth() is doing though. Basically, geom_smooth() builds a regression model and plots the predictions along with the associated confidence bands. However, it’s important to know what the underlying models are and geom_smooth currently doesn’t support the ability to give those to you, instead you will need to build the lm() objects and extract the coefficients yourself. We can save this topic for another time. In the meantime, just be careful when using canned function like geom_smooth() which will do a nice job of giving you a pretty picture, but it won’t necessarily tell you what’s going on under the hood. And we usually want more than just a pretty picture.

ggplot(penguins) +
  geom_point(aes(y = bill_depth_mm, x = body_mass_g, color = species, shape = species), size = 2) +
  geom_smooth(se = T, method = "lm", aes(y = bill_depth_mm, x = body_mass_g)) +
  labs(x = "Body mass (g)", y = "Bill depth (mm)", color = "Species", shape = "Species") +
  scale_color_manual(values = c("dodgerblue2", "orangered2", "darkgoldenrod2")) +
  theme_bw() +
  theme(text = element_text(size = 14)) 
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

As you can see, without specifying the group categories within geom_smooth(), it fits a line through all of the data that suggests a negative relationship between bill depth and body mass. But we know just by looking at the scatter plot that we have some strong species-level effects. To fix this, let’s plot the individual regressions for each species. We can do this by simply adding color = species to the geom_smooth() argument.

ggplot(penguins) +
  geom_point(aes(y = bill_depth_mm, x = body_mass_g, color = species, shape = species), size = 2) +
  geom_smooth(se = T, method = "lm", formula = y ~ x, aes(y = bill_depth_mm, x = body_mass_g, color = species)) +
  labs(x = "Body mass (g)", y = "Bill depth (mm)", color = "Species", shape = "Species") +
  scale_color_manual(values = c("dodgerblue2", "orangered2", "darkgoldenrod2")) +
  theme_bw() +
  theme(text = element_text(size = 14)) 
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

Notice how the direction of the species-level relationships are reversed from the original model. Bill depth increases with body mass within each species, but considered in aggregate the relationship was negative because Gentoos as a group just have smaller bill depths. This is a nice example of a Simpson’s paradox, a phenomenon where the sign of a relationship can flip depending on whether or not you condition on group-level effects. If you suspect you have strong group-level effects in your own data, you should always check for a potential Simpson’s paradox before treating trudging forward with an aggregated analysis. When we have suspected group-level effects, that’s a good reason to implement a mixed-effects model. But that’s a topic for another time.

Facetting

Facetting allows you to break up and display your data into many smaller multiples. For example, I could split the previous plot into three different ones, each plot corresponding to the island where those data were collected using facet_wrap().

p <- ggplot(penguins) +
        geom_point(aes(y = bill_depth_mm, x = body_mass_g, color = species, shape = species), size = 2) +
        geom_smooth(se = T, method = "lm", formula = y ~ x, aes(y = bill_depth_mm, x = body_mass_g, color = species)) +
        labs(x = "Body mass (g)", y = "Bill depth (mm)", color = "Species", shape = "Species") +
        scale_color_manual(values = c("dodgerblue2", "orangered2", "darkgoldenrod2")) +
        theme_bw() +
        theme(text = element_text(size = 14)) 

p + facet_wrap(~ island) 
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

That’s all for today’s blog.