Notes on visualizing data

These notes describe some details about visualization in R using the ggplot system in the tidyverse. They are adapted from Kieran Healy’s Data Visualization: A Practical Introduction. We’ll follow Healy’s examples and use the Gapminder dataset, which you can obtain by installing the gapminder package (run install.packages("gapminder")).

library(gapminder)
library(tidyverse)

Take a look at the gapminder data (just run gapminder) before proceeding. It has data on life expectancy, population, and GDP per capita for a number of countries over time.

Basics of ggplot

The “gg” in “ggplot” stands for grammar of graphics. It was originally (and still is) in the ggplot2 package, which has been incorporated into the tidyverse. As the name suggests, ggplot takes a variety of plotting functions and brings them together into a cohesive grammar. This grammar is useful because when you want to do new things, you have a set of rules from which you can envision how to accomplish your intent. In case you want to learn more, the theoretical underpinnings of the ggplot philosophy of graphics can be found in this book and this paper (you don’t need to read them for this course—we’ll develop all the theory we need along the way).

The notion of a mapping is central to the ggplot philosphy. A mapping relates objects in the data to visual elements of the plot. You can have a consistent mapping throughout a plot, or have layer-specific mappings. We’ll play with a few different examples so that you get the intuition—don’t sweat it if it doesn’t make sense right away.

Plots created by ggplot are, like everything else in R, objects. This means we can save our plots and build them up iteratively, as we’ll see below. You should use this feature.

The `ggplot` function specifies the base for the plot

To start, let’s make a plot of life expectancy against GDP per capita.

plot_base <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
plot_base

Just running plot_base creates a blank canvas with GDP per capita on the x axis and life expectancy on the y axis. Every plot you’ll make in ggplot will start with a ggplot() call, where you’ll almost always specify the dataset you want to use in the plot and the axis mappings.

The data argument is the dataset where we want ggplot() to look for the variables we’ll be plotting.

The mapping argument, aes(), is a function (not a data object or a character string). The arguments to aes() are definitions telling ggplot() what the plot axes should be (later in the course we’ll see examples of other arguments to aes()). We don’t need to tell aes() where to look for objects with those names because ggplot() assumes they’re in whatever we gave as the data argument.

`geom_` objects add layers

Suppose we want to add a scatterplot. Since we’ve already specified the mapping, we can just run

scatterplot <- plot_base + geom_point()
scatterplot

geom_point() is a function which adds points (which are a geometric object). If we hadn’t specified the mapping in the ggplot() call, we could have specified it in the geom_point() call:

ggplot(data = gapminder) + geom_point(mapping = aes(x = gdpPercap, y = lifeExp))

If we want to show a line going through the points with a confidence interval (more on those later on in the course), we can use geom_smooth() to draw a “smoothing” line through the points:

scatterplot + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

geom_smooth() has some options for how it draws the lines. We’ll ignore them for now, since they’re beyond the scope of this class. But you should know that you can customize the line.

geom_line() draws a line connecting all the points in the mapping in order of their x axis values; probably not what we want here:

plot_base + geom_line()

There are many more geoms in ggplot: geom_bar(), geom_hist(), geom_vline(), etc. If you’re trying to draw a particular kind of plot, just look up “draw in ggplot” and you’ll find plenty of results. The ggplot philosophy means you’ll interact with all geoms in similar ways, by specifying mappings (or having them inherit mappings from the plot base) and so on.

You can add other plot elements too

Now suppose we wanted to change the axis labels and give the graph a title. We can add those just like we added the points for the scatterplot:

scatterplot + xlab("GDP per capita ($/person)") +
              ylab("Life expectancy (years)") +
              ggtitle("Life expectancy against GDP per capita")

Notice that I added the plot elements across multiple lines. R doesn’t care whether you do this or not, but it helps human readers track what’s going on in the plot’s construction. When you split across multiple lines, make sure the + symbol is at the end of the line. This way R knows to expect more components on the next line. If you don’t put the + at the end of the line and instead put it at the start of the new line, R will think you finished the plot and then get confused when a new line starts with a +:

labeled_scatterplot <- scatterplot + xlab("GDP per capita ($/person)") +
              ylab("Life expectancy (years)")
              + ggtitle("Life expectancy against GDP per capita")

## Error in +ggtitle("Life expectancy against GDP per capita"): invalid argument to unary operator

labeled_scatterplot

Notice that R read the first two lines correctly. When the second line ended without a +, it assumed I was done with the plot and drew what I’d told it to until that point. Then R read the third line as beginning with a + and hit an error (“Hey buddy, you can’t just start a line with a +”).

Using aesthetic mappings

Suppose you wanted to color-code the points in the plot above by continent. Putting aside the question of whether you should do this, how would you go about it?

ggplot allows us to define an aesthetic mapping: a mapping which operates as part of the aesthetic. An aesthetic mapping allows us to map data to some aesthetic feature of the plot. In this case, we’re mapping continent names to point colors. Let’s look at how we might do this:

plot_base + geom_point(aes(color=continent)) + xlab("GDP per capita ($/person)") +
              ylab("Life expectancy (years)") +
              ggtitle("Life expectancy against GDP per capita")

Notice that plot_base already had one aesthetic mapping (mapping = aes(x = gdpPercap, y = lifeExp)), which specified that the plot should have

a coordinate system with x axis given by GDP per capita and y axis given by life expectancy; and
the default scale (the scale of the data). We’ll see an example of a scale transformation in the next section.

Now we’ve added a new aesthetic mapping, specific to the geom_point() layer: we mapped the continent variable in the data to the color argument. In doing so, we didn’t have to explicitly specify that each continent should get its own color, what those colors should be, or whether the data were numerical or categorical. ggplot automatically inferred that each continent should get its own color because we did this using an aesthetic mapping, ggplot figured out that the data were categorical from the underlying structure (continent is a factor variable), the color argument used the default color palette for categorical data. It also created a helpful legend, because ggplot is nice like that.

We could have specified this mapping at the overall plot level by putting the color=continent argument inside the aes() call that went into plot_base. If we’d done that, any layer we created which could make sense of a color argument would have color-coded points by continent, e.g. geom_point, geom_bar, etc. In this approach the color coding only applies to the geom_point layer.

Mappings inside the aes() function (“aesthetic mappings”) are different from mappings outside of it. Look at what happens when we specify color=continent outside the mapping:

plot_base + geom_point(color=continent) + xlab("GDP per capita ($/person)") +
              ylab("Life expectancy (years)") +
              ggtitle("Life expectancy against GDP per capita")

## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomPoint, : object 'continent' not found

ggplot searches for an object named “continent” outside the data we told it to use in plot_base. Finding none, it returns an error. The correct syntax would be to give it a color name as a string. Let’s try that and see what happens:

plot_base + geom_point(color="red") + xlab("GDP per capita ($/person)") +
              ylab("Life expectancy (years)") +
              ggtitle("Life expectancy against GDP per capita")

Notice that now all the points are red. Specifying an option like color outside the aes() argument sets an aesthetic for the plot, but it doesn’t map it to the data in any way.

There are many, many more ways to aes() to specify aesthetic mappings. You can find a lot of helpful information on StackOverflow and other places.

General principles of visualization

There are few hard and fast rules for making “good” visualizations, but many suggestions. We list two suggestions here. Practice, time, and curiosity will help you make appealing, intuitive, and memorable visuals.

Build up your plots in layers

The syntax of ggplot makes it natural for final plotting code to be constructed in layers. Take advantage of ggplot’s ability to add elements and save plots as objects by building your plots in layers too. Suppose we wanted to make a labeled scatterplot like above, but we also wanted to change the background (i.e. change the theme) and plot it on a log scale. Here’s some final plotting code:

plot_base + geom_point() +
            scale_x_log10(labels = scales::dollar) +
            geom_smooth(method = "lm") +
            xlab("GDP per capita ($/person)") +
            ylab("Life expectancy (years)") +
            ggtitle("Life expectancy against GDP per capita") +
            theme_bw()

## `geom_smooth()` using formula 'y ~ x'

(The method=lm argument to geom_smooth() is telling ggplot to put a straight (rather than bendy) line against the points. We’ll discuss lm a little near the end of the course, but the possible arguments to geom_smooth() are beyond our scope in ECON 210. The labels=scales::dollar argument to scale_x_log10() makes the labels a bit nicer than the default scientific notation. There are other scale_ functions which you can include in your plots. If no scale_ function is used, the default is to plot with the scale of the data.)

In building up the code above, I didn’t write and run it all in one go. I started with plot_base and geom_point(). Then I tried out the log scale, and made the labels into dollars rather than scientific notation. Then I added geom_smooth(), and experimented with methods until I had one I liked. Finally, I added the axis labels and titles, and set the theme. Building a plot in layers is especially valuable when you’re experimenting with different plot designs and elements.

Don’t make a plot too busy

Too many embellishments on a plot and it starts to look like “chartjunk”. Think about some of the worst and best visualizations you’ve seen: which had unnecessary shadows and 3d effects? Which had lines that were tangential (at best) to the point being communicated? Odds are, the worse visuals had more chartjunk.

The human visual system is powerful, but it’s still not a good idea to overwhelm it with noise. Every plot should have a central message. Remove anything which doesn’t help it convey that message. A useful concept here is the “data-ink ratio”: the ratio of pixels which are data to all pixels plotted. A data-ink ratio of 1 means that every pixel is representing data. A data-ink ratio of 0 means that no pixels are representing data.

In general, you want to maximize the data-ink ratio, but sometimes a little “non-data” ink is helpful. Let’s compare three plots: one with the default gray background with gridlines (lots of “non-data” ink), one with a white background and gridlines (less “non-data” ink), and one with a plain white background and no gridlines (least “non-data” ink).

labeled_scatterplot

labeled_scatterplot + theme_bw() # white background with gridlines

labeled_scatterplot + theme_classic() # white background with no gridlines

The first graph gives gridlines. Gridlines are helpful when you want the reader to be able to identify rough magnitudes for the points, and to track them more easily. But the background is a lot of gray ink. This background may be good if it helps draw out a contrast in the main figure lines. Here, it isn’t doing much.

The second graph gives the same gridlines as the first, but removes the gray background ink. This feels a little less busy.

The third graph has the highest data-ink ratio, with just axis lines and no gridlines or box bounding the plot. It’s a little harder to identify the values of the points farther into the northeast, though. Whether this matters for the point you’re trying to make with the graph will depend on your use case.

There are many more adjustments possible, each suitable to different tasks. Sometimes the defaults are fine, sometimes more customization is called for. Don’t spend time on these kinds of issues while you’re iterating on the analysis. Do spend (some) time experimenting and thinking through presentation details before sharing your graph with your intended audience.