The goals of this notebook are two-fold:
ggplot2 library in R.The motivating example for the first part of the practical is going to be the movement(s) of North Atlantic right whales—an endangered large whale species living on the East coast of North America.
I do research on these whales, and as part of that have been working on a project to estimate movement transitions between certain areas. In particular, I am interested in movements into and through the mid-Atlantic region (hereafter, MIDA), which comprises the nearshore areas between Georgia and Rhode Island. As part of this research, we’ve been conducting interviews with right whale experts to estimate their beliefs about movement transitions in this MIDA area.
Imagine then we have three regions:
In our example, we’d ask the experts to express their belief using natural frequencies (recall that Spiegelhalter et al. 2011 discussed this in their paper). The questions looked like:
“Of 100 Adult Female right whales in the mid-Atlantic region in December, how many moved to the southeastern region the next month?”
We’ll make some fake data that we can use for subsequent plotting. These data are the movement probabilities from one MIDA to all other regions. For example, a value of MIDA = 0.6, means that of 100 hypothetical whales, 60 would remain in the MIDA region, etc. Here are the data—these are point estimates of three possible transitions.
mida <- 0.4
north <- 0.31
seus <- 1 - (mida + north)
df <- data.frame(regions = c('MIDA', 'NORTH', 'SEUS'), moveProb = c(mida, north, seus))
kable(df, format = 'html')
| regions | moveProb |
|---|---|
| MIDA | 0.40 |
| NORTH | 0.31 |
| SEUS | 0.29 |
Ok, with data, now we can look at some different ways to encode these movement data. (For now we’ll work in base R graphics; in part 2 we’ll use the ggplot2 library.)
Let’s start with our favourite - the pie chart.
pie(df$moveProb)
That’s fine, but we should add a label so we know what regions correspond to what probabilities:
pie(df$moveProb, labels = df$regions)
That’s better, but I think it would be better still if we shifted the pie slices to start at 12 o’clock instead of 3 o’clock:
pie(df$moveProb, labels = df$regions, init.angle = 90)
While this looks better to me, recall from lecture that pie charts are bad for encoding information. Indeed if you type ?pie into an R prompt you’ll see this:
Note
Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.
Also recall by looking at the table of the data, that I’ve made two of the probabilities very close together. It’s almost impossible to distinguish them with the pie chart, but as we’ll see, it is much easier with a plot type higher up on the hierarchy (e.g. the dotchart).
How about a bar chart? Here we can see the slight differences easily.
barplot(df$moveProb, names.arg = df$regions)
Simple, but very effective. How about lines, where the length of the line corresponds to the movement probability:
plot(0:1, 0:1, type = 'n', ylim = c(0, 1), xlim = c(0, 0.8), ylab = "", xlab = "")
segments(x0 = c(0, 0.2, 0.3),
x1 = c(0.4, 0.2 + 0.31, 0.3 + 0.29),
y0 = c(0.25, 0.5, 0.75),
y1 = c(0.25, 0.5, 0.75))
text(0.5, 0.25, labels = 'MIDA')
text(0.6, 0.5, labels = 'NORTH')
text(0.7, 0.75, labels = 'SEUS')
Do you recall a) where this is on the visual hierarchy, and b) why it’s so much harder than something that has a common position? For example, here’s the same plot but starting at 0:
plot(0:1, 0:1, type = 'n', ylim = c(0, 1), xlim = c(0, 0.8), ylab = "", xlab = "")
segments(x0 = c(0, 0, 0),
x1 = c(0.4, 0.31, 0.29),
y0 = c(0.25, 0.5, 0.75),
y1 = c(0.25, 0.5, 0.75))
text(0.5, 0.25, labels = 'MIDA')
text(0.5, 0.5, labels = 'NORTH')
text(0.5, 0.75, labels = 'SEUS')
Or maybe with points (I didn’t add the labels…sorry):
plot(c(1, 2, 3), df$moveProb, type = 'n', ylab = "", xlab = "")
points(x = 1:3, y = df$moveProb)
How about colour to signify the movement probabilities? What would we expect that to look like? Here the darker red is a higher movement probability.
mypal <- brewer.pal(n = 3, name = 'Reds')
plot(c(1, 2, 3), y = c(0, 0, 0), type = 'n', xlim = c(0, 11), ylim = c(0, 1), ylab = "", xlab = "")
symbols(c(2, 5.5, 9), c(0.5, 0.5, 0.5), squares = c(1, 1, 1), bg = rev(mypal), add = TRUE)
text(c(2, 5.5, 9), c(0.2, 0.2, 0.2), labels = df$regions)
Do you think it’s better than or worse than area:
plot(c(1, 2, 3), df$moveProb, type = 'n', xlim = c(0, 10), ylim = c(0, 1), ylab = "", xlab = "")
symbols(c(2, 6, 9), c(0.5, 0.5, 0.5), squares = df$moveProb, bg = 'black', add = TRUE)
text(c(2, 6, 9), c(0.2, 0.2, 0.2), labels = df$regions)
Finally, what if we went old school, and looked at a dot chart—Cleveland and Tukey’s favorite:
dotchart(df$moveProb, labels = df$regions)
Now think back to the hierarchy from Munzner that I showed in class:
In all the plots we’ve seen above, we’re using the exact same data—just showing the data with different graphical encodings. I’d argue that some are more effective than others!
If we think back over a handful of these plots, we can order them from most to least effective:
dotchartrectangles with different areacoloured rectanglespie chartAnd then Bill Cleveland will be happy!
One of the most impactful libraries right now is the ggplot2 library, created by Hadley Wickham—the Chief Scientist at RStudio. ggplot2 implements the Grammar of Graphics, which was originally formulated by Leland Wilkinson. Hadley wrote this library as part of his PhD work at Iowa State University.
Let’s start by looking at how you might specify a scatterplot:
library(ggplot2)
ggplot() +
layer(
data = diamonds,
mapping = aes(x = carat, y = price),
geom = "point",
stat = "identity",
position = "identity")+
scale_y_continuous() +
scale_x_continuous() +
coord_cartesian()
OMG you might say—that’s a lot of stuff to make a simple scatterplot. And while indeed it is a lot of stuff, there’s a nice theory underlying this, and it has a lot of nice defaults that allow you to replace the above, with this code below (which will make the same exact plot).
ggplot(diamonds, aes(carat, price)) +
geom_point()
So that’s interesting. But what all is in all of that code? Let’s look at it in a bit more detail. A full “grammar” contains 5 elements:
How does that map on to the code?
data = diamonds, mapping = aes(x = carat, y = price)geom = "point"stat = "identity"position = "identity"x and a y mapping)scale_y_continuous()scale_x_continuous()coord_cartesian()Great question! If this seems like a lot of extra bookkeeping as compared to pointing and clicking a barchart in Excel, or Google Docs, well it probably is. But, what it forces you to do is to think about the data and how you want to represent it, and what story you want to tell.
Instead of thinking, “Oh, I’m going to make a bar chart,” you shift to thinking “Okay, I’m going to use the best possible graphical perception tools to indicate the relative and absolute destinations of whales.” In the words of Hadley Wickham: “Use your graph to answer a question.”
Also, by thinking this way, you might easily shift the emphasis of the story, and say well I want to show the relationship between diamond size and diamond price, but a different scale—say a log scale—would better depict the relationship. In addition, you might want to add a summary layer to that. to highlight the relationship. These different components of the grammar are easily added as follows:
ggplot(diamonds, aes(carat, price)) +
geom_point() +
stat_smooth(method = lm) +
scale_x_log10() +
scale_y_log10()
Now we’re composing a more complex plot using these same grammar rules. All we’ve done is add a layer, and changed the scale. So we want to reinforce the idea of building up a complex plot from a series of structured components.
Let’s think some more about the coordinate system. We often think of data being plotted in the Cartesian coordinate system. But ggplot2 allows us to quickly go from the Cartesian to a different one. Above we saw a shift to plotting data in log space, but what happens when we take a bar chart and plot it in polar coordinate space? Recall that our data haven’t changed. Compare these two plots:
ggplot(diamonds,aes(x = clarity, fill=clarity)) +
geom_bar(position = 'dodge')
ggplot(diamonds,aes(x = clarity, fill=clarity)) +
geom_bar(position = 'dodge') +
coord_polar(theta="x")
What do you notice? We have the same data, the same aesthetics, the same layer. But all we do is change the coordinate system and get a radically different plot. This is because the grammar of graphics readily facilitates these kinds of manipulations.
Going forward, here’s a template to use to make any kind of plot using ggplot2. This comes from Hadley Wickham’s new book on using R for Data Science.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
While in practice you can take advantage of ggplot2’s defaults and won’t typically need to specify all of these parameters, the seven parameters compose the formal grammar:
The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme.
We saw above Wickham’s suggestion of using the graph to answer a question. So we might say, do cars with big engines use more fuel than cars with small engines, i.e. does this:
…waste more petrol, than this:
How could we look at big versus small with existing data? Well the ggplot2 library has an mpg data frame built in:
head(mpg)
Let’s look at the relationship between engine size and fuel economy with ggplot2 how would we do that? What elements do we need? What goes into this template?
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
n.b. Charles, you could have them puzzle through this interactively before giving them this code:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
We now have a plot with two aesthetics mapped, yet we could add a lot more (recall the ‘triple scatterplot’ idea of Anscombe from the lecture). Why don’t we use the qualitative description of the car type to see what jumps out at us:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, colour = class))
What does this graph tell you? What information can you quickly pick out?
In terms of the grammar, recall that the aesthetics are “visual properties that you can map to variables to display information about the data.” So in this case we have mapped three visual properties: x, y, and class.
In the lecture we talked about the idea of small multiples—recall the examples of Galileo’s sunspots and Muybridge’s pictures of running horse. In ggplot2 the way to show these is with the facet commands—facet_wrap and facet_grid. These are very powerful specifications that quickly allow you to see many different views of the same data.
For example, let’s say we wanted to have individual plots for each of the 7 car types:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
Note above that we’ve removed one of the aesthetic specifications (colour for class). We could have left it, but it would be redundant once we’ve split the data out into unique sub-plots.
We’ve talked about the use of different geoms to show the same data. Compare these two plots (n.b. Charles—you might have them try to make this before you show the code):
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
What if you wanted them on the same plot?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
Let’s recap what is going on again. We have:
mpgx and ypoints and a smoothNow this brings up an important point. We have to have aesthetics mapped - right? But what if we still want the colour attribute included to represent the class of car? How do we put that in given that we have two different geoms?
The natural thought would be to just add it in as before, but see what happens when we do that:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
geom_smooth()
versus this:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
Aaah, much better!
In this second case we are only mapping the colour aesthetic for class to the point layer, while the smooth is now over all of the data. The first code was gramatically correct, but probably not what we wanted.
Here’s another wrinkle. Let’s say we want to see a smooth, but only for one class type, e.g. subcompact. How could we do that?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = dplyr::filter(mpg, class == "subcompact"), se = FALSE)
Again, you can start to see and appreciate the power of this framework.
Imagine that the last series of exploratory graphics/data analysis were just done by you at the R console for your own benefit. Now, however, you need to communicate these results to an audience. To do that in an effective manner we might add a few things to help orient the audience who are probably seeing the data for the first time. Accordingly, they need a little help to get oriented with the story.
Let’s try to add some labels of groups, as well as highlighting the worst performing cars. We’ll also add a title and subtitle.
As before, we’ll build this up using ggplot2. To make the labels, we have to first to a bit of calculating to find the worst performers. This chunk of code groups all of the cars in the mpg dataset by class and then finds the one with the worst mileage per gallon. If you are interested, this is done with code from the dplyr package, also written by Hadley Wickham.
worst_in_class <- mpg %>%
group_by(class) %>%
filter(row_number(hwy) == 1)
With those calculated, let’s look at the plot:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_point(size = 3, shape = 1, data = worst_in_class) +
geom_smooth(se = FALSE) +
ggrepel::geom_label_repel(aes(label = model), data = worst_in_class)+
labs(
title = "Fuel efficiency generally decreases with engine size",
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)",
colour = "Car type"
)
So let’s recap starting with the default ggplot syntax for building up a plot and compare that to the code we just called:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
In the code to make the last plot we have:
mpg and worst_in_class)xyclassSo while we’ve come a long way from pie(<data>) we also have an extremely complicated—yet elegant—plot with a consistent way to think about and graphically represent data. Instead of thinking about making a pre-selected type of chart, you can use this system to make any kind of plot.
If you want to learn more about ggplot2 here are some good starting places:
Also, note that you can make any kind of plot in R’s base graphic system. Some people prefer it, e.g. Nathan Yau, who has a blog post explaining why he prefers base graphics to ggplot2. I prefer the ggplot2 way of thinking about data and graphics, but wanted to point out contrasting opinions!
I haven’t even mentioned themes, which I’ll allow you to explore in your own. ggplot2 ships with 8 default plotting themes, which are preset templates. With one line of code you can change your plot from this:
p <- ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_point(size = 3, shape = 1, data = worst_in_class) +
geom_smooth(se = FALSE) +
ggrepel::geom_label_repel(aes(label = model), data = worst_in_class)+
labs(
title = "Fuel efficiency generally decreases with engine size",
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)",
colour = "Car type"
)
p + theme_bw()
to this:
p + theme_minimal()
And there is a package called ggthemes that allows you a broad range of defaults, e.g. the Economist:
p + theme_economist()