Let’s load in the tidyverse package.

library(tidyverse)

Now let’s read in our data, which we’ll use later1

df <- read_csv("http://dartgo.org/dc-complete")
## Parsed with column specification:
## cols(
##   record_id = col_double(),
##   month = col_double(),
##   day = col_double(),
##   year = col_double(),
##   plot_id = col_double(),
##   species_id = col_character(),
##   sex = col_character(),
##   hindfoot_length = col_double(),
##   weight = col_double(),
##   genus = col_character(),
##   species = col_character(),
##   taxa = col_character(),
##   plot_type = col_character()
## )

To get started, we’ll use a test dataset and start with a review of ggplot’s basic functionality.

data(mpg)

mpg
## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

Try running this code:

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()

ggplot2 is built on the idea of the Grammar of Graphics, a system described by Leland Wilkinson (see readings). There are a lot of nuances to explore in the book and in the ggplot2 package, but one of the core concepts to keep in mind is that a visualization has three critical components:

We can see this in action

ggplot(mpg)

It doesn’t mean much yet. Let’s give it a coordinate system and see what happens.

ggplot(mpg, aes(x = cty, y = hwy))

Fantastic! The data is in there, and describing a 2D plane (the most we can really have in ggplot2 - there are no “traditional” 3D visualizations for it). For every row of our dataset, there is some intersection that exists on this plane. We need to decide how they will be rendered, though. For these two continuous variables, a scatterplot is a good place to start. We can add another “layer” to our plot containing geom_point() to show where the data exists on this plane.

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()

The points are just one way of rendering this. We could do it another way if we wanted to.

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_text(aes(label = model))

Or we could try this

ggplot(mpg, aes(x = drv, y = hwy)) +
  geom_boxplot()

Or this:

ggplot(mpg, aes(x = drv, y = hwy)) +
  geom_col()

ggplot(mpg, aes(x = drv, y = hwy)) +
  geom_col(aes(color = model))

ggplot(mpg, aes(x = drv, y = hwy)) +
  geom_col(aes(fill = model))

Let’s go back to our points for now, though. Though we’re limited to “2D” visualizations, there’s a technique called “small multiples” (“facets” in ggplot) that allow us to show this data in other dimensions. For instance, if we want to separate things out by year.

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  facet_grid(year~.)

Activity

Using what you know so far, create a plot that lays out city and highway mileage by vehicle class

Solution

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  facet_grid(class~.)

We could also write this as:

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  facet_wrap(~class)

Activity

Try using our df data set to create a scatter plot of hindfoot length vs. weight. Then try breaking it out into facets by sex.

ggplot(df, aes(x = weight, y = hindfoot_length)) +
  geom_point()

ggplot(df, aes(x = weight, y = hindfoot_length)) +
  geom_point() +
  facet_grid(.~sex)

Activity

Try breaking that plot out by sex and plot_type

ggplot(df, aes(x = weight, y = hindfoot_length)) +
  geom_point() +
  facet_grid(plot_type~sex)

Another answer to this is

ggplot(df, aes(x = weight, y = hindfoot_length)) +
  geom_point() +
  facet_grid(.~sex+plot_type)

Adding color

We can also encode variables as color, shape, or size. Let’s try coloring the points of mpg by vehicle class.

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(aes(color = class)) +
  facet_grid(year~.) 

What if I want to make all the points blue?

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(aes(color = "blue")) +
  facet_grid(year~.) 

That doesn’t work. It’s because we called it inside of the aes() function. We use aes() when we’re referring to something inside the data, such as class, cty, or hwy. If we’re just trying to assign a characteristic to every point, we do that inside of geom_point() but outside of aes().

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(color = "blue") +
  facet_grid(year~.) 

Activity

Try coloring our df points by genus

ggplot(df, aes(x = weight, y = hindfoot_length)) +
  geom_point(aes(color = genus)) +
  facet_grid(.~sex)

These are great for exploratory data visualization. What we’d like to do is pivot to data visualizations for presentation - this means making plots not just informative for a researcher, but aesthetically pleasing and informative for a wider audience.

Labels

Let’s talk for a moment about text labels. We can use labs() to set the labels for just about anything. There are a lot of different philosophies about titles, but I usually like them to act as an explanation of the story for a graph.

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  labs(title = "Highway and City MPG Are Positively Correlated",
       subtitle = "And it's easy to see",
       caption = "Data from fueleconomy.gov",
       x = "Miles Per Gallon (City)",
       y = "Miles Per Gallon (Highway)")

Themes

ggplot is incredibly flexible; just about every part of a graph can be customized, from the colors to the size to the label text and font family. All you really ever have to do is add the + and put on a new layer that describes what you want to do.

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(alpha = 0.5) +
  facet_grid(year~.) +
  theme_minimal()

There are several built-in themes. Which one do you think makes the graph look like this [SEE SLIDE]:

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(alpha = 0.5) +
  facet_grid(year~.) +
  theme_dark()

There is also a function in ggplot called theme(), which lets us fine-tune these customizations. When we’re thinking about advanced graphics, often what we mean is fine-tuning graphs to get them ready for publication or presentation.

I often like to use theme_minimal() as a base, but it has some drawbacks, like a lack of borders when we use facets.

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(alpha = 0.5) +
  facet_grid(year~.) +
  theme_minimal()

We can tweak this using theme()

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(alpha = 0.5) +
  facet_grid(year~.) +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA))

Almost all of the elements we can tweak with theme() are built from: - element_rect() - element_line() - element_text()

For instance:

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(alpha = 0.5) +
  facet_grid(year~.) +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA),
        axis.title = element_text(family = "mono"),
        panel.grid.major.x = element_line(color = "black"))

If we want to completely remove something, we can use element_blank()

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(alpha = 0.5) +
  facet_grid(year~.) +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA),
        axis.title = element_text(family = "mono"),
        panel.grid.major.x = element_line(color = "black"),
        panel.grid.minor.y = element_blank())

Activity

See if you can make this horrible-looking chart using theme()

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(alpha = 0.5) +
  facet_grid(year~.) +
  theme_minimal() +
  theme(panel.border = element_rect(color = "blue", fill = NA),
        axis.title.x = element_text(family = "serif"),
        axis.title.y = element_text(family = "mono", color = "red"),
        axis.text.x = element_text(family = "serif", size = 20),
        panel.grid.major.y = element_line(color = "green", size = 2),
        panel.grid.major.x = element_line(color = "black"))

Highlights

That’s enough of bad graphics. Let’s try making something good.

Aside from details like font choice and facet borders, one of the things that makes a graphic effective as a presentation piece is its ability to draw the eye to important details and to tell a story. Here’s an example [SHOW SLIDE]

We can make this sort of simple highlight happen by creating a subset of the data, and using that to add a second points layer.

vw <- mpg %>%
  filter(manufacturer == "volkswagen")

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_point(color = "red", data = vw)

Data Labels

To make a finer point of it, let’s throw some labels on there:

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_point(color = "red", data = vw) +
  geom_text(data = vw, aes(label = model))

That’s not as great…let’s work on that with a package called ggrepel

library(ggrepel)

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_point(color = "red", data = vw) +
  geom_text_repel(data = vw, aes(label = model))

This gets crowded fast…before we move on, though, it’s worth noting how we can change the case of the words

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_point(color = "red", data = vw) +
  geom_text_repel(data = vw, aes(label = str_to_title(model)))

Activity

Create a plot that highlights the top performing cars (hwy mileage > 40), and labels them appropriately.

high <- mpg %>%
  filter(hwy > 40)

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_point(color = "red", data = high) +
  geom_text_repel(data = high, aes(label = model))

Another way to do this is to manipulate the dataset itself, but this does make it harder to add labels:

mpg2 <- mpg %>%
  mutate(highlight = ifelse(hwy > 40, "Yes", "No"))

ggplot(mpg2, aes(x = cty, y = hwy)) +
  geom_point(aes(color = highlight))

A Bit on Scales

When we talk about scales, we’re thinking of the many ways that we can encode data. Some of these are: - X position - Y position - Size - Shape - Color - Fill (different from color) - Alpha (transparency)

There are various scale_* functions that handle how these work. Let’s clean up the look of that a bit.

ggplot(mpg2, aes(x = cty, y = hwy)) +
  geom_point(aes(color = highlight)) +
  scale_color_manual(values = c("black", "red"), guide = "none")

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(aes(color = hwy)) +
  scale_color_viridis_c()

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(aes(color = factor(hwy))) +
  scale_color_viridis_d()

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(aes(color = hwy)) +
  scale_color_continuous(low = "blue", high = "yellow")

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(aes(color = drv)) +
  scale_color_manual(values = c("steelblue", "forestgreen", "firebrick"))

Note: a packages called ggthemes provides some nice colors and themes. For instance:

library(ggthemes)

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(aes(color = drv)) +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight()

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(aes(size = hwy)) +
  scale_size_continuous(range = c(1, 3))

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(aes(shape = drv)) +
  scale_shape_manual(values = c(8, 10, 12))

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point(aes(shape = drv)) +
  scale_x_reverse()

Activity

Make a log10 scale for our df, and use the viridis to color points by genus.

ggplot(df, aes(x = weight, y = hindfoot_length)) +
  geom_point(aes(color = genus)) +
  scale_x_log10() +
  scale_color_viridis_d()

Annotating with shapes

Another commonly-encountered issue is the need to put guiding lines or shapes onto a graph. Let’s put a cutoff that reinforces 30mpg highway and 25 mpg city.

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_hline(yintercept = 30) +
  geom_vline(xintercept = 25)

Let’s say we wanted to shame the low-performing vehicles. Let’s throw a box around some of the low-performers using annotate():

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_hline(yintercept = 30) +
  geom_vline(xintercept = 25) +
  annotate("rect", xmin = 10, xmax = 15, ymin = 15, ymax = 20)

That’s out of order…

ggplot(mpg, aes(x = cty, y = hwy)) +
  annotate("rect", xmin = 10, xmax = 15, ymin = 15, ymax = 20) +
  geom_point() +
  geom_hline(yintercept = 30) +
  geom_vline(xintercept = 25) 

ggplot(mpg, aes(x = cty, y = hwy)) +
  annotate("rect", 
           xmin = 10, 
           xmax = 15, 
           ymin = 15, 
           ymax = 20, 
           alpha = 0.2,
           fill = "red") +
  geom_point() +
  geom_hline(yintercept = 30) +
  geom_vline(xintercept = 25) 

Annotations can be all kinds of shapes, including text.

ggplot(mpg, aes(x = cty, y = hwy)) +
  annotate("rect", 
           xmin = 10, 
           xmax = 15, 
           ymin = 15, 
           ymax = 20, 
           alpha = 0.2,
           fill = "red") +
  annotate("text",
           x = -Inf,
           y = -Inf,
           label = "Bottom Left Corner") +
  annotate("text",
           x = Inf,
           y = Inf,
           label = "Top Right Corner") +
  geom_point() +
  geom_hline(yintercept = 30) +
  geom_vline(xintercept = 25) 

If we want to fix where the text annotations are anchored, we can use hjust and vjust.

ggplot(mpg, aes(x = cty, y = hwy)) +
  annotate("rect", 
           xmin = 10, 
           xmax = 15, 
           ymin = 15, 
           ymax = 20, 
           alpha = 0.2,
           fill = "red") +
  annotate("text",
           x = -Inf,
           y = -Inf,
           hjust = 0,
           vjust = 0,
           label = "Bottom Left Corner") +
  annotate("text",
           x = Inf,
           y = Inf,
           hjust = 1,
           vjust = 1,
           label = "Top Right Corner") +
  geom_point() +
  geom_hline(yintercept = 30) +
  geom_vline(xintercept = 25) 

Activity

Create a text annotation right where the hline and vline meet. If we wanted to anchor it at the top left portion of the text, how would we do that?

ggplot(mpg, aes(x = cty, y = hwy)) +
  annotate("rect", 
           xmin = 10, 
           xmax = 15, 
           ymin = 15, 
           ymax = 20, 
           alpha = 0.2,
           fill = "red") +
  annotate("text",
           x = -Inf,
           y = -Inf,
           hjust = 0,
           vjust = 0,
           label = "Bottom Left Corner") +
  annotate("text",
           x = Inf,
           y = Inf,
           hjust = 1,
           vjust = 1,
           label = "Top Right Corner") +
  annotate("text",
           x = 25,
           y = 30,
           hjust = 0,
           vjust = 1,
           label = "Hello, world!") +
  geom_point() +
  geom_hline(yintercept = 30) +
  geom_vline(xintercept = 25) 

Activity

Let’s bring it all together. Using themes, highlights, labels, and titles, let’s make a graph that’s good enough for publication.

high <- mpg %>%
  filter(hwy > 40)

low <- mpg %>%
  filter(cty <= 15, hwy <= 20)

ggplot(mpg, aes(x = cty, y = hwy)) +
  annotate("rect", 
           xmin = 15, 
           xmax = 20, 
           ymin = 20, 
           ymax = 30,
           fill = "blue",
           alpha = 0.2) +
  annotate("text",
           x = 12.5, 
           y = 22, 
           label = "Less Than Ideal",
           color = "red") +
  geom_point() +
  geom_point(data = high, color = "forestgreen") +
  geom_point(data = low, color = "firebrick") +
  geom_text_repel(data = high, aes(label = str_to_title(model))) +
  facet_grid(year~.) +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA)) +
  labs(title = "Highway and City MPG Are Positively Correlated",
       subtitle = "And it was better in 1999!",
       caption = "Data from fueleconomy.gov",
       x = "Miles Per Gallon (City)",
       y = "Miles Per Gallon (Highway)")

Now remember to save it out!

ggsave("./images/final_plot.png", dpi = 300, width = 10, height = 10)

  1. Data adapted from Data Carpentry, sourced from Portal Project Teaching Data Set. Ernest, Morgan; Brown, James; Valone, Thomas; White, Ethan P. (2017): Portal Project Teaching Database. figshare. https://doi.org/10.6084/m9.figshare.1314459.v6