Data Sources

The data used in this workshop are adapted from the following:

City of Cambridge Public Safety. “Police Department Crash Data - Updated.” Cambridge Open Data, September 13, 2022. https://data.cambridgema.gov/Public-Safety/Police-Department-Crash-Data-Updated/gb5w-yva3.

National Centers for Environmental Information, National Oceanic and Atmospheric Administration. “Past Weather.” Past Weather | CAMBRIDGE MA | US1MAMD0011, October 3, 2022. https://www.ncei.noaa.gov/access/past-weather/42.55502054693938,-71.76079845408827,42.040922629398835,-70.48652687279844.

Load in the tidyverse package and load in our data

library(tidyverse)
library(lubridate)

crashes <- read_csv("./data/processed/crashes.csv")
weather <- read_csv("./data/processed/weather.csv")
crashes_weather <- left_join(crashes, weather)
weather_crashes <- read_csv("./data/processed/weather_crashes.csv")  %>%
  mutate(year = year(date))

ggplot2 review

To get started, we’ll use a test dataset and start with a review of ggplot’s basic functionality.

data(mpg)

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

Try running this code:

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()

ggplot2 is built on the idea of the Grammar of Graphics, a system described by Leland Wilkinson (see readings). There are a lot of nuances to explore in the book and in the ggplot2 package, but one of the core concepts to keep in mind is that a visualization has three critical components:

  • data
  • a coordinate system
  • geometry to display the data

We can see this in action

ggplot(mpg)

It doesn’t mean much yet. Let’s give it a coordinate system and see what happens.

ggplot(mpg, aes(x = cty, y = hwy))

Fantastic! The data is in there, and describing a 2D plane (the most we can really have in ggplot2 - there are no “traditional” 3D visualizations for it). For every row of our dataset, there is some intersection that exists on this plane. We need to decide how they will be rendered, though. For these two continuous variables, a scatterplot is a good place to start. We can add another “layer” to our plot containing geom_point() to show where the data exists on this plane.

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()

Though we’re limited to “2D” visualizations, there’s a technique called “small multiples” (“facets” in ggplot) that allow us to show this data in other dimensions. For instance, if we want to separate things out by year.

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  facet_grid(year~.)

Activity

Using what you know so far, create a plot that lays out city and highway mileage by vehicle class

Solution

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  facet_grid(class~.)

We could also write this as:

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  facet_wrap(~class)

Activity

Try using our weather_crashes data set to create a scatter plot of precipitation vs. number of crashes. Then try breaking it out into facets by year.

ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
  geom_point() +
  facet_wrap(~year)

Labels

Let’s talk for a moment about text labels. We can use labs() to set the labels for just about anything. There are a lot of different philosophies about titles, but I usually like them to act as an explanation of the story for a graph.

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  labs(title = "Highway and City MPG Are Positively Correlated",
       subtitle = "And it's easy to see",
       caption = "Data from fueleconomy.gov",
       x = "Miles Per Gallon (City)",
       y = "Miles Per Gallon (Highway)")

Themes

ggplot is incredibly flexible; just about every part of a graph can be customized, from the colors to the size to the label text and font family. All you really ever have to do is add the + and put on a new layer that describes what you want to do.

ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
  geom_point() +
  facet_wrap(~year) +
  theme_minimal()

There are several built-in themes. Which one do you think makes the graph look like this [SEE SLIDE]:

ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
  geom_point() +
  facet_wrap(~year) +
  theme_dark()

There is also a function in ggplot called theme(), which lets us fine-tune these customizations. When we’re thinking about advanced graphics, often what we mean is fine-tuning graphs to get them ready for publication or presentation.

I often like to use theme_minimal() as a base, but it has some drawbacks, like a lack of borders when we use facets.

ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
  geom_point() +
  facet_wrap(~year) +
  theme_minimal()

We can tweak this using theme()

ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
  geom_point() +
  facet_wrap(~year) +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA))

Almost all of the elements we can tweak with theme() are built from:

  • element_rect()
  • element_line()
  • element_text()

For instance:

ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
  geom_point() +
  facet_wrap(~year) +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA),
        axis.title = element_text(family = "mono"),
        panel.grid.major.x = element_line(color = "black"))

If we want to completely remove something, we can use element_blank()

ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
  geom_point() +
  facet_wrap(~year) +
  theme_minimal() +
  theme(panel.border = element_rect(color = "black", fill = NA),
        axis.title = element_text(family = "mono"),
        panel.grid.major.x = element_line(color = "black"),
        panel.grid.minor.y = element_blank())

Activity

See if you can make this horrible-looking chart using theme()

ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
  geom_point() +
  facet_wrap(~year) +
  theme_minimal() +
  theme(panel.border = element_rect(color = "blue", fill = NA),
        axis.title.x = element_text(family = "serif"),
        axis.title.y = element_text(family = "mono", color = "red"),
        axis.text.x = element_text(family = "serif", size = 20),
        panel.grid.major.y = element_line(color = "green", size = 2),
        panel.grid.major.x = element_line(color = "black"))

Highlights

That’s enough of bad graphics. Let’s try making something good.

Aside from details like font choice and facet borders, one of the things that makes a graphic effective as a presentation piece is its ability to draw the eye to important details and to tell a story. Here’s an example [SHOW SLIDE]

We can make this sort of simple highlight happen by creating a subset of the data, and using that to add a second points layer.

vw <- mpg %>%
  filter(manufacturer == "volkswagen")

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_point(color = "red", data = vw)

Data Labels

To make a finer point of it, let’s throw some labels on there:

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_point(color = "red", data = vw) +
  geom_text(data = vw, aes(label = model))

That’s not as great…let’s work on that with a package called ggrepel

library(ggrepel)

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_point(color = "red", data = vw) +
  geom_text_repel(data = vw, aes(label = model))
## Warning: ggrepel: 14 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

This gets crowded fast…before we move on, though, it’s worth noting how we can change the case of the words

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_point(color = "red", data = vw) +
  geom_text_repel(data = vw, aes(label = str_to_title(model)))
## Warning: ggrepel: 14 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Let’s start off with this graph for the next activity

ggplot(weather_crashes, aes(x = precip, y = n_crashes)) + 
  geom_point()

Activity

Create a plot using weather_crashes that highlights the days with most crashes (n_crashes > 14).

high_crashes <- weather_crashes %>%
  filter(n_crashes > 14)

ggplot(weather_crashes, aes(x = precip, y = n_crashes)) + 
  geom_point() + 
  geom_point(data = high_crashes, color = "red") +
  geom_label_repel(data = high_crashes, aes(label = date))

Another way to do this is to manipulate the dataset itself, but this does make it harder to add labels:

weather_crashes2 <- weather_crashes %>%
  mutate(highlight = ifelse(n_crashes > 14, "Yes", "No"))

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
  geom_point(aes(color = highlight))

A Bit on Scales

When we talk about scales, we’re thinking of the many ways that we can encode data. Some of these are:

  • X position
  • Y position
  • Size
  • Shape
  • Color
  • Fill (different from color)
  • Alpha (transparency)

There are various scale_* functions that handle how these work. Let’s clean up the look of that a bit.

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
  geom_point(aes(color = highlight)) +
  scale_color_manual(values = c("black", "red"), guide = "none")

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
  geom_point(aes(color = n_crashes)) +
  scale_color_viridis_c()

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
  geom_point(aes(color = factor(n_crashes))) +
  scale_color_viridis_d()

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
  geom_point(aes(color = n_crashes)) +
  scale_color_continuous(low = "blue", high = "yellow")

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
  geom_point(aes(color = highlight)) +
  scale_color_manual(values = c("steelblue", "firebrick"))

Note: a packages called ggthemes provides some nice colors and themes. For instance:

library(ggthemes)

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
  geom_point(aes(color = highlight)) +
  scale_color_fivethirtyeight() +
  theme_fivethirtyeight()

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
  geom_point(aes(size = n_crashes)) +
  scale_size_continuous(range = c(1, 3))

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
  geom_point(aes(shape = highlight)) +
  scale_shape_manual(values = c(8, 10))

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
  geom_point(aes(shape = highlight)) +
  scale_x_reverse()

Annotating with Shapes

Another commonly-encountered issue is the need to put guiding lines or shapes onto a graph. Let’s try doing that with our graph.

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
  geom_point(aes(shape = highlight)) +
  geom_hline(yintercept = 14) +
  geom_vline(xintercept = 2)

Let’s say we wanted to call out the high-crash, low-precip days. Let’s add a box using annotate():

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
  geom_point(aes(shape = highlight)) +
  geom_hline(yintercept = 14) +
  geom_vline(xintercept = 2) +
  annotate(
    "rect",
    xmin = -0.15,
    xmax = 0.25,
    ymin = 14.5,
    ymax = 16.5
  )

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
    annotate(
    "rect",
    xmin = -0.15,
    xmax = 0.25,
    ymin = 14.5,
    ymax = 16.5
  ) +
  geom_point(aes(shape = highlight)) +
  geom_hline(yintercept = 14) +
  geom_vline(xintercept = 2) 

ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) + 
    annotate(
    "rect",
    xmin = -0.15,
    xmax = 0.25,
    ymin = 14.5,
    ymax = 16.5,
    alpha = 0.2,
    fill = "red"
  ) + 
  annotate(
    "text",
    x = .25,
    y = 15.5,
    color = "red",
    label = "Bad Days",
    hjust = 0
  ) +
  geom_point(aes(shape = highlight)) +
  geom_hline(yintercept = 14) +
  geom_vline(xintercept = 2)