The data used in this workshop are adapted from the following:
City of Cambridge Public Safety. “Police Department Crash Data - Updated.” Cambridge Open Data, September 13, 2022. https://data.cambridgema.gov/Public-Safety/Police-Department-Crash-Data-Updated/gb5w-yva3.
National Centers for Environmental Information, National Oceanic and Atmospheric Administration. “Past Weather.” Past Weather | CAMBRIDGE MA | US1MAMD0011, October 3, 2022. https://www.ncei.noaa.gov/access/past-weather/42.55502054693938,-71.76079845408827,42.040922629398835,-70.48652687279844.
Load in the tidyverse package and load in our data
library(tidyverse)
library(lubridate)
crashes <- read_csv("./data/processed/crashes.csv")
weather <- read_csv("./data/processed/weather.csv")
crashes_weather <- left_join(crashes, weather)
weather_crashes <- read_csv("./data/processed/weather_crashes.csv") %>%
mutate(year = year(date))
To get started, we’ll use a test dataset and start with a review of ggplot’s basic functionality.
data(mpg)
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
Try running this code:
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
ggplot2 is built on the idea of the Grammar of Graphics, a system described by Leland Wilkinson (see readings). There are a lot of nuances to explore in the book and in the ggplot2 package, but one of the core concepts to keep in mind is that a visualization has three critical components:
We can see this in action
ggplot(mpg)
It doesn’t mean much yet. Let’s give it a coordinate system and see what happens.
ggplot(mpg, aes(x = cty, y = hwy))
Fantastic! The data is in there, and describing a 2D plane (the most we can really have in ggplot2 - there are no “traditional” 3D visualizations for it). For every row of our dataset, there is some intersection that exists on this plane. We need to decide how they will be rendered, though. For these two continuous variables, a scatterplot is a good place to start. We can add another “layer” to our plot containing geom_point() to show where the data exists on this plane.
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
Though we’re limited to “2D” visualizations, there’s a technique called “small multiples” (“facets” in ggplot) that allow us to show this data in other dimensions. For instance, if we want to separate things out by year.
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
facet_grid(year~.)
Using what you know so far, create a plot that lays out city and highway mileage by vehicle class
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
facet_grid(class~.)
We could also write this as:
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
facet_wrap(~class)
Try using our weather_crashes data set to create a
scatter plot of precipitation vs. number of crashes. Then try breaking
it out into facets by year.
ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
geom_point() +
facet_wrap(~year)
Let’s talk for a moment about text labels. We can use labs() to set the labels for just about anything. There are a lot of different philosophies about titles, but I usually like them to act as an explanation of the story for a graph.
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
labs(title = "Highway and City MPG Are Positively Correlated",
subtitle = "And it's easy to see",
caption = "Data from fueleconomy.gov",
x = "Miles Per Gallon (City)",
y = "Miles Per Gallon (Highway)")
ggplot is incredibly flexible; just about every part of a graph can be customized, from the colors to the size to the label text and font family. All you really ever have to do is add the + and put on a new layer that describes what you want to do.
ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
geom_point() +
facet_wrap(~year) +
theme_minimal()
There are several built-in themes. Which one do you think makes the graph look like this [SEE SLIDE]:
ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
geom_point() +
facet_wrap(~year) +
theme_dark()
There is also a function in ggplot called theme(), which
lets us fine-tune these customizations. When we’re thinking about
advanced graphics, often what we mean is fine-tuning graphs to get them
ready for publication or presentation.
I often like to use theme_minimal() as a base, but it
has some drawbacks, like a lack of borders when we use facets.
ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
geom_point() +
facet_wrap(~year) +
theme_minimal()
We can tweak this using theme()
ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
geom_point() +
facet_wrap(~year) +
theme_minimal() +
theme(panel.border = element_rect(color = "black", fill = NA))
Almost all of the elements we can tweak with theme() are
built from:
element_rect()element_line()element_text()For instance:
ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
geom_point() +
facet_wrap(~year) +
theme_minimal() +
theme(panel.border = element_rect(color = "black", fill = NA),
axis.title = element_text(family = "mono"),
panel.grid.major.x = element_line(color = "black"))
If we want to completely remove something, we can use
element_blank()
ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
geom_point() +
facet_wrap(~year) +
theme_minimal() +
theme(panel.border = element_rect(color = "black", fill = NA),
axis.title = element_text(family = "mono"),
panel.grid.major.x = element_line(color = "black"),
panel.grid.minor.y = element_blank())
See if you can make this horrible-looking chart using
theme()
ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
geom_point() +
facet_wrap(~year) +
theme_minimal() +
theme(panel.border = element_rect(color = "blue", fill = NA),
axis.title.x = element_text(family = "serif"),
axis.title.y = element_text(family = "mono", color = "red"),
axis.text.x = element_text(family = "serif", size = 20),
panel.grid.major.y = element_line(color = "green", size = 2),
panel.grid.major.x = element_line(color = "black"))
That’s enough of bad graphics. Let’s try making something good.
Aside from details like font choice and facet borders, one of the things that makes a graphic effective as a presentation piece is its ability to draw the eye to important details and to tell a story. Here’s an example [SHOW SLIDE]
We can make this sort of simple highlight happen by creating a subset of the data, and using that to add a second points layer.
vw <- mpg %>%
filter(manufacturer == "volkswagen")
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
geom_point(color = "red", data = vw)
To make a finer point of it, let’s throw some labels on there:
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
geom_point(color = "red", data = vw) +
geom_text(data = vw, aes(label = model))
That’s not as great…let’s work on that with a package called
ggrepel
library(ggrepel)
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
geom_point(color = "red", data = vw) +
geom_text_repel(data = vw, aes(label = model))
## Warning: ggrepel: 14 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
This gets crowded fast…before we move on, though, it’s worth noting how we can change the case of the words
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
geom_point(color = "red", data = vw) +
geom_text_repel(data = vw, aes(label = str_to_title(model)))
## Warning: ggrepel: 14 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Let’s start off with this graph for the next activity
ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
geom_point()
Create a plot using weather_crashes that highlights the
days with most crashes (n_crashes > 14).
high_crashes <- weather_crashes %>%
filter(n_crashes > 14)
ggplot(weather_crashes, aes(x = precip, y = n_crashes)) +
geom_point() +
geom_point(data = high_crashes, color = "red") +
geom_label_repel(data = high_crashes, aes(label = date))
Another way to do this is to manipulate the dataset itself, but this does make it harder to add labels:
weather_crashes2 <- weather_crashes %>%
mutate(highlight = ifelse(n_crashes > 14, "Yes", "No"))
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
geom_point(aes(color = highlight))
When we talk about scales, we’re thinking of the many ways that we can encode data. Some of these are:
There are various scale_* functions that handle how these work. Let’s clean up the look of that a bit.
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
geom_point(aes(color = highlight)) +
scale_color_manual(values = c("black", "red"), guide = "none")
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
geom_point(aes(color = n_crashes)) +
scale_color_viridis_c()
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
geom_point(aes(color = factor(n_crashes))) +
scale_color_viridis_d()
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
geom_point(aes(color = n_crashes)) +
scale_color_continuous(low = "blue", high = "yellow")
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
geom_point(aes(color = highlight)) +
scale_color_manual(values = c("steelblue", "firebrick"))
Note: a packages called ggthemes provides some nice colors and themes. For instance:
library(ggthemes)
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
geom_point(aes(color = highlight)) +
scale_color_fivethirtyeight() +
theme_fivethirtyeight()
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
geom_point(aes(size = n_crashes)) +
scale_size_continuous(range = c(1, 3))
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
geom_point(aes(shape = highlight)) +
scale_shape_manual(values = c(8, 10))
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
geom_point(aes(shape = highlight)) +
scale_x_reverse()
Another commonly-encountered issue is the need to put guiding lines or shapes onto a graph. Let’s try doing that with our graph.
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
geom_point(aes(shape = highlight)) +
geom_hline(yintercept = 14) +
geom_vline(xintercept = 2)
Let’s say we wanted to call out the high-crash, low-precip days.
Let’s add a box using annotate():
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
geom_point(aes(shape = highlight)) +
geom_hline(yintercept = 14) +
geom_vline(xintercept = 2) +
annotate(
"rect",
xmin = -0.15,
xmax = 0.25,
ymin = 14.5,
ymax = 16.5
)
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
annotate(
"rect",
xmin = -0.15,
xmax = 0.25,
ymin = 14.5,
ymax = 16.5
) +
geom_point(aes(shape = highlight)) +
geom_hline(yintercept = 14) +
geom_vline(xintercept = 2)
ggplot(weather_crashes2, aes(x = precip, y = n_crashes)) +
annotate(
"rect",
xmin = -0.15,
xmax = 0.25,
ymin = 14.5,
ymax = 16.5,
alpha = 0.2,
fill = "red"
) +
annotate(
"text",
x = .25,
y = 15.5,
color = "red",
label = "Bad Days",
hjust = 0
) +
geom_point(aes(shape = highlight)) +
geom_hline(yintercept = 14) +
geom_vline(xintercept = 2)