Part 1 is here.
Compare the ggplot2 code to the base R code we saw earlier:
ggplot(iris) +
# This aes() call sets up the PLOT LEVEL aesthetics
aes(x = Sepal.Length, y = Sepal.Width, color = Species) +
geom_point()
plot(iris$Sepal.Length, iris$Sepal.Width, col = iris$Species)
legend(7, 4.3,
unique(iris$Species),
col=1:length(iris$Species),
pch = 1)
Now the advantages of ggplot2 and the grammar-of-graphics approach start to get clear.
Geoms have different aesthetic requirements, and not every geom works with every dataset (visually or computationally):
# Note now the aes() call is INSIDE the ggplot() call--makes it obvious that these
# are the initial, PLOT LEVEL aesthetics
iris_plot <- ggplot(iris_long, aes(x = Sepal,
y = Petal,
color = Species))
iris_plot + geom_point()
iris_plot + geom_line()
iris_plot + geom_bin2d()
iris_plot + geom_violin()
## Warning: position_dodge requires non-overlapping x intervals
iris_plot + geom_dotplot()
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
## Error: geom_dotplot requires the following missing aesthetics: y
You can have as many geoms in a plot as you want. Later geoms are drawn in front of earlier ones:
iris_plot + geom_density_2d() + geom_point()
Because of the layering principle we discussed above, later steps and in particular geoms override earlier ones as the plot is built up.
In particular, if you define aesthetics in geoms they override any earlier definitions but only within that geom:
# Note that, as before, aesthetics are mapped to variables inside an aes() call
# If we want to map an aesthetic to a constant value we do so OUTSIDE aes()
# E.g., + geom_point(color = "red")
iris_plot + geom_density_2d() + geom_point(aes(color = dimension))
This doesn’t make much sense, but this might:
iris_plot +
geom_density_2d(aes(linetype = dimension)) +
geom_point(aes(shape = dimension))
What aesthetics are operative in this plot?
x = Sepal
, y = Petal
, and color = Species
geom_point
but it also uses shape = dimension
geom_density2d
but it also uses linetype = dimension
Some geoms don’t take both x
and y
aesthetics; rather, they take just one and compute the other, or they transform one of the aesthetics by some computation. How this happens is beyond our scope here (come back next week!) but let’s look at a couple of examples.
ggplot(iris_long) +
aes(x = Sepal) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(iris_long) +
aes(x = Species, y = Sepal) +
geom_boxplot()
ggplot(iris_long) +
aes(x = Species, y = Sepal) +
geom_boxplot() +
geom_jitter(alpha = 0.5, aes(color = dimension))
Notice how the final boxplot-and-point plot makes it clear that the boxplot-only plot is highly deceptive: there are two groups of data here (two dimensions that were measured), and you probably wouldn’t want to present them combined together.
Remember one of the 10 data visualization principles above: visualize your data, not just summaries!
Some geoms are handy for annotation or to help with interpretation:
iris_plot +
geom_density_2d(aes(linetype = dimension)) +
geom_point(aes(shape = dimension)) +
# vertical line
geom_vline(xintercept = 5, size = 3) +
# horizontal line
geom_hline(yintercept = 6, color = "purple", linetype = 2) +
# a-b line
geom_abline(size = 10, alpha = 0.25, slope = 0.3, intercept = 1)
There’s also a useful annotate()
function; see its documentation.
The question here is really: what are you trying to communicate?
(See notes under Data visualization section above.)
If you want to… | …try this geom |
---|---|
show the relationship between two variables | geom_point geom_jitter |
show values over time or a series | geom_bar geom_line |
show data distribution (1D) | geom_histogram geom_density |
show data distribution (against another variable) | geom_boxplot geom_violin |
show data distribution (2D) | geom_hex geom_bin2d |
analyze trends | geom_smooth geom_line |
We frequently would like to fit, or show, trend lines with point data. The easiest way to do this is with geom_smooth()
:
iris_plot + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
What’s happened here?
geom_smooth
uses a loess smoother, which is just a form of local regressiongeom_smooth
geom_smooth
inherited the plot aesthetics (x, y, color), so it fit separately to each color (group of data)More likely, we would like a simple linear regression—i.e., trend lines. For this we just need to override the default and tell is to use R’s lm function:
iris_plot + geom_point() + geom_smooth(method = "lm")
We might also want to fit an overall trend line, to the pooled data (i.e., without groups). You might be tempted in this case to override the color aesthetic to a constant value. This works, but it’s not ideal. Better:
iris_plot + geom_point() +
geom_smooth(method = "lm") + # per-group trend line
geom_smooth(method = "lm",
color = "black",
group = 1, # pooled; intent is clear
linetype = 2)
Behind the scenes:
color
aesthetic controls the group
aesthetic in this plotgroup
is ultimately what determines how the data are split up for computation and plottinggroup = 1
makes it crystal clear what we want: a single group
for the second geom_smooth
.Finally, we can use any smoothing function we want, including custom ones.
ggplot(iris_long, aes(Sepal, Petal)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), color = "red")
Labeling your plots well is important.
As we’ve seen, ggplot2 takes its default axis and aesthetic labels from the columns you specify. These are easy to change using the xlab
and ylab
functions; the ggtitle
function is available as well:
iris_plot + geom_point() +
geom_smooth() +
xlab("What is a sepal again?") +
ylab("Petal (cm)") +
ggtitle("This is starting to look like a real plot!")
ggplot2’s theme system is powerful and sometimes confusing. Like the rest of the ggplot system, it uses the idea of inheritance: you can apply themes, or aspects of them, to entire plots, sub-elements, or small details, and changes cascade down.
The simplest step is to apply a theme to an entire plot:
iris_plot + geom_point() + theme_gray()
iris_plot + geom_point() + theme_dark()
iris_plot + geom_point() + theme_minimal()
Many more themes are available in other R packages and online repositories:
library(cowplot)
iris_plot + geom_point() + theme_cowplot()
Many more themes are available in other R packages and online repositories:
library(ggthemes)
iris_plot + geom_point() + theme_economist()
The theming system is also how we change specific aspects of plots.
Let’s say we want to apply some formatting to text in our plot. But, which text?
iris_plot + geom_point() + ggtitle("Iris") +
theme(text = element_text(size = 20, color = "red"))
iris_plot + geom_point() + ggtitle("Iris") +
theme(axis.text = element_text(size = 20, color = "red"))
iris_plot + geom_point() + ggtitle("Iris") +
theme(axis.text.x = element_text(size = 20, color = "red"))
iris_plot + geom_point() + ggtitle("Iris") +
theme(axis.text.x.bottom = element_text(size = 20, color = "red"))
iris_plot + geom_point() + ggtitle("Iris") +
theme(title = element_text(size = 20, color = "red"))
iris_plot + geom_point() + ggtitle("Iris") +
theme(axis.title = element_text(size = 20, color = "red"))
We can do similar things with any other aspect of the plot: grid lines, legends, backgrounds, etc. From the help page:
Theme elements inherit properties from other theme elements hierarchically. For example,
axis.title.x.bottom
inherits fromaxis.title.x
which inherits fromaxis.title
, which in turn inherits fromtext
. All text elements inherit directly or indirectly fromtext
; all lines inherit fromline
, and all rectangular objects inherit fromrect
. This means that you can modify the appearance of multiple elements by setting a single high-level component.
Interestingly (I just realized this in preparing these slides) this seems to be wrong or at least incomplete? I don’t understand why axis.text
isn’t inheriting from text
(which is what the documentation says). 🤷
Facets are multi-panel plots that show different subsets of a data frame. They work best on discrete variables and can help clear up a busy plot.
facet_grid
is a rigid m x n matrix and best used with multiple variablesfacet_wrap
is a long ribbon of panels that can be wrapped into any number of columns using ncol
and best used with a single variableWhich of these is the best visualization? Why?
iris_plot + geom_point()
Species
iris_plot + geom_point() + facet_wrap(~Species)
dimension
iris_plot + geom_point() + facet_wrap(~dimension)
iris_plot + geom_point() + facet_grid(Species ~ dimension)
At this point, we’ve pretty much arrived at our final plot!
At a minimum, be aware that some of the default colors in ggplot2 have equal luminance and so can be difficult to distinguish, particularly for colorblind viewers.
We can change colors and palettes, however; for example, the Viridis color scales included with ggplot2 are designed to be perceptually uniform in both color and black-and-white. They are also designed to be perceived by viewers with common forms of color blindness.
Use viridsd()
with discrete data, and viridisc()
with continuous data.
The RColorBrewer package is another good option worth looking into.
tx_housing # down-sampled from the "txhousing" dataset
tx_housing + scale_color_viridis_d()
Note that ggplot()
objects are just like any other R object, and can be printed (displaying the plot), passed to functions, saved to disk, etc.
The primary way to save a plot image is the ggsave()
function:
p <- ggplot(cars, aes(speed, dist)) + geom_point()
print(p)
ggsave("cars_plot.pdf")
# You can specify file types and dimensions:
q <- ggplot(iris, aes(Sepal.Length)) + geom_histogram()
print(q)
ggsave("iris_plot.png", width = 8, height = 5)
# By default ggsave assumes you mean "save the last plot generated"
# but you can save arbitrary objects:
ggsave("cars_plot.jpg", plot = p + theme_bw())
This figure is from a recent paper we published with many co-authors in GCB:
This seems—and is—complex, but you now have the tools to see what’s happening in the code that generates the figure.
# Whittaker biome plot
library(plotbiomes)
p_inset <- whittaker_base_plot() +
geom_point(data = cosore_points,
aes(x = mat_cosore, y = map_cosore / 10),
color = "black", shape = 4) +
coord_cartesian(ylim = c(0, 500)) +
theme(axis.title = element_blank(),
axis.text = element_text(size = 8),
legend.text = element_text(size = 7),
legend.key.size = unit(0.4, "lines"),
legend.position = c(0.35, 0.75),
legend.title = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white"),
panel.border = element_rect(colour = "black",
fill = NA, size = 0.5))
# SP's main climate space plot
p <- ggplot() +
geom_hex(data = map_mat_global,
aes(x = mat, y = map / 10),
bins = 100, na.rm = TRUE) +
scale_fill_viridis_c(name = "Grid cells", begin = 0.85, end = 0) +
geom_point(data = cosore_points,
aes(x = mat_cosore, y = map_cosore / 10),
color = "black", shape = 4, size = 1.5, na.rm = TRUE) +
theme_minimal() +
coord_cartesian(ylim = c(0, 500)) +
labs(x = "MAT (°C)", y = "MAP (cm)")
# Inset the first inside the second
library(cowplot, quietly = TRUE)
ggdraw() +
draw_plot(p) +
draw_plot(p_inset, x = 0.1, y = 0.52, width = 0.4, height = 0.45)
This workshop has covered a lot of ground, and necessarily skipped over and/or fudged some concepts for clarity and time.
Important things we haven’t talked about include:
Some of these are built into ggplot2, but there’s also a whole ecosystem of extension packages that people have written.
This is a great walkthrough of the evolution of a complex and beautiful visualization:
This presentation was written in R Markdown.