Data needs to be imported, tidied, and transformed before any visualization can be created. We will discuss these steps, which are necessary for almost all real-world datasets, in the following chapters. However, because we can use tidy example datasets that are already available in R, we will focus on creating visualizations first.
Specifically, we will use the ggplot2 package, which is part of the Tidyverse. It is based on a layered “grammar of graphics” described in this article. In this chapter, we will mention only the most basic commands that enable you to quickly create simple (but beautiful) visualizations.
Of course the first step is to activate the package:
library(ggplot2)
Note that you could also use library(tidyverse) to
activate all core Tidyverse packages, including ggplot2.
The ggplot2 package contains an example data frame called
mpg. It is a good idea to read its documentation
(?mpg) before moving on to the next step. Let’s take a look
at the data:
mpg
# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
7 audi a4 3.1 2008 6 auto(av) f 18 27 p compact
8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compact
# … with 224 more rows
Here’s the first question we want to answer from this dataset: do
cars with big engines consume more or less fuel than cars with small
engines? We can use the variables displ (engine size in
liters) and hwy (fuel efficiency on the highway in miles
per gallon) to try to address this question.
Let’s create our first plot using the following code (just copy and run this code for now):
ggplot(data=mpg) +
geom_point(mapping=aes(x=displ, y=hwy))
This scatterplot shows a clear negative relationship, meaning that larger engines tend to have lower fuel efficiency.
All ggplot2 plots start with calling the ggplot()
function, to which we pass a data frame that contains the data to be
plotted (using the data argument). This function call on
its own just produces an empty plot, because we haven’t specified any
details about our plot yet.
To add something interesting to the plot, we literally add
another layer. In this example, the new layer should contain points,
which we can create with the geom_point() function. The
ggplot2 package contains many geom functions that enable us to create
many different kinds of plots. They all have a mapping
argument, which specifies how columns in the data frame should be
represented (note that this mapping needs to be wrapped inside an
aes() function). In our example, we map the
displ column to the x axis and the
hwy column to the y axis, resulting in a
scatterplot.
We can use the code from this example as a template for ggplot2 plots. It looks as follows:
ggplot(data=<DATA>) +
<GEOM_FUNCTION>(mapping=aes(<MAPPINGS>))
Note that this template is not valid R code! We need to replace all code in angle brackets with our actual data, geom function, and mappings.
ggplot(data=mpg) produce?mpg consist of? What are
the column data types?drv column describe?class versus drv
not very useful?displ
versus hwy that we produced previously? (Hint: the plot
does not show all 234 individual data points contained in the
mpg data frame.)Aesthetics are visual properties of a plot. We have already used the
x and y aesthetics in our previous
mpg example when we mapped these aesthetics to variables
(columns) displ and hwy in
geom_point(mapping=aes(x=displ, y=hwy)). We have also noted
that we need to wrap aesthetic mappings with the aes()
function.
There are other aesthetics that we could use. For example, we could
map to the color aesthetic to visualize a third variable in
the data such as class, which contains the type (or class)
of each car:
ggplot(data=mpg) +
geom_point(mapping=aes(x=displ, y=hwy, color=class))
The scatterplot now conveys additional information on the types of
vehicles. Notice that ggplot2 automatically adds a legend. Instead of
color, we could also use the shape,
size or alpha aesthetics (try it out and see
what the result looks like). The documentation of the
geom_point() function lists all supported aesthetics. It is
also possible to use multiple aesthetics for the same variable.
If you ever want to manually set an aesthetic to a fixed value (not
depending on data values), such as plotting all points in blue, you need
to specify this outside of aes():
ggplot(data=mpg) +
geom_point(mapping=aes(x=displ, y=hwy), color="blue")
See https://r4ds.had.co.nz/data-visualisation.html#exercises-1.
Facets can be used to split a plot into multiple subplots, each of which shows a subset of the original data. There are two functions that create facets based on variables (columns) in the data:
facet_wrap() facets by one variable; use
~ <VARIABLE> as its first argument.facet_grid() facets by two variables; use
<VARIABLE_ROWS> ~ <VARIABLE_COLS> as its first
argument.It’s very easy to add facetting to an existing plot by adding one of these functions as if it was a new layer. Here are two examples:
ggplot(data=mpg) +
geom_point(mapping=aes(x=displ, y=hwy)) +
facet_wrap(~ class, nrow=2)
ggplot(data=mpg) +
geom_point(mapping=aes(x=displ, y=hwy)) +
facet_grid(drv ~ cyl)
See https://r4ds.had.co.nz/data-visualisation.html#exercises-2.
Facets create subplots based on subsets of the data. If instead you
want to combine multiple independent plots, I recommend the patchwork package.
Using operators such as +, |, and
/, it is intuitive to create custom arrangements of
existing ggplot objects (which we assign to variables for
easier access), for example:
library(patchwork)
p1 = ggplot(data=mpg) +
geom_point(mapping=aes(x=hwy, y=cty))
p2 = ggplot(data=mtcars) +
geom_boxplot(mapping=aes(x=cyl, y=qsec, group=cyl))
p3 = ggplot(data=mpg) +
geom_point(mapping=aes(x=displ, y=hwy))
(p1 | p2) / p3
Consider the following two plots:
p1 = ggplot(data=mpg) +
geom_point(mapping=aes(x=displ, y=hwy))
p2 = ggplot(data=mpg) +
geom_smooth(mapping=aes(x=displ, y=hwy))
p1 | p2
These plots show the same data (displ versus
hwy), but they use different visual representations.
Whereas the left plot visualizes each data point, the right plot
represents the same data with a smoothing function. In ggplot2, we call
these different visual representations geoms (geometric
objects). The Data
Visualization cheatsheet lists some of the over 40 available geoms
in ggplot2.
Every geom function takes a mapping argument where you
specify which data columns should be mapped to which aesthetics. Note
that any given geom supports only aesthetics that make sense (check the
documentation to find out). For example, you can set the
shape of a point, but the shape of a line does
not make sense (so geom_line does not support it). However,
you could use the linetype aesthetic instead.
p1 = ggplot(data=mpg) +
geom_point(mapping=aes(x=displ, y=hwy, shape=drv))
p2 = ggplot(data=mpg) +
geom_smooth(mapping=aes(x=displ, y=hwy, linetype=drv))
p1 | p2
Importantly, we can use multiple geoms in a plot by adding each one as a new layer:
ggplot(data=mpg) +
geom_point(mapping=aes(x=displ, y=hwy)) +
geom_smooth(mapping=aes(x=displ, y=hwy))
Because both geoms use the same mapping argument, we
could also specify a global mapping inside the
ggplot() function call. All geoms will inherit this global
mapping, and the resulting figure is identical to the previous one.
ggplot(data=mpg, mapping=aes(x=displ, y=hwy)) +
geom_point() +
geom_smooth()
Global and local mappings can be combined; if you don’t specify a
mapping for a particular geom function, the function will use the global
mapping. However, if you do specify a local mapping by passing a
mapping argument in a specific geom function, it will
extend (not replace) the global one:
ggplot(data=mpg, mapping=aes(x=displ, y=hwy)) +
geom_point(mapping=aes(color=class)) +
geom_smooth()
We can also apply this concept of global and local settings to the
data argument. This is useful if a particular layer is
based on a different data (sub)set.
See https://r4ds.had.co.nz/data-visualisation.html#exercises-3.
Consider the following bar chart created with the
geom_bar() function:
ggplot(data=mpg) +
geom_bar(mapping=aes(x=class))
Notice how we only specified a mapping for the x
aesthetic (the class column). However, the y-axis
shows counts, a variable that is not contained in the data.
Although many plots show the raw data (as we saw with scatterplots),
other kinds of plots transform the original data to show counts
of the binned data (like in the bar chart), summary statistics, or
predictions from a model fitted to the data. This works automatically,
because each geom function is associated with a default statistical
transformation. You can check the documentation of a geom to see its
default transformation listed as the default argument for the
stat parameter. For example, geom_point() uses
stat="identity" (no transformation), whereas
geom_bar() uses stat="count".
Sometimes, it is necessary to change the stat of a geom. For example, let’s assume that we had the count data directly available in a data frame as follows:
library(tibble)
mpg_count = tribble(
~class, ~count,
"2seater", 5,
"compact", 47,
"midsize", 41,
"minivan", 11,
"pickup", 33,
"subcompact", 35,
"suv", 62
)
If we want to visualize this as a bar chart, we need to pass
stat="identity" to avoid the default "count"
transformation:
ggplot(data=mpg_count) +
geom_bar(mapping=aes(x=class, y=count), stat="identity")
Finally, since each geom function has a default stat function, and
each stat function has a default geom function, we can use these
functions interchangeably. For example, geom_bar() uses
stat_count() and stat_count() uses
geom_bar() by default. Therefore, we can also produce the
bar chart we have encountered previously with the following code:
ggplot(data=mpg) +
stat_count(mapping=aes(x=class))
See https://r4ds.had.co.nz/data-visualisation.html#exercises-4.
Geoms have a position argument, which influences how
charts are visualized when certain grouping aesthetics are used.
Consider the following bar chart:
ggplot(data=mpg) +
geom_bar(mapping=aes(x=class, fill=drv))
This is a stacked bar chart, because it uses
position="stack" by default. If you don’t want a stacked
bar chart, you can also use "identity",
"dodge", or "fill":
ggplot(data=mpg) +
geom_bar(mapping=aes(x=class, color=drv), position="identity", fill=NA)
ggplot(data=mpg) +
geom_bar(mapping=aes(x=class, fill=drv), position="dodge")
ggplot(data=mpg) +
geom_bar(mapping=aes(x=class, fill=drv), position="fill")
See https://r4ds.had.co.nz/data-visualisation.html#exercises-5.
Most ggplot2 plots use the Cartesian coordinate system spanned by two
axes x and y. However, it is possible to change the
coordinate system with so-called coord functions. We won’t go
into any detail here, mostly because working with coordinate systems is
quite an advanced topic that is not relevant for many real-world plots.
However, there is one handy function that you should remember:
coord_flip() flips the x and y axes. You
can simply add it like an additional layer to an existing plot:
ggplot(data=mpg, mapping=aes(x=class, y=hwy)) +
geom_boxplot() +
coord_flip()
With this set of ggplot2 building blocks, you can create a basic
version of almost any kind of data visualization. In addition to the
“Export” button in the “Plots” pane, you can use ggsave()
to save a plot to a file.
©
2022, Clemens Brunner