Visualization

Introduction

Data needs to be imported, tidied, and transformed before any visualization can be created. We will discuss these steps, which are necessary for almost all real-world datasets, in the following chapters. However, because we can use tidy example datasets that are already available in R, we will focus on creating visualizations first.

Specifically, we will use the ggplot2 package, which is part of the Tidyverse. It is based on a layered “grammar of graphics” described in this article. In this chapter, we will mention only the most basic commands that enable you to quickly create simple (but beautiful) visualizations.

Of course the first step is to activate the package:

library(ggplot2)

Note that you could also use library(tidyverse) to activate all core Tidyverse packages, including ggplot2.

Basic usage

The ggplot2 package contains an example data frame called mpg. It is a good idea to read its documentation (?mpg) before moving on to the next step. Let’s take a look at the data:

mpg

# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans      drv     cty   hwy fl    class  
   <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>  
 1 audi         a4           1.8  1999     4 auto(l5)   f        18    29 p     compact
 2 audi         a4           1.8  1999     4 manual(m5) f        21    29 p     compact
 3 audi         a4           2    2008     4 manual(m6) f        20    31 p     compact
 4 audi         a4           2    2008     4 auto(av)   f        21    30 p     compact
 5 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p     compact
 6 audi         a4           2.8  1999     6 manual(m5) f        18    26 p     compact
 7 audi         a4           3.1  2008     6 auto(av)   f        18    27 p     compact
 8 audi         a4 quattro   1.8  1999     4 manual(m5) 4        18    26 p     compact
 9 audi         a4 quattro   1.8  1999     4 auto(l5)   4        16    25 p     compact
10 audi         a4 quattro   2    2008     4 manual(m6) 4        20    28 p     compact
# … with 224 more rows

Here’s the first question we want to answer from this dataset: do cars with big engines consume more or less fuel than cars with small engines? We can use the variables displ (engine size in liters) and hwy (fuel efficiency on the highway in miles per gallon) to try to address this question.

Let’s create our first plot using the following code (just copy and run this code for now):

ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy))

This scatterplot shows a clear negative relationship, meaning that larger engines tend to have lower fuel efficiency.

All ggplot2 plots start with calling the ggplot() function, to which we pass a data frame that contains the data to be plotted (using the data argument). This function call on its own just produces an empty plot, because we haven’t specified any details about our plot yet.

To add something interesting to the plot, we literally add another layer. In this example, the new layer should contain points, which we can create with the geom_point() function. The ggplot2 package contains many geom functions that enable us to create many different kinds of plots. They all have a mapping argument, which specifies how columns in the data frame should be represented (note that this mapping needs to be wrapped inside an aes() function). In our example, we map the displ column to the x axis and the hwy column to the y axis, resulting in a scatterplot.

We can use the code from this example as a template for ggplot2 plots. It looks as follows:

ggplot(data=<DATA>) +
    <GEOM_FUNCTION>(mapping=aes(<MAPPINGS>))

Note that this template is not valid R code! We need to replace all code in angle brackets with our actual data, geom function, and mappings.

Exercises

What does ggplot(data=mpg) produce?
How many rows and columns does mpg consist of? What are the column data types?
What does the drv column describe?
Efficient cars should consume less fuel on both highways and in cities than inefficient cars. Create a scatterplot to investigate this hypothesis!
How does highway fuel efficiency correlate with the number of cylinders?
Why is the scatterplot of class versus drv not very useful?
Do you see any problem with the scatterplot displ versus hwy that we produced previously? (Hint: the plot does not show all 234 individual data points contained in the mpg data frame.)

Aesthetic mappings

Aesthetics are visual properties of a plot. We have already used the x and y aesthetics in our previous mpg example when we mapped these aesthetics to variables (columns) displ and hwy in geom_point(mapping=aes(x=displ, y=hwy)). We have also noted that we need to wrap aesthetic mappings with the aes() function.

There are other aesthetics that we could use. For example, we could map to the color aesthetic to visualize a third variable in the data such as class, which contains the type (or class) of each car:

ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy, color=class))

The scatterplot now conveys additional information on the types of vehicles. Notice that ggplot2 automatically adds a legend. Instead of color, we could also use the shape, size or alpha aesthetics (try it out and see what the result looks like). The documentation of the geom_point() function lists all supported aesthetics. It is also possible to use multiple aesthetics for the same variable.

If you ever want to manually set an aesthetic to a fixed value (not depending on data values), such as plotting all points in blue, you need to specify this outside of aes():

ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy), color="blue")

Exercises

See https://r4ds.had.co.nz/data-visualisation.html#exercises-1.

Facets

Facets can be used to split a plot into multiple subplots, each of which shows a subset of the original data. There are two functions that create facets based on variables (columns) in the data:

facet_wrap() facets by one variable; use ~ <VARIABLE> as its first argument.
facet_grid() facets by two variables; use <VARIABLE_ROWS> ~ <VARIABLE_COLS> as its first argument.

It’s very easy to add facetting to an existing plot by adding one of these functions as if it was a new layer. Here are two examples:

ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy)) +
    facet_wrap(~ class, nrow=2)

ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy)) +
    facet_grid(drv ~ cyl)

Exercises

See https://r4ds.had.co.nz/data-visualisation.html#exercises-2.

Composing plots

Facets create subplots based on subsets of the data. If instead you want to combine multiple independent plots, I recommend the patchwork package. Using operators such as +, |, and /, it is intuitive to create custom arrangements of existing ggplot objects (which we assign to variables for easier access), for example:

library(patchwork)

p1 = ggplot(data=mpg) +
    geom_point(mapping=aes(x=hwy, y=cty))
p2 = ggplot(data=mtcars) +
    geom_boxplot(mapping=aes(x=cyl, y=qsec, group=cyl))
p3 = ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy))

(p1 | p2) / p3

Geometric objects

Consider the following two plots:

p1 = ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy))
p2 = ggplot(data=mpg) +
    geom_smooth(mapping=aes(x=displ, y=hwy))
p1 | p2

These plots show the same data (displ versus hwy), but they use different visual representations. Whereas the left plot visualizes each data point, the right plot represents the same data with a smoothing function. In ggplot2, we call these different visual representations geoms (geometric objects). The Data Visualization cheatsheet lists some of the over 40 available geoms in ggplot2.

Every geom function takes a mapping argument where you specify which data columns should be mapped to which aesthetics. Note that any given geom supports only aesthetics that make sense (check the documentation to find out). For example, you can set the shape of a point, but the shape of a line does not make sense (so geom_line does not support it). However, you could use the linetype aesthetic instead.

p1 = ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy, shape=drv))
p2 = ggplot(data=mpg) +
    geom_smooth(mapping=aes(x=displ, y=hwy, linetype=drv))
p1 | p2

Importantly, we can use multiple geoms in a plot by adding each one as a new layer:

ggplot(data=mpg) +
    geom_point(mapping=aes(x=displ, y=hwy)) +
    geom_smooth(mapping=aes(x=displ, y=hwy))

Because both geoms use the same mapping argument, we could also specify a global mapping inside the ggplot() function call. All geoms will inherit this global mapping, and the resulting figure is identical to the previous one.

ggplot(data=mpg, mapping=aes(x=displ, y=hwy)) +
    geom_point() +
    geom_smooth()

Global and local mappings can be combined; if you don’t specify a mapping for a particular geom function, the function will use the global mapping. However, if you do specify a local mapping by passing a mapping argument in a specific geom function, it will extend (not replace) the global one:

ggplot(data=mpg, mapping=aes(x=displ, y=hwy)) +
    geom_point(mapping=aes(color=class)) +
    geom_smooth()

We can also apply this concept of global and local settings to the data argument. This is useful if a particular layer is based on a different data (sub)set.

Exercises

See https://r4ds.had.co.nz/data-visualisation.html#exercises-3.

Statistical transformations

Consider the following bar chart created with the geom_bar() function:

ggplot(data=mpg) +
    geom_bar(mapping=aes(x=class))

Notice how we only specified a mapping for the x aesthetic (the class column). However, the y-axis shows counts, a variable that is not contained in the data.

Although many plots show the raw data (as we saw with scatterplots), other kinds of plots transform the original data to show counts of the binned data (like in the bar chart), summary statistics, or predictions from a model fitted to the data. This works automatically, because each geom function is associated with a default statistical transformation. You can check the documentation of a geom to see its default transformation listed as the default argument for the stat parameter. For example, geom_point() uses stat="identity" (no transformation), whereas geom_bar() uses stat="count".

Sometimes, it is necessary to change the stat of a geom. For example, let’s assume that we had the count data directly available in a data frame as follows:

library(tibble)

mpg_count = tribble(
    ~class,   ~count,
    "2seater",     5,
    "compact",    47,
    "midsize",    41,
    "minivan",    11,
    "pickup",     33,
    "subcompact", 35,
    "suv",        62
)

If we want to visualize this as a bar chart, we need to pass stat="identity" to avoid the default "count" transformation:

ggplot(data=mpg_count) +
    geom_bar(mapping=aes(x=class, y=count), stat="identity")

Finally, since each geom function has a default stat function, and each stat function has a default geom function, we can use these functions interchangeably. For example, geom_bar() uses stat_count() and stat_count() uses geom_bar() by default. Therefore, we can also produce the bar chart we have encountered previously with the following code:

ggplot(data=mpg) +
    stat_count(mapping=aes(x=class))

Exercises

See https://r4ds.had.co.nz/data-visualisation.html#exercises-4.

Position adjustments

Geoms have a position argument, which influences how charts are visualized when certain grouping aesthetics are used. Consider the following bar chart:

ggplot(data=mpg) +
    geom_bar(mapping=aes(x=class, fill=drv))

This is a stacked bar chart, because it uses position="stack" by default. If you don’t want a stacked bar chart, you can also use "identity", "dodge", or "fill":

ggplot(data=mpg) +
    geom_bar(mapping=aes(x=class, color=drv), position="identity", fill=NA)

ggplot(data=mpg) +
    geom_bar(mapping=aes(x=class, fill=drv), position="dodge")

ggplot(data=mpg) +
    geom_bar(mapping=aes(x=class, fill=drv), position="fill")

Exercises

See https://r4ds.had.co.nz/data-visualisation.html#exercises-5.

Coordinate systems

Most ggplot2 plots use the Cartesian coordinate system spanned by two axes x and y. However, it is possible to change the coordinate system with so-called coord functions. We won’t go into any detail here, mostly because working with coordinate systems is quite an advanced topic that is not relevant for many real-world plots. However, there is one handy function that you should remember: coord_flip() flips the x and y axes. You can simply add it like an additional layer to an existing plot:

ggplot(data=mpg, mapping=aes(x=class, y=hwy)) +
    geom_boxplot() +
    coord_flip()

Conclusion

With this set of ggplot2 building blocks, you can create a basic version of almost any kind of data visualization. In addition to the “Export” button in the “Plots” pane, you can use ggsave() to save a plot to a file.