Setup

knitr::opts_chunk$set(echo = TRUE)

#  Install the tidyverse package, if you didn't already.  
#  Only install it once, like this:

#  Go to bottom right pane, click Packages,
#  then click Install, then type tidyverse
#  in the blank in the middle, click Install.

#  If you have problems with the tidyverse, install and load ggplot2 instead.
#  (It is a package within the tidyverse package).

# Load your package when you want to use it:
pacman::p_load(tidyverse)

#  We'll use the data frame mpg, stored in 
#  the ggplot2 package.  Take a look at it:

tibble(mpg)
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

ggplot function: Building a map

When creating graphs, ggplot() can be a powerful tool. But with great power comes great… complexity. So how does it work?

The ggplot() function itself doesn’t actually create a graph. It’s better to think of it as laying down the foundation that we will build our graph on top of. In fact, don’t think of ggplot as creating a graph, but as a way of building one piece by piece. What do we mean by that?

The built-in graphing functions in R create graphs. They use a single function with multiple arguments to form the plot:

# Basic histogram using hist()
hist(x = mpg$cty,                                      # dataframe$variable 
     breaks = 15,                                      # Number of bars for the histogram 
     col = "steelblue",                                # Color to fill in the bars
     main = "Histogram for City Miles Per Gallon",     # Title of the plot
     xlab = "City MPG")                                # xlab = label of x-axis

hist() and its arguments tells R everything it needs to create the histogram.

A benefit of using hist() is that it is relatively simple to use. The drawback is that adding more info to the plot can be complicated (it’s also not the most visually appealing graph).

GGPlot Step 1: Mapping the data to aesthetics

GGPlot works differently. It’s more of a step-by-step process in building the graph.

In step 1 we need to invoke the actual ggplot() function itself. What do we do with ggplot()?

In the ggplot() function, we need to:

  • Specify the data set the variables are stored in using data =
  • Map the data to the aesthetics using mapping = aes(…)

For instance, if we want a histogram of city mpg from the mpg data set:

ggplot(data = mpg,
       mapping = aes(x = cty))

GGPlot Step 2: Add Shapes to the Plot using geom_

Using ggplot() built the “foundation” of the plot, but doesn’t actually add any “data ink” (No histogram was actually created). Notice the only actual ink on the plot is the labels of different city miles per gallon on the x-axis. Otherwise, everything else is blank! So what do we do?

Once we create the foundation of our graph, then we add the shapes (data ink). And ggplot() takes “add” literally!

What do we add to ggplot()? If we want to add shapes, we use a family of functions that all start with geom_. To create a histogram, we use geom_histogram(). The function just needs us to specify how many bars or bins it should create using bins =

ggplot(data = mpg,
       mapping = aes(x = cty)) + 
  
  geom_histogram(bins = 15)

Notice that we literally added geom_histogram() to ggplot(). This is one of the biggest advantages of using GGPlot to create graphs, it adds flexibility when forming and customizing the visuals!

Unfortunately, it still doesn’t look great. So how do we improve it?

Setting Aesthetics in GGPlot

We can improve the histogram by changing the default color choices of the bars and lines that ggplot() uses. The argument to set or map a color in GGPlot is color = (For tea drinkers, colour = … also works)

ggplot(data = mpg,
       mapping = aes(x = cty)) + 
  
  geom_histogram(bins = 15,
                 color = "steelblue")

Wait, that doesn’t look right! So what went wrong?

GGPlot has 2 different arguments to specify color:

  • color = … changes the color of lines and dots (the lines that form the bars in our histogram)
  • fill = … shades in the area of shapes

When working with histograms in GGPlot, it’s advisable to have color = and fill = use different colors

ggplot(data = mpg,
       mapping = aes(x = cty)) + 
  
  geom_histogram(bins = 15,
                 fill = "steelblue",
                 color = "black")

If you’re curious about why color = and fill = should be different, try removing color = “black” from the plot above!

Mapping vs setting an aesthetic

What is the difference between “mapping” an aesthetic vs setting an aesthetic with GGPlot?

  • Mapping: Used when an aesthetic represents a variable in the data.
    • Specified with mapping = aes(x = cty, fill = year)
    • The variable name should NOT be in quotes!
    • Can be mapped in either ggplot() or in the geom function
# Mapping fill in ggplot()
ggplot(data = mpg,
       mapping = aes(x = cty,
                     fill = factor(year))) + 
  
  geom_histogram(bins = 15,
                 color = "black")

# Mapping fill in geom_histogram()
ggplot(data = mpg,
       mapping = aes(x = cty)) + 
  
  geom_histogram(mapping = aes(fill = factor(year)),
                 bins = 15,
                 color = "black")

  • Setting: Used when the aesthetic should be constant throughout the graph.
    • The aesthetic should be assigned outside of aes()
    • the choice of aesthetic should be in quotes (shape = “square”) unless it is a number, like alpha = or size =
    • Should be set in the geom it is affecting and NOT in ggplot()
# What would happen if we included fill = "steelblue" in aes()?
ggplot(data = mpg,
       mapping = aes(x = cty,
                     fill = "steelblue")) + 
  
  geom_histogram(bins = 15,
                 color = "black")

Oh no!

The mistake happened because fill = “steelblue” was in aes(). We can just place it outside of aes() and in geom_histogram it should work as intended

ggplot(data = mpg,
       mapping = aes(x = cty)) + 
  
  geom_histogram(bins = 15,
                 color = "black",
                 fill = "steelblue")

And if we set fill = “steelblue” in ggplot() instead of geom_histogram(), it will be ignored:

ggplot(data = mpg,
       mapping = aes(x = cty),
       fill = "steelblue") + 
  
  geom_histogram(bins = 15,
                 color = "black")

Saving GG Objects

You can also save ggplot objects just like we can any other object we’ve seen so far:

gg_hist_cty <- 
  
  ggplot(data = mpg,
         mapping = aes(x = cty)) + 
  
  geom_histogram(bins = 15,
                 color = "black",
                 fill = "steelblue")

gg_hist_cty

GGPlot Step 3: Adding and Adjusting Labels

The labels for aesthetics (x-axis, y-axis, legend titles, etc…) will be name of the variable they’re mapped to, which often isn’t the best choice for a graph.

  • What does “cty” mean?
  • “factor(year)” isn’t as great as just “Year” for the fill legend

Also, a title can add a lot of context to the graph with just a little more data ink. So how do we make those additions and alterations? By adding labs() to our ggplot object!

Let’s have the x-axis named “City MPG”, fill legend as “Year”, and have the plot titles “Histogram of City Miles Per Gallon”. To change the label of an aesthetic, you use the same name as the aesthetic itself!

ggplot(data = mpg,
       mapping = aes(x = cty,
                     fill = factor(year))) + 
  
  geom_histogram(bins = 15,
                 color = "black") + 
  
  labs(title = "Histogram of City Miles Per Gallon",
       x = "City MPG",
       fill = "Year",
       caption = "Data: mpg")

GGPlot Step 4: Choosing the Theme

Graphs made with ggplot() have plenty of choices for non-data ink:

  • Should we have a box around the data area (called the panel)
  • What color should the graph background be?
  • What color should the gridlines be?

And there are a lot of other choices we could alter. We could change any of these individually by adding a theme() function to our ggplot object, but there are some “preset” themes we can add that changes many at once.

The default theme is theme_grey(), hence the grey background. Some others are:

  • theme_bw()
  • theme_test()
  • theme_minimal()
  • theme_classic()
  • theme_test()
  • theme_dark()
  • theme_void()

While the theme choice is often overlooked, once the rest of the graph has been finalized, trying a few different themes is highly recommended!

Try adding some different themes below and compare the differences!

ggplot(data = mpg,
       mapping = aes(x = cty)) + 
  
  geom_histogram(bins = 15,
                 color = "black",
                 fill = "steelblue") + 
  
  labs(title = "Histogram of City Miles Per Gallon",
       x = "City MPG") +
  
  theme_bw()

Additional: Using Facets to Create Small Multiples

Previously we used color to represent the year the car was made. For some graphs, using color to represent different groups works well. Unfortunately, one histogram for multiple groups can be complicated.

So what do we do instead? Why don’t we make several histograms and place them in the plot! We can achieve this using facets. There are two main choices of specifying facets with GGPlot:

  • facet_wrap(): Useful if there is only 1 variable that will create the small multiples
  • facet_grid(): Useful if there are 2 variables that will create the small multiples
Facet_wrap()

First, let’s use year to create 2 histograms stacked on top of one another with facet_wrap(). The common arguments for facet_wrap() are:

  • facets = The variable that will be used to create the small multiples. Specified by using a formula: ~ variable_name

  • nrow = and/or ncol =: how many rows or columns of graphs there should be in the plot

  • scales = : Should the x- and y-axes be the same or different for all the graphs?

    • scales = “fixed” will have the same axes for all the graphs (DEFAULT)
    • scales = “free_x” will have the x-axis be different for each graph but the y-axis will be the same
    • scales = “free_y” will have the same x-axis but different y-axis for each graph
    • scales = “free” will have different set of axes (both x and y) for all the graphs

If the purpose of the plot is to compare groups, you want the scales to be fixed.

ggplot(data = mpg,
       mapping = aes(x = cty)) + 
  
  geom_histogram(bins = 15,
                 color = "black",
                 fill = "steelblue") + 
  
  labs(title = "Histogram of City Miles Per Gallon by Year",
       x = "City MPG") +
  
  theme_bw() + 
  
  facet_wrap(facets = ~ year,
             ncol = 1,
             scales = "fixed")

Facet_grid()

facet_grid() is very similar facet_wrap(), but the formula has two variables in the formula, one on each side of the ~. The left variable will determine the rows of the grid and the right variable will determine the columns. So we don’t need to specify nrow or ncol anymore!

Let’s look at city mpg by cylinder (cyl) and drive train (drv)

ggplot(data = mpg,
       mapping = aes(x = cty)) + 
  
  geom_histogram(bins = 15,
                 color = "black",
                 fill = "steelblue") + 
  
  labs(title = "Histogram of City Miles Per Gallon",
       subtitle = "By Cylinders and Drive Train",
       x = "City MPG") +
  
  theme_bw() + 
  
  facet_grid(facets = cyl ~ drv,
             scales = "fixed")