knitr::opts_chunk$set(echo = TRUE)
# Install the tidyverse package, if you didn't already.
# Only install it once, like this:
# Go to bottom right pane, click Packages,
# then click Install, then type tidyverse
# in the blank in the middle, click Install.
# If you have problems with the tidyverse, install and load ggplot2 instead.
# (It is a package within the tidyverse package).
# Load your package when you want to use it:
pacman::p_load(tidyverse)
# We'll use the data frame mpg, stored in
# the ggplot2 package. Take a look at it:
tibble(mpg)
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
When creating graphs, ggplot() can be a powerful tool. But with great power comes great… complexity. So how does it work?
The ggplot() function itself doesn’t actually create a graph. It’s better to think of it as laying down the foundation that we will build our graph on top of. In fact, don’t think of ggplot as creating a graph, but as a way of building one piece by piece. What do we mean by that?
The built-in graphing functions in R create graphs. They use a single function with multiple arguments to form the plot:
# Basic histogram using hist()
hist(x = mpg$cty, # dataframe$variable
breaks = 15, # Number of bars for the histogram
col = "steelblue", # Color to fill in the bars
main = "Histogram for City Miles Per Gallon", # Title of the plot
xlab = "City MPG") # xlab = label of x-axis
hist() and its arguments tells R everything it needs to create the histogram.
A benefit of using hist() is that it is relatively simple to use. The drawback is that adding more info to the plot can be complicated (it’s also not the most visually appealing graph).
GGPlot works differently. It’s more of a step-by-step process in building the graph.
In step 1 we need to invoke the actual ggplot() function itself. What do we do with ggplot()?
In the ggplot() function, we need to:
For instance, if we want a histogram of city mpg from the mpg data set:
ggplot(data = mpg,
mapping = aes(x = cty))
Using ggplot() built the “foundation” of the plot, but doesn’t actually add any “data ink” (No histogram was actually created). Notice the only actual ink on the plot is the labels of different city miles per gallon on the x-axis. Otherwise, everything else is blank! So what do we do?
Once we create the foundation of our graph, then we add the shapes (data ink). And ggplot() takes “add” literally!
What do we add to ggplot()? If we want to add shapes, we use a family of functions that all start with geom_. To create a histogram, we use geom_histogram(). The function just needs us to specify how many bars or bins it should create using bins =
ggplot(data = mpg,
mapping = aes(x = cty)) +
geom_histogram(bins = 15)
Notice that we literally added geom_histogram() to ggplot(). This is one of the biggest advantages of using GGPlot to create graphs, it adds flexibility when forming and customizing the visuals!
Unfortunately, it still doesn’t look great. So how do we improve it?
We can improve the histogram by changing the default color choices of the bars and lines that ggplot() uses. The argument to set or map a color in GGPlot is color = (For tea drinkers, colour = … also works)
ggplot(data = mpg,
mapping = aes(x = cty)) +
geom_histogram(bins = 15,
color = "steelblue")
Wait, that doesn’t look right! So what went wrong?
GGPlot has 2 different arguments to specify color:
When working with histograms in GGPlot, it’s advisable to have color = and fill = use different colors
ggplot(data = mpg,
mapping = aes(x = cty)) +
geom_histogram(bins = 15,
fill = "steelblue",
color = "black")
If you’re curious about why color = and fill
= should be different, try removing color = “black” from the
plot above!
What is the difference between “mapping” an aesthetic vs setting an aesthetic with GGPlot?
# Mapping fill in ggplot()
ggplot(data = mpg,
mapping = aes(x = cty,
fill = factor(year))) +
geom_histogram(bins = 15,
color = "black")
# Mapping fill in geom_histogram()
ggplot(data = mpg,
mapping = aes(x = cty)) +
geom_histogram(mapping = aes(fill = factor(year)),
bins = 15,
color = "black")
# What would happen if we included fill = "steelblue" in aes()?
ggplot(data = mpg,
mapping = aes(x = cty,
fill = "steelblue")) +
geom_histogram(bins = 15,
color = "black")
Oh no!
The mistake happened because fill = “steelblue” was in aes(). We can just place it outside of aes() and in geom_histogram it should work as intended
ggplot(data = mpg,
mapping = aes(x = cty)) +
geom_histogram(bins = 15,
color = "black",
fill = "steelblue")
And if we set fill = “steelblue” in ggplot() instead of geom_histogram(), it will be ignored:
ggplot(data = mpg,
mapping = aes(x = cty),
fill = "steelblue") +
geom_histogram(bins = 15,
color = "black")
You can also save ggplot objects just like we can any other object we’ve seen so far:
gg_hist_cty <-
ggplot(data = mpg,
mapping = aes(x = cty)) +
geom_histogram(bins = 15,
color = "black",
fill = "steelblue")
gg_hist_cty
The labels for aesthetics (x-axis, y-axis, legend titles, etc…) will be name of the variable they’re mapped to, which often isn’t the best choice for a graph.
Also, a title can add a lot of context to the graph with just a little more data ink. So how do we make those additions and alterations? By adding labs() to our ggplot object!
Let’s have the x-axis named “City MPG”, fill legend as “Year”, and have the plot titles “Histogram of City Miles Per Gallon”. To change the label of an aesthetic, you use the same name as the aesthetic itself!
ggplot(data = mpg,
mapping = aes(x = cty,
fill = factor(year))) +
geom_histogram(bins = 15,
color = "black") +
labs(title = "Histogram of City Miles Per Gallon",
x = "City MPG",
fill = "Year",
caption = "Data: mpg")
Graphs made with ggplot() have plenty of choices for non-data ink:
And there are a lot of other choices we could alter. We could change any of these individually by adding a theme() function to our ggplot object, but there are some “preset” themes we can add that changes many at once.
The default theme is theme_grey(), hence the grey background. Some others are:
While the theme choice is often overlooked, once the rest of the graph has been finalized, trying a few different themes is highly recommended!
Try adding some different themes below and compare the differences!
ggplot(data = mpg,
mapping = aes(x = cty)) +
geom_histogram(bins = 15,
color = "black",
fill = "steelblue") +
labs(title = "Histogram of City Miles Per Gallon",
x = "City MPG") +
theme_bw()
Previously we used color to represent the year the car was made. For some graphs, using color to represent different groups works well. Unfortunately, one histogram for multiple groups can be complicated.
So what do we do instead? Why don’t we make several histograms and place them in the plot! We can achieve this using facets. There are two main choices of specifying facets with GGPlot:
First, let’s use year to create 2 histograms stacked on top of one another with facet_wrap(). The common arguments for facet_wrap() are:
facets = The variable that will be used to create the small multiples. Specified by using a formula: ~ variable_name
nrow = and/or ncol =: how many rows or columns of graphs there should be in the plot
scales = : Should the x- and y-axes be the same or different for all the graphs?
If the purpose of the plot is to compare groups, you want the scales to be fixed.
ggplot(data = mpg,
mapping = aes(x = cty)) +
geom_histogram(bins = 15,
color = "black",
fill = "steelblue") +
labs(title = "Histogram of City Miles Per Gallon by Year",
x = "City MPG") +
theme_bw() +
facet_wrap(facets = ~ year,
ncol = 1,
scales = "fixed")
facet_grid() is very similar facet_wrap(), but the formula has two variables in the formula, one on each side of the ~. The left variable will determine the rows of the grid and the right variable will determine the columns. So we don’t need to specify nrow or ncol anymore!
Let’s look at city mpg by cylinder (cyl) and drive train (drv)
ggplot(data = mpg,
mapping = aes(x = cty)) +
geom_histogram(bins = 15,
color = "black",
fill = "steelblue") +
labs(title = "Histogram of City Miles Per Gallon",
subtitle = "By Cylinders and Drive Train",
x = "City MPG") +
theme_bw() +
facet_grid(facets = cyl ~ drv,
scales = "fixed")