ggplot2 – a grammar of graphics.ggplot2.Let’s first load the packages that we need for this chapter. You can click on the green arrow to execute the code chunk below.
library("knitr") # for rendering the RMarkdown file
library("tidyverse") # for plotting (and many more cool things we'll discover later)
# these options here change the formatting of how comments are rendered
# in RMarkdown
opts_chunk$set(comment = "",
fig.show = "hold")The tidyverse is a collection of packages that includes ggplot2.
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
stat_summary(fun = "mean", geom = "bar", color = "black", fill = "lightblue", width = 0.85) +
stat_summary(fun.data = "mean_cl_boot", geom = "linerange", size = 1.5) +
labs(title = "Price as a function of quality of cut", subtitle = "Note: The price is in US dollars", tag = "A", x = "Quality of the cut", y = "Price")You may want to write it this way instead:
ggplot(data = diamonds,
mapping = aes(x = cut,
y = price)) +
# display the means
stat_summary(fun = "mean",
geom = "bar",
color = "black",
fill = "lightblue",
width = 0.85) +
# display the error bars
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
size = 1.5) +
# change labels
labs(title = "Price as a function of quality of cut",
subtitle = "Note: The price is in US dollars", # we might want to change this later
tag = "A",
x = "Quality of the cut",
y = "Price")This makes it much easier to see what’s going on, and you can easily add comments to individual lines of code.
I will use the ggplot2 package to visualize data.
Figure 1.1: What a nice figure!
Now let’s figure out (pun intended!) how to get there.
Let’s first get some data.
df.diamonds = diamondsThe diamonds dataset comes with the ggplot2 package. We can get a description of the dataset by running the following command:
?diamondsAbove, we assigned the diamonds dataset to the variable df.diamonds so that we can see it in the data explorer.
Let’s take a look at the full dataset by clicking on it in the explorer.
The df.diamonds data frame contains information about almost 60,000 diamonds, including their price, carat value, size, etc. Let’s use visualization to get a better sense for this dataset.
We start by setting up the plot. To do so, we pass a data frame to the function ggplot() in the following way.
ggplot(data = df.diamonds)This, by itself, won’t do anything yet. We also need to specify what to plot.
Let’s take a look at how much diamonds of different color cost. The help file says that diamonds labeled D have the best color, and diamonds labeled J the worst color. Let’s make a bar plot that shows the average price of diamonds for different colors.
We do so via specifying a mapping from the data to the plot aesthetics with the function aes(). We need to tell aes() what we would like to display on the x-axis, and the y-axis of the plot.
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price))Here, we specified that we want to plot color on the x-axis, and price on the y-axis. As you can see, ggplot2 has already figured out how to label the axes. However, we still need to specify how to plot it.
Let’s make a bar graph:
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price)) +
stat_summary(fun = "mean",
geom = "bar")Neat! Three lines of code produce an almost-publication-ready plot (to be published in the Proceedings of Unnecessary Diamonds)! Note how we used a + at the end of the first line of code to specify that there will be more. This is a very powerful idea underlying ggplot2. We can start simple and keep adding things to the plot step by step.
We used the stat_summary() function to define what we want to plot (the “mean”), and how (as a “bar” chart). Let’s take a closer look at that function.
help(stat_summary)Not the the easiest help file … We supplied two arguments to the function, fun = and geom =.
fun argument specifies what function we’d like to apply to the data for each value of x. Here, we said that we would like to take the mean and we specified that as a string.geom (= geometric object) argument specifies how we would like to plot the result, namely as a “bar” plot.Instead of showing the “mean”, we could also show the “median” instead.
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price)) +
stat_summary(fun = "median",
geom = "bar")And instead of making a bar plot, we could plot some points.
ggplot(df.diamonds,
aes(x = color,
y = price)) +
stat_summary(fun = "mean",
geom = "point")Tip: Take a look here to see what other geoms ggplot2 supports.
Somewhat surprisingly, diamonds with the best color (D) are not the most expensive ones. What’s going on here? We’ll need to do some more exploration to figure this out.
Before moving on, let’s set a different default theme for our plots. Personally, I’m not a big fan of the gray background and the white grid lines. Also, the default size of the text should be bigger. We can change the default theme using the theme_set() function like so:
theme_set(theme_classic() + # set the theme
theme(text = element_text(size = 20))) # set the default text sizeFrom now on, all our plots will use what’s specified in theme_classic(), and the default text size will be larger, too. For any individual plot, we can still override these settings.
I don’t know much about diamonds, but I do know that diamonds with a higher carat value tend to be more expensive. color was a discrete variable with seven different values. carat, however, is a continuous variable. We want to see how the price of diamonds differs as a function of the carat value. Since we are interested in the relationship between two continuous variables, plotting a bar graph won’t work. Instead, let’s make a scatter plot. Let’s put the carat value on the x-axis, and the price on the y-axis.
ggplot(data = df.diamonds,
mapping = aes(x = carat,
y = price)) +
geom_point()Figure 1.2: Scatterplot.
Cool! That looks sensible. Diamonds with a higher carat value tend to have a higher price. Our dataset has 53940 rows. So the plot actually shows 53940 circles even though we can’t see all of them since they overlap.
Let’s make some progress on trying to figure out why the diamonds with the better color weren’t the most expensive ones on average. We’ll add some color to the scatter plot in Figure 1.2. We color each of the points based on the diamond’s color. To do so, we pass another argument to the aesthetics of the plot via aes().
ggplot(data = df.diamonds,
mapping = aes(x = carat,
y = price,
color = color)) +
geom_point()Figure 1.3: Scatterplot with color.
Aha! Now we’ve got some color. Notice how in Figure 1.3 ggplot2 added a legend for us, thanks! We’ll see later how to play around with legends. Form just eye-balling the plot, it looks like the diamonds with the best color (D) tended to have a lower carat value, and the ones with the worst color (J), tended to have the highest carat values.
So this is why diamonds with better colors are less expensive – these diamonds have a lower carat value overall.
There are many other things that we can define in aes(). Take a quick look at the vignette:
vignette("ggplot2-specs")What else do we know about the diamonds? We actually know the quality of how they were cut. The cut variable ranges from “Fair” to “Ideal”. First, let’s take a look at the relationship between cut and price. This time, we’ll make a line plot instead of a bar plot (just because we can).
ggplot(data = df.diamonds,
mapping = aes(x = cut,
y = price)) +
stat_summary(fun = "mean",
geom = "line")`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
Oops! All we did is that we replaced x = color with x = cut, and geom = "bar" with geom = "line". However, the plot doesn’t look like expected (i.e. there is no real plot). What happened here? The reason is that the line plot needs to know which points to connect. The error message tells us that each group consists of only one observation. Let’s adjust the group aesthetic to fix this.
ggplot(data = df.diamonds,
mapping = aes(x = cut,
y = price,
group = 1)) +
stat_summary(fun = "mean",
geom = "line")By adding the parameter group = 1 to mapping = aes(), we specify that we would like all the levels in x = cut to be treated as coming from the same group. The reason for this is that cut (our x-axis variable) is a factor (and not a numeric variable), so, by default, ggplot2 tries to draw a separate line for each factor level. We’ll learn more about grouping below (and about factors later).
Interestingly, there is no simple relationship between the quality of the cut and the price of the diamond. In fact, “Ideal” diamonds tend to be cheapest.
We often don’t just want to show the means but also give a sense for how much the data varies. ggplot2 has some convenient ways of specifying error bars. Let’s take a look at how much price varies as a function of clarity (another variable in our diamonds data frame).
ggplot(data = df.diamonds,
mapping = aes(x = clarity,
y = price)) +
stat_summary(fun.data = "mean_cl_boot",
geom = "pointrange")Figure 1.4: Relationship between diamond clarity and price. Error bars indicate 95% bootstrapped confidence intervals.
Here we have it. The average price of our diamonds for different levels of clarity together with bootstrapped 95% confidence intervals. How do we know that we have 95% confidence intervals? That’s what mean_cl_boot() computes as a default. Let’s take a look at that function:
help(mean_cl_boot)Note that I had to use the fun.data = argument here instead of fun = because the mean_cl_boot() function produces three data points for each value of the x-axis (the mean, lower, and upper confidence interval).
The order in which we add geoms to a ggplot matters! Generally, we want to plot error bars before the points that represent the means. To illustrate, let’s set the color in which we show the means to “red”.
ggplot(data = df.diamonds,
mapping = aes(x = clarity,
y = price)) +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange") +
stat_summary(fun = "mean",
geom = "point",
color = "red")Figure 1.5: This figure looks good. Error bars and means are drawn in the correct order.
Figure 1.5 looks good.
# I've changed the order in which the means and error bars are drawn.
ggplot(df.diamonds,
aes(x = clarity,
y = price)) +
stat_summary(fun = "mean",
geom = "point",
color = "red") +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange")Figure 1.6: This figure looks bad. Error bars and means are drawn in the incorrect order.
Figure 1.6 doesn’t look good. The error bars are on top of the points that represent the means.
One cool feature about using stat_summary() is that we did not have to change anything about the data frame that we used to make the plots. We directly used our raw data instead of having to make separate data frames that contain the relevant information (such as the means and the confidence intervals).
You may not remember exactly what confidence intervals actually are. Don’t worry! We’ll have a recap later in class.
Let’s take a look at two more principles for plotting data that are extremely helpful: groups and facets. But before, another practice plot.
Grouping in ggplot2 is a very powerful idea. It allows us to plot subsets of the data – again without the need to make separate data frames first.
Let’s make a plot that shows the relationship between price and color separately for the different qualities of cut.
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
group = cut)) +
stat_summary(fun = "mean",
geom = "line")Well, we got some separate lines here but we don’t know which line corresponds to which cut. Let’s add some color!
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
group = cut,
color = cut)) +
stat_summary(fun = "mean",
geom = "line",
size = 2)Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Nice! In addition to adding color, I’ve made the lines a little thicker here by setting the size argument to 2.
Grouping is very useful for bar plots. Let’s take a look at how the average price of diamonds looks like taking into account both cut and color (I know – exciting times!). Let’s put the color on the x-axis and then group by the cut.
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
group = cut,
color = cut)) +
stat_summary(fun = "mean",
geom = "bar")That’s a fail! Several things went wrong here. All the bars are gray and only their outline is colored differently. Instead we want the bars to have a different color. For that we need to specify the fill argument rather than the color argument! But things are worse. The bars currently are shown on top of each other. Instead, we’d like to put them next to each other. Here is how we can do that:
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
group = cut,
fill = cut)) +
stat_summary(fun = "mean",
geom = "bar",
position = position_dodge())Neato! We’ve changed the color argument to fill, and have added the position = position_dodge() argument to the stat_summary() call. This argument makes it such that the bars are nicely dodged next to each other. Let’s add some error bars just for kicks.
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
group = cut,
fill = cut)) +
stat_summary(fun = "mean",
geom = "bar",
position = position_dodge(width = 0.9),
color = "black") +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
position = position_dodge(width = 0.9))Voila! Now with error bars. Note that we’ve added the width = 0.9 argument to position_dodge(). Somehow R was complaining when this was not defined for geom “linerange”. I’ve also added some outline to the bars by including the argument color = "black". I think it looks nicer this way.
So, still somewhat surprisingly, diamonds with the worst color (J) are more expensive than dimanods with the best color (D), and diamonds with better cuts are not necessarily more expensive.
Having too much information in a single plot can be overwhelming. The previous plot is already pretty busy. Facets are a nice way of splitting up plots and showing information in separate panels.
Let’s take a look at how wide these diamonds tend to be. The width in mm is given in the y column of the diamonds data frame. We’ll make a histogram first. To make a histogram, the only aesthetic we needed to specify is x.
ggplot(data = df.diamonds,
mapping = aes(x = y)) +
geom_histogram()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
That looks bad! Let’s pick a different value for the width of the bins in the histogram.
ggplot(data = df.diamonds,
mapping = aes(x = y)) +
geom_histogram(binwidth = 0.1)Still bad. There seems to be an outlier diamond that happens to be almost 60 mm wide, while most of the rest is much narrower. One option would be to remove the outlier from the data before plotting it. But generally, we don’t want to make new data frames. Instead, let’s just limit what data we show in the plot.
ggplot(data = df.diamonds,
mapping = aes(x = y)) +
geom_histogram(binwidth = 0.1) +
coord_cartesian(xlim = c(3, 10))I’ve used the coord_cartesian() function to restrict the range of data to show by passing a minimum and maximum to the xlim argument. This looks better now.
Instead of histograms, we can also plot a density fitted to the distribution.
ggplot(data = df.diamonds,
mapping = aes(x = y)) +
geom_density() +
coord_cartesian(xlim = c(3, 10))Looks pretty similar to our histogram above! Just like we can play around with the binwidth of the histogram, we can change the smoothing bandwidth of the kernel that is used to create the histogram. Here is a histogram with a much wider bandwidth:
ggplot(data = df.diamonds,
mapping = aes(x = y)) +
geom_density(bw = 0.5) +
coord_cartesian(xlim = c(3, 10))We’ll learn more about how these densities are determined later in class.
I promised that this section was about making facets, right? We’re getting there! Let’s first take a look at how wide diamonds of different color are. We can use grouping to make this happen.
ggplot(data = df.diamonds,
mapping = aes(x = y,
group = color,
fill = color)) +
geom_density(bw = 0.2,
alpha = 0.2) +
coord_cartesian(xlim = c(3, 10))OK! That’s a little tricky to tell apart. Notice that I’ve specified the alpha argument in the geom_density() function so that the densities in the front don’t completely hide the densities in the back. But this plot still looks too busy. Instead of grouping, let’s put the densities for the different colors, in separate panels. That’s what facetting allows you to do.
ggplot(data = df.diamonds,
mapping = aes(x = y,
fill = color)) +
geom_density(bw = 0.2) +
facet_grid(cols = vars(color)) +
coord_cartesian(xlim = c(3, 10))Now we have the densities next to each other in separate panels. I’ve removed the alpha argument since the densities aren’t overlapping anymore. To make the different panels, I used the facet_grid() function and specified that I want separate columns for the different colors (cols = vars(color)). What’s the deal with vars()? Why couldn’t we just write facet_grid(cols = color) instead? The short answer is: that’s what the function wants. The long answer is: long. (We’ll learn more about this later in the course.)
To show the facets in different rows instead of columns we simply replace cols = vars(color) with rows = vars(color).
ggplot(data = df.diamonds,
mapping = aes(x = y,
fill = color)) +
geom_density(bw = 0.2) +
facet_grid(rows = vars(color)) +
coord_cartesian(xlim = c(3, 10))Several aspects about this plot should be improved:
So, what does this plot actually show us? Well, J-colored diamonds tend to be wider than D-colored diamonds. Fascinating!
Of course, we could go completely overboard with facets and groups. So let’s do it! Let’s look at how the average price (somewhat more interesting) varies as a function of color, cut, and clarity. We’ll put color on the x-axis, and make separate rows for cut and columns for clarity.
ggplot(data = df.diamonds,
mapping = aes(y = price,
x = color,
fill = color)) +
stat_summary(fun = "mean",
geom = "bar",
color = "black") +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange") +
facet_grid(rows = vars(cut),
cols = vars(clarity))Warning: Removed 1 rows containing missing values (`geom_segment()`).
Warning: Removed 3 rows containing missing values (`geom_segment()`).
Warning: Removed 1 rows containing missing values (`geom_segment()`).
Figure 1.7: A figure that is stretching it in terms of information.
Figure 1.7 is stretching it in terms of how much information it presents. But it gives you a sense for how to combine the different bits and pieces we’ve learned so far.
aes()ggplot2 allows you to specify the plot aesthetics in different ways.
ggplot(data = df.diamonds,
mapping = aes(x = carat,
y = price,
color = color)) +
geom_point() +
geom_smooth(method = "lm",
se = F)`geom_smooth()` using formula = 'y ~ x'
Here, I’ve drawn a scatter plot of the relationship between carat and price, and I have added the best-fitting regression lines via the geom_smooth(method = "lm") call. (We will learn more about what these regression lines mean later in class.)
Because I have defined all the aesthetics at the top level (i.e. directly within the ggplot() function), the aesthetics apply to all the functions afterwards. Aesthetics defined in the ggplot() call are global. In this case, the geom_point() and the geom_smooth() functions. The geom_smooth() function produces separate best-fit regression lines for each different color.
But what if we only wanted to show one regression line instead that applies to all the data? Here is one way of doing so:
ggplot(data = df.diamonds,
mapping = aes(x = carat,
y = price)) +
geom_point(mapping = aes(color = color)) +
geom_smooth(method = "lm")`geom_smooth()` using formula = 'y ~ x'
Here, I’ve moved the color aesthetic into the geom_point() function call. Now, the x and y aesthetics still apply to both the geom_point() and the geom_smooth() function call (they are global), but the color aesthetic applies only to geom_point() (it is local). Alternatively, we can simply overwrite global aesthetics within local function calls.
ggplot(data = df.diamonds,
mapping = aes(x = carat,
y = price,
color = color)) +
geom_point() +
geom_smooth(method = "lm",
color = "black")`geom_smooth()` using formula = 'y ~ x'
Here, I’ve set color = "black" within the geom_smooth() function, and now only one overall regression line is displayed since the global color aesthetic was overwritten in the local function call.
Let’s first install the new packages that you might not have yet.
install.packages(c("gganimate", "gapminder", "ggridges", "devtools", "png", "gifski", "patchwork"))Now, let’s load the packages that we need for this chapter.
library("knitr") # for rendering the RMarkdown file
library("patchwork") # for making figure panels
library("ggridges") # for making joyplots
library("gganimate") # for making animations
library("gapminder") # data available from Gapminder.org
library("tidyverse") # for plotting (and many more cool things we'll discover later)And set some settings:
# these options here change the formatting of how comments are rendered
opts_chunk$set(comment = "#>",
fig.show = "hold")
# this just suppresses an unnecessary message about grouping
options(dplyr.summarise.inform = F)
# set the default plotting theme
theme_set(theme_classic() + #set the theme
theme(text = element_text(size = 20))) #set the default text sizeAnd let’s load the diamonds data again.
df.diamonds = diamondsDifferent plots work best for different kinds of data. Let’s take a look at some.
ggplot(data = df.diamonds,
mapping = aes(x = cut,
fill = color)) +
geom_bar(color = "black")This bar chart shows for the different cuts (x-axis), the number of diamonds of different color. Stacked bar charts give a good general impression of the data. However, it’s difficult to precisely compare different proportions.
Figure 2.1: Finally a pie chart that makes sense.
Pie charts have a bad reputation. And there are indeed a number of problems with pie charts:
ggplot(data = df.diamonds,
mapping = aes(x = 1,
fill = cut)) +
geom_bar() +
coord_polar("y", start = 0) +
theme_void()We can create a pie chart with ggplot2 by changing the coordinate system using coord_polar().
If we are interested in comparing proportions and we don’t have too many data points, then tables are a good alternative to showing figures.
Often we want to compare the data from many different conditions. And sometimes, it’s also useful to get a sense for what the individual participant data look like. Here is a plot that achieves both.
ggplot(data = df.diamonds[1:150, ],
mapping = aes(x = color,
y = price)) +
# means with confidence intervals
stat_summary(fun.data = "mean_cl_boot",
geom = "pointrange",
color = "black",
fill = "yellow",
shape = 21,
size = 1) +
# individual data points (jittered horizontally)
geom_point(alpha = 0.2,
color = "blue",
position = position_jitter(width = 0.1, height = 0),
size = 2)Figure 2.2: Price of differently colored diamonds. Large yellow circles are means, small black circles are individual data poins, and the error bars are 95% bootstrapped confidence intervals.
Note that I’m only plotting the first 150 entries of the data here by setting data = df.diamonds[1:150,] in gpplot().
This plot shows means, bootstrapped confidence intervals, and individual data points. I’ve used two tricks to make the individual data points easier to see.
1. I’ve set the alpha attribute to make the points somewhat transparent.
2. I’ve used the position_jitter() function to jitter the points horizontally.
3. I’ve used shape = 21 for displaying the mean. For this circle shape, we can set a color and fill (see Figure 2.3)
Figure 2.3: Different shapes that can be used for plotting.
Here is an example of an actual plot that I’ve made for a paper that I’m working on (using the same techniques).
Figure 2.4: Participants’ preference for the conjunctive (top) versus dis-junctive (bottom) structure as a function of the explanation (abnormal cause vs. normalcause) and the type of norm (statistical vs. prescriptive). Note: Large circles are groupmeans. Error bars are bootstrapped 95% confidence intervals. Small circles are individualparticipants’ judgments (jittered along the x-axis for visibility)
Another way to get a sense for the distribution of the data is to use box plots.
ggplot(data = df.diamonds[1:500,],
mapping = aes(x = color, y = price)) +
geom_boxplot()What do boxplots show? Here adapted from help(geom_boxplot()):
The boxplots show the median as a horizontal black line. The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles) of the data. The whiskers (= black vertical lines) extend from the top or bottom of the hinge by at most 1.5 * IQR (where IQR is the inter-quartile range, or distance between the first and third quartiles). Data beyond the end of the whiskers are called “outlying” points and are plotted individually.
Personally, I’m not a big fan of boxplots. Many data sets are consistent with the same boxplot.
Figure 2.5: Box plot distributions. Source: https://www.autodeskresearch.com/publications/samestats
Figure 2.5 shows three different distributions that each correspond to the same boxplot.
If there is not too much data, I recommend to plot jittered individual data points instead. If you do have a lot of data points, then violin plots can be helpful.
Figure 2.6: Boxplot distributions. Source: https://www.autodeskresearch.com/publications/samestats
Figure 2.6 shows the same raw data represented as jittered dots, boxplots, and violin plots.
We make violin plots like so:
ggplot(data = df.diamonds,
mapping = aes(x = color, y = price)) +
geom_violin()Violin plots are good for detecting bimodal distributions. They work well when:
Violin plots don’t work well for Likert-scale data (e.g. ratings on a discrete scale from 1 to 7). Here is a simple example:
set.seed(1)
data = tibble(rating = sample(x = 1:7,
prob = c(0.1, 0.4, 0.1, 0.1, 0.2, 0, 0.1),
size = 500,
replace = T))
ggplot(data = data,
mapping = aes(x = "Likert", y = rating)) +
geom_violin() +
geom_point(position = position_jitter(width = 0.05,
height = 0.1),
alpha = 0.05)This represents a vase much better than it represents the data.
We can also show the distributions along the x-axis using the geom_density_ridges() function from the ggridges package.
ggplot(data = df.diamonds,
mapping = aes(x = price, y = color)) +
ggridges::geom_density_ridges(scale = 1.5)#> Picking joint bandwidth of 535
Figure 2.7: Practice plot 1.
Scatter plots are great for looking at the relationship between two continuous variables.
ggplot(data = df.diamonds,
mapping = aes(x = carat,
y = price,
color = color)) +
geom_point()These are useful for looking how a variable of interest varies as a function of two other variables. For example, when we are trying to fit a model with two parameters, we might be interested to see how well the model does for different combinations of these two parameters. Here, we’ll plot what carat values diamonds of different color and clarity have.
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = clarity,
z = carat)) +
stat_summary_2d(fun = "mean", geom = "tile")Not too bad. Let’s add a few tweaks to make it look nicer.
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = clarity,
z = carat)) +
stat_summary_2d(fun = "mean",
geom = "tile",
color = "black") +
scale_fill_gradient(low = "white", high = "black") +
labs(fill = "carat")I’ve added some outlines to the tiles by specifying color = "black" in geom_tile() and I’ve changed the scale for the fill gradient. I’ve defined the color for the low value to be “white”, and for the high value to be “black.” Finally, I’ve changed the lower and upper limit of the scale via the limits argument. Looks much better now! We see that diamonds with clarity I1 and color J tend to have the highest carat values on average.
Line plots are a good choice for temporal data. Here, I’ll use the txhousing data that comes with the ggplot2 package. The dataset contains information about housing sales in Texas.
# ignore this part for now (we'll learn about data wrangling soon)
df.plot = txhousing %>%
filter(city %in% c("Dallas", "Fort Worth", "San Antonio", "Houston")) %>%
mutate(city = factor(city, levels = c("Dallas", "Houston", "San Antonio", "Fort Worth")))
ggplot(data = df.plot,
mapping = aes(x = year,
y = median,
color = city,
fill = city)) +
stat_summary(fun.data = "mean_cl_boot",
geom = "ribbon",
alpha = 0.2,
linetype = 0) +
stat_summary(fun = "mean", geom = "line") +
stat_summary(fun = "mean", geom = "point") Ignore the top part where I’m defining df.plot for now (we’ll look into this in the next class). The other part is fairly straightforward. I’ve used stat_summary() three times: First, to define the confidence interval as a geom = "ribbon". Second, to show the lines connecting the means, and third to put the means as data points points on top of the lines.
Let’s tweak the figure some more to make it look real good.
df.plot = txhousing %>%
filter(city %in% c("Dallas", "Fort Worth", "San Antonio", "Houston")) %>%
mutate(city = factor(city, levels = c("Dallas", "Houston", "San Antonio", "Fort Worth")))
df.text = df.plot %>%
filter(year == max(year)) %>%
group_by(city) %>%
summarize(year = mean(year) + 0.2,
median = mean(median))
ggplot(data = df.plot,
mapping = aes(x = year,
y = median,
color = city,
fill = city)) +
# draw dashed horizontal lines in the background
geom_hline(yintercept = seq(from = 100000,
to = 250000,
by = 50000),
linetype = 2,
alpha = 0.2) +
# draw ribbon
stat_summary(fun.data = mean_cl_boot,
geom = "ribbon",
alpha = 0.2,
linetype = 0) +
# draw lines connecting the means
stat_summary(fun = "mean", geom = "line") +
# draw means as points
stat_summary(fun = "mean", geom = "point") +
# add the city names
geom_text(data = df.text,
mapping = aes(label = city),
hjust = 0,
size = 5) +
# set the limits for the coordinates
coord_cartesian(xlim = c(1999, 2015),
clip = "off",
expand = F) +
# set the x-axis labels
scale_x_continuous(breaks = seq(from = 2000,
to = 2015,
by = 5)) +
# set the y-axis labels
scale_y_continuous(breaks = seq(from = 100000,
to = 250000,
by = 50000),
labels = str_c("$",
seq(from = 100,
to = 250,
by = 50),
"K")) +
# set the plot title and axes titles
labs(title = "Change of median house sale price in Texas",
x = "Year",
y = "Median house sale price",
fill = "",
color = "") +
theme(title = element_text(size = 16),
legend.position = "none",
plot.margin = margin(r = 1, unit = "in"))So far, we’ve seen a number of different ways of plotting data. Now, let’s look into how to customize the plots. For example, we may want to change the axis labels, add a title, increase the font size. ggplot2 let’s you customize almost anything.
Let’s start simple.
ggplot(data = df.diamonds,
mapping = aes(x = cut, y = price)) +
stat_summary(fun = "mean",
geom = "bar",
color = "black") +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange")This plot shows the average price for diamonds with a different quality of the cut, as well as the bootstrapped confidence intervals. Here are some things we can do to make it look nicer.
ggplot(data = df.diamonds,
mapping = aes(x = cut,
y = price)) +
# change color of the fill, make a little more space between bars by setting their width
stat_summary(fun = "mean",
geom = "bar",
color = "black",
fill = "lightblue",
width = 0.85) +
# adjust the range of both axes
coord_cartesian(xlim = c(0.25, 5.75),
ylim = c(0, 5000),
expand = F) +
# make error bars thicker
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
size = 1.5) +
# adjust what to show on the y-axis
scale_y_continuous(breaks = seq(from = 0, to = 4000, by = 2000),
labels = seq(from = 0, to = 4000, by = 2000)) +
# add a title, subtitle, and changed axis labels
labs(title = "Price as a function of quality of cut",
subtitle = "Note: The price is in US dollars",
tag = "A",
x = "Quality of the cut",
y = "Price") +
theme(
# adjust the text size
text = element_text(size = 20),
# add some space at top of x-title
axis.title.x = element_text(margin = margin(t = 0.2, unit = "inch")),
# add some space t the right of y-title
axis.title.y = element_text(margin = margin(r = 0.1, unit = "inch")),
# add some space underneath the subtitle and make it gray
plot.subtitle = element_text(margin = margin(b = 0.3, unit = "inch"),
color = "gray70"),
# make the plot tag bold
plot.tag = element_text(face = "bold"),
# move the plot tag a little
plot.tag.position = c(0.05, 0.99)
)I’ve tweaked quite a few things here (and I’ve added comments to explain what’s happening). Take a quick look at the theme() function to see all the things you can change.
ggplotI suggest to use this general skeleton for creating a ggplot:
# ggplot call with global aesthetics
ggplot(data = data,
mapping = aes(x = cause,
y = effect)) +
# add geometric objects (geoms)
geom_point() +
stat_summary(fun = "mean", geom = "point") +
... +
# add text objects
geom_text() +
annotate() +
# adjust axes and coordinates
coord_cartesian() +
scale_x_continuous() +
scale_y_continuous() +
# define plot title, and axis titles
labs(title = "Title",
x = "Cause",
y = "Effect") +
# change global aspects of the plot
theme(text = element_text(size = 20),
plot.margin = margin(t = 1, b = 1, l = 0.5, r = 0.5, unit = "cm")) +
# save the plot
ggsave(filename = "super_nice_plot.pdf",
width = 8,
height = 6)Sometimes we don’t have a natural ordering of our independent variable. In that case, it’s nice to show the data in order.
ggplot(data = df.diamonds,
mapping = aes(x = reorder(cut, price),
y = price)) +
# mapping = aes(x = cut, y = price)) +
stat_summary(fun = "mean",
geom = "bar",
color = "black",
fill = "lightblue",
width = 0.85) +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
size = 1.5) +
labs(x = "cut")The reorder() function helps us to do just that. Now, the results are ordered according to price. To show the results in descending order, I would simply need to write reorder(cut, -price) instead.
Legends form an important part of many figures. However, it is often better to avoid legends if possible, and directly label the data. This way, the reader doesn’t have to look back and forth between the plot and the legend to understand what’s going on.
Here, we’ll look into a few aspects that come up quite often. There are two main functions to manipulate legends with ggplot2
1. theme() (there are a number of arguments starting with legend.)
2. guide_legend()
Let’s make a plot with a legend.
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
color = clarity)) +
stat_summary(fun = "mean",
geom = "point")Let’s move the legend to the bottom of the plot:
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
color = clarity)) +
stat_summary(fun = "mean",
geom = "point") +
theme(legend.position = "bottom")Let’s change a few more things in the legend using the guides() function:
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
color = clarity)) +
stat_summary(fun = "mean",
geom = "point",
size = 2) +
theme(legend.position = "bottom") +
guides(color = guide_legend(nrow = 3, # 3 rows
reverse = TRUE, # reversed order
override.aes = list(size = 6))) # point size
Let’s load the diamonds data set
df.diamonds = diamonds Recreate the plot shown in Figure ??.
ggplot(data = df.diamonds,
mapping = aes(x = depth,
y = table)) +
geom_point()Figure 3.1: Practice plot 1.
ggplot(data = df.diamonds,
mapping = aes(x = depth,
y = table)) +
geom_point(alpha = 0.1) +
labs(x = "Depth (in mm)",
y = "Table\n(width of top of diamond relative to widest point)")Figure 3.2: Practice plot 1 (advanced).
Recreate the plot shown in Figure ??.
ggplot(data = df.diamonds,
mapping = aes(x = clarity,
y = price)) +
stat_summary(fun = "mean",
geom = "bar") +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange")Recreate the plot shown in Figure ??.
ggplot(data = df.diamonds,
mapping = aes(x = clarity,
y = price)) +
stat_summary(fun = "mean",
geom = "bar",
color = "black",
fill = "lightblue") +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
size = 1) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1)))Recreate the plot shown in Figure ??.
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
group = clarity,
color = clarity)) +
stat_summary(fun.data = "mean_cl_boot",
geom = "pointrange",
size = 1) +
stat_summary(fun = "mean",
geom = "line",
size = 2)Recreate the plot shown in Figure ??.
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
group = clarity,
color = clarity)) +
stat_summary(fun = "mean",
geom = "line",
size = 2,
position = position_dodge(width = 0.9)) +
stat_summary(fun.data = "mean_cl_boot",
geom = "pointrange",
shape = 21,
fill = "white",
size = 1,
position = position_dodge(width = 0.9)) +
geom_vline(xintercept = seq(from = 1.5,
by = 1,
length.out = 6),
linetype = 2,
color = "gray20")Recreate the plot shown in Figure ??.
ggplot(data = df.diamonds,
mapping = aes(x = color,
y = price,
fill = cut)) +
stat_summary(fun = "mean",
geom = "bar",
position = position_dodge(width = 0.9),
color = "black") +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
position = position_dodge(width = 0.9),
color = "black") +
facet_grid(rows = vars(clarity)) +
theme(axis.text.y = element_text(size = 8),
strip.text = element_text(size = 10))Color brewer helps with finding colors that are colorblind safe and printfriendly. For more information on how to use color effectively see here.
For a given project, I often want all of my plots to share certain visual features such as the font type, font size, how the axes are displayed, etc. Instead of defining these for each individual plot, I can set a theme at the beginning of my project so that it applies to all the plots in this file. To do so, I use the theme_set() command:
theme_set(theme_classic() + #classic theme
theme(text = element_text(size = 20))) #text size Here, I’ve just defined that I want to use theme_classic() for all my plots, and that the text size should be 20. For any individual plot, I can still overwrite any of these defaults.
To save plots, use the ggsave() command. Personally, I prefer to save my plots as pdf files. This way, the plot looks good no matter what size you need it to be. This means it’ll look good both in presentations as well as in a paper. You can save the plot in any format that you like.
I strongly recommend to use a relative path to specify where the figure should be saved. This way, if you are sharing the project with someone else via Stanford Box, Dropbox, or Github, they will be able to run the code without errors.
Here is an example for how to save one of the plots that we’ve created above.
p1 = ggplot(data = df.diamonds,
mapping = aes(x = cut, y = price)) +
stat_summary(fun = "mean",
geom = "bar",
color = "black",
fill = "lightblue") +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
size = 1)
print(p1)
p2 = ggplot(data = df.diamonds,
mapping = aes(x = cut, y = price)) +
stat_summary(fun = "mean",
geom = "bar",
color = "black",
fill = "lightblue") +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
size = 1)
ggsave(filename = "figures/diamond_plot.pdf",
plot = p1,
width = 8,
height = 6)Here, I’m saving the plot in the figures folder and it’s name is diamond_plot.pdf. I also specify the width and height as the plot in inches (which is the default unit).
Sometimes, we want to create a figure with several subfigures, each of which is labeled with a), b), etc. We have already learned how to make separate panels using facet_wrap() or facet_grid(). The R package patchwork makes it very easy to combine multiple plots. You can find out more about the package here.
Let’s combine a few plots that we’ve made above into one.
# first plot
p1 = ggplot(data = df.diamonds,
mapping = aes(x = y, fill = color)) +
geom_density(bw = 0.2,
show.legend = F) +
facet_grid(cols = vars(color)) +
labs(title = "Width of differently colored diamonds") +
coord_cartesian(xlim = c(3, 10),
expand = F) #setting expand to FALSE removes any padding on x and y axes
# second plot
p2 = ggplot(data = df.diamonds,
mapping = aes(x = color,
y = clarity,
z = carat)) +
stat_summary_2d(fun = "mean",
geom = "tile") +
labs(title = "Carat values",
subtitle = "For different color and clarity",
x = "Color")
# third plot
p3 = ggplot(data = df.diamonds,
mapping = aes(x = cut, y = price)) +
stat_summary(fun = "mean",
geom = "bar",
color = "black",
fill = "lightblue",
width = 0.85) +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
size = 1.5) +
scale_x_discrete(labels = c("fair", "good", "very\ngood", "premium", "ideal")) +
labs(title = "Price as a function of cut",
subtitle = "Note: The price is in US dollars",
x = "Quality of the cut",
y = "Price") +
coord_cartesian(xlim = c(0.25, 5.75),
ylim = c(0, 5000),
expand = F)
# combine the plots
p1 + (p2 + p3) +
plot_layout(ncol = 1) &
plot_annotation(tag_levels = "A") &
theme_classic() &
theme(plot.tag = element_text(face = "bold", size = 20))
# ggsave("figures/combined_plot.png", width = 10, height = 6)Not a perfect plot yet, but you get the idea. To combine the plots, we defined that we would like p2 and p3 to be displayed in the same row using the () syntax. And we specified that we only want one column via the plot_layout() function. We also applied the same theme_classic() to all the plots using the & operator, and formatted how the plot tags should be displayed. For more info on how to use patchwork, take a look at the readme on the github page.
Other packages that provide additional functionality for combining multiple plots into one are
An alternative way for making these plots is to use Adobe Illustrator, Powerpoint, or Keynote. However, you want to make changing plots as easy as possible. Adobe Illustrator has a feature that allows you to link to files. This way, if you change the plot, the plot within the illustrator file gets updated automatically as well.
If possible, it’s much better to do everything in R though so that your plot can easily be reproduced by someone else.
Sometimes it can be helpful for debugging to take a look behind the scenes. Silently, ggplot() computes a data frame based on the information you pass to it. We can take a look at the data frame that’s underlying the plot.
p = ggplot(data = df.diamonds,
mapping = aes(x = color,
y = clarity,
z = carat)) +
stat_summary_2d(fun = "mean",
geom = "tile",
color = "black") +
scale_fill_gradient(low = "white", high = "black")
print(p)
build = ggplot_build(p)
df.plot_info = build$data[[1]]
dim(df.plot_info) # data frame dimensions#> [1] 56 18
I’ve called the ggplot_build() function on the ggplot2 object that we saved as p. I’ve then printed out the data associated with that plot object. The first thing we note about the data frame is how many entries it has, 56. That’s good. This means there is one value for each of the 7 x 8 grids. The columns tell us what color was used for the fill, the value associated with each row, where each row is being displayed (x and y), etc.
If a plot looks weird, it’s worth taking a look behind the scenes. For example, something we thing we could have tried is the following (in fact, this is what I tried first):
p = ggplot(data = df.diamonds,
mapping = aes(x = color,
y = clarity,
fill = carat)) +
geom_tile(color = "black") +
scale_fill_gradient(low = "white", high = "black")
print(p)
build = ggplot_build(p)
df.plot_info = build$data[[1]]
dim(df.plot_info) # data frame dimensions#> [1] 53940 15
Why does this plot look different from the one before? What went wrong here? Notice that the data frame associated with the ggplot2 object has 53940 rows. So instead of plotting means here, we plotted all the individual data points. So what we are seeing here is just the top layer of many, many layers.
Animated plots can be a great way to illustrate your data in presentations. The R package gganimate lets you do just that.
Here is an example showing how to use it.
ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp,
size = pop,
colour = country)) +
geom_point(alpha = 0.7, show.legend = FALSE) +
geom_text(data = gapminder %>%
filter(country %in% c("United States", "China", "India")),
mapping = aes(label = country),
color = "black",
vjust = -0.75,
show.legend = FALSE) +
scale_colour_manual(values = country_colors) +
scale_size(range = c(2, 12)) +
scale_x_log10(breaks = c(1e3, 1e4, 1e5),
labels = c("1,000", "10,000", "100,000")) +
theme_classic() +
theme(text = element_text(size = 23)) +
# Here come the gganimate specific bits
labs(title = "Year: {frame_time}", x = "GDP per capita", y = "life expectancy") +
transition_time(year) +
ease_aes("linear")#> Warning: No renderer available. Please install the gifski, av, or magick
#> package to create animated output
#> NULL
# anim_save(filename = "figures/life_gdp_animation.gif") # to save the animationThis takes a while to run but it’s worth the wait. The plot shows the relationship between GDP per capita (on the x-axis) and life expectancy (on the y-axis) changes across different years for the countries of different continents. The size of each dot represents the population size of the respective country. And different countries are shown in different colors. This animation is not super useful yet in that we don’t know which continents and countries the different dots represent. I’ve added a label to the United States, China, and India.
Note how little is required to define the gganimate-specific information! The {frame_time} variable changes the title for each frame. The transition_time() variable is set to year, and the kind of transition is set as ‘linear’ in ease_aes(). I’ve saved the animation as a gif in the figures folder.
We won’t have time to go into more detail here but I encourage you to play around with gganimate. It’s fun, looks cool, and (if done well) makes for a great slide in your next presentation!
The package shiny makes it relatively easy to create interactive plots that can be hosted online. Here is a gallery with some examples.
Often, we want to create similar plots over and over again. One way to achieve this is by finding the original plot, copy and pasting it, and changing the bits that need changing. Another more flexible and faster way to do this is by using snippets. Snippets are short pieces of code that
Here are some snippets I use:
snippet sngg
ggplot(data = ${1:data},
mapping = aes(${2:aes})) +
${0}
snippet sndf
${1:data} = ${1:data} %>%
${0}To make a bar plot, I now only need to type snbar and then hit TAB to activate the snippet. I can then cycle through the bits in the code that are marked with ${Number:word} by hitting TAB again.
In RStudio, you can change and add snippets by going to Tools –> Global Options… –> Code –> Edit Snippets. Make sure to set the tick mark in front of Enable Code Snippets (see Figure 3.3). ).
Figure 3.3: Enable code snippets.
To edit code snippets faster, run this command from the usethis package. Make sure to install the package first if you don’t have it yet.
# install.packages("usethis")
usethis::edit_rstudio_snippets()This command opens up a separate tab in RStudio called r.snippets so that you can make new snippets and adapt old ones more quickly. Take a look at the snippets that RStudio already comes with. And then, make some new ones! By using snippets you will be able to avoid typing the same code over and over again, and you won’t have to memorize as much, too.
Information about this R session including which version of R was used, and what packages were loaded.
sessionInfo()#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22621)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] gapminder_0.3.0 gganimate_1.0.8 ggridges_0.5.4 patchwork_1.1.2
#> [5] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.0 purrr_1.0.1
#> [9] readr_2.1.3 tidyr_1.3.0 tibble_3.1.8 ggplot2_3.4.0
#> [13] tidyverse_1.3.2 knitr_1.42
#>
#> loaded via a namespace (and not attached):
#> [1] nlme_3.1-160 fs_1.6.0 lubridate_1.9.1
#> [4] progress_1.2.2 RColorBrewer_1.1-3 httr_1.4.4
#> [7] tools_4.2.2 backports_1.4.1 bslib_0.4.2
#> [10] utf8_1.2.2 R6_2.5.1 rpart_4.1.19
#> [13] mgcv_1.8-41 Hmisc_4.7-2 DBI_1.1.3
#> [16] colorspace_2.1-0 nnet_7.3-18 withr_2.5.0
#> [19] prettyunits_1.1.1 tidyselect_1.2.0 gridExtra_2.3
#> [22] compiler_4.2.2 textshaping_0.3.6 cli_3.6.0
#> [25] rvest_1.0.3 htmlTable_2.4.1 xml2_1.3.3
#> [28] labeling_0.4.2 bookdown_0.32 sass_0.4.5
#> [31] scales_1.2.1 checkmate_2.1.0 systemfonts_1.0.4
#> [34] digest_0.6.31 foreign_0.8-83 rmarkdown_2.20
#> [37] base64enc_0.1-3 jpeg_0.1-10 pkgconfig_2.0.3
#> [40] htmltools_0.5.4 dbplyr_2.3.0 fastmap_1.1.0
#> [43] highr_0.10 htmlwidgets_1.6.1 rlang_1.0.6
#> [46] readxl_1.4.1 rstudioapi_0.14 jquerylib_0.1.4
#> [49] farver_2.1.1 generics_0.1.3 jsonlite_1.8.4
#> [52] googlesheets4_1.0.1 magrittr_2.0.3 Formula_1.2-4
#> [55] interp_1.1-3 Matrix_1.5-1 Rcpp_1.0.10
#> [58] munsell_0.5.0 fansi_1.0.4 lifecycle_1.0.3
#> [61] stringi_1.7.12 yaml_2.3.7 grid_4.2.2
#> [64] crayon_1.5.2 deldir_1.0-6 lattice_0.20-45
#> [67] haven_2.5.1 splines_4.2.2 hms_1.1.2
#> [70] pillar_1.8.1 reprex_2.0.2 glue_1.6.2
#> [73] evaluate_0.20 latticeExtra_0.6-30 data.table_1.14.6
#> [76] modelr_0.1.10 tweenr_2.0.2 png_0.1-8
#> [79] vctrs_0.5.2 tzdb_0.3.0 cellranger_1.1.0
#> [82] gtable_0.3.1 assertthat_0.2.1 cachem_1.0.6
#> [85] xfun_0.36 broom_1.0.3 ragg_1.2.5
#> [88] survival_3.4-0 googledrive_2.0.0 viridisLite_0.4.1
#> [91] gargle_1.3.0 cluster_2.1.4 timechange_0.2.0
#> [94] ellipsis_0.3.2