Some excercises adapted from exercises found at http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html

Figures are important communication tools. Not only do they communicate your data, but sometimes they communicate more about you as a researcher than you may think.

Imagine that you are handed a figure that was obviously created in Excel, along with results that indicate statistics were done on the dataset. Excel is not a statistics software package - how careful could the researchers have been? How much control did they have over their statistical analyses? Can I trust their analysis to be correct?

Creating a figure that is uniquely your own - and looks nice - gives you an advantage.

ggplot2

We’ve seen plots from ggplot2 several times during this workshop. It is a plotting package that has some really nice features. That said, it has limitations, and there are lots of options out there for plotting in R.

Advantages of ggplot2:

Some limitations of ggplot2:

The “grammar of graphics” in brief

ggplot2 breaks down a graphic, or figure, into building blocks. If you understand these building blocks - or syntax - you can build nearly any graphic you like. The building blocks are:

The syntax of ggplot2 can take some getting used to. But, once you’ve figured it out, it becomes remarkably easy to control even the most minute details of your figures.

Today’s goal

By the end of this exercise, everyone should be able to recreate this figure using ggplot. The data is from the mtcars dataset, baked into R.


For something a bit more advanced, try to recreate this figure from The Economist:

From The Economist

From The Economist


Below is my attempt:


Note: The author of The Economist figure has done some extra data manipulations, and we don’t know their statistical model, so you won’t get it exactly. I’ve labeled a random subsample of points. More resources for this challenge located throughout, and towards the end of this RMarkdown document. Data available here.


Basic graphics

It’s important to know that the base graphic functions are always available to you - but they are often not as customizable as ggplot2. Let’s look at some mtcars data using base plotting functions.

hist(mtcars$mpg)


Now for the ggplot2 histogram:

library(ggplot2)
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


base graphics looks a bit better, doesn’t it?

We can try to make our ggplot look a bit better by changing the bindwidth.

library(ggplot2)
ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 5)


Let’s look at a more complex example. Let’s plot mpg by weight (wt) by transmition type using base and then ggplot:

plot(mpg ~ wt, 
     data = subset(mtcars, am == 1)) # Plot mpg by weight, use a subset of the data, where transmission == "manual" (==1)
points(mpg ~ wt, 
       data = subset(mtcars, am == 0), col = "red") # Ditto, but for automatic transmissions
legend(3,34, # Where to put the legend
       c("Manual", "Automatic"), # What to call the data keys
       col = c("black", "red"),  # What to colour them
       pch = (c(1,1))) # What symbol to use


Some of the data is cut off. Why do you think?

plot(mpg ~ wt, 
     data = subset(mtcars, am == 1), 
     xlim = c(1,4.5), 
     ylim = c(10,36))                         # Set the x and y axis limits
points(mpg ~ wt, 
       data = subset(mtcars, am == 0), 
       col = "red")
legend(3.5,34,
       c("Manual", "Automatic"), 
       col = c("black", "red"), 
       pch = (c(1,1)))


Or:

plot(mpg ~ wt, 
     data = mtcars, 
     type = "n")                              # Set up a plot using ALL the data, but don't actually plot anything (type = "none")
points(mpg ~ wt, 
       data = subset(mtcars, am == 1))
points(mpg ~ wt, 
       data = subset(mtcars, am == 0), 
       col = "red")
legend(4.2,34,
       c("Manual", "Automatic"), 
       col = c("black", "red"), 
       pch = (c(1,1)))


Now the ggplot2:

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg,
           colour = as.factor(am))) +         # Set the aesthetic mapping so that x = wt, y = mpg, and colour the points by am
  geom_point()                                # Make it a scatter plot


We need to do some fine-tuning, but I’d argue that the ggplot2 syntax is easier to use. In fact, we can quickly switch between plot types:

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg,
           colour = as.factor(am))) +         # Set the aesthetic mapping so that x = wt, y = mpg, and colour the points by am
  geom_smooth()                                # Make it loess fit
## `geom_smooth()` using method = 'loess'

Or

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg,
           colour = as.factor(am))) +         # Set the aesthetic mapping so that x = wt, y = mpg, and colour the points by am
  geom_density2d()                                # Make it a 2D density plot

Just as examples…

What’s ggplot2 all about?

Aesthetic mapping

Included in the aesthetic mapping are all the things you can see, and that change according do your variables. These might include:

  • The position of points, lines, bars, etc.
  • Groupings
  • Colour (of lines, or edges)
  • Fill (colour inside lines, polygons, or edges)
  • Shape of points
  • Types of line (dashed, dotted, solid, etc.)
  • Size (of points, lines, etc.)

These are usually set within the aes() function.

Geometric objects

These are the types of geometries, or plots, that we want to make out of our data. There are many, but some are:

  • Points (geom_point for scatter and dot plots)
  • Lines (geom_line for line plots, and also for functions or regressions)
  • Bars (geom_bar for bar graphs)
  • Boxplots (geom_boxplot)
  • Violin plots (geom_violin - these are fancy boxplots!)
  • “Smooths” (geom_smooth for complex trends and confidence intervals)
  • … There are many more!…

Manipulating a scatterplot

Let’s start with the plot we started above, and change the colours. (Note: ggplot2 was developed by Europeans - hence “colours”).

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg,
           colour = as.factor(am))) +         # Set the aesthetic mapping so that x = wt, y = mpg, and colour the points by am
  geom_point() +
  scale_colour_manual(values = c("black","red")) # Notice that colour order should be the same as the order of your factors


Maybe we don’t want to colour the points by transmission type at all. Maybe instead we’re interested in the horespower of the car.

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp))


You see that since hp is a continuous variable, the points are now coloured using a gradient. We can change that gradient.

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp)) +
  scale_colour_gradient(low = "yellow", high = "red")


ggplot2 also has some built in gradients to use.

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp)) +
  scale_colour_distiller(palette = "Spectral")


While we’re at it, we can change the legend title.

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp)) +
  scale_colour_distiller(palette = "Spectral", "Horsepower")


Adding another geom

You can add as many geoms to a plot as you like. For example, maybe we’d like to add a regression line to the current plot we’re working with.

First, let’s see what happens when we just add a geom_line.

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp)) +
  scale_colour_distiller(palette = "Spectral", "Horsepower") +
  geom_line()


That doesn’t really serve our current purpose, but it would be useful, for example, in a time-series.

What about adding that regression line? Well, we could do the regression, and add the data to the graph:

mydata <- mtcars
mydata$pred <- predict(lm(mpg~wt,mtcars))     # Do the regression and save the predictions

ggplot(mydata,                                # Make a ggplot using "mydata""
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp)) +
  scale_colour_distiller(palette = "Spectral", "Horsepower") +
  geom_line(aes(y = pred))                    # Plot the predictions as a line - the results of the linear regression.


Or, we could use geom_smooth and do the regression within the ggplot…

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp)) +
  scale_colour_distiller(palette = "Spectral", "Horsepower") +
  geom_smooth(method = "lm")


Take away the 95% confidence interval and change the regression line’s colour…

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp)) +
  scale_colour_distiller(palette = "Spectral", "Horsepower") +
  geom_smooth(method = "lm", se = FALSE, colour = "black")

Voila! The same plot.


You can scale all sorts of things - like the size or alpha (transparency) of points…

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp, size = qsec), alpha = 0.8) +            # qsec = 1/4 mile time;
  scale_colour_distiller(palette = "Spectral", "Horsepower") +
  geom_smooth(method = "lm", se = FALSE, colour = "black")

Challenge

Can you add a scalar for alpha (transparency) by rear axle ratio (drat)?

Themes, or theme()

Within a ggplot, you can control virtually any part of your plot with options from the theme() command. You can get very specific, and some folks have written canned “themes” that you can use. Here are some examples:

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp, size = qsec), alpha = 0.8) +            # qsec = 1/4 mile time;
  scale_colour_distiller(palette = "Spectral", "Horsepower") +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  theme_bw() # <------- this is canned "black and white" theme

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp, size = qsec), alpha = 0.8) +            # qsec = 1/4 mile time;
  scale_colour_distiller(palette = "Spectral", "Horsepower") +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  theme_linedraw() # <------- another canned theme

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp, size = qsec), alpha = 0.8) +            # qsec = 1/4 mile time;
  scale_colour_distiller(palette = "Spectral", "Horsepower") +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  theme_light() # <------- another canned theme

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp, size = qsec), alpha = 0.8) +            # qsec = 1/4 mile time;
  scale_colour_distiller(palette = "Spectral", "Horsepower") +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  theme_classic() # <------- another canned theme

ggplot(mtcars,                                # Make a ggplot using mtcars data
       aes(x = wt,
           y = mpg)) +         
  geom_point(aes(colour = hp, size = qsec), alpha = 0.8) +            # qsec = 1/4 mile time;
  scale_colour_distiller(palette = "Spectral", "Horsepower") +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  theme_minimal() # <------- another canned theme


The important thing to understand about these themes is that you could build them, on your own. Check out this and this for some additional resources, but we will be breaking some of this down below.

Let’s begin with the below plot, that we’ve been working towards today. We’ve plotted mpg by wt, and we’re grouping the results by the number of cylinders (cyl), and scaling points by horespower (hp).

ggplot(aes(x = wt, y = mpg, colour = as.factor(cyl), fill = as.factor(cyl)), data = mtcars) +
  geom_point(aes(size = hp), shape = 21) +
  geom_smooth(aes(linetype = as.factor(cyl)), fill = "grey", method = "lm", formula = y~x, alpha = 0.2) 


The first thing I’d like to do is change the colours that I’m using, and fix my legend titles.

ggplot(aes(x = wt, y = mpg, colour = as.factor(cyl), fill = as.factor(cyl)), data = mtcars) +
  geom_point(aes(size = hp), shape = 21) +
  geom_smooth(aes(linetype = as.factor(cyl)), fill = "grey", method = "lm", formula = y~x, alpha = 0.2) +
  scale_colour_manual(values = c("black", "grey", "black"), name = "Number of \ncylinders") + # Here you can specify EDGE colour, and the title of the legend.
  scale_fill_manual(values = c("black", "grey", NA), name = "Number of \ncylinders") # Ditto for point fill. Notice that I've used NO fill (NA) for 6 cylinder cars. Also, setting the name of the legend to be identical to the one above COMBINES them. What happens if you use a different name?


Next, I’d like to specify what kind of line to use for each regression, and control the range of of the point size:

ggplot(aes(x = wt, y = mpg, colour = as.factor(cyl), fill = as.factor(cyl)), data = mtcars) +
  geom_point(aes(size = hp), shape = 21) +
  geom_smooth(aes(linetype = as.factor(cyl)), fill = "grey", method = "lm", formula = y~x, alpha = 0.2) +
  scale_colour_manual(values = c("black", "grey", "black"), name = "Number of \ncylinders") +
  scale_fill_manual(values = c("black", "grey", NA), name = "Number of \ncylinders") +
  scale_linetype_manual(values = c("solid", "solid", "twodash"), name = "Number of \ncylinders") + # I'm doing the same again for linetype.
  scale_size_continuous(breaks = c(100,200,300), range = c(.5,8),name = "Horsepower")  # and for the size of the points.


Add x and y labels.

ggplot(aes(x = wt, y = mpg, colour = as.factor(cyl), fill = as.factor(cyl)), data = mtcars) +
  geom_point(aes(size = hp), shape = 21) +
  geom_smooth(aes(linetype = as.factor(cyl)), fill = "grey", method = "lm", formula = y~x, alpha = 0.2) +
  scale_colour_manual(values = c("black", "grey", "black"), name = "Number of \ncylinders") +
  scale_fill_manual(values = c("black", "grey", NA), name = "Number of \ncylinders") +
  scale_linetype_manual(values = c("solid", "solid", "twodash"), name = "Number of \ncylinders") + # I'm doing the same again for linetype.
  scale_size_continuous(breaks = c(100,200,300), range = c(.5,8),name = "Horsepower") + # and for the size of the points.
  xlab("Weight (1000 lbs)") +
  ylab("MPG") 


Here is where we will start manipulating the theme(). The first thing I’d like to do is get rid of those minor and major gridlines.

ggplot(aes(x = wt, y = mpg, colour = as.factor(cyl), fill = as.factor(cyl)), data = mtcars) +
  geom_point(aes(size = hp), shape = 21) +
  geom_smooth(aes(linetype = as.factor(cyl)), fill = "grey", method = "lm", formula = y~x, alpha = 0.2) +
  scale_colour_manual(values = c("black", "grey", "black"), name = "Number of \ncylinders") +
  scale_fill_manual(values = c("black", "grey", NA), name = "Number of \ncylinders") +
  scale_linetype_manual(values = c("solid", "solid", "twodash"), name = "Number of \ncylinders") + # I'm doing the same again for linetype.
  scale_size_continuous(breaks = c(100,200,300), range = c(.5,8),name = "Horsepower") + # and for the size of the points.
  xlab("Weight (1000 lbs)") +
  ylab("MPG") +
  theme(panel.grid.major = element_blank(), # Gets rid of major grid lines
        panel.grid.minor = element_blank()) # Gets rid of minor grid lines


Next, let’s get rid of the background entirely. Or, whatever, try changing its colour…

ggplot(aes(x = wt, y = mpg, colour = as.factor(cyl), fill = as.factor(cyl)), data = mtcars) +
  geom_point(aes(size = hp), shape = 21) +
  geom_smooth(aes(linetype = as.factor(cyl)), fill = "grey", method = "lm", formula = y~x, alpha = 0.2) +
  scale_colour_manual(values = c("black", "grey", "black"), name = "Number of \ncylinders") +
  scale_fill_manual(values = c("black", "grey", NA), name = "Number of \ncylinders") +
  scale_linetype_manual(values = c("solid", "solid", "twodash"), name = "Number of \ncylinders") + # I'm doing the same again for linetype.
  scale_size_continuous(breaks = c(100,200,300), range = c(.5,8),name = "Horsepower") + # and for the size of the points.
  xlab("Weight (1000 lbs)") +
  ylab("MPG") +
  theme(panel.grid.major = element_blank(), # Gets rid of major grid lines
        panel.grid.minor = element_blank(), # Gets rid of minor grid lines
        panel.background = element_blank()) # Use element_rect() to alter the colour of the background. 


Let’s move that legend. I’d prefer it to be inside the plot…

ggplot(aes(x = wt, y = mpg, colour = as.factor(cyl), fill = as.factor(cyl)), data = mtcars) +
  geom_point(aes(size = hp), shape = 21) +
  geom_smooth(aes(linetype = as.factor(cyl)), fill = "grey", method = "lm", formula = y~x, alpha = 0.2) +
  scale_colour_manual(values = c("black", "grey", "black"), name = "Number of \ncylinders") +
  scale_fill_manual(values = c("black", "grey", NA), name = "Number of \ncylinders") +
  scale_linetype_manual(values = c("solid", "solid", "twodash"), name = "Number of \ncylinders") + # I'm doing the same again for linetype.
  scale_size_continuous(breaks = c(100,200,300), range = c(.5,8),name = "Horsepower") + # and for the size of the points.
  xlab("Weight (1000 lbs)") +
  ylab("MPG") +
  theme(panel.grid.major = element_blank(), # Gets rid of major grid lines
        panel.grid.minor = element_blank(), # Gets rid of minor grid lines
        panel.background = element_blank(),
        legend.position = c(.99,.99), # Sets the position of the legend to the upper right (within plot) (x,y)
        legend.justification = c(.99,.99)) # Sets the justification of the legend to upper right (x,y)

Try using different x,y values for the legend position and justification, so that you understand what’s happening…


We deleted the background, but lost the border when we did so. Let’s put it back.

ggplot(aes(x = wt, y = mpg, colour = as.factor(cyl), fill = as.factor(cyl)), data = mtcars) +
  geom_point(aes(size = hp), shape = 21) +
  geom_smooth(aes(linetype = as.factor(cyl)), fill = "grey", method = "lm", formula = y~x, alpha = 0.2) +
  scale_colour_manual(values = c("black", "grey", "black"), name = "Number of \ncylinders") +
  scale_fill_manual(values = c("black", "grey", NA), name = "Number of \ncylinders") +
  scale_linetype_manual(values = c("solid", "solid", "twodash"), name = "Number of \ncylinders") + # I'm doing the same again for linetype.
  scale_size_continuous(breaks = c(100,200,300), range = c(.5,8),name = "Horsepower") + # and for the size of the points.
  xlab("Weight (1000 lbs)") +
  ylab("MPG") +
  theme(panel.grid.major = element_blank(), # Gets rid of major grid lines
        panel.grid.minor = element_blank(), # Gets rid of minor grid lines
        panel.background = element_blank(),
        legend.position = c(.99,.99), # Sets the position of the legend to the upper right (within plot) (x,y)
        legend.justification = c(.99,.99), # Sets the justification of the legend to upper right (x,y)
        panel.border = element_rect(colour = "black", fill = NA)) # Makes the bounding box of the plot black


Now I’d like to change the sizes of legend elements and text. This is easy to do within theme:

ggplot(aes(x = wt, y = mpg, colour = as.factor(cyl), fill = as.factor(cyl)), data = mtcars) +
  geom_point(aes(size = hp), shape = 21) +
  geom_smooth(aes(linetype = as.factor(cyl)), fill = "grey", method = "lm", formula = y~x, alpha = 0.2) +
  scale_colour_manual(values = c("black", "grey", "black"), name = "Number of \ncylinders") +
  scale_fill_manual(values = c("black", "grey", NA), name = "Number of \ncylinders") +
  scale_linetype_manual(values = c("solid", "solid", "twodash"), name = "Number of \ncylinders") + # I'm doing the same again for linetype.
  scale_size_continuous(breaks = c(100,200,300), range = c(.5,8),name = "Horsepower") + # and for the size of the points.
  xlab("Weight (1000 lbs)") +
  ylab("MPG") +
  theme(panel.grid.major = element_blank(), # Gets rid of major grid lines
        panel.grid.minor = element_blank(), # Gets rid of minor grid lines
        panel.background = element_blank(),
        legend.position = c(.99,.99), # Sets the position of the legend to the upper right (within plot) (x,y)
        legend.justification = c(.99,.99), # Sets the justification of the legend to upper right (x,y)
        panel.border = element_rect(colour = "black", fill = NA), # Makes the bounding box of the plot black
        legend.key.size = unit(.8,"cm"),
        axis.text.x = element_text(size = 13),
        axis.text.y = element_text(size = 13),
        legend.text = element_text(size = 10),
        axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15),
        legend.title = element_text(size = 12))


And finally, I like my plots to be perfect squares. Easy to do by setting the aspect.ratio.

ggplot(aes(x = wt, y = mpg, colour = as.factor(cyl), fill = as.factor(cyl)), data = mtcars) +
  geom_point(aes(size = hp), shape = 21) +
  geom_smooth(aes(linetype = as.factor(cyl)), fill = "grey", method = "lm", formula = y~x, alpha = 0.2) +
  scale_colour_manual(values = c("black", "grey", "black"), name = "Number of \ncylinders") +
  scale_fill_manual(values = c("black", "grey", NA), name = "Number of \ncylinders") +
  scale_linetype_manual(values = c("solid", "solid", "twodash"), name = "Number of \ncylinders") + # I'm doing the same again for linetype.
  scale_size_continuous(breaks = c(100,200,300), range = c(.5,8),name = "Horsepower") + # and for the size of the points.
  xlab("Weight (1000 lbs)") +
  ylab("MPG") +
  theme(panel.grid.major = element_blank(), # Gets rid of major grid lines
        panel.grid.minor = element_blank(), # Gets rid of minor grid lines
        panel.background = element_blank(),
        legend.position = c(.99,.99), # Sets the position of the legend to the upper right (within plot) (x,y)
        legend.justification = c(.99,.99), # Sets the justification of the legend to upper right (x,y)
        panel.border = element_rect(colour = "black", fill = NA), # Makes the bounding box of the plot black
        legend.key.size = unit(.8,"cm"),
        axis.text.x = element_text(size = 13),
        axis.text.y = element_text(size = 13),
        legend.text = element_text(size = 10),
        axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15),
        legend.title = element_text(size = 12),
        aspect.ratio = 1) # Make plot square


Well that’s a lot of code! But, I’ve gotten my plot exactly how I want it, all within R.

Challenge

Play around with these settings, or others, to get a sense of how theme() controls how your figure looks. If you feel up to it, try the more advanced challenge of trying to emulate the Economist figure (at the beginning of this session) using ggplot2 and theme(). You may need to use ggrepel (?ggrepel). If you’d like to see how I did it, let me know.


That’s it!