Learning Objectives

In this Tutorial, you should learn how to:

  • Wrangle data extracted from a published figure using WebplotDigitizer
  • Embed images in R Markdown
  • Use ggplot to replicate a published figure
  • Make fine adjustments to the ggplot figure to match an original figure
  • Save a ggplot figure as an image file
  • Access valuable sources for working with color in ggplot

Description of the Stolen Data

The data I’ll be stealing and the figure I will be reproducing is taken from the 10th edition of the textbook How Humans Evolved, by Rob Boyd, Joan Silk, and Kevin Langergraber (2023). The figure is a line graph of age-specific fertility rates (ASFR) by sex as observed in the !Kung San people of the Kalahari Desert. ASFR represents the probability of an individual giving birth per year. In studies of smaller populations, such data can be quite noisy, as you might only have a few observations of individuals in certain ages. For this reason is customary to clump together people into five year age groups, and calculate an ASFR for each five-year age group. This was the case in the study of the !Kung. The text uses these data to illustrate an example of “natural fertility” in human populations that subsist from hunting and gathering and do not use modern contraceptives.

R Markdown and Images

R Markdown has great options for both generating your images on the fly (at knit-time) and embedding existing images. I’ll describe here options for displaying existing pre-rendered images, like charts, graphs, and illustrations. R can deal with many image types, I find myself most frequently using .PNG, .JPG, and .TIF images. It is good to save them at a resolution around 300 dpi, but the code can handle any resolution you choose.

How to Embed Existing Images into RMarkdown

The working directory for locating images is the very same directory where your .Rmd document is saved. Place your images in the same folder and life will be easy. If you are inclined to organize, you can also save them in a folder that is within the folder where your .Rmd file is saved, say, in a folder named ‘figures’ or ‘images’. If you do that, you just have to add the directory to each of your references to the image file name. But the base, working directory where images will be first searched for is the folder where the .Rmd file is saved. In this tutorial, that is where the images are located.

The easiest way to include images is by inserting text like the following:

![ASFR among the !Kung](asfr_kung.png)

Which produces this:

That is the simplest way to include an image. The ![]() syntax is Markdown syntax for including an image. The () is where you put the path to the image file. The path is relative to the location of the .Rmd file. The ! is what tells the Markdown processor that this is an image. The [] is where you can put a caption for the image. The caption is optional. The ![]() syntax is not R code, it is Markdown syntax.

The way I typically include images is by using the knitr package. This package has a function called include_graphics that is invoked inside an R code chunk. The function is called with the path to the image file as an argument. The function can also be used to scale the image, which is very useful for controlling the size of the image in the document.

Scaling an Image

Using the function include_graphics in the knitr package, we can provide the out.width argument to the chunk header and that will control the size of the image. The out.width argument takes a percentage as a value. For example, to make the image 50% smaller, you can do the following:

Which produces this:

You can also control where in the page the figure is placed. If you would like the image to be centered horizontally in the page, you can provide the chunk option fig.align="center":

Which produces this:

Wrangling the Stolen Data (Rustling?)

The data I extracted from the figure using the WebplotDigitizer is shown below, as it appears when opened using Excel. Including an image like this is not something you need to do for your homework, but I wanted to show you how the data arrived in my case, just so the data processing steps below make sense.

There are some issues with the data that arrived from WebplotDigitizer. As you can see, the data are “jagged” in the sense that the age groups are not consistent. And there is the weird first row that I will need to remove. The data are also in a wide format, which is not ideal for plotting with ggplot2. My plan is to first process female data, then the male data, and then combine them in a way that is suited for plotting using ggplot2.

# Load the data
d <- read.csv("kung_asfr.csv")
#rename the columns
names(d) <- c("male_age", "male_asfr", "female_age", "female_asfr")
#remove the first row
d <- d[-1,]
#convert male_age to numeric and then round to integer
d$male_age <- round(as.numeric(d$male_age),0)
d$female_age <- round(as.numeric(d$female_age),0)
#round the asfr to 2 decimal places
d$male_asfr <- round(as.numeric(d$male_asfr),2)
d$female_asfr <- round(as.numeric(d$female_asfr),2)
# make the data tidy and narrow, which is what ggplot2 will need. 
# This means there should not be two columns for age, but only 1. 
# There same goes for ASFR. 
# We will also need to add a column for sex. 
male_ages <- d$male_age
#remove NA values that arise from the jagged data frame
male_ages <- male_ages[!is.na(male_ages)]
female_ages <- d$female_age

male_asfr <- d$male_asfr
#remove NA values
male_asfr <- male_asfr[!is.na(male_asfr)]
female_asfr <- d$female_asfr

#combine the data
age <- c(male_ages, female_ages)
asfr <- c(male_asfr, female_asfr)
sex <- c(rep("Men",length(male_ages)), rep("Women",length(female_ages)))
tidy_d <- data.frame(age, asfr, sex)
#wrangling finished

Here is what the data look like in tidy format:

tidy_d[1:17,]
##    age asfr   sex
## 1   20 0.06   Men
## 2   25 0.11   Men
## 3   30 0.15   Men
## 4   35 0.14   Men
## 5   40 0.14   Men
## 6   45 0.09   Men
## 7   50 0.04   Men
## 8   55 0.00   Men
## 9   15 0.06 Women
## 10  20 0.21 Women
## 11  25 0.24 Women
## 12  30 0.18 Women
## 13  35 0.11 Women
## 14  40 0.04 Women
## 15  45 0.01 Women
## 16  50 0.00 Women
## 17  55 0.00 Women

Recreating the Published Figure using ggplot2

We will first make a “quick and dirty” plot and then systematically work through the parts of the plotting code that need to be adjusted to match the original published figure.

A Quick and Dirty Figure using ggplot2

Now that the data are in a tidy format, we can plot them using ggplot2. I will first go for a quick and dirty plot, and then I will make it look better and get nit-picky with the details so that it matches the figure in the book. This is the quick and dirty plot:

ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line() + 
  geom_point() + 
  labs(x="Age (years)", y="Age specific fertility rate") 

Making the Plot Look Better using (Mostly) Theme Elements

Theme elements are the way to control elements of the plot that are not linked to data, but represent artistic choices about how the plot should look. Most of the work we are doing here will be inside calls to the theme function.

I will remove the gridlines, change the axes so tick marks and labels appear correct, add a border to the plot, make the lines thicker, the points bigger, change the legend title and symbols, change the aspect ratio, change some font sizes, and change the colors. All this work might feel tedious. But trust me! This is a valuable lesson. When the time comes for you to make an figure of your own creation just right you will also need to delve into this level of detail and learn the finer points about how ggplot2 works. You can use this tutorial as a guide for your homework and study of these issues.

Change the Axis Labels and Axis Limits

Here we need to add labels at every multiple of 5, not just multiples of 10.

ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line() + 
  geom_point() + 
  labs(x="Age (years)", y="Age specific fertility rate") + 
  theme_minimal() + 
  scale_x_continuous(breaks = seq(10, 55, 5), limits = c(10, 56))

Remove the Grid Lines

ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line() + 
  geom_point() + 
  labs(x="Age (years)", y="Age specific fertility rate") + 
  theme_minimal() + 
  scale_x_continuous(breaks = seq(10, 55, 5), limits = c(10, 56)) +
  theme(
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank()
  )

Make the Points and Lines Thicker

Because the thickness of the lines and the width of the points are related to the geom_line and geom_point elements, we make this adjustment by referencing arguments supplied to those geoms.

ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line(linewidth=1.5) + 
  geom_point(size=3) + 
  labs(x="Age (years)", y="Age specific fertility rate") + 
  scale_x_continuous(breaks = seq(10, 55, 5), limits = c(10, 56)) +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    )

Add a Line for Each Axis

We can do this by adding the following line to the theme function: axis.line = element_line(colour = "black", linewidth = 1).

ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line(linewidth=1.5) + 
  geom_point(size=3) + 
  labs(x="Age (years)", y="Age specific fertility rate") + 
  scale_x_continuous(breaks = seq(10, 55, 5), limits = c(10, 56)) +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "black", linewidth = 1)  
  )

Add Tick Marks to the Axes and Remove Margins beyond Axes Limits

The tick marks are a bit weird because they extend into the plot area (by default in ggplot2 they extend away from the plot area, towards the margins). We can achieve this by specifying negative numbers of the axis.ticks.length arguments. We also need to remove any margins beyond the axes limits (e.g. any extra space on the x axis to the left of 15 or to the right of 55). This is achieved by changing the expand argument in the scale_x_continuous and scale_y_continuous functions.

ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line(linewidth=1.5) + 
  geom_point(size=3) + 
  labs(x="Age (years)", y="Age specific fertility rate") + 
  scale_x_continuous(breaks = seq(10, 55, 5), limits = c(10, 56), expand = c(0, 0)) + 
  scale_y_continuous(breaks = seq(0, 0.25, .05), limits = c(0, .26), expand = c(0, 0)) + 
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "black", linewidth = 1),  
    axis.ticks.x = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.x = unit(-3, "mm"), # Negative length to extend upward
    axis.ticks.y = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.y = unit(-3, "mm"),  # Negative length to extend upward
  )

Remove the Legend Title

This is achieved by adding the following line to the theme function: legend.title = element_blank().

ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line(linewidth=1.5) + 
  geom_point(size=3) + 
  labs(x="Age (years)", y="Age specific fertility rate") + 
  scale_x_continuous(breaks = seq(10, 55, 5), limits = c(10, 56), expand = c(0, 0)) + 
  scale_y_continuous(breaks = seq(0, 0.25, .05), limits = c(0, .27), expand = c(0, 0)) +
  theme_minimal()+
    theme(
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "black", linewidth = 1),  
    axis.ticks.x = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.x = unit(-3, "mm"), # Negative length to extend upward
    axis.ticks.y = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.y = unit(-3, "mm"),  # Negative length to extend upward
    legend.title = element_blank(),
  )

Remove the geom_line Symbol from the Legend

As you see in the legend, there is both a line and a point mapped to the variable “Sex”. We only want the point to be in the legend. To make this change, we need to add the argument show.legend = FALSE to the geom_line() function.

ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line(linewidth=1.5, show.legend = FALSE) + 
  geom_point(size=3) + 
  labs(x="Age (years)", y="Age specific fertility rate") + 
  scale_x_continuous(breaks = seq(10, 55, 5), limits = c(10, 56), expand = c(0, 0)) + 
  scale_y_continuous(breaks = seq(0, 0.25, .05), limits = c(0, .27), expand = c(0, 0)) + 
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "black", linewidth = 1),  
    axis.ticks.x = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.x = unit(-3, "mm"), # Negative length to extend upward
    axis.ticks.y = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.y = unit(-3, "mm"),  # Negative length to extend upward
    legend.title = element_blank(),
  )

Changing the Margins between Axis Lines and Axis Tick Mark Labels

The axis tick mark labels are a bit too close to the axis lines. We can increase the margins between the axis lines and the axis tick mark labels by adjusting arguments to the axis.text theme elements. The arguments accept values for the margin of interest (t for top, r for right, b for bottom, and l for left), and the unit that we are using for the adjustments, which in this case is “points”, or pt. Points are, in this context, a unit of length in typography, equal to 1/72 of an inch.

ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line(linewidth=1.5, show.legend = FALSE) + 
  geom_point(size=3) + 
  labs(x="Age (years)", y="Age specific fertility rate") + 
  scale_x_continuous(breaks = seq(10, 55, 5), limits = c(10, 56), expand = c(0, 0)) + 
  scale_y_continuous(breaks = seq(0, 0.25, .05), limits = c(0, .27), expand = c(0, 0)) + 
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "black", linewidth = 1),  
    axis.ticks.x = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.x = unit(-3, "mm"), # Negative length to extend upward
    axis.ticks.y = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.y = unit(-3, "mm"),  # Negative length to extend upward
    legend.title = element_blank(),
    axis.text.x = element_text(margin = margin(t = 5, r = 0, b = 0, l = 0, unit = "pt")),
    axis.text.y = element_text(margin = margin(t = 0, r = 5, b = 0, l = 0, unit = "pt"))

  )

Changing the Legend Position

The legend is currently in the middle right of the plot. We can move it to the top right corner by adjusting the legend.position and legend.position.inside arguments in the theme() function.

ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line(linewidth=1.5, show.legend = FALSE) + 
  geom_point(size=3) + 
  labs(x="Age (years)", y="Age specific fertility rate") + 
  scale_x_continuous(breaks = seq(10, 55, 5), limits = c(10, 56), expand = c(0, 0)) + 
  scale_y_continuous(breaks = seq(0, 0.25, .05), limits = c(0, .27), expand = c(0, 0)) + 
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "black", linewidth = 1),  
    axis.ticks.x = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.x = unit(-3, "mm"), # Negative length to extend upward
    axis.ticks.y = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.y = unit(-3, "mm"),  # Negative length to extend upward
    legend.title = element_blank(),
    axis.text.x = element_text(margin = margin(t = 5, r = 0, b = 0, l = 0, unit = "pt")),
    axis.text.y = element_text(margin = margin(t = 0, r = 5, b = 0, l = 0, unit = "pt")),
    legend.position = "inside",
    legend.position.inside=c(0.9, 0.9)
    )

Changing the Font Size of Tick Mark Labels

The font size of the tick mark labels is a bit too small. We can increase the font size of the tick mark labels by adjusting the size argument in element_text function that defines the axis.text.x and axis.text.y elements.

ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line(linewidth=1.5, show.legend = FALSE) + 
  geom_point(size=3) + 
  labs(x="Age (years)", y="Age specific fertility rate") + 
  scale_x_continuous(breaks = seq(10, 55, 5), limits = c(10, 56), expand = c(0, 0)) + 
  scale_y_continuous(breaks = seq(0, 0.25, .05), limits = c(0, .27), expand = c(0, 0)) + 
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "black", linewidth = 1),  
    axis.ticks.x = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.x = unit(-3, "mm"), # Negative length to extend upward
    axis.ticks.y = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.y = unit(-3, "mm"),  # Negative length to extend upward
    legend.title = element_blank(),
    axis.text.x = element_text(size=12, margin = margin(t = 5, r = 0, b = 0, l = 0, unit = "pt")),
    axis.text.y = element_text(size=12, margin = margin(t = 0, r = 5, b = 0, l = 0, unit = "pt")),
    legend.position = "inside",
    legend.position.inside=c(0.9, 0.9)
    )

Changing the Colors Associated with each Group in the geom_line, geom_point

We would like men to be associated with a greenish color, and women associated with a reddish color. To change these mappings between the groups in the data and their displayed colors, we can use the scale_color_manual() function. This function takes a named vector as an argument, where the names are the groups, and the values are the colors that we want to associate with each group.

ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line(linewidth=1.5, show.legend = FALSE) + 
  geom_point(size=3) + 
  labs(x="Age (years)", y="Age specific fertility rate") + 
  scale_color_manual(values=c(Men="#06aca4", Women="#f04143"))+
  scale_x_continuous(breaks = seq(10, 55, 5), limits = c(10, 56), expand = c(0, 0)) + 
  scale_y_continuous(breaks = seq(0, 0.25, .05), limits = c(0, .27), expand = c(0, 0)) + 
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "black", linewidth = 1),  
    axis.ticks.x = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.x = unit(-3, "mm"), # Negative length to extend upward
    axis.ticks.y = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.y = unit(-3, "mm"),  # Negative length to extend upward
    legend.title = element_blank(),
    axis.text.x = element_text(margin = margin(t = 5, r = 0, b = 0, l = 0, unit = "pt")),
    axis.text.y = element_text(margin = margin(t = 0, r = 5, b = 0, l = 0, unit = "pt")),
    legend.position = "inside",
    legend.position.inside=c(0.9, 0.9)
  )

Declaration of Victory

The plot is now very similar to the published figure. I am happy with the result, and I am ready to declare victory. A few final notes now, before we finish, about working with colors.

Working with Colors

In ggplot you can supply colors in RGB, hexadecimal, or color names.

Matching Existing Colors

The website imagecolorpicker.com is a useful tool for finding the RGB or hex code of a color in an image.

Finding Pretty Color Palettes

You can find pleasing combinations of colors in the following places:

If you are looking for inspiration for color combinations, I recommend the website “ColorBrewer” [http://colorbrewer2.org/]. This website is a great resource for finding color schemes which are colorblind-friendly, print-friendly, and photocopy-friendly.

If you are looking for color combinations that simply look beautiful, I also recommend the website “coolors” [https://coolors.co/palettes/popular]. This website demonstrates beautiful color schemes that you can use in your plots.

The RColorBrewer package provides access to the ColorBrewer palettes. These are pre-defined combinations of colors that are easy to plug into your ggplot code.

Named Colors in R

The figure below shows the named colors in R. You can use these names directly in your ggplot code.

Saving Figures Created with ggplot

One of the great things about R is that it allows you to save figures with exact dimensions, resolutions, and file formats, making it easy to reproduce your figures or customize your figures for different uses, whether that be a powerpoint presentation, a figure to be included in a report, or a figure that must precisely match a journal’s artwork guidelines. A high resolution png file will usually work across all of these different scenarios.

Here, we here use R’s png() function to save our figure. You place a call to this function at the beginning of your ggplot code, specifying the name of the file, its dimensions, the units of its dimensions, and its resolution. After you call png(), you then include your ggplot code to draw the figure, and when that it finished, you finish your work with a call to dev.off(). This function tells R: “I am finished plotting now, so please write the file”.

png("recreated_figure.png", width=6, height=4, units="in", res=300)
ggplot(tidy_d, aes(x=age, y=asfr, color=sex)) + 
  geom_line(linewidth=1.5, show.legend = FALSE) + 
  geom_point(size=3) + 
  labs(x="Age (years)", y="Age specific fertility rate") + 
  scale_color_manual(values=c(Men="#06aca4", Women="#f04143"))+
  scale_x_continuous(breaks = seq(10, 55, 5), limits = c(10, 56), expand = c(0, 0)) + 
  scale_y_continuous(breaks = seq(0, 0.25, .05), limits = c(0, .27), expand = c(0, 0)) + 
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),
    axis.line = element_line(colour = "black", linewidth = 1),  
    axis.ticks.x = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.x = unit(-3, "mm"), # Negative length to extend upward
    axis.ticks.y = element_line(color = "black", linewidth = 0.5, linetype = "solid"),
    axis.ticks.length.y = unit(-3, "mm"),  # Negative length to extend upward
    legend.title = element_blank(),
    axis.text.x = element_text(margin = margin(t = 5, r = 0, b = 0, l = 0, unit = "pt")),
    axis.text.y = element_text(margin = margin(t = 0, r = 5, b = 0, l = 0, unit = "pt"))
  )
dev.off()
## quartz_off_screen 
##                 2

References

Boyd, R., Silk, J., & Langergraber, K. (2023). How Humans Evolved. W.W. Norton & Company.