Figure Design Tutorial

Purpose

Creating effective data visuals is key for communicating science clearly. Here, I walk through the process of making one of the main figures for my paper in Ecology Letters. I’m very proud of this paper! It got accepted to the top jounal in my field, a distinction I largely attribute to the effort I put into conceptualizing and designing these figures.

Research Background

Climate change shifts plants’ and animals’ seasonal schedules, because events like breeding, migration, and metamorphosis are timed by climate cues. Here, I examine this response in populations of frogs in East Texas. 20 years ago, a collaborator in the USDA installed audio recorders in 8 ponds in East Texas. These recorders are programmed to record 6 times daily, capturing the breeding calls of frogs at the ponds. I’ve analyzed this data in R to measure if/how frog breeding timing is shifting and how this will affect frogs’ interactions with each other. I was the lead author on this project, meaning I conceptualized the questions, performed all data management and analysis, and interpreted and wrote up results for publication. This document provides a summary of the analysis for one of the two major questions for this project.

Process and Tidy Raw Data

We have 8 raw data files, each representing 1 pond. Each row represents one time point (6 per day from May 2000 through Dec. 2018) and includes the number of calls made by 12 species of frogs (in columns).

##   POND YEAR FROGTIME FROGDATE HV HC BV BW RCA RCL RS GC PC PT AC RP
## 1    1 2000    18:00 5/9/2000  2  0  0  0   0   0  0  0  0  0  0  0
## 2    1 2000    19:00 5/9/2000  3  0  0  0   0   0  0  0  0  0  0  0
## 3    1 2000    20:00 5/9/2000  1  0  0  0   0   0  0  0  0  0  0  0
## 4    1 2000    21:00 5/9/2000  0  0  0  0   0   0  0  0  0  0  0  0
## 5    1 2000    22:00 5/9/2000  0  0  0  0   0   0  0  0  0  0  0  0
## 6    1 2000    23:00 5/9/2000  0  0  0  0   0   0  0  0  0  0  0  0

Here, I merge the data frames for the eight ponds, melt to long format, sum the six daily observations, and adjust time & date format.

## Stack raw data into one df
all_ponds <- rbind(p1, p2, p3, p4, p5, p6, p7, p8)
all_ponds <- subset(all_ponds, subset = (YEAR != 2000))  # 2000 is an incomplete year, so we drop it

## Gather species to long format, adjust time/date formats, and rename columns
all_ponds_long <- all_ponds %>% 
  gather("HV", "HC", "BV", "BW", "RCA", "RCL", "RS", "GC", "PC", "PT", "AC", "RP",
         key = "sp", value = "calls") %>% 
  mutate(doy = yday(mdy(DATE))) %>% 
  dplyr::select(year = YEAR, doy = doy, time = TIME, pond = POND, sp = sp, calls = calls)

## Sum all 6 daily observations and convert to df
daily_calls <- all_ponds_long %>%
  group_by(year, doy, pond, sp) %>%
  summarize(dailysum = sum(calls))

The result is a tidy long format data set. Each row represents the number of calls in a day for a particular species at a particular pond.

## # A tibble: 8 x 5
## # Groups:   year, doy, pond [2]
##    year   doy  pond sp    dailysum
##   <int> <dbl> <int> <chr>    <int>
## 1  2001   157     6 RS           0
## 2  2001   157     7 AC           0
## 3  2001   157     7 BV           0
## 4  2001   157     7 BW           0
## 5  2001   157     7 GC           0
## 6  2001   157     7 HC           0
## 7  2001   157     7 HV          11
## 8  2001   157     7 PC           0

Visualize Raw Data

The above dataset gives time series for the calling behavior of each frog species. It’s a lot of data! Even just looking at one year and one pond, it’s difficult to make sense of the patterns of species abundance in this figure. We need a visualization that smoothes over some of the noise and highlights particular questions or comparisons needed for the planned analyses.

One thing that makes the above figure difficult to read is that the data jumps around a lot. Using lowess functions, I can transform the time series into smooth distributions. This makes it easier to visualize when the frogs are active through the year and also enables quantification.

Another problem is that 12 species is just too much to present in one figure. You can’t tell the species apart because the color scheme is bad and the data overlaps. The major questions of this project deals with pairwise comparisons between species. So let’s focus our visualizations on 2 species at a time.

The resulting figure (below) presents the same data as the figure above, but focused just on 2 of the 12 species. The result is much cleaner and, importantly, is closely related to the conceptual questions of the project. The major objective of this project was to calculate how the area of overlap (visible in the plot at the area where the orange and teal distributions intersect) changed across years for pairwise combinations of species. This would tell us if climate change-driven shifts in breeding timing influence the competition between species.

I made a function to allow me to quickly generate the figure below for any species, year, and pond. I use these to spotcheck analysis and provide a visual reference for my workflow.

Process Data

Now, I want to calculate the area of intersection between species calling distributions (depicted above as the area where the teal and orange distributions overlap), but for all the species and all the years and all the ponds. To do this, I wrote a custom function that 1) subsets the data by species, year, and pond 2) makes lowess smoothed distributions of calling activity for each species 3) calculates the integrated area of intersection between the two species’ lowess distributions. I then apply this function to all 132 pairings of species at each of the 8 sites and for each of the 17 years.

User-written functions like this are a really powerful element of R! In analyzing this dataset, I built a number of custom functions as pipelines to perform calculations, plotting, and analysis on iterative slices of data. Without functions, managing analysis was totally unwieldy. Any upstream change to the data or my analysis would render all subsequent parts broken and I’d have to tediously retweak everything. Discovering and learning to build functions was a lifesaver. What initially took place in several separate steps— subsetting data, processing data, calculating key metrics, statistical models, and plotting— was now integrated into the same reproducible pipeline that could be easily be adjusted. By mastering functions, I was able to do rigorous model selection quickly and efficiently because changing model formulation was as simple as changing 1 line in the function and re-applying the function to the data.

## The function take 4 inputs: year, pond, species 1 and species 2 
## I applied the function to all combinations of these parameters
overlapmatrix <- function(y, p, s1, s2) {
  
  ## Subset each species, year and pond from daily calls data
  of1 <- subset(daily_calls, subset = (sp == s1 & year == y & pond == p))
  of2 <- subset(daily_calls, subset = (sp == s2 & year == y & pond == p))
  
  ## Run lowess functions-- this smoothes the time series data into a distribution
  # f, iter, and delta are parameters that control the degree and method of smoothing
  lf1 <- lowess(of1$dailysum, f = 1/50, iter = 3, delta = 4)
  lf2 <- lowess(of2$dailysum, f = 1/50, iter = 3, delta = 4)
  
  ## Calculate overlap
  # make a dataframe with time as x and each species' lowess curve as a separate y
  d <- data.frame(day = lf1$x,    
                  frog1 = lf1$y,  
                  frog2 = lf2$y)
  # designate the lower lowess because this will be the ceiling of the integrated area
  d$min <- pmin(d$frog1, d$frog2)  
  # inegrated area of curves is time x the lower lowess curve
  inter <- integrate.xy(d$day, d$min) 
  # standardize overlap by making it a proportion of each species' full distribution 
  prop1 <- inter/integrate.xy(d$day, d$frog1) 
  prop2 <- inter/integrate.xy(d$day, d$frog2)
  
  ## Designate and return output
  out <- cbind(prop1, prop2)
  return(out)
}

After running the function through all species and years, the result is a dataframe with the proportional overlap for each species pair in each year and pond in which both species were present. There are two overlap values because this was standardized relative to the absolute size of each species’ distribution.

##   X year pond sp1 sp2   overlap  overlapf2      spsp
## 1 1 2002    1  HC  HV 0.2876729 0.14013007 HCi : HVe
## 2 2 2004    1  HC  HV 0.8335911 0.19707868 HCi : HVe
## 3 3 2005    1  HC  HV 0.0000000 0.00000000 HCi : HVe
## 4 4 2006    1  HC  HV 0.1953258 0.04375981 HCi : HVe
## 5 5 2008    1  HC  HV 0.2490100 0.14031609 HCi : HVe
## 6 6 2014    1  HC  HV 1.0000000 0.02264580 HCi : HVe

Figures

Now, we want to see how overlap between species temporal distributions changes across years for each species pair. This plot shows the regressions for each year, but it’s super hard to read and ugly. In this figure, each color represents a species pair. The slope of the line indicates how the overlap in calling between that species pair changed across the 17 year period of this data. Positive slopes indicate that overlap increased, negative slopes indicate that overlap decreased.

The information we really want from this figure is the slope of each line and the amount of error around that slope (from replicate ponds). The following figure gives exactly that and is much easier to interpret and better looking. ggplot is great because it allows you to build a plot from multiple data sources and offers endless customization ability. Here, each row represents a species pair. Black points represent the average regression slope with error from pond replicates. Colored points represent the regression slope for each pond. Points that are positive indicate overlap increased across years. Points that are negative indicate that overlap decreased across years. The marginal histogram at the top shows the distribution of the points.

## Reorder species based on the magnitude of the regression slope so it's easier to read
oy <- ggplot(over_year, aes(x = coef, y = reorder(spsp, coef))) + 
  
  ## Ponds shown independently
  geom_point(size = 3, alpha = 0.75, 
             aes(color = as.factor(pond))) + 
  
  ## Average of all 8 ponds
  geom_point(size = 4, colour = "black", shape = 18,
             data = over_year_err, aes(x = coef, y = spsp)) + 
  
  ## SE bars on averages
  geom_errorbarh(size = 1, height = 0, 
                 data = over_year_err, aes(xmin = coef - se, xmax = coef + se)) +
  
  ## Lines to visually separate species pairs
  geom_hline(yintercept = seq(1.5, length(over_year$spsp), 1), 
             color = "lightgray", linetype = 2) +
  
  ## Add a marker at 0 to show show the position where overlap stays constant across years
  ## Positive values indicate increasing overlap; negative indicates decreasing overlap
  geom_vline(xintercept = 0, size = 1, linetype = 2) +
 
  ## Style elements. Mytheme is a custom theme I apply to all ggplots
  theme(panel.border = element_rect(colour = "black", fill = NA, size = 2)) +
  mytheme +
  
  ## Labels for axes and legend
  labs(x = "Regression slope \n(numerical overlap ~ year)", 
       y = "Species pair",
       color = 'pond') +
  
  ## Add a Wes Anderson color palette for extra fun
  scale_color_manual(values = wes_palette(n = 8, name = "Zissou1", type = 'continuous'))

## Add a marginal histogram so it's easier to see the overall pattern
oy <- ggMarginal(oy, size = 7, margins = "x", type = "histogram",
                 col = "black", fill = "black", alpha = 0.75)

A lot of data and a lot of code went into this figure, and it was therefore a challenge to keep the design clean and elegant. I put a lot of thought into what the key messages were and simplified wherever possible. I’m very proud of the end result! The final paper had four figures that looked just like this, but measuring different trends and models. The consistent figure design helped readers process the information more quickly and make comparisons across different components of the paper. This paper got accepted to the top jounal in my field, a distinction I largely attribute to the effort and thought that went into these figures.

Here, we can see that most points are positive and the marginal histogram is right skewed. This means that it was common for frogs to increase their overlap in breeding through the 17 year period we measured. We suspect this is a result of climate change. Typically, frogs try to partition their breeding in time. But most species can only breed (and therefore they only call) when it’s raining. In this region, climate change has resulted in long periods of dry with rare but large rain events. This forces frogs to cluster their calling around these few rain opportunities. The outcome is greater competition for the resulting offspring, which should reduce survival over time.