3. Batch Application of Methods: Results and Figures

Aim

In this tutorial I will show how to apply the package functions in batch to all streams (and time points), mostly through the use of:

Packages from the tidyverse like tidyr, dplyr, purr, ggplot2, or tibble.
Some custom-made functions.

The results, tables, and figures, should already be useable for the associated manuscript (Frauendorf et al. in revision).

Data Processing

The nice thing about tibbles is that it allows you to organize dataframes where each entry need not be single elements only, but can be any type of object: a data.frame, the results of a model fit, etc. This makes it easy to batch-apply functions using package purr. We can load these functions with:

library(tidyverse)

One of the advantages of the tidyverse family of packages is the use of the pipe operator %>%. It essentially allows you to first indicate what dataset you want to do a bunch of modifications or operations, and then list the series of operations one after the other followed by %>%.

For example, imagine we want to do the following changes to the excretion data:

Select only the data for females
Express the ecretion rate in \(mg\) instead of \(g\) (i.e. \(\times 1000\))
Group the data by stream and year
Summarize it calculating the number of individuals, the mean individual size, and the average excretion rate.

load('../data/data_excretion.rda')

new_data <- data_excretion %>%
  filter(sex !='M') %>%
  mutate(N_rate = N_rate * 1000) %>%
  group_by(stream, year) %>%
  summarize(n_inds = length(size), 
            size_mean = mean(size, na.rm = TRUE),
            N_rate = mean(N_rate, na.rm = TRUE))

## `summarise()` regrouping output by 'stream' (override with `.groups` argument)

knitr::kable(new_data)

stream	year	n_inds	size_mean	N_rate
CA	2009	18	0.1272222	5494.0073
CA	2010	22	0.1418182	1870.1445
CA	2011	26	0.1726923	9785.8555
CA	2014	6	0.1666667	6908.4324
LL	2009	28	0.1357143	4138.9871
LL	2010	28	0.0996429	5542.1272
LL	2011	28	0.1196429	2791.9596
LL	2014	9	0.2177778	7298.9697
TY	2009	19	0.1811111	7270.3752
TY	2010	25	0.2652000	7917.7528
TY	2011	19	0.1015789	102.3317
UL	2009	33	0.2003030	9225.8923
UL	2010	27	0.1018519	6629.6677
UL	2011	25	0.1420000	8202.7205

TADAH! Now let’s forget about this example and use this magic with some real questions. Note that we made the table look prettier than the usual R format using function kable from package knitr

Population Descriptors

Figure 1. Population density

Figure 2. Size structure

Individual Excretion.

First of all, we want to group the excretion data by stream and year.

edata <- data_excretion %>%
  group_by(stream, year) %>%
  nest()

edata

## # A tibble: 14 x 3
## # Groups:   stream, year [14]
##    stream  year data             
##    <fct>  <int> <list>           
##  1 UL      2009 <tibble [42 × 8]>
##  2 LL      2009 <tibble [37 × 8]>
##  3 TY      2009 <tibble [24 × 8]>
##  4 CA      2009 <tibble [25 × 8]>
##  5 UL      2010 <tibble [36 × 8]>
##  6 LL      2010 <tibble [36 × 8]>
##  7 TY      2010 <tibble [36 × 8]>
##  8 CA      2010 <tibble [30 × 8]>
##  9 UL      2011 <tibble [36 × 8]>
## 10 LL      2011 <tibble [37 × 8]>
## 11 TY      2011 <tibble [24 × 8]>
## 12 CA      2011 <tibble [36 × 8]>
## 13 CA      2014 <tibble [15 × 8]>
## 14 LL      2014 <tibble [15 × 8]>

What this did is to create a dataframe whith one entry per stream and year only, and as a third column, the whole data for that stream and year. For example, to access the UL 2009 data:

subset(edata, stream == 'UL' & year == 2009)$data

## [[1]]
## # A tibble: 42 x 8
##     pair canopy year_intro month_cal month_intro sex    size N_rate
##    <int>  <int>      <dbl>     <int>       <int> <fct> <dbl>  <dbl>
##  1     1      1          1         5          14 F      0.55  18.2 
##  2     1      1          1         5          14 F      0.13   5.37
##  3     1      1          1         5          14 M      0.13   9.22
##  4     1      1          1         5          14 M      0.15   3.58
##  5     1      1          1         5          14 F      0.55   8.48
##  6     1      1          1         5          14 F      0.49  27.7 
##  7     1      1          1         5          14 F      0.16  10.5 
##  8     1      1          1         5          14 F      0.12   5.85
##  9     1      1          1         5          14 M      0.16   8.00
## 10     1      1          1         5          14 F      0.04   2.8 
## # … with 32 more rows

Now we can use a very powerful function called map from package purrr, for example, to apply the same function to all the nested dataframes at once.

edata <- edata %>%
  mutate(fit = map(.f = allo_fit, .x = data, N_rate, size))

edata

## # A tibble: 14 x 4
## # Groups:   stream, year [14]
##    stream  year data              fit             
##    <fct>  <int> <list>            <list>          
##  1 UL      2009 <tibble [42 × 8]> <named list [3]>
##  2 LL      2009 <tibble [37 × 8]> <named list [3]>
##  3 TY      2009 <tibble [24 × 8]> <named list [3]>
##  4 CA      2009 <tibble [25 × 8]> <named list [3]>
##  5 UL      2010 <tibble [36 × 8]> <named list [3]>
##  6 LL      2010 <tibble [36 × 8]> <named list [3]>
##  7 TY      2010 <tibble [36 × 8]> <named list [3]>
##  8 CA      2010 <tibble [30 × 8]> <named list [3]>
##  9 UL      2011 <tibble [36 × 8]> <named list [3]>
## 10 LL      2011 <tibble [37 × 8]> <named list [3]>
## 11 TY      2011 <tibble [24 × 8]> <named list [3]>
## 12 CA      2011 <tibble [36 × 8]> <named list [3]>
## 13 CA      2014 <tibble [15 × 8]> <named list [3]>
## 14 LL      2014 <tibble [15 × 8]> <named list [3]>

As you see, there is a new column coefs that stores a list of three elements, the three elements returned by our function allo_fit: the coefficients, and the variance covariance matrix. Let’s check out again, for example the UL 2009 data.

subset(edata, stream == 'UL' & year == 2009)$fit

## [[1]]
## [[1]]$coef
## (Intercept)   log(size) 
##   3.2254379   0.6202035 
## 
## [[1]]$vcov
##             (Intercept)   log(size)
## (Intercept) 0.015757805 0.006071351
## log(size)   0.006071351 0.002952926
## 
## [[1]]$fit
## 
## Call:
## lm(formula = log(response) ~ log(size))
## 
## Coefficients:
## (Intercept)    log(size)  
##      3.2254       0.6202

Therey you go! All models fit with one stroke! Now let’s carry on implementing allo_mc.

edata <- edata %>%
  mutate(mc = map(.f = allo_mc, .x = fit, predict = TRUE))

edata

## # A tibble: 14 x 5
## # Groups:   stream, year [14]
##    stream  year data              fit              mc              
##    <fct>  <int> <list>            <list>           <list>          
##  1 UL      2009 <tibble [42 × 8]> <named list [3]> <named list [3]>
##  2 LL      2009 <tibble [37 × 8]> <named list [3]> <named list [3]>
##  3 TY      2009 <tibble [24 × 8]> <named list [3]> <named list [3]>
##  4 CA      2009 <tibble [25 × 8]> <named list [3]> <named list [3]>
##  5 UL      2010 <tibble [36 × 8]> <named list [3]> <named list [3]>
##  6 LL      2010 <tibble [36 × 8]> <named list [3]> <named list [3]>
##  7 TY      2010 <tibble [36 × 8]> <named list [3]> <named list [3]>
##  8 CA      2010 <tibble [30 × 8]> <named list [3]> <named list [3]>
##  9 UL      2011 <tibble [36 × 8]> <named list [3]> <named list [3]>
## 10 LL      2011 <tibble [37 × 8]> <named list [3]> <named list [3]>
## 11 TY      2011 <tibble [24 × 8]> <named list [3]> <named list [3]>
## 12 CA      2011 <tibble [36 × 8]> <named list [3]> <named list [3]>
## 13 CA      2014 <tibble [15 × 8]> <named list [3]> <named list [3]>
## 14 LL      2014 <tibble [15 × 8]> <named list [3]> <named list [3]>

Let’s check again for UL 2009

str(subset(edata, stream == 'UL' & year == 2009)$mc)

## List of 1
##  $ :List of 3
##   ..$ coefs  : num [1:1000, 1:2] 3.12 3.2 3.24 3.21 3.16 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : chr [1:2] "(Intercept)" "log(size)"
##   ..$ xpred  : num [1:201] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
##   ..$ pred_ci: num [1:3, 1:201] 0 0 0 1.07 1.44 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:3] "2.5%" "50%" "97.5%"
##   .. .. ..$ : NULL

In reality, it can all be done at the same time.

edata <- data_excretion %>%
  group_by(stream, year) %>%
  nest() %>%
  arrange(stream) %>%
  mutate(fit = map(.f = allo_fit, .x = data, N_rate, size))

edata

## # A tibble: 14 x 4
## # Groups:   stream, year [14]
##    stream  year data              fit             
##    <fct>  <int> <list>            <list>          
##  1 CA      2009 <tibble [25 × 8]> <named list [3]>
##  2 CA      2010 <tibble [30 × 8]> <named list [3]>
##  3 CA      2011 <tibble [36 × 8]> <named list [3]>
##  4 CA      2014 <tibble [15 × 8]> <named list [3]>
##  5 LL      2009 <tibble [37 × 8]> <named list [3]>
##  6 LL      2010 <tibble [36 × 8]> <named list [3]>
##  7 LL      2011 <tibble [37 × 8]> <named list [3]>
##  8 LL      2014 <tibble [15 × 8]> <named list [3]>
##  9 TY      2009 <tibble [24 × 8]> <named list [3]>
## 10 TY      2010 <tibble [36 × 8]> <named list [3]>
## 11 TY      2011 <tibble [24 × 8]> <named list [3]>
## 12 UL      2009 <tibble [42 × 8]> <named list [3]>
## 13 UL      2010 <tibble [36 × 8]> <named list [3]>
## 14 UL      2011 <tibble [36 × 8]> <named list [3]>

Let’s plot the results, which corresponds to Figure 3

## `geom_smooth()` using formula 'y ~ x'

Using package stargazer, we can also automate the construction of a nice summary table for the model results of different streams and years. It may look like a lot of code, but anything beyond the first line is formating and customizing labels. Even if you left that out, the default would look quite decent.

I notice we don’t have this in the paper, and it should be at least a Supplementary Appendix Table S3

my_model_labels <- ifelse(edata$year == min(edata$year), paste(edata$stream, edata$year, sep = '<br>'), paste('<br>',edata$year, sep = ''))

stargazer::stargazer(lapply(edata$fit, with, fit), type = 'html',
                     column.labels = my_model_labels,
                     covariate.labels = c('<em>a</em>','<em>b</em>'),
                     dep.var.caption = "log(E) = <em>a</em> +  <em>b</em> log(size)",
                     dep.var.labels.include = FALSE,
                     model.numbers = FALSE, 
                     df = FALSE, omit.stat = 'adj.rsq')


	log(E) = a + b log(size)

	CA 2009	2010	2011	2014	LL 2009	2010	2011	2014	TY 2009	2010	2011	UL 2009	2010	2011

a	0.547^***	0.620^*	0.494^***	0.762^**	1.003^***	0.552^***	0.874^***	0.710^***	0.541^***	0.619^***	0.780	0.620^***	0.550^***	0.182
	(0.066)	(0.358)	(0.087)	(0.298)	(0.170)	(0.115)	(0.185)	(0.104)	(0.091)	(0.142)	(0.534)	(0.054)	(0.057)	(0.129)

b	2.946^***	0.673	3.173^***	3.203^***	3.557^***	3.026^***	2.479^***	3.017^***	3.011^***	2.701^***	1.169	3.225^***	3.292^***	2.294^***
	(0.174)	(0.957)	(0.212)	(0.696)	(0.385)	(0.313)	(0.497)	(0.210)	(0.223)	(0.328)	(0.954)	(0.126)	(0.172)	(0.338)


Observations	23	22	31	15	31	29	29	15	19	32	14	39	33	32
R²	0.764	0.131	0.525	0.335	0.546	0.459	0.453	0.782	0.675	0.388	0.151	0.779	0.748	0.062
Residual Std. Error	0.300	1.966	0.517	0.559	0.731	0.589	1.066	0.260	0.460	0.910	1.205	0.357	0.371	0.782
F Statistic	67.917^***	3.010^*	32.020^***	6.561^**	34.872^***	22.872^***	22.372^***	46.510^***	35.342^***	19.019^***	2.137	130.261^***	91.821^***	1.981

Note:	p<0.1; p<0.05; p<0.01

Population Parameters

load('../data/data_population.rda')

pdata <- data_population %>%
  group_by(stream, month_intro) %>%
  arrange(month_intro, .by_group = TRUE) %>%
  nest()

pdata

## # A tibble: 14 x 3
## # Groups:   stream, month_intro [14]
##    stream month_intro data                
##    <fct>        <dbl> <list>              
##  1 CA               2 <tibble [73 × 5]>   
##  2 CA              14 <tibble [352 × 5]>  
##  3 CA              25 <tibble [369 × 5]>  
##  4 CA              60 <tibble [351 × 5]>  
##  5 LL              14 <tibble [371 × 5]>  
##  6 LL              26 <tibble [890 × 5]>  
##  7 LL              37 <tibble [781 × 5]>  
##  8 LL              73 <tibble [822 × 5]>  
##  9 TY               2 <tibble [65 × 5]>   
## 10 TY              14 <tibble [732 × 5]>  
## 11 TY              25 <tibble [166 × 5]>  
## 12 UL              14 <tibble [748 × 5]>  
## 13 UL              26 <tibble [2,492 × 5]>
## 14 UL              37 <tibble [984 × 5]>

NOTE: Because these figures are already fine in the MS, I will leave this for the end, but I can re-draw them so that they all look consistent

##Population Recycling through Time

In order to scale up from individual to population, we need to use ind2pop for all the combinations of stream and month since introduction. These are essentially given in pdata, so we can simply use som purrr magic to apply the function to all those combinations.

N = 5 # Define nr of MC simulations throughout the vignette

pdata <- pdata %>%
  mutate(pop_params = purrr::map2(stream, month_intro, pop_bundle, nsims = N),
         pop_excretion = purrr::map(pop_params, ind2pop)) %>%
  mutate(pop_exc_mean = purrr::map_dbl(pop_excretion, mean),
         pop_exc_upp = purrr::map_dbl(pop_excretion, quantile, 0.975),
         pop_exc_low = purrr::map_dbl(pop_excretion, quantile, 0.025))
pdata

## # A tibble: 14 x 8
## # Groups:   stream, month_intro [14]
##    stream month_intro data  pop_params pop_excretion pop_exc_mean pop_exc_upp
##    <fct>        <dbl> <lis> <list>     <list>               <dbl>       <dbl>
##  1 CA               2 <tib… <named li… <dbl [5]>             3.24        3.84
##  2 CA              14 <tib… <named li… <dbl [5]>             1.23        1.81
##  3 CA              25 <tib… <named li… <dbl [5]>            25.8        32.6 
##  4 CA              60 <tib… <named li… <dbl [5]>            12.0        13.4 
##  5 LL              14 <tib… <named li… <dbl [5]>             6.13        8.89
##  6 LL              26 <tib… <named li… <dbl [5]>            13.8        16.5 
##  7 LL              37 <tib… <named li… <dbl [5]>             5.20        6.09
##  8 LL              73 <tib… <named li… <dbl [5]>            10.3        10.8 
##  9 TY               2 <tib… <named li… <dbl [5]>             4.42        4.97
## 10 TY              14 <tib… <named li… <dbl [5]>            28.4        35.2 
## 11 TY              25 <tib… <named li… <dbl [5]>             1.42        1.95
## 12 UL              14 <tib… <named li… <dbl [5]>            13.7        15.1 
## 13 UL              26 <tib… <named li… <dbl [5]>            48.4        54.7 
## 14 UL              37 <tib… <named li… <dbl [5]>            16.6        21.0 
## # … with 1 more variable: pop_exc_low <dbl>

Now that we have the 5 MC simulations of population-level excretion for each stream/month combination, we can plot its evolution through time to reconstruct Figure 4

# natural stream references
natural_ref <- rbind(c(3.174946,3.824626,4.480129),
                     c(12.33031,14.735918,17.10158))
rownames(natural_ref) <- c('HP','LP')

# range of x
max_x <- max(pdata$month_intro) + 2

 # plot
plot_popexc <- pdata %>%
  ggplot(aes(y = pop_exc_mean, x = month_intro, group = stream)) +
    geom_rect(aes(xmin =0, xmax = max_x,
                  ymin = natural_ref['HP',1],
                  ymax = natural_ref['HP',3]), 
              fill = 'gray90') +
    annotate('text', max_x-2, natural_ref['HP',2], 
             label = 'HP', colour = 'gray50') +
    geom_rect(aes(xmin =0, xmax = max_x,
                  ymin = natural_ref['LP',1],
                  ymax = natural_ref['LP',3]), 
              fill = 'gray90') + 
    annotate('text', max_x-2, natural_ref['LP',2], 
            label = 'LP', colour = 'gray50') +
    geom_line(aes(color = factor(stream))) +
    scale_linetype_manual(values = c('solid', 'solid', 
                                     'dashed', 'dashed')) +
    geom_point(aes(color = factor(stream), 
                   shape = factor(stream)),
               size = 2.5) +
    geom_errorbar(aes(ymax = pop_exc_upp, 
                      ymin = pdata$pop_exc_low,
                      color = factor(stream))) +
    scale_color_manual(values = c('black', 'black',
                                  'gray50', 'gray50')) +
    scale_shape_manual(values = c(16, 17, 16, 17)) +
    scale_x_continuous(breaks = seq(0, max_x, 12),
                     expand = c(0, 0)) +
    theme_bw() + theme(text = element_text(size=13),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      axis.text.x = element_text(size = 12),
      axis.text.y = element_text(size = 12),
      legend.title=element_blank(),
      legend.text = element_text(size = 12)) + 
    xlab('\n Months Since Introduction') +
    ylab(expression(Population ~ Excretion ~ Rate ~ (gN/h/m^2)))
  
plot_popexc

Contributions to Change in Recycling

Now finally, the decomposition into contributions to change. This is a bit trickier because it cannot be done independently for all stream-month combinations. Rather, for each stream, we need to compare one time step with the next. To do this, we need to devise a little function that finds what are the relevant comparisons: i.e. that finds the next time step for each month-stream sampling. We have implemented this in function next_sampling. For example:

next_sampling('UL', 14)

## [1] 26

We can check that this is true:

data_ul <- subset(data_excretion, stream == 'UL')

samplings_ul <- sort(unique(data_ul$month_intro))
samplings_ul

## [1] 14 26 37

Yes, the sampling after month 14 is at month 26.

Let’s apply this to all our stream samplings and get our decomposition. This will require three steps:

Apply next_sampling to find the next point to compare with
Calculate the pop_bundle of that next sampling
Perform the decomposition using both population parameter bundles using decompose

decomp_data <- pdata %>%
  mutate(next_month = purrr::map2_dbl(stream, month_intro, next_sampling),
         pop_params_next = purrr::map2(stream, next_month, pop_bundle, nsims = N),
         decomposition = purrr::map2(pop_params, pop_params_next, change_decomp))

decomp_data[, c('stream', 'month_intro', 'decomposition')]

## # A tibble: 14 x 3
## # Groups:   stream, month_intro [14]
##    stream month_intro decomposition   
##    <fct>        <dbl> <list>          
##  1 CA               2 <df[,5] [5 × 5]>
##  2 CA              14 <df[,5] [5 × 5]>
##  3 CA              25 <df[,5] [5 × 5]>
##  4 CA              60 <lgl [1]>       
##  5 LL              14 <df[,5] [5 × 5]>
##  6 LL              26 <df[,5] [5 × 5]>
##  7 LL              37 <df[,5] [5 × 5]>
##  8 LL              73 <lgl [1]>       
##  9 TY               2 <df[,5] [5 × 5]>
## 10 TY              14 <df[,5] [5 × 5]>
## 11 TY              25 <lgl [1]>       
## 12 UL              14 <df[,5] [5 × 5]>
## 13 UL              26 <df[,5] [5 × 5]>
## 14 UL              37 <lgl [1]>

DONE! Let’s have a look at one of them

head(decomp_data$decomposition[[1]])

##      change          A         S        D         I
## 1 0.5198320 0.12381828 0.9443387 4.821918 0.9219993
## 2 0.7194713 0.18209928 0.8600252 4.821918 0.9527397
## 3 0.3750426 0.09160362 0.9492725 4.821918 0.8944525
## 4 0.2842011 0.06716755 0.9549846 4.821918 0.9188616
## 5 0.6932695 0.16373332 0.7967160 4.821918 1.1021525

We just need now to make the plot. But first let’s define a function that will calculate mean, upper, and lower CIs for all columns of a data frame.

MC_summary <- function(data){
  
  temp <- data %>%
    #mutate_all(funs(log)) %>%
    summarise_all(funs (mean = mean, 
                        lo = quantile(.,probs = 0.025),
                        up = quantile(.,probs = 0.975))) 
  return(temp)
}

summary_decomp <- decomp_data %>%
  filter(!is.na(decomposition)) %>%
  mutate(MC_summary = purrr::map(decomposition, MC_summary)) %>%
  select(stream, month_intro, MC_summary) %>%
  unnest()

mean_decomp <- summary_decomp %>%
  gather(key, mean, A_mean, S_mean, D_mean, I_mean, factor_key =  TRUE) %>%
  mutate(key = forcats::fct_recode(key, A = "A_mean",
                                        S = "S_mean",
                                        D = "D_mean",
                                        I = "I_mean"))
up_decomp <- summary_decomp  %>%
  gather(key, up, A_up, S_up, D_up, I_up, factor_key =  TRUE) %>%
  mutate(key = forcats::fct_recode(key, A = "A_up",
                                        S = "S_up",
                                        D = "D_up",
                                        I = "I_up"))
low_decomp <- summary_decomp  %>%
  gather(key, low, A_lo, S_lo, D_lo, I_lo, factor_key =  TRUE) %>%
  mutate(key = forcats::fct_recode(key, A = "A_lo",
                                        S = "S_lo",
                                        D = "D_lo",
                                        I = "I_lo"))

merge_decomp <- left_join(up_decomp, low_decomp, by = c('stream', 'month_intro', 'key')) %>%
    left_join(., mean_decomp, by = c('stream', 'month_intro', 'key')) %>%
    select(stream, month_intro, key, mean, up, low)  
  
plot_decomp <- merge_decomp %>%  
  ggplot(aes(y = mean, x = month_intro)) +
    facet_grid(stream ~ key) +
    geom_ribbon(aes(ymin = low, ymax = up), fill = 'grey70') +
    geom_line() +
    scale_x_continuous(breaks = seq(0, max_x, 12)) +
    scale_y_log10() +
    theme_bw() + theme(text = element_text(size=13),
      axis.text.x = element_text(size = 12),
      axis.text.y = element_text(size = 12),
      legend.title=element_blank(),
      legend.text = element_text(size = 12),
      strip.background = element_rect(fill = 'white', colour = 'white'),
      strip.text.y = element_text(angle = 0, size = 14, vjust = .5),
      strip.text.x = element_text(size = 14)) +
    xlab('\n Months Since Introduction') +
    ylab('\n Contribution to Population Recycling')
  

plot_decomp