In this tutorial I will show how to apply the package functions in batch to all streams (and time points), mostly through the use of:
tidyverse
like tidyr
, dplyr
, purr
, ggplot2
, or tibble
.The results, tables, and figures, should already be useable for the associated manuscript (Frauendorf et al. in revision).
The nice thing about tibbles
is that it allows you to organize dataframes where each entry need not be single elements only, but can be any type of object: a data.frame, the results of a model fit, etc. This makes it easy to batch-apply functions using package purr
. We can load these functions with:
library(tidyverse)
One of the advantages of the tidyverse
family of packages is the use of the pipe operator %>%
. It essentially allows you to first indicate what dataset you want to do a bunch of modifications or operations, and then list the series of operations one after the other followed by %>%
.
For example, imagine we want to do the following changes to the excretion data:
load('../data/data_excretion.rda')
new_data <- data_excretion %>%
filter(sex !='M') %>%
mutate(N_rate = N_rate * 1000) %>%
group_by(stream, year) %>%
summarize(n_inds = length(size),
size_mean = mean(size, na.rm = TRUE),
N_rate = mean(N_rate, na.rm = TRUE))
## `summarise()` regrouping output by 'stream' (override with `.groups` argument)
knitr::kable(new_data)
stream | year | n_inds | size_mean | N_rate |
---|---|---|---|---|
CA | 2009 | 18 | 0.1272222 | 5494.0073 |
CA | 2010 | 22 | 0.1418182 | 1870.1445 |
CA | 2011 | 26 | 0.1726923 | 9785.8555 |
CA | 2014 | 6 | 0.1666667 | 6908.4324 |
LL | 2009 | 28 | 0.1357143 | 4138.9871 |
LL | 2010 | 28 | 0.0996429 | 5542.1272 |
LL | 2011 | 28 | 0.1196429 | 2791.9596 |
LL | 2014 | 9 | 0.2177778 | 7298.9697 |
TY | 2009 | 19 | 0.1811111 | 7270.3752 |
TY | 2010 | 25 | 0.2652000 | 7917.7528 |
TY | 2011 | 19 | 0.1015789 | 102.3317 |
UL | 2009 | 33 | 0.2003030 | 9225.8923 |
UL | 2010 | 27 | 0.1018519 | 6629.6677 |
UL | 2011 | 25 | 0.1420000 | 8202.7205 |
TADAH! Now let’s forget about this example and use this magic with some real questions. Note that we made the table look prettier than the usual R format using function kable
from package knitr
First of all, we want to group the excretion data by stream and year.
edata <- data_excretion %>%
group_by(stream, year) %>%
nest()
edata
## # A tibble: 14 x 3
## # Groups: stream, year [14]
## stream year data
## <fct> <int> <list>
## 1 UL 2009 <tibble [42 × 8]>
## 2 LL 2009 <tibble [37 × 8]>
## 3 TY 2009 <tibble [24 × 8]>
## 4 CA 2009 <tibble [25 × 8]>
## 5 UL 2010 <tibble [36 × 8]>
## 6 LL 2010 <tibble [36 × 8]>
## 7 TY 2010 <tibble [36 × 8]>
## 8 CA 2010 <tibble [30 × 8]>
## 9 UL 2011 <tibble [36 × 8]>
## 10 LL 2011 <tibble [37 × 8]>
## 11 TY 2011 <tibble [24 × 8]>
## 12 CA 2011 <tibble [36 × 8]>
## 13 CA 2014 <tibble [15 × 8]>
## 14 LL 2014 <tibble [15 × 8]>
What this did is to create a dataframe whith one entry per stream and year only, and as a third column, the whole data for that stream and year. For example, to access the UL 2009
data:
subset(edata, stream == 'UL' & year == 2009)$data
## [[1]]
## # A tibble: 42 x 8
## pair canopy year_intro month_cal month_intro sex size N_rate
## <int> <int> <dbl> <int> <int> <fct> <dbl> <dbl>
## 1 1 1 1 5 14 F 0.55 18.2
## 2 1 1 1 5 14 F 0.13 5.37
## 3 1 1 1 5 14 M 0.13 9.22
## 4 1 1 1 5 14 M 0.15 3.58
## 5 1 1 1 5 14 F 0.55 8.48
## 6 1 1 1 5 14 F 0.49 27.7
## 7 1 1 1 5 14 F 0.16 10.5
## 8 1 1 1 5 14 F 0.12 5.85
## 9 1 1 1 5 14 M 0.16 8.00
## 10 1 1 1 5 14 F 0.04 2.8
## # … with 32 more rows
Now we can use a very powerful function called map
from package purrr
, for example, to apply the same function to all the nested dataframes at once.
edata <- edata %>%
mutate(fit = map(.f = allo_fit, .x = data, N_rate, size))
edata
## # A tibble: 14 x 4
## # Groups: stream, year [14]
## stream year data fit
## <fct> <int> <list> <list>
## 1 UL 2009 <tibble [42 × 8]> <named list [3]>
## 2 LL 2009 <tibble [37 × 8]> <named list [3]>
## 3 TY 2009 <tibble [24 × 8]> <named list [3]>
## 4 CA 2009 <tibble [25 × 8]> <named list [3]>
## 5 UL 2010 <tibble [36 × 8]> <named list [3]>
## 6 LL 2010 <tibble [36 × 8]> <named list [3]>
## 7 TY 2010 <tibble [36 × 8]> <named list [3]>
## 8 CA 2010 <tibble [30 × 8]> <named list [3]>
## 9 UL 2011 <tibble [36 × 8]> <named list [3]>
## 10 LL 2011 <tibble [37 × 8]> <named list [3]>
## 11 TY 2011 <tibble [24 × 8]> <named list [3]>
## 12 CA 2011 <tibble [36 × 8]> <named list [3]>
## 13 CA 2014 <tibble [15 × 8]> <named list [3]>
## 14 LL 2014 <tibble [15 × 8]> <named list [3]>
As you see, there is a new column coefs
that stores a list of three elements, the three elements returned by our function allo_fit
: the coefficients, and the variance covariance matrix. Let’s check out again, for example the UL 2009
data.
subset(edata, stream == 'UL' & year == 2009)$fit
## [[1]]
## [[1]]$coef
## (Intercept) log(size)
## 3.2254379 0.6202035
##
## [[1]]$vcov
## (Intercept) log(size)
## (Intercept) 0.015757805 0.006071351
## log(size) 0.006071351 0.002952926
##
## [[1]]$fit
##
## Call:
## lm(formula = log(response) ~ log(size))
##
## Coefficients:
## (Intercept) log(size)
## 3.2254 0.6202
Therey you go! All models fit with one stroke! Now let’s carry on implementing allo_mc
.
edata <- edata %>%
mutate(mc = map(.f = allo_mc, .x = fit, predict = TRUE))
edata
## # A tibble: 14 x 5
## # Groups: stream, year [14]
## stream year data fit mc
## <fct> <int> <list> <list> <list>
## 1 UL 2009 <tibble [42 × 8]> <named list [3]> <named list [3]>
## 2 LL 2009 <tibble [37 × 8]> <named list [3]> <named list [3]>
## 3 TY 2009 <tibble [24 × 8]> <named list [3]> <named list [3]>
## 4 CA 2009 <tibble [25 × 8]> <named list [3]> <named list [3]>
## 5 UL 2010 <tibble [36 × 8]> <named list [3]> <named list [3]>
## 6 LL 2010 <tibble [36 × 8]> <named list [3]> <named list [3]>
## 7 TY 2010 <tibble [36 × 8]> <named list [3]> <named list [3]>
## 8 CA 2010 <tibble [30 × 8]> <named list [3]> <named list [3]>
## 9 UL 2011 <tibble [36 × 8]> <named list [3]> <named list [3]>
## 10 LL 2011 <tibble [37 × 8]> <named list [3]> <named list [3]>
## 11 TY 2011 <tibble [24 × 8]> <named list [3]> <named list [3]>
## 12 CA 2011 <tibble [36 × 8]> <named list [3]> <named list [3]>
## 13 CA 2014 <tibble [15 × 8]> <named list [3]> <named list [3]>
## 14 LL 2014 <tibble [15 × 8]> <named list [3]> <named list [3]>
Let’s check again for UL 2009
str(subset(edata, stream == 'UL' & year == 2009)$mc)
## List of 1
## $ :List of 3
## ..$ coefs : num [1:1000, 1:2] 3.12 3.2 3.24 3.21 3.16 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : NULL
## .. .. ..$ : chr [1:2] "(Intercept)" "log(size)"
## ..$ xpred : num [1:201] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
## ..$ pred_ci: num [1:3, 1:201] 0 0 0 1.07 1.44 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:3] "2.5%" "50%" "97.5%"
## .. .. ..$ : NULL
In reality, it can all be done at the same time.
edata <- data_excretion %>%
group_by(stream, year) %>%
nest() %>%
arrange(stream) %>%
mutate(fit = map(.f = allo_fit, .x = data, N_rate, size))
edata
## # A tibble: 14 x 4
## # Groups: stream, year [14]
## stream year data fit
## <fct> <int> <list> <list>
## 1 CA 2009 <tibble [25 × 8]> <named list [3]>
## 2 CA 2010 <tibble [30 × 8]> <named list [3]>
## 3 CA 2011 <tibble [36 × 8]> <named list [3]>
## 4 CA 2014 <tibble [15 × 8]> <named list [3]>
## 5 LL 2009 <tibble [37 × 8]> <named list [3]>
## 6 LL 2010 <tibble [36 × 8]> <named list [3]>
## 7 LL 2011 <tibble [37 × 8]> <named list [3]>
## 8 LL 2014 <tibble [15 × 8]> <named list [3]>
## 9 TY 2009 <tibble [24 × 8]> <named list [3]>
## 10 TY 2010 <tibble [36 × 8]> <named list [3]>
## 11 TY 2011 <tibble [24 × 8]> <named list [3]>
## 12 UL 2009 <tibble [42 × 8]> <named list [3]>
## 13 UL 2010 <tibble [36 × 8]> <named list [3]>
## 14 UL 2011 <tibble [36 × 8]> <named list [3]>
Let’s plot the results, which corresponds to Figure 3
## `geom_smooth()` using formula 'y ~ x'
Using package stargazer
, we can also automate the construction of a nice summary table for the model results of different streams and years. It may look like a lot of code, but anything beyond the first line is formating and customizing labels. Even if you left that out, the default would look quite decent.
I notice we don’t have this in the paper, and it should be at least a Supplementary Appendix Table S3
my_model_labels <- ifelse(edata$year == min(edata$year), paste(edata$stream, edata$year, sep = '<br>'), paste('<br>',edata$year, sep = ''))
stargazer::stargazer(lapply(edata$fit, with, fit), type = 'html',
column.labels = my_model_labels,
covariate.labels = c('<em>a</em>','<em>b</em>'),
dep.var.caption = "log(E) = <em>a</em> + <em>b</em> log(size)",
dep.var.labels.include = FALSE,
model.numbers = FALSE,
df = FALSE, omit.stat = 'adj.rsq')
log(E) = a + b log(size) | ||||||||||||||
CA 2009 |
2010 |
2011 |
2014 |
LL 2009 |
2010 |
2011 |
2014 |
TY 2009 |
2010 |
2011 |
UL 2009 |
2010 |
2011 |
|
a | 0.547*** | 0.620* | 0.494*** | 0.762** | 1.003*** | 0.552*** | 0.874*** | 0.710*** | 0.541*** | 0.619*** | 0.780 | 0.620*** | 0.550*** | 0.182 |
(0.066) | (0.358) | (0.087) | (0.298) | (0.170) | (0.115) | (0.185) | (0.104) | (0.091) | (0.142) | (0.534) | (0.054) | (0.057) | (0.129) | |
b | 2.946*** | 0.673 | 3.173*** | 3.203*** | 3.557*** | 3.026*** | 2.479*** | 3.017*** | 3.011*** | 2.701*** | 1.169 | 3.225*** | 3.292*** | 2.294*** |
(0.174) | (0.957) | (0.212) | (0.696) | (0.385) | (0.313) | (0.497) | (0.210) | (0.223) | (0.328) | (0.954) | (0.126) | (0.172) | (0.338) | |
Observations | 23 | 22 | 31 | 15 | 31 | 29 | 29 | 15 | 19 | 32 | 14 | 39 | 33 | 32 |
R2 | 0.764 | 0.131 | 0.525 | 0.335 | 0.546 | 0.459 | 0.453 | 0.782 | 0.675 | 0.388 | 0.151 | 0.779 | 0.748 | 0.062 |
Residual Std. Error | 0.300 | 1.966 | 0.517 | 0.559 | 0.731 | 0.589 | 1.066 | 0.260 | 0.460 | 0.910 | 1.205 | 0.357 | 0.371 | 0.782 |
F Statistic | 67.917*** | 3.010* | 32.020*** | 6.561** | 34.872*** | 22.872*** | 22.372*** | 46.510*** | 35.342*** | 19.019*** | 2.137 | 130.261*** | 91.821*** | 1.981 |
Note: | p<0.1; p<0.05; p<0.01 |
load('../data/data_population.rda')
pdata <- data_population %>%
group_by(stream, month_intro) %>%
arrange(month_intro, .by_group = TRUE) %>%
nest()
pdata
## # A tibble: 14 x 3
## # Groups: stream, month_intro [14]
## stream month_intro data
## <fct> <dbl> <list>
## 1 CA 2 <tibble [73 × 5]>
## 2 CA 14 <tibble [352 × 5]>
## 3 CA 25 <tibble [369 × 5]>
## 4 CA 60 <tibble [351 × 5]>
## 5 LL 14 <tibble [371 × 5]>
## 6 LL 26 <tibble [890 × 5]>
## 7 LL 37 <tibble [781 × 5]>
## 8 LL 73 <tibble [822 × 5]>
## 9 TY 2 <tibble [65 × 5]>
## 10 TY 14 <tibble [732 × 5]>
## 11 TY 25 <tibble [166 × 5]>
## 12 UL 14 <tibble [748 × 5]>
## 13 UL 26 <tibble [2,492 × 5]>
## 14 UL 37 <tibble [984 × 5]>
NOTE: Because these figures are already fine in the MS, I will leave this for the end, but I can re-draw them so that they all look consistent
##Population Recycling through Time
In order to scale up from individual to population, we need to use ind2pop
for all the combinations of stream and month since introduction. These are essentially given in pdata, so we can simply use som purrr
magic to apply the function to all those combinations.
N = 5 # Define nr of MC simulations throughout the vignette
pdata <- pdata %>%
mutate(pop_params = purrr::map2(stream, month_intro, pop_bundle, nsims = N),
pop_excretion = purrr::map(pop_params, ind2pop)) %>%
mutate(pop_exc_mean = purrr::map_dbl(pop_excretion, mean),
pop_exc_upp = purrr::map_dbl(pop_excretion, quantile, 0.975),
pop_exc_low = purrr::map_dbl(pop_excretion, quantile, 0.025))
pdata
## # A tibble: 14 x 8
## # Groups: stream, month_intro [14]
## stream month_intro data pop_params pop_excretion pop_exc_mean pop_exc_upp
## <fct> <dbl> <lis> <list> <list> <dbl> <dbl>
## 1 CA 2 <tib… <named li… <dbl [5]> 3.24 3.84
## 2 CA 14 <tib… <named li… <dbl [5]> 1.23 1.81
## 3 CA 25 <tib… <named li… <dbl [5]> 25.8 32.6
## 4 CA 60 <tib… <named li… <dbl [5]> 12.0 13.4
## 5 LL 14 <tib… <named li… <dbl [5]> 6.13 8.89
## 6 LL 26 <tib… <named li… <dbl [5]> 13.8 16.5
## 7 LL 37 <tib… <named li… <dbl [5]> 5.20 6.09
## 8 LL 73 <tib… <named li… <dbl [5]> 10.3 10.8
## 9 TY 2 <tib… <named li… <dbl [5]> 4.42 4.97
## 10 TY 14 <tib… <named li… <dbl [5]> 28.4 35.2
## 11 TY 25 <tib… <named li… <dbl [5]> 1.42 1.95
## 12 UL 14 <tib… <named li… <dbl [5]> 13.7 15.1
## 13 UL 26 <tib… <named li… <dbl [5]> 48.4 54.7
## 14 UL 37 <tib… <named li… <dbl [5]> 16.6 21.0
## # … with 1 more variable: pop_exc_low <dbl>
Now that we have the 5 MC simulations of population-level excretion for each stream/month combination, we can plot its evolution through time to reconstruct Figure 4
# natural stream references
natural_ref <- rbind(c(3.174946,3.824626,4.480129),
c(12.33031,14.735918,17.10158))
rownames(natural_ref) <- c('HP','LP')
# range of x
max_x <- max(pdata$month_intro) + 2
# plot
plot_popexc <- pdata %>%
ggplot(aes(y = pop_exc_mean, x = month_intro, group = stream)) +
geom_rect(aes(xmin =0, xmax = max_x,
ymin = natural_ref['HP',1],
ymax = natural_ref['HP',3]),
fill = 'gray90') +
annotate('text', max_x-2, natural_ref['HP',2],
label = 'HP', colour = 'gray50') +
geom_rect(aes(xmin =0, xmax = max_x,
ymin = natural_ref['LP',1],
ymax = natural_ref['LP',3]),
fill = 'gray90') +
annotate('text', max_x-2, natural_ref['LP',2],
label = 'LP', colour = 'gray50') +
geom_line(aes(color = factor(stream))) +
scale_linetype_manual(values = c('solid', 'solid',
'dashed', 'dashed')) +
geom_point(aes(color = factor(stream),
shape = factor(stream)),
size = 2.5) +
geom_errorbar(aes(ymax = pop_exc_upp,
ymin = pdata$pop_exc_low,
color = factor(stream))) +
scale_color_manual(values = c('black', 'black',
'gray50', 'gray50')) +
scale_shape_manual(values = c(16, 17, 16, 17)) +
scale_x_continuous(breaks = seq(0, max_x, 12),
expand = c(0, 0)) +
theme_bw() + theme(text = element_text(size=13),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.title=element_blank(),
legend.text = element_text(size = 12)) +
xlab('\n Months Since Introduction') +
ylab(expression(Population ~ Excretion ~ Rate ~ (gN/h/m^2)))
plot_popexc
Now finally, the decomposition into contributions to change. This is a bit trickier because it cannot be done independently for all stream-month combinations. Rather, for each stream, we need to compare one time step with the next. To do this, we need to devise a little function that finds what are the relevant comparisons: i.e. that finds the next time step for each month-stream sampling. We have implemented this in function next_sampling
. For example:
next_sampling('UL', 14)
## [1] 26
We can check that this is true:
data_ul <- subset(data_excretion, stream == 'UL')
samplings_ul <- sort(unique(data_ul$month_intro))
samplings_ul
## [1] 14 26 37
Yes, the sampling after month 14 is at month 26.
Let’s apply this to all our stream samplings and get our decomposition. This will require three steps:
next_sampling
to find the next point to compare withpop_bundle
of that next samplingdecompose
decomp_data <- pdata %>%
mutate(next_month = purrr::map2_dbl(stream, month_intro, next_sampling),
pop_params_next = purrr::map2(stream, next_month, pop_bundle, nsims = N),
decomposition = purrr::map2(pop_params, pop_params_next, change_decomp))
decomp_data[, c('stream', 'month_intro', 'decomposition')]
## # A tibble: 14 x 3
## # Groups: stream, month_intro [14]
## stream month_intro decomposition
## <fct> <dbl> <list>
## 1 CA 2 <df[,5] [5 × 5]>
## 2 CA 14 <df[,5] [5 × 5]>
## 3 CA 25 <df[,5] [5 × 5]>
## 4 CA 60 <lgl [1]>
## 5 LL 14 <df[,5] [5 × 5]>
## 6 LL 26 <df[,5] [5 × 5]>
## 7 LL 37 <df[,5] [5 × 5]>
## 8 LL 73 <lgl [1]>
## 9 TY 2 <df[,5] [5 × 5]>
## 10 TY 14 <df[,5] [5 × 5]>
## 11 TY 25 <lgl [1]>
## 12 UL 14 <df[,5] [5 × 5]>
## 13 UL 26 <df[,5] [5 × 5]>
## 14 UL 37 <lgl [1]>
DONE! Let’s have a look at one of them
head(decomp_data$decomposition[[1]])
## change A S D I
## 1 0.5198320 0.12381828 0.9443387 4.821918 0.9219993
## 2 0.7194713 0.18209928 0.8600252 4.821918 0.9527397
## 3 0.3750426 0.09160362 0.9492725 4.821918 0.8944525
## 4 0.2842011 0.06716755 0.9549846 4.821918 0.9188616
## 5 0.6932695 0.16373332 0.7967160 4.821918 1.1021525
We just need now to make the plot. But first let’s define a function that will calculate mean, upper, and lower CIs for all columns of a data frame.
MC_summary <- function(data){
temp <- data %>%
#mutate_all(funs(log)) %>%
summarise_all(funs (mean = mean,
lo = quantile(.,probs = 0.025),
up = quantile(.,probs = 0.975)))
return(temp)
}
summary_decomp <- decomp_data %>%
filter(!is.na(decomposition)) %>%
mutate(MC_summary = purrr::map(decomposition, MC_summary)) %>%
select(stream, month_intro, MC_summary) %>%
unnest()
mean_decomp <- summary_decomp %>%
gather(key, mean, A_mean, S_mean, D_mean, I_mean, factor_key = TRUE) %>%
mutate(key = forcats::fct_recode(key, A = "A_mean",
S = "S_mean",
D = "D_mean",
I = "I_mean"))
up_decomp <- summary_decomp %>%
gather(key, up, A_up, S_up, D_up, I_up, factor_key = TRUE) %>%
mutate(key = forcats::fct_recode(key, A = "A_up",
S = "S_up",
D = "D_up",
I = "I_up"))
low_decomp <- summary_decomp %>%
gather(key, low, A_lo, S_lo, D_lo, I_lo, factor_key = TRUE) %>%
mutate(key = forcats::fct_recode(key, A = "A_lo",
S = "S_lo",
D = "D_lo",
I = "I_lo"))
merge_decomp <- left_join(up_decomp, low_decomp, by = c('stream', 'month_intro', 'key')) %>%
left_join(., mean_decomp, by = c('stream', 'month_intro', 'key')) %>%
select(stream, month_intro, key, mean, up, low)
plot_decomp <- merge_decomp %>%
ggplot(aes(y = mean, x = month_intro)) +
facet_grid(stream ~ key) +
geom_ribbon(aes(ymin = low, ymax = up), fill = 'grey70') +
geom_line() +
scale_x_continuous(breaks = seq(0, max_x, 12)) +
scale_y_log10() +
theme_bw() + theme(text = element_text(size=13),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.title=element_blank(),
legend.text = element_text(size = 12),
strip.background = element_rect(fill = 'white', colour = 'white'),
strip.text.y = element_text(angle = 0, size = 14, vjust = .5),
strip.text.x = element_text(size = 14)) +
xlab('\n Months Since Introduction') +
ylab('\n Contribution to Population Recycling')
plot_decomp