Monkeyflower reciprocal transplant study

CLIENT

Avin Niknafs

DATE

August 19, 2025

From client:

I am conducting a reciprocal transplant study to test for local adaptation between two monkeyflower species in distinct microhabitats at a sympatric site on the central coast of California. My project involves analyzing germination rates, seedling survival of the two species, and reproductive success, and if possible relating those outcomes to weekly soil moisture data.

Seeded toothpicks were planted and surveyed weekly for roughly one year. Study design:

10 plots in each of 2 microhabitat environments (rock outcrop and seep)
2 monkeyflower species (annual and perennial)
9 lineages per species
6 replicates of each lineage were randomly allocated to positions in each plot

In total, this amounts to 108 toothpicks (6 reps x 9 lineages x 2 species) per plot and a total sample size of $n = 2160$ (108 per plot x 10 plots x 2 microhabitats). Primary interest is in species/environment interaction effects on fitness.

Goals set during initial meeting:

Formulate an analysis-ready rearrangment of seed plot data.
Recommend an approach for analysis of germination rates.

Note

Questions/issues needing client’s attention are shown throughout like this.

Seed plot data

Data provided are shown below. I removed plot index A4_10_3, which was missing identifying information for species and lineage. I suspect this may have been miscoded, since after removal $n = 2159$, which is one short of expected based on the study design.

Note

Check plot coding for index A4_10_3. Was this miscoded?

For the three individuals shown, values are missing in all weeks because the toothpicks never germinated.

Table: raw data (continued below)
species	mom	dad	envname	family	habit	color
LLP	9 G1	11 G1	LLP 9 G1 x LLP 11 G1	P9x11	P	WhitePink
LLP	6 G1	20 G1	LLP 6 G1 x LLP 20 G1	P6x20	P	WhiteBlue
LLP	15 G1	1 G1	LLP 15 G1 x LLP 1 G1	P15x1	P	WhiteBlack

Table continues below
location	replicate	plotind	plot	famabbrev	position	flower?
rock_outcrop	1	P1_1_1	1	P1	1	NA
rock_outcrop	1	P7_1_1	1	P7	2	NA
rock_outcrop	1	P5_1_1	1	P5	3	NA

Table continues below
week1	week2	week3	week4	week5	week6	week7	week8	week9	week10
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Table continues below
week11	week12	week13	week14	week15	week16	week17	week18	week19
NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA

Table continues below
week20	week22	week24	week26	week28	week30	week33	week35	week39
NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA

Table continues below
week42	week44	week46	week48	week50	week52	week54	week56	week58
NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA

week60	week63	week65
NA	NA	NA
NA	NA	NA
NA	NA	NA

To verify that the data reflect the study design, the following tables show distinct combinations of study attributes and sample sizes per plot:

Table: Number of unique lineages per location/species
location	LLA	LLP
rock_outcrop	9	9
seep	9	9

Table: Sample sizes per plot
plot	rock_outcrop	seep
1	108	107
2	108	108
3	108	108
4	108	108
5	108	108
6	108	108
7	108	108
8	108	108
9	108	108
10	108	108

The values entered for each individual each week are:

missing if ungerminated, dead, or otherwise not present
1 if germinated
2 if flowering, with number of repetitions indicating number of flowers (e.g., 222 means three flowers)
3 if alive post-flowering
4 if fruiting, with number of repetitions indicating number of fruits (e.g., 44 means two fruits)
X if missing for reasons other than death
combinations as appropriate (e.g., 222 4 indicates three flowers and one fruit)

The unique encodings present in the data are shown below:

 [1] NA                         "X"                       
 [3] "1"                        "2"                       
 [5] "3"                        "22"                      
 [7] "222"                      "4"                       
 [9] "222222 (4)"               "3 (4)"                   
[11] "2222"                     "2 (4)"                   
[13] "222222"                   "22 (4)"                  
[15] " "                        "222222222"               
[17] "22222"                    "222 (4)"                 
[19] "22 (44)"                  "2222 (4444)"             
[21] "222222 (444444)"          "222 (444)"               
[23] "22222222222222"           "2222222"                 
[25] "2 (4444)"                 "222 (444444)"            
[27] "44"                       "22 (444)"                
[29] "2 (44)"                   "444"                     
[31] "22222222"                 "222222222222"            
[33] "22 (444444)"              "2 (444)"                 
[35] "4444"                     "2222222222 (44444444444)"
[37] "2222222222"

The decimals are artefacts of value encoding in excel when the entry can be parsed as numeric. So for example 2.22222222E8 is nine 2’s rewritten in scientific notation ($2.22222222 \times 10^8$).

Reformatting for analysis

I’ve cleaned up the encoding and data formatting for analysis as follows:

rearrange in ‘long’ format with a variable indicating week number
append sampling dates from the CensusDate sheet
derive a stage variable encoding life stage achieved by the time of observation:
ungerminated if no values were entered prior to the observation date
germinated if a ‘1’ is recorded and no other values were entered prior to the observation date
flowered if a ‘2’ was entered prior to or on the observation date
fruited if a ‘4’ was entered prior to or on the observation date
died if value is missing on the current observation date but at least one nonmissing value was entered prior to the observation date
NA if an ‘X’ was entered prior to or on the observation date
derive counts of numbers of fruits and flowers
add a unique identifier id per toothpick

Table: Example rows of analysis-ready data (continued below)
id	species	family	location	plot	replicate	week	date
1	LLP	P9x11	rock_outcrop	1	1	week1	2024-01-25
1	LLP	P9x11	rock_outcrop	1	1	week2	2024-02-01
1	LLP	P9x11	rock_outcrop	1	1	week3	2024-02-08
1	LLP	P9x11	rock_outcrop	1	1	week4	2024-02-15
1	LLP	P9x11	rock_outcrop	1	1	week5	2024-02-22
1	LLP	P9x11	rock_outcrop	1	1	week6	2024-02-29

stage	n.flower	n.fruit
ungerminated	0	0
ungerminated	0	0
ungerminated	0	0
ungerminated	0	0
ungerminated	0	0
ungerminated	0	0

I’ve provided the data in this form to client as seedplot.csv for direct import into R. I’d recommend explicitly encoding the life stage variable in the correct order, as shown below.

Code

seedplot <- read_csv('seedplot.csv') |>
  mutate(stage = factor(stage, 
                        levels = c('ungerminated', 'germinated', 
                                   'flowered', 'fruited', 'died'), 
                        ordered = T))

Note

I considered adding a curr.stage or curr.status variable that is the same as stage but indicates current life stage at the time of observation: ungerminated, germinated, flowering, fruiting, living, dead.

This information is captured implicitly by the combination of flower/fruit counts and the existing stage variable, but would it be useful to encode it explicitly?

Descriptive analysis

Comparing progressions through life stages shows:

almost all germination occurs by 2/15/2024
slightly higher germination rate for perennial compared with annual species
no survival at rock outcrop after 6/25/2024
perennials generally don’t flower/fruit except in seep environment after 6/25
one rock outcrop plot missing, but otherwise few missing values

Code

# life stage over time
seedplot |>
  count(date, stage, location, species) |>
  ggplot(aes(x = factor(date), y = n, fill = stage)) +
  facet_grid(location ~ species) +
  geom_bar(position = 'fill', stat = 'identity', alpha = 0.9) +
  scale_fill_grafify() +
  scale_y_continuous(expand = rep(0, 2)) +
  theme_bw() +
  theme(panel.grid = element_blank(), 
        panel.spacing = unit(0.15, 'in'),
        axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  labs(x = 'date', y = 'proportion of individuals')

Figure: Life stage progression by location and species (missing data removed)

Here’s the same figure but with normalization after removing missing data:

Code

# life stage over time
seedplot |>
  count(date, stage, location, species) |>
  drop_na(stage) |>
  ggplot(aes(x = factor(date), y = n, fill = stage)) +
  facet_grid(location ~ species) +
  geom_bar(position = 'fill', stat = 'identity', alpha = 0.9) +
  scale_fill_grafify() +
  scale_y_continuous(expand = rep(0, 2)) +
  theme_bw() +
  theme(panel.grid = element_blank(), 
        panel.spacing = unit(0.15, 'in'),
        axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  labs(x = 'date', y = 'proportion of individuals')

Analysis of rate data

Analysis of event occurrence rates aggregated over the study period models the probability that the individual germinated/flowered/fruited at any point during the study period, possibly conditional on prior life stages. For this, individuals are dropped if missing throughout the study, but if nonmissing observations were recorded for any period of time, the individual is retained through the last stage for which values are recorded.

The framework for this analysis is a generalized linear mixed model for the relevant event probability.

Germination rates

I’ve added an indicator of whether germination occurred at some time during the study. Here are some examples of the indicator encoding:

Code

# binary indicator for germination
germ <- seedplot |>
  drop_na(stage) |>
  group_by(id, species, family, location, plot) |>
  summarize(last.stage = max(stage), .groups = 'drop') |>
  mutate(germ = fct_collapse(last.stage, 
                             no = 'ungerminated', 
                             other_level = 'yes') |>
           factor(ordered = F))

Table: Example germination indicator encoding
id	species	family	location	plot	last.stage	germ
1	LLP	P9x11	rock_outcrop	1	ungerminated	no
927	LLP	P14x16	rock_outcrop	9	germinated	yes
993	LLP	P17x6	seep	10	flowered	yes
1299	LLP	P13x14	seep	13	fruited	yes
25	LLA	A30x3	rock_outcrop	1	died	yes

The figure below shows germination rates (aggregated over plots) by lineage, location, and species, with the location/species average overlaid as a vertical dashed line.

Figure: Germination rates by species, lineage, and location; average shown in blue dash.

My observations:

no apparent location effect on germination rates in aggregate, but possible location/lineage interactions
- usually slightly higher rates in seep environments, but not uniformly; e.g., P9x11 does better in rock outcrop)
- if there are location/lineage interactions, the location effect may not be significant; few differences in germination rates exceeding 5-10%
apparent species effect – higher germination rates for perennials

Interestingly, there is a bit of a plot effect and a possible plot/species interaction. This suggests that there may be unaccounted-for microhabitat features that influence fitness.

Figure: Plot effects and plot/species interactions

Analysis ignoring lineage

Let $Y_{ijkl}$ denote the whether germination occurred for the $l$th replicate of species $i$ in plot $k$ in microhabitat type $j$. A model for the data, ignoring any effect of lineage, is:

\[ \begin{cases} Y_{ijkl} \sim \text{Bernoulli}(p_{ijk}) \\ \log\left(\frac{p_{ijk}}{1 - p_{ijk}}\right) = \text{spp}_i + \text{mh}_j + \left(\text{spp}\times\text{mh}\right)_{ij} + \text{plot}_{k(j)} + \left(\text{plot}\times\text{spp}\right)_{ik(j)} \end{cases} \]

Fixed effects:

species $\text{spp}_i$
microhabitat $\text{mh}_j$
species/microhabitat interaction $\left(\text{spp}\times\text{mh}\right)_{ij}$

Random effects:

plot (nested within microhabitat) $\text{plot}_{k(j)}$
plot/species interaction (nested within microhabitat) $\left(\text{plot}\times\text{spp}\right)_{ik(j)}$

To fit the model, use lme4::glmer:

Code

# fit model
fit <- glmer(germ ~ species*location + (1 | plot:location/species),
             data = germ, family = binomial)

This produces the inference:

Code

# tests for fixed effects
car::Anova(fit, type = 'II') |> 
  pander(caption = 'Table: Type II Wald tests for fixed effects')

Table: Type II Wald tests for fixed effects
	Chisq	Df	Pr(>Chisq)
species	5.631	1	0.01765
location	0.002503	1	0.9601
species:location	0.0118	1	0.9135

Code

# test for plot effect
glm(germ ~ species*location,
                data = germ, family = binomial) %>% 
  anova(fit, ., type = 'LRT') |> 
  pander(caption = 'Table: Likelihood ratio test for plot effect')

Table: Likelihood ratio test for plot effect
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	4	2795	2818	-1394	2787	NA	NA	NA
fit	6	2631	2665	-1310	2619	168.1	2	3.146e-37

Code

# test for plot/species interaction
glmer(germ ~ species*location + (1 | plot:location),
                  data = germ, family = binomial) %>%
  anova(fit, ., type = 'LRT') |> 
  pander(caption = 'Table: Likelihood ratio test for plot/species interaction')

Table: Likelihood ratio test for plot/species interaction
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	5	2630	2658	-1310	2620	NA	NA	NA
fit	6	2631	2665	-1310	2619	0.2977	1	0.5853

In summary:

the data provide evidence of a species effect ($\chi^2_1 = 5.631, p = 0.01765$) and a plot effect ($\chi^2_2 = 168.1, p < 0.0001$)
the data do not provide evidence of a location effect or a species/plot interaction

Adjusting for lineage

Let $Y_{ijklm}$ denote the whether germination occurred for the $m$th replicate of lineage $l$ of species $i$ in plot $k$ in microhabitat type $j$. Agumenting the model above to account for lineage yields:

\[ \begin{cases} Y_{ijklm} \sim \text{Bernoulli}(p_{ijk}) \\ \log\left(\frac{p_{ijkl}}{1 - p_{ijkl}}\right) = \text{spp}_i + \text{mh}_j + \left(\text{spp}\times\text{mh}\right)_{ij} + \text{plot}_{k(j)} + \left(\text{plot}\times\text{spp}\right)_{ik(j)} + \text{lin}_{l(i)} \end{cases} \]

Fixed effects:

species $\text{spp}_i$
microhabitat $\text{mh}_j$
species/microhabitat interaction $\left(\text{spp}\times\text{mh}\right)_{ij}$

Random effects:

plot (nested within microhabitat) $\text{plot}_{k(j)}$
plot/species interaction (nested within microhabitat) $\left(\text{plot}\times\text{spp}\right)_{ik(j)}$
lineage (nested within species) $\text{lin}_{l(i)}$

To fit the model:

Code

# fit model
fit_lin <- glmer(germ ~ species*location + (1 | plot:location/species) + (1 | family:species),
                 data = germ, family = binomial)

By comparison with the initial model, there is evidence of an effect of lineage:

Code

# test for effect of lineage
anova(fit_lin, fit) |>
  pander(caption = 'Table: Likelihood ratio test for lineage effect')

Table: Likelihood ratio test for lineage effect
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
fit	6	2631	2665	-1310	2619	NA	NA	NA
fit_lin	7	2619	2659	-1303	2605	14.16	1	0.0001677

And interestingly, the effect of species vanishes when lineage is accounted for:

Code

# tests for fixed effects
car::Anova(fit_lin, type = 'II') |> 
  pander(caption = 'Table: Type II Wald tests for fixed effects')

Table: Type II Wald tests for fixed effects
	Chisq	Df	Pr(>Chisq)
species	2.139	1	0.1436
location	0.002618	1	0.9592
species:location	0.01313	1	0.9088

Inference on the other fixed effects – location and location/species interactions – are unchanged. Likewise for inferences on random effects – plot and plot/species interaction.

Table: Likelihood ratio test for plot effect
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	5	2787	2816	-1389	2777	NA	NA	NA
fit_lin	7	2619	2659	-1303	2605	172.2	2	3.985e-38

Table: Likelihood ratio test for plot/species interaction
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	6	2618	2651	-1303	2606	NA	NA	NA
fit_lin	7	2619	2659	-1303	2605	0.4603	1	0.4975

In summary:

the data provide evidence of a lineage effect ($\chi^2_1 = 14.16, p = 0.0001677$) and a plot effect ($\chi^2_2 = 172.2, p < 0.0001$)
the data do not provide evidence of a species effect, a location effect, or location/species or plot/species interactions

As an aside, I also tested for lineage interactions with plot and location (separately), and neither were significant. These are shown below.

Table: Likelihood ratio test for lineage/plot interaction
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
fit_lin	7	2619	2659	-1303	2605	NA	NA	NA
.	8	2621	2666	-1302	2605	0.3193	1	0.572

Table: Likelihood ratio test for lineage/location interaction
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
fit_lin	7	2619	2659	-1303	2605	NA	NA	NA
.	8	2620	2665	-1302	2604	0.7485	1	0.3869

My recommendation would be to omit interaction terms involving lineage for model parsimony, but a full model could include both interactions above. This may be desirable for interpretability, but won’t change the inferences about effects. I checked but suppressed the analyses – codes are shown below.

Code

# full model with all interactions
fit_full <- glmer(germ ~ species*location + (1 | plot:location/species) + (1 | family:species/location) + (1 | family:species:location:plot),
                 data = germ, family = binomial) 

# inference for fixed effects
car::Anova(fit_full, type = 'II')

# inference for random effects: lineage/plot interaction
glmer(germ ~ species*location + (1 | plot:location/species) + (1 | family:species/location),
                 data = germ, family = binomial) %>%
  anova(fit_full, .)

# inference for random effects: lineage/location interaction
glmer(germ ~ species*location + (1 | plot:location/species) + (1 | family:species) + (1 | family:species:location:plot),
                 data = germ, family = binomial) %>%
  anova(fit_full, .)

# inference for random effects: lineage effect
glmer(germ ~ species*location + (1 | plot:location/species),
                 data = germ, family = binomial) %>%
  anova(fit_full, .)

# inference for random effects: plot effect
glmer(germ ~ species*location + (1 | family:species/location),
                 data = germ, family = binomial) %>%
  anova(fit_full, .)

# inference for random effects: plot/species interaction
glmer(germ ~ species*location + (1 | plot:location) + (1 | family:species),
                 data = germ, family = binomial) %>%
  anova(fit_full, .)

Flowering rates

Flowering rates could be computed in one of two ways:

Unconditional on germination, i.e., rate at which all toothpicks flowered
Conditional on germination, i.e., rate at which germinated toothpicks flowered

I’ve carried out parallel analyses using both rates. Personally, I think the conditional rates make more sense for the analysis.

Unconditional flowering rates

Below are comparisons of the unconditional flowering rates.

Code

# binary indicator for flowering (unconditional)
flow <- seedplot |>
  drop_na(stage) |>
  group_by(id, species, family, location, plot) |>
  summarize(flow = factor(sum(n.flower) > 0, 
                          levels = c(F, T), 
                          labels = c('no', 'yes')),
            .groups = 'drop')

Figure: Unconditional flowering rates by species, lineage, and location; average shown in blue dash.

Since some families have unconditional flowering rates of zero, it will not be possible to estimate lineage interactions. The analysis below uses the same model specification as that employed in the analysis of germination rates. (Codes are also identical.)

Inference for fixed effects indicates there is a species effect on unconditional flowering rates:

Code

# fit model
fit_flow <- glmer(flow ~ species*location + (1 | plot:location/species) + (1 | family:species),
                  data = flow, family = binomial)

Table: Inference for fixed effects
	Chisq	Df	Pr(>Chisq)
species	15.8	1	7.025e-05
location	0.1033	1	0.7479
species:location	3.173	1	0.07489

Inference for random effects indicates a lineage effect, a plot effect, and a species/plot interaction. Codes for these analyses are identical to those shown in the previous subsection on analysis of germination rates.

Table: Likelihood ratio test for lineage effect on unconditional flowering rates
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	6	1416	1450	-701.9	1404	NA	NA	NA
fit_flow	7	1400	1439	-692.9	1386	17.98	1	2.236e-05

Table: Likelihood ratio test for plot effect on unconditional flowering rates
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	5	1528	1557	-759.2	1518	NA	NA	NA
fit_flow	7	1400	1439	-692.9	1386	132.6	2	1.637e-29

Table: Likelihood ratio test for species/plot interaction effect on unconditional flowering rates
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	6	1406	1440	-696.9	1394	NA	NA	NA
fit_flow	7	1400	1439	-692.9	1386	7.928	1	0.004868

Higher-order interactions produce numerical instability in the model fitting and so are omitted. Estimates of unconditional flowering rates for each species/location combination are given below.

Table: Estimated (unconditional) flowering probabilities and 95% confidence intervals
species	location	prob	std.error	conf.low	conf.high
LLA	rock_outcrop	0.1265	0.0613	0.04653	0.3005
LLP	rock_outcrop	0.02323	0.01351	0.00735	0.07097
LLA	seep	0.1163	0.05412	0.04476	0.2697
LLP	seep	0.04893	0.02549	0.01729	0.1308

Graphically:

Marginal flowering rates by species (averaged over location) and location (averaged over species) are shown below.

Table: Estimated marginal unconditional flowering rates by species (averaged over location)
species	prob	std.error	conf.low	conf.high
LLA	0.1213	0.04285	0.05905	0.2328
LLP	0.0338	0.01389	0.01497	0.0745

Table: Estimated marginal unconditional flowering rates by location (averaged over species)
location	prob	std.error	conf.low	conf.high
rock_outcrop	0.05543	0.02794	0.0202	0.1431
seep	0.07602	0.03512	0.02995	0.1798

Lastly, factorial contrasts yield comparisons of marginal flowering rates between (a) locations and (b) species.

Table: Factorial contrasts for marginal species and location effects on unconditional flowering rates
contrast	odds.ratio	SE	conf.low	conf.high
loc	1.402	0.9868	0.3529	5.57
spp	3.945	1.312	2.056	7.57

This quantifies the significant species effect as follows: with 95% confidence, the odds of flowering are estimated to be between 2.06 and 7.57 times higher among annuals compared with perennials (point estimate: 3.95).

Conditional flowering rates

Below are comparisons of the conditional flowering rates.

Code

# binary indicator for flowering (conditional on germination)
cflow <- seedplot |>
  drop_na(stage) |>
  filter(stage > 'ungerminated') |> # notice filter here for conditional rate
  group_by(id, species, family, location, plot) |>
  summarize(flow = factor(sum(n.flower) > 0, 
                          levels = c(F, T), 
                          labels = c('no', 'yes')),
            .groups = 'drop')

Figure: Conditional flowering rates by species, lineage, and location; average shown in blue dash.

Since the conditional rates are determined after filtering out ungerminated plants, the relevant sample sizes change. After filtering:

Table: Sample sizes for conditional flowering rates by species, lineage, and plot
species	family	location	1	2	3	4	5	6	7	8	9	10
LLA	A12x19	rock_outcrop	NA	5	NA	3	3	2	3	4	4	4
LLA	A12x9	rock_outcrop	2	3	NA	4	6	4	1	7	6	4
LLA	A19x12	rock_outcrop	2	3	NA	4	6	4	3	2	2	5
LLA	A24x30	rock_outcrop	NA	4	NA	1	2	3	NA	1	3	6
LLA	A30x3	rock_outcrop	2	5	NA	6	3	2	1	4	5	4
LLA	A33x24	rock_outcrop	NA	5	NA	7	3	2	4	1	5	6
LLA	A3x21	rock_outcrop	2	5	NA	4	4	6	1	3	4	4
LLA	A7x10	rock_outcrop	NA	4	NA	3	5	3	2	4	4	3
LLA	A9x33	rock_outcrop	1	5	NA	2	4	4	1	4	3	3
LLP	P10x7	rock_outcrop	1	3	NA	4	4	2	2	4	4	4
LLP	P13x14	rock_outcrop	1	4	NA	3	5	5	4	5	5	3
LLP	P14x16	rock_outcrop	NA	4	NA	5	7	2	4	4	3	3
LLP	P15x1	rock_outcrop	1	4	NA	5	5	5	5	2	3	3
LLP	P17x6	rock_outcrop	2	4	NA	6	5	1	6	5	6	5
LLP	P1x8	rock_outcrop	2	4	NA	5	4	6	4	5	5	3
LLP	P20x4	rock_outcrop	1	6	NA	4	5	2	1	5	5	5
LLP	P6x20	rock_outcrop	1	4	NA	2	3	2	4	3	1	2
LLP	P9x11	rock_outcrop	1	3	NA	3	5	4	3	3	5	4
LLA	A12x19	seep	2	4	4	4	2	1	4	2	3	5
LLA	A12x9	seep	2	3	4	5	6	2	5	5	4	5
LLA	A19x12	seep	3	3	4	2	1	1	3	4	2	4
LLA	A24x30	seep	3	3	3	4	7	NA	3	5	1	5
LLA	A30x3	seep	2	5	4	4	5	1	6	4	4	3
LLA	A33x24	seep	3	3	2	3	2	1	4	2	3	6
LLA	A3x21	seep	1	4	2	4	3	NA	2	6	5	5
LLA	A7x10	seep	2	5	6	5	2	1	6	3	3	5
LLA	A9x33	seep	1	3	1	3	3	2	3	4	2	5
LLP	P10x7	seep	2	3	5	2	4	3	4	3	3	4
LLP	P13x14	seep	1	4	6	3	4	NA	5	6	4	5
LLP	P14x16	seep	3	5	3	6	5	3	6	6	4	6
LLP	P15x1	seep	2	4	4	2	4	1	6	3	2	6
LLP	P17x6	seep	4	4	6	3	4	1	5	5	4	6
LLP	P1x8	seep	3	4	4	3	4	2	5	4	4	5
LLP	P20x4	seep	2	4	3	4	3	2	3	5	3	4
LLP	P6x20	seep	4	3	3	2	3	NA	4	4	3	6
LLP	P9x11	seep	1	5	4	2	5	2	NA	3	2	5

Because some plots have no germinated plants of particular lineages, it will not be possible to estimate the plot/lineage interaction. However, the row totals are sufficient that the lineage effect is still estimable.

Inference for fixed effects (again using the same model specification employed for analysis of germination rates and the same codes) indicates there is a species effect on conditional flowering rates (no change compared with analysis of unconditional rates):

Table: Inference for fixed effects on conditional flowering rates
	Chisq	Df	Pr(>Chisq)
species	20.61	1	5.625e-06
location	0.2234	1	0.6364
species:location	2.692	1	0.1008

Inference for random effects indicates a plot effect and a species/plot interaction and a lineage effect (no change compared with the analysis of unconditional flowering rates):

Table: Likelihood ratio test for lineage effect on conditional flowering rates
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	6	1110	1141	-549	1098	NA	NA	NA
fit_cflow	7	1093	1128	-539.3	1079	19.39	1	1.065e-05

Table: Likelihood ratio test for plot effect on conditional flowering rates
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	5	1174	1199	-581.8	1164	NA	NA	NA
fit_cflow	7	1093	1128	-539.3	1079	84.88	2	3.695e-19

Table: Likelihood ratio test for species/plot interaction effect on conditional flowering rates
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	6	1099	1130	-543.7	1087	NA	NA	NA
fit_cflow	7	1093	1128	-539.3	1079	8.755	1	0.003087

Estimates of conditional flowering rates for each species/location combination are given below.

Table: Estimated (conditional) flowering probabilities and 95% confidence intervals
species	location	prob	std.error	conf.low	conf.high
LLA	rock_outcrop	0.2623	0.09818	0.1162	0.4901
LLP	rock_outcrop	0.04167	0.02202	0.01454	0.1136
LLA	seep	0.253	0.09095	0.1165	0.4652
LLP	seep	0.08892	0.04078	0.03512	0.2075

Graphically:

Marginal flowering rates by species (averaged over location) and location (averaged over species) are shown below.

Table: Marginal conditional flowering rates by species (averaged over location)
species	prob	std.error	conf.low	conf.high
LLA	0.2576	0.07143	0.143	0.4191
LLP	0.06116	0.02293	0.02893	0.1247

Table: Marginal conditional flowering rates by location (averaged over species)
location	prob	std.error	conf.low	conf.high
rock_outcrop	0.1106	0.04691	0.04655	0.2405
seep	0.1538	0.05784	0.07072	0.3028

Lastly, factorial contrasts yield comparisons of marginal flowering rates between (a) locations and (b) species.

Table: Factorial contrasts for marginal species and location effects on conditional flowering rates
contrast	odds.ratio	SE	conf.low	conf.high
loc	1.462	0.9019	0.4365	4.898
spp	5.326	1.909	2.638	10.75

This quantifies the significant species effect as follows: with 95% confidence, the odds of flowering are estimated to be between 2.63 and 10.75 times higher among annuals compared with perennials (point estimate: 5.33).

Fruiting rates

Fruiting rates could be computed in one of three ways:

Unconditional on germination, i.e., rate at which all toothpicks produced fruit
Conditional on germination, i.e., rate at which germinated toothpicks produced fruit
Conditional on flowering, i.e., rate at which flowered plants produced fruit

Below I’ve only shown the analysis for option (3). To explore options 1-2, remove or change the filter on life stage in the codes shown below.

Code

# binary indicator for fruiting (conditional on flowering)
fruit <- seedplot |>
  drop_na(stage) |>
  filter(stage > 'germinated') |> # notice filter here for conditional rate
  group_by(id, species, family, location, plot) |>
  summarize(fruit = factor(sum(n.fruit) > 0, 
                          levels = c(F, T), 
                          labels = c('no', 'yes')),
            .groups = 'drop')

Figure: Conditional fruiting rates by species, lineage, and location; average shown in blue dash.

Since some families have conditional fruiting rates of zero, it will not be possible to estimate lineage interactions. The analysis below uses the same model specification as that employed in the analysis of germination rates. (Codes are also identical.)

Inference for fixed effects indicates there is a species effect on conditional fruiting rates:

Table: Inference for fixed effects on conditional fruiting rate
	Chisq	Df	Pr(>Chisq)
species	8.955	1	0.002767
location	0.1853	1	0.6668
species:location	2.163	1	0.1413

Inference for random effects indicates a plot effect and a species/plot interaction but no lineage effect:

Table: Likelihood ratio test for lineage effect on conditional fruiting rates
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	6	672.2	702.6	-330.1	660.2	NA	NA	NA
fit_fruit	7	672.6	708	-329.3	658.6	1.596	1	0.2064

Table: Likelihood ratio test for plot effect on conditional fruiting rates
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	5	707.6	732.9	-348.8	697.6	NA	NA	NA
fit_fruit	7	672.6	708	-329.3	658.6	39.01	2	3.377e-09

Table: Likelihood ratio test for species/plot interaction effect on conditional fruiting rates
	npar	AIC	BIC	logLik	-2*log(L)	Chisq	Df	Pr(>Chisq)
.	6	676.2	706.5	-332.1	664.2	NA	NA	NA
fit_fruit	7	672.6	708	-329.3	658.6	5.588	1	0.01809

Higher-order interactions produce numerical instability in the model fitting and so are omitted. Estimates of conditional fruiting rates for each species/location combination are given below.

Table: Estimated (conditional) fruiting probabilities and 95% confidence intervals
species	location	prob	std.error	conf.low	conf.high
LLA	rock_outcrop	0.1183	0.04827	0.05135	0.2494
LLP	rock_outcrop	0.02249	0.01223	0.007673	0.06406
LLA	seep	0.06554	0.02804	0.02779	0.1468
LLP	seep	0.03452	0.01773	0.01245	0.09207

Graphically:

Marginal fruiting rates by species (averaged over location) and location (averaged over species) are shown below.

Table: Estimated marginal conditional fruiting rates by species (averaged over location)
species	prob	std.error	conf.low	conf.high
LLA	0.08841	0.02763	0.0472	0.1596
LLP	0.02788	0.01113	0.01267	0.06026

Table: Estimated marginal conditional fruiting rates by location (averaged over species)
location	prob	std.error	conf.low	conf.high
rock_outcrop	0.05262	0.02147	0.02333	0.1144
seep	0.04769	0.01888	0.02169	0.1016

Lastly, factorial contrasts yield comparisons of marginal fruiting rates between (a) locations and (b) species.

Table: Factorial contrasts for marginal species and location effects on conditional fruiting rates
contrast	odds.ratio	SE	conf.low	conf.high
loc	0.9015	0.4997	0.3042	2.671
spp	3.382	1.364	1.534	7.455

This quantifies the significant species effect as follows: with 95% confidence, the odds of fruiting are estimated to be between 1.53 and 7.46 times higher among annuals compared with perennials (point estimate: 3.38).

Analysis of time-to-event data

If you’re interested in time patterns, it’s fairly easy to obtain first dates for each stage. Here’s what that looks like for one individual that saw all life stages:

Code

# earliest dates for each life stage
first_dates <- seedplot |>
  drop_na(stage) |>
  group_by(id, species, family, location, plot, replicate, stage) |>
  slice_min(date) |>
  select(-starts_with('n'), -week) |>
  ungroup()

# preview
first_dates |>
  filter(id == 140) |>
  pander(caption = 'Table: dates to life stage events for individual no. 140')

Table: dates to life stage events for individual no. 140
id	species	family	location	plot	replicate	date	stage
140	LLP	P15x1	rock_outcrop	2	4	2024-01-25	ungerminated
140	LLP	P15x1	rock_outcrop	2	4	2024-02-01	germinated
140	LLP	P15x1	rock_outcrop	2	4	2024-04-12	flowered
140	LLP	P15x1	rock_outcrop	2	4	2024-05-31	fruited
140	LLP	P15x1	rock_outcrop	2	4	2024-06-07	died

This can be rearranged to calculate times to each stage. Here are some examples for individuals who reached different life stages:

Code

# time to event data
durations <- first_dates |>
  pivot_wider(names_from = stage, values_from = date) |>
  rename(first.obs = ungerminated) |>
  mutate(time.to.germ = germinated - ymd("2024-01-17"),
         time.to.flower = flowered - ymd("2024-01-17"),
         time.to.fruit = fruited - ymd("2024-01-17"),
         time.to.death = died - ymd("2024-01-17"))

Table continues below
id	first.obs	germinated	flowered	fruited	died
25	2024-01-25	2024-02-08	NA	NA	2024-03-14
131	2024-01-25	2024-03-21	2024-05-03	NA	2024-05-09
140	2024-01-25	2024-02-01	2024-04-12	2024-05-31	2024-06-07

time.to.germ	time.to.flower	time.to.fruit	time.to.death
22 days	NA	NA	57 days
64 days	107 days	NA	113 days
15 days	86 days	135 days	142 days

Note

I manually input the start of the experiment as 1/17/25 because this was the earliest date that appeared in the metadata, and is consistent with the week 1 observation being 1/25/25. Client should confirm and adjust start date if needed so that durations are accurate.

Here are distributions of times to each event by species and location:

Figure: Distributions of times to life stage events by species and location

It’s perhaps not the most efficient comparison, but one can see:

approximate time ranges for each event
longer survival times and later flowering and fruiting in seep plots

Survival analysis methods can be used to estimate the (instantaneous) probability of germination/flowering/fruiting/death as a function of time. I focused on a simple approach using Kaplan-Meier estimators after grouping by location and species and ignoring plot and lineage effects. Mixed-effects Cox regression could be used for a proper analysis accounting for the experimental design with a specification analogous to that used above in modeling event rates in aggregate across the study period, but since this is a secondary analysis, I’ve just shown some simple exploratory work.

Missing durations are right-censored – all we know is that the event of interest occurred (or would have occurred) sometime after the end of the study. To format the data correctly, I’ve replaced missing durations with the number of days that the study persisted (442) and added an indicator of whether the corresponding observation is censored (i = 0) or not (i = 1).

Table: data formatted for analysis of time-to-event data
id	species	location	t.g	t.fl	t.fr	t.d	i.g	i.d
25	LLA	rock_outcrop	22	442	442	57	1	1
30	LLA	rock_outcrop	22	442	442	36	1	1
35	LLA	rock_outcrop	36	442	442	86	1	1
38	LLA	rock_outcrop	22	442	442	64	1	1
42	LLP	rock_outcrop	22	442	442	29	1	1
46	LLP	rock_outcrop	36	442	442	51	1	1

I’ve provided a copy of the data formatted in this way as seedplot-times.csv.

Code

# read in time to event data
tdata <- read_csv('seedplot-times.csv')

A quick caveat: it’s somewhat awkward to represent germination, flowering, and fruiting as right-censored since this implicitly assumes that these events would eventually occur, which is not true in this system. This is more conceptual than analytical – what will happen in the analysis is that after sufficient time the estimated instantaneous event probabilities will plateau at the long-run event rates.

Germination

Modeling the instantaneous probability of germination as a function of time since the start of the study shows relatively quick convergence to aggregate germination rates:

The bands show pointwise 95% confidence intervals, and the overlap is consistent with the prior analysis which found no effect of species or location on germination rates. This just displays that conclusion a bit differently. There doesn’t look to be much difference in the rate at which the germination probability increases with time, though the estimate for LLP does increase slightly faster in the rock outrcop microhabitat.

Flowering

For modeling time to flowering, I used:

Flowering occurrences conditional on germination, i.e., all ungerminated individuals were removed prior to analysis
The time since germination (rather than the time since the start of the experiment)

This yields the following estimates of the instantaneous probability of flowering as a function of time since germination:

There is a clear and persistent difference in flowering probability by species: annuals flower at higher rates, as one might expect. While there is no long-run apparent difference in flowering probability by microhabitat, the flowering probability increases faster in the rock outcrop microhabitat. This reflects the descriptive finding earlier that flowering occurs later in the seep microhabitat.

Fruiting

Similar to flowering, I estimated the instantaneous conditional probability of fruiting given germination, as a function of the time since germination.

Note

Client may prefer to model the conditional probability of fruiting given flowering as a function of time since flowering. If so, change Surv(t.fr - t.g, i.fr) to Surv(t.fr - t.fl, i.fr) in the model specification.

Interestingly, there appears to be an interaction effect here: depending on species, the rate of increase and plateau differs between environments.

Survival

Finally, an analysis of survival times after germination is given below. The figure shows the probability of survival as a function of time, and reflects the descriptive finding earlier that plants survived longer in the seep microhabitat.

Reproducibility

I’ve provided a copy of all codes used to generate analyses shown above starting from the client’s spreadsheet. In addition, I generated two derivative files with reformatted data:

seedplot.csv has data arranged for event occurrence rate analysis
seedplot-time.csv has data arranged for time-to-event analysis

These files are available (only client has access permissions) in a compressed folder here (you may need to right-click to open).

--- title: "Monkeyflower reciprocal transplant study" author: "Avin Niknafs" author-title: "CLIENT" date: "8/19/25" published-title: "DATE" format: html: code-fold: true code-tools: true toc: true execute: warning: false message: false echo: false --- ```{r} # setup library(tidyverse) library(lubridate) library(grafify) library(lme4) library(lmerTest) library(pander) library(emmeans) library(survival) ``` From client: > I am conducting a reciprocal transplant study to test for local adaptation between two monkeyflower species in distinct microhabitats at a sympatric site on the central coast of California. My project involves analyzing germination rates, seedling survival of the two species, and reproductive success, and if possible relating those outcomes to weekly soil moisture data. Seeded toothpicks were planted and surveyed weekly for roughly one year. Study design: - 10 plots in each of 2 microhabitat environments (rock outcrop and seep) - 2 monkeyflower species (annual and perennial) - 9 lineages per species - 6 replicates of each lineage were randomly allocated to positions in each plot In total, this amounts to 108 toothpicks (6 reps x 9 lineages x 2 species) per plot and a total sample size of $n = 2160$ (108 per plot x 10 plots x 2 microhabitats). Primary interest is in species/environment interaction effects on fitness. Goals set during initial meeting: 1. Formulate an analysis-ready rearrangment of seed plot data. 2. Recommend an approach for analysis of germination rates. ::: callout-note Questions/issues needing client's attention are shown throughout like this. ::: ## Seed plot data Data provided are shown below. I removed plot index A4_10_3, which was missing identifying information for species and lineage. I suspect this may have been miscoded, since after removal $n = 2159$, which is one short of expected based on the study design. ::: callout-note Check plot coding for index A4_10_3. Was this miscoded? ::: For the three individuals shown, values are missing in all weeks because the toothpicks never germinated. ```{r} # read in seed plot data raw <- here::here("25su-niknafs", "FieldData2024.xlsx") |> readxl::read_xlsx(sheet = "Seed_Plots", col_types = "text") |> rename_with(tolower) |> filter(plotind != "A4_10_3") # removed due to missing species/lineage/etc # preview head(raw, 3) |> pander(caption = 'Table: raw data') ``` To verify that the data reflect the study design, the following tables show distinct combinations of study attributes and sample sizes per plot: ```{r} # 36 distinct treatment combinations # (9 lineages x 2 species x 2 locations) raw |> distinct(species, family, location) |> arrange(species, family, location) |> count(location, species) |> spread(species, n) |> pander(caption = 'Table: Number of unique lineages per location/species') # sample sizes per plot raw |> count(plot, location) |> arrange(location) |> group_by(location) |> mutate(plot = consecutive_id(plot)) |> spread(location, n) |> pander(caption = 'Table: Sample sizes per plot') ``` The values entered for each individual each week are: - missing if ungerminated, dead, or otherwise not present - 1 if germinated - 2 if flowering, with number of repetitions indicating number of flowers (*e.g.*, 222 means three flowers) - 3 if alive post-flowering - 4 if fruiting, with number of repetitions indicating number of fruits (*e.g.*, 44 means two fruits) - X if missing for reasons other than death - combinations as appropriate (*e.g.*, 222 4 indicates three flowers and one fruit) The unique encodings present in the data are shown below: ```{r} # unique encodings raw |> select(starts_with('week')) |> gather() |> pull(value) |> unique() ``` The decimals are artefacts of value encoding in excel when the entry can be parsed as numeric. So for example `2.22222222E8` is nine 2's rewritten in scientific notation ($2.22222222 \times 10^8$). ## Reformatting for analysis I've cleaned up the encoding and data formatting for analysis as follows: - rearrange in 'long' format with a variable indicating week number - append sampling dates from the `CensusDate` sheet - derive a `stage` variable encoding life stage achieved by the time of observation: + `ungerminated` if no values were entered prior to the observation date + `germinated` if a '1' is recorded and no other values were entered prior to the observation date + `flowered` if a '2' was entered prior to or on the observation date + `fruited` if a '4' was entered prior to or on the observation date + `died` if value is missing on the current observation date but at least one nonmissing value was entered prior to the observation date + NA if an 'X' was entered prior to or on the observation date - derive counts of numbers of fruits and flowers - add a unique identifier `id` per toothpick ```{r} # read in sampling dates dates <- here::here("25su-niknafs", "FieldData2024.xlsx") |> readxl::read_xlsx(sheet = "CensusDate", col_types = c('text', 'date')) |> rename_with(tolower) |> mutate(week = paste('week', week, sep = ''), date = as_date(date)) # rearrange data and export raw |> select(species, family, location, plot, replicate, starts_with('week')) |> pivot_longer(starts_with('week'), names_to = 'week', values_to = 'code.raw') |> left_join(dates, join_by(week)) |> mutate(across(c(plot, replicate), as.numeric), id = consecutive_id(species, family, location, plot, replicate)) |> mutate(code.clean = str_remove(code.raw, 'E.*') |> str_remove('\\.') |> str_remove('0') |> str_remove_all('[:punct:]') |> str_squish()) |> mutate(flag1 = str_detect(code.clean, '1'), flag2 = str_detect(code.clean, '2'), flag3 = str_detect(code.clean, '3'), flag4 = str_detect(code.clean, '4'), flagx = str_detect(code.clean, 'X'), flagna = is.na(code.clean)) |> mutate(across(starts_with('flag'), ~replace_na(.x, F))) |> arrange(id, date) |> group_by(id) |> mutate(germ = cumsum(flag1) > 0, flow = cumsum(flag2) > 0, frui = cumsum(flag4) > 0, miss = cumsum(flagx) > 0, stage.num = germ + flow + frui, stage.desc = case_when(stage.num == 0 ~ 'ungerminated', (stage.num == 1) & (!flagna) ~ 'germinated', (stage.num == 2) & (!flagna) ~ 'flowered', (stage.num == 3) & (!flagna) ~ 'fruited', (stage.num > 0) & flagna ~ 'died'), stage = if_else(miss, NA, stage.desc), n.flower = str_count(code.clean, '2') |> replace_na(0), n.fruit = str_count(code.clean, '4') |> replace_na(0)) |> select(id, species, family, location, plot, replicate, week, date, stage, n.flower, n.fruit) |> write_csv(file = here::here('25su-niknafs', 'seedplot.csv')) # read in seedplot data seedplot <- here::here('25su-niknafs', 'seedplot.csv') |> read_csv() |> mutate(stage = factor(stage, levels = c('ungerminated', 'germinated', 'flowered', 'fruited', 'died'), ordered = T)) # preview head(seedplot) |> pander(caption = 'Table: Example rows of analysis-ready data') ``` I've provided the data in this form to client as `seedplot.csv` for direct import into R. I'd recommend explicitly encoding the life stage variable in the correct order, as shown below. ```{r, echo = T} seedplot <- read_csv('seedplot.csv') |> mutate(stage = factor(stage, levels = c('ungerminated', 'germinated', 'flowered', 'fruited', 'died'), ordered = T)) ``` ::: callout-note I considered adding a `curr.stage` or `curr.status` variable that is the same as `stage` but indicates *current* life stage at the time of observation: ungerminated, germinated, flowering, fruiting, living, dead. This information is captured implicitly by the combination of flower/fruit counts and the existing `stage` variable, but would it be useful to encode it explicitly? ::: ## Descriptive analysis Comparing progressions through life stages shows: - almost all germination occurs by 2/15/2024 - slightly higher germination rate for perennial compared with annual species - no survival at rock outcrop after 6/25/2024 - perennials generally don't flower/fruit except in seep environment after 6/25 - one rock outcrop plot missing, but otherwise few missing values ```{r, echo = T} #| fig-width: 10 #| fig.height: 6 #| fig-cap: "Figure: Life stage progression by location and species (missing data removed)" # life stage over time seedplot |> count(date, stage, location, species) |> ggplot(aes(x = factor(date), y = n, fill = stage)) + facet_grid(location ~ species) + geom_bar(position = 'fill', stat = 'identity', alpha = 0.9) + scale_fill_grafify() + scale_y_continuous(expand = rep(0, 2)) + theme_bw() + theme(panel.grid = element_blank(), panel.spacing = unit(0.15, 'in'), axis.text.x = element_text(angle = 90, vjust = 0.5)) + labs(x = 'date', y = 'proportion of individuals') ``` Here's the same figure but with normalization after removing missing data: ```{r, echo = T} #| fig-width: 10 #| fig.height: 6 #| fig-cap: "Figure: Life stage progression by location and species (missing data removed)" # life stage over time seedplot |> count(date, stage, location, species) |> drop_na(stage) |> ggplot(aes(x = factor(date), y = n, fill = stage)) + facet_grid(location ~ species) + geom_bar(position = 'fill', stat = 'identity', alpha = 0.9) + scale_fill_grafify() + scale_y_continuous(expand = rep(0, 2)) + theme_bw() + theme(panel.grid = element_blank(), panel.spacing = unit(0.15, 'in'), axis.text.x = element_text(angle = 90, vjust = 0.5)) + labs(x = 'date', y = 'proportion of individuals') ``` ## Analysis of rate data Analysis of event occurrence rates aggregated over the study period models the probability that the individual germinated/flowered/fruited at any point during the study period, possibly conditional on prior life stages. For this, individuals are dropped if missing *throughout* the study, but if nonmissing observations were recorded for any period of time, the individual is retained through the last stage for which values are recorded. The framework for this analysis is a generalized linear mixed model for the relevant event probability. ### Germination rates I've added an indicator of whether germination occurred at some time during the study. Here are some examples of the indicator encoding: ```{r, echo = T} # binary indicator for germination germ <- seedplot |> drop_na(stage) |> group_by(id, species, family, location, plot) |> summarize(last.stage = max(stage), .groups = 'drop') |> mutate(germ = fct_collapse(last.stage, no = 'ungerminated', other_level = 'yes') |> factor(ordered = F)) ``` ```{r} germ |> group_by(last.stage) |> slice_head(n = 1) |> pander(caption = 'Table: Example germination indicator encoding') ``` The figure below shows germination rates (aggregated over plots) by lineage, location, and species, with the location/species average overlaid as a vertical dashed line. ```{r} #| fig-cap: "Figure: Germination rates by species, lineage, and location; average shown in blue dash." # compute group means for primary study arms mean_grates <- germ |> group_by(species, location) |> summarize(prop.germ = mean(germ == 'yes'), .groups = 'drop') # plot germ |> group_by(species, location, family) |> summarize(prop.germ = mean(germ == 'yes'), .groups = 'drop') |> ggplot(aes(y = fct_reorder(family, .x = prop.germ), x = prop.germ)) + facet_wrap(species~location, scales = 'free_y') + geom_point() + geom_linerange(aes(xmin = 0, xmax = prop.germ)) + theme_bw() + theme(panel.grid.minor = element_blank(), panel.grid.major.y = element_blank()) + labs(x = 'germination rate', y = NULL) + geom_vline(aes(xintercept = prop.germ), linetype = 'dashed', color = 'blue', data = mean_grates) ``` My observations: - no apparent location effect on germination rates in aggregate, but possible location/lineage interactions + usually slightly higher rates in seep environments, but not uniformly; *e.g.*, P9x11 does better in rock outcrop) + if there are location/lineage interactions, the location effect may not be significant; few differences in germination rates exceeding 5-10% - apparent species effect -- higher germination rates for perennials Interestingly, there is a bit of a plot effect *and* a possible plot/species interaction. This suggests that there may be unaccounted-for microhabitat features that influence fitness. ```{r} #| fig-height: 3 #| fig-cap: "Figure: Plot effects and plot/species interactions" # look for plot effect germ |> group_by(species, plot, location) |> summarize(prop.germ = mean(germ == 'yes'), .groups = 'drop') |> ggplot(aes(x = species, y = prop.germ)) + facet_wrap(~location) + geom_point() + geom_path(aes(group = plot)) + theme_bw() + theme(panel.grid.minor = element_blank(), panel.grid.major.x = element_blank()) + labs(y = 'germination rate') ``` #### Analysis ignoring lineage Let $Y_{ijkl}$ denote the whether germination occurred for the $l$th replicate of species $i$ in plot $k$ in microhabitat type $j$. A model for the data, ignoring any effect of lineage, is: $$ \begin{cases} Y_{ijkl} \sim \text{Bernoulli}(p_{ijk}) \\ \log\left(\frac{p_{ijk}}{1 - p_{ijk}}\right) = \text{spp}_i + \text{mh}_j + \left(\text{spp}\times\text{mh}\right)_{ij} + \text{plot}_{k(j)} + \left(\text{plot}\times\text{spp}\right)_{ik(j)} \end{cases} $$ Fixed effects: - species $\text{spp}_i$ - microhabitat $\text{mh}_j$ - species/microhabitat interaction $\left(\text{spp}\times\text{mh}\right)_{ij}$ Random effects: - plot (nested within microhabitat) $\text{plot}_{k(j)}$ - plot/species interaction (nested within microhabitat) $\left(\text{plot}\times\text{spp}\right)_{ik(j)}$ To fit the model, use `lme4::glmer`: ```{r, echo = T} # fit model fit <- glmer(germ ~ species*location + (1 | plot:location/species), data = germ, family = binomial) ``` This produces the inference: ```{r, echo = T} # tests for fixed effects car::Anova(fit, type = 'II') |> pander(caption = 'Table: Type II Wald tests for fixed effects') # test for plot effect glm(germ ~ species*location, data = germ, family = binomial) %>% anova(fit, ., type = 'LRT') |> pander(caption = 'Table: Likelihood ratio test for plot effect') # test for plot/species interaction glmer(germ ~ species*location + (1 | plot:location), data = germ, family = binomial) %>% anova(fit, ., type = 'LRT') |> pander(caption = 'Table: Likelihood ratio test for plot/species interaction') ``` In summary: - the data provide evidence of a species effect ($\chi^2_1 = 5.631, p = 0.01765$) and a plot effect ($\chi^2_2 = 168.1, p < 0.0001$) - the data do not provide evidence of a location effect or a species/plot interaction #### Adjusting for lineage Let $Y_{ijklm}$ denote the whether germination occurred for the $m$th replicate of lineage $l$ of species $i$ in plot $k$ in microhabitat type $j$. Agumenting the model above to account for lineage yields: $$ \begin{cases} Y_{ijklm} \sim \text{Bernoulli}(p_{ijk}) \\ \log\left(\frac{p_{ijkl}}{1 - p_{ijkl}}\right) = \text{spp}_i + \text{mh}_j + \left(\text{spp}\times\text{mh}\right)_{ij} + \text{plot}_{k(j)} + \left(\text{plot}\times\text{spp}\right)_{ik(j)} + \text{lin}_{l(i)} \end{cases} $$ Fixed effects: - species $\text{spp}_i$ - microhabitat $\text{mh}_j$ - species/microhabitat interaction $\left(\text{spp}\times\text{mh}\right)_{ij}$ Random effects: - plot (nested within microhabitat) $\text{plot}_{k(j)}$ - plot/species interaction (nested within microhabitat) $\left(\text{plot}\times\text{spp}\right)_{ik(j)}$ - **lineage (nested within species) $\text{lin}_{l(i)}$** To fit the model: ```{r, echo = T} # fit model fit_lin <- glmer(germ ~ species*location + (1 | plot:location/species) + (1 | family:species), data = germ, family = binomial) ``` By comparison with the initial model, there is evidence of an effect of lineage: ```{r, echo = T} # test for effect of lineage anova(fit_lin, fit) |> pander(caption = 'Table: Likelihood ratio test for lineage effect') ``` And interestingly, the effect of species vanishes when lineage is accounted for: ```{r, echo = T} # tests for fixed effects car::Anova(fit_lin, type = 'II') |> pander(caption = 'Table: Type II Wald tests for fixed effects') ``` Inference on the other fixed effects -- location and location/species interactions -- are unchanged. Likewise for inferences on random effects -- plot and plot/species interaction. ```{r} # test for plot effect glmer(germ ~ species*location + (1 | family:species), data = germ, family = binomial) %>% anova(fit_lin, ., type = 'LRT') |> pander(caption = 'Table: Likelihood ratio test for plot effect') # test for plot/species interaction glmer(germ ~ species*location + (1 | plot:location) + (1 | family:species), data = germ, family = binomial) %>% anova(fit_lin, ., type = 'LRT') |> pander(caption = 'Table: Likelihood ratio test for plot/species interaction') ``` In summary: - the data provide evidence of a lineage effect ($\chi^2_1 = 14.16, p = 0.0001677$) and a plot effect ($\chi^2_2 = 172.2, p < 0.0001$) - the data do not provide evidence of a species effect, a location effect, or location/species or plot/species interactions As an aside, I also tested for lineage interactions with plot and location (separately), and neither were significant. These are shown below. ```{r} # test for lineage/plot interaction glmer(germ ~ species*location + (1 | plot:location/species) + (1 | family:species) + (1 | family:species:plot:location), data = germ, family = binomial) %>% anova(fit_lin) |> pander(caption = 'Table: Likelihood ratio test for lineage/plot interaction') # test for lineage/location interaction glmer(germ ~ species*location + (1 | plot:location/species) + (1 | family:species/location), data = germ, family = binomial) %>% anova(fit_lin) |> pander(caption = 'Table: Likelihood ratio test for lineage/location interaction') ``` My recommendation would be to omit interaction terms involving lineage for model parsimony, but a full model could include both interactions above. This may be desirable for interpretability, but won't change the inferences about effects. I checked but suppressed the analyses -- codes are shown below. ```{r, eval = F, echo = T} # full model with all interactions fit_full <- glmer(germ ~ species*location + (1 | plot:location/species) + (1 | family:species/location) + (1 | family:species:location:plot), data = germ, family = binomial) # inference for fixed effects car::Anova(fit_full, type = 'II') # inference for random effects: lineage/plot interaction glmer(germ ~ species*location + (1 | plot:location/species) + (1 | family:species/location), data = germ, family = binomial) %>% anova(fit_full, .) # inference for random effects: lineage/location interaction glmer(germ ~ species*location + (1 | plot:location/species) + (1 | family:species) + (1 | family:species:location:plot), data = germ, family = binomial) %>% anova(fit_full, .) # inference for random effects: lineage effect glmer(germ ~ species*location + (1 | plot:location/species), data = germ, family = binomial) %>% anova(fit_full, .) # inference for random effects: plot effect glmer(germ ~ species*location + (1 | family:species/location), data = germ, family = binomial) %>% anova(fit_full, .) # inference for random effects: plot/species interaction glmer(germ ~ species*location + (1 | plot:location) + (1 | family:species), data = germ, family = binomial) %>% anova(fit_full, .) ``` ### Flowering rates Flowering rates could be computed in one of two ways: 1. **Unconditional** on germination, *i.e.*, rate at which all toothpicks flowered 2. **Conditional** on germination, *i.e.*, rate at which germinated toothpicks flowered I've carried out parallel analyses using both rates. Personally, I think the conditional rates make more sense for the analysis. #### Unconditional flowering rates Below are comparisons of the *unconditional* flowering rates. ```{r, echo = T} # binary indicator for flowering (unconditional) flow <- seedplot |> drop_na(stage) |> group_by(id, species, family, location, plot) |> summarize(flow = factor(sum(n.flower) > 0, levels = c(F, T), labels = c('no', 'yes')), .groups = 'drop') ``` ```{r} #| fig-cap: "Figure: Unconditional flowering rates by species, lineage, and location; average shown in blue dash." # compute group means for primary study arms mean_frates <- flow |> group_by(species, location) |> summarize(prop.flow = mean(flow == 'yes'), .groups = 'drop') # plot flow |> group_by(species, location, family) |> summarize(prop.flow = mean(flow == 'yes'), .groups = 'drop') |> ggplot(aes(y = fct_reorder(family, .x = prop.flow), x = prop.flow)) + facet_wrap(species~location, scales = 'free_y') + geom_point() + geom_linerange(aes(xmin = 0, xmax = prop.flow)) + theme_bw() + theme(panel.grid.minor = element_blank(), panel.grid.major.y = element_blank()) + labs(x = 'unconditional flowering rate', y = NULL) + geom_vline(aes(xintercept = prop.flow), linetype = 'dashed', color = 'blue', data = mean_frates) ``` Since some families have unconditional flowering rates of zero, it will not be possible to estimate lineage interactions. The analysis below uses the same model specification as that employed in the analysis of germination rates. (Codes are also identical.) Inference for fixed effects indicates there is a species effect on unconditional flowering rates: ```{r, echo = T} # fit model fit_flow <- glmer(flow ~ species*location + (1 | plot:location/species) + (1 | family:species), data = flow, family = binomial) ``` ```{r} # anova table for fixed effects (ignore the pander part) car::Anova(fit_flow, type = 'II') |> pander(caption = 'Table: Inference for fixed effects') ``` Inference for random effects indicates a lineage effect, a plot effect, and a species/plot interaction. Codes for these analyses are identical to those shown in the previous subsection on analysis of germination rates. ```{r} # test for lineage effect glmer(flow ~ species*location + (1 | plot:location/species), data = flow, family = binomial) %>% anova(fit_flow) |> pander(caption = 'Table: Likelihood ratio test for lineage effect on unconditional flowering rates') # test for plot effect glmer(flow ~ species*location + (1 | family:species), data = flow, family = binomial) %>% anova(fit_flow) |> pander(caption = 'Table: Likelihood ratio test for plot effect on unconditional flowering rates') # test for species/plot interaction glmer(flow ~ species*location + (1 | plot:location) + (1 | family:species), data = flow, family = binomial) %>% anova(fit_flow) |> pander(caption = 'Table: Likelihood ratio test for species/plot interaction effect on unconditional flowering rates') ``` Higher-order interactions produce numerical instability in the model fitting and so are omitted. Estimates of unconditional flowering rates for each species/location combination are given below. ```{r} # estimated unconditional flowering probabilities emmeans(fit_flow, specs = ~species*location, type = 'response') |> confint() |> broom::tidy() |> select(-df) |> pander(caption = 'Table: Estimated (unconditional) flowering probabilities and 95% confidence intervals') ``` Graphically: ```{r} # plot estimated probabilities emmeans(fit_flow, specs = ~species*location, type = 'response') |> confint() |> broom::tidy() |> ggplot(aes(x = species, y = prob, color = location)) + geom_point(position = position_dodge(width = 0.5)) + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.1, position = position_dodge(width = 0.5)) + theme_bw() + labs(x = NULL, y = 'estimated unconditional flowering probability') ``` Marginal flowering rates by species (averaged over location) and location (averaged over species) are shown below. ```{r} # marginal flowering rates by species emmeans(fit_flow, specs = ~ species, type = 'response') |> confint() |> broom::tidy() |> select(-df) |> pander(caption = 'Table: Estimated marginal unconditional flowering rates by species (averaged over location)') # marginal flowering rates by location emmeans(fit_flow, specs = ~ location, type = 'response') |> confint() |> broom::tidy() |> select(-df) |> pander(caption = 'Table: Estimated marginal unconditional flowering rates by location (averaged over species)') ``` Lastly, factorial contrasts yield comparisons of marginal flowering rates between (a) locations and (b) species. ```{r} # factorial contrasts fac.eff.emmc <- function(levels, ...){ modelr::data_grid(flow, location, species) |> bind_cols(loc.contr = 0.5*rep(c(-1, 1), each = 2), spp.contr = 0.5*rep(c(1, -1), 2)) |> select(ends_with('contr')) |> rename_with(~str_remove_all(.x, '.contr')) } emmeans(fit_flow, specs = ~ species*location, type = 'response') |> contrast('fac.eff') |> confint() |> as_tibble() |> rename(conf.low = asymp.LCL, conf.high = asymp.UCL) |> select(-df) |> pander(caption = 'Table: Factorial contrasts for marginal species and location effects on unconditional flowering rates') ``` This quantifies the significant species effect as follows: with 95% confidence, the odds of flowering are estimated to be between 2.06 and 7.57 times higher among annuals compared with perennials (point estimate: 3.95). #### Conditional flowering rates Below are comparisons of the *conditional* flowering rates. ```{r, echo = T} # binary indicator for flowering (conditional on germination) cflow <- seedplot |> drop_na(stage) |> filter(stage > 'ungerminated') |> # notice filter here for conditional rate group_by(id, species, family, location, plot) |> summarize(flow = factor(sum(n.flower) > 0, levels = c(F, T), labels = c('no', 'yes')), .groups = 'drop') ``` ```{r} #| fig-cap: "Figure: Conditional flowering rates by species, lineage, and location; average shown in blue dash." # compute group means for primary study arms mean_cfrates <- cflow |> group_by(species, location) |> summarize(prop.flow = mean(flow == 'yes'), .groups = 'drop') # plot cflow |> group_by(species, location, family) |> summarize(prop.flow = mean(flow == 'yes'), .groups = 'drop') |> ggplot(aes(y = fct_reorder(family, .x = prop.flow), x = prop.flow)) + facet_wrap(species~location, scales = 'free_y') + geom_point() + geom_linerange(aes(xmin = 0, xmax = prop.flow)) + theme_bw() + theme(panel.grid.minor = element_blank(), panel.grid.major.y = element_blank()) + labs(x = 'flowering rate (conditional on germination)', y = NULL) + geom_vline(aes(xintercept = prop.flow), linetype = 'dashed', color = 'blue', data = mean_cfrates) ``` Since the conditional rates are determined after filtering out ungerminated plants, the relevant sample sizes change. After filtering: ```{r} # for display purposes plot_recode <- seedplot |> distinct(plot, location) |> arrange(location) |> group_by(location) |> mutate(plot.num = row_number()) |> ungroup() |> select(plot, plot.num) # sample sizes by plot cflow |> left_join(plot_recode, join_by(plot)) |> count(species, family, location, plot.num) |> spread(plot.num, n) |> arrange(location, species) |> pander(caption = 'Table: Sample sizes for conditional flowering rates by species, lineage, and plot') ``` Because some plots have no germinated plants of particular lineages, it will not be possible to estimate the plot/lineage interaction. However, the row totals are sufficient that the lineage effect is still estimable. Inference for fixed effects (again using the same model specification employed for analysis of germination rates and the same codes) indicates there is a species effect on conditional flowering rates (no change compared with analysis of unconditional rates): ```{r} # fit model fit_cflow <- glmer(flow ~ species*location + (1 | plot:location/species) + (1 | family:species), data = cflow, family = binomial) # anova table for fixed effects car::Anova(fit_cflow, type = 'II') |> pander(caption = 'Table: Inference for fixed effects on conditional flowering rates') ``` Inference for random effects indicates a plot effect and a species/plot interaction and a lineage effect (no change compared with the analysis of unconditional flowering rates): ```{r} # test for lineage effect glmer(flow ~ species*location + (1 | plot:location/species), data = cflow, family = binomial) %>% anova(fit_cflow) |> pander(caption = 'Table: Likelihood ratio test for lineage effect on conditional flowering rates') # test for plot effect glmer(flow ~ species*location + (1 | family:species), data = cflow, family = binomial) %>% anova(fit_cflow) |> pander(caption = 'Table: Likelihood ratio test for plot effect on conditional flowering rates') # test for species/plot interaction glmer(flow ~ species*location + (1 | plot:location) + (1 | family:species), data = cflow, family = binomial) %>% anova(fit_cflow) |> pander(caption = 'Table: Likelihood ratio test for species/plot interaction effect on conditional flowering rates') ``` Estimates of conditional flowering rates for each species/location combination are given below. ```{r} # estimated conditional flowering probabilities emmeans(fit_cflow, specs = ~species*location, type = 'response') |> confint() |> broom::tidy() |> select(-df) |> pander(caption = 'Table: Estimated (conditional) flowering probabilities and 95% confidence intervals') ``` Graphically: ```{r} # plot estimated probabilities emmeans(fit_cflow, specs = ~species*location, type = 'response') |> confint() |> broom::tidy() |> ggplot(aes(x = species, y = prob, color = location)) + geom_point(position = position_dodge(width = 0.5)) + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.1, position = position_dodge(width = 0.5)) + theme_bw() + labs(x = NULL, y = 'estimated conditional flowering probability') ``` Marginal flowering rates by species (averaged over location) and location (averaged over species) are shown below. ```{r} # marginal flowering rates by species emmeans(fit_cflow, specs = ~ species, type = 'response') |> confint() |> broom::tidy() |> select(-df) |> pander(caption = 'Table: Marginal conditional flowering rates by species (averaged over location)') # marginal flowering rates by location emmeans(fit_cflow, specs = ~ location, type = 'response') |> confint() |> broom::tidy() |> select(-df) |> pander(caption = 'Table: Marginal conditional flowering rates by location (averaged over species)') ``` Lastly, factorial contrasts yield comparisons of marginal flowering rates between (a) locations and (b) species. ```{r} # factorial contrasts fac.eff.emmc <- function(levels, ...){ modelr::data_grid(flow, location, species) |> bind_cols(loc.contr = 0.5*rep(c(-1, 1), each = 2), spp.contr = 0.5*rep(c(1, -1), 2)) |> select(ends_with('contr')) |> rename_with(~str_remove_all(.x, '.contr')) } emmeans(fit_cflow, specs = ~ species*location, type = 'response') |> contrast('fac.eff') |> confint() |> as_tibble() |> rename(conf.low = asymp.LCL, conf.high = asymp.UCL) |> select(-df) |> pander(caption = 'Table: Factorial contrasts for marginal species and location effects on conditional flowering rates') ``` This quantifies the significant species effect as follows: with 95% confidence, the odds of flowering are estimated to be between 2.63 and 10.75 times higher among annuals compared with perennials (point estimate: 5.33). ### Fruiting rates Fruiting rates could be computed in one of three ways: 1. **Unconditional** on germination, *i.e.*, rate at which all toothpicks produced fruit 2. **Conditional** on germination, *i.e.*, rate at which germinated toothpicks produced fruit 3. **Conditional** on flowering, *i.e.*, rate at which flowered plants produced fruit Below I've only shown the analysis for option (3). To explore options 1-2, remove or change the filter on life stage in the codes shown below. ```{r, echo = T} # binary indicator for fruiting (conditional on flowering) fruit <- seedplot |> drop_na(stage) |> filter(stage > 'germinated') |> # notice filter here for conditional rate group_by(id, species, family, location, plot) |> summarize(fruit = factor(sum(n.fruit) > 0, levels = c(F, T), labels = c('no', 'yes')), .groups = 'drop') ``` ```{r} #| fig-cap: "Figure: Conditional fruiting rates by species, lineage, and location; average shown in blue dash." # compute group means for primary study arms mean_ftrates <- fruit |> group_by(species, location) |> summarize(prop.fruit = mean(fruit == 'yes'), .groups = 'drop') # plot fruit |> group_by(species, location, family) |> summarize(prop.fruit = mean(fruit == 'yes'), .groups = 'drop') |> ggplot(aes(y = fct_reorder(family, .x = prop.fruit), x = prop.fruit)) + facet_wrap(species~location, scales = 'free_y') + geom_point() + geom_linerange(aes(xmin = 0, xmax = prop.fruit)) + theme_bw() + theme(panel.grid.minor = element_blank(), panel.grid.major.y = element_blank()) + labs(x = 'conditional fruiting rate', y = NULL) + geom_vline(aes(xintercept = prop.fruit), linetype = 'dashed', color = 'blue', data = mean_ftrates) ``` Since some families have conditional fruiting rates of zero, it will not be possible to estimate lineage interactions. The analysis below uses the same model specification as that employed in the analysis of germination rates. (Codes are also identical.) Inference for fixed effects indicates there is a species effect on conditional fruiting rates: ```{r} # fit model fit_fruit <- glmer(fruit ~ species*location + (1 | plot:location/species) + (1 | family:species), data = fruit, family = binomial) # anova table for fixed effects car::Anova(fit_fruit, type = 'II') |> pander(caption = 'Table: Inference for fixed effects on conditional fruiting rate') ``` Inference for random effects indicates a plot effect and a species/plot interaction but **no** lineage effect: ```{r} # test for lineage effect glmer(fruit ~ species*location + (1 | plot:location/species), data = fruit, family = binomial) %>% anova(fit_fruit) |> pander(caption = 'Table: Likelihood ratio test for lineage effect on conditional fruiting rates') # test for plot effect glmer(fruit ~ species*location + (1 | family:species), data = fruit, family = binomial) %>% anova(fit_fruit) |> pander(caption = 'Table: Likelihood ratio test for plot effect on conditional fruiting rates') # test for species/plot interaction glmer(fruit ~ species*location + (1 | plot:location) + (1 | family:species), data = fruit, family = binomial) %>% anova(fit_fruit) |> pander(caption = 'Table: Likelihood ratio test for species/plot interaction effect on conditional fruiting rates') ``` Higher-order interactions produce numerical instability in the model fitting and so are omitted. Estimates of conditional fruiting rates for each species/location combination are given below. ```{r} # estimates of conditional fruiting probabilities emmeans(fit_fruit, specs = ~species*location, type = 'response') |> confint() |> broom::tidy() |> select(-df) |> pander(caption = 'Table: Estimated (conditional) fruiting probabilities and 95% confidence intervals') ``` Graphically: ```{r} # plot estimated probabilities emmeans(fit_fruit, specs = ~species*location, type = 'response') |> confint() |> broom::tidy() |> ggplot(aes(x = species, y = prob, color = location)) + geom_point(position = position_dodge(width = 0.5)) + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.1, position = position_dodge(width = 0.5)) + theme_bw() + labs(x = NULL, y = 'estimated conditional fruiting probability') ``` Marginal fruiting rates by species (averaged over location) and location (averaged over species) are shown below. ```{r} # marginal fruiting rates by species emmeans(fit_fruit, specs = ~ species, type = 'response') |> confint() |> broom::tidy() |> select(-df) |> pander(caption = 'Table: Estimated marginal conditional fruiting rates by species (averaged over location)') # marginal fruiting rates by location emmeans(fit_fruit, specs = ~ location, type = 'response') |> confint() |> broom::tidy() |> select(-df) |> pander(caption = 'Table: Estimated marginal conditional fruiting rates by location (averaged over species)') ``` Lastly, factorial contrasts yield comparisons of marginal fruiting rates between (a) locations and (b) species. ```{r} # factorial contrasts fac.eff.emmc <- function(levels, ...){ modelr::data_grid(fruit, location, species) |> bind_cols(loc.contr = 0.5*rep(c(-1, 1), each = 2), spp.contr = 0.5*rep(c(1, -1), 2)) |> select(ends_with('contr')) |> rename_with(~str_remove_all(.x, '.contr')) } emmeans(fit_fruit, specs = ~ species*location, type = 'response') |> contrast('fac.eff') |> confint() |> as_tibble() |> rename(conf.low = asymp.LCL, conf.high = asymp.UCL) |> select(-df) |> pander(caption = 'Table: Factorial contrasts for marginal species and location effects on conditional fruiting rates') ``` This quantifies the significant species effect as follows: with 95% confidence, the odds of fruiting are estimated to be between 1.53 and 7.46 times higher among annuals compared with perennials (point estimate: 3.38). ## Analysis of time-to-event data If you're interested in time patterns, it's fairly easy to obtain first dates for each stage. Here's what that looks like for one individual that saw all life stages: ```{r, echo = T} # earliest dates for each life stage first_dates <- seedplot |> drop_na(stage) |> group_by(id, species, family, location, plot, replicate, stage) |> slice_min(date) |> select(-starts_with('n'), -week) |> ungroup() # preview first_dates |> filter(id == 140) |> pander(caption = 'Table: dates to life stage events for individual no. 140') ``` This can be rearranged to calculate times to each stage. Here are some examples for individuals who reached different life stages: ```{r, echo = T} # time to event data durations <- first_dates |> pivot_wider(names_from = stage, values_from = date) |> rename(first.obs = ungerminated) |> mutate(time.to.germ = germinated - ymd("2024-01-17"), time.to.flower = flowered - ymd("2024-01-17"), time.to.fruit = fruited - ymd("2024-01-17"), time.to.death = died - ymd("2024-01-17")) ``` ```{r} # some example rows bind_rows(durations |> drop_na(time.to.germ) |> slice_head(n = 1), durations |> drop_na(time.to.flower) |> slice_head(n = 1), durations |> drop_na(time.to.fruit) |> slice_head(n = 1)) |> select(id, first.obs, germinated, flowered, fruited, died, time.to.germ, time.to.flower, time.to.fruit, time.to.death) |> pander() ``` ::: callout-note I manually input the start of the experiment as 1/17/25 because this was the earliest date that appeared in the metadata, and is consistent with the week 1 observation being 1/25/25. Client should confirm and adjust start date if needed so that durations are accurate. ::: Here are distributions of times to each event by species and location: ```{r} #| fig-height: 8 #| fig-width: 8 #| fig-cap: "Figure: Distributions of times to life stage events by species and location" durations |> pivot_longer(starts_with('time.to.')) |> ggplot(aes(x = value)) + facet_grid(species*location~name, scale = 'free_x') + geom_histogram(bins = 5) + theme_bw() + theme(panel.grid.minor = element_blank()) + labs(x = 'days') ``` It's perhaps not the most efficient comparison, but one can see: - approximate time ranges for each event - longer survival times and later flowering and fruiting in seep plots Survival analysis methods can be used to estimate the (instantaneous) probability of germination/flowering/fruiting/death as a function of time. I focused on a simple approach using Kaplan-Meier estimators after grouping by location and species **and ignoring plot and lineage effects**. Mixed-effects Cox regression could be used for a proper analysis accounting for the experimental design with a specification analogous to that used above in modeling event rates in aggregate across the study period, but since this is a secondary analysis, I've just shown some simple exploratory work. Missing durations are right-censored -- all we know is that the event of interest occurred (or would have occurred) sometime after the end of the study. To format the data correctly, I've replaced missing durations with the number of days that the study persisted (442) and added an indicator of whether the corresponding observation is censored (`i = 0`) or not (`i = 1`). ```{r} # number of days between first and last observation date max_t <- seedplot |> pull(date) |> range() |> diff() # compute time and status variables for survival analysis # replace NA (right-censored values) with maximum duration (442 days) tdata <- durations |> filter(!is.na(first.obs)) |> mutate(t.g = as.integer(time.to.germ) |> replace_na(as.integer(max_t)), i.g = 1 - is.na(time.to.germ), t.fl = as.integer(time.to.flower) |> replace_na(as.integer(max_t)), i.fl = 1 - is.na(time.to.flower), t.fr = as.integer(time.to.fruit) |> replace_na(as.integer(max_t)), i.fr = 1 - is.na(time.to.fruit), t.d = as.integer(time.to.death) |> replace_na(as.integer(max_t)), i.d = 1 - is.na(time.to.death)) |> select(id, species, location, starts_with('t.'), starts_with('i.')) # preview tdata |> filter(t.g < 442) |> slice_head(n = 6) |> pander(caption = 'Table: data formatted for analysis of time-to-event data') # export for client use write_csv(tdata, file = 'seedplot-times.csv') ``` I've provided a copy of the data formatted in this way as `seedplot-times.csv`. ```{r, echo = T} # read in time to event data tdata <- read_csv('seedplot-times.csv') ``` A quick caveat: it's somewhat awkward to represent germination, flowering, and fruiting as right-censored since this implicitly assumes that these events would eventually occur, which is not true in this system. This is more conceptual than analytical -- what will happen in the analysis is that after sufficient time the estimated instantaneous event probabilities will plateau at the long-run event rates. ### Germination Modeling the instantaneous probability of germination as a function of time since the start of the study shows relatively quick convergence to aggregate germination rates: ```{r} # kaplan-meier curves for germination probability fit_km_germ <- survfit(Surv(t.g, i.g) ~ interaction(location, species), data = tdata) # plot tibble(t = fit_km_germ$time |> c(rep(0, 4)), haz = fit_km_germ$surv |> c(rep(1, 4)), lwr = fit_km_germ$lower |> c(rep(NA, 4)), upr = fit_km_germ$upper |> c(rep(NA, 4)), group = rep(c('rock.LLA', 'seep.LLA', 'rock.LLP', 'seep.LLP'), fit_km_germ$strata) |> c(c('rock.LLA', 'seep.LLA', 'rock.LLP', 'seep.LLP'))) |> separate_wider_delim(cols = group, delim = '.', names = c('location', 'species')) |> ggplot(aes(x = t, y = 1 - haz)) + geom_step(aes(color = species, linetype = location)) + pammtools::geom_stepribbon(aes(ymin = 1 - lwr, ymax = 1 - upr, fill = species, linetype = location), alpha = 0.2) + theme_bw() + theme(panel.grid.minor = element_blank()) + guides(linetype = guide_legend(override.aes = list(fill = NA))) + labs(x = 'time (days)', y = 'Pr(germination)') ``` The bands show pointwise 95% confidence intervals, and the overlap is consistent with the prior analysis which found no effect of species or location on germination rates. This just displays that conclusion a bit differently. There doesn't look to be much difference in the rate at which the germination probability increases with time, though the estimate for LLP does increase slightly faster in the rock outrcop microhabitat. ### Flowering For modeling time to flowering, I used: 1. Flowering occurrences *conditional on* germination, *i.e.*, all ungerminated individuals were removed prior to analysis 2. The time since germination (rather than the time since the start of the experiment) This yields the following estimates of the instantaneous probability of flowering as a function of time since germination: ```{r} # kaplan-meier curves for (conditional) flowering probability fit_km_flow <- tdata |> filter(t.g < as.integer(max_t)) %>% survfit(Surv(t.fl - t.g, i.fl) ~ interaction(location, species), data = .) tibble(t = fit_km_flow$time |> c(rep(0, 4)), haz = fit_km_flow$surv |> c(rep(1, 4)), lwr = fit_km_flow$lower |> c(rep(NA, 4)), upr = fit_km_flow$upper |> c(rep(NA, 4)), group = rep(c('rock.LLA', 'seep.LLA', 'rock.LLP', 'seep.LLP'), fit_km_flow$strata) |> c(c('rock.LLA', 'seep.LLA', 'rock.LLP', 'seep.LLP'))) |> separate_wider_delim(cols = group, delim = '.', names = c('location', 'species')) |> ggplot(aes(x = t, y = 1 - haz)) + geom_step(aes(color = species, linetype = location)) + pammtools::geom_stepribbon(aes(ymin = 1 - lwr, ymax = 1 - upr, fill = species, linetype = location), alpha = 0.2) + theme_bw() + theme(panel.grid.minor = element_blank()) + guides(linetype = guide_legend(override.aes = list(fill = NA))) + labs(x = 'time (days since germination)', y = 'Pr(flower | germinated)') ``` There is a clear and persistent difference in flowering probability by species: annuals flower at higher rates, as one might expect. While there is no long-run apparent difference in flowering probability by microhabitat, the flowering probability increases faster in the rock outcrop microhabitat. This reflects the descriptive finding earlier that flowering occurs later in the seep microhabitat. ### Fruiting Similar to flowering, I estimated the instantaneous *conditional* probability of fruiting given germination, as a function of the time since germination. ::: callout-note Client may prefer to model the conditional probability of fruiting given flowering as a function of time since flowering. If so, change `Surv(t.fr - t.g, i.fr)` to `Surv(t.fr - t.fl, i.fr)` in the model specification. ::: ```{r} # kaplan-meier curves for (conditional) fruiting probability fit_km_fru <- tdata |> filter(t.g < as.integer(max_t)) %>% survfit(Surv(t.fr - t.g, i.fr) ~ interaction(location, species), data = .) tibble(t = fit_km_fru$time |> c(rep(0, 4)), haz = fit_km_fru$surv |> c(rep(1, 4)), lwr = fit_km_fru$lower |> c(rep(NA, 4)), upr = fit_km_fru$upper |> c(rep(NA, 4)), group = rep(c('rock.LLA', 'seep.LLA', 'rock.LLP', 'seep.LLP'), fit_km_fru$strata) |> c(c('rock.LLA', 'seep.LLA', 'rock.LLP', 'seep.LLP'))) |> separate_wider_delim(cols = group, delim = '.', names = c('location', 'species')) |> ggplot(aes(x = t, y = 1 - haz)) + geom_step(aes(color = species, linetype = location)) + pammtools::geom_stepribbon(aes(ymin = 1 - lwr, ymax = 1 - upr, fill = species, linetype = location), alpha = 0.2) + theme_bw() + theme(panel.grid.minor = element_blank()) + guides(linetype = guide_legend(override.aes = list(fill = NA))) + labs(x = 'time (days since germination)', y = 'Pr(fruit | germinated)') ``` Interestingly, there appears to be an interaction effect here: depending on species, the rate of increase and plateau differs between environments. ### Survival Finally, an analysis of survival times after germination is given below. The figure shows the probability of survival as a function of time, and reflects the descriptive finding earlier that plants survived longer in the seep microhabitat. ```{r} # kaplan-meier curves for survival after germination fit_km_surv <- tdata |> filter(t.g < as.integer(max_t)) %>% survfit(Surv(t.d - t.g, i.d) ~ interaction(location, species), data = .) # plot tibble(t = fit_km_surv$time |> c(rep(0, 4)), haz = fit_km_surv$surv |> c(rep(1, 4)), lwr = fit_km_surv$lower |> c(rep(NA, 4)), upr = fit_km_surv$upper |> c(rep(NA, 4)), group = rep(c('rock.LLA', 'seep.LLA', 'rock.LLP', 'seep.LLP'), fit_km_surv$strata) |> c(c('rock.LLA', 'seep.LLA', 'rock.LLP', 'seep.LLP'))) |> separate_wider_delim(cols = group, delim = '.', names = c('location', 'species')) |> ggplot(aes(x = t, y = haz)) + geom_step(aes(color = species, linetype = location)) + pammtools::geom_stepribbon(aes(ymin = lwr, ymax = upr, fill = species, linetype = location), alpha = 0.2) + theme_bw() + theme(panel.grid.minor = element_blank()) + guides(linetype = guide_legend(override.aes = list(fill = NA))) + labs(x = 'time (days since germination)', y = 'Pr(survival | germinated)') ``` ## Reproducibility I've provided a copy of all codes used to generate analyses shown above starting from the client's spreadsheet. In addition, I generated two derivative files with reformatted data: - `seedplot.csv` has data arranged for event occurrence rate analysis - `seedplot-time.csv` has data arranged for time-to-event analysis These files are available (only client has access permissions) in a compressed folder [here](https://cpslo-my.sharepoint.com/:u:/r/personal/truiz01_calpoly_edu/Documents/consulting/25su-niknafs/niknafs-consult-toclient.zip?csf=1&web=1&e=jporSr) (you may need to right-click to open).

week1	week2	week3	week4	week5	week6	week7	week8	week9	week10
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

plot	rock_outcrop	seep
1	108	107
2	108	108
3	108	108
4	108	108
5	108	108
6	108	108
7	108	108
8	108	108
9	108	108
10	108	108

week1	week2	week3	week4	week5	week6	week7	week8	week9	week10
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

plot	rock_outcrop	seep
1	108	107
2	108	108
3	108	108
4	108	108
5	108	108
6	108	108
7	108	108
8	108	108
9	108	108
10	108	108

week1	week2	week3	week4	week5	week6	week7	week8	week9	week10
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

plot	rock_outcrop	seep
1	108	107
2	108	108
3	108	108
4	108	108
5	108	108
6	108	108
7	108	108
8	108	108
9	108	108
10	108	108