Cultivar trial data

CLIENT

Colin Koubek

DATE

July 22, 2025

Data are repeated measures from an RCBD. In detail:

33 cultivars (treatments) are randomly allocated among plots arranged in 4 blocks
one replicate per cultivar per block
leaves are (sub)sampled from each plot at six dates over a four-month period
responses are counts of two-spot spider mites (“TSSM”) and predatory mites (“PM”)
average response across subsamples recorded for each plot

Goal is to compare each response among the 33 cultivars longitudinally across time points. Some plots were missed accidentally on certain sampling dates owing to the large number of samples to collect.

In addition, trichomes were counted on three subsamples from each plot at the end of the experiment.

Client questions:

How to handle missing data?
How to handle repeated measures?
How to do cultivar comparisons with so many treatments?
How to analyze trichome data?

Mite data

Raw mite data as provided is shown below.

date.sampled	date.counted	cultivar	plot	tssm	pm	block
2025-02-11	2025-02-14	Surfline	1001	0.2	0	10
2025-03-25	2025-03-26	Surfline	1001	0.2	0	10
2025-04-29	2025-04-30	Surfline	1001	2	0.2	10
2025-05-13	2025-05-19	Surfline	1001	0	0	10
2025-05-28	2025-05-28	Surfline	1001	0	0	10
2025-02-11	2025-02-14	Monarch	1002	0	0	10

Quality control

There are a few quality control issues to address that I identified as I was checking basic data attributes and identifying missing observations. Based on the design I expect either one or zero samples per cultivar/block/date combination. A handful of combinations saw two or more samples:

block	date.sampled	cultivar	n
11	2025-04-14	Albion (LC)	2
11	2025-02-11	BG-10.3181	2
12	2025-02-11	BG-10.3181	2
12	2025-03-25	BG-10.3181	2
12	2025-04-14	BG-10.3181	2
12	2025-04-29	BG-10.3181	2
12	2025-05-13	BG-10.3181	2
12	2025-05-28	BG-10.3181	2
11	2025-03-25	BG-4.367	2
11	2025-02-11	Eclipse	2
11	2025-02-11	Royal Royce	2
11	2025-03-25	Royal Royce	3
11	2025-04-14	Royal Royce	2
11	2025-04-29	Royal Royce	2
11	2025-05-13	Royal Royce	2
11	2025-05-28	Royal Royce	2
10	2025-03-25	UC_DN18	2

Additionally, a few blocks are missing in their entirety for certain cultivars and seem to correspond to blocks duplicated in their entirety for other cultivars:

block	date.sampled	BG-10.3169	BG-10.3181	Mojo	Royal Royce
10	2025-02-11	1	1	1	0
10	2025-03-25	1	1	1	1
10	2025-04-14	1	1	1	1
11	2025-02-11	1	2	0	2
11	2025-03-25	1	1	0	3
11	2025-04-14	1	1	0	2
11	2025-04-29	1	1	0	2
11	2025-05-13	1	1	0	2
11	2025-05-28	1	1	0	2
12	2025-02-11	0	2	1	1
12	2025-03-25	0	2	1	1
12	2025-04-14	0	2	1	1
12	2025-04-29	0	2	1	1
12	2025-05-13	0	2	1	1
12	2025-05-28	0	2	1	1
13	2025-04-14	1	1	1	1
13	2025-04-29	1	1	1	1
13	2025-05-13	1	1	1	1

Client confirmed plot mislabeling for the cultivars shown above. Here is the labeling in the original data:

plot	cultivar
1023	BG-10.3169
1111	BG-10.3169
1308	BG-10.3169
1033	BG-10.3181
1121	BG-10.3181
1217	BG-10.3181
1222	BG-10.3181
1305	BG-10.3181
1025	Mojo
1220	Mojo
1311	Mojo
1005	Royal Royce
1120	Royal Royce
1125	Royal Royce
1214	Royal Royce
1328	Royal Royce

Plots 1222, 1217, and 1120 are mislabeled. After recoding these, there are only a handful of “duplicated” observations:

plot	date.sampled	cultivar	tssm	pm
1011	2025-03-25	UC_DN18	1.2	0
1011	2025-03-25	UC_DN18	2.8	2.6
1121	2025-02-11	BG-10.3181	0.6	0
1121	2025-02-11	BG-10.3181	1.8	0
1122	2025-02-11	Eclipse	0.2	0
1122	2025-02-11	Eclipse	0.2	0
1126	2025-03-25	BG-4.367	3.4	0
1126	2025-03-25	BG-4.367	0.4	0
1125	2025-03-25	Royal Royce	13.8	0
1125	2025-03-25	Royal Royce	1.6	0
1123	2025-04-14	Albion (LC)	18.6	0.60000000000000009
1123	2025-04-14	Albion (LC)	NA	NA

Plot 1123 can be retained since one of the two observations is missing (though this empty row – 332 in the excel file – is perhaps better simply removed from the original data). Otherwise, response values are not in general identical for these plots. Client should verify plots above have correct cultivar labels. If there is no further mislabeling, and the “extra” observations cannot be otherwise accounted for, these can be removed from the data and treated as missing, as is done below.

In general, missing observations are not recorded explicitly in the data provided – this is a fine convention. However, there are three exceptions:

date.sampled	cultivar	plot	tssm	pm
2025-04-14	Albion (LC)	1123	NA	NA
2025-04-29	Monarch	1333	NA	NA
2025-05-28	Monarch	1333	NA	NA

It is also worth noting that there are a few instances where TSSM is recorded but PM is missing (there are none in the opposite configuration with PM recorded and TSSM missing). Client may wish to inspect these samples:

date.sampled	cultivar	plot	tssm	pm
2025-05-28	Albion (LC)	1321	0	NA
2025-02-11	Monarch	1333	5.8	NA
2025-03-25	Monarch	1333	5	NA

The pseudo-duplicates and samples with both TSSM and PM recorded as missing are removed. After a bit of renaming, the analysis-ready data look like so:

block	plot	date	cultivar	tssm	pm
10	1001	2025-02-11	Surfline	0.2	0
10	1001	2025-03-25	Surfline	0.2	0
10	1001	2025-04-29	Surfline	2	0.2
10	1001	2025-05-13	Surfline	0	0
10	1001	2025-05-28	Surfline	0	0
10	1002	2025-02-11	Monarch	0	0

Missing data

Relatively few observations are missing – only 8 date/block/cultivar combinations, not counting the pseudo-duplicates. Here is a list of missing observations and pseudo-duplicates:

cultivar	plot	block	date
Albion (LC)	1123	11	2025-03-25
BG-10.3181	1121	11	2025-02-11
BG-4.367	1126	11	2025-03-25
Eclipse	1122	11	2025-02-11
Fronteras	1225	12	2025-03-25
Keystone	1101	11	2025-03-25
Monarch	1333	13	2025-04-14
Monarch	1333	13	2025-04-29
Monarch	1333	13	2025-05-13
Monarch	1333	13	2025-05-28
Monterey	1226	12	2025-03-25
PS-14.762.054	1022	10	2025-02-11
PS-14.762.054	1332	13	2025-04-29
Royal Royce	1005	10	2025-02-11
Royal Royce	1125	11	2025-03-25
Surfline	1001	10	2025-04-14
UC_DN18	1011	10	2025-03-25
UC_SD16	1021	10	2025-02-11

It’s worth noting that:

Plot 1333 (Monarch) is entirely missing after March
PS-14.762.054 and Royal Royce are both missed twice, but on different plots at different dates

Otherwise, there is very little pattern to the missingness. It seems reasonable to assume data are missing completely at random. There are sufficient data from nonmissing blocks to impute treatment means by date for the missing observations and then proceed with the analysis as if data were complete.

TSSM

The group means averaged across blocks by date are shown below with smoothing across treatments shown by the solid blue line.

These values are imputed for the missing plots/dates.

Code

# impute means for missing observations
tssm <- mite |>
  select(-pm, -plot) |>
  spread(cultivar, tssm) |>
  pivot_longer(-c(block, date), names_to = 'cultivar', values_to = 'tssm') |>
  group_by(cultivar, date) |>
  mutate(tssm = if_else(is.na(tssm), mean(tssm, na.rm = T), tssm)) |>
  ungroup() |>
  arrange(cultivar, block, date)

After imputing averages, the distributions of TSSM (on a log scale) across cultivars by date are as follows:

By the last sampling date, TSSM is almost uniformly zero, so there is no longer any signal in the data – this date can be removed prior to analysis.

Visually, signal on 5/13 also looks negligible, but 25% of plots still have nonzero (though small) values:

I retained data from 5/13, but the signal is likely very weak. Client may wish to remove at their discretion.

To format for analysis, I’ve converted date to number of weeks since first sample, stored as a factor (weeks.f) and as an integer (weeks), and added an integer-valued plot identifier (id). While it won’t affect the analysis, client may wish to encode weeks relative to the date of treatment application for interpretability. Note that I’ve also log-transformed TSSM after adding a pseudo-count to replace zeroes.

cultivar	block	id	date	week	week.f	tssm
Albion (LC)	10	1	2025-02-11	0	0	0.6931
Albion (LC)	10	1	2025-03-25	6	6	3.258
Albion (LC)	10	1	2025-04-14	9	9	3.006
Albion (LC)	10	1	2025-04-29	11	11	1.649
Albion (LC)	10	1	2025-05-13	13	13	0
Albion (LC)	11	34	2025-02-11	0	0	0.5878

It does look as if there’s a modest block effect. Further, the block effect appears fairly uniform across time – this could be incorporated to simplify the analysis, if client so wishes.

Finally, we can look at the distributions of mite levels by cultivar. Across all time points:

And disaggregated by time:

Interestingly, mite levels seem to exhibit fairly different time trajectories depending on cultivar. It’s also pretty clear that variances are nonhomogeneous both by cultivar and over time.

Model specification

The usual statistical model at one time point would be:

\[ \log(\text{tssm}_{ij}) = \mu + \text{cultivar}_i + \text{block}_j + \epsilon_{ij} \]

To incorporate the repeated measures, we’ll add in week (as a factor) and its interactions. To treat the block effect as time-varying, include two-way interactions with both block and cultivar effects:

\[ \log(\text{tssm}_{ijk}) = \mu + \text{cult}_i + \text{blk}_j + \text{wk}_k + (\text{cult}\times\text{wk})_{ik} + (\text{blk}\times\text{wk})_{jk} + \epsilon_{ijk} \qquad\qquad(1) \]

If instead the block effect is assumed uniform over time, a slightly simpler model can be used:

\[ \log(\text{tssm}_{ijk}) = \mu + \text{cult}_i + \text{blk}_j + \text{wk}_k + (\text{cult}\times\text{wk})_{ik} + \epsilon_{ijk} \qquad\qquad(2) \]

I somewhat prefer this model, but it won’t make much of a difference downstream in analysis. I’ve tried it both ways and little changes. Here’s a comparison of estimated block medians under each model:

Qualitatively, we might ask whether it’s reasonable to expect that mite populations would exhibit different trajectories by block. Since all blocks have the same cultivars, this mainly has to do with experimental conditions and possible extraneous sources of variability that could affect entire blocks. Based on the data, after considering estimated sampling variability, it’s really only block 11 that differs considerably under the two models, and even then only at week 11. I’d be happy to overlook this in the analysis, but client may wish to use the more complex model.

Repeated measures

Repeated measures often induce within-unit correlations that can be modeled by adding (in this case) a plot random effect. However, if we look at residual correlations from model (2) across weeks, they aren’t especially strong:

            0          6           9          11          13
0  1.00000000  0.2066859  0.05955158  0.06314299  0.09003641
6  0.20668594  1.0000000  0.35129712  0.16580759 -0.10815681
9  0.05955158  0.3512971  1.00000000  0.50349111 -0.30260907
11 0.06314299  0.1658076  0.50349111  1.00000000 -0.09398731
13 0.09003641 -0.1081568 -0.30260907 -0.09398731  1.00000000

Only two stand out:

weeks 6 and 9
weeks 9 and 11

These are the nearest-in-time observations. In general, it looks as if observations are taken far enough apart in time that there’s not a strong within-plot correlation. This presents two options:

Assume independence as usual and work with the model above
Add a within-plot random effect having AR(1) correlation structure appropriately scaled by time distance

There’s also a third possibility that’s a little less common and involves treating time as continuous:

Estimate response curves by cultivar

(Option 1) Independent errors

For this option, we take model (2) and specify that for the error term one has:

\[ \log(\text{tssm}_{ijk}) = \mu + \text{cult}_i + \text{blk}_j + \text{wk}_k + (\text{cult}\times\text{wk})_{ik} + \epsilon_{ijk} \;,\qquad \epsilon_{ijk} \stackrel{iid}{\sim} N(0, \sigma^2) \]

The equal variance assumption is an oversimplification, but one that we’ll overlook for now. The model above gives the ANOVA:

Code

# fit anova model with independent errors
fit.aov <- aov(tssm ~ cultivar*week.f + block, data = tssm_analysis) 

# anova table
fit.aov |> summary() |> pander()

Analysis of Variance Model
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
cultivar	32	102.3	3.198	5.223	7.156e-17
week.f	4	256.6	64.15	104.8	1.698e-64
block	3	46.32	15.44	25.22	3.437e-15
cultivar:week.f	128	123.9	0.9676	1.58	0.0003043
Residuals	492	301.2	0.6123	NA	NA

The estimated cultivar median TSSM values that result from this approach are:

Pairwise contrasts are not likely to be helpful, and as client noted, the study is not powered to detect differences after adjusting for 528 contrasts at each time point (2640 in total). My recommendation is to generate simultaneous intervals for the cultivar medians either by time or on a time-averaged basis, and then order them from best to worst or vice-versa.

Here are time-averaged estimates with simultaneous 80% confidence intervals:

Code

# plot cultivar medians
fit.aov |>
  emmeans(~cultivar) |>
  confint(adjust = 'mvt', level = 0.8) |>
  as_tibble() |>
  mutate(across(c(emmean, upper.CL, lower.CL), ~exp(.x) - 1)) |>
  ggplot(aes(x = emmean, y = fct_reorder(cultivar, emmean))) +
  geom_point() +
  geom_errorbarh(aes(xmin = lower.CL, xmax = upper.CL)) +
  theme_bw() +
  labs(y = NULL, x = 'median TSSM')

Note that the usual multiplicity adjustments do not account for comparisons between cultivars. If cultivar comparisons are of interest, I’d recommend adjusting intervals using Scheffe’s method, which accounts for all possible contrasts. However, this produces quite wide intervals, and, unsurprisingly, no comparisons are significantly different.

One could also look at estimated cultivar medians disaggregated by time:

The multiplicity adjustments again start to become inflated due to the number of estimates (165 here). I’d probably go with the time-averaged plot as a primary figure. However, it is notable that mite levels peak at different times for different cultivars, and this plot makes that a bit easier to see than the EDA version showing boxplots by time point – certainly worth including at least on a supplementary basis.

(Option 2) AR(1) correlation

If the time correlation is of particular interest, we can modify the model by adding a AR(1) covariance structure within plots. That is:

\[ \log(\text{tssm}_{ijk}) = \mu + \text{cult}_i + \text{blk}_j + \text{wk}_k + (\text{cult}\times\text{wk})_{ik} + \epsilon_{ijk} \qquad \begin{cases} \epsilon \sim N(0, \Sigma) \\ \text{var}(\epsilon_{ijk}) = \sigma^2 \\ \text{cov}(\epsilon_{ijk}, \epsilon_{ijk'}) = \rho^{|w(k) - w(k')|} \\ \Sigma_{mn} = 0 \;\text{otherwise} \end{cases} \]

The parameter $\rho$ represents the correlation between two measurements taken one week apart, and the covariance model specifies that this correlation decays geometrically with time – $w(k)$ denotes the numeric week for index $k$.

This model is fairly straightforward to fit in R using nlme::gls and gives the following ANOVA table:

Code

# format week for nlme
tssm_analysis_ar1 <- tssm_analysis |>
  mutate(week.i = as.integer(week))

# fit repeated measures anova with AR1 covariance
fit.aov.ar1 <- gls(tssm ~ week.f*cultivar + block,
                   corr = corAR1(value = 0.5, form = ~ week.i | id),
                   data = tssm_analysis_ar1)

# anova table (questionable)
anova(fit.aov.ar1) |>
  pander()

	numDF	F-value	p-value
(Intercept)	1	717.7	3.576e-98
week.f	4	112.2	6.234e-68
cultivar	32	3.605	5.726e-10
block	3	14.18	7.065e-09
week.f:cultivar	128	1.65	8.49e-05

There is some controversy around constructing and interpreting the F statistics in this context since sums of squares are no longer independent under the model. The estimated covariance structure can be summarized as shown below. I’ve included correlations at 2, 3, and 6 week spacing, since those are the unique spacings of sampling dates in the dataset.

Code

# inspect estimated covariance structure
tibble(sigma = fit.aov.ar1$sigma,
       rho = coef(fit.aov.ar1$modelStruct$corStruct, unconstrained = FALSE)) |>
  mutate(cor2wks  = rho^2,
         cor3wks = rho^3,
         cor6wks = rho^6) |>
  pander()

sigma	rho	cor2wks	cor3wks	cor6wks
0.7855	0.6133	0.3761	0.2306	0.0532

All the same plots as before can be constructed using estimates from this model, and they are essentially identical.

This approach will produce the same conclusions as option 1, and only differs in that you have an estimate for the time correlation. If the time correlation is not of specific interest, client may wish to use the simpler model (option 1) – considering the time correlation is not especially strong, this is an entirely reasonable choice.

(Option 3) Response curves

A less common but potentially helpful alternative is to estimate response curves as functions of time. This arises from taking the basic RCBD model, treating weeks since first sampling date as a continuous time variate, and adding a cultivar-dependent function of time:

\[ \log(\text{tssm}_{ij}(t)) = \mu + \text{cult}_i + \text{blk}_j + f_i(t) + \epsilon_{ij} \]

Notice I’m still assuming that the block effect is uniform over time – this can be changed if desired. I fit this model using a B-spline representation $f_i(t) = \sum_j B_j(t)$, which is fairly straightforward to do in R. This gives the ANOVA:

Code

# append splines to data
tssm_analysis_sp <- tssm_analysis |>
  mutate(week.bs = splines::bs(week, degree = 3))

# fit model
fit.aov.sp <- aov(tssm ~ cultivar*week.bs + block, 
                  data = tssm_analysis_sp)

# anova table
summary(fit.aov.sp) |>
  pander()

Analysis of Variance Model
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
cultivar	32	102.3	3.198	5.216	5.206e-17
week.bs	3	245.3	81.76	133.3	3.245e-64
block	3	46.32	15.44	25.18	3.113e-15
cultivar:week.bs	96	114.5	1.193	1.945	2.19e-06
Residuals	525	321.9	0.6132	NA	NA

What’s nice about this model, although a little more complex than the usual ANOVA, is that estimates can be used to reconstruct response curves:

With a little clever effort, one could compute an integral over the time domain to estimate cumulative mite populations by cultivar and use this as a basis for cultivar comparisons.

As an aside, from a statistical perspective the errors in this model are still assumed independent, but per the discussion in Option 2, there is likely some time correlation within plot. As a consequence, the uncertainty bands on the response curves above are likely a bit too narrow. However, considering that in Option 2 the autocorrelation is fairly weak given the time spacing, I wouldn’t be too concerned about this issue. That said, if client wishes to adjust for that, there are options for incorporating a covariance model into the specification.

PM

For the PM response, there is noticeably more zero-inflation. As a result, standard ANOVA is not recommended.

There are three common strategies for dealing with this issue:

Convert data to presence/absence
Filter out zeroes and model the response conditional on mite presence
Fit a mixture model with a component accounting for the excess probability of zero

Since none of the counts are especially high (99.61% of obserations are under 5), there’s not a particularly strong information signal in the numeric values of the nonzero responses, so I think that the presence/absence approach is best here. Here is the distribution of nonzero values:

I’d expect that even if one did model the quantitative response, the scale of variation by cultivar would not be practically meaningful.

After imputing missing values using the same approach as for TSSM and converting the PM values to presence/absence, data look like so:

cultivar	block	id	date	week	week.f	pm	pm.f
Albion (LC)	10	1	2025-02-11	0	0	0	absent
Albion (LC)	10	1	2025-03-25	6	6	0	absent
Albion (LC)	10	1	2025-04-14	9	9	0.4	present
Albion (LC)	10	1	2025-04-29	11	11	1.4	present
Albion (LC)	10	1	2025-05-13	13	13	0.4	present
Albion (LC)	10	1	2025-05-28	16	16	0	absent

I’d fit a binomial GLM with the same predictor form as the standard anova model for an RCBD:

\[ \mathbb{I}(\text{pm}_{ij} > 0) \sim \text{bernoulli}(p_{ij}) \quad\text{where}\quad \text{logit}(p_{ij}) = \mu + \text{cultivar}_i + \text{block}_j \] This model is easy to fit in R and yields the following analysis of deviance table:

Code

# fit binomial glm
fit.glm <- glm(pm.f ~ cultivar + block + week.f + week.f:cultivar, 
               family = binomial(), 
               data = pm)

# analysis of deviance table
anova(fit.glm) |> pander()

Analysis of Deviance Table
	Df	Deviance	Resid. Df	Resid. Dev	Pr(>Chi)
NULL	NA	NA	791	1047	NA
cultivar	32	42.42	759	1004	0.103
block	3	3.598	756	1001	0.3083
week.f	5	178.3	751	822.5	1.21e-36
cultivar:week.f	160	181.5	591	641.1	0.1176

The $p$-values correspond to sequential likelihood ratio tests akin to $F$ tests using type I sums of squares in a standard ANOVA. Note that in the formula specification the terms are entered in the desired order in which the tests are conducted.

Interestingly, only time is significant; there do not appear to be cultivar differences in the probability of predatory mite occurrence. Nonetheless, here are the estimated predatory mite occurrence probabilities over time by cultivar with pointwise (not simultaneous) 80% confidence intervals and ordered from highest average probability (upper left) to lowest (bottom right):

The estimated aggregate probability of predatory mite occurrence over time (averaged over cultivars and blocks) is:

While there may be some weak autocorrelation within plots, I’d keep it simple and avoid attempting to model this directly. I think the effort to correctly specify and fit a GLMM is not worth the potential benefit, and if the TSSM data are any indication, the autocorrelation does not appear strong at the scale of the time spacing of observations.

Trichome data

For the trichome data, samples were collected on one date only – there are no repeated measurements. I would suggest a standard ANOVA here after averaging the subsamples (as was done for the other responses in the repeated measures data).

The variances are somewhat obviously different, but I wouldn’t be too concerned about that – it’s almost expected given the number of cultivars, and I don’t think it’s worth the effort of adding a covariance model to account for this. However, this could be done if desired. The standard RCBD model is:

\[ \text{trichomes}_{ij} = \mu + \text{cult}_i + \text{blk}_j + \epsilon_{ij} \]

This gives the ANOVA:

Code

trichome_raw |>
  group_by(plot, cultivar, date.sampled) |>
  summarize(tri = mean(tri)) |>
  mutate(block = str_trunc(plot, 2, side = 'right', ellipsis = '')) %>%
  aov(tri ~ cultivar + block, data = .) |>
  summary() |>
  pander()

Analysis of Variance Model
	Df	Sum Sq	Mean Sq	F value	Pr(>F)
cultivar	32	19037	594.9	2.691	0.0001112
block	3	2592	864.1	3.909	0.01112
Residuals	95	21000	221.1	NA	NA

The inhomogeneous variance is not terribly concerning. This may affect interval coverage for a few groups, but as noted above, I wouldn’t be too concerned.

The following gives estimated means ordered from largest to smallest:

It would be interesting to see whether there’s any correspondence between the mite and trichome levels by comparing the rankings of estimated means obtained from each model. Client will need to recode the cultivars so that they match across datasets to facilitate this comparison; Spearman rank correlation between the two sets of rankings aligned by cultivar would provide some indication of whether there’s any relationship.

Summary

How to handle missing data?

Impute averages by cultivar and time. This assumes observations are missing completely at random, which seems reasonable here since the researchers just missed a few plots here and there when sampling.

How to handle repeated measures?

While there are a few options for the TSSM data, I’d keep it simple and use a standard ANOVA and add a term for week and a week-cultivar interaction. Assuming independent errors is a bit of an oversimplification, but the correlation does not seem strong on the scale of the time spacing of observations in the data.

How to do cultivar comparisons with so many treatments?

Avoid pairwise comparisons or other contrasts here unless there are any comparisons that are a priori important to examine. Just estimate the means with multiple inference adjustments.

How to analyze trichome data?

Average the subsamples and fit an ANOVA model.

--- title: "Cultivar trial data" author: "Colin Koubek" author-title: "CLIENT" date: "7/22/25" published-title: "DATE" format: html: code-fold: true code-tools: true toc: true execute: warning: false message: false echo: false --- ```{r} library(tidyverse) library(readxl) library(nlme) library(emmeans) library(modelr) library(pander) library(patchwork) library(splines) options(contrasts = rep('contr.sum', 2)) # read in mite data mite_raw <- read_xlsx('mite-cultivar.xlsx', sheet = 'Mites') |> rename_with(~str_replace_all(.x, ' ', '.') |> tolower()) |> rename(plot = `plot.#`) |> select(starts_with('date'), cultivar, plot, total.tssm, total.pred.mites) |> rename(tssm = total.tssm, pm = total.pred.mites) |> mutate(block = str_trunc(plot, width = 2, side = 'right', ellipsis = '') |> factor()) # read in trichome data trichome_raw <- read_xlsx('mite-cultivar.xlsx', sheet = 'Trichomes') |> rename_with(~str_replace_all(.x, ' ', '.') |> tolower()) |> rename(tri = `#.trichomes`) |> mutate(block = str_trunc(plot, width = 2, side = 'right', ellipsis = '') |> factor()) |> select(starts_with('date'), cultivar, plot, subsample, tri) ``` Data are repeated measures from an RCBD. In detail: - 33 cultivars (treatments) are randomly allocated among plots arranged in 4 blocks - one replicate per cultivar per block - leaves are (sub)sampled from each plot at six dates over a four-month period - responses are counts of two-spot spider mites ("TSSM") and predatory mites ("PM") - average response across subsamples recorded for each plot Goal is to compare each response among the 33 cultivars longitudinally across time points. Some plots were missed accidentally on certain sampling dates owing to the large number of samples to collect. In addition, trichomes were counted on three subsamples from each plot at the end of the experiment. Client questions: 1. How to handle missing data? 2. How to handle repeated measures? 3. How to do cultivar comparisons with so many treatments? 4. How to analyze trichome data? # Mite data Raw mite data as provided is shown below. ```{r} # preview mite_raw |> head() |> pander() ``` ## Quality control There are a few quality control issues to address that I identified as I was checking basic data attributes and identifying missing observations. Based on the design I expect either one or zero samples per cultivar/block/date combination. A handful of combinations saw two or more samples: ```{r} # treatment/date combinations with more than two samples mite_raw |> count(cultivar, block, date.sampled) |> spread(cultivar, n) |> pivot_longer(-c(block, date.sampled), names_to = 'cultivar', values_to = 'n') |> mutate(n = replace_na(n, 0)) |> filter(n > 1) |> arrange(cultivar, block, date.sampled) |> mutate(date.sampled = date(date.sampled)) |> pander() ``` Additionally, a few blocks are missing in their entirety for certain cultivars and seem to correspond to blocks duplicated in their entirety for other cultivars: ```{r} # missing/oversampled blocks mite_raw |> count(cultivar, block, date.sampled) |> spread(cultivar, n) |> mutate(date.sampled = date(date.sampled)) |> mutate(across(where(is.numeric), ~replace_na(.x, 0)), total = rowSums(pick(where(is.numeric)) != 1)) |> filter(total > 0) |> select(where(~sum(.x != 1) > 4), -total) |> pander() ``` Client confirmed plot mislabeling for the cultivars shown above. Here is the labeling in the original data: ```{r} # inspect plot numbers and check cultivar labeling following client input mite_raw |> filter(cultivar %in% c("BG-10.3181", "BG-10.3169", "Mojo", "Royal Royce")) |> distinct(plot, cultivar) |> arrange(cultivar) |> pander() ``` Plots 1222, 1217, and 1120 are mislabeled. After recoding these, there are only a handful of "duplicated" observations: ```{r} # recode mislabeled plots mite_relabeled <- mite_raw |> mutate(cultivar.corrected = case_when(plot == 1222 ~ "BG-10.3181", plot == 1217 ~ "BG-10.3169", plot == 1120 ~ "Mojo", .default = cultivar)) # identify duplicates and inspect plots duplicated_obs <- mite_relabeled |> select(-cultivar) |> rename(cultivar = cultivar.corrected) |> count(cultivar, block, date.sampled) |> spread(cultivar, n) |> pivot_longer(-c(block, date.sampled), names_to = 'cultivar', values_to = 'n') |> mutate(n = replace_na(n, 0)) |> filter(n > 1) |> inner_join(mite_relabeled, join_by(block, date.sampled, cultivar == cultivar.corrected)) |> select(plot, date.sampled, cultivar, tssm, pm) # inspect duplicated_obs |> pander() ``` Plot 1123 can be retained since one of the two observations is missing (though this empty row -- 332 in the excel file -- is perhaps better simply removed from the original data). Otherwise, response values are not in general identical for these plots. **Client should verify plots above have correct cultivar labels.** If there is no further mislabeling, and the "extra" observations cannot be otherwise accounted for, these can be removed from the data and treated as missing, as is done below. In general, missing observations are not recorded explicitly in the data provided -- this is a fine convention. However, there are three exceptions: ```{r} # explicitly recorded as missing mite_relabeled |> select(date.sampled, cultivar, plot, tssm, pm) |> filter(is.na(tssm), is.na(pm)) |> pander() ``` It is also worth noting that there are a few instances where TSSM is recorded but PM is missing (there are none in the opposite configuration with PM recorded and TSSM missing). Client may wish to inspect these samples: ```{r} # missing only PM mite_relabeled |> select(date.sampled, cultivar, plot, tssm, pm) |> filter(is.na(tssm) + is.na(pm) == 1) |> pander() ``` The pseudo-duplicates and samples with both TSSM and PM recorded as missing are removed. After a bit of renaming, the analysis-ready data look like so: ```{r} # for illustration purposes, drop plots with duplicates and treat as missing obs_to_exclude <- duplicated_obs |> distinct(plot, date.sampled, cultivar) |> filter(plot != 1123) # analysis-ready data mite <- mite_relabeled |> drop_na(tssm) |> select(-cultivar) |> rename(cultivar = cultivar.corrected) |> anti_join(obs_to_exclude, join_by(plot, date.sampled, cultivar)) |> mutate(date = date(date.sampled), pm = as.numeric(pm)) |> select(block, plot, date, cultivar, tssm, pm) # preview mite |> head() |> pander() ``` ## Missing data Relatively few observations are missing -- only 8 date/block/cultivar combinations, not counting the pseudo-duplicates. Here is a list of missing observations and pseudo-duplicates: ```{r} # missing observations mite |> count(cultivar, block, date) |> spread(cultivar, n) |> pivot_longer(-c(block, date), names_to = 'cultivar', values_to = 'n') |> mutate(n = replace_na(n, 0)) |> filter(n == 0) |> left_join(distinct(mite, block, plot, cultivar), join_by(block, cultivar)) |> select(cultivar, plot, block, date) |> arrange(cultivar, block, plot, date) |> pander() ``` It's worth noting that: - Plot 1333 (Monarch) is entirely missing after March - PS-14.762.054 and Royal Royce are both missed twice, but on different plots at different dates Otherwise, there is very little pattern to the missingness. It seems reasonable to assume data are missing completely at random. There are sufficient data from nonmissing blocks to impute treatment means by date for the missing observations and then proceed with the analysis as if data were complete. ## TSSM The group means averaged across blocks by date are shown below with smoothing across treatments shown by the solid blue line. ```{r} # group means mite |> group_by(cultivar, date) |> summarize(across(c(tssm, pm), mean)) |> ggplot(aes(x = date, y = tssm)) + geom_path(aes(group = cultivar), alpha = 0.25) + scale_y_continuous(transform = "log1p") + geom_smooth(method = 'loess', formula = 'y ~ x', se = F) + theme_bw() + labs(y = 'tssm cultivar mean') ``` These values are imputed for the missing plots/dates. ```{r, echo = T} # impute means for missing observations tssm <- mite |> select(-pm, -plot) |> spread(cultivar, tssm) |> pivot_longer(-c(block, date), names_to = 'cultivar', values_to = 'tssm') |> group_by(cultivar, date) |> mutate(tssm = if_else(is.na(tssm), mean(tssm, na.rm = T), tssm)) |> ungroup() |> arrange(cultivar, block, date) ``` After imputing averages, the distributions of TSSM (on a log scale) across cultivars by date are as follows: ```{r} #| fig-height: 2.5 #| fig-width: 6 tssm |> ggplot(aes(x = tssm)) + geom_histogram(bins = 10) + facet_wrap(~date, nrow = 1) + scale_x_continuous(transform = 'log1p', breaks = c(0, 10, 100, 1000)) + theme_bw() + theme(axis.text.x = element_text(angle = 90)) ``` By the last sampling date, TSSM is almost uniformly zero, so there is no longer any signal in the data -- this date can be removed prior to analysis. Visually, signal on 5/13 also looks negligible, but 25% of plots still have nonzero (though small) values: ```{r} tssm |> filter(date == mdy("5/13/2025")) |> ggplot(aes(x = tssm)) + geom_histogram(bins = 4) + theme_bw() ``` I retained data from 5/13, but the signal is likely very weak. Client may wish to remove at their discretion. To format for analysis, I've converted date to number of weeks since first sample, stored as a factor (`weeks.f`) and as an integer (`weeks`), and added an integer-valued plot identifier (`id`). While it won't affect the analysis, **client may wish to encode weeks relative to the date of treatment application for interpretability.** Note that I've also log-transformed TSSM after adding a pseudo-count to replace zeroes. ```{r} # remove last sampling date (since 98% zeroes) tssm_analysis <- tssm |> filter(date < mdy("5-28-25")) |> group_by(block, cultivar) |> mutate(tssm = log(tssm + 1), id = cur_group_id(), week = week(date) - min(week(date)), week.f = factor(week)) |> ungroup() |> select(cultivar, block, id, date, week, week.f, tssm) # preview head(tssm_analysis) |> pander() ``` It does look as if there's a modest block effect. Further, the block effect appears fairly uniform across time -- this could be incorporated to simplify the analysis, if client so wishes. ```{r} # inspect block means (does look like a block effect) tssm_analysis |> group_by(block, date) |> summarize(tssm = mean(tssm)) |> ggplot(aes(x = date, y = tssm)) + geom_path(aes(group = block)) + geom_point() + theme_bw() + labs(y = "log(tssm) block mean") ``` Finally, we can look at the distributions of mite levels by cultivar. Across all time points: ```{r} tssm_analysis |> ggplot(aes(x = tssm, y = fct_reorder(cultivar, tssm))) + geom_boxplot() + theme_bw() + labs(x = 'log(tssm)', y = NULL) ``` And disaggregated by time: ```{r} tssm_analysis |> ggplot(aes(x = tssm, y = fct_reorder(cultivar, tssm))) + geom_boxplot() + facet_grid(~week.f) + theme_bw() + labs(x = 'log(tssm)', y = NULL) ``` Interestingly, mite levels seem to exhibit fairly different time trajectories depending on cultivar. It's also pretty clear that variances are nonhomogeneous both by cultivar and over time. ### Model specification The usual statistical model at one time point would be: $$ \log(\text{tssm}_{ij}) = \mu + \text{cultivar}_i + \text{block}_j + \epsilon_{ij} $$ To incorporate the repeated measures, we'll add in week (as a factor) and its interactions. To treat the block effect as time-varying, include two-way interactions with both block and cultivar effects: $$ \log(\text{tssm}_{ijk}) = \mu + \text{cult}_i + \text{blk}_j + \text{wk}_k + (\text{cult}\times\text{wk})_{ik} + (\text{blk}\times\text{wk})_{jk} + \epsilon_{ijk} \qquad\qquad(1) $$ If instead the block effect is assumed uniform over time, a slightly simpler model can be used: $$ \log(\text{tssm}_{ijk}) = \mu + \text{cult}_i + \text{blk}_j + \text{wk}_k + (\text{cult}\times\text{wk})_{ik} + \epsilon_{ijk} \qquad\qquad(2) $$ I somewhat prefer this model, but it won't make much of a difference downstream in analysis. I've tried it both ways and little changes. Here's a comparison of estimated block medians under each model: ```{r, echo = F} #| fig-width: 8 #| fig-height: 4 block_uniform <- aov(tssm ~ cultivar*week.f + block, data = tssm_analysis) |> emmeans(~block*week.f) |> confint(adjust = 'mvt') |> as_tibble() |> left_join(distinct(tssm_analysis, date, week.f, block), join_by(week.f, block)) |> ggplot(aes(x = date, y = exp(emmean) - 1)) + geom_path(aes(group = block, color = block)) + geom_ribbon(aes(ymin = exp(lower.CL) - 1, ymax = exp(upper.CL) - 1, fill = block, group = block), alpha = 0.1) + theme_bw() + labs(x = NULL, y = NULL, title = 'model (2)') + scale_y_continuous(limits = c(-0.5, 20), transform = 'log1p') block_varying <- aov(tssm ~ (cultivar + block)*week.f, data = tssm_analysis) |> emmeans(~block*week.f) |> confint(adjust = 'mvt') |> as_tibble() |> left_join(distinct(tssm_analysis, date, week.f, block), join_by(week.f, block)) |> ggplot(aes(x = date, y = exp(emmean) - 1)) + geom_path(aes(group = block, color = block)) + geom_ribbon(aes(ymin = exp(lower.CL) - 1, ymax = exp(upper.CL) - 1, fill = block, group = block), alpha = 0.1) + theme_bw() + guides(color = guide_none(), fill = guide_none()) + labs(x = NULL, y = 'block median TSSM', title = 'model (1)') + scale_y_continuous(limits = c(-0.5, 20), transform = 'log1p') block_varying + block_uniform + plot_layout(nrow = 1) ``` Qualitatively, we might ask whether it's reasonable to expect that mite populations would exhibit different trajectories by block. Since all blocks have the same cultivars, this mainly has to do with experimental conditions and possible extraneous sources of variability that could affect entire blocks. Based on the data, after considering estimated sampling variability, it's really only block 11 that differs considerably under the two models, and even then only at week 11. I'd be happy to overlook this in the analysis, but client may wish to use the more complex model. ### Repeated measures Repeated measures often induce within-unit correlations that can be modeled by adding (in this case) a plot random effect. However, if we look at residual correlations from model (2) across weeks, they aren't especially strong: ```{r} # check residual correlation under model 2 -- not that strong tssm_analysis |> bind_cols(resid = aov(tssm ~ cultivar*week.f + block, data = tssm_analysis)$residuals) |> select(id, week, resid) |> spread(week, resid) |> column_to_rownames('id') |> cor() ``` Only two stand out: - weeks 6 and 9 - weeks 9 and 11 These are the nearest-in-time observations. In general, it looks as if observations are taken far enough apart in time that there's not a strong within-plot correlation. This presents two options: 1. Assume independence as usual and work with the model above 2. Add a within-plot random effect having AR(1) correlation structure *appropriately scaled by time distance* There's also a third possibility that's a little less common and involves treating time as continuous: 3. Estimate response curves by cultivar #### (Option 1) Independent errors For this option, we take model (2) and specify that for the error term one has: $$ \log(\text{tssm}_{ijk}) = \mu + \text{cult}_i + \text{blk}_j + \text{wk}_k + (\text{cult}\times\text{wk})_{ik} + \epsilon_{ijk} \;,\qquad \epsilon_{ijk} \stackrel{iid}{\sim} N(0, \sigma^2) $$ The equal variance assumption is an oversimplification, but one that we'll overlook for now. The model above gives the ANOVA: ```{r, echo = T} # fit anova model with independent errors fit.aov <- aov(tssm ~ cultivar*week.f + block, data = tssm_analysis) # anova table fit.aov |> summary() |> pander() ``` The estimated cultivar median TSSM values that result from this approach are: ```{r} # plot estimated median counts by cultivar fit.aov |> emmeans(~cultivar*week.f) |> as_tibble() |> ggplot(aes(x = week.f, y = exp(emmean) - 1, group = cultivar)) + geom_path(alpha = 0.5) + geom_ribbon(aes(ymin = exp(lower.CL) - 1, ymax = exp(upper.CL) - 1), alpha = 0.05) + theme_bw() + scale_x_discrete(expand = expansion(mult = 0.01)) + scale_y_continuous(transform = 'log1p') + labs(y = 'cultivar median tssm', x = 'week') ``` Pairwise contrasts are not likely to be helpful, and as client noted, the study is not powered to detect differences after adjusting for 528 contrasts at each time point (2640 in total). My recommendation is to generate simultaneous intervals for the cultivar medians either by time or on a time-averaged basis, and then order them from best to worst or vice-versa. Here are time-averaged estimates with simultaneous 80% confidence intervals: ```{r, echo = T} # plot cultivar medians fit.aov |> emmeans(~cultivar) |> confint(adjust = 'mvt', level = 0.8) |> as_tibble() |> mutate(across(c(emmean, upper.CL, lower.CL), ~exp(.x) - 1)) |> ggplot(aes(x = emmean, y = fct_reorder(cultivar, emmean))) + geom_point() + geom_errorbarh(aes(xmin = lower.CL, xmax = upper.CL)) + theme_bw() + labs(y = NULL, x = 'median TSSM') ``` Note that the usual multiplicity adjustments do not account for comparisons between cultivars. If cultivar comparisons are of interest, I'd recommend adjusting intervals using Scheffe's method, which accounts for all possible contrasts. However, this produces quite wide intervals, and, unsurprisingly, no comparisons are significantly different. ```{r} # plot cultivar medians fit.aov |> emmeans(~cultivar) |> confint(adjust = 'scheffe', level = 0.8) |> as_tibble() |> mutate(across(c(emmean, upper.CL, lower.CL), ~exp(.x) - 1)) |> ggplot(aes(x = emmean, y = fct_reorder(cultivar, emmean))) + geom_point() + geom_errorbarh(aes(xmin = lower.CL, xmax = upper.CL)) + theme_bw() + labs(y = NULL, x = 'median TSSM') ``` One could also look at estimated cultivar medians disaggregated by time: ```{r} #| fig-width: 8 #| fig-height: 4 # order cultivars by overall median cultivar_order <- fit.aov |> emmeans(~cultivar) |> as_tibble() |> mutate(order = rank(emmean)) |> select(cultivar, order) fit.aov |> emmeans(~cultivar*week.f) |> confint(adjust = 'mvt', level = 0.8) |> as_tibble() |> left_join(cultivar_order, join_by(cultivar)) |> mutate(across(c(emmean, upper.CL, lower.CL), ~exp(.x) - 1)) |> ggplot(aes(x = emmean, y = fct_reorder(cultivar, order))) + geom_point() + geom_errorbarh(aes(xmin = lower.CL, xmax = upper.CL)) + theme_bw() + facet_grid(~week.f) + labs(y = NULL, x = 'median TSSM') ``` The multiplicity adjustments again start to become inflated due to the number of estimates (165 here). I'd probably go with the time-averaged plot as a primary figure. However, it is notable that mite levels peak at different times for different cultivars, and this plot makes that a bit easier to see than the EDA version showing boxplots by time point -- certainly worth including at least on a supplementary basis. #### (Option 2) AR(1) correlation If the time correlation is of particular interest, we can modify the model by adding a AR(1) covariance structure within plots. That is: $$ \log(\text{tssm}_{ijk}) = \mu + \text{cult}_i + \text{blk}_j + \text{wk}_k + (\text{cult}\times\text{wk})_{ik} + \epsilon_{ijk} \qquad \begin{cases} \epsilon \sim N(0, \Sigma) \\ \text{var}(\epsilon_{ijk}) = \sigma^2 \\ \text{cov}(\epsilon_{ijk}, \epsilon_{ijk'}) = \rho^{|w(k) - w(k')|} \\ \Sigma_{mn} = 0 \;\text{otherwise} \end{cases} $$ The parameter $\rho$ represents the correlation between two measurements taken one week apart, and the covariance model specifies that this correlation decays geometrically with time -- $w(k)$ denotes the numeric week for index $k$. This model is fairly straightforward to fit in R using `nlme::gls` and gives the following ANOVA table: ```{r, echo = T} # format week for nlme tssm_analysis_ar1 <- tssm_analysis |> mutate(week.i = as.integer(week)) # fit repeated measures anova with AR1 covariance fit.aov.ar1 <- gls(tssm ~ week.f*cultivar + block, corr = corAR1(value = 0.5, form = ~ week.i | id), data = tssm_analysis_ar1) # anova table (questionable) anova(fit.aov.ar1) |> pander() ``` There is some controversy around constructing and interpreting the F statistics in this context since sums of squares are no longer independent under the model. The estimated covariance structure can be summarized as shown below. I've included correlations at 2, 3, and 6 week spacing, since those are the unique spacings of sampling dates in the dataset. ```{r, echo = T} # inspect estimated covariance structure tibble(sigma = fit.aov.ar1$sigma, rho = coef(fit.aov.ar1$modelStruct$corStruct, unconstrained = FALSE)) |> mutate(cor2wks = rho^2, cor3wks = rho^3, cor6wks = rho^6) |> pander() ``` All the same plots as before can be constructed using estimates from this model, and they are essentially identical. ```{r} #| fig-height: 4 #| fig-width: 8 # plot estimated median counts by cultivar p1 <- fit.aov.ar1 |> emmeans(~cultivar*week.f) |> confint(adjust = 'mvt', level = 0.8) |> as_tibble() |> ggplot(aes(x = week.f, y = exp(emmean) - 1, group = cultivar)) + geom_path(alpha = 0.5) + geom_ribbon(aes(ymin = exp(lower.CL) - 1, ymax = exp(upper.CL) - 1), alpha = 0.05) + theme_bw() + scale_x_discrete(expand = expansion(mult = 0.01)) + scale_y_continuous(transform = 'log1p') + labs(y = 'cultivar median tssm', x = 'week') # plot cultivar medians p2 <- fit.aov.ar1 |> emmeans(~cultivar) |> confint(adjust = 'mvt', level = 0.8) |> as_tibble() |> mutate(across(c(emmean, upper.CL, lower.CL), ~exp(.x) - 1)) |> ggplot(aes(x = emmean, y = fct_reorder(cultivar, emmean))) + geom_point() + geom_errorbarh(aes(xmin = lower.CL, xmax = upper.CL)) + theme_bw() + labs(y = NULL, x = 'median TSSM') p1 + p2 + plot_layout(nrow = 1) ``` This approach will produce the same conclusions as option 1, and only differs in that you have an estimate for the time correlation. If the time correlation is not of specific interest, client may wish to use the simpler model (option 1) -- considering the time correlation is not especially strong, this is an entirely reasonable choice. #### (Option 3) Response curves A less common but potentially helpful alternative is to estimate response curves as functions of time. This arises from taking the basic RCBD model, treating weeks since first sampling date as a continuous time variate, and adding a cultivar-dependent function of time: $$ \log(\text{tssm}_{ij}(t)) = \mu + \text{cult}_i + \text{blk}_j + f_i(t) + \epsilon_{ij} $$ Notice I'm still assuming that the block effect is uniform over time -- this can be changed if desired. I fit this model using a B-spline representation $f_i(t) = \sum_j B_j(t)$, which is fairly straightforward to do in R. This gives the ANOVA: ```{r, echo = T} # append splines to data tssm_analysis_sp <- tssm_analysis |> mutate(week.bs = splines::bs(week, degree = 3)) # fit model fit.aov.sp <- aov(tssm ~ cultivar*week.bs + block, data = tssm_analysis_sp) # anova table summary(fit.aov.sp) |> pander() ``` What's nice about this model, although a little more complex than the usual ANOVA, is that estimates can be used to reconstruct response curves: ```{r} # data grid for plotting response curves tssm_grid <- tssm_analysis_sp |> data_grid(cultivar = cultivar, block = block, week = seq_range(week, n = 50)) |> mutate(week.bs = splines::bs(week, degree = 3)) # rank cultivars by median tssm cultivar_order <- mite |> group_by(cultivar) |> filter(date < mdy("5/28/25")) |> summarize(across(tssm, median)) |> arrange(cultivar) |> mutate(rank = rank(tssm, ties.method = 'first')) |> select(-tssm) |> arrange(rank) # medians to show sample values cultivar_medians <- mite |> group_by(cultivar, date) |> filter(date < mdy("5/28/25")) |> summarize(across(tssm, median)) |> left_join(cultivar_order, join_by(cultivar)) |> mutate(cultivar = factor(cultivar) |> fct_reorder(rank)) # visualize tssm_grid %>% bind_cols(., pred = predict(fit.aov.sp, newdata = ., se.fit = T)) |> group_by(cultivar, week) |> summarize(fit = mean(fit), se = sqrt(sum(se.fit^2)/n()^2), .groups = 'drop') |> left_join(cultivar_order, join_by(cultivar)) |> mutate(cultivar = factor(cultivar) |> fct_reorder(rank)) |> arrange(cultivar) |> ggplot(aes(x = week, y = exp(fit) - 1)) + facet_wrap(~cultivar) + geom_path() + geom_ribbon(aes(ymin = exp(fit - se), ymax = exp(fit + se)), alpha = 0.25) + geom_point(aes(x = week(date) - min(week(date)), y = tssm), data = cultivar_medians, alpha = 0.25) + geom_path(aes(x = week(date) - min(week(date)), y = tssm), data = cultivar_medians, alpha = 0.25) + theme_bw() + labs(y = 'median TSSM') ``` With a little clever effort, one could compute an integral over the time domain to estimate cumulative mite populations by cultivar and use this as a basis for cultivar comparisons. As an aside, from a statistical perspective the errors in this model are still assumed independent, but per the discussion in Option 2, there is likely some time correlation within plot. As a consequence, the uncertainty bands on the response curves above are likely a bit too narrow. However, considering that in Option 2 the autocorrelation is fairly weak given the time spacing, I wouldn't be too concerned about this issue. That said, if client wishes to adjust for that, there are options for incorporating a covariance model into the specification. ## PM For the PM response, there is noticeably more zero-inflation. As a result, standard ANOVA is not recommended. ```{r} mite |> ggplot(aes(x = pm)) + geom_histogram(bins = 15) + facet_wrap(~date) + scale_x_continuous(transform = 'log1p') + theme_bw() ``` There are three common strategies for dealing with this issue: 1. Convert data to presence/absence 2. Filter out zeroes and model the response conditional on mite presence 3. Fit a mixture model with a component accounting for the excess probability of zero Since none of the counts are especially high (`r round(100*mean(mite$pm < 5, na.rm = T), 2)`% of obserations are under 5), there's not a particularly strong information signal in the numeric values of the nonzero responses, so I think that the presence/absence approach is best here. Here is the distribution of nonzero values: ```{r} # distribution of nonzero values mite |> filter(pm > 0) |> ggplot(aes(x = pm)) + geom_histogram(bins = 15) + theme_bw() ``` I'd expect that even if one did model the quantitative response, the scale of variation by cultivar would not be practically meaningful. After imputing missing values using the same approach as for TSSM and converting the PM values to presence/absence, data look like so: ```{r} # impute missing values and convert to p/a pm <- mite |> select(-tssm, -plot) |> spread(cultivar, pm) |> pivot_longer(-c(block, date), names_to = 'cultivar', values_to = 'pm') |> group_by(cultivar, date) |> mutate(pm = if_else(is.na(pm), mean(pm, na.rm = T), pm)) |> ungroup() |> arrange(cultivar, block, date) |> mutate(pm.f = factor(pm > 0, labels = c('absent', 'present'))) |> ungroup() |> group_by(block, cultivar) |> mutate(id = cur_group_id(), week = week(date) - min(week(date)), week.f = factor(week)) |> ungroup() |> select(cultivar, block, id, date, week, week.f, pm, pm.f) # preview pm |> head() |> pander() ``` I'd fit a binomial GLM with the same predictor form as the standard anova model for an RCBD: $$ \mathbb{I}(\text{pm}_{ij} > 0) \sim \text{bernoulli}(p_{ij}) \quad\text{where}\quad \text{logit}(p_{ij}) = \mu + \text{cultivar}_i + \text{block}_j $$ This model is easy to fit in R and yields the following analysis of deviance table: ```{r, echo = T} # fit binomial glm fit.glm <- glm(pm.f ~ cultivar + block + week.f + week.f:cultivar, family = binomial(), data = pm) # analysis of deviance table anova(fit.glm) |> pander() ``` The $p$-values correspond to sequential likelihood ratio tests akin to $F$ tests using type I sums of squares in a standard ANOVA. Note that in the formula specification the terms are entered in the desired order in which the tests are conducted. Interestingly, only time is significant; there do not appear to be cultivar differences in the probability of predatory mite occurrence. Nonetheless, here are the estimated predatory mite occurrence probabilities over time by cultivar with pointwise (not simultaneous) 80% confidence intervals and ordered from highest average probability (upper left) to lowest (bottom right): ```{r} cultivar_order <- fit.glm |> emmeans(~cultivar, type = 'response') |> as_tibble() |> mutate(rank = rank(1 - prob)) |> select(cultivar, rank) fit.glm |> emmeans(~week.f*cultivar, type = 'response') |> confint(adjust = 'none', level = 0.8) |> as_tibble() |> left_join(distinct(pm, date, week.f), join_by(week.f)) |> left_join(cultivar_order, join_by(cultivar)) |> mutate(cultivar = fct_reorder(cultivar, rank)) |> ggplot(aes(x = date, y = prob)) + geom_path(aes(group = cultivar)) + geom_ribbon(aes(ymin = asymp.LCL, ymax = asymp.UCL), alpha = 0.2) + facet_wrap(~cultivar) + theme_bw() + labs(x = NULL, y = 'P(PM > 0)') ``` The estimated aggregate probability of predatory mite occurrence over time (averaged over cultivars and blocks) is: ```{r} glm(pm.f ~ week.f, family = binomial(), data = pm) |> emmeans(~week.f, type = 'response') |> confint(adjust = 'none', level = 0.8) |> as_tibble() |> left_join(distinct(pm, date, week.f), join_by(week.f)) |> ggplot(aes(x = date, y = prob)) + geom_path() + geom_ribbon(aes(ymin = asymp.LCL, ymax = asymp.UCL), alpha = 0.2) + theme_bw() + labs(x = NULL, y = 'P(PM > 0)') ``` While there may be some weak autocorrelation within plots, I'd keep it simple and avoid attempting to model this directly. I think the effort to correctly specify and fit a GLMM is not worth the potential benefit, and if the TSSM data are any indication, the autocorrelation does not appear strong at the scale of the time spacing of observations. # Trichome data For the trichome data, samples were collected on one date only -- there are no repeated measurements. I would suggest a standard ANOVA here after averaging the subsamples (as was done for the other responses in the repeated measures data). ```{r} trichome_raw |> group_by(plot, cultivar, date.sampled) |> summarize(tri = mean(tri)) |> ggplot(aes(y = fct_reorder(cultivar, tri), x = tri)) + geom_boxplot() + theme_bw() + labs(x = 'trichomes', y = NULL) ``` The variances are somewhat obviously different, but I wouldn't be too concerned about that -- it's almost expected given the number of cultivars, and I don't think it's worth the effort of adding a covariance model to account for this. However, this could be done if desired. The standard RCBD model is: $$ \text{trichomes}_{ij} = \mu + \text{cult}_i + \text{blk}_j + \epsilon_{ij} $$ This gives the ANOVA: ```{r, echo = T} trichome_raw |> group_by(plot, cultivar, date.sampled) |> summarize(tri = mean(tri)) |> mutate(block = str_trunc(plot, 2, side = 'right', ellipsis = '')) %>% aov(tri ~ cultivar + block, data = .) |> summary() |> pander() ``` The inhomogeneous variance is not terribly concerning. This may affect interval coverage for a few groups, but as noted above, I wouldn't be too concerned. ```{r} trichome_raw |> group_by(plot, cultivar, date.sampled) |> summarize(tri = mean(tri)) |> mutate(block = str_trunc(plot, 2, side = 'right', ellipsis = '')) %>% aov(tri ~ cultivar + block, data = .) |> broom::augment() |> ggplot(aes(x = .fitted, y = .std.resid)) + geom_point() + theme_bw() + scale_y_continuous(limits = c(-1, 1)*5) + labs(x = 'fitted values', y = 'studentized residuals') ``` The following gives estimated means ordered from largest to smallest: ```{r} trichome_raw |> group_by(plot, cultivar, date.sampled) |> summarize(tri = mean(tri)) |> mutate(block = str_trunc(plot, 2, side = 'right', ellipsis = '')) %>% aov(tri ~ cultivar + block, data = .) |> emmeans(~ cultivar) |> confint(method = 'holm', level = 0.9) |> as_tibble() |> ggplot(aes(x = emmean, y = fct_reorder(cultivar, emmean))) + geom_point() + geom_errorbarh(aes(xmin = lower.CL, xmax = upper.CL)) + labs(x = 'trichomes', y = NULL) + theme_bw() ``` It would be interesting to see whether there's any correspondence between the mite and trichome levels by comparing the rankings of estimated means obtained from each model. Client will need to recode the cultivars so that they match across datasets to facilitate this comparison; Spearman rank correlation between the two sets of rankings aligned by cultivar would provide some indication of whether there's any relationship. # Summary 1. How to handle missing data? > Impute averages by cultivar and time. This assumes observations are missing completely at random, which seems reasonable here since the researchers just missed a few plots here and there when sampling. 2. How to handle repeated measures? > While there are a few options for the TSSM data, I'd keep it simple and use a standard ANOVA and add a term for week and a week-cultivar interaction. Assuming independent errors is a bit of an oversimplification, but the correlation does not seem strong on the scale of the time spacing of observations in the data. 3. How to do cultivar comparisons with so many treatments? > Avoid pairwise comparisons or other contrasts here unless there are any comparisons that are *a priori* important to examine. Just estimate the means with multiple inference adjustments. 4. How to analyze trichome data? > Average the subsamples and fit an ANOVA model.