CLIENT

Colin Koubek

DATE

July 22, 2025

Data are repeated measures from an RCBD. In detail:

Goal is to compare each response among the 33 cultivars longitudinally across time points. Some plots were missed accidentally on certain sampling dates owing to the large number of samples to collect.

In addition, trichomes were counted on three subsamples from each plot at the end of the experiment.

Client questions:

  1. How to handle missing data?
  2. How to handle repeated measures?
  3. How to do cultivar comparisons with so many treatments?
  4. How to analyze trichome data?

Mite data

Raw mite data as provided is shown below.

date.sampled date.counted cultivar plot tssm pm block
2025-02-11 2025-02-14 Surfline 1001 0.2 0 10
2025-03-25 2025-03-26 Surfline 1001 0.2 0 10
2025-04-29 2025-04-30 Surfline 1001 2 0.2 10
2025-05-13 2025-05-19 Surfline 1001 0 0 10
2025-05-28 2025-05-28 Surfline 1001 0 0 10
2025-02-11 2025-02-14 Monarch 1002 0 0 10

Quality control

There are a few quality control issues to address that I identified as I was checking basic data attributes and identifying missing observations. Based on the design I expect either one or zero samples per cultivar/block/date combination. A handful of combinations saw two or more samples:

block date.sampled cultivar n
11 2025-04-14 Albion (LC) 2
11 2025-02-11 BG-10.3181 2
12 2025-02-11 BG-10.3181 2
12 2025-03-25 BG-10.3181 2
12 2025-04-14 BG-10.3181 2
12 2025-04-29 BG-10.3181 2
12 2025-05-13 BG-10.3181 2
12 2025-05-28 BG-10.3181 2
11 2025-03-25 BG-4.367 2
11 2025-02-11 Eclipse 2
11 2025-02-11 Royal Royce 2
11 2025-03-25 Royal Royce 3
11 2025-04-14 Royal Royce 2
11 2025-04-29 Royal Royce 2
11 2025-05-13 Royal Royce 2
11 2025-05-28 Royal Royce 2
10 2025-03-25 UC_DN18 2

Additionally, a few blocks are missing in their entirety for certain cultivars and seem to correspond to blocks duplicated in their entirety for other cultivars:

block date.sampled BG-10.3169 BG-10.3181 Mojo Royal Royce
10 2025-02-11 1 1 1 0
10 2025-03-25 1 1 1 1
10 2025-04-14 1 1 1 1
11 2025-02-11 1 2 0 2
11 2025-03-25 1 1 0 3
11 2025-04-14 1 1 0 2
11 2025-04-29 1 1 0 2
11 2025-05-13 1 1 0 2
11 2025-05-28 1 1 0 2
12 2025-02-11 0 2 1 1
12 2025-03-25 0 2 1 1
12 2025-04-14 0 2 1 1
12 2025-04-29 0 2 1 1
12 2025-05-13 0 2 1 1
12 2025-05-28 0 2 1 1
13 2025-04-14 1 1 1 1
13 2025-04-29 1 1 1 1
13 2025-05-13 1 1 1 1

Client confirmed plot mislabeling for the cultivars shown above. Here is the labeling in the original data:

plot cultivar
1023 BG-10.3169
1111 BG-10.3169
1308 BG-10.3169
1033 BG-10.3181
1121 BG-10.3181
1217 BG-10.3181
1222 BG-10.3181
1305 BG-10.3181
1025 Mojo
1220 Mojo
1311 Mojo
1005 Royal Royce
1120 Royal Royce
1125 Royal Royce
1214 Royal Royce
1328 Royal Royce

Plots 1222, 1217, and 1120 are mislabeled. After recoding these, there are only a handful of “duplicated” observations:

plot date.sampled cultivar tssm pm
1011 2025-03-25 UC_DN18 1.2 0
1011 2025-03-25 UC_DN18 2.8 2.6
1121 2025-02-11 BG-10.3181 0.6 0
1121 2025-02-11 BG-10.3181 1.8 0
1122 2025-02-11 Eclipse 0.2 0
1122 2025-02-11 Eclipse 0.2 0
1126 2025-03-25 BG-4.367 3.4 0
1126 2025-03-25 BG-4.367 0.4 0
1125 2025-03-25 Royal Royce 13.8 0
1125 2025-03-25 Royal Royce 1.6 0
1123 2025-04-14 Albion (LC) 18.6 0.60000000000000009
1123 2025-04-14 Albion (LC) NA NA

Plot 1123 can be retained since one of the two observations is missing (though this empty row – 332 in the excel file – is perhaps better simply removed from the original data). Otherwise, response values are not in general identical for these plots. Client should verify plots above have correct cultivar labels. If there is no further mislabeling, and the “extra” observations cannot be otherwise accounted for, these can be removed from the data and treated as missing, as is done below.

In general, missing observations are not recorded explicitly in the data provided – this is a fine convention. However, there are three exceptions:

date.sampled cultivar plot tssm pm
2025-04-14 Albion (LC) 1123 NA NA
2025-04-29 Monarch 1333 NA NA
2025-05-28 Monarch 1333 NA NA

It is also worth noting that there are a few instances where TSSM is recorded but PM is missing (there are none in the opposite configuration with PM recorded and TSSM missing). Client may wish to inspect these samples:

date.sampled cultivar plot tssm pm
2025-05-28 Albion (LC) 1321 0 NA
2025-02-11 Monarch 1333 5.8 NA
2025-03-25 Monarch 1333 5 NA

The pseudo-duplicates and samples with both TSSM and PM recorded as missing are removed. After a bit of renaming, the analysis-ready data look like so:

block plot date cultivar tssm pm
10 1001 2025-02-11 Surfline 0.2 0
10 1001 2025-03-25 Surfline 0.2 0
10 1001 2025-04-29 Surfline 2 0.2
10 1001 2025-05-13 Surfline 0 0
10 1001 2025-05-28 Surfline 0 0
10 1002 2025-02-11 Monarch 0 0

Missing data

Relatively few observations are missing – only 8 date/block/cultivar combinations, not counting the pseudo-duplicates. Here is a list of missing observations and pseudo-duplicates:

cultivar plot block date
Albion (LC) 1123 11 2025-03-25
BG-10.3181 1121 11 2025-02-11
BG-4.367 1126 11 2025-03-25
Eclipse 1122 11 2025-02-11
Fronteras 1225 12 2025-03-25
Keystone 1101 11 2025-03-25
Monarch 1333 13 2025-04-14
Monarch 1333 13 2025-04-29
Monarch 1333 13 2025-05-13
Monarch 1333 13 2025-05-28
Monterey 1226 12 2025-03-25
PS-14.762.054 1022 10 2025-02-11
PS-14.762.054 1332 13 2025-04-29
Royal Royce 1005 10 2025-02-11
Royal Royce 1125 11 2025-03-25
Surfline 1001 10 2025-04-14
UC_DN18 1011 10 2025-03-25
UC_SD16 1021 10 2025-02-11

It’s worth noting that:

  • Plot 1333 (Monarch) is entirely missing after March
  • PS-14.762.054 and Royal Royce are both missed twice, but on different plots at different dates

Otherwise, there is very little pattern to the missingness. It seems reasonable to assume data are missing completely at random. There are sufficient data from nonmissing blocks to impute treatment means by date for the missing observations and then proceed with the analysis as if data were complete.

TSSM

The group means averaged across blocks by date are shown below with smoothing across treatments shown by the solid blue line.

These values are imputed for the missing plots/dates.

Code
# impute means for missing observations
tssm <- mite |>
  select(-pm, -plot) |>
  spread(cultivar, tssm) |>
  pivot_longer(-c(block, date), names_to = 'cultivar', values_to = 'tssm') |>
  group_by(cultivar, date) |>
  mutate(tssm = if_else(is.na(tssm), mean(tssm, na.rm = T), tssm)) |>
  ungroup() |>
  arrange(cultivar, block, date)

After imputing averages, the distributions of TSSM (on a log scale) across cultivars by date are as follows:

By the last sampling date, TSSM is almost uniformly zero, so there is no longer any signal in the data – this date can be removed prior to analysis.

Visually, signal on 5/13 also looks negligible, but 25% of plots still have nonzero (though small) values:

I retained data from 5/13, but the signal is likely very weak. Client may wish to remove at their discretion.

To format for analysis, I’ve converted date to number of weeks since first sample, stored as a factor (weeks.f) and as an integer (weeks), and added an integer-valued plot identifier (id). While it won’t affect the analysis, client may wish to encode weeks relative to the date of treatment application for interpretability. Note that I’ve also log-transformed TSSM after adding a pseudo-count to replace zeroes.

cultivar block id date week week.f tssm
Albion (LC) 10 1 2025-02-11 0 0 0.6931
Albion (LC) 10 1 2025-03-25 6 6 3.258
Albion (LC) 10 1 2025-04-14 9 9 3.006
Albion (LC) 10 1 2025-04-29 11 11 1.649
Albion (LC) 10 1 2025-05-13 13 13 0
Albion (LC) 11 34 2025-02-11 0 0 0.5878

It does look as if there’s a modest block effect. Further, the block effect appears fairly uniform across time – this could be incorporated to simplify the analysis, if client so wishes.

Finally, we can look at the distributions of mite levels by cultivar. Across all time points:

And disaggregated by time:

Interestingly, mite levels seem to exhibit fairly different time trajectories depending on cultivar. It’s also pretty clear that variances are nonhomogeneous both by cultivar and over time.

Model specification

The usual statistical model at one time point would be:

\[ \log(\text{tssm}_{ij}) = \mu + \text{cultivar}_i + \text{block}_j + \epsilon_{ij} \]

To incorporate the repeated measures, we’ll add in week (as a factor) and its interactions. To treat the block effect as time-varying, include two-way interactions with both block and cultivar effects:

\[ \log(\text{tssm}_{ijk}) = \mu + \text{cult}_i + \text{blk}_j + \text{wk}_k + (\text{cult}\times\text{wk})_{ik} + (\text{blk}\times\text{wk})_{jk} + \epsilon_{ijk} \qquad\qquad(1) \]

If instead the block effect is assumed uniform over time, a slightly simpler model can be used:

\[ \log(\text{tssm}_{ijk}) = \mu + \text{cult}_i + \text{blk}_j + \text{wk}_k + (\text{cult}\times\text{wk})_{ik} + \epsilon_{ijk} \qquad\qquad(2) \]

I somewhat prefer this model, but it won’t make much of a difference downstream in analysis. I’ve tried it both ways and little changes. Here’s a comparison of estimated block medians under each model:

Qualitatively, we might ask whether it’s reasonable to expect that mite populations would exhibit different trajectories by block. Since all blocks have the same cultivars, this mainly has to do with experimental conditions and possible extraneous sources of variability that could affect entire blocks. Based on the data, after considering estimated sampling variability, it’s really only block 11 that differs considerably under the two models, and even then only at week 11. I’d be happy to overlook this in the analysis, but client may wish to use the more complex model.

Repeated measures

Repeated measures often induce within-unit correlations that can be modeled by adding (in this case) a plot random effect. However, if we look at residual correlations from model (2) across weeks, they aren’t especially strong:

            0          6           9          11          13
0  1.00000000  0.2066859  0.05955158  0.06314299  0.09003641
6  0.20668594  1.0000000  0.35129712  0.16580759 -0.10815681
9  0.05955158  0.3512971  1.00000000  0.50349111 -0.30260907
11 0.06314299  0.1658076  0.50349111  1.00000000 -0.09398731
13 0.09003641 -0.1081568 -0.30260907 -0.09398731  1.00000000

Only two stand out:

  • weeks 6 and 9
  • weeks 9 and 11

These are the nearest-in-time observations. In general, it looks as if observations are taken far enough apart in time that there’s not a strong within-plot correlation. This presents two options:

  1. Assume independence as usual and work with the model above
  2. Add a within-plot random effect having AR(1) correlation structure appropriately scaled by time distance

There’s also a third possibility that’s a little less common and involves treating time as continuous:

  1. Estimate response curves by cultivar

(Option 1) Independent errors

For this option, we take model (2) and specify that for the error term one has:

\[ \log(\text{tssm}_{ijk}) = \mu + \text{cult}_i + \text{blk}_j + \text{wk}_k + (\text{cult}\times\text{wk})_{ik} + \epsilon_{ijk} \;,\qquad \epsilon_{ijk} \stackrel{iid}{\sim} N(0, \sigma^2) \]

The equal variance assumption is an oversimplification, but one that we’ll overlook for now. The model above gives the ANOVA:

Code
# fit anova model with independent errors
fit.aov <- aov(tssm ~ cultivar*week.f + block, data = tssm_analysis) 

# anova table
fit.aov |> summary() |> pander()
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
cultivar 32 102.3 3.198 5.223 7.156e-17
week.f 4 256.6 64.15 104.8 1.698e-64
block 3 46.32 15.44 25.22 3.437e-15
cultivar:week.f 128 123.9 0.9676 1.58 0.0003043
Residuals 492 301.2 0.6123 NA NA

The estimated cultivar median TSSM values that result from this approach are:

Pairwise contrasts are not likely to be helpful, and as client noted, the study is not powered to detect differences after adjusting for 528 contrasts at each time point (2640 in total). My recommendation is to generate simultaneous intervals for the cultivar medians either by time or on a time-averaged basis, and then order them from best to worst or vice-versa.

Here are time-averaged estimates with simultaneous 80% confidence intervals:

Code
# plot cultivar medians
fit.aov |>
  emmeans(~cultivar) |>
  confint(adjust = 'mvt', level = 0.8) |>
  as_tibble() |>
  mutate(across(c(emmean, upper.CL, lower.CL), ~exp(.x) - 1)) |>
  ggplot(aes(x = emmean, y = fct_reorder(cultivar, emmean))) +
  geom_point() +
  geom_errorbarh(aes(xmin = lower.CL, xmax = upper.CL)) +
  theme_bw() +
  labs(y = NULL, x = 'median TSSM')

Note that the usual multiplicity adjustments do not account for comparisons between cultivars. If cultivar comparisons are of interest, I’d recommend adjusting intervals using Scheffe’s method, which accounts for all possible contrasts. However, this produces quite wide intervals, and, unsurprisingly, no comparisons are significantly different.

One could also look at estimated cultivar medians disaggregated by time:

The multiplicity adjustments again start to become inflated due to the number of estimates (165 here). I’d probably go with the time-averaged plot as a primary figure. However, it is notable that mite levels peak at different times for different cultivars, and this plot makes that a bit easier to see than the EDA version showing boxplots by time point – certainly worth including at least on a supplementary basis.

(Option 2) AR(1) correlation

If the time correlation is of particular interest, we can modify the model by adding a AR(1) covariance structure within plots. That is:

\[ \log(\text{tssm}_{ijk}) = \mu + \text{cult}_i + \text{blk}_j + \text{wk}_k + (\text{cult}\times\text{wk})_{ik} + \epsilon_{ijk} \qquad \begin{cases} \epsilon \sim N(0, \Sigma) \\ \text{var}(\epsilon_{ijk}) = \sigma^2 \\ \text{cov}(\epsilon_{ijk}, \epsilon_{ijk'}) = \rho^{|w(k) - w(k')|} \\ \Sigma_{mn} = 0 \;\text{otherwise} \end{cases} \]

The parameter \(\rho\) represents the correlation between two measurements taken one week apart, and the covariance model specifies that this correlation decays geometrically with time – \(w(k)\) denotes the numeric week for index \(k\).

This model is fairly straightforward to fit in R using nlme::gls and gives the following ANOVA table:

Code
# format week for nlme
tssm_analysis_ar1 <- tssm_analysis |>
  mutate(week.i = as.integer(week))

# fit repeated measures anova with AR1 covariance
fit.aov.ar1 <- gls(tssm ~ week.f*cultivar + block,
                   corr = corAR1(value = 0.5, form = ~ week.i | id),
                   data = tssm_analysis_ar1)

# anova table (questionable)
anova(fit.aov.ar1) |>
  pander()
  numDF F-value p-value
(Intercept) 1 717.7 3.576e-98
week.f 4 112.2 6.234e-68
cultivar 32 3.605 5.726e-10
block 3 14.18 7.065e-09
week.f:cultivar 128 1.65 8.49e-05

There is some controversy around constructing and interpreting the F statistics in this context since sums of squares are no longer independent under the model. The estimated covariance structure can be summarized as shown below. I’ve included correlations at 2, 3, and 6 week spacing, since those are the unique spacings of sampling dates in the dataset.

Code
# inspect estimated covariance structure
tibble(sigma = fit.aov.ar1$sigma,
       rho = coef(fit.aov.ar1$modelStruct$corStruct, unconstrained = FALSE)) |>
  mutate(cor2wks  = rho^2,
         cor3wks = rho^3,
         cor6wks = rho^6) |>
  pander()
sigma rho cor2wks cor3wks cor6wks
0.7855 0.6133 0.3761 0.2306 0.0532

All the same plots as before can be constructed using estimates from this model, and they are essentially identical.

This approach will produce the same conclusions as option 1, and only differs in that you have an estimate for the time correlation. If the time correlation is not of specific interest, client may wish to use the simpler model (option 1) – considering the time correlation is not especially strong, this is an entirely reasonable choice.

(Option 3) Response curves

A less common but potentially helpful alternative is to estimate response curves as functions of time. This arises from taking the basic RCBD model, treating weeks since first sampling date as a continuous time variate, and adding a cultivar-dependent function of time:

\[ \log(\text{tssm}_{ij}(t)) = \mu + \text{cult}_i + \text{blk}_j + f_i(t) + \epsilon_{ij} \]

Notice I’m still assuming that the block effect is uniform over time – this can be changed if desired. I fit this model using a B-spline representation \(f_i(t) = \sum_j B_j(t)\), which is fairly straightforward to do in R. This gives the ANOVA:

Code
# append splines to data
tssm_analysis_sp <- tssm_analysis |>
  mutate(week.bs = splines::bs(week, degree = 3))

# fit model
fit.aov.sp <- aov(tssm ~ cultivar*week.bs + block, 
                  data = tssm_analysis_sp)

# anova table
summary(fit.aov.sp) |>
  pander()
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
cultivar 32 102.3 3.198 5.216 5.206e-17
week.bs 3 245.3 81.76 133.3 3.245e-64
block 3 46.32 15.44 25.18 3.113e-15
cultivar:week.bs 96 114.5 1.193 1.945 2.19e-06
Residuals 525 321.9 0.6132 NA NA

What’s nice about this model, although a little more complex than the usual ANOVA, is that estimates can be used to reconstruct response curves:

With a little clever effort, one could compute an integral over the time domain to estimate cumulative mite populations by cultivar and use this as a basis for cultivar comparisons.

As an aside, from a statistical perspective the errors in this model are still assumed independent, but per the discussion in Option 2, there is likely some time correlation within plot. As a consequence, the uncertainty bands on the response curves above are likely a bit too narrow. However, considering that in Option 2 the autocorrelation is fairly weak given the time spacing, I wouldn’t be too concerned about this issue. That said, if client wishes to adjust for that, there are options for incorporating a covariance model into the specification.

PM

For the PM response, there is noticeably more zero-inflation. As a result, standard ANOVA is not recommended.

There are three common strategies for dealing with this issue:

  1. Convert data to presence/absence
  2. Filter out zeroes and model the response conditional on mite presence
  3. Fit a mixture model with a component accounting for the excess probability of zero

Since none of the counts are especially high (99.61% of obserations are under 5), there’s not a particularly strong information signal in the numeric values of the nonzero responses, so I think that the presence/absence approach is best here. Here is the distribution of nonzero values:

I’d expect that even if one did model the quantitative response, the scale of variation by cultivar would not be practically meaningful.

After imputing missing values using the same approach as for TSSM and converting the PM values to presence/absence, data look like so:

cultivar block id date week week.f pm pm.f
Albion (LC) 10 1 2025-02-11 0 0 0 absent
Albion (LC) 10 1 2025-03-25 6 6 0 absent
Albion (LC) 10 1 2025-04-14 9 9 0.4 present
Albion (LC) 10 1 2025-04-29 11 11 1.4 present
Albion (LC) 10 1 2025-05-13 13 13 0.4 present
Albion (LC) 10 1 2025-05-28 16 16 0 absent

I’d fit a binomial GLM with the same predictor form as the standard anova model for an RCBD:

\[ \mathbb{I}(\text{pm}_{ij} > 0) \sim \text{bernoulli}(p_{ij}) \quad\text{where}\quad \text{logit}(p_{ij}) = \mu + \text{cultivar}_i + \text{block}_j \] This model is easy to fit in R and yields the following analysis of deviance table:

Code
# fit binomial glm
fit.glm <- glm(pm.f ~ cultivar + block + week.f + week.f:cultivar, 
               family = binomial(), 
               data = pm)

# analysis of deviance table
anova(fit.glm) |> pander()
Analysis of Deviance Table
  Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL NA NA 791 1047 NA
cultivar 32 42.42 759 1004 0.103
block 3 3.598 756 1001 0.3083
week.f 5 178.3 751 822.5 1.21e-36
cultivar:week.f 160 181.5 591 641.1 0.1176

The \(p\)-values correspond to sequential likelihood ratio tests akin to \(F\) tests using type I sums of squares in a standard ANOVA. Note that in the formula specification the terms are entered in the desired order in which the tests are conducted.

Interestingly, only time is significant; there do not appear to be cultivar differences in the probability of predatory mite occurrence. Nonetheless, here are the estimated predatory mite occurrence probabilities over time by cultivar with pointwise (not simultaneous) 80% confidence intervals and ordered from highest average probability (upper left) to lowest (bottom right):

The estimated aggregate probability of predatory mite occurrence over time (averaged over cultivars and blocks) is:

While there may be some weak autocorrelation within plots, I’d keep it simple and avoid attempting to model this directly. I think the effort to correctly specify and fit a GLMM is not worth the potential benefit, and if the TSSM data are any indication, the autocorrelation does not appear strong at the scale of the time spacing of observations.

Trichome data

For the trichome data, samples were collected on one date only – there are no repeated measurements. I would suggest a standard ANOVA here after averaging the subsamples (as was done for the other responses in the repeated measures data).

The variances are somewhat obviously different, but I wouldn’t be too concerned about that – it’s almost expected given the number of cultivars, and I don’t think it’s worth the effort of adding a covariance model to account for this. However, this could be done if desired. The standard RCBD model is:

\[ \text{trichomes}_{ij} = \mu + \text{cult}_i + \text{blk}_j + \epsilon_{ij} \]

This gives the ANOVA:

Code
trichome_raw |>
  group_by(plot, cultivar, date.sampled) |>
  summarize(tri = mean(tri)) |>
  mutate(block = str_trunc(plot, 2, side = 'right', ellipsis = '')) %>%
  aov(tri ~ cultivar + block, data = .) |>
  summary() |>
  pander()
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
cultivar 32 19037 594.9 2.691 0.0001112
block 3 2592 864.1 3.909 0.01112
Residuals 95 21000 221.1 NA NA

The inhomogeneous variance is not terribly concerning. This may affect interval coverage for a few groups, but as noted above, I wouldn’t be too concerned.

The following gives estimated means ordered from largest to smallest:

It would be interesting to see whether there’s any correspondence between the mite and trichome levels by comparing the rankings of estimated means obtained from each model. Client will need to recode the cultivars so that they match across datasets to facilitate this comparison; Spearman rank correlation between the two sets of rankings aligned by cultivar would provide some indication of whether there’s any relationship.

Summary

  1. How to handle missing data?

Impute averages by cultivar and time. This assumes observations are missing completely at random, which seems reasonable here since the researchers just missed a few plots here and there when sampling.

  1. How to handle repeated measures?

While there are a few options for the TSSM data, I’d keep it simple and use a standard ANOVA and add a term for week and a week-cultivar interaction. Assuming independent errors is a bit of an oversimplification, but the correlation does not seem strong on the scale of the time spacing of observations in the data.

  1. How to do cultivar comparisons with so many treatments?

Avoid pairwise comparisons or other contrasts here unless there are any comparisons that are a priori important to examine. Just estimate the means with multiple inference adjustments.

  1. How to analyze trichome data?

Average the subsamples and fit an ANOVA model.