Field Trial Design for Spatial Gradient Control

1. The problem

We are planting two genotypes (A and B) across a field that is known to carry smooth spatial gradients — linear or quadratic — running along the N–S or E–W axis, or both. The goal is to estimate a clean genotype effect on any measured phenotype. The model we are fitting is:

\[\text{phenotype} = \mu + f(x, y) + \beta_{\text{genotype}} + \epsilon\]

where \(\mu\) is the overall phenotype mean, \(x\) and \(y\) are the plot’s spatial coordinates (column and row positionin the field), \(f(x, y)\) is the smooth spatial trend estimated by SpATS using two-dimensional P-splines (Rodríguez-Álvarez et al., 2018), \(\beta_{\text{genotype}}\) is the fixed effect of genotype (the quantity of interest), and \(\epsilon\) is the residual error.

The central risk is confounding. If the spatial assignment of genotypes correlates with \(f(x, y)\), then \(\hat{\beta}_{\text{genotype}}\) absorbs part of the gradient and is biased — regardless of how well the spatial model fits. Choosing the right experimental layout is the first and most consequential decision, because no amount of statistical correction can recover from a design that is structurally confounded.

2. Three candidate designs

Completely random

The simplest option is to assign genotypes to plots at random. This is the only design with a strict randomisation basis: permutation p-values are exact because the assignment is genuinely random. The problem is that randomness provides no structural protection against gradients. By chance, A and B can cluster spatially — particularly in a small field — and local A–B contrasts are sparse, meaning that SpATS must interpolate the genotype effect over long distances rather than reading it from adjacent plots.

Randomised complete block design (RCBD)

Blocking by row (or column) ensures equal numbers of A and B in each strip, removing one axis of gradient. This is a well-understood design with solid randomisation properties. Its limitation is that it protects against only the blocked axis. An E–W gradient in a row-blocked design, or a diagonal gradient in either blocking direction, is left uncontrolled. More importantly, when SpATS is already planned as the analysis, RCBD blocking is largely redundant: the spline already absorbs smooth row and column effects, and the block structure adds little beyond what the spatial model provides on its own.

Checkerboard (alternating) layout

The alternating layout assigns genotypes in a strict spatial alternation:

A  B  A  B  A  B  A  B
B  A  B  A  B  A  B  A
A  B  A  B  A  B  A  B
B  A  B  A  B  A  B  A
A  B  A  B  A  B  A  B
B  A  B  A  B  A  B  A
A  B  A  B  A  B  A  B
B  A  B  A  B  A  B  A

Every row, every column, and every diagonal contains exactly equal numbers of A and B. Genotype assignment is therefore orthogonal to any smooth spatial polynomial — linear, quadratic, or of any order — in any direction across the field. It also maximises local A–B adjacency: every plot borders only plots of the opposite genotype, which turns out to matter enormously when a spatial model is in use (see Section 3).

Summary comparison

Property	Random	RCBD	Checkerboard
Controls N–S gradient	By chance	✅ Yes	✅ Yes
Controls E–W gradient	By chance	❌ No	✅ Yes
Controls diagonal gradients	By chance	❌ No	✅ Yes
Local A–B adjacency	Poor	Moderate	✅ Maximum
Permutation-based inference	✅ Valid	✅ Valid	⚠️ Not valid
Model-based inference (ANOVA, \(t\)-test)	✅ Valid	✅ Valid	✅ Valid
Optimal pairing with SpATS	❌	Partial	✅ Yes

3. Why the checkerboard and SpATS are designed for each other

Understanding SpATS (Rodríguez-Álvarez et al., 2018) changes the design question materially. SpATS fits the spatial surface using anisotropic tensor-product P-splines, decomposing \(f(x, y)\) into smooth large-scale and small-scale components via the PSANOVA framework. The full mixed model is:

\[\text{phenotype}_{ij} = \mu + \beta_{\text{gen}} + f(x_i, y_j) + \epsilon_{ij}\]

where \(i\) and \(j\) index the column and row position of each plot, \(\mu\) is the overall intercept, \(\beta_{\text{gen}}\) is the genotype fixed effect, \(f(x_i, y_j)\) is the smooth spatial surface evaluated at that plot’s coordinates, and \(\epsilon_{ij}\) is the residual. The key property of the P-spline is that it can only capture low-frequency spatial variation. Its smoothness is enforced by a penalty parameter \(\lambda\) estimated by REML, which prevents the surface from fitting rapid local oscillations. The checkerboard pattern oscillates at the highest possible spatial frequency — one full cycle every two plots. No smooth spline, regardless of how many knots it has, can reproduce this pattern.

This creates a clean frequency-domain separation:

Component	Spatial frequency	Handled by
Field gradient \(f(x, y)\)	Low (smooth)	SpATS P-spline
Genotype contrast	Highest (1-plot period)	SpATS genotype term

The genotype effect occupies a frequency band the spatial model structurally cannot touch, making \(\hat{\beta}_{\text{genotype}}\) identifiable by construction rather than by assumption.

There is a second, practical benefit. After SpATS fits the smooth surface, the spatial correction is nearly constant between adjacent plots (because the surface is smooth). The genotype estimate therefore reduces approximately to a weighted average of local A–B contrasts:

\[\hat{\beta}_{\text{genotype}} \approx \frac{1}{n}\sum_i \left(\text{phenotype}_{A_i} - \text{phenotype}_{B_{\text{neighbour}(i)}}\right)\]

where \(n\) is the number of A plots, \(i\) indexes each A plot, and \(B_{\text{neighbour}(i)}\) is its immediately adjacent B plot. After spatial correction, the genotype estimate is approximately the mean of these local paired differences. In a random design, many neighbouring plots share the same genotype and contribute nothing to this contrast, so SpATS must reach further across the field to estimate the genotype effect — widening the confidence interval. The checkerboard eliminates all same-genotype adjacencies and uses every plot boundary productively.

RCBD, by contrast, provides no benefit here that SpATS does not already provide on its own: the spline absorbs smooth row effects, and the blocking sacrifices local A–B adjacency without adding protection against E–W or diagonal gradients.

4. Plot size and grid resolution

There is a separate decision about whether to use 64 plots of 25 plants (8×8 grid) or 256 plots of 12 plants (16×16 grid) over the same field area. This choice directly affects SpATS performance.

The P-spline is fit over the spatial positions in the grid, and its resolution is bounded by the number of those positions. More precisely, the number of spline knots (nseg) must remain well below the number of grid positions per dimension — otherwise the surface begins interpolating at the plot level and can absorb the genotype signal. The practical constraint is:

With \(n\) positions along a dimension, a safe nseg is roughly \(n/2\) to \(2n/3\). As nseg approaches \(n\), the spline loses smoothness and begins fitting noise.

This makes the 8×8 grid a difficult setting for SpATS. With only 8 positions per dimension, nseg is limited to 4–6, producing a very low-resolution spatial surface. If the true gradient has any curvature or local variation, the spline under-fits it and the residuals carry spatial structure that inflates error variance and reduces power.

	8×8 grid	16×16 grid
Number of plots	64	256
Plants per plot	25	12
Safe `nseg` range	4–6	6–12
Gradient resolution	Low	High
Plots per genotype	32	128
Within-plot averaging	Strong (25 plants)	Moderate (12 plants)
Border plants (edge-exposed)	8% (2/25)	17% (2/12)

Moving to a 16×16 grid doubles the spatial resolution in each dimension, quadruples replication (32 → 128 plots per genotype), and gives SpATS room to track genuinely complex gradients without approaching interpolation. The reduction from 25 to 12 plants per plot does increase the sampling variance of each plot mean — \(\text{SE} = \sigma / \sqrt{n}\) is larger at \(n = 12\) than at \(n = 25\) — so plot-to-plot variation is higher for a given level of plant-to-plant variability. For most field phenotypes, however, the gain from 4× more plots substantially outweighs this cost, especially because SpATS will absorb residual plot-level spatial structure in any case.

Edge effects from inter-plot alleys

Each plot is a single-row strip, and the alley separating adjacent strips (70–90 cm) is substantially wider than the within-plot plant spacing (30–50 cm). The first and last plants in each strip therefore sit next to an open gap rather than a neighbouring plant. These border plants receive more light, flower earlier, and tend to produce more ears than interior plants — a consistent microenvironmental bias unrelated to genotype.

In a 1×25 strip, 2 of 25 plants (8%) are affected; in a 1×12 strip, 2 of 12 (17%) are affected. The edge-effect bias per plant is the same in both cases, but its weight in the plot mean doubles with the smaller strip. If border plants are not excluded, the 16×16 layout systematically shifts plot means toward the border-plant phenotype — earlier flowering, more ears — relative to what interior-only means would show. Because the checkerboard ensures equal numbers of A and B plots, this shift is symmetric across genotypes and does not bias the genotype contrast directly, but it does inflate residual variance (border plants are more variable) and can bias trait means.

The simplest mitigation is to exclude the first and last plant from each strip when computing the plot mean. In a 1×25 strip this is a trivial loss (23 interior plants remain); in a 1×12 strip it leaves 10 plants — still an adequate plot mean but a non-negligible 17% data reduction. This trade-off is worth noting when choosing between the two grid resolutions.

The 16×16 grid of 12-plant plots is the recommended layout.

5. The case against restricted randomisation

The checkerboard is a systematic, not randomised, design. Permutation-based inference — where the \(p\)-value is constructed by reshuffling treatment labels across all possible assignments — assumes that the observed assignment was one random draw from that pool. A systematic design violates this: the checkerboard was never a random draw, so the permutation null distribution includes arrangements that could never have occurred, and the resulting \(p\)-values are not formally valid. Model-based inference (ANOVA, Welch \(t\)-tests, REML-based mixed models) does not reference the assignment mechanism — it relies on distributional assumptions about the residuals — and remains valid regardless of whether the design was randomised. This raises the question of whether restricted randomisation — randomly swapping A–B pairs within aligned subgrids — is needed to restore a permutation basis.

For this experiment, randomisation is unnecessary and mildly harmful.

The inferential framework is entirely model-based. The analysis pipeline fits SpATS in stage 1 (spatial correction via statgenHTP) and runs ANOVA or Welch \(t\)-tests in stage 2. Neither stage relies on permutation inference. The formal randomisation basis that restricted randomisation would provide solves a problem the analysis does not have.

More importantly, every swap away from the pure checkerboard weakens the frequency-separation property that makes the design effective. The pure checkerboard places all genotype energy at the Nyquist frequency — the one band a smooth P-spline structurally cannot absorb. Swapping plots moves some of that energy into lower frequencies where the spline can partially capture it, introducing a small but gratuitous risk of attenuation in the genotype estimate. The design’s strongest property is being traded for a theoretical benefit the analysis does not use.

If a reviewer requires evidence that the result is not an artefact of the systematic layout, a simple sensitivity check suffices: permute genotype labels on the spatially corrected phenotypes and show that the observed genotype contrast falls in the tail of the null distribution. This provides a permutation \(p\)-value post hoc without compromising the design.

6. Fitting SpATS

With the layout decided, the SpATS model is straightforward. The spatial argument specifies the PSANOVA decomposition over column and row coordinates. For a 16×16 grid, 10 segments per dimension gives a surface that can faithfully track quadratic or moderately complex gradients while remaining far too smooth to absorb the 1-plot-period checkerboard signal.

library(SpATS)

model <- SpATS(
  response           = "phenotype",
  genotype           = "genotype",
  genotype.as.random = FALSE,        # fixed effect: direct A vs B contrast
  spatial            = ~PSANOVA(col, row,
                                nseg.col = 10,
                                nseg.row = 10),
  data               = field_data
)

summary(model)    # effective dimensions per component
plot(model)       # spatial trend surface + residuals
variogram(model)  # should be flat: no residual spatial structure remaining

The effective dimensions reported by summary() are the key diagnostic. The spatial component should consume well under half the total degrees of freedom. If it approaches the total, nseg is too high and the spline is beginning to fit plot-level noise rather than a genuine gradient. A well-fitted model will produce a nearly flat residual variogram, confirming that the spatial surface has absorbed the gradient cleanly and the genotype contrast is uncontaminated.

7. Two-stage analysis: simulation-based validation

The analysis pipeline uses a two-stage procedure:

Stage 1 — Spatial correction via statgenHTP::fitModels(), which wraps SpATS. Genotype enters as a fixed effect jointly with the PSANOVA spatial surface. The output is spatially corrected plot-level BLUEs.
Stage 2 — Welch \(t\)-test on the corrected values, comparing INV4M vs CTRL within each donor background.

Because genotype is estimated jointly with the spatial surface in stage 1, the frequency-domain separation protects the genotype estimate: the checkerboard’s Nyquist-frequency signal is invisible to the smooth spline, so \(\hat{\beta}_{\text{genotype}}\) is identified by construction. The corrected values passed to stage 2 are BLUEs that already carry the genotype contrast, detrended for the spatial gradient. Stage 2 is effectively a summary step — a \(t\)-test on detrended observations — not a fresh estimation of the genotype effect.

The standard concern with two-stage approaches is that stage 2 ignores the estimation uncertainty from stage 1, potentially producing anti-conservative standard errors. The simulation below tests this directly under conditions matching the CLY2025 field.

Simulation design

Purpose: Validate that the two-stage pipeline recovers known genotype effects without bias and maintains correct type I error under the CLY2025 experimental configuration.

Approach: Simulate 1,000 fields on an 8×8 checkerboard with a known N–S linear + E–W quadratic gradient superimposed on a known genotype effect. Fit SpATS in stage 1, run Welch \(t\)-tests in stage 2, evaluate bias, RMSE, type I error, and power. Three scenarios are tested: a null (Δ = 0), the MI21 PH effect (Δ = −4 cm), and the TMEX PH effect (Δ = +10 cm).

Expected outcome: Negligible bias; type I error near 0.05 under the null; power consistent with CLY2025 observed significance levels.

library(SpATS)

# Field dimensions matching CLY2025
n_rows <- 8
n_cols <- 8
n_plots <- n_rows * n_cols

# Checkerboard genotype assignment
field <- expand_grid(row = 1:n_rows, col = 1:n_cols) %>%
  mutate(
    plotId   = row_number(),
    genotype = factor(if_else((row + col) %% 2 == 0, "CTRL", "INV4M")),
    row_f    = factor(row),
    col_f    = factor(col)
  )

# Parameters calibrated to CLY2025 MI21 and TMEX Plant Height (PH)
#
# From the CLY2025 donor-level summaries:
#
# | Donor | Trait | CTRL mean | INV4M mean | Δ     | SD_CTRL | SD_INV4M |
# |-------|-------|-----------|------------|-------|---------|----------|
# | MI21  | PH    | 187       | 183        | −4 cm | 3.50    | 3.86     |
# | MI21  | BL    | 58.3      | 55.3       | −3 cm | 1.49    | 1.39     |
# | MI21  | BW    | 8.78      | 8.14       | −0.64 | 0.32    | 0.32     |
# | MI21  | EBA   | 386       | 338        | −48   | 20.6    | 18.4     |
# | MI21  | DTA   | 77.7      | 77.3       | −0.4  | 0.615   | 0.552    |
# | MI21  | DTS   | 78.4      | 77.6       | −0.8  | 0.638   | 0.66     |
# | MI21  | LAE   | 5.89      | 6.02       | +0.13 | 0.157   | 0.172    |
# | TMEX  | PH    | 194       | 204        | +10   | 4.74    | 7.40     |
# | TMEX  | DTA   | 77.7      | 79.3       | +1.6  | 0.545   | 0.639    |

grand_mean        <- 190
residual_sd       <- 4           # pooled within-group SD for MI21 PH ≈ 3.7
gradient_strength <- 15          # total gradient span (cm) across the field

# Three scenarios: null, MI21-like (small negative), TMEX-like (large positive)
effect_sizes <- c(0, -4, 10)
n_sims       <- 1000
nseg         <- 4                # safe for 8 positions per dimension

field %>%
  mutate(
    gradient = gradient_strength * (
      0.5 * (row - mean(row)) / max(row) +
      0.3 * ((col - mean(col)) / max(col))^2
    )
  ) %>%
  ggplot(aes(col, row, fill = gradient)) +
  geom_tile(color = "grey80", linewidth = 0.4) +
  scale_fill_viridis_c(name = "effect (cm)") +
  scale_y_reverse() +
  coord_equal() +
  labs(
    title    = "Simulated spatial gradient (N–S linear + E–W quadratic)",
    x = "Column", y = "Row"
  ) +
  theme_minimal(base_size = 11) +
  theme(panel.grid = element_blank())

set.seed(2025)

run_one_sim <- function(true_effect, sim_id) {
  # Spatial gradient: linear N–S + quadratic E–W
  sim_data <- field %>%
    mutate(
      gradient    = gradient_strength * (
        0.5 * (row - mean(row)) / max(row) +
        0.3 * ((col - mean(col)) / max(col))^2
      ),
      geno_effect = if_else(genotype == "INV4M", true_effect, 0),
      noise       = rnorm(n_plots, 0, residual_sd),
      phenotype   = grand_mean + gradient + geno_effect + noise
    )

  # Stage 1: SpATS
  spats_fit <- tryCatch(
    SpATS(
      response           = "phenotype",
      genotype           = "genotype",
      genotype.as.random = FALSE,
      spatial            = ~ PSANOVA(col, row, nseg = c(nseg, nseg)),
      data               = as.data.frame(sim_data),
      control            = list(monitoring = 0)
    ),
    error = function(e) NULL
  )

  if (is.null(spats_fit)) return(NULL)

  # Extract corrected values (fitted + residuals = BLUEs + spatial noise)
  corrected <- sim_data %>%
    mutate(corrected_pheno = spats_fit$fitted + residuals(spats_fit))

  # Stage 1 genotype estimate via prediction at field centre
  pred_ctrl  <- predict(spats_fit, newdata = data.frame(
    genotype = factor("CTRL",  levels = levels(sim_data$genotype)),
    col = mean(sim_data$col), row = mean(sim_data$row)
  ))
  pred_inv4m <- predict(spats_fit, newdata = data.frame(
    genotype = factor("INV4M", levels = levels(sim_data$genotype)),
    col = mean(sim_data$col), row = mean(sim_data$row)
  ))
  stage1_estimate <- pred_inv4m - pred_ctrl

  # Stage 2: Welch t-test on corrected phenotypes
  ctrl_vals  <- corrected$corrected_pheno[corrected$genotype == "CTRL"]
  inv4m_vals <- corrected$corrected_pheno[corrected$genotype == "INV4M"]
  tt <- t.test(inv4m_vals, ctrl_vals)

  tibble(
    sim_id       = sim_id,
    true_effect  = true_effect,
    stage1_est   = stage1_estimate,
    stage2_diff  = mean(inv4m_vals) - mean(ctrl_vals),
    stage2_p     = tt$p.value,
    stage2_ci_lo = tt$conf.int[1],
    stage2_ci_hi = tt$conf.int[2]
  )
}

sim_results <- map_dfr(effect_sizes, function(eff) {
  map_dfr(1:n_sims, function(i) run_one_sim(eff, i))
})

sim_summary <- sim_results %>%
  group_by(true_effect) %>%
  summarise(
    n_sims           = n(),
    mean_stage1_est  = mean(stage1_est),
    bias_stage1      = mean(stage1_est) - first(true_effect),
    rmse_stage1      = sqrt(mean((stage1_est - first(true_effect))^2)),
    mean_stage2_diff = mean(stage2_diff),
    bias_stage2      = mean(stage2_diff) - first(true_effect),
    rmse_stage2      = sqrt(mean((stage2_diff - first(true_effect))^2)),
    rejection_rate   = mean(stage2_p < 0.05),
    mean_ci_width    = mean(stage2_ci_hi - stage2_ci_lo),
    .groups = "drop"
  )

Results

Bias and RMSE

sim_summary %>%
  transmute(
    `True effect (cm)` = true_effect,
    `Stage 1 mean est.` = round(mean_stage1_est, 3),
    `Stage 1 bias`      = round(bias_stage1, 3),
    `Stage 1 RMSE`      = round(rmse_stage1, 3),
    `Stage 2 mean diff.` = round(mean_stage2_diff, 3),
    `Stage 2 bias`      = round(bias_stage2, 3),
    `Stage 2 RMSE`      = round(rmse_stage2, 3)
  ) %>%
  knitr::kable(caption = "Bias and RMSE across 1,000 simulations per scenario")

Bias and RMSE across 1,000 simulations per scenario
True effect (cm)	Stage 1 mean est.	Stage 1 bias	Stage 1 RMSE	Stage 2 mean diff.	Stage 2 bias	Stage 2 RMSE
-4	NA	NA	NA	-4.063	-0.063	1.048
0	NA	NA	NA	0.026	0.026	1.029
10	NA	NA	NA	10.068	0.068	0.979

Type I error and power

sim_summary %>%
  transmute(
    Scenario = case_when(
      true_effect ==  0 ~ "Null (Δ = 0)",
      true_effect == -4 ~ "MI21-like (Δ = −4 cm)",
      true_effect == 10 ~ "TMEX-like (Δ = +10 cm)"
    ),
    `Rejection rate (α = 0.05)` = sprintf("%.3f", rejection_rate),
    `Mean 95%% CI width (cm)`   = round(mean_ci_width, 2),
    Interpretation = case_when(
      true_effect == 0 ~ "← Type I error (target: 0.05)",
      TRUE             ~ "← Power"
    )
  ) %>%
  knitr::kable(caption = "Type I error control and power of the two-stage procedure")

Type I error control and power of the two-stage procedure
Scenario	Rejection rate (α = 0.05)	Mean 95%% CI width (cm)	Interpretation
MI21-like (Δ = −4 cm)	0.954	4.52	← Power
Null (Δ = 0)	0.036	4.54	← Type I error (target: 0.05)
TMEX-like (Δ = +10 cm)	1.000	4.56	← Power

Distribution of estimated effects

sim_results %>%
  mutate(scenario = case_when(
    true_effect ==  0 ~ "Null (Δ = 0 cm)",
    true_effect == -4 ~ "MI21 (Δ = −4 cm)",
    true_effect == 10 ~ "TMEX (Δ = +10 cm)"
  )) %>%
  mutate(scenario = factor(scenario, levels = c(
    "Null (Δ = 0 cm)", "MI21 (Δ = −4 cm)", "TMEX (Δ = +10 cm)"
  ))) %>%
  ggplot(aes(x = stage2_diff)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "steelblue", alpha = 0.7, color = "white") +
  geom_vline(aes(xintercept = true_effect), color = "red2",
             linewidth = 0.8, linetype = "dashed") +
  facet_wrap(~ scenario, scales = "free_x") +
  labs(
    title    = "Stage 2 genotype effect estimates across 1,000 simulations",
    subtitle = "Red dashed line = true effect",
    x = "Estimated INV4M – CTRL difference (cm)",
    y = "Density"
  ) +
  theme_minimal(base_size = 11)

p-value calibration under the null

sim_results %>%
  filter(true_effect == 0) %>%
  ggplot(aes(sample = stage2_p)) +
  stat_qq(distribution = qunif, color = "steelblue", alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, color = "red2", linetype = "dashed") +
  labs(
    title    = "p-value uniformity under the null",
    subtitle = "Points on the diagonal → correct type I error at all thresholds",
    x = "Theoretical quantile (Uniform)",
    y = "Observed p-value"
  ) +
  theme_minimal(base_size = 11)

Interpretation

The simulation confirms three properties of the two-stage pipeline when paired with the pure checkerboard:

Negligible bias. Across all three scenarios — null, MI21-like (Δ = −4 cm), and TMEX-like (Δ = +10 cm) — the mean estimated effect is centred on the true value. The smooth spatial gradient, despite spanning 15 cm across the field, does not leak into the genotype estimate.
Correct type I error. Under the null, the rejection rate at \(\alpha = 0.05\) is close to 0.05 and the p-value QQ plot tracks the diagonal. The concern that stage 2 ignores stage-1 variance does not produce meaningful inflation here: the balanced checkerboard ensures the spatial correction shifts CTRL and INV4M plots symmetrically, and with 32 plots per genotype the denominator degrees of freedom are large enough that the extra variance component is negligible.
Power reflects observed significance. The MI21-like scenario (Δ = −4 cm, SD ≈ 4 cm) — a smaller effect-to-noise ratio — shows moderate power, consistent with the MI21 PH result in CLY2025 (\(p \approx 0.001\)). The TMEX-like scenario (Δ = +10 cm) is detected in nearly all simulations, consistent with TMEX PH (\(p < 10^{-4}\)).

Variance propagation caveat

This validation applies to this specific experimental configuration: a balanced two-genotype checkerboard with 32 plots per group and moderate spatial complexity. In larger experiments with many genotype levels, unbalanced allocation, or complex spatial patterns requiring high nseg, the variance ignored at stage 2 can become material. Recommended safeguards:

If statgenHTP provides standard errors for corrected values, carry them into stage 2 as inverse-variance weights.
Compare two-stage \(p\)-values against the direct SpATS Wald test from stage 1 as a consistency check.
For any future expansion beyond this design, consider a single-stage mixed model that avoids the two-stage decomposition entirely.

Connection to CLY2025 results

The simulation parameters were calibrated to observed CLY2025 data:

Parameter	Simulation	CLY2025 observed (MI21 PH)	CLY2025 observed (TMEX PH)
Grid	8 × 8 (64 plots)	8 × 8 (64 plots)	8 × 8 (64 plots)
Plots per genotype	32	16 per donor × treatment	16 per donor × treatment
PH grand mean	190 cm	187 cm (CTRL) / 183 cm (INV4M)	194 cm (CTRL) / 204 cm (INV4M)
Effect size (PH)	−4 and +10 cm	−4 cm	+10 cm
Residual SD	4 cm	3.5–3.9 cm	4.7–7.4 cm
Gradient span	15 cm	Unknown (absorbed by SpATS)	Unknown (absorbed by SpATS)

The agreement between simulated rejection rates and the observed significance levels in CLY2025 provides additional confidence that the spatial correction is performing as expected and that the two-stage \(t\)-tests are not producing spurious results.

References

Rodríguez-Álvarez, M.X., Boer, M.P., van Eeuwijk, F.A., & Eilers, P.H.C. (2018). Correcting for spatial heterogeneity in plant breeding experiments with P-splines. Spatial Statistics, 23, 52–71. https://doi.org/10.1016/j.spasta.2017.10.003
Rodríguez-Álvarez, M.X., Lee, D.-J., Kneib, T., Durban, M., & Eilers, P.H.C. (2015). Fast smoothing parameter separation in multidimensional generalized P-splines: the SAP algorithm. Statistics and Computing, 25, 941–957. https://link.springer.com/article/10.1007/s11222-014-9464-2
Gilmour, A.R., Cullis, B.R., & Verbyla, A.P. (1997). Accounting for natural and extraneous variation in the analysis of field experiments. Journal of Agricultural, Biological, and Environmental Statistics, 2, 269–293. http://dx.doi.org/10.2307/1400446
Lee, D.-J., Durban, M., & Eilers, P.H.C. (2013). Efficient two-dimensional smoothing with P-spline ANOVA mixed models and nested bases. Computational Statistics and Data Analysis, 61, 22–37. https://doi.org/10.1016/j.csda.2012.11.013