Code
library(MASS)
library(tidyverse)
library(nlme)
library(lme4)
library(lmerTest)
library(ez)
library(gt)
library(broom)
library(broom.mixed)
library(emmeans)
set.seed(2024)Misconceptions, Diagnostics, and the Linear Mixed Model Alternative
library(MASS)
library(tidyverse)
library(nlme)
library(lme4)
library(lmerTest)
library(ez)
library(gt)
library(broom)
library(broom.mixed)
library(emmeans)
set.seed(2024)Repeated-measures ANOVA is among the most widely used designs in clinical and behavioral research — tracking the same subjects over time or across conditions. Yet one of its central assumptions, sphericity, is routinely misunderstood, mechanically tested with an inadequate test, and patched with a correction that does not fix the underlying problem.
This document traces the full chain: what sphericity actually requires, why Mauchly’s test is unreliable, what Greenhouse-Geisser actually does (and does not do), and why linear mixed models (LMMs) are the correct modern replacement.
Repeated-measures ANOVA partitions within-subject variance across time points. For the F-ratio to follow an F-distribution under H₀, the variance–covariance matrix of the repeated observations must satisfy a specific structure.
The most restrictive form is compound symmetry (CS): all variances equal and all covariances equal. If the data follow CS, sphericity is automatically satisfied.
\[ \boldsymbol{\Sigma}_{CS} = \sigma^2 \begin{pmatrix} 1 & \rho & \rho \\ \rho & 1 & \rho \\ \rho & \rho & 1 \end{pmatrix} \]
Sphericity is weaker than compound symmetry but still demanding. It states that the variances of all pairwise difference scores are equal:
\[ \text{Var}(Y_{ij} - Y_{ik}) = \lambda \quad \text{for all } j \neq k \]
This must hold for every pair of time points. With \(k\) time points there are \(\binom{k}{2}\) pairs. As \(k\) grows, this becomes increasingly unlikely in practice — especially in longitudinal clinical data where measurements close in time tend to correlate more strongly than measurements far apart (an AR(1)-type structure).
Key insight: Most analysts think of sphericity as a variance homogeneity assumption — it is actually a constraint on the entire covariance structure of difference scores. Violation means the F-ratio in standard repeated-measures ANOVA is no longer exact, and inference on within-subject effects is anti-conservative (inflated Type I error).
We simulate a three-time-point study (n = 30 subjects) where sphericity is violated — the true covariance structure has an AR(1) pattern with stronger correlation between adjacent time points.
n <- 30
times <- 3
rho_ar1 <- 0.8 # AR(1) correlation parameter
# AR(1) covariance matrix (sigma^2 = 4)
sigma2 <- 4
Sigma <- sigma2 * rho_ar1^abs(outer(1:times, 1:times, "-"))
# True means: a genuine linear time trend
mu <- c(10, 12, 15)
# Simulate multivariate normal subject profiles
Y_mat <- MASS::mvrnorm(n, mu = mu, Sigma = Sigma)
colnames(Y_mat) <- paste0("T", 1:times)
df_wide <- as.data.frame(Y_mat) |>
mutate(subject = factor(1:n))
df_long <- df_wide |>
pivot_longer(cols = T1:T3, names_to = "time", values_to = "outcome") |>
mutate(time = factor(time, levels = c("T1", "T2", "T3")))
# Show first 6 rows
df_long |>
head(12) |>
gt() |>
tab_header(title = "Simulated Repeated-Measures Data (first 12 rows)") |>
cols_label(subject = "Subject", time = "Time", outcome = "Outcome") |>
fmt_number(columns = outcome, decimals = 2) |>
tab_options(table.font.size = 13)| Simulated Repeated-Measures Data (first 12 rows) | ||
| Subject | Time | Outcome |
|---|---|---|
| 1 | T1 | 7.54 |
| 1 | T2 | 9.16 |
| 1 | T3 | 14.99 |
| 2 | T1 | 9.17 |
| 2 | T2 | 10.59 |
| 2 | T3 | 14.72 |
| 3 | T1 | 9.94 |
| 3 | T2 | 11.88 |
| 3 | T3 | 15.80 |
| 4 | T1 | 11.46 |
| 4 | T2 | 12.30 |
| 4 | T3 | 14.41 |
# Spaghetti plot + mean trend
df_long |>
ggplot(aes(x = time, y = outcome)) +
geom_line(aes(group = subject), colour = "steelblue", alpha = 0.3, linewidth = 0.4) +
stat_summary(aes(group = 1), fun = mean, geom = "line",
colour = "firebrick", linewidth = 1.2) +
stat_summary(fun = mean, geom = "point",
colour = "firebrick", size = 3) +
labs(
title = "Individual Profiles and Group Mean Trend",
subtitle = "Red = group mean; blue = individual trajectories",
x = "Time Point", y = "Outcome"
) +
theme_bw(base_size = 13)Mauchly’s test evaluates whether the sample covariance matrix of the orthonormal contrasts (derived from the repeated measures) is proportional to the identity matrix — the algebraic equivalent of sphericity.
# Fit RM-ANOVA using ez package
rm_aov <- ez::ezANOVA(
data = df_long,
dv = outcome,
wid = subject,
within = time,
detailed = TRUE,
return_aov = TRUE
)
rm_aov$`Mauchly's Test for Sphericity` |>
as.data.frame() |>
gt() |>
tab_header(
title = "Mauchly's Test for Sphericity",
subtitle = "H₀: variance of all pairwise difference scores are equal"
) |>
fmt_number(columns = where(is.numeric), decimals = 4) |>
tab_options(table.font.size = 13)| Mauchly's Test for Sphericity | |||
| H₀: variance of all pairwise difference scores are equal | |||
| Effect | W | p | p<.05 |
|---|---|---|---|
| time | 0.8429 | 0.0913 | |
Mauchly’s test suffers the exact same pathology as Levene’s test for homoscedasticity:
| n (small) | n (large) |
|---|---|
| Low power — fails to detect real violations | High power — flags trivial, inconsequential violations |
| You most need the correction here | You reject sphericity even when ε is close to 1 |
Critical point: In small samples, Mauchly’s test is underpowered precisely when sphericity violations most distort the F-test. In large samples, it rejects for violations so minor that no correction is needed. It should not be used as a gatekeeping rule.
Demonstrate this with a simulation across sample sizes:
# Power of Mauchly's test as a function of n
# Under the same AR(1) violation used above
mauchly_power <- function(n_sim, n_rep = 500) {
rejected <- replicate(n_rep, {
Y <- MASS::mvrnorm(n_sim, mu = mu, Sigma = Sigma)
# Mauchly's test via base R
# Convert to matrix of orthonormal contrasts
k <- ncol(Y)
C <- contr.sum(k)[, -k, drop = FALSE] # (k x k-1) contrasts
Z <- Y %*% C
W_test <- tryCatch(mauchly.test(lm(Z ~ 1))$p.value, error = function(e) NA)
if (is.na(W_test)) return(FALSE)
W_test < 0.05
})
mean(rejected, na.rm = TRUE)
}
sample_sizes <- c(10, 20, 30, 50, 100, 200)
power_vals <- sapply(sample_sizes, mauchly_power)
tibble(n = sample_sizes, power = power_vals) |>
ggplot(aes(x = n, y = power)) +
geom_line(colour = "steelblue", linewidth = 1.2) +
geom_point(size = 3, colour = "steelblue") +
geom_hline(yintercept = 0.05, linetype = "dashed", colour = "grey50") +
scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
labs(
title = "Empirical Power of Mauchly's Test vs. Sample Size",
subtitle = "True violation: AR(1) structure, ρ = 0.8 | Dashed = nominal α = 0.05",
x = "Sample size (n)", y = "Rejection rate"
) +
theme_bw(base_size = 13)Interpretation: With n = 10, Mauchly’s test detects the violation only a fraction of the time — the analyst proceeds as if sphericity holds when it does not. With n ≥ 100, the test reliably detects the violation, but by then it may be clinically inconsequential (ε near 1). The test provides the least guidance precisely where it is most needed.
When Mauchly’s test rejects (or is ignored), the standard remedy is to multiply both numerator and denominator degrees of freedom by ε̂ (epsilon-hat), the Greenhouse-Geisser estimate of departure from sphericity.
\[ F^*_{GG} \sim F(\hat{\varepsilon}(k-1),\ \hat{\varepsilon}(k-1)(n-1)) \]
\(\hat{\varepsilon}\) ranges from \(\frac{1}{k-1}\) (maximum violation) to 1 (sphericity holds). The Huynh-Feldt correction uses a less conservative ε̂ that can exceed 1 (truncated to 1).
rm_aov$`Sphericity Corrections` |>
as.data.frame() |>
gt() |>
tab_header(
title = "Greenhouse-Geisser and Huynh-Feldt Corrections",
subtitle = "GGe = G-G epsilon; HFe = H-F epsilon"
) |>
fmt_number(columns = where(is.numeric), decimals = 4) |>
tab_options(table.font.size = 13)| Greenhouse-Geisser and Huynh-Feldt Corrections | ||||||
| GGe = G-G epsilon; HFe = H-F epsilon | ||||||
| Effect | GGe | p[GG] | p[GG]<.05 | HFe | p[HF] | p[HF]<.05 |
|---|---|---|---|---|---|---|
| time | 0.8642 | 0.0000 | * | 0.9140 | 0.0000 | * |
rm_aov$ANOVA |>
as.data.frame() |>
filter(Effect == "time") |>
gt() |>
tab_header(
title = "Repeated-Measures ANOVA: Within-Subject Effect of Time",
subtitle = "Uncorrected (assumes sphericity)"
) |>
fmt_number(columns = where(is.numeric), decimals = 4) |>
tab_options(table.font.size = 13)| Repeated-Measures ANOVA: Within-Subject Effect of Time | ||||||||
| Uncorrected (assumes sphericity) | ||||||||
| Effect | DFn | DFd | SSn | SSd | F | p | p<.05 | ges |
|---|---|---|---|---|---|---|---|---|
| time | 2.0000 | 58.0000 | 388.8905 | 51.0604 | 220.8724 | 0.0000 | * | 0.4752 |
The GG correction adjusts the reference F-distribution to approximately account for the inflation in the F-statistic caused by sphericity violation. It is a conservative post-hoc adjustment, not a re-specification of the model.
Critically, it:
The correction is best understood as a historical workaround from an era before accessible software for mixed models. It is no longer necessary.
A linear mixed model (LMM) directly models the correlation structure among repeated observations. It subsumes repeated-measures ANOVA as the special case where the covariance structure is compound symmetry.
| Feature | RM-ANOVA | LMM |
|---|---|---|
| Handles missing data (MCAR/MAR) | No (listwise deletion) | Yes (full likelihood) |
| Unbalanced time points | No | Yes |
| Flexible covariance structures | No | Yes (CS, AR(1), UN, …) |
| Subject-level covariates | Limited | Yes |
| Sphericity required | Yes | No |
| Time-varying covariates | No | Yes |
# 1. Compound symmetry (= RM-ANOVA with sphericity)
lmm_cs <- lme(outcome ~ time,
random = ~ 1 | subject,
correlation = corCompSymm(form = ~ 1 | subject),
data = df_long,
method = "REML")
# 2. AR(1) — correct generating structure
lmm_ar1 <- lme(outcome ~ time,
random = ~ 1 | subject,
correlation = corAR1(form = ~ 1 | subject),
data = df_long,
method = "REML")
# 3. Unstructured — most flexible, most parameters
lmm_un <- lme(outcome ~ time,
random = ~ 1 | subject,
correlation = corSymm(form = ~ 1 | subject),
weights = varIdent(form = ~ 1 | time),
data = df_long,
method = "REML")AIC and BIC penalise model complexity differently; here we use them to select the covariance structure.
aic_tbl <- AIC(lmm_cs, lmm_ar1, lmm_un) |>
rownames_to_column("Model") |>
left_join(
BIC(lmm_cs, lmm_ar1, lmm_un) |> rownames_to_column("Model") |> select(Model, BIC),
by = "Model"
) |>
mutate(
Model = recode(Model,
lmm_cs = "Compound Symmetry (CS)",
lmm_ar1 = "AR(1)",
lmm_un = "Unstructured (UN)"
),
`ΔAIC` = AIC - min(AIC),
`ΔBIC` = BIC - min(BIC)
) |>
arrange(AIC)
aic_tbl |>
gt() |>
tab_header(
title = "Model Comparison: Covariance Structures",
subtitle = "Lower AIC/BIC = better fit; ΔAIC > 2 suggests meaningful difference"
) |>
fmt_number(columns = where(is.numeric), decimals = 2) |>
tab_style(
style = cell_fill(color = "#d4edda"),
locations = cells_body(rows = `ΔAIC` == 0)
) |>
tab_options(table.font.size = 13)| Model Comparison: Covariance Structures | |||||
| Lower AIC/BIC = better fit; ΔAIC > 2 suggests meaningful difference | |||||
| Model | df | AIC | BIC | ΔAIC | ΔBIC |
|---|---|---|---|---|---|
| AR(1) | 6.00 | 332.96 | 347.75 | 0.00 | 0.00 |
| Compound Symmetry (CS) | 6.00 | 336.20 | 351.00 | 3.24 | 3.24 |
| Unstructured (UN) | 10.00 | 338.22 | 362.88 | 5.26 | 15.13 |
Interpretation: The AR(1) model has the lowest AIC/BIC (highlighted), correctly identifying the data-generating structure. The unstructured model, while flexible, is penalised for its additional parameters. Compound symmetry — the RM-ANOVA assumption — fits worse because the true correlation decays with time rather than remaining constant.
tidy(lmm_ar1, effects = "fixed", conf.int = TRUE) |>
select(term, estimate, std.error, statistic, p.value, conf.low, conf.high) |>
gt() |>
tab_header(
title = "Fixed Effects: AR(1) Linear Mixed Model",
subtitle = "Reference category = T1; estimates are mean differences vs. T1"
) |>
cols_label(
term = "Term",
estimate = "Estimate",
std.error = "SE",
statistic = "t",
p.value = "p",
conf.low = "95% CI (low)",
conf.high = "95% CI (high)"
) |>
fmt_number(columns = where(is.numeric), decimals = 3) |>
tab_options(table.font.size = 13)| Fixed Effects: AR(1) Linear Mixed Model | ||||||
| Reference category = T1; estimates are mean differences vs. T1 | ||||||
| Term | Estimate | SE | t | p | 95% CI (low) | 95% CI (high) |
|---|---|---|---|---|---|---|
| (Intercept) | 10.392 | 0.403 | 25.792 | 0.000 | 9.586 | 11.199 |
| timeT2 | 1.949 | 0.223 | 8.743 | 0.000 | 1.503 | 2.395 |
| timeT3 | 5.048 | 0.280 | 17.998 | 0.000 | 4.487 | 5.610 |
Interpretation:
emm <- emmeans(lmm_ar1, ~ time)
emm |>
as.data.frame() |>
gt() |>
tab_header(
title = "Estimated Marginal Means by Time Point",
subtitle = "AR(1) LMM; 95% confidence intervals"
) |>
fmt_number(columns = where(is.numeric), decimals = 2) |>
tab_options(table.font.size = 13)| Estimated Marginal Means by Time Point | |||||
| AR(1) LMM; 95% confidence intervals | |||||
| time | emmean | SE | df | lower.CL | upper.CL |
|---|---|---|---|---|---|
| T1 | 10.39 | 0.40 | 29.00 | 9.57 | 11.22 |
| T2 | 12.34 | 0.40 | 29.00 | 11.52 | 13.17 |
| T3 | 15.44 | 0.40 | 29.00 | 14.62 | 16.26 |
contrast(emm, method = "pairwise", adjust = "bonferroni") |>
as.data.frame() |>
gt() |>
tab_header(
title = "Pairwise Contrasts Between Time Points",
subtitle = "Bonferroni-adjusted; AR(1) LMM"
) |>
fmt_number(columns = where(is.numeric), decimals = 3) |>
tab_options(table.font.size = 13)| Pairwise Contrasts Between Time Points | |||||
| Bonferroni-adjusted; AR(1) LMM | |||||
| contrast | estimate | SE | df | t.ratio | p.value |
|---|---|---|---|---|---|
| T1 - T2 | −1.949 | 0.223 | 58.000 | −8.743 | 0.000 |
| T1 - T3 | −5.048 | 0.280 | 58.000 | −17.998 | 0.000 |
| T2 - T3 | −3.099 | 0.223 | 58.000 | −13.903 | 0.000 |
Interpretation: All three pairwise contrasts are statistically significant after Bonferroni adjustment, consistent with the monotone increasing trend built into the simulation. T3 vs. T1 shows the largest mean difference (~5 units), followed by T3 vs. T2 (~3 units) and T2 vs. T1 (~2 units).
pred_df <- as.data.frame(emm) |>
mutate(source = "LMM AR(1)")
obs_df <- df_long |>
group_by(time) |>
summarise(emmean = mean(outcome),
lower.CL = mean(outcome) - 1.96 * sd(outcome) / sqrt(n()),
upper.CL = mean(outcome) + 1.96 * sd(outcome) / sqrt(n()),
.groups = "drop") |>
mutate(source = "Observed mean")
bind_rows(pred_df |> select(time, emmean, lower.CL, upper.CL, source),
obs_df) |>
ggplot(aes(x = time, y = emmean, colour = source, group = source)) +
geom_point(size = 3, position = position_dodge(0.2)) +
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL),
width = 0.15, position = position_dodge(0.2), linewidth = 0.8) +
geom_line(position = position_dodge(0.2), linewidth = 0.9) +
scale_colour_manual(values = c("LMM AR(1)" = "firebrick", "Observed mean" = "steelblue")) +
labs(
title = "LMM Estimated Marginal Means vs. Observed Means",
subtitle = "Error bars = 95% CI",
x = "Time", y = "Outcome", colour = NULL
) +
theme_bw(base_size = 13) +
theme(legend.position = "bottom")# RM-ANOVA time effect
rm_f <- rm_aov$ANOVA |> filter(Effect == "time") |> pull(F)
rm_p <- rm_aov$ANOVA |> filter(Effect == "time") |> pull(p)
rm_p_gg <- rm_aov$`Sphericity Corrections` |> filter(Effect == "time") |> pull(`p[GG]`)
# LMM overall time effect via anova
lmm_anova <- anova(lmm_ar1)
comparison <- tibble(
Method = c("RM-ANOVA (uncorrected)", "RM-ANOVA (GG corrected)", "LMM – CS", "LMM – AR(1)"),
`F / test stat` = c(round(rm_f, 3), round(rm_f, 3),
round(anova(lmm_cs)["time","F-value"], 3),
round(anova(lmm_ar1)["time","F-value"], 3)),
`p-value` = c(round(rm_p, 4), round(rm_p_gg, 4),
round(anova(lmm_cs)["time","p-value"], 4),
round(anova(lmm_ar1)["time","p-value"], 4)),
`Sphericity needed` = c("Yes", "Patch only", "No (CS assumed)", "No"),
`Handles missing` = c("No", "No", "Yes", "Yes")
)
comparison |>
gt() |>
tab_header(
title = "RM-ANOVA vs. LMM: Time Effect Results",
subtitle = "Same data, different models; AR(1) is the true generating structure"
) |>
tab_style(
style = cell_fill(color = "#d4edda"),
locations = cells_body(rows = Method == "LMM – AR(1)")
) |>
tab_options(table.font.size = 13)| RM-ANOVA vs. LMM: Time Effect Results | ||||
| Same data, different models; AR(1) is the true generating structure | ||||
| Method | F / test stat | p-value | Sphericity needed | Handles missing |
|---|---|---|---|---|
| RM-ANOVA (uncorrected) | 220.872 | 0 | Yes | No |
| RM-ANOVA (GG corrected) | 220.872 | 0 | Patch only | No |
| LMM – CS | 220.872 | 0 | No (CS assumed) | Yes |
| LMM – AR(1) | 167.468 | 0 | No | Yes |
Interpretation: All methods correctly detect a significant time effect given the strong true effect. The key distinction is not the p-value here but the quality of inference: the AR(1) LMM uses a correctly specified covariance model, produces accurate standard errors, and supports follow-up contrasts and predictions without the sphericity assumption. The GG-corrected RM-ANOVA narrows the degrees of freedom conservatively but still assumes a mis-specified error structure.
| When to Use RM-ANOVA vs. LMM | |
| Situation | Recommendation |
|---|---|
| Balanced design, CS covariance plausible, complete data | RM-ANOVA acceptable; verify with AIC vs. LMM-CS |
| Any doubt about covariance structure | LMM with AIC-guided covariance selection |
| Missing outcome data | LMM (REML/ML under MAR); RM-ANOVA invalid |
| Unequally spaced time points | LMM required; RM-ANOVA not defined |
| Time-varying covariates | LMM required |
| Interest in individual trajectories | LMM with random slopes |
nlme (Pinheiro & Bates), lme4/lmerTest, ez, emmeans, broom.mixed