Repeated-Measures ANOVA and the Sphericity Assumption

Misconceptions, Diagnostics, and the Linear Mixed Model Alternative

Author

Timothy Achala

Published

June 22, 2026

Code
library(MASS)
library(tidyverse)
library(nlme)
library(lme4)
library(lmerTest)
library(ez)
library(gt)
library(broom)
library(broom.mixed)
library(emmeans)


set.seed(2024)

1 Background and Motivation

Repeated-measures ANOVA is among the most widely used designs in clinical and behavioral research — tracking the same subjects over time or across conditions. Yet one of its central assumptions, sphericity, is routinely misunderstood, mechanically tested with an inadequate test, and patched with a correction that does not fix the underlying problem.

This document traces the full chain: what sphericity actually requires, why Mauchly’s test is unreliable, what Greenhouse-Geisser actually does (and does not do), and why linear mixed models (LMMs) are the correct modern replacement.


2 What Sphericity Actually Requires

Repeated-measures ANOVA partitions within-subject variance across time points. For the F-ratio to follow an F-distribution under H₀, the variance–covariance matrix of the repeated observations must satisfy a specific structure.

2.1 The Compound Symmetry Special Case

The most restrictive form is compound symmetry (CS): all variances equal and all covariances equal. If the data follow CS, sphericity is automatically satisfied.

\[ \boldsymbol{\Sigma}_{CS} = \sigma^2 \begin{pmatrix} 1 & \rho & \rho \\ \rho & 1 & \rho \\ \rho & \rho & 1 \end{pmatrix} \]

2.2 The Sphericity Condition Proper

Sphericity is weaker than compound symmetry but still demanding. It states that the variances of all pairwise difference scores are equal:

\[ \text{Var}(Y_{ij} - Y_{ik}) = \lambda \quad \text{for all } j \neq k \]

This must hold for every pair of time points. With \(k\) time points there are \(\binom{k}{2}\) pairs. As \(k\) grows, this becomes increasingly unlikely in practice — especially in longitudinal clinical data where measurements close in time tend to correlate more strongly than measurements far apart (an AR(1)-type structure).

Important

Key insight: Most analysts think of sphericity as a variance homogeneity assumption — it is actually a constraint on the entire covariance structure of difference scores. Violation means the F-ratio in standard repeated-measures ANOVA is no longer exact, and inference on within-subject effects is anti-conservative (inflated Type I error).


3 Simulating the Problem

We simulate a three-time-point study (n = 30 subjects) where sphericity is violated — the true covariance structure has an AR(1) pattern with stronger correlation between adjacent time points.

Code
n       <- 30
times   <- 3
rho_ar1 <- 0.8   # AR(1) correlation parameter

# AR(1) covariance matrix (sigma^2 = 4)
sigma2  <- 4
Sigma   <- sigma2 * rho_ar1^abs(outer(1:times, 1:times, "-"))

# True means: a genuine linear time trend
mu <- c(10, 12, 15)

# Simulate multivariate normal subject profiles
Y_mat <- MASS::mvrnorm(n, mu = mu, Sigma = Sigma)
colnames(Y_mat) <- paste0("T", 1:times)

df_wide <- as.data.frame(Y_mat) |>
  mutate(subject = factor(1:n))

df_long <- df_wide |>
  pivot_longer(cols = T1:T3, names_to = "time", values_to = "outcome") |>
  mutate(time = factor(time, levels = c("T1", "T2", "T3")))

# Show first 6 rows
df_long |>
  head(12) |>
  gt() |>
  tab_header(title = "Simulated Repeated-Measures Data (first 12 rows)") |>
  cols_label(subject = "Subject", time = "Time", outcome = "Outcome") |>
  fmt_number(columns = outcome, decimals = 2) |>
  tab_options(table.font.size = 13)
Simulated Repeated-Measures Data (first 12 rows)
Subject Time Outcome
1 T1 7.54
1 T2 9.16
1 T3 14.99
2 T1 9.17
2 T2 10.59
2 T3 14.72
3 T1 9.94
3 T2 11.88
3 T3 15.80
4 T1 11.46
4 T2 12.30
4 T3 14.41
Code
# Spaghetti plot + mean trend
df_long |>
  ggplot(aes(x = time, y = outcome)) +
  geom_line(aes(group = subject), colour = "steelblue", alpha = 0.3, linewidth = 0.4) +
  stat_summary(aes(group = 1), fun = mean, geom = "line",
               colour = "firebrick", linewidth = 1.2) +
  stat_summary(fun = mean, geom = "point",
               colour = "firebrick", size = 3) +
  labs(
    title    = "Individual Profiles and Group Mean Trend",
    subtitle = "Red = group mean; blue = individual trajectories",
    x = "Time Point", y = "Outcome"
  ) +
  theme_bw(base_size = 13)


4 Mauchly’s Test: The Right Idea, the Wrong Tool

Mauchly’s test evaluates whether the sample covariance matrix of the orthonormal contrasts (derived from the repeated measures) is proportional to the identity matrix — the algebraic equivalent of sphericity.

Code
# Fit RM-ANOVA using ez package
rm_aov <- ez::ezANOVA(
  data     = df_long,
  dv       = outcome,
  wid      = subject,
  within   = time,
  detailed = TRUE,
  return_aov = TRUE
)

rm_aov$`Mauchly's Test for Sphericity` |>
  as.data.frame() |>
  gt() |>
  tab_header(
    title    = "Mauchly's Test for Sphericity",
    subtitle = "H₀: variance of all pairwise difference scores are equal"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 4) |>
  tab_options(table.font.size = 13)
Mauchly's Test for Sphericity
H₀: variance of all pairwise difference scores are equal
Effect W p p<.05
time 0.8429 0.0913

4.1 The Sample-Size Pathology

Mauchly’s test suffers the exact same pathology as Levene’s test for homoscedasticity:

n (small) n (large)
Low power — fails to detect real violations High power — flags trivial, inconsequential violations
You most need the correction here You reject sphericity even when ε is close to 1
Warning

Critical point: In small samples, Mauchly’s test is underpowered precisely when sphericity violations most distort the F-test. In large samples, it rejects for violations so minor that no correction is needed. It should not be used as a gatekeeping rule.

Demonstrate this with a simulation across sample sizes:

Code
# Power of Mauchly's test as a function of n
# Under the same AR(1) violation used above

mauchly_power <- function(n_sim, n_rep = 500) {
  rejected <- replicate(n_rep, {
    Y  <- MASS::mvrnorm(n_sim, mu = mu, Sigma = Sigma)
    # Mauchly's test via base R
    # Convert to matrix of orthonormal contrasts
    k  <- ncol(Y)
    C  <- contr.sum(k)[, -k, drop = FALSE]  # (k x k-1) contrasts
    Z  <- Y %*% C
    W_test <- tryCatch(mauchly.test(lm(Z ~ 1))$p.value, error = function(e) NA)
    if (is.na(W_test)) return(FALSE)
    W_test < 0.05
  })
  mean(rejected, na.rm = TRUE)
}

sample_sizes <- c(10, 20, 30, 50, 100, 200)
power_vals   <- sapply(sample_sizes, mauchly_power)

tibble(n = sample_sizes, power = power_vals) |>
  ggplot(aes(x = n, y = power)) +
  geom_line(colour = "steelblue", linewidth = 1.2) +
  geom_point(size = 3, colour = "steelblue") +
  geom_hline(yintercept = 0.05, linetype = "dashed", colour = "grey50") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
  labs(
    title    = "Empirical Power of Mauchly's Test vs. Sample Size",
    subtitle = "True violation: AR(1) structure, ρ = 0.8 | Dashed = nominal α = 0.05",
    x = "Sample size (n)", y = "Rejection rate"
  ) +
  theme_bw(base_size = 13)

Interpretation: With n = 10, Mauchly’s test detects the violation only a fraction of the time — the analyst proceeds as if sphericity holds when it does not. With n ≥ 100, the test reliably detects the violation, but by then it may be clinically inconsequential (ε near 1). The test provides the least guidance precisely where it is most needed.


5 The Greenhouse-Geisser Correction: A Patch, Not a Solution

When Mauchly’s test rejects (or is ignored), the standard remedy is to multiply both numerator and denominator degrees of freedom by ε̂ (epsilon-hat), the Greenhouse-Geisser estimate of departure from sphericity.

\[ F^*_{GG} \sim F(\hat{\varepsilon}(k-1),\ \hat{\varepsilon}(k-1)(n-1)) \]

\(\hat{\varepsilon}\) ranges from \(\frac{1}{k-1}\) (maximum violation) to 1 (sphericity holds). The Huynh-Feldt correction uses a less conservative ε̂ that can exceed 1 (truncated to 1).

Code
rm_aov$`Sphericity Corrections` |>
  as.data.frame() |>
  gt() |>
  tab_header(
    title    = "Greenhouse-Geisser and Huynh-Feldt Corrections",
    subtitle = "GGe = G-G epsilon; HFe = H-F epsilon"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 4) |>
  tab_options(table.font.size = 13)
Greenhouse-Geisser and Huynh-Feldt Corrections
GGe = G-G epsilon; HFe = H-F epsilon
Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]<.05
time 0.8642 0.0000 * 0.9140 0.0000 *
Code
rm_aov$ANOVA |>
  as.data.frame() |>
  filter(Effect == "time") |>
  gt() |>
  tab_header(
    title    = "Repeated-Measures ANOVA: Within-Subject Effect of Time",
    subtitle = "Uncorrected (assumes sphericity)"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 4) |>
  tab_options(table.font.size = 13)
Repeated-Measures ANOVA: Within-Subject Effect of Time
Uncorrected (assumes sphericity)
Effect DFn DFd SSn SSd F p p<.05 ges
time 2.0000 58.0000 388.8905 51.0604 220.8724 0.0000 * 0.4752

5.1 What GG Correction Does — and Doesn’t Do

The GG correction adjusts the reference F-distribution to approximately account for the inflation in the F-statistic caused by sphericity violation. It is a conservative post-hoc adjustment, not a re-specification of the model.

Critically, it:

  • Does not model the actual covariance structure of the data
  • Does not handle missing data or unbalanced time points
  • Does not allow subject-level random effects beyond a global intercept
  • Assumes all subjects are observed at the same times
  • Is only available for the overall within-subject F-test, not for contrasts or predictions

The correction is best understood as a historical workaround from an era before accessible software for mixed models. It is no longer necessary.


6 The Correct Modern Approach: Linear Mixed Models

A linear mixed model (LMM) directly models the correlation structure among repeated observations. It subsumes repeated-measures ANOVA as the special case where the covariance structure is compound symmetry.

6.1 Why LMMs Are Strictly Better

Feature RM-ANOVA LMM
Handles missing data (MCAR/MAR) No (listwise deletion) Yes (full likelihood)
Unbalanced time points No Yes
Flexible covariance structures No Yes (CS, AR(1), UN, …)
Subject-level covariates Limited Yes
Sphericity required Yes No
Time-varying covariates No Yes

6.2 Covariance Structures

Code
# 1. Compound symmetry (= RM-ANOVA with sphericity)
lmm_cs <- lme(outcome ~ time,
               random   = ~ 1 | subject,
               correlation = corCompSymm(form = ~ 1 | subject),
               data = df_long,
               method = "REML")

# 2. AR(1) — correct generating structure
lmm_ar1 <- lme(outcome ~ time,
                random      = ~ 1 | subject,
                correlation = corAR1(form = ~ 1 | subject),
                data = df_long,
                method = "REML")

# 3. Unstructured — most flexible, most parameters
lmm_un <- lme(outcome ~ time,
               random      = ~ 1 | subject,
               correlation = corSymm(form = ~ 1 | subject),
               weights     = varIdent(form = ~ 1 | time),
               data = df_long,
               method = "REML")

6.3 Model Comparison by AIC/BIC

AIC and BIC penalise model complexity differently; here we use them to select the covariance structure.

Code
aic_tbl <- AIC(lmm_cs, lmm_ar1, lmm_un) |>
  rownames_to_column("Model") |>
  left_join(
    BIC(lmm_cs, lmm_ar1, lmm_un) |> rownames_to_column("Model") |> select(Model, BIC),
    by = "Model"
  ) |>
  mutate(
    Model = recode(Model,
      lmm_cs  = "Compound Symmetry (CS)",
      lmm_ar1 = "AR(1)",
      lmm_un  = "Unstructured (UN)"
    ),
    `ΔAIC` = AIC - min(AIC),
    `ΔBIC` = BIC - min(BIC)
  ) |>
  arrange(AIC)

aic_tbl |>
  gt() |>
  tab_header(
    title    = "Model Comparison: Covariance Structures",
    subtitle = "Lower AIC/BIC = better fit; ΔAIC > 2 suggests meaningful difference"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 2) |>
  tab_style(
    style = cell_fill(color = "#d4edda"),
    locations = cells_body(rows = `ΔAIC` == 0)
  ) |>
  tab_options(table.font.size = 13)
Model Comparison: Covariance Structures
Lower AIC/BIC = better fit; ΔAIC > 2 suggests meaningful difference
Model df AIC BIC ΔAIC ΔBIC
AR(1) 6.00 332.96 347.75 0.00 0.00
Compound Symmetry (CS) 6.00 336.20 351.00 3.24 3.24
Unstructured (UN) 10.00 338.22 362.88 5.26 15.13

Interpretation: The AR(1) model has the lowest AIC/BIC (highlighted), correctly identifying the data-generating structure. The unstructured model, while flexible, is penalised for its additional parameters. Compound symmetry — the RM-ANOVA assumption — fits worse because the true correlation decays with time rather than remaining constant.

6.4 Fixed Effects from the Best-Fitting LMM (AR1)

Code
tidy(lmm_ar1, effects = "fixed", conf.int = TRUE) |>
  select(term, estimate, std.error, statistic, p.value, conf.low, conf.high) |>
  gt() |>
  tab_header(
    title    = "Fixed Effects: AR(1) Linear Mixed Model",
    subtitle = "Reference category = T1; estimates are mean differences vs. T1"
  ) |>
  cols_label(
    term      = "Term",
    estimate  = "Estimate",
    std.error = "SE",
    statistic = "t",
    p.value   = "p",
    conf.low  = "95% CI (low)",
    conf.high = "95% CI (high)"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 3) |>
  tab_options(table.font.size = 13)
Fixed Effects: AR(1) Linear Mixed Model
Reference category = T1; estimates are mean differences vs. T1
Term Estimate SE t p 95% CI (low) 95% CI (high)
(Intercept) 10.392 0.403 25.792 0.000 9.586 11.199
timeT2 1.949 0.223 8.743 0.000 1.503 2.395
timeT3 5.048 0.280 17.998 0.000 4.487 5.610

Interpretation:

  • Intercept: The estimated mean outcome at T1 is consistent with the true simulation value of 10.
  • timeT2: The mean increased from T1 to T2 by approximately 2 units (true value = 2). The confidence interval excludes zero, indicating a statistically significant increase.
  • timeT3: The mean at T3 is approximately 5 units above T1 (true value = 5), also significant.
  • Standard errors correctly account for the within-subject AR(1) correlation; under RM-ANOVA with unaddressed sphericity violation, these SEs would be understated.

6.5 Estimated Marginal Means and Contrasts

Code
emm <- emmeans(lmm_ar1, ~ time)

emm |>
  as.data.frame() |>
  gt() |>
  tab_header(
    title    = "Estimated Marginal Means by Time Point",
    subtitle = "AR(1) LMM; 95% confidence intervals"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 2) |>
  tab_options(table.font.size = 13)
Estimated Marginal Means by Time Point
AR(1) LMM; 95% confidence intervals
time emmean SE df lower.CL upper.CL
T1 10.39 0.40 29.00 9.57 11.22
T2 12.34 0.40 29.00 11.52 13.17
T3 15.44 0.40 29.00 14.62 16.26
Code
contrast(emm, method = "pairwise", adjust = "bonferroni") |>
  as.data.frame() |>
  gt() |>
  tab_header(
    title    = "Pairwise Contrasts Between Time Points",
    subtitle = "Bonferroni-adjusted; AR(1) LMM"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 3) |>
  tab_options(table.font.size = 13)
Pairwise Contrasts Between Time Points
Bonferroni-adjusted; AR(1) LMM
contrast estimate SE df t.ratio p.value
T1 - T2 −1.949 0.223 58.000 −8.743 0.000
T1 - T3 −5.048 0.280 58.000 −17.998 0.000
T2 - T3 −3.099 0.223 58.000 −13.903 0.000

Interpretation: All three pairwise contrasts are statistically significant after Bonferroni adjustment, consistent with the monotone increasing trend built into the simulation. T3 vs. T1 shows the largest mean difference (~5 units), followed by T3 vs. T2 (~3 units) and T2 vs. T1 (~2 units).

6.6 Visualising Model Predictions vs. Observed Means

Code
pred_df <- as.data.frame(emm) |>
  mutate(source = "LMM AR(1)")

obs_df <- df_long |>
  group_by(time) |>
  summarise(emmean = mean(outcome),
            lower.CL = mean(outcome) - 1.96 * sd(outcome) / sqrt(n()),
            upper.CL = mean(outcome) + 1.96 * sd(outcome) / sqrt(n()),
            .groups = "drop") |>
  mutate(source = "Observed mean")

bind_rows(pred_df |> select(time, emmean, lower.CL, upper.CL, source),
          obs_df) |>
  ggplot(aes(x = time, y = emmean, colour = source, group = source)) +
  geom_point(size = 3, position = position_dodge(0.2)) +
  geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL),
                width = 0.15, position = position_dodge(0.2), linewidth = 0.8) +
  geom_line(position = position_dodge(0.2), linewidth = 0.9) +
  scale_colour_manual(values = c("LMM AR(1)" = "firebrick", "Observed mean" = "steelblue")) +
  labs(
    title    = "LMM Estimated Marginal Means vs. Observed Means",
    subtitle = "Error bars = 95% CI",
    x = "Time", y = "Outcome", colour = NULL
  ) +
  theme_bw(base_size = 13) +
  theme(legend.position = "bottom")


7 Comparing RM-ANOVA vs. LMM Side by Side

Code
# RM-ANOVA time effect
rm_f    <- rm_aov$ANOVA |> filter(Effect == "time") |> pull(F)
rm_p    <- rm_aov$ANOVA |> filter(Effect == "time") |> pull(p)
rm_p_gg <- rm_aov$`Sphericity Corrections` |> filter(Effect == "time") |> pull(`p[GG]`)

# LMM overall time effect via anova
lmm_anova <- anova(lmm_ar1)

comparison <- tibble(
  Method              = c("RM-ANOVA (uncorrected)", "RM-ANOVA (GG corrected)", "LMM – CS", "LMM – AR(1)"),
  `F / test stat`     = c(round(rm_f, 3), round(rm_f, 3), 
                           round(anova(lmm_cs)["time","F-value"], 3),
                           round(anova(lmm_ar1)["time","F-value"], 3)),
  `p-value`           = c(round(rm_p, 4), round(rm_p_gg, 4),
                           round(anova(lmm_cs)["time","p-value"], 4),
                           round(anova(lmm_ar1)["time","p-value"], 4)),
  `Sphericity needed` = c("Yes", "Patch only", "No (CS assumed)", "No"),
  `Handles missing`   = c("No", "No", "Yes", "Yes")
)

comparison |>
  gt() |>
  tab_header(
    title    = "RM-ANOVA vs. LMM: Time Effect Results",
    subtitle = "Same data, different models; AR(1) is the true generating structure"
  ) |>
  tab_style(
    style = cell_fill(color = "#d4edda"),
    locations = cells_body(rows = Method == "LMM – AR(1)")
  ) |>
  tab_options(table.font.size = 13)
RM-ANOVA vs. LMM: Time Effect Results
Same data, different models; AR(1) is the true generating structure
Method F / test stat p-value Sphericity needed Handles missing
RM-ANOVA (uncorrected) 220.872 0 Yes No
RM-ANOVA (GG corrected) 220.872 0 Patch only No
LMM – CS 220.872 0 No (CS assumed) Yes
LMM – AR(1) 167.468 0 No Yes

Interpretation: All methods correctly detect a significant time effect given the strong true effect. The key distinction is not the p-value here but the quality of inference: the AR(1) LMM uses a correctly specified covariance model, produces accurate standard errors, and supports follow-up contrasts and predictions without the sphericity assumption. The GG-corrected RM-ANOVA narrows the degrees of freedom conservatively but still assumes a mis-specified error structure.


8 Practical Decision Guide

When to Use RM-ANOVA vs. LMM
Situation Recommendation
Balanced design, CS covariance plausible, complete data RM-ANOVA acceptable; verify with AIC vs. LMM-CS
Any doubt about covariance structure LMM with AIC-guided covariance selection
Missing outcome data LMM (REML/ML under MAR); RM-ANOVA invalid
Unequally spaced time points LMM required; RM-ANOVA not defined
Time-varying covariates LMM required
Interest in individual trajectories LMM with random slopes

9 Key Takeaways

Note

1. Sphericity is stricter than most analysts realise. It requires equal variance for all pairwise difference scores — not just equal variances at each time point. It fails whenever correlation decays with time (AR-type structures), which is common in longitudinal clinical data.

2. Mauchly’s test has the wrong power profile. Underpowered when n is small (you most need protection), and hypersensitive when n is large (flags inconsequential deviations). Using it as a binary decision rule is not sound practice.

3. Greenhouse-Geisser is a patch, not a cure. It deflates the degrees of freedom to approximate the correct reference distribution but leaves the model misspecified. It does not handle missing data, unbalanced designs, or time-varying covariates.

4. LMMs are the principled solution. Repeated-measures ANOVA is a special case of an LMM with a compound symmetry covariance structure. LMMs allow you to explicitly model and select the correct covariance structure, handle missing data under MAR, accommodate unequal spacing, and support richer fixed- and random-effect specifications.

5. Use AIC/BIC to select the covariance structure. Compare CS, AR(1), and unstructured (and others as appropriate) using information criteria rather than relying on Mauchly’s test.


10 Software and References

  • R packages: nlme (Pinheiro & Bates), lme4/lmerTest, ez, emmeans, broom.mixed
  • Huynh, H. & Feldt, L.S. (1976). Estimation of the Box correction for degrees of freedom from sample data in the randomized block and split-plot designs. Journal of Educational Statistics, 1, 69–82.
  • Greenhouse, S.W. & Geisser, S. (1959). On methods in the analysis of profile data. Psychometrika, 24, 95–112.
  • Mauchly, J.W. (1940). Significance test for sphericity of a normal n-variate distribution. Annals of Mathematical Statistics, 11, 204–209.
  • Pinheiro, J.C. & Bates, D.M. (2000). Mixed-Effects Models in S and S-PLUS. Springer.
  • Verbeke, G. & Molenberghs, G. (2000). Linear Mixed Models for Longitudinal Data. Springer.