Repeated-Measures ANOVA and the Sphericity Assumption

Misconceptions, Diagnostics, and the Linear Mixed Model Alternative

Author

Timothy Achala

Published

June 22, 2026

Code

library(MASS)
library(tidyverse)
library(nlme)
library(lme4)
library(lmerTest)
library(ez)
library(gt)
library(broom)
library(broom.mixed)
library(emmeans)


set.seed(2024)

1 Background and Motivation

Repeated-measures ANOVA is among the most widely used designs in clinical and behavioral research — tracking the same subjects over time or across conditions. Yet one of its central assumptions, sphericity, is routinely misunderstood, mechanically tested with an inadequate test, and patched with a correction that does not fix the underlying problem.

This document traces the full chain: what sphericity actually requires, why Mauchly’s test is unreliable, what Greenhouse-Geisser actually does (and does not do), and why linear mixed models (LMMs) are the correct modern replacement.

2 What Sphericity Actually Requires

Repeated-measures ANOVA partitions within-subject variance across time points. For the F-ratio to follow an F-distribution under H₀, the variance–covariance matrix of the repeated observations must satisfy a specific structure.

2.1 The Compound Symmetry Special Case

The most restrictive form is compound symmetry (CS): all variances equal and all covariances equal. If the data follow CS, sphericity is automatically satisfied.

\[ \boldsymbol{\Sigma}_{CS} = \sigma^2 \begin{pmatrix} 1 & \rho & \rho \\ \rho & 1 & \rho \\ \rho & \rho & 1 \end{pmatrix} \]

2.2 The Sphericity Condition Proper

Sphericity is weaker than compound symmetry but still demanding. It states that the variances of all pairwise difference scores are equal:

\[ \text{Var}(Y_{ij} - Y_{ik}) = \lambda \quad \text{for all } j \neq k \]

This must hold for every pair of time points. With \(k\) time points there are \(\binom{k}{2}\) pairs. As \(k\) grows, this becomes increasingly unlikely in practice — especially in longitudinal clinical data where measurements close in time tend to correlate more strongly than measurements far apart (an AR(1)-type structure).

Important

Key insight: Most analysts think of sphericity as a variance homogeneity assumption — it is actually a constraint on the entire covariance structure of difference scores. Violation means the F-ratio in standard repeated-measures ANOVA is no longer exact, and inference on within-subject effects is anti-conservative (inflated Type I error).

3 Simulating the Problem

We simulate a three-time-point study (n = 30 subjects) where sphericity is violated — the true covariance structure has an AR(1) pattern with stronger correlation between adjacent time points.

Code

n       <- 30
times   <- 3
rho_ar1 <- 0.8   # AR(1) correlation parameter

# AR(1) covariance matrix (sigma^2 = 4)
sigma2  <- 4
Sigma   <- sigma2 * rho_ar1^abs(outer(1:times, 1:times, "-"))

# True means: a genuine linear time trend
mu <- c(10, 12, 15)

# Simulate multivariate normal subject profiles
Y_mat <- MASS::mvrnorm(n, mu = mu, Sigma = Sigma)
colnames(Y_mat) <- paste0("T", 1:times)

df_wide <- as.data.frame(Y_mat) |>
  mutate(subject = factor(1:n))

df_long <- df_wide |>
  pivot_longer(cols = T1:T3, names_to = "time", values_to = "outcome") |>
  mutate(time = factor(time, levels = c("T1", "T2", "T3")))

# Show first 6 rows
df_long |>
  head(12) |>
  gt() |>
  tab_header(title = "Simulated Repeated-Measures Data (first 12 rows)") |>
  cols_label(subject = "Subject", time = "Time", outcome = "Outcome") |>
  fmt_number(columns = outcome, decimals = 2) |>
  tab_options(table.font.size = 13)

Subject	Time	Outcome
Simulated Repeated-Measures Data (first 12 rows)
1	T1	7.54
1	T2	9.16
1	T3	14.99
2	T1	9.17
2	T2	10.59
2	T3	14.72
3	T1	9.94
3	T2	11.88
3	T3	15.80
4	T1	11.46
4	T2	12.30
4	T3	14.41

Code

# Spaghetti plot + mean trend
df_long |>
  ggplot(aes(x = time, y = outcome)) +
  geom_line(aes(group = subject), colour = "steelblue", alpha = 0.3, linewidth = 0.4) +
  stat_summary(aes(group = 1), fun = mean, geom = "line",
               colour = "firebrick", linewidth = 1.2) +
  stat_summary(fun = mean, geom = "point",
               colour = "firebrick", size = 3) +
  labs(
    title    = "Individual Profiles and Group Mean Trend",
    subtitle = "Red = group mean; blue = individual trajectories",
    x = "Time Point", y = "Outcome"
  ) +
  theme_bw(base_size = 13)

4 Mauchly’s Test: The Right Idea, the Wrong Tool

Mauchly’s test evaluates whether the sample covariance matrix of the orthonormal contrasts (derived from the repeated measures) is proportional to the identity matrix — the algebraic equivalent of sphericity.

Code

# Fit RM-ANOVA using ez package
rm_aov <- ez::ezANOVA(
  data     = df_long,
  dv       = outcome,
  wid      = subject,
  within   = time,
  detailed = TRUE,
  return_aov = TRUE
)

rm_aov$`Mauchly's Test for Sphericity` |>
  as.data.frame() |>
  gt() |>
  tab_header(
    title    = "Mauchly's Test for Sphericity",
    subtitle = "H₀: variance of all pairwise difference scores are equal"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 4) |>
  tab_options(table.font.size = 13)

Effect	W	p
Mauchly's Test for Sphericity
H₀: variance of all pairwise difference scores are equal
time	0.8429	0.0913

4.1 The Sample-Size Pathology

Mauchly’s test suffers the exact same pathology as Levene’s test for homoscedasticity:

n (small)	n (large)
Low power — fails to detect real violations	High power — flags trivial, inconsequential violations
You most need the correction here	You reject sphericity even when ε is close to 1

Warning

Critical point: In small samples, Mauchly’s test is underpowered precisely when sphericity violations most distort the F-test. In large samples, it rejects for violations so minor that no correction is needed. It should not be used as a gatekeeping rule.

Demonstrate this with a simulation across sample sizes:

Code

# Power of Mauchly's test as a function of n
# Under the same AR(1) violation used above

mauchly_power <- function(n_sim, n_rep = 500) {
  rejected <- replicate(n_rep, {
    Y  <- MASS::mvrnorm(n_sim, mu = mu, Sigma = Sigma)
    # Mauchly's test via base R
    # Convert to matrix of orthonormal contrasts
    k  <- ncol(Y)
    C  <- contr.sum(k)[, -k, drop = FALSE]  # (k x k-1) contrasts
    Z  <- Y %*% C
    W_test <- tryCatch(mauchly.test(lm(Z ~ 1))$p.value, error = function(e) NA)
    if (is.na(W_test)) return(FALSE)
    W_test < 0.05
  })
  mean(rejected, na.rm = TRUE)
}

sample_sizes <- c(10, 20, 30, 50, 100, 200)
power_vals   <- sapply(sample_sizes, mauchly_power)

tibble(n = sample_sizes, power = power_vals) |>
  ggplot(aes(x = n, y = power)) +
  geom_line(colour = "steelblue", linewidth = 1.2) +
  geom_point(size = 3, colour = "steelblue") +
  geom_hline(yintercept = 0.05, linetype = "dashed", colour = "grey50") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
  labs(
    title    = "Empirical Power of Mauchly's Test vs. Sample Size",
    subtitle = "True violation: AR(1) structure, ρ = 0.8 | Dashed = nominal α = 0.05",
    x = "Sample size (n)", y = "Rejection rate"
  ) +
  theme_bw(base_size = 13)

Interpretation: With n = 10, Mauchly’s test detects the violation only a fraction of the time — the analyst proceeds as if sphericity holds when it does not. With n ≥ 100, the test reliably detects the violation, but by then it may be clinically inconsequential (ε near 1). The test provides the least guidance precisely where it is most needed.

5 The Greenhouse-Geisser Correction: A Patch, Not a Solution

When Mauchly’s test rejects (or is ignored), the standard remedy is to multiply both numerator and denominator degrees of freedom by ε̂ (epsilon-hat), the Greenhouse-Geisser estimate of departure from sphericity.

\[ F^*_{GG} \sim F(\hat{\varepsilon}(k-1),\ \hat{\varepsilon}(k-1)(n-1)) \]

\(\hat{\varepsilon}\) ranges from \(\frac{1}{k-1}\) (maximum violation) to 1 (sphericity holds). The Huynh-Feldt correction uses a less conservative ε̂ that can exceed 1 (truncated to 1).

Code

rm_aov$`Sphericity Corrections` |>
  as.data.frame() |>
  gt() |>
  tab_header(
    title    = "Greenhouse-Geisser and Huynh-Feldt Corrections",
    subtitle = "GGe = G-G epsilon; HFe = H-F epsilon"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 4) |>
  tab_options(table.font.size = 13)

Effect	GGe	p[GG]	p[GG]<.05	HFe	p[HF]	p[HF]<.05
Greenhouse-Geisser and Huynh-Feldt Corrections
GGe = G-G epsilon; HFe = H-F epsilon
time	0.8642	0.0000	*	0.9140	0.0000	*

Code

rm_aov$ANOVA |>
  as.data.frame() |>
  filter(Effect == "time") |>
  gt() |>
  tab_header(
    title    = "Repeated-Measures ANOVA: Within-Subject Effect of Time",
    subtitle = "Uncorrected (assumes sphericity)"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 4) |>
  tab_options(table.font.size = 13)

Effect	DFn	DFd	SSn	SSd	F	p	p<.05	ges
Repeated-Measures ANOVA: Within-Subject Effect of Time
Uncorrected (assumes sphericity)
time	2.0000	58.0000	388.8905	51.0604	220.8724	0.0000	*	0.4752

5.1 What GG Correction Does — and Doesn’t Do

The GG correction adjusts the reference F-distribution to approximately account for the inflation in the F-statistic caused by sphericity violation. It is a conservative post-hoc adjustment, not a re-specification of the model.

Critically, it:

Does not model the actual covariance structure of the data
Does not handle missing data or unbalanced time points
Does not allow subject-level random effects beyond a global intercept
Assumes all subjects are observed at the same times
Is only available for the overall within-subject F-test, not for contrasts or predictions

The correction is best understood as a historical workaround from an era before accessible software for mixed models. It is no longer necessary.

6 The Correct Modern Approach: Linear Mixed Models

A linear mixed model (LMM) directly models the correlation structure among repeated observations. It subsumes repeated-measures ANOVA as the special case where the covariance structure is compound symmetry.

6.1 Why LMMs Are Strictly Better

Feature	RM-ANOVA	LMM
Handles missing data (MCAR/MAR)	No (listwise deletion)	Yes (full likelihood)
Unbalanced time points	No	Yes
Flexible covariance structures	No	Yes (CS, AR(1), UN, …)
Subject-level covariates	Limited	Yes
Sphericity required	Yes	No
Time-varying covariates	No	Yes

6.2 Covariance Structures

Code

# 1. Compound symmetry (= RM-ANOVA with sphericity)
lmm_cs <- lme(outcome ~ time,
               random   = ~ 1 | subject,
               correlation = corCompSymm(form = ~ 1 | subject),
               data = df_long,
               method = "REML")

# 2. AR(1) — correct generating structure
lmm_ar1 <- lme(outcome ~ time,
                random      = ~ 1 | subject,
                correlation = corAR1(form = ~ 1 | subject),
                data = df_long,
                method = "REML")

# 3. Unstructured — most flexible, most parameters
lmm_un <- lme(outcome ~ time,
               random      = ~ 1 | subject,
               correlation = corSymm(form = ~ 1 | subject),
               weights     = varIdent(form = ~ 1 | time),
               data = df_long,
               method = "REML")

6.3 Model Comparison by AIC/BIC

AIC and BIC penalise model complexity differently; here we use them to select the covariance structure.

Code

aic_tbl <- AIC(lmm_cs, lmm_ar1, lmm_un) |>
  rownames_to_column("Model") |>
  left_join(
    BIC(lmm_cs, lmm_ar1, lmm_un) |> rownames_to_column("Model") |> select(Model, BIC),
    by = "Model"
  ) |>
  mutate(
    Model = recode(Model,
      lmm_cs  = "Compound Symmetry (CS)",
      lmm_ar1 = "AR(1)",
      lmm_un  = "Unstructured (UN)"
    ),
    `ΔAIC` = AIC - min(AIC),
    `ΔBIC` = BIC - min(BIC)
  ) |>
  arrange(AIC)

aic_tbl |>
  gt() |>
  tab_header(
    title    = "Model Comparison: Covariance Structures",
    subtitle = "Lower AIC/BIC = better fit; ΔAIC > 2 suggests meaningful difference"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 2) |>
  tab_style(
    style = cell_fill(color = "#d4edda"),
    locations = cells_body(rows = `ΔAIC` == 0)
  ) |>
  tab_options(table.font.size = 13)

Model	df	AIC	BIC	ΔAIC	ΔBIC
Model Comparison: Covariance Structures
Lower AIC/BIC = better fit; ΔAIC > 2 suggests meaningful difference
AR(1)	6.00	332.96	347.75	0.00	0.00
Compound Symmetry (CS)	6.00	336.20	351.00	3.24	3.24
Unstructured (UN)	10.00	338.22	362.88	5.26	15.13

Interpretation: The AR(1) model has the lowest AIC/BIC (highlighted), correctly identifying the data-generating structure. The unstructured model, while flexible, is penalised for its additional parameters. Compound symmetry — the RM-ANOVA assumption — fits worse because the true correlation decays with time rather than remaining constant.

6.4 Fixed Effects from the Best-Fitting LMM (AR1)

Code

tidy(lmm_ar1, effects = "fixed", conf.int = TRUE) |>
  select(term, estimate, std.error, statistic, p.value, conf.low, conf.high) |>
  gt() |>
  tab_header(
    title    = "Fixed Effects: AR(1) Linear Mixed Model",
    subtitle = "Reference category = T1; estimates are mean differences vs. T1"
  ) |>
  cols_label(
    term      = "Term",
    estimate  = "Estimate",
    std.error = "SE",
    statistic = "t",
    p.value   = "p",
    conf.low  = "95% CI (low)",
    conf.high = "95% CI (high)"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 3) |>
  tab_options(table.font.size = 13)

Term	Estimate	SE	t	p	95% CI (low)	95% CI (high)
Fixed Effects: AR(1) Linear Mixed Model
Reference category = T1; estimates are mean differences vs. T1
(Intercept)	10.392	0.403	25.792	0.000	9.586	11.199
timeT2	1.949	0.223	8.743	0.000	1.503	2.395
timeT3	5.048	0.280	17.998	0.000	4.487	5.610

Interpretation:

Intercept: The estimated mean outcome at T1 is consistent with the true simulation value of 10.
timeT2: The mean increased from T1 to T2 by approximately 2 units (true value = 2). The confidence interval excludes zero, indicating a statistically significant increase.
timeT3: The mean at T3 is approximately 5 units above T1 (true value = 5), also significant.
Standard errors correctly account for the within-subject AR(1) correlation; under RM-ANOVA with unaddressed sphericity violation, these SEs would be understated.

6.5 Estimated Marginal Means and Contrasts

Code

emm <- emmeans(lmm_ar1, ~ time)

emm |>
  as.data.frame() |>
  gt() |>
  tab_header(
    title    = "Estimated Marginal Means by Time Point",
    subtitle = "AR(1) LMM; 95% confidence intervals"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 2) |>
  tab_options(table.font.size = 13)

time	emmean	SE	df	lower.CL	upper.CL
Estimated Marginal Means by Time Point
AR(1) LMM; 95% confidence intervals
T1	10.39	0.40	29.00	9.57	11.22
T2	12.34	0.40	29.00	11.52	13.17
T3	15.44	0.40	29.00	14.62	16.26

Code

contrast(emm, method = "pairwise", adjust = "bonferroni") |>
  as.data.frame() |>
  gt() |>
  tab_header(
    title    = "Pairwise Contrasts Between Time Points",
    subtitle = "Bonferroni-adjusted; AR(1) LMM"
  ) |>
  fmt_number(columns = where(is.numeric), decimals = 3) |>
  tab_options(table.font.size = 13)

contrast	estimate	SE	df	t.ratio	p.value
Pairwise Contrasts Between Time Points
Bonferroni-adjusted; AR(1) LMM
T1 - T2	−1.949	0.223	58.000	−8.743	0.000
T1 - T3	−5.048	0.280	58.000	−17.998	0.000
T2 - T3	−3.099	0.223	58.000	−13.903	0.000

Interpretation: All three pairwise contrasts are statistically significant after Bonferroni adjustment, consistent with the monotone increasing trend built into the simulation. T3 vs. T1 shows the largest mean difference (~5 units), followed by T3 vs. T2 (~3 units) and T2 vs. T1 (~2 units).

6.6 Visualising Model Predictions vs. Observed Means

Code

pred_df <- as.data.frame(emm) |>
  mutate(source = "LMM AR(1)")

obs_df <- df_long |>
  group_by(time) |>
  summarise(emmean = mean(outcome),
            lower.CL = mean(outcome) - 1.96 * sd(outcome) / sqrt(n()),
            upper.CL = mean(outcome) + 1.96 * sd(outcome) / sqrt(n()),
            .groups = "drop") |>
  mutate(source = "Observed mean")

bind_rows(pred_df |> select(time, emmean, lower.CL, upper.CL, source),
          obs_df) |>
  ggplot(aes(x = time, y = emmean, colour = source, group = source)) +
  geom_point(size = 3, position = position_dodge(0.2)) +
  geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL),
                width = 0.15, position = position_dodge(0.2), linewidth = 0.8) +
  geom_line(position = position_dodge(0.2), linewidth = 0.9) +
  scale_colour_manual(values = c("LMM AR(1)" = "firebrick", "Observed mean" = "steelblue")) +
  labs(
    title    = "LMM Estimated Marginal Means vs. Observed Means",
    subtitle = "Error bars = 95% CI",
    x = "Time", y = "Outcome", colour = NULL
  ) +
  theme_bw(base_size = 13) +
  theme(legend.position = "bottom")

7 Comparing RM-ANOVA vs. LMM Side by Side

Code

# RM-ANOVA time effect
rm_f    <- rm_aov$ANOVA |> filter(Effect == "time") |> pull(F)
rm_p    <- rm_aov$ANOVA |> filter(Effect == "time") |> pull(p)
rm_p_gg <- rm_aov$`Sphericity Corrections` |> filter(Effect == "time") |> pull(`p[GG]`)

# LMM overall time effect via anova
lmm_anova <- anova(lmm_ar1)

comparison <- tibble(
  Method              = c("RM-ANOVA (uncorrected)", "RM-ANOVA (GG corrected)", "LMM – CS", "LMM – AR(1)"),
  `F / test stat`     = c(round(rm_f, 3), round(rm_f, 3), 
                           round(anova(lmm_cs)["time","F-value"], 3),
                           round(anova(lmm_ar1)["time","F-value"], 3)),
  `p-value`           = c(round(rm_p, 4), round(rm_p_gg, 4),
                           round(anova(lmm_cs)["time","p-value"], 4),
                           round(anova(lmm_ar1)["time","p-value"], 4)),
  `Sphericity needed` = c("Yes", "Patch only", "No (CS assumed)", "No"),
  `Handles missing`   = c("No", "No", "Yes", "Yes")
)

comparison |>
  gt() |>
  tab_header(
    title    = "RM-ANOVA vs. LMM: Time Effect Results",
    subtitle = "Same data, different models; AR(1) is the true generating structure"
  ) |>
  tab_style(
    style = cell_fill(color = "#d4edda"),
    locations = cells_body(rows = Method == "LMM – AR(1)")
  ) |>
  tab_options(table.font.size = 13)

Method	F / test stat	p-value	Sphericity needed	Handles missing
RM-ANOVA vs. LMM: Time Effect Results
Same data, different models; AR(1) is the true generating structure
RM-ANOVA (uncorrected)	220.872	0	Yes	No
RM-ANOVA (GG corrected)	220.872	0	Patch only	No
LMM – CS	220.872	0	No (CS assumed)	Yes
LMM – AR(1)	167.468	0	No	Yes

Interpretation: All methods correctly detect a significant time effect given the strong true effect. The key distinction is not the p-value here but the quality of inference: the AR(1) LMM uses a correctly specified covariance model, produces accurate standard errors, and supports follow-up contrasts and predictions without the sphericity assumption. The GG-corrected RM-ANOVA narrows the degrees of freedom conservatively but still assumes a mis-specified error structure.

8 Practical Decision Guide

Situation	Recommendation
When to Use RM-ANOVA vs. LMM
Balanced design, CS covariance plausible, complete data	RM-ANOVA acceptable; verify with AIC vs. LMM-CS
Any doubt about covariance structure	LMM with AIC-guided covariance selection
Missing outcome data	LMM (REML/ML under MAR); RM-ANOVA invalid
Unequally spaced time points	LMM required; RM-ANOVA not defined
Time-varying covariates	LMM required
Interest in individual trajectories	LMM with random slopes

9 Key Takeaways

Note

1. Sphericity is stricter than most analysts realise. It requires equal variance for all pairwise difference scores — not just equal variances at each time point. It fails whenever correlation decays with time (AR-type structures), which is common in longitudinal clinical data.

2. Mauchly’s test has the wrong power profile. Underpowered when n is small (you most need protection), and hypersensitive when n is large (flags inconsequential deviations). Using it as a binary decision rule is not sound practice.

3. Greenhouse-Geisser is a patch, not a cure. It deflates the degrees of freedom to approximate the correct reference distribution but leaves the model misspecified. It does not handle missing data, unbalanced designs, or time-varying covariates.

4. LMMs are the principled solution. Repeated-measures ANOVA is a special case of an LMM with a compound symmetry covariance structure. LMMs allow you to explicitly model and select the correct covariance structure, handle missing data under MAR, accommodate unequal spacing, and support richer fixed- and random-effect specifications.

5. Use AIC/BIC to select the covariance structure. Compare CS, AR(1), and unstructured (and others as appropriate) using information criteria rather than relying on Mauchly’s test.

10 Software and References

R packages: nlme (Pinheiro & Bates), lme4/lmerTest, ez, emmeans, broom.mixed
Huynh, H. & Feldt, L.S. (1976). Estimation of the Box correction for degrees of freedom from sample data in the randomized block and split-plot designs. Journal of Educational Statistics, 1, 69–82.
Greenhouse, S.W. & Geisser, S. (1959). On methods in the analysis of profile data. Psychometrika, 24, 95–112.
Mauchly, J.W. (1940). Significance test for sphericity of a normal n-variate distribution. Annals of Mathematical Statistics, 11, 204–209.
Pinheiro, J.C. & Bates, D.M. (2000). Mixed-Effects Models in S and S-PLUS. Springer.
Verbeke, G. & Molenberghs, G. (2000). Linear Mixed Models for Longitudinal Data. Springer.