Change over time: caveats and recommendations

Worked examples - CYP waiting times and meaningful change in large-sample surveys

Author

Fede Andreis, Practice Lead - Quantitative Analytics & Statistics

Published

May 2026

Scope and intent

This note responds to two specific questions raised by Strategic Insight (Mental Health):

Question 1: Is month-on-month percentage change appropriate for CYP waiting times indicators when the published values are monthly medians and 3-month rolling medians? If not, what are defensible alternatives?

Question 2: Is Cohen’s h a useful way to frame meaningful change (as opposed to statistically detectable change) in large-sample proportion comparisons such as the NHS Staff Survey? Should the conventional thresholds (0.2 / 0.5 / 0.8) be used to define “meaningful”?

The document works through both cases with simulated data designed to mimic the structure of the real series. It is not a general guide to time-series analysis or effect sizes - it addresses these specific questions.

NoteA note on the data

Synthetic data is used so that the underlying truth is known, which is what allows us to identify when an analytical method introduces artefacts. The values are calibrated to be plausible for the stated context but should not be read as representing actual CQC indicators.


Part 1 - CYP waiting times: % change on rolling medians

1.1 The setting

The MH team monitors waiting times to first appointment for CYP services. Each month there is a (large) sample of individual-level wait durations. Two summaries are typically reported:

  • the monthly median wait;
  • a 3-month rolling median (or rolling mean) for smoothing.

A natural-seeming next step is to compute the month-on-month percentage change of the rolling series and to interpret it as “how much has the wait changed this month”.

The argument below is that this combination is hard to defend, for two distinct reasons that are easy to confuse but worth separating.

1.2 Simulating a plausible series

We simulate 48 months of patient-level waiting times. Each month contains approximately 1,500 individual patient waits, following a right-skewed distribution (log-normal, as is typical for waiting-time data). The underlying monthly median has a mild upward drift and modest noise.

month monthly_median rolling_3m_median pct_change_rolling pct_change_monthly
40 2025-05-01 9.40 9.40 -4.84 -9.23
41 2025-06-01 9.70 9.70 3.14 3.14
42 2025-07-01 9.81 9.70 0.00 1.17
43 2025-08-01 8.64 9.70 0.00 -11.92
44 2025-09-01 10.48 9.81 1.17 21.29
45 2025-10-01 8.96 8.96 -8.64 -14.49
46 2025-11-01 9.78 9.78 9.08 9.08
47 2025-12-01 9.75 9.75 -0.23 -0.23
Figure 1: Simulated CYP monthly median waiting times and 3-month rolling median over 48 months.
WarningEffect 1 - lag-1 autocorrelation by construction

1.3 The technical issue

A 3-month rolling statistic at time \(t\) shares two of its three input months with the value at time \(t-1\). Even if the underlying monthly series has no temporal dependence at all, the rolling series will exhibit substantial lag-1 autocorrelation simply because consecutive points are computed from heavily overlapping data.

To make this concrete, I have simulated a series in which there is no real dynamic at all - each month’s median is independent white noise around a fixed level - and displayed the autocorrelation of the underlying vs the rolling series below.

The left panel is what you would expect from a series with no real dynamic: autocorrelations near zero at all lags. The right panel - the rolling series computed from the same data - shows substantial positive lag-1 autocorrelation, decaying over a few months. None of this reflects the underlying process. It is a property of the smoothing operation itself.

Implication. Any analysis that takes the rolling series and tests for trend, change-points, or serial dependence using methods that assume independent observations will mis-attribute structure that the smoother created. This includes naive significance tests on month-on-month differences.

WarningEffect 2 - the “monthly change” isn’t a monthly change

1.4 The technical issue

This is the more practically damaging issue. Let \(X_t\) denote the underlying monthly median at time \(t\), and \(Y_t\) denote the 3-month rolling statistic at time \(t\) (computed from \(X_{t-2}, X_{t-1}, X_t\)).

For a 3-month rolling mean, the algebra is exact:

\[ Y_{t+1} - Y_t = \frac{X_{t+1} - X_{t-2}}{3} \]

The two shared months cancel. The “month-on-month” change in the rolling series is driven by the new month entering the window and the old month leaving it - and those two months are three months apart in the underlying data.

For a rolling median the algebra is not this clean (the median is non-linear), but the qualitative story holds: movement in the rolling median comes from which observations enter and leave the window.

Two practical consequences:

  • Timing. A movement in the rolling series can be driven by something that happened three months ago dropping out of the window, not by anything that happened in the most recent month.
  • Persistence. A single anomalous month produces apparent “changes” in the rolling series for as many months as the window length.

The simulation below injects a one-off spike in month 24 (say, a service disruption) into an otherwise stable underlying series.

The bottom panel makes the issue visible. The underlying % change shows a large spike at the shock month and a single corresponding correction the month after - consistent with what actually happened. The rolling % change tells a different story: it shows changes spread across three months (when the shock enters, persists, and exits the window), with the magnitudes attenuated by the windowing. A reader of the rolling % change series would conclude that something was “changing” for three months. Nothing was; it was a single month-long event being smeared across the window.

Underlying median Rolling median % MoM (underlying) % MoM (rolling)
2024-11 8.13 7.96 2.1 0.7
2024-12 10.80 8.13 32.8 2.1
2025-01 7.88 8.13 -27.0 0.0
2025-02 8.11 8.11 2.9 -0.3
2025-03 8.06 8.06 -0.6 -0.6
2025-04 8.37 8.11 3.9 0.6

1.5 Why the combination is particularly unhelpful

Computing % MoM change on a rolling statistic is essentially trying to extract a high-frequency signal from a series that has been deliberately smoothed to remove it. You typically get the worst of both worlds: enough smoothing to distort timing and magnitude (Effect 2), and enough induced autocorrelation to invalidate naive inference (Effect 1) - but not enough smoothing to provide a clean trend signal.

If smoothing is being used because the monthly series is too noisy, the appropriate response is to interpret change on a longer timescale, not to compute short-horizon derivatives of the smoothed series.

1.7 Summary table - Part 1

Approach Defensible? Notes
% MoM on monthly median Yes, with caveats Volatile; pair with SPC for signal vs noise
% MoM on 3-month rolling median No as default Conflates timing, smears one-off events across window, induces autocorrelation
Same-month YoY (absolute or %) Yes Removes seasonality and overlap
Rolling 12m vs prior rolling 12m Yes Non-overlapping windows
SPC chart on underlying monthly Yes Separates signal from noise; widely understood in NHS
Theil–Sen slope with CI Yes One-number summary of trend with uncertainty

Part 2 - Meaningful change vs statistical significance

2.1 The setting

The NHS Staff Survey reaches several hundred thousand respondents. With samples that large, almost any year-on-year movement in a proportion will be “statistically significant” in the conventional sense. The question is whether Cohen’s h can reframe the conversation around magnitude and meaningfulness rather than detectability, with reference to the conventional thresholds (0.2 / 0.5 / 0.8).

The argument here is: yes - partially. Effect sizes are an improvement over significance alone, but the conventional thresholds are conventions, not substantive benchmarks. Treating them as fixed cut-offs for “meaningful change” imports convenient numerical conventions as substantive thresholds without grounding in the specific context.

What “meaningful” requires is a substantive anchor: a minimum important difference, an action threshold, a regulatory cut-off - agreed before the analysis, in dialogue with the team that will act on it. The effect size then provides a vocabulary for assessing against that anchor.

2.2 A worked example: significance, effect size, and meaningful change

Cohen’s h for two proportions \(p_1, p_2\) is:

\[ h = 2 \arcsin\sqrt{p_1} - 2 \arcsin\sqrt{p_2} \]

It is a measure of distance between proportions on a variance-stabilised scale and does not depend on sample size.

Example: Suppose the NHS Staff Survey (n ≈ 600,000 respondents per year) shows that the proportion of staff who “would recommend their organisation as a place to work” changed from 62.0% to 64.5% year-on-year. The team has agreed in advance that a change of 2 percentage points or more is action-relevant (the minimum important difference, MID).

We compute: - Absolute change: 2.5 percentage points - Cohen’s h: \(2\arcsin\sqrt{0.645} - 2\arcsin\sqrt{0.620} \approx 0.051\) - Statistical significance: With n = 600,000 per year, this change is overwhelmingly significant (p < 0.0001)

How should this be interpreted?

  1. Statistical significance: p < 0.0001 tells us the change is not due to sampling noise. With n = 600,000, this is unsurprising - almost any non-zero change would be significant.

  2. Effect size: Cohen’s h ≈ 0.05 is well below the conventional “small” threshold of 0.2. By that standard, this would be dismissed as trivial.

  3. Substantive anchor: The change (2.5 pp) exceeds the agreed MID of 2 pp. The team has determined that a 2 pp shift in this indicator is action-relevant.

The defensible conclusion: This is a statistically significant and meaningful improvement. The MID does the work of defining “meaningful” - not the p-value, not Cohen’s conventional thresholds. The effect size (h = 0.05) characterises the magnitude on a standardised scale, which is useful for cross-question comparisons, but it does not determine whether the change matters.

Contrast: If the same question had changed from 62.0% to 62.3% (0.3 pp), it would still be statistically significant (p < 0.0001) with Cohen’s h ≈ 0.006, but it would be below the MID and therefore not action-relevant, despite being “significant.”

2.3 What the conventional thresholds are - and aren’t

Cohen originally introduced the 0.2 / 0.5 / 0.8 thresholds explicitly as conventions to be used when no substantive benchmark is available. He described them basically as “operating conventions” offered for use when no better basis for estimating the effect size is available. They were not derived from any feature of social or organisational reality.

Cohen’s h for two proportions is defined as \(h = 2\arcsin\sqrt{p_1} - 2\arcsin\sqrt{p_2}\), applying the arcsine transformation that stabilises the variance of proportions. This makes h comparable across different baseline rates, unlike raw percentage point differences.

In a regulatory or public-facing context, treating them as fixed cut-offs has two problems:

  1. They lack substantive grounding. A Cohen’s h of 0.20 in one indicator might correspond to an action-relevant change in service quality; in another, it might be operationally trivial. The threshold doesn’t know.
  2. It repeats a familiar pattern. Importing convenient numerical conventions as substantive thresholds - e.g. p < 0.05 as “real”, or z > 2 as “concerning” - has caused us methodological grief elsewhere. Doing the same with effect sizes would be self-inflicted. The structure of a defensible report on year-on-year change in a large-sample survey: the MID carries the weight of the meaningfulness judgement; the effect size characterises magnitude on a comparable scale; the significance test confirms the change isn’t sampling noise. The conventional Cohen thresholds are not needed.

2.4 Where Cohen’s h is genuinely useful

Despite the above, h is still worth having in the toolkit for specific purposes:

  • Standardised comparison across questions with different baselines. A 2 pp change from 0.05 to 0.07 and a 2 pp change from 0.50 to 0.52 are not the same kind of movement. Cohen’s h reflects this; absolute percentage point change does not.
  • Cross-survey comparisons where absolute scales differ.
  • A sanity check on significance results in very large samples - a near-zero h with p < 0.001 is a clear signal that the result is statistically detectable but not substantively interesting.

What it should not be used for: as a standalone benchmark of meaningfulness, with a fixed Cohen threshold, in regulatory or public-facing reporting.

2.5 Summary - Part 2

  • Statistical significance in large samples is uninformative on its own.
  • Cohen’s h is a useful descriptive statistic for the magnitude of difference between proportions, particularly across baselines.
  • The 0.2 / 0.5 / 0.8 thresholds are conventions, not substantive benchmarks. Treating them as the latter repeats the failure mode of treating p < 0.05 as “real change”.
  • “Meaningful” requires a substantive anchor: a minimum important difference, action threshold, or regulatory cut-off, agreed in dialogue with the team that will act on the result, before the analysis.
  • A defensible report combines: the absolute change (vs MID), an effect size for magnitude, and a significance test for ruling out sampling noise.

Part 3 - Implications and discussion points

The two cases in this note generalise. Both are instances of the same broader principle: statistical conventions are not substitutes for substantive judgement, and analytical methods that obscure the link between the two should be avoided.

Key themes for discussion:

  • Decision framing: What question are you answering? (signal vs noise, comparison across baselines, communicating to stakeholders). The choice of method should follow from the question, not from convention.
  • Worked examples matter: The CYP series demonstrates why SPC, YoY, and slope estimation are defensible alternatives to % change on rolling statistics.
  • Effect sizes need context: Cohen’s h and similar metrics are useful for characterising magnitude, but “meaningful” requires a substantive anchor (MID, action threshold) agreed with stakeholders before analysis.
  • Explicit warnings: % change of overlapping rolling statistics should be avoided as a routine indicator without caveats.

Suggested next steps:

  1. Discuss these cases with the MH team to confirm the recommendations align with their analytical needs and stakeholder expectations.
  2. Agree substantive anchors (MIDs, action thresholds) for the indicators in scope.
  3. Consider whether similar issues affect other SI indicators beyond CYP waiting times and Staff Survey proportions.

Notes and caveats on this document

  • All data is simulated; values are calibrated to be plausible but should not be read as representing actual NHS indicators.
  • The SPC illustration uses standard XmR conventions (mean ± 2.66 × mean moving range). For proportion-based indicators, P-charts with appropriately calculated limits should be used; for rates with varying denominators, funnel plots may be more appropriate.
  • The Theil–Sen slope is illustrated on the full simulated series for simplicity. In practice, the choice of window matters and should be guided by the question being asked.
  • Cohen’s h assumes independent samples. For repeated-measures or clustered designs (e.g. trust-level Staff Survey returns over time), more careful treatment of dependence is needed; the qualitative argument about meaningful change vs significance carries through.