Change over time: caveats and recommendations

Worked examples - CYP waiting times and meaningful change in large-sample surveys

Author

Fede Andreis, Practice Lead - Quantitative Analytics & Statistics

Published

May 2026

Scope and intent

This note responds to two specific questions raised by Strategic Insight (Mental Health):

Question 1: Is month-on-month percentage change appropriate for CYP waiting times indicators when the published values are monthly medians and 3-month rolling medians? If not, what are defensible alternatives?

Question 2: Is Cohen’s h a useful way to frame meaningful change (as opposed to statistically detectable change) in large-sample proportion comparisons such as the NHS Staff Survey? Should the conventional thresholds (0.2 / 0.5 / 0.8) be used to define “meaningful”?

The document works through both cases with simulated data designed to mimic the structure of the real series. It is not a general guide to time-series analysis or effect sizes - it addresses these specific questions.

A note on the data

Synthetic data is used so that the underlying truth is known, which is what allows us to identify when an analytical method introduces artefacts. The values are calibrated to be plausible for the stated context but should not be read as representing actual CQC indicators.

Part 1 - CYP waiting times: % change on rolling medians

1.1 The setting

The MH team monitors waiting times to first appointment for CYP services. Each month there is a (large) sample of individual-level wait durations. Two summaries are typically reported:

the monthly median wait;
a 3-month rolling median (or rolling mean) for smoothing.

A natural-seeming next step is to compute the month-on-month percentage change of the rolling series and to interpret it as “how much has the wait changed this month”.

The argument below is that this combination is hard to defend, for two distinct reasons that are easy to confuse but worth separating.

1.2 Simulating a plausible series

We simulate 48 months of patient-level waiting times. Each month contains approximately 1,500 individual patient waits, following a right-skewed distribution (log-normal, as is typical for waiting-time data). The underlying monthly median has a mild upward drift and modest noise.

	month	monthly_median	rolling_3m_median	pct_change_rolling	pct_change_monthly
40	2025-05-01	9.40	9.40	-4.84	-9.23
41	2025-06-01	9.70	9.70	3.14	3.14
42	2025-07-01	9.81	9.70	0.00	1.17
43	2025-08-01	8.64	9.70	0.00	-11.92
44	2025-09-01	10.48	9.81	1.17	21.29
45	2025-10-01	8.96	8.96	-8.64	-14.49
46	2025-11-01	9.78	9.78	9.08	9.08
47	2025-12-01	9.75	9.75	-0.23	-0.23

Figure 1: Simulated CYP monthly median waiting times and 3-month rolling median over 48 months.

Effect 1 - lag-1 autocorrelation by construction

1.3 The technical issue

A 3-month rolling statistic at time \(t\) shares two of its three input months with the value at time \(t-1\). Even if the underlying monthly series has no temporal dependence at all, the rolling series will exhibit substantial lag-1 autocorrelation simply because consecutive points are computed from heavily overlapping data.

To make this concrete, I have simulated a series in which there is no real dynamic at all - each month’s median is independent white noise around a fixed level - and displayed the autocorrelation of the underlying vs the rolling series below.

The left panel is what you would expect from a series with no real dynamic: autocorrelations near zero at all lags. The right panel - the rolling series computed from the same data - shows substantial positive lag-1 autocorrelation, decaying over a few months. None of this reflects the underlying process. It is a property of the smoothing operation itself.

Implication. Any analysis that takes the rolling series and tests for trend, change-points, or serial dependence using methods that assume independent observations will mis-attribute structure that the smoother created. This includes naive significance tests on month-on-month differences.

Effect 2 - the “monthly change” isn’t a monthly change

1.4 The technical issue

This is the more practically damaging issue. Let \(X_t\) denote the underlying monthly median at time \(t\), and \(Y_t\) denote the 3-month rolling statistic at time \(t\) (computed from \(X_{t-2}, X_{t-1}, X_t\)).

For a 3-month rolling mean, the algebra is exact:

\[ Y_{t+1} - Y_t = \frac{X_{t+1} - X_{t-2}}{3} \]

The two shared months cancel. The “month-on-month” change in the rolling series is driven by the new month entering the window and the old month leaving it - and those two months are three months apart in the underlying data.

For a rolling median the algebra is not this clean (the median is non-linear), but the qualitative story holds: movement in the rolling median comes from which observations enter and leave the window.

Two practical consequences:

Timing. A movement in the rolling series can be driven by something that happened three months ago dropping out of the window, not by anything that happened in the most recent month.
Persistence. A single anomalous month produces apparent “changes” in the rolling series for as many months as the window length.

The simulation below injects a one-off spike in month 24 (say, a service disruption) into an otherwise stable underlying series.

The bottom panel makes the issue visible. The underlying % change shows a large spike at the shock month and a single corresponding correction the month after - consistent with what actually happened. The rolling % change tells a different story: it shows changes spread across three months (when the shock enters, persists, and exits the window), with the magnitudes attenuated by the windowing. A reader of the rolling % change series would conclude that something was “changing” for three months. Nothing was; it was a single month-long event being smeared across the window.

	Underlying median	Rolling median	% MoM (underlying)	% MoM (rolling)
2024-11	8.13	7.96	2.1	0.7
2024-12	10.80	8.13	32.8	2.1
2025-01	7.88	8.13	-27.0	0.0
2025-02	8.11	8.11	2.9	-0.3
2025-03	8.06	8.06	-0.6	-0.6
2025-04	8.37	8.11	3.9	0.6

1.5 Why the combination is particularly unhelpful

Computing % MoM change on a rolling statistic is essentially trying to extract a high-frequency signal from a series that has been deliberately smoothed to remove it. You typically get the worst of both worlds: enough smoothing to distort timing and magnitude (Effect 2), and enough induced autocorrelation to invalidate naive inference (Effect 1) - but not enough smoothing to provide a clean trend signal.

If smoothing is being used because the monthly series is too noisy, the appropriate response is to interpret change on a longer timescale, not to compute short-horizon derivatives of the smoothed series.

1.6 Recommended alternatives

The right alternative depends on the question being asked.

1.6.1 “Is the underlying trajectory going up or down?” - SPC on the monthly series

Statistical Process Control (SPC) charts are designed to separate signal from noise in time-series data. XmR charts (also called I-MR or individuals charts) plot individual measurements with control limits derived from the moving range between consecutive points. For proportion-based indicators, P-charts with appropriately calculated limits should be used.

SPC is widely adopted across NHS settings - the NHS England guide on Making Data Count provides practical implementation guidance and is the standard reference for SPC use in healthcare performance monitoring.

Construction of the XmR chart: Let \(X_t\) denote the monthly median at time \(t\). The moving range at time \(t\) is \(MR_t = |X_t - X_{t-1}|\). The chart consists of two panels:

Individuals (X) chart: Plots \(X_t\) with centre line at \(\bar{X}\) (mean of all \(X_t\)) and control limits at \(\bar{X} \pm 2.66 \times \overline{MR}\), where \(\overline{MR}\) is the mean moving range.
Moving range (MR) chart: Plots \(MR_t\) with centre line at \(\overline{MR}\) and upper control limit at \(3.27 \times \overline{MR}\). The lower control limit is zero.

The constants 2.66 and 3.27 are derived to approximate 3-sigma limits under normality assumptions. The illustration below applies this to the simulated monthly medians.

XmR chart on the simulated monthly median series. Top panel: individuals (X) chart with centre line at mean and control limits at mean ± 2.66 × mean moving range. Bottom panel: moving range (MR) chart with centre line at mean MR and upper control limit at 3.27 × mean MR.

Out-of-control rules (points outside the limits, runs of consecutive points on one side of the mean, trends) provide principled signals for “something has changed” without requiring distributional assumptions to hold across a smoothing window.

Note

A note on XmR charts for summary statistics

The XmR chart above treats each monthly median as an individual observation. Strictly speaking, XmR charts are designed for individual measurements, not summary statistics. Each monthly median is computed from ~1,500 patient waits, so it has a much tighter sampling distribution than a single patient’s wait. This means the control limits shown are likely conservative (wider than they would be if we accounted for the reduced variability from aggregation).

In practice, this is often the only option: we may only have access to monthly summary statistics (medians or means), not patient-level data. The XmR approach is pragmatic and widely used across NHS performance monitoring for exactly this reason.

If patient-level data were available, more sophisticated approaches could be used - such as control charts with limits derived from the sampling distribution of the median, or hierarchical models that account for within-month and between-month variation. But for monitoring based on published summary statistics, the XmR chart remains a defensible and interpretable choice.

1.6.2 “How does this period compare with the same period a year ago?” - same-month YoY

Same-month year-on-year (YoY) comparison removes seasonality and avoids the overlap problem entirely. Using the notation from section 1.4, the YoY change at time \(t\) is simply:

\[ \text{YoY change} = X_t - X_{t-12} \]

where \(X_t\) is the underlying monthly median at time \(t\). The two months being compared (\(t\) and \(t-12\)) share no data - each is computed from a completely independent sample of patients. This is a comparison between two clean snapshots, twelve months apart, with no windowing artefacts.

Note on 3-month rolling YoY: If the published indicator is a 3-month rolling median \(Y_t\) (computed from \(X_{t-2}, X_{t-1}, X_t\)), the same-month YoY comparison \(Y_t - Y_{t-12}\) is also defensible. While the rolling windows at \(t\) and \(t-12\) each contain three months of data, the two windows share no overlapping months, so the comparison remains clean and interpretable.

This is also a more honest framing for public-facing comms: “the median wait this March is X weeks longer than last March” is interpretable without methodological caveat.

1.6.3 “What’s the underlying trend?” - slope estimation with uncertainty

If a single number is needed for trend over a defined window, a robust slope estimator is more defensible than chained % changes. The Theil-Sen estimator computes the median of all pairwise slopes between points, making it resistant to outliers and violations of normality assumptions.

The slope is a single, well-defined estimand with a clear uncertainty quantification, and Theil-Sen is robust to the kind of single-month anomaly demonstrated above. The confidence interval shown below uses the method described in Sen (1968), implemented in SciPy’s theilslopes function.

Theil–Sen slope: 0.040 weeks/month (95% CI: 0.028, 0.052)
Implied annual change: +0.48 weeks/year (95% CI: +0.34, +0.62)

1.6.4 If percentage change must be reported

For public-facing comms it is sometimes unavoidable. The defensible options are:

Percentage change of the underlying monthly median (not the rolling), with a clear acknowledgement of month-to-month volatility, ideally paired with an SPC view to indicate whether it is a signal or noise.
Percentage change between non-overlapping periods: e.g. this rolling 12-month vs the prior rolling 12-month. The two windows do not share data, so the change is not a windowing artefact.
Percentage change on the same month year-on-year.

What to avoid: % change between adjacent values of an overlapping rolling statistic, presented without caveats, as a routine indicator.

1.7 Summary table - Part 1

Approach	Defensible?	Notes
% MoM on monthly median	Yes, with caveats	Volatile; pair with SPC for signal vs noise
% MoM on 3-month rolling median	No as default	Conflates timing, smears one-off events across window, induces autocorrelation
Same-month YoY (absolute or %)	Yes	Removes seasonality and overlap
Rolling 12m vs prior rolling 12m	Yes	Non-overlapping windows
SPC chart on underlying monthly	Yes	Separates signal from noise; widely understood in NHS
Theil–Sen slope with CI	Yes	One-number summary of trend with uncertainty

Part 2 - Meaningful change vs statistical significance

2.1 The setting

The NHS Staff Survey reaches several hundred thousand respondents. With samples that large, almost any year-on-year movement in a proportion will be “statistically significant” in the conventional sense. The question is whether Cohen’s h can reframe the conversation around magnitude and meaningfulness rather than detectability, with reference to the conventional thresholds (0.2 / 0.5 / 0.8).

The argument here is: yes - partially. Effect sizes are an improvement over significance alone, but the conventional thresholds are conventions, not substantive benchmarks. Treating them as fixed cut-offs for “meaningful change” imports convenient numerical conventions as substantive thresholds without grounding in the specific context.

What “meaningful” requires is a substantive anchor: a minimum important difference, an action threshold, a regulatory cut-off - agreed before the analysis, in dialogue with the team that will act on it. The effect size then provides a vocabulary for assessing against that anchor.

2.2 A worked example: significance, effect size, and meaningful change

Cohen’s h for two proportions \(p_1, p_2\) is:

\[ h = 2 \arcsin\sqrt{p_1} - 2 \arcsin\sqrt{p_2} \]

It is a measure of distance between proportions on a variance-stabilised scale and does not depend on sample size.

Example: Suppose the NHS Staff Survey (n ≈ 600,000 respondents per year) shows that the proportion of staff who “would recommend their organisation as a place to work” changed from 62.0% to 64.5% year-on-year. The team has agreed in advance that a change of 2 percentage points or more is action-relevant (the minimum important difference, MID).

We compute: - Absolute change: 2.5 percentage points - Cohen’s h: \(2\arcsin\sqrt{0.645} - 2\arcsin\sqrt{0.620} \approx 0.051\) - Statistical significance: With n = 600,000 per year, this change is overwhelmingly significant (p < 0.0001)

How should this be interpreted?

Statistical significance: p < 0.0001 tells us the change is not due to sampling noise. With n = 600,000, this is unsurprising - almost any non-zero change would be significant.
Effect size: Cohen’s h ≈ 0.05 is well below the conventional “small” threshold of 0.2. By that standard, this would be dismissed as trivial.
Substantive anchor: The change (2.5 pp) exceeds the agreed MID of 2 pp. The team has determined that a 2 pp shift in this indicator is action-relevant.

The defensible conclusion: This is a statistically significant and meaningful improvement. The MID does the work of defining “meaningful” - not the p-value, not Cohen’s conventional thresholds. The effect size (h = 0.05) characterises the magnitude on a standardised scale, which is useful for cross-question comparisons, but it does not determine whether the change matters.

Contrast: If the same question had changed from 62.0% to 62.3% (0.3 pp), it would still be statistically significant (p < 0.0001) with Cohen’s h ≈ 0.006, but it would be below the MID and therefore not action-relevant, despite being “significant.”

2.3 What the conventional thresholds are - and aren’t

Cohen originally introduced the 0.2 / 0.5 / 0.8 thresholds explicitly as conventions to be used when no substantive benchmark is available. He described them basically as “operating conventions” offered for use when no better basis for estimating the effect size is available. They were not derived from any feature of social or organisational reality.

Cohen’s h for two proportions is defined as \(h = 2\arcsin\sqrt{p_1} - 2\arcsin\sqrt{p_2}\), applying the arcsine transformation that stabilises the variance of proportions. This makes h comparable across different baseline rates, unlike raw percentage point differences.

In a regulatory or public-facing context, treating them as fixed cut-offs has two problems:

They lack substantive grounding. A Cohen’s h of 0.20 in one indicator might correspond to an action-relevant change in service quality; in another, it might be operationally trivial. The threshold doesn’t know.
It repeats a familiar pattern. Importing convenient numerical conventions as substantive thresholds - e.g. p < 0.05 as “real”, or z > 2 as “concerning” - has caused us methodological grief elsewhere. Doing the same with effect sizes would be self-inflicted. The structure of a defensible report on year-on-year change in a large-sample survey: the MID carries the weight of the meaningfulness judgement; the effect size characterises magnitude on a comparable scale; the significance test confirms the change isn’t sampling noise. The conventional Cohen thresholds are not needed.

2.4 Where Cohen’s h is genuinely useful

Despite the above, h is still worth having in the toolkit for specific purposes:

Standardised comparison across questions with different baselines. A 2 pp change from 0.05 to 0.07 and a 2 pp change from 0.50 to 0.52 are not the same kind of movement. Cohen’s h reflects this; absolute percentage point change does not.
Cross-survey comparisons where absolute scales differ.
A sanity check on significance results in very large samples - a near-zero h with p < 0.001 is a clear signal that the result is statistically detectable but not substantively interesting.

What it should not be used for: as a standalone benchmark of meaningfulness, with a fixed Cohen threshold, in regulatory or public-facing reporting.

2.5 Summary - Part 2

Statistical significance in large samples is uninformative on its own.
Cohen’s h is a useful descriptive statistic for the magnitude of difference between proportions, particularly across baselines.
The 0.2 / 0.5 / 0.8 thresholds are conventions, not substantive benchmarks. Treating them as the latter repeats the failure mode of treating p < 0.05 as “real change”.
“Meaningful” requires a substantive anchor: a minimum important difference, action threshold, or regulatory cut-off, agreed in dialogue with the team that will act on the result, before the analysis.
A defensible report combines: the absolute change (vs MID), an effect size for magnitude, and a significance test for ruling out sampling noise.

Part 3 - Implications and discussion points

The two cases in this note generalise. Both are instances of the same broader principle: statistical conventions are not substitutes for substantive judgement, and analytical methods that obscure the link between the two should be avoided.

Key themes for discussion:

Decision framing: What question are you answering? (signal vs noise, comparison across baselines, communicating to stakeholders). The choice of method should follow from the question, not from convention.
Worked examples matter: The CYP series demonstrates why SPC, YoY, and slope estimation are defensible alternatives to % change on rolling statistics.
Effect sizes need context: Cohen’s h and similar metrics are useful for characterising magnitude, but “meaningful” requires a substantive anchor (MID, action threshold) agreed with stakeholders before analysis.
Explicit warnings: % change of overlapping rolling statistics should be avoided as a routine indicator without caveats.

Suggested next steps:

Discuss these cases with the MH team to confirm the recommendations align with their analytical needs and stakeholder expectations.
Agree substantive anchors (MIDs, action thresholds) for the indicators in scope.
Consider whether similar issues affect other SI indicators beyond CYP waiting times and Staff Survey proportions.

Notes and caveats on this document

All data is simulated; values are calibrated to be plausible but should not be read as representing actual NHS indicators.
The SPC illustration uses standard XmR conventions (mean ± 2.66 × mean moving range). For proportion-based indicators, P-charts with appropriately calculated limits should be used; for rates with varying denominators, funnel plots may be more appropriate.
The Theil–Sen slope is illustrated on the full simulated series for simplicity. In practice, the choice of window matters and should be guided by the question being asked.
Cohen’s h assumes independent samples. For repeated-measures or clustered designs (e.g. trust-level Staff Survey returns over time), more careful treatment of dependence is needed; the qualitative argument about meaningful change vs significance carries through.