| month | monthly_median | rolling_3m_median | pct_change_rolling | pct_change_monthly | |
|---|---|---|---|---|---|
| 40 | 2025-05-01 | 9.40 | 9.40 | -4.84 | -9.23 |
| 41 | 2025-06-01 | 9.70 | 9.70 | 3.14 | 3.14 |
| 42 | 2025-07-01 | 9.81 | 9.70 | 0.00 | 1.17 |
| 43 | 2025-08-01 | 8.64 | 9.70 | 0.00 | -11.92 |
| 44 | 2025-09-01 | 10.48 | 9.81 | 1.17 | 21.29 |
| 45 | 2025-10-01 | 8.96 | 8.96 | -8.64 | -14.49 |
| 46 | 2025-11-01 | 9.78 | 9.78 | 9.08 | 9.08 |
| 47 | 2025-12-01 | 9.75 | 9.75 | -0.23 | -0.23 |
Change over time: caveats and recommendations
Worked examples - CYP waiting times and meaningful change in large-sample surveys
Scope and intent
This note responds to two specific questions raised by Strategic Insight (Mental Health):
Question 1: Is month-on-month percentage change appropriate for CYP waiting times indicators when the published values are monthly medians and 3-month rolling medians? If not, what are defensible alternatives?
Question 2: Is Cohen’s h a useful way to frame meaningful change (as opposed to statistically detectable change) in large-sample proportion comparisons such as the NHS Staff Survey? Should the conventional thresholds (0.2 / 0.5 / 0.8) be used to define “meaningful”?
The document works through both cases with simulated data designed to mimic the structure of the real series. It is not a general guide to time-series analysis or effect sizes - it addresses these specific questions.
Part 1 - CYP waiting times: % change on rolling medians
1.1 The setting
The MH team monitors waiting times to first appointment for CYP services. Each month there is a (large) sample of individual-level wait durations. Two summaries are typically reported:
- the monthly median wait;
- a 3-month rolling median (or rolling mean) for smoothing.
A natural-seeming next step is to compute the month-on-month percentage change of the rolling series and to interpret it as “how much has the wait changed this month”.
The argument below is that this combination is hard to defend, for two distinct reasons that are easy to confuse but worth separating.
1.2 Simulating a plausible series
We simulate 48 months of patient-level waiting times. Each month contains approximately 1,500 individual patient waits, following a right-skewed distribution (log-normal, as is typical for waiting-time data). The underlying monthly median has a mild upward drift and modest noise.
1.5 Why the combination is particularly unhelpful
Computing % MoM change on a rolling statistic is essentially trying to extract a high-frequency signal from a series that has been deliberately smoothed to remove it. You typically get the worst of both worlds: enough smoothing to distort timing and magnitude (Effect 2), and enough induced autocorrelation to invalidate naive inference (Effect 1) - but not enough smoothing to provide a clean trend signal.
If smoothing is being used because the monthly series is too noisy, the appropriate response is to interpret change on a longer timescale, not to compute short-horizon derivatives of the smoothed series.
1.6 Recommended alternatives
The right alternative depends on the question being asked.
1.6.1 “Is the underlying trajectory going up or down?” - SPC on the monthly series
Statistical Process Control (SPC) charts are designed to separate signal from noise in time-series data. XmR charts (also called I-MR or individuals charts) plot individual measurements with control limits derived from the moving range between consecutive points. For proportion-based indicators, P-charts with appropriately calculated limits should be used.
SPC is widely adopted across NHS settings - the NHS England guide on Making Data Count provides practical implementation guidance and is the standard reference for SPC use in healthcare performance monitoring.
Construction of the XmR chart: Let \(X_t\) denote the monthly median at time \(t\). The moving range at time \(t\) is \(MR_t = |X_t - X_{t-1}|\). The chart consists of two panels:
- Individuals (X) chart: Plots \(X_t\) with centre line at \(\bar{X}\) (mean of all \(X_t\)) and control limits at \(\bar{X} \pm 2.66 \times \overline{MR}\), where \(\overline{MR}\) is the mean moving range.
- Moving range (MR) chart: Plots \(MR_t\) with centre line at \(\overline{MR}\) and upper control limit at \(3.27 \times \overline{MR}\). The lower control limit is zero.
The constants 2.66 and 3.27 are derived to approximate 3-sigma limits under normality assumptions. The illustration below applies this to the simulated monthly medians.
Out-of-control rules (points outside the limits, runs of consecutive points on one side of the mean, trends) provide principled signals for “something has changed” without requiring distributional assumptions to hold across a smoothing window.
A note on XmR charts for summary statistics
The XmR chart above treats each monthly median as an individual observation. Strictly speaking, XmR charts are designed for individual measurements, not summary statistics. Each monthly median is computed from ~1,500 patient waits, so it has a much tighter sampling distribution than a single patient’s wait. This means the control limits shown are likely conservative (wider than they would be if we accounted for the reduced variability from aggregation).
In practice, this is often the only option: we may only have access to monthly summary statistics (medians or means), not patient-level data. The XmR approach is pragmatic and widely used across NHS performance monitoring for exactly this reason.
If patient-level data were available, more sophisticated approaches could be used - such as control charts with limits derived from the sampling distribution of the median, or hierarchical models that account for within-month and between-month variation. But for monitoring based on published summary statistics, the XmR chart remains a defensible and interpretable choice.
1.6.2 “How does this period compare with the same period a year ago?” - same-month YoY
Same-month year-on-year (YoY) comparison removes seasonality and avoids the overlap problem entirely. Using the notation from section 1.4, the YoY change at time \(t\) is simply:
\[ \text{YoY change} = X_t - X_{t-12} \]
where \(X_t\) is the underlying monthly median at time \(t\). The two months being compared (\(t\) and \(t-12\)) share no data - each is computed from a completely independent sample of patients. This is a comparison between two clean snapshots, twelve months apart, with no windowing artefacts.
Note on 3-month rolling YoY: If the published indicator is a 3-month rolling median \(Y_t\) (computed from \(X_{t-2}, X_{t-1}, X_t\)), the same-month YoY comparison \(Y_t - Y_{t-12}\) is also defensible. While the rolling windows at \(t\) and \(t-12\) each contain three months of data, the two windows share no overlapping months, so the comparison remains clean and interpretable.
This is also a more honest framing for public-facing comms: “the median wait this March is X weeks longer than last March” is interpretable without methodological caveat.
1.6.3 “What’s the underlying trend?” - slope estimation with uncertainty
If a single number is needed for trend over a defined window, a robust slope estimator is more defensible than chained % changes. The Theil-Sen estimator computes the median of all pairwise slopes between points, making it resistant to outliers and violations of normality assumptions.
The slope is a single, well-defined estimand with a clear uncertainty quantification, and Theil-Sen is robust to the kind of single-month anomaly demonstrated above. The confidence interval shown below uses the method described in Sen (1968), implemented in SciPy’s theilslopes function.
Theil–Sen slope: 0.040 weeks/month (95% CI: 0.028, 0.052)
Implied annual change: +0.48 weeks/year (95% CI: +0.34, +0.62)
1.6.4 If percentage change must be reported
For public-facing comms it is sometimes unavoidable. The defensible options are:
- Percentage change of the underlying monthly median (not the rolling), with a clear acknowledgement of month-to-month volatility, ideally paired with an SPC view to indicate whether it is a signal or noise.
- Percentage change between non-overlapping periods: e.g. this rolling 12-month vs the prior rolling 12-month. The two windows do not share data, so the change is not a windowing artefact.
- Percentage change on the same month year-on-year.
What to avoid: % change between adjacent values of an overlapping rolling statistic, presented without caveats, as a routine indicator.
1.7 Summary table - Part 1
| Approach | Defensible? | Notes |
|---|---|---|
| % MoM on monthly median | Yes, with caveats | Volatile; pair with SPC for signal vs noise |
| % MoM on 3-month rolling median | No as default | Conflates timing, smears one-off events across window, induces autocorrelation |
| Same-month YoY (absolute or %) | Yes | Removes seasonality and overlap |
| Rolling 12m vs prior rolling 12m | Yes | Non-overlapping windows |
| SPC chart on underlying monthly | Yes | Separates signal from noise; widely understood in NHS |
| Theil–Sen slope with CI | Yes | One-number summary of trend with uncertainty |
Part 2 - Meaningful change vs statistical significance
2.1 The setting
The NHS Staff Survey reaches several hundred thousand respondents. With samples that large, almost any year-on-year movement in a proportion will be “statistically significant” in the conventional sense. The question is whether Cohen’s h can reframe the conversation around magnitude and meaningfulness rather than detectability, with reference to the conventional thresholds (0.2 / 0.5 / 0.8).
The argument here is: yes - partially. Effect sizes are an improvement over significance alone, but the conventional thresholds are conventions, not substantive benchmarks. Treating them as fixed cut-offs for “meaningful change” imports convenient numerical conventions as substantive thresholds without grounding in the specific context.
What “meaningful” requires is a substantive anchor: a minimum important difference, an action threshold, a regulatory cut-off - agreed before the analysis, in dialogue with the team that will act on it. The effect size then provides a vocabulary for assessing against that anchor.
2.2 A worked example: significance, effect size, and meaningful change
Cohen’s h for two proportions \(p_1, p_2\) is:
\[ h = 2 \arcsin\sqrt{p_1} - 2 \arcsin\sqrt{p_2} \]
It is a measure of distance between proportions on a variance-stabilised scale and does not depend on sample size.
Example: Suppose the NHS Staff Survey (n ≈ 600,000 respondents per year) shows that the proportion of staff who “would recommend their organisation as a place to work” changed from 62.0% to 64.5% year-on-year. The team has agreed in advance that a change of 2 percentage points or more is action-relevant (the minimum important difference, MID).
We compute: - Absolute change: 2.5 percentage points - Cohen’s h: \(2\arcsin\sqrt{0.645} - 2\arcsin\sqrt{0.620} \approx 0.051\) - Statistical significance: With n = 600,000 per year, this change is overwhelmingly significant (p < 0.0001)
How should this be interpreted?
Statistical significance: p < 0.0001 tells us the change is not due to sampling noise. With n = 600,000, this is unsurprising - almost any non-zero change would be significant.
Effect size: Cohen’s h ≈ 0.05 is well below the conventional “small” threshold of 0.2. By that standard, this would be dismissed as trivial.
Substantive anchor: The change (2.5 pp) exceeds the agreed MID of 2 pp. The team has determined that a 2 pp shift in this indicator is action-relevant.
The defensible conclusion: This is a statistically significant and meaningful improvement. The MID does the work of defining “meaningful” - not the p-value, not Cohen’s conventional thresholds. The effect size (h = 0.05) characterises the magnitude on a standardised scale, which is useful for cross-question comparisons, but it does not determine whether the change matters.
Contrast: If the same question had changed from 62.0% to 62.3% (0.3 pp), it would still be statistically significant (p < 0.0001) with Cohen’s h ≈ 0.006, but it would be below the MID and therefore not action-relevant, despite being “significant.”
2.3 What the conventional thresholds are - and aren’t
Cohen originally introduced the 0.2 / 0.5 / 0.8 thresholds explicitly as conventions to be used when no substantive benchmark is available. He described them basically as “operating conventions” offered for use when no better basis for estimating the effect size is available. They were not derived from any feature of social or organisational reality.
Cohen’s h for two proportions is defined as \(h = 2\arcsin\sqrt{p_1} - 2\arcsin\sqrt{p_2}\), applying the arcsine transformation that stabilises the variance of proportions. This makes h comparable across different baseline rates, unlike raw percentage point differences.
In a regulatory or public-facing context, treating them as fixed cut-offs has two problems:
- They lack substantive grounding. A Cohen’s h of 0.20 in one indicator might correspond to an action-relevant change in service quality; in another, it might be operationally trivial. The threshold doesn’t know.
- It repeats a familiar pattern. Importing convenient numerical conventions as substantive thresholds - e.g. p < 0.05 as “real”, or z > 2 as “concerning” - has caused us methodological grief elsewhere. Doing the same with effect sizes would be self-inflicted. The structure of a defensible report on year-on-year change in a large-sample survey: the MID carries the weight of the meaningfulness judgement; the effect size characterises magnitude on a comparable scale; the significance test confirms the change isn’t sampling noise. The conventional Cohen thresholds are not needed.
2.4 Where Cohen’s h is genuinely useful
Despite the above, h is still worth having in the toolkit for specific purposes:
- Standardised comparison across questions with different baselines. A 2 pp change from 0.05 to 0.07 and a 2 pp change from 0.50 to 0.52 are not the same kind of movement. Cohen’s h reflects this; absolute percentage point change does not.
- Cross-survey comparisons where absolute scales differ.
- A sanity check on significance results in very large samples - a near-zero h with p < 0.001 is a clear signal that the result is statistically detectable but not substantively interesting.
What it should not be used for: as a standalone benchmark of meaningfulness, with a fixed Cohen threshold, in regulatory or public-facing reporting.
2.5 Summary - Part 2
- Statistical significance in large samples is uninformative on its own.
- Cohen’s h is a useful descriptive statistic for the magnitude of difference between proportions, particularly across baselines.
- The 0.2 / 0.5 / 0.8 thresholds are conventions, not substantive benchmarks. Treating them as the latter repeats the failure mode of treating p < 0.05 as “real change”.
- “Meaningful” requires a substantive anchor: a minimum important difference, action threshold, or regulatory cut-off, agreed in dialogue with the team that will act on the result, before the analysis.
- A defensible report combines: the absolute change (vs MID), an effect size for magnitude, and a significance test for ruling out sampling noise.
Notes and caveats on this document
- All data is simulated; values are calibrated to be plausible but should not be read as representing actual NHS indicators.
- The SPC illustration uses standard XmR conventions (mean ± 2.66 × mean moving range). For proportion-based indicators, P-charts with appropriately calculated limits should be used; for rates with varying denominators, funnel plots may be more appropriate.
- The Theil–Sen slope is illustrated on the full simulated series for simplicity. In practice, the choice of window matters and should be guided by the question being asked.
- Cohen’s h assumes independent samples. For repeated-measures or clustered designs (e.g. trust-level Staff Survey returns over time), more careful treatment of dependence is needed; the qualitative argument about meaningful change vs significance carries through.