The point in time series (as in the FPP2 Australian electricity example) is variance stabilization. When seasonal swings get bigger as the level of the series grows — winter peaks in 1990 that dwarf winter peaks in 1960 in absolute terms — the seasonality is multiplicative, but additive models (tslm with seasonal dummies, additive ETS, plain ARIMA) assume the noise and seasonal structure are roughly constant in width. Box-Cox is the lever that converts the multiplicative structure into something approximately additive so those models behave.
\(\lambda = 1\) leaves the series alone.
\(\lambda = 0\) is the log transform — appropriate when SD scales linearly with the mean.
Intermediate values give intermediate compression.
BoxCox.lambda() uses Guerrero’s method, which picks \(\lambda\) to minimize the coefficient of variation of within-period subseries.
Advantages
Single tunable parameter with automated selection.
Stabilizes variance so additive seasonal models become appropriate.
Symmetrizes the residual distribution, which improves prediction-interval coverage.
Integrates cleanly into forecast/fable — pass lambda= and the back-transform is automatic.
Disadvantages
Requires strictly positive data.
Back-transformation introduces bias because \(E[g(X)] \neq g(E[X])\); need biasadj = TRUE to get forecasts of means rather than medians.
Interpretation degrades sharply — a coefficient on \(y^{-0.3}\) has no clean unit meaning.
\(\lambda\) is itself a fitted quantity and its uncertainty isn’t propagated into intervals.
Guerrero picks \(\lambda\) to minimize variance-of-variance, not forecast error, so the variance-optimal \(\lambda\) isn’t necessarily the forecast-optimal \(\lambda\).
Part 2 — Baseline methods on AirPassengers
Train on Jan 1949 – Dec 1958, hold out Jan 1959 – Dec 1960 (\(h=24\)).
y <- AirPassengerstrain <-window(y, end =c(1958, 12))test <-window(y, start =c(1959, 1))h <-length(test)fit_lm <-tslm(train ~ trend + season)fc_lm <-forecast(fit_lm, h = h)fc_drift <-rwf(train, h = h, drift =TRUE)fc_mean <-meanf(train, h = h)fc_snaive <-snaive(train, h = h)cols <-c("ME", "RMSE", "MAE", "MPE", "MAPE")results_part2 <-rbind(`tslm (trend+season)`=accuracy(fc_lm, test)["Test set", cols],Drift =accuracy(fc_drift, test)["Test set", cols],Mean =accuracy(fc_mean, test)["Test set", cols],sNaive =accuracy(fc_snaive, test)["Test set", cols])kable(round(results_part2, 2))
ME
RMSE
MAE
MPE
MAPE
tslm (trend+season)
26.14
47.94
34.64
4.59
6.88
Drift
91.62
115.70
91.62
18.41
18.41
Mean
206.34
219.44
206.34
44.23
44.23
sNaive
71.25
76.99
71.25
15.52
15.52
tslm wins clearly because the series has both a strong trend and a strong seasonal pattern, and trend + season captures both. Note that every model has positive ME — they all systematically under-predict the holdout, which is the multiplicative-growth fingerprint: the series accelerates faster than any linear-trend extrapolation.
par(mfrow =c(1, 2))plot(as.numeric(fitted(fit_lm)), as.numeric(residuals(fit_lm)),pch =16, col ="#00000080",main ="Untransformed residuals",xlab ="Fitted", ylab ="Residual")abline(h =0, lty =2, col ="red")plot(as.numeric(fitted(fit_bc)), as.numeric(residuals(fit_bc)),pch =16, col ="#00000080",main ="Box-Cox residuals (transformed scale)",xlab ="Fitted", ylab ="Residual")abline(h =0, lty =2, col ="red")
Findings
Three things stand out, and they’re not the textbook answer:
Guerrero-optimal \(\lambda \approx -0.31\) more than doubles test RMSE. With \(\lambda\) negative, the trend on the transformed scale is roughly linear in \(-1/y^{0.31}\); extrapolating that linearly understates how fast the original series grows once back-transformed, so the model now systematically over-predicts. ME flips from \(+26\) to roughly \(-98\).
The untransformed model wasn’t actually badly specified. Its residuals show some fanning, but trend + season on the original scale was capturing most of the signal and the bias was modest (MPE \(\approx +5\%\)). Transformation had limited upside and meaningful downside.
Log is the safe middle. RMSE is essentially unchanged versus the untransformed model. ME flips negative (now over-predicting by \(\approx 10\%\)), but the residuals-vs-fitted plot is visibly more homoscedastic — so prediction intervals would be more honestly calibrated even though point forecasts aren’t better.
Interpretation cost
In every transformed case:
Coefficients lose unit meaning: “December adds \(X\) passengers” becomes “December adds \(X\) to \(y^{\lambda}\).”
Back-transformed forecasts are medians unless bias-adjusted to approximate means.
\(\lambda\) is another reported parameter whose uncertainty doesn’t enter the intervals.
Takeaway
Box-Cox is worth trying whenever the residuals-vs-fitted plot shows a fan, but the right test is held-out forecast accuracy, not just “did Guerrero pick a non-1 \(\lambda\).” On AirPassengers the textbook expectation — Box-Cox helps, log especially — doesn’t survive contact with the holdout RMSE. The honest answer for this series is that the untransformed tslm is hard to beat at point forecasting; the transformation buys you better-calibrated intervals and cleaner residual diagnostics at a real cost in point accuracy and interpretability.