Module 4 Discussion: Box-Cox Transformation in Time Series

Part 1 — Box-Cox in a time series context

The Box-Cox family is a one-parameter power transformation:

\[ w_t = \begin{cases} \log(y_t) & \lambda = 0 \\ (y_t^{\lambda} - 1)/\lambda & \lambda \neq 0 \end{cases} \]

The point in time series (as in the FPP2 Australian electricity example) is variance stabilization. When seasonal swings get bigger as the level of the series grows — winter peaks in 1990 that dwarf winter peaks in 1960 in absolute terms — the seasonality is multiplicative, but additive models (tslm with seasonal dummies, additive ETS, plain ARIMA) assume the noise and seasonal structure are roughly constant in width. Box-Cox is the lever that converts the multiplicative structure into something approximately additive so those models behave.

\(\lambda = 1\) leaves the series alone.
\(\lambda = 0\) is the log transform — appropriate when SD scales linearly with the mean.
Intermediate values give intermediate compression.

BoxCox.lambda() uses Guerrero’s method, which picks \(\lambda\) to minimize the coefficient of variation of within-period subseries.

Advantages

Single tunable parameter with automated selection.
Stabilizes variance so additive seasonal models become appropriate.
Symmetrizes the residual distribution, which improves prediction-interval coverage.
Integrates cleanly into forecast/fable — pass lambda= and the back-transform is automatic.

Disadvantages

Requires strictly positive data.
Back-transformation introduces bias because \(E[g(X)] \neq g(E[X])\); need biasadj = TRUE to get forecasts of means rather than medians.
Interpretation degrades sharply — a coefficient on \(y^{-0.3}\) has no clean unit meaning.
\(\lambda\) is itself a fitted quantity and its uncertainty isn’t propagated into intervals.
Guerrero picks \(\lambda\) to minimize variance-of-variance, not forecast error, so the variance-optimal \(\lambda\) isn’t necessarily the forecast-optimal \(\lambda\).

Part 2 — Baseline methods on AirPassengers

Train on Jan 1949 – Dec 1958, hold out Jan 1959 – Dec 1960 (\(h=24\)).

y     <- AirPassengers
train <- window(y, end = c(1958, 12))
test  <- window(y, start = c(1959, 1))
h     <- length(test)

fit_lm    <- tslm(train ~ trend + season)
fc_lm     <- forecast(fit_lm, h = h)
fc_drift  <- rwf(train,  h = h, drift = TRUE)
fc_mean   <- meanf(train, h = h)
fc_snaive <- snaive(train, h = h)

cols <- c("ME", "RMSE", "MAE", "MPE", "MAPE")
results_part2 <- rbind(
  `tslm (trend+season)` = accuracy(fc_lm,     test)["Test set", cols],
  Drift                 = accuracy(fc_drift,  test)["Test set", cols],
  Mean                  = accuracy(fc_mean,   test)["Test set", cols],
  sNaive                = accuracy(fc_snaive, test)["Test set", cols]
)
kable(round(results_part2, 2))

	ME	RMSE	MAE	MPE	MAPE
tslm (trend+season)	26.14	47.94	34.64	4.59	6.88
Drift	91.62	115.70	91.62	18.41	18.41
Mean	206.34	219.44	206.34	44.23	44.23
sNaive	71.25	76.99	71.25	15.52	15.52

tslm wins clearly because the series has both a strong trend and a strong seasonal pattern, and trend + season captures both. Note that every model has positive ME — they all systematically under-predict the holdout, which is the multiplicative-growth fingerprint: the series accelerates faster than any linear-trend extrapolation.

Part 3 — Apply a transformation

lambda <- BoxCox.lambda(train)
lambda

[1] -0.3096628

fit_bc  <- tslm(train ~ trend + season, lambda = lambda)
fc_bc   <- forecast(fit_bc, h = h, biasadj = TRUE)

fit_log <- tslm(train ~ trend + season, lambda = 0)
fc_log_med  <- forecast(fit_log, h = h, biasadj = FALSE)
fc_log_madj <- forecast(fit_log, h = h, biasadj = TRUE)

results_part3 <- rbind(
  `tslm (no transform)`             = accuracy(fc_lm,       test)["Test set", cols],
  `tslm + log (median)`             = accuracy(fc_log_med,  test)["Test set", cols],
  `tslm + log (bias-adj)`           = accuracy(fc_log_madj, test)["Test set", cols],
  `tslm + BoxCox Guerrero (bias-adj)` = accuracy(fc_bc,     test)["Test set", cols]
)
kable(round(results_part3, 2))

	ME	RMSE	MAE	MPE	MAPE
tslm (no transform)	26.14	47.94	34.64	4.59	6.88
tslm + log (median)	-42.97	46.87	42.97	-9.94	9.94
tslm + log (bias-adj)	-43.83	47.66	43.83	-10.13	10.13
tslm + BoxCox Guerrero (bias-adj)	-97.50	104.26	97.50	-21.23	21.23

Diagnostic — residuals vs fitted

par(mfrow = c(1, 2))

plot(as.numeric(fitted(fit_lm)), as.numeric(residuals(fit_lm)),
     pch = 16, col = "#00000080",
     main = "Untransformed residuals",
     xlab = "Fitted", ylab = "Residual")
abline(h = 0, lty = 2, col = "red")

plot(as.numeric(fitted(fit_bc)), as.numeric(residuals(fit_bc)),
     pch = 16, col = "#00000080",
     main = "Box-Cox residuals (transformed scale)",
     xlab = "Fitted", ylab = "Residual")
abline(h = 0, lty = 2, col = "red")

Findings

Three things stand out, and they’re not the textbook answer:

Guerrero-optimal \(\lambda \approx -0.31\) more than doubles test RMSE. With \(\lambda\) negative, the trend on the transformed scale is roughly linear in \(-1/y^{0.31}\); extrapolating that linearly understates how fast the original series grows once back-transformed, so the model now systematically over-predicts. ME flips from \(+26\) to roughly \(-98\).
The untransformed model wasn’t actually badly specified. Its residuals show some fanning, but trend + season on the original scale was capturing most of the signal and the bias was modest (MPE \(\approx +5\%\)). Transformation had limited upside and meaningful downside.
Log is the safe middle. RMSE is essentially unchanged versus the untransformed model. ME flips negative (now over-predicting by \(\approx 10\%\)), but the residuals-vs-fitted plot is visibly more homoscedastic — so prediction intervals would be more honestly calibrated even though point forecasts aren’t better.

Interpretation cost

In every transformed case:

Coefficients lose unit meaning: “December adds \(X\) passengers” becomes “December adds \(X\) to \(y^{\lambda}\).”
Back-transformed forecasts are medians unless bias-adjusted to approximate means.
\(\lambda\) is another reported parameter whose uncertainty doesn’t enter the intervals.

Takeaway

Box-Cox is worth trying whenever the residuals-vs-fitted plot shows a fan, but the right test is held-out forecast accuracy, not just “did Guerrero pick a non-1 \(\lambda\).” On AirPassengers the textbook expectation — Box-Cox helps, log especially — doesn’t survive contact with the holdout RMSE. The honest answer for this series is that the untransformed tslm is hard to beat at point forecasting; the transformation buys you better-calibrated intervals and cleaner residual diagnostics at a real cost in point accuracy and interpretability.