Module 4 Discussion: Box-Cox Transformation in Time Series

Part 1 — Box-Cox in a time series context

The Box-Cox family is a one-parameter power transformation:

\[ w_t = \begin{cases} \log(y_t) & \lambda = 0 \\ (y_t^{\lambda} - 1)/\lambda & \lambda \neq 0 \end{cases} \]

The point in time series (as in the FPP2 Australian electricity example) is variance stabilization. When seasonal swings get bigger as the level of the series grows — winter peaks in 1990 that dwarf winter peaks in 1960 in absolute terms — the seasonality is multiplicative, but additive models (tslm with seasonal dummies, additive ETS, plain ARIMA) assume the noise and seasonal structure are roughly constant in width. Box-Cox is the lever that converts the multiplicative structure into something approximately additive so those models behave.

  • \(\lambda = 1\) leaves the series alone.
  • \(\lambda = 0\) is the log transform — appropriate when SD scales linearly with the mean.
  • Intermediate values give intermediate compression.

BoxCox.lambda() uses Guerrero’s method, which picks \(\lambda\) to minimize the coefficient of variation of within-period subseries.

Advantages

  • Single tunable parameter with automated selection.
  • Stabilizes variance so additive seasonal models become appropriate.
  • Symmetrizes the residual distribution, which improves prediction-interval coverage.
  • Integrates cleanly into forecast/fable — pass lambda= and the back-transform is automatic.

Disadvantages

  • Requires strictly positive data.
  • Back-transformation introduces bias because \(E[g(X)] \neq g(E[X])\); need biasadj = TRUE to get forecasts of means rather than medians.
  • Interpretation degrades sharply — a coefficient on \(y^{-0.3}\) has no clean unit meaning.
  • \(\lambda\) is itself a fitted quantity and its uncertainty isn’t propagated into intervals.
  • Guerrero picks \(\lambda\) to minimize variance-of-variance, not forecast error, so the variance-optimal \(\lambda\) isn’t necessarily the forecast-optimal \(\lambda\).

Part 2 — Baseline methods on AirPassengers

Train on Jan 1949 – Dec 1958, hold out Jan 1959 – Dec 1960 (\(h=24\)).

y     <- AirPassengers
train <- window(y, end = c(1958, 12))
test  <- window(y, start = c(1959, 1))
h     <- length(test)

fit_lm    <- tslm(train ~ trend + season)
fc_lm     <- forecast(fit_lm, h = h)
fc_drift  <- rwf(train,  h = h, drift = TRUE)
fc_mean   <- meanf(train, h = h)
fc_snaive <- snaive(train, h = h)

cols <- c("ME", "RMSE", "MAE", "MPE", "MAPE")
results_part2 <- rbind(
  `tslm (trend+season)` = accuracy(fc_lm,     test)["Test set", cols],
  Drift                 = accuracy(fc_drift,  test)["Test set", cols],
  Mean                  = accuracy(fc_mean,   test)["Test set", cols],
  sNaive                = accuracy(fc_snaive, test)["Test set", cols]
)
kable(round(results_part2, 2))
ME RMSE MAE MPE MAPE
tslm (trend+season) 26.14 47.94 34.64 4.59 6.88
Drift 91.62 115.70 91.62 18.41 18.41
Mean 206.34 219.44 206.34 44.23 44.23
sNaive 71.25 76.99 71.25 15.52 15.52

tslm wins clearly because the series has both a strong trend and a strong seasonal pattern, and trend + season captures both. Note that every model has positive ME — they all systematically under-predict the holdout, which is the multiplicative-growth fingerprint: the series accelerates faster than any linear-trend extrapolation.

Part 3 — Apply a transformation

lambda <- BoxCox.lambda(train)
lambda
[1] -0.3096628
fit_bc  <- tslm(train ~ trend + season, lambda = lambda)
fc_bc   <- forecast(fit_bc, h = h, biasadj = TRUE)

fit_log <- tslm(train ~ trend + season, lambda = 0)
fc_log_med  <- forecast(fit_log, h = h, biasadj = FALSE)
fc_log_madj <- forecast(fit_log, h = h, biasadj = TRUE)

results_part3 <- rbind(
  `tslm (no transform)`             = accuracy(fc_lm,       test)["Test set", cols],
  `tslm + log (median)`             = accuracy(fc_log_med,  test)["Test set", cols],
  `tslm + log (bias-adj)`           = accuracy(fc_log_madj, test)["Test set", cols],
  `tslm + BoxCox Guerrero (bias-adj)` = accuracy(fc_bc,     test)["Test set", cols]
)
kable(round(results_part3, 2))
ME RMSE MAE MPE MAPE
tslm (no transform) 26.14 47.94 34.64 4.59 6.88
tslm + log (median) -42.97 46.87 42.97 -9.94 9.94
tslm + log (bias-adj) -43.83 47.66 43.83 -10.13 10.13
tslm + BoxCox Guerrero (bias-adj) -97.50 104.26 97.50 -21.23 21.23

Diagnostic — residuals vs fitted

par(mfrow = c(1, 2))

plot(as.numeric(fitted(fit_lm)), as.numeric(residuals(fit_lm)),
     pch = 16, col = "#00000080",
     main = "Untransformed residuals",
     xlab = "Fitted", ylab = "Residual")
abline(h = 0, lty = 2, col = "red")

plot(as.numeric(fitted(fit_bc)), as.numeric(residuals(fit_bc)),
     pch = 16, col = "#00000080",
     main = "Box-Cox residuals (transformed scale)",
     xlab = "Fitted", ylab = "Residual")
abline(h = 0, lty = 2, col = "red")

Findings

Three things stand out, and they’re not the textbook answer:

  1. Guerrero-optimal \(\lambda \approx -0.31\) more than doubles test RMSE. With \(\lambda\) negative, the trend on the transformed scale is roughly linear in \(-1/y^{0.31}\); extrapolating that linearly understates how fast the original series grows once back-transformed, so the model now systematically over-predicts. ME flips from \(+26\) to roughly \(-98\).

  2. The untransformed model wasn’t actually badly specified. Its residuals show some fanning, but trend + season on the original scale was capturing most of the signal and the bias was modest (MPE \(\approx +5\%\)). Transformation had limited upside and meaningful downside.

  3. Log is the safe middle. RMSE is essentially unchanged versus the untransformed model. ME flips negative (now over-predicting by \(\approx 10\%\)), but the residuals-vs-fitted plot is visibly more homoscedastic — so prediction intervals would be more honestly calibrated even though point forecasts aren’t better.

Interpretation cost

In every transformed case:

  • Coefficients lose unit meaning: “December adds \(X\) passengers” becomes “December adds \(X\) to \(y^{\lambda}\).”
  • Back-transformed forecasts are medians unless bias-adjusted to approximate means.
  • \(\lambda\) is another reported parameter whose uncertainty doesn’t enter the intervals.

Takeaway

Box-Cox is worth trying whenever the residuals-vs-fitted plot shows a fan, but the right test is held-out forecast accuracy, not just “did Guerrero pick a non-1 \(\lambda\).” On AirPassengers the textbook expectation — Box-Cox helps, log especially — doesn’t survive contact with the holdout RMSE. The honest answer for this series is that the untransformed tslm is hard to beat at point forecasting; the transformation buys you better-calibrated intervals and cleaner residual diagnostics at a real cost in point accuracy and interpretability.