Module 2 — Accuracy Metrics & Decomposition

Author

Adrian Aziza

1. Setup

library(fpp3)        # tsibble + fable + feasts
library(fredr)       # FRED API
library(dplyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(readr)

# Get a free FRED API key at https://fred.stlouisfed.org/docs/api/api_key.html
# Store it in your .Renviron as FRED_API_KEY=xxxxxx  (recommended)
fredr_set_key(Sys.getenv("FRED_API_KEY"))

2. Data — FRED series RSXFS

I’m using U.S. Advance Retail Sales: Retail Trade and Food Services (FRED ID RSXFS), monthly, not seasonally adjusted. This series has both strong trend and very strong seasonality, which makes the four benchmark models behave visibly differently and gives the additive-vs-multiplicative decomposition something to argue about.

raw <- fredr(
  series_id         = "RSXFS",
  observation_start = as.Date("2000-01-01"),
  observation_end   = as.Date("2024-12-01")
)

retail <- raw |>
  mutate(Month = yearmonth(date)) |>
  select(Month, Sales = value) |>
  as_tsibble(index = Month)

retail |>
  autoplot(Sales) +
  labs(title = "U.S. Advance Retail Sales (RSXFS), 2000–2024",
       y = "Millions of USD", x = NULL)

3. Train / Test split

I hold out the last 24 months as the test set.

train <- retail |> filter(year(Month) <= 2022)
test  <- retail |> filter(year(Month) >= 2023)
h     <- nrow(test)
h

[1] 24

4. Fit benchmark models

fits <- train |>
  model(
    Mean   = MEAN(Sales),
    Naive  = NAIVE(Sales),
    SNaive = SNAIVE(Sales ~ lag("year")),
    Drift  = RW(Sales ~ drift())
  )

fc <- fits |> forecast(h = h)

5. Forecast plot — point estimates and 80% / 95% CIs

fc |>
  autoplot(retail |> filter(year(Month) >= 2018), level = c(80, 95)) +
  facet_wrap(~ .model, scales = "free_y", ncol = 2) +
  labs(title = "Forecasts vs. actuals — RSXFS, 24-month horizon",
       y = "Sales (millions $)", x = NULL) +
  guides(colour = "none", fill = guide_legend(title = "Level"))

6. Accuracy metrics on the test set

acc <- fc |>
  accuracy(retail) |>
  select(.model, ME, RMSE, MAE, MPE, MAPE) |>
  arrange(RMSE)

acc |>
  kbl(digits = 2,
      caption = "Test-set accuracy metrics — 2023–2024 holdout",
      col.names = c("Model", "ME", "RMSE", "MAE", "MPE (%)", "MAPE (%)")) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, position = "left") |>
  row_spec(0, bold = TRUE, background = "#f5f5f5")

Test-set accuracy metrics — 2023–2024 holdout
Model	ME	RMSE	MAE	MPE (%)	MAPE (%)
Drift	12712.56	13600.89	12712.56	2.12	2.12
SNaive	20632.29	23717.06	20632.29	3.43	3.43
Naive	27784.29	29609.61	27784.29	4.63	4.63
Mean	239277.45	239496.26	239277.45	40.07	40.07

How the five metrics differ

Let $e_t = y_t - \hat y_t$ be the forecast error on observation $t$, with $n$ test points.

ME (Mean Error) $= \frac{1}{n}\sum e_t$. Scale-dependent. Measures bias only — positive and negative errors cancel, so a model can have ME ≈ 0 and still be terrible. Useful for spotting systematic over- or under-forecasting.
RMSE (Root Mean Squared Error) $= \sqrt{\frac{1}{n}\sum e_t^2}$. Scale-dependent. Penalizes large errors quadratically, so it is sensitive to outliers. Optimal point forecast under squared-error loss is the conditional mean.
MAE (Mean Absolute Error) $= \frac{1}{n}\sum |e_t|$. Scale-dependent. Linear penalty, robust to outliers. Optimal point forecast under absolute-error loss is the conditional median.
MPE (Mean Percentage Error) $= \frac{100}{n}\sum \frac{e_t}{y_t}$. Scale-free. Like ME but in percent — also a bias measure, signs cancel.
MAPE (Mean Absolute Percentage Error) $= \frac{100}{n}\sum \left|\frac{e_t}{y_t}\right|$. Scale-free, so you can compare across series with different units. Two known weaknesses: undefined when $y_t = 0$, and asymmetric — over-forecasts can exceed 100% but under-forecasts are capped at 100%.

Pairing the bias metric (ME, MPE) with the matching magnitude metric (MAE, MAPE) is the standard read: if MAE is large but ME is near zero, errors are big but unbiased; if both are large and same sign, the model is systematically off.

7. Excel cross-check

Export the SNaive forecast plus the held-out actuals so we can recompute the five metrics by hand in Excel:

verify <- fc |>
  filter(.model == "SNaive") |>
  as_tibble() |>
  select(Month, Forecast = .mean) |>
  left_join(as_tibble(test), by = "Month") |>
  rename(Actual = Sales) |>
  mutate(Error       = Actual - Forecast,
         AbsError    = abs(Error),
         SqError     = Error^2,
         PctError    = 100 * Error / Actual,
         AbsPctError = abs(PctError))

write_csv(verify, "snaive_accuracy_check.csv")

verify |>
  kbl(digits = 3, caption = "SNaive errors — exported to snaive_accuracy_check.csv") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)

SNaive errors — exported to snaive_accuracy_check.csv
Month	Forecast	Actual	Error	AbsError	SqError	PctError	AbsPctError
2023 Jan	560257	591964	31707	31707	1005333849	5.356	5.356
2023 Feb	563777	586165	22388	22388	501222544	3.819	3.819
2023 Mar	574485	578812	4327	4327	18722929	0.748	0.748
2023 Apr	580943	583874	2931	2931	8590761	0.502	0.502
2023 May	579844	586286	6442	6442	41499364	1.099	1.099
2023 Jun	585556	588670	3114	3114	9696996	0.529	0.529
2023 Jul	579267	589257	9990	9990	99800100	1.695	1.695
2023 Aug	581196	595191	13995	13995	195860025	2.351	2.351
2023 Sep	579446	598956	19510	19510	380640100	3.257	3.257
2023 Oct	585006	595517	10511	10511	110481121	1.765	1.765
2023 Nov	576356	593567	17211	17211	296218521	2.900	2.900
2023 Dec	569119	594256	25137	25137	631868769	4.230	4.230
2024 Jan	560257	588944	28687	28687	822943969	4.871	4.871
2024 Feb	563777	593206	29429	29429	866066041	4.961	4.961
2024 Mar	574485	595969	21484	21484	461562256	3.605	3.605
2024 Apr	580943	595676	14733	14733	217061289	2.473	2.473
2024 May	579844	600372	20528	20528	421398784	3.419	3.419
2024 Jun	585556	599979	14423	14423	208022929	2.404	2.404
2024 Jul	579267	605814	26547	26547	704743209	4.382	4.382
2024 Aug	581196	603601	22405	22405	501984025	3.712	3.712
2024 Sep	579446	608774	29328	29328	860131584	4.818	4.818
2024 Oct	585006	612772	27766	27766	770950756	4.531	4.531
2024 Nov	576356	616344	39988	39988	1599040144	6.488	6.488
2024 Dec	569119	621713	52594	52594	2766128836	8.460	8.460

Open snaive_accuracy_check.csv in Excel. Columns land as: A = Month, B = Forecast, C = Actual, D = Error, E = AbsError, F = SqError, G = PctError, H = AbsPctError. With 24 forecast months you have rows 2:25. Add five summary cells:

Metric	Excel formula
ME	`=AVERAGE(D2:D25)`
RMSE	`=SQRT(AVERAGE(F2:F25))`
MAE	`=AVERAGE(E2:E25)`
MPE	`=AVERAGE(G2:G25)`
MAPE	`=AVERAGE(H2:H25)`

These should match the SNaive row of the R table to within rounding. Take a screenshot of the Excel cells next to the R table and paste both into the discussion post.

8. Decomposition — additive vs. multiplicative

You can decompose this series because it’s monthly (frequency 12). Below I fit both an STL decomposition (commonly used as the additive workhorse) and a classical multiplicative decomposition.

8.1 Additive (STL)

add_dcmp <- train |>
  model(STL(Sales ~ trend(window = 13) + season(window = "periodic"),
            robust = TRUE)) |>
  components()

autoplot(add_dcmp) +
  labs(title = "Additive STL decomposition of RSXFS")

8.2 Multiplicative (classical)

mult_dcmp <- train |>
  model(classical_decomposition(Sales, type = "multiplicative")) |>
  components()

autoplot(mult_dcmp) +
  labs(title = "Classical multiplicative decomposition of RSXFS")

8.3 Which worked better, and how do we tell?

The diagnostic is the remainder (residual) panel. After we strip out trend and season, the remainder should look like white noise — no structure, roughly constant variance.

In the additive decomposition, the remainder’s amplitude grows over time and is much larger near the COVID shock and during the recent high-sales years. That’s the classic signature that the seasonal swings scale with the level — additivity is the wrong assumption.
In the multiplicative decomposition, the remainder is centred near 1, much more stable across years, and visibly closer to noise.

Numeric check — variance of the remainder, smaller is better (note these are on different scales, additive is in millions of $, multiplicative is a unitless ratio, so compare each against its own trend):

add_rem  <- add_dcmp  |> pull(remainder)
mult_rem <- mult_dcmp |> pull(random) |> na.omit()

tibble(
  decomposition = c("Additive (STL)", "Multiplicative (classical)"),
  remainder_sd  = c(sd(add_rem), sd(mult_rem)),
  rem_over_level = c(sd(add_rem) / mean(train$Sales),
                     sd(mult_rem))   # mult random is already a ratio
) |>
  kbl(digits = 4, caption = "Remainder dispersion — lower is better") |>
  kable_styling(full_width = FALSE)

Remainder dispersion — lower is better
decomposition	remainder_sd	rem_over_level
Additive (STL)	7455.2109	0.0208
Multiplicative (classical)	0.0166	0.0166

For this series multiplicative wins — the seasonal index is a constant percentage-of-trend rather than a constant dollar amount, which matches the economic intuition for retail sales.

8.4 Using decomposition for forecasting

You can absolutely forecast off a decomposition — that’s what decomposition_model() does. The idea: forecast the seasonally adjusted series with any model (ETS, ARIMA, RW with drift), forecast the seasonal component naively (SNAIVE on the season), and recombine.

dcmp_fit <- train |>
  model(stl_ets = decomposition_model(
    STL(Sales ~ trend(window = 13) + season(window = "periodic"),
        robust = TRUE),
    ETS(season_adjust ~ season("N")),
    SNAIVE(season_year)
  ))

dcmp_fc <- dcmp_fit |> forecast(h = h)

bind_rows(fc, dcmp_fc) |>
  accuracy(retail) |>
  select(.model, ME, RMSE, MAE, MPE, MAPE) |>
  arrange(RMSE) |>
  kbl(digits = 2,
      caption = "Adding a decomposition-based forecast to the comparison") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Adding a decomposition-based forecast to the comparison
.model	ME	RMSE	MAE	MPE	MAPE
stl_ets	7010.17	8615.80	7062.89	1.17	1.18
Drift	12712.56	13600.89	12712.56	2.12	2.12
SNaive	20632.29	23717.06	20632.29	3.43	3.43
Naive	27784.29	29609.61	27784.29	4.63	4.63
Mean	239277.45	239496.26	239277.45	40.07	40.07

dcmp_fc |>
  autoplot(retail |> filter(year(Month) >= 2018), level = c(80, 95)) +
  labs(title = "STL + ETS decomposition forecast",
       y = "Sales (millions $)", x = NULL)

Would I use it for a final forecast? On this series, yes — the seasonal structure is so dominant that any model which doesn’t explicitly handle it (Mean, Naive, Drift) is hopeless. SNaive captures seasonality but has no trend; the STL+ETS pipeline captures both and typically beats SNaive’s RMSE on this series. For monthly seasonal economic data this kind of decomposition forecast is the right benchmark to beat before reaching for something heavier.