library(fpp3)
library(dplyr)
library(ggplot2)
library(knitr)Module 10 Discussion
Part I: Concepts
HTS vs GTS
A hierarchical time series (HTS) has a single tree structure. Each bottom-level series rolls up through one aggregation path to the grand total. For example, retail sales may be aggregated from stores to regions, from regions to states, and then to the national total.
A grouped time series (GTS) is different because the same observations can be grouped in more than one way. For example, tourism can be grouped by state or by travel purpose. In fpp3, HTS is usually written with /, while GTS uses *. GTS is more flexible, but reconciliation is more complicated because forecasts must satisfy more than one set of aggregation constraints.
Forecast methods vs reconciliation methods
Forecast methods and reconciliation methods are not the same thing. Forecast methods such as ETS, ARIMA, and TSLM produce base forecasts for each series.
The problem is that these base forecasts usually do not add up correctly across the hierarchy. Reconciliation adjusts them so that they become coherent. Bottom-up, top-down, middle-out, and MinT are four ways of doing this. The forecasting model and the reconciliation method are separate choices.
Comparison of the four reconciliation methods
Bottom-up. Forecast the bottom-level series first and then sum them upward. This preserves detailed information and is automatically coherent, but it can perform poorly if bottom-level data are noisy.
Top-down. Forecast the total series first and then distribute it downward using proportions. It is simple and often works well at the top level, but lower-level accuracy is usually weaker. It is mainly suited to strict hierarchies.
Middle-out. Start at an intermediate level, then aggregate upward and disaggregate downward. It is a compromise between bottom-up and top-down. It can work well when the middle level is the most stable, but it depends on choosing the right middle level.
MinT (minimum trace). Use forecasts from all levels and reconcile them using the covariance structure of forecast errors. It often gives strong overall accuracy, but it is the most computationally demanding method and depends on estimating the covariance matrix well.
Summary
| Method | Works on | Main strength | Main weakness | Best used when |
|---|---|---|---|---|
| Bottom-up | HTS + GTS | Uses detailed data | Sensitive to noisy bottom series | Bottom-level data is reliable |
| Top-down | HTS only | Simple and strong at the top level | Weak at lower levels | Aggregate series is stable |
| Middle-out | HTS with 3+ levels | Useful compromise | Depends on the chosen middle level | Middle level is the most stable |
| MinT | HTS + GTS | Often strong overall | More complex and data-demanding | Accuracy is the priority |
Part II, Option I: Hierarchical Retail Forecasting
Dataset
For this exercise, I use the aus_retail dataset from tsibbledata, which contains monthly retail turnover (in A$ millions) for Australian retail industries by state. To keep the analysis manageable, I filter the data to four states and four retail categories. This gives a simple three-level hierarchy: Total -> State -> Industry within State. In total, this produces 1 + 4 + 16 = 21 series. A three-level hierarchy is also the minimum needed for middle-out reconciliation to be applicable.
This dataset is different from the class example, which used tsibble::tourism.
Setup
Load and subset the data
df <- tsibbledata::aus_retail
selected_states <- c("New South Wales", "Victoria",
"Queensland", "Western Australia")
selected_industries <- c(
"Food retailing",
"Clothing, footwear and personal accessory retailing",
"Household goods retailing",
"Department stores"
)
retail <- df %>%
filter(State %in% selected_states,
Industry %in% selected_industries,
Month >= yearmonth("2005 Jan"),
Month <= yearmonth("2018 Dec"))
retail %>%
as_tibble() %>%
summarise(series = n_distinct(paste(State, Industry)),
months = n_distinct(Month),
start_month = min(Month),
end_month = max(Month)) %>%
kable(caption = "Filtered dataset summary")| series | months | start_month | end_month |
|---|---|---|---|
| 16 | 168 | 2005 1月 | 2018 12月 |
Build the hierarchy
retail_hts <- retail %>%
aggregate_key(State / Industry, Turnover = sum(Turnover))
retail_hts %>%
as_tibble() %>%
mutate(level = case_when(
is_aggregated(State) & is_aggregated(Industry) ~ "1. Total",
!is_aggregated(State) & is_aggregated(Industry) ~ "2. State",
TRUE ~ "3. State-Industry"
)) %>%
distinct(State, Industry, level) %>%
count(level, name = "num_series") %>%
kable(caption = "Number of series by level")| level | num_series |
|---|---|
| 1. Total | 1 |
| 2. State | 4 |
| 3. State-Industry | 16 |
Top-level time series
retail_hts %>%
filter(is_aggregated(State) & is_aggregated(Industry)) %>%
autoplot(Turnover) +
labs(title = "Total monthly retail turnover (4 states, 4 industries)",
y = "Turnover ($M AUD)", x = "") +
theme_minimal()The top-level series shows a clear upward trend and strong annual seasonality, with pronounced December peaks. This pattern makes ETS a reasonable choice for the base forecasts.
Train-test split
The final 24 months are reserved as the test period, while the remaining observations are used for training.
all_months <- sort(unique(retail$Month))
n_test <- 24
train_months <- head(all_months, -n_test)
test_months <- tail(all_months, n_test)
cat("Training:", format(min(train_months)), "to", format(max(train_months)), "\n")Training: 2005 1月 to 2016 12月
cat("Testing: ", format(min(test_months)), "to", format(max(test_months)), "\n")Testing: 2017 1月 to 2018 12月
train <- retail %>% filter(Month %in% train_months)
test <- retail %>% filter(Month %in% test_months)
train_hts <- train %>% aggregate_key(State / Industry, Turnover = sum(Turnover))
test_hts <- test %>% aggregate_key(State / Industry, Turnover = sum(Turnover))Fit base model and reconcile
ETS is used as the base forecasting model, and the resulting forecasts are reconciled using three methods: top-down, middle-out, and MinT with shrinkage.
# split = 1 sets the middle-out anchor at the State level
fit <- train_hts %>%
model(ets = ETS(Turnover)) %>%
reconcile(
td = top_down(ets, method = "forecast_proportions"),
mo = middle_out(ets, split = 1),
mint = min_trace(ets, method = "mint_shrink")
)Generate forecasts for the test horizon
fc <- fit %>% forecast(h = n_test)Training-period fitted values
Total level
fitted_vals <- augment(fit)
fitted_vals %>%
filter(is_aggregated(State) & is_aggregated(Industry)) %>%
ggplot(aes(x = Month)) +
geom_line(aes(y = Turnover), color = "black", linewidth = 0.6) +
geom_line(aes(y = .fitted, color = .model), alpha = 0.85, linewidth = 0.6) +
facet_wrap(~ .model, ncol = 2) +
labs(
title = "Observed and fitted values at the total level (training period)",
subtitle = "Black line shows observed turnover; coloured line shows fitted values",
y = "Turnover (A$ million)", x = ""
) +
theme_minimal() +
theme(legend.position = "none")State level
fitted_vals %>%
filter(.model == "mint",
!is_aggregated(State), is_aggregated(Industry)) %>%
ggplot(aes(x = Month)) +
geom_line(aes(y = Turnover), color = "black", linewidth = 0.5) +
geom_line(aes(y = .fitted), color = "steelblue", linewidth = 0.6, alpha = 0.85) +
facet_wrap(~ State, scales = "free_y") +
labs(
title = "Observed and fitted values at the state level (training period)",
subtitle = "Black line shows observed turnover; blue line shows MinT fitted values",
y = "Turnover (A$ million)", x = ""
) +
theme_minimal()Forecasts in the test period
Total level
fc %>%
filter(is_aggregated(State), is_aggregated(Industry)) %>%
autoplot(
retail_hts %>%
filter(is_aggregated(State), is_aggregated(Industry),
Month >= yearmonth("2014 Jan")),
level = NULL, linewidth = 0.8
) +
labs(
title = "Forecasts and observed values at the total level (test period)",
subtitle = "The coloured lines show forecasts from the four methods over the 24-month test horizon",
y = "Turnover (A$ million)", x = ""
) +
theme_minimal()State level
fc %>%
filter(!is_aggregated(State), is_aggregated(Industry)) %>%
autoplot(
retail_hts %>%
filter(!is_aggregated(State), is_aggregated(Industry),
Month >= yearmonth("2014 Jan")),
level = NULL, linewidth = 0.7
) +
facet_wrap(~ State, scales = "free_y") +
labs(
title = "Forecasts and observed values at the state level (test period)",
subtitle = "The coloured lines show forecasts from the four methods over the 24-month test horizon",
y = "Turnover (A$ million)", x = ""
) +
theme_minimal()Bottom level
fc %>%
filter(State == "New South Wales", !is_aggregated(Industry)) %>%
autoplot(
retail_hts %>%
filter(State == "New South Wales", !is_aggregated(Industry),
Month >= yearmonth("2014 Jan")),
level = NULL, linewidth = 0.7
) +
facet_wrap(~ Industry, scales = "free_y", labeller = label_wrap_gen(width = 30)) +
labs(
title = "Forecasts and observed values at the bottom level: New South Wales industries (test period)",
subtitle = "The coloured lines show forecasts from the four methods over the 24-month test horizon",
y = "Turnover (A$ million)", x = ""
) +
theme_minimal()Accuracy evaluation
The 21 series in this hierarchy are on very different scales. The total series is much larger than any state series, and the state series are larger than the state-by-industry series below them. For that reason, RMSE and MAE are reported separately by hierarchy level rather than averaged across all series. MAPE is also reported as a scale-free measure to provide a limited cross-level comparison.
Accuracy by hierarchy level
acc_all <- fc %>%
accuracy(test_hts, measures = list(RMSE = RMSE, MAE = MAE, MAPE = MAPE))
acc_by_level <- acc_all %>%
mutate(level = case_when(
is_aggregated(State) & is_aggregated(Industry) ~ "Total",
!is_aggregated(State) & is_aggregated(Industry) ~ "State",
TRUE ~ "State × Industry"
)) %>%
mutate(level = factor(level, levels = c("Total", "State", "State × Industry"))) %>%
group_by(level, .model) %>%
summarise(
RMSE = mean(RMSE, na.rm = TRUE),
MAE = mean(MAE, na.rm = TRUE),
MAPE = mean(MAPE, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(level, RMSE)
kable(
acc_by_level,
digits = 2,
caption = "Accuracy by hierarchy level and reconciliation method"
)| level | .model | RMSE | MAE | MAPE |
|---|---|---|---|---|
| Total | ets | 176.10 | 145.70 | 0.89 |
| Total | td | 176.10 | 145.70 | 0.89 |
| Total | mo | 204.08 | 161.25 | 0.97 |
| Total | mint | 244.48 | 190.38 | 1.12 |
| State | td | 60.98 | 48.26 | 1.11 |
| State | ets | 64.81 | 50.88 | 1.18 |
| State | mo | 64.81 | 50.88 | 1.18 |
| State | mint | 78.96 | 61.00 | 1.48 |
| State × Industry | td | 31.17 | 25.47 | 3.05 |
| State × Industry | mint | 31.81 | 25.27 | 3.05 |
| State × Industry | mo | 32.08 | 26.11 | 3.11 |
| State × Industry | ets | 34.40 | 28.20 | 3.44 |
Cross-level MAPE summary
acc_all %>%
group_by(.model) %>%
summarise(mean_MAPE = mean(MAPE, na.rm = TRUE), .groups = "drop") %>%
arrange(mean_MAPE) %>%
kable(digits = 2,
caption = "Mean MAPE across all series by reconciliation method")| .model | mean_MAPE |
|---|---|
| td | 2.58 |
| mo | 2.64 |
| mint | 2.66 |
| ets | 2.89 |
Which reconciliation method performed best?
Looking at the two tables above, the overall pattern is fairly clear. At the total level, the lowest error was shared by the unreconciled ETS forecasts and top-down, both with an RMSE of 176.10 and a MAPE of 0.89. At the state level, top-down performed best, with an average RMSE of 60.98 and an average MAPE of 1.11. At the state × industry level, top-down again gave the lowest average RMSE at 31.17, while its average MAPE was 3.05, essentially tied with MinT on that measure. Using mean MAPE across all 21 series as a scale-free cross-level summary, top-down also performed best overall with 2.58.
This result is slightly different from the common expectation that MinT will usually perform best overall. In this exercise, a possible explanation is that the aggregate and state-level retail patterns were relatively stable over the training and test periods, so the historical allocation structure used by top-down remained useful over the 24-month forecast horizon. That would help top-down at the higher levels and keep its bottom-level performance competitive. By contrast, MinT depends on estimating the forecast error covariance structure, and in a relatively small hierarchy like this one, that extra complexity does not necessarily lead to better test-set accuracy.