Module 10 Discussion

Author

Jincheng Xie

Part I: Concepts

HTS vs GTS

A hierarchical time series (HTS) has a single tree structure. Each bottom-level series rolls up through one aggregation path to the grand total. For example, retail sales may be aggregated from stores to regions, from regions to states, and then to the national total.

A grouped time series (GTS) is different because the same observations can be grouped in more than one way. For example, tourism can be grouped by state or by travel purpose. In fpp3, HTS is usually written with /, while GTS uses *. GTS is more flexible, but reconciliation is more complicated because forecasts must satisfy more than one set of aggregation constraints.

Forecast methods vs reconciliation methods

Forecast methods and reconciliation methods are not the same thing. Forecast methods such as ETS, ARIMA, and TSLM produce base forecasts for each series.

The problem is that these base forecasts usually do not add up correctly across the hierarchy. Reconciliation adjusts them so that they become coherent. Bottom-up, top-down, middle-out, and MinT are four ways of doing this. The forecasting model and the reconciliation method are separate choices.

Comparison of the four reconciliation methods

Bottom-up. Forecast the bottom-level series first and then sum them upward. This preserves detailed information and is automatically coherent, but it can perform poorly if bottom-level data are noisy.

Top-down. Forecast the total series first and then distribute it downward using proportions. It is simple and often works well at the top level, but lower-level accuracy is usually weaker. It is mainly suited to strict hierarchies.

Middle-out. Start at an intermediate level, then aggregate upward and disaggregate downward. It is a compromise between bottom-up and top-down. It can work well when the middle level is the most stable, but it depends on choosing the right middle level.

MinT (minimum trace). Use forecasts from all levels and reconcile them using the covariance structure of forecast errors. It often gives strong overall accuracy, but it is the most computationally demanding method and depends on estimating the covariance matrix well.

Summary

Method	Works on	Main strength	Main weakness	Best used when
Bottom-up	HTS + GTS	Uses detailed data	Sensitive to noisy bottom series	Bottom-level data is reliable
Top-down	HTS only	Simple and strong at the top level	Weak at lower levels	Aggregate series is stable
Middle-out	HTS with 3+ levels	Useful compromise	Depends on the chosen middle level	Middle level is the most stable
MinT	HTS + GTS	Often strong overall	More complex and data-demanding	Accuracy is the priority

Part II, Option I: Hierarchical Retail Forecasting

Dataset

For this exercise, I use the aus_retail dataset from tsibbledata, which contains monthly retail turnover (in A$ millions) for Australian retail industries by state. To keep the analysis manageable, I filter the data to four states and four retail categories. This gives a simple three-level hierarchy: Total -> State -> Industry within State. In total, this produces 1 + 4 + 16 = 21 series. A three-level hierarchy is also the minimum needed for middle-out reconciliation to be applicable.

This dataset is different from the class example, which used tsibble::tourism.

Setup

library(fpp3)
library(dplyr)
library(ggplot2)
library(knitr)

Load and subset the data

df <- tsibbledata::aus_retail

selected_states <- c("New South Wales", "Victoria",
                     "Queensland", "Western Australia")

selected_industries <- c(
  "Food retailing",
  "Clothing, footwear and personal accessory retailing",
  "Household goods retailing",
  "Department stores"
)

retail <- df %>%
  filter(State %in% selected_states,
         Industry %in% selected_industries,
         Month >= yearmonth("2005 Jan"),
         Month <= yearmonth("2018 Dec"))

retail %>%
  as_tibble() %>%
  summarise(series       = n_distinct(paste(State, Industry)),
            months       = n_distinct(Month),
            start_month  = min(Month),
            end_month    = max(Month)) %>%
  kable(caption = "Filtered dataset summary")

Filtered dataset summary
series	months	start_month	end_month
16	168	2005 1月	2018 12月

Build the hierarchy

retail_hts <- retail %>%
  aggregate_key(State / Industry, Turnover = sum(Turnover))

retail_hts %>%
  as_tibble() %>%
  mutate(level = case_when(
    is_aggregated(State) & is_aggregated(Industry) ~ "1. Total",
    !is_aggregated(State) & is_aggregated(Industry) ~ "2. State",
    TRUE ~ "3. State-Industry"
  )) %>%
  distinct(State, Industry, level) %>%
  count(level, name = "num_series") %>%
  kable(caption = "Number of series by level")

Number of series by level
level	num_series
1. Total	1
2. State	4
3. State-Industry	16

Top-level time series

retail_hts %>%
  filter(is_aggregated(State) & is_aggregated(Industry)) %>%
  autoplot(Turnover) +
  labs(title = "Total monthly retail turnover (4 states, 4 industries)",
       y = "Turnover ($M AUD)", x = "") +
  theme_minimal()

The top-level series shows a clear upward trend and strong annual seasonality, with pronounced December peaks. This pattern makes ETS a reasonable choice for the base forecasts.

Train-test split

The final 24 months are reserved as the test period, while the remaining observations are used for training.

all_months    <- sort(unique(retail$Month))
n_test        <- 24
train_months  <- head(all_months, -n_test)
test_months   <- tail(all_months, n_test)

cat("Training:", format(min(train_months)), "to", format(max(train_months)), "\n")

Training: 2005 1月 to 2016 12月

cat("Testing: ", format(min(test_months)),  "to", format(max(test_months)),  "\n")

Testing:  2017 1月 to 2018 12月

train <- retail %>% filter(Month %in% train_months)
test  <- retail %>% filter(Month %in% test_months)

train_hts <- train %>% aggregate_key(State / Industry, Turnover = sum(Turnover))
test_hts  <- test  %>% aggregate_key(State / Industry, Turnover = sum(Turnover))

Fit base model and reconcile

ETS is used as the base forecasting model, and the resulting forecasts are reconciled using three methods: top-down, middle-out, and MinT with shrinkage.

# split = 1 sets the middle-out anchor at the State level
fit <- train_hts %>%
  model(ets = ETS(Turnover)) %>%
  reconcile(
    td   = top_down(ets,   method = "forecast_proportions"),
    mo   = middle_out(ets, split = 1),
    mint = min_trace(ets,  method = "mint_shrink")
  )

Generate forecasts for the test horizon

fc <- fit %>% forecast(h = n_test)

Training-period fitted values

Total level

fitted_vals <- augment(fit)

fitted_vals %>%
  filter(is_aggregated(State) & is_aggregated(Industry)) %>%
  ggplot(aes(x = Month)) +
  geom_line(aes(y = Turnover), color = "black", linewidth = 0.6) +
  geom_line(aes(y = .fitted, color = .model), alpha = 0.85, linewidth = 0.6) +
  facet_wrap(~ .model, ncol = 2) +
  labs(
    title = "Observed and fitted values at the total level (training period)",
    subtitle = "Black line shows observed turnover; coloured line shows fitted values",
    y = "Turnover (A$ million)", x = ""
  ) +
  theme_minimal() +
  theme(legend.position = "none")

State level

fitted_vals %>%
  filter(.model == "mint",
         !is_aggregated(State), is_aggregated(Industry)) %>%
  ggplot(aes(x = Month)) +
  geom_line(aes(y = Turnover), color = "black", linewidth = 0.5) +
  geom_line(aes(y = .fitted), color = "steelblue", linewidth = 0.6, alpha = 0.85) +
  facet_wrap(~ State, scales = "free_y") +
  labs(
    title = "Observed and fitted values at the state level (training period)",
    subtitle = "Black line shows observed turnover; blue line shows MinT fitted values",
    y = "Turnover (A$ million)", x = ""
  ) +
  theme_minimal()

Forecasts in the test period

Total level

fc %>%
  filter(is_aggregated(State), is_aggregated(Industry)) %>%
  autoplot(
    retail_hts %>%
      filter(is_aggregated(State), is_aggregated(Industry),
             Month >= yearmonth("2014 Jan")),
    level = NULL, linewidth = 0.8
  ) +
  labs(
    title = "Forecasts and observed values at the total level (test period)",
    subtitle = "The coloured lines show forecasts from the four methods over the 24-month test horizon",
    y = "Turnover (A$ million)", x = ""
  ) +
  theme_minimal()

State level

fc %>%
  filter(!is_aggregated(State), is_aggregated(Industry)) %>%
  autoplot(
    retail_hts %>%
      filter(!is_aggregated(State), is_aggregated(Industry),
             Month >= yearmonth("2014 Jan")),
    level = NULL, linewidth = 0.7
  ) +
  facet_wrap(~ State, scales = "free_y") +
  labs(
    title = "Forecasts and observed values at the state level (test period)",
    subtitle = "The coloured lines show forecasts from the four methods over the 24-month test horizon",
    y = "Turnover (A$ million)", x = ""
  ) +
  theme_minimal()

Bottom level

fc %>%
  filter(State == "New South Wales", !is_aggregated(Industry)) %>%
  autoplot(
    retail_hts %>%
      filter(State == "New South Wales", !is_aggregated(Industry),
             Month >= yearmonth("2014 Jan")),
    level = NULL, linewidth = 0.7
  ) +
  facet_wrap(~ Industry, scales = "free_y", labeller = label_wrap_gen(width = 30)) +
  labs(
    title = "Forecasts and observed values at the bottom level: New South Wales industries (test period)",
    subtitle = "The coloured lines show forecasts from the four methods over the 24-month test horizon",
    y = "Turnover (A$ million)", x = ""
  ) +
  theme_minimal()

Accuracy evaluation

The 21 series in this hierarchy are on very different scales. The total series is much larger than any state series, and the state series are larger than the state-by-industry series below them. For that reason, RMSE and MAE are reported separately by hierarchy level rather than averaged across all series. MAPE is also reported as a scale-free measure to provide a limited cross-level comparison.

Accuracy by hierarchy level

acc_all <- fc %>%
  accuracy(test_hts, measures = list(RMSE = RMSE, MAE = MAE, MAPE = MAPE))

acc_by_level <- acc_all %>%
  mutate(level = case_when(
    is_aggregated(State) & is_aggregated(Industry) ~ "Total",
    !is_aggregated(State) & is_aggregated(Industry) ~ "State",
    TRUE ~ "State × Industry"
  )) %>%
  mutate(level = factor(level, levels = c("Total", "State", "State × Industry"))) %>%
  group_by(level, .model) %>%
  summarise(
    RMSE = mean(RMSE, na.rm = TRUE),
    MAE  = mean(MAE, na.rm = TRUE),
    MAPE = mean(MAPE, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(level, RMSE)

kable(
  acc_by_level,
  digits = 2,
  caption = "Accuracy by hierarchy level and reconciliation method"
)

Accuracy by hierarchy level and reconciliation method
level	.model	RMSE	MAE	MAPE
Total	ets	176.10	145.70	0.89
Total	td	176.10	145.70	0.89
Total	mo	204.08	161.25	0.97
Total	mint	244.48	190.38	1.12
State	td	60.98	48.26	1.11
State	ets	64.81	50.88	1.18
State	mo	64.81	50.88	1.18
State	mint	78.96	61.00	1.48
State × Industry	td	31.17	25.47	3.05
State × Industry	mint	31.81	25.27	3.05
State × Industry	mo	32.08	26.11	3.11
State × Industry	ets	34.40	28.20	3.44

Cross-level MAPE summary

acc_all %>%
  group_by(.model) %>%
  summarise(mean_MAPE = mean(MAPE, na.rm = TRUE), .groups = "drop") %>%
  arrange(mean_MAPE) %>%
  kable(digits = 2,
        caption = "Mean MAPE across all series by reconciliation method")

Mean MAPE across all series by reconciliation method
.model	mean_MAPE
td	2.58
mo	2.64
mint	2.66
ets	2.89

Which reconciliation method performed best?

Looking at the two tables above, the overall pattern is fairly clear. At the total level, the lowest error was shared by the unreconciled ETS forecasts and top-down, both with an RMSE of 176.10 and a MAPE of 0.89. At the state level, top-down performed best, with an average RMSE of 60.98 and an average MAPE of 1.11. At the state × industry level, top-down again gave the lowest average RMSE at 31.17, while its average MAPE was 3.05, essentially tied with MinT on that measure. Using mean MAPE across all 21 series as a scale-free cross-level summary, top-down also performed best overall with 2.58.

This result is slightly different from the common expectation that MinT will usually perform best overall. In this exercise, a possible explanation is that the aggregate and state-level retail patterns were relatively stable over the training and test periods, so the historical allocation structure used by top-down remained useful over the 24-month forecast horizon. That would help top-down at the higher levels and keep its bottom-level performance competitive. By contrast, MinT depends on estimating the forecast error covariance structure, and in a relatively small hierarchy like this one, that extra complexity does not necessarily lead to better test-set accuracy.