Module 10 Discussion

Author

Jincheng Xie

Part I: Concepts

HTS vs GTS

A hierarchical time series (HTS) has a single tree structure. Each bottom-level series rolls up to one parent, and each parent rolls up again until we reach the grand total. In a retail setting, sales may be aggregated from stores to regions, from regions to states, and then to the national total. Because there is only one aggregation path, the structure is relatively straightforward.

A grouped time series (GTS) is different because the same observations can be grouped in more than one way. A common example is Australian tourism, where trips can be grouped by state or by purpose of travel. Neither dimension is strictly above the other, but both describe the same underlying data. In the fpp3 framework, HTS is usually written with slashes, such as aggregate_key(State / Region), while GTS uses an asterisk, such as aggregate_key(State * Purpose). This matters because grouped structures have to satisfy multiple aggregation constraints at the same time, which makes reconciliation more complicated.

Forecast methods vs reconciliation methods

Forecast methods and reconciliation methods are related, but they are not the same thing. Forecast methods such as ETS, ARIMA, and TSLM are used to produce base forecasts for each series. If a hierarchy contains many nodes, this means many separate models may be fitted across the structure.

The issue is that these base forecasts usually do not add up properly. For example, the forecast for a state may not equal the sum of the forecasts for the series below it. Reconciliation is used to adjust those base forecasts so that they become coherent, meaning that the forecasts follow the aggregation structure correctly. Bottom-up, top-down, middle-out, and MinT are different ways of doing this.

An important point is that the forecasting model and the reconciliation method are separate choices. For example, ETS forecasts can be reconciled using MinT, or ARIMA forecasts can be reconciled using bottom-up. Choosing one does not automatically determine the other.

Comparison of the four reconciliation methods

Bottom-up. This method forecasts the bottom-level series first and then sums them upward to obtain forecasts for higher levels. Its main advantage is that it preserves the detailed information in the most disaggregated series, and the resulting forecasts are automatically coherent. The weakness is that bottom-level data are often noisy, so this noise can accumulate when the series are aggregated. Bottom-up is usually most useful when the bottom-level data are reasonably reliable.

Top-down. This method starts by forecasting the total series and then distributes that forecast down the hierarchy using historical or forecast proportions. It is simple, computationally light, and often performs reasonably well at the top level. However, lower-level accuracy is often weaker because the method imposes the aggregate pattern on all disaggregated series. It is also mainly suitable for strict hierarchies rather than grouped structures.

Middle-out. This method begins at an intermediate level, then aggregates upward and disaggregates downward from that level. It can be seen as a compromise between bottom-up and top-down. Its advantage is that it may work well when the middle level is more stable or more interpretable than either the top or the bottom. Its limitation is that the results depend on whether the chosen middle level is actually the right place to start, and it only works when the hierarchy has at least three levels.

MinT (minimum trace). This method uses forecasts from all levels and reconciles them by taking account of the covariance structure of the forecast errors. It is generally regarded as the most statistically efficient approach, and it often gives the best overall accuracy. However, it is also the most computationally demanding method, because it requires estimation of an error covariance matrix. When there are many series or only a short history, that estimation can be difficult or unstable.

Summary

Method Works on Needs Accuracy tendency Use when
Bottom-up HTS + GTS Reliable bottom-level series Often strong at lower levels, but can suffer if bottom series are noisy Detailed data is reliable
Top-down HTS only Stable aggregate pattern Usually good at the top level, but weaker at lower levels Top-level series is more reliable than bottom-level data
Middle-out HTS with 3+ levels A stable intermediate level A compromise between top-down and bottom-up An intermediate level is the most stable or interpretable
MinT HTS + GTS Enough data to estimate the covariance matrix reliably Often best overall Accuracy is the priority and computation is available

Part II, Option I: Hierarchical Retail Forecasting

Dataset

For this exercise, I use the aus_retail dataset from tsibbledata, which contains monthly retail turnover (in A$ millions) for Australian retail industries by state. To keep the analysis manageable, I filter the data to four states and four retail categories. This gives a simple three-level hierarchy: Total -> State -> Industry within State. In total, this produces 1 + 4 + 16 = 21 series. A three-level hierarchy is also the minimum needed for middle-out reconciliation to be applicable.

This dataset is different from the class example, which used tsibble::tourism.

Setup

library(fpp3)
library(dplyr)
library(ggplot2)
library(knitr)

Load and subset the data

df <- tsibbledata::aus_retail

selected_states <- c("New South Wales", "Victoria",
                     "Queensland", "Western Australia")

selected_industries <- c(
  "Food retailing",
  "Clothing, footwear and personal accessory retailing",
  "Household goods retailing",
  "Department stores"
)

retail <- df %>%
  filter(State %in% selected_states,
         Industry %in% selected_industries,
         Month >= yearmonth("2005 Jan"),
         Month <= yearmonth("2018 Dec"))

retail %>%
  as_tibble() %>%
  summarise(series       = n_distinct(paste(State, Industry)),
            months       = n_distinct(Month),
            start_month  = min(Month),
            end_month    = max(Month)) %>%
  kable(caption = "Filtered dataset summary")
Filtered dataset summary
series months start_month end_month
16 168 2005 1月 2018 12月

Build the hierarchy

retail_hts <- retail %>%
  aggregate_key(State / Industry, Turnover = sum(Turnover))

retail_hts %>%
  as_tibble() %>%
  mutate(level = case_when(
    is_aggregated(State) & is_aggregated(Industry) ~ "1. Total",
    !is_aggregated(State) & is_aggregated(Industry) ~ "2. State",
    TRUE ~ "3. State-Industry"
  )) %>%
  distinct(State, Industry, level) %>%
  count(level, name = "num_series") %>%
  kable(caption = "Number of series by level")
Number of series by level
level num_series
1. Total 1
2. State 4
3. State-Industry 16

Top-level time series

retail_hts %>%
  filter(is_aggregated(State) & is_aggregated(Industry)) %>%
  autoplot(Turnover) +
  labs(title = "Total monthly retail turnover (4 states, 4 industries)",
       y = "Turnover ($M AUD)", x = "") +
  theme_minimal()

The top-level series shows a clear upward trend and strong annual seasonality, with pronounced December peaks. This pattern makes ETS a reasonable choice for the base forecasts.

Train-test split

The final 24 months are reserved as the test period, while the remaining observations are used for training.

all_months    <- sort(unique(retail$Month))
n_test        <- 24
train_months  <- head(all_months, -n_test)
test_months   <- tail(all_months, n_test)

cat("Training:", format(min(train_months)), "to", format(max(train_months)), "\n")
Training: 2005 1月 to 2016 12月 
cat("Testing: ", format(min(test_months)),  "to", format(max(test_months)),  "\n")
Testing:  2017 1月 to 2018 12月 
train <- retail %>% filter(Month %in% train_months)
test  <- retail %>% filter(Month %in% test_months)

train_hts <- train %>% aggregate_key(State / Industry, Turnover = sum(Turnover))
test_hts  <- test  %>% aggregate_key(State / Industry, Turnover = sum(Turnover))

Fit base model and reconcile

ETS is used as the base forecasting model, and the resulting forecasts are reconciled using three methods: top-down, middle-out, and MinT with shrinkage.

# split = 1 sets the middle-out anchor at the State level
fit <- train_hts %>%
  model(ets = ETS(Turnover)) %>%
  reconcile(
    td   = top_down(ets,   method = "forecast_proportions"),
    mo   = middle_out(ets, split = 1),
    mint = min_trace(ets,  method = "mint_shrink")
  )

Generate forecasts for the test horizon

fc <- fit %>% forecast(h = n_test)

Training-period fitted values

Total level

fitted_vals <- augment(fit)

fitted_vals %>%
  filter(is_aggregated(State) & is_aggregated(Industry)) %>%
  ggplot(aes(x = Month)) +
  geom_line(aes(y = Turnover), color = "black", linewidth = 0.6) +
  geom_line(aes(y = .fitted, color = .model), alpha = 0.85, linewidth = 0.6) +
  facet_wrap(~ .model, ncol = 2) +
  labs(
    title = "Observed and fitted values at the total level (training period)",
    subtitle = "Black line shows observed turnover; coloured line shows fitted values",
    y = "Turnover (A$ million)", x = ""
  ) +
  theme_minimal() +
  theme(legend.position = "none")

State level

fitted_vals %>%
  filter(.model == "mint",
         !is_aggregated(State), is_aggregated(Industry)) %>%
  ggplot(aes(x = Month)) +
  geom_line(aes(y = Turnover), color = "black", linewidth = 0.5) +
  geom_line(aes(y = .fitted), color = "steelblue", linewidth = 0.6, alpha = 0.85) +
  facet_wrap(~ State, scales = "free_y") +
  labs(
    title = "Observed and fitted values at the state level (training period)",
    subtitle = "Black line shows observed turnover; blue line shows MinT fitted values",
    y = "Turnover (A$ million)", x = ""
  ) +
  theme_minimal()

Forecasts in the test period

Total level

fc %>%
  filter(is_aggregated(State), is_aggregated(Industry)) %>%
  autoplot(
    retail_hts %>%
      filter(is_aggregated(State), is_aggregated(Industry),
             Month >= yearmonth("2014 Jan")),
    level = NULL, linewidth = 0.8
  ) +
  labs(
    title = "Forecasts and observed values at the total level (test period)",
    subtitle = "The coloured lines show forecasts from the four methods over the 24-month test horizon",
    y = "Turnover (A$ million)", x = ""
  ) +
  theme_minimal()

State level

fc %>%
  filter(!is_aggregated(State), is_aggregated(Industry)) %>%
  autoplot(
    retail_hts %>%
      filter(!is_aggregated(State), is_aggregated(Industry),
             Month >= yearmonth("2014 Jan")),
    level = NULL, linewidth = 0.7
  ) +
  facet_wrap(~ State, scales = "free_y") +
  labs(
    title = "Forecasts and observed values at the state level (test period)",
    subtitle = "The coloured lines show forecasts from the four methods over the 24-month test horizon",
    y = "Turnover (A$ million)", x = ""
  ) +
  theme_minimal()

Bottom level

fc %>%
  filter(State == "New South Wales", !is_aggregated(Industry)) %>%
  autoplot(
    retail_hts %>%
      filter(State == "New South Wales", !is_aggregated(Industry),
             Month >= yearmonth("2014 Jan")),
    level = NULL, linewidth = 0.7
  ) +
  facet_wrap(~ Industry, scales = "free_y", labeller = label_wrap_gen(width = 30)) +
  labs(
    title = "Forecasts and observed values at the bottom level: New South Wales industries (test period)",
    subtitle = "The coloured lines show forecasts from the four methods over the 24-month test horizon",
    y = "Turnover (A$ million)", x = ""
  ) +
  theme_minimal()

Accuracy evaluation

The 21 series in this hierarchy are on very different scales. The total series is much larger than any state series, and the state series are larger than the state-by-industry series below them. For that reason, RMSE and MAE are reported separately by hierarchy level rather than averaged across all series. MAPE is also reported as a scale-free measure to provide a limited cross-level comparison.

Accuracy by hierarchy level

acc_all <- fc %>%
  accuracy(test_hts, measures = list(RMSE = RMSE, MAE = MAE, MAPE = MAPE))

acc_by_level <- acc_all %>%
  mutate(level = case_when(
    is_aggregated(State) & is_aggregated(Industry) ~ "Total",
    !is_aggregated(State) & is_aggregated(Industry) ~ "State",
    TRUE ~ "State × Industry"
  )) %>%
  mutate(level = factor(level, levels = c("Total", "State", "State × Industry"))) %>%
  group_by(level, .model) %>%
  summarise(
    RMSE = mean(RMSE, na.rm = TRUE),
    MAE  = mean(MAE, na.rm = TRUE),
    MAPE = mean(MAPE, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(level, RMSE)

kable(
  acc_by_level,
  digits = 2,
  caption = "Accuracy by hierarchy level and reconciliation method"
)
Accuracy by hierarchy level and reconciliation method
level .model RMSE MAE MAPE
Total ets 176.10 145.70 0.89
Total td 176.10 145.70 0.89
Total mo 204.08 161.25 0.97
Total mint 244.48 190.38 1.12
State td 60.98 48.26 1.11
State ets 64.81 50.88 1.18
State mo 64.81 50.88 1.18
State mint 78.96 61.00 1.48
State × Industry td 31.17 25.47 3.05
State × Industry mint 31.81 25.27 3.05
State × Industry mo 32.08 26.11 3.11
State × Industry ets 34.40 28.20 3.44

Cross-level MAPE summary

acc_all %>%
  group_by(.model) %>%
  summarise(mean_MAPE = mean(MAPE, na.rm = TRUE), .groups = "drop") %>%
  arrange(mean_MAPE) %>%
  kable(digits = 2,
        caption = "Mean MAPE across all series by reconciliation method")
Mean MAPE across all series by reconciliation method
.model mean_MAPE
td 2.58
mo 2.64
mint 2.66
ets 2.89

Which reconciliation method performed best?

Looking at the two tables above, the overall pattern is fairly clear. At the total level, the lowest error was shared by the unreconciled ETS forecasts and top-down, both with an RMSE of 176.10 and a MAPE of 0.89. At the state level, top-down performed best, with an average RMSE of 60.98 and an average MAPE of 1.11. At the state × industry level, top-down again gave the lowest average RMSE at 31.17, while its average MAPE was 3.05, essentially tied with MinT on that measure. Using mean MAPE across all 21 series as a scale-free cross-level summary, top-down also performed best overall with 2.58.

This result is slightly different from the common expectation that MinT will usually perform best overall. In this exercise, a possible explanation is that the aggregate and state-level retail patterns were relatively stable over the training and test periods, so the historical allocation structure used by top-down remained useful over the 24-month forecast horizon. That would help top-down at the higher levels and keep its bottom-level performance competitive. By contrast, MinT depends on estimating the forecast error covariance structure, and in a relatively small hierarchy like this one, that extra complexity does not necessarily lead to better test-set accuracy.