ADEC7406 Module 10 Discussion

Author

Fabian Yang

Part I

Hierarchical Time Series (HTS)

In an HTS structure, the data follow a clear tree-like hierarchy. For example, total sales can be broken down into regions, and each region can then be broken down into stores. Every lower level adds up naturally to the level above it. The key idea is that the series are nested in a strict parent-child structure.

Grouped Time Series (GTS)

GTS is more flexible. Instead of a single hierarchy, the same data can be grouped by more than one dimension at the same time. For example, sales can be grouped by region and also by product category. These groupings overlap rather than forming one simple tree. So HTS is a special case with one clear hierarchy, while GTS allows multiple crossing classification structures.

Forecast vs Reconsiliation Methods

Forecast methods are the models used to genrate forecasts for each series. These could be methods such as ETS, ARIMA, or TSLM. They produce what are called base forecasts. However, when each sereis is forecast seperately, the forecasts may not add up correctly across the hierarchy or groups. This is where reconciliation methods come in. They adjust the base forecasts to ensure that they are coherent with the hierarchical or grouped structure. For example, if the total sales forecast doesn’t equal the sum of regional sales forecasts, reconciliation methods will adjust them to fix this inconsistency.

Four Reconciliation Methods

The first is the bottom-up method. This approach first forecasts only the most disaggregated series, then aggregates them upward. Its main strength is that it guarantees coherence in a simple and intuitive way. It also works well when the bottom-level series contain strong and reliable information. However, its weakness is that bottom-level series are often noisy, sparse, or volatile, so forecast accuracy can suffer if those disaggregated series are difficult to model.

The second is the top-down method. This method first forecasts the total series at the top level and then distributes that forecast down to lower levels according to historical proportions or some allocation rule. Its main strength is simplicity, and it can perform reasonably well when the aggregate series is much more stable than the detailed series. Its weakness is that it ignores much of the information in the lower-level series. As a result, it may produce poor forecasts for individual components, especially when the composition of the total changes over time.

The third is the middle-out method. This method starts from a middle level of the hierarchy. Forecasts at that middle level are aggregated upward and distributed downward. Its strength is that it provides a compromise between top-down and bottom-up, balancing stability and detail. It can be useful when the middle level is the most meaningful level for decision-making. Its weakness is that it still loses some information, because lower-level series are not modeled directly in a full way, and the results depend heavily on the choice of the middle level.

The fourth is optimal reconciliation, which in fpp3 is usually associated with MinT-type approaches. This method first produces base forecasts for all series, then reconciles them statistically using information about the forecast error structure. Its major strength is that it uses information from all levels at once and is often more accurate than the simpler methods. It is also more flexible for both hierarchical and grouped time series. Its weakness is that it is more computationally demanding and depends on estimating the forecast error covariance structure, which can be difficult or unstable in small sample.

Part II

Preparation

I will use the aus_retail data from fpp3 and builds a hierarchy of Total -> State -> Industry. To keep the charts readable, I only keep three industries, while still carrying out reconciliation over the full hierarchy for those series.

I use all observations up to December 2017 as the training set and keep the 12 months of 2018 as the test set.
This holdout design is simple and makes it easy to compare forecast performance across reconciliation methods.

library(fpp3)
library(dplyr)
library(ggplot2)

retail_hier <- aus_retail |>
  filter(Industry %in% c(
    "Department stores",
    "Clothing, footwear and personal accessory retailing",
    "Food retailing"
  )) |>
  group_by(State, Industry) |>
  summarise(Turnover = sum(Turnover)) |>
  aggregate_key(State / Industry, Turnover = sum(Turnover))

# train-test split. 
# forecast a 12-month period
train <- retail_hier |> filter(Month <= yearmonth("2017 Dec"))
test  <- retail_hier |> filter(Month > yearmonth("2017 Dec"))

tibble(
  Sample = c("Train", "Test"),
  Start = c(min(pull(train, Month)), min(pull(test, Month))),
  End = c(max(pull(train, Month)), max(pull(test, Month)))
)

# A tibble: 2 × 3
  Sample    Start      End
  <chr>     <mth>    <mth>
1 Train  1982 Apr 2017 Dec
2 Test   2018 Jan 2018 Dec

Base Model (ETS)

I use an ETS model as the base forecasting model because monthly retail turnover typically shows level, trend, and seasonality. After fitting the base model, I reconcile the forecasts using TopDown, MiddleOut, and MinT so that forecasts remain coherent across the hierarchy.

fit <- train |>
  model(ETS = ETS(Turnover)) |>
  reconcile(
    TopDown = top_down(ETS),
    MiddleOut = middle_out(ETS, split = "State"),
    MinT = min_trace(ETS, method = "mint_shrink")
  )

fc <- forecast(fit, new_data = test)

To keep the figure readable, I show the fitted values only at the state level, which is the middle layer of the hierarchy. The black line shows the observed turnover, and the colored lines show the fitted values from the base and reconciled models.

train_actual_state <- train |>
  filter(!is_aggregated(State), is_aggregated(Industry), Month >= yearmonth("2013 Jan"))

train_fitted_state <- augment(fit) |>
  filter(!is_aggregated(State), is_aggregated(Industry), Month >= yearmonth("2013 Jan"))

ggplot() +
  geom_line(
    data = train_actual_state,
    aes(x = Month, y = Turnover),
    colour = "black",
    linewidth = 0.5
  ) +
  geom_line(
    data = train_fitted_state,
    aes(x = Month, y = .fitted, colour = .model),
    linewidth = 0.6
  ) +
  facet_wrap(vars(State), scales = "free_y") +
  labs(
    title = "Training Data: Observed vs Fitted Values",
    x = NULL,
    y = "Turnover",
    colour = "Model"
  ) +
  theme_minimal(base_size = 11) + 
  theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal")

The reconciled fitted values at the state level are visually very similar, so the lines largely overlap in the training plot. This suggests that the different reconciliation methods produce very similar in-sample fitted values at this level of aggregation. The differences are likely to be more visible at the bottom level or in out-of-sample forecast accuracy on the test set.

Predicted Value on the Test Data

Next, I compare the forecasts for the 2018 test period against the actual observations, again at the state level for readability. This makes it easier to see whether the reconciled forecasts track the held-out data better than the unreconciled ETS forecasts.

test_actual_state <- test |>
  filter(!is_aggregated(State), is_aggregated(Industry))

test_fc_state <- fc |>
  filter(!is_aggregated(State), is_aggregated(Industry)) |>
  as_tibble()

ggplot() +
  geom_line(
    data = test_actual_state,
    aes(x = Month, y = Turnover),
    colour = "black",
    linewidth = 0.6
  ) +
  geom_line(
    data = test_fc_state,
    aes(x = Month, y = .mean, colour = .model),
    linewidth = 0.7
  ) +
  facet_wrap(vars(State), scales = "free_y") +
  labs(
    title = "Test Data: Actual vs Predicted Values",
    x = NULL,
    y = "Turnover",
    colour = "Model"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.box = "horizontal") + 
  scale_x_yearmonth(date_breaks = "6 month")

The test-period plot shows that all four models produce very similar forecasts at the state level. In most states, the predicted lines follow the general movement of the actual series quite well, especially the strong upward jump at the end of the period. At the same time, the actual values are a little more volatile than the forecasts, so the model lines look smoother than the black line. Since the differences between ETS, MiddleOut, MinT, and TopDown are fairly small in the figure, it is difficult to identify the best method from the chart alone. A more reliable conclusion should therefore be based on the test-set accuracy metrics.

Evaluate Test-Set Performance

I evaluate the models on the test period using the average RMSE and MAPE across all series in the hierarchy.
RMSE captures overall forecast error, while MAPE provides a scale-free percentage comparison.

accuracy_summary <- accuracy(fc, test) |>
  as_tibble() |>
  group_by(.model) |>
  summarise(
    mean_RMSE = mean(RMSE, na.rm = TRUE),
    mean_MAPE = mean(MAPE, na.rm = TRUE)
  ) |>
  arrange(mean_RMSE)

accuracy_summary

# A tibble: 4 × 3
  .model    mean_RMSE mean_MAPE
  <chr>         <dbl>     <dbl>
1 MinT           25.1      2.88
2 MiddleOut      26.8      2.88
3 ETS            27.2      2.83
4 TopDown        28.5      2.95

Based on the test-set results, MinT appears to be the best-performing method overall because it has the lowest mean RMSE among all four models. This means it produced the smallest overall forecast error on average across the hierarchy. MiddleOut also performed reasonably well, while TopDown had the weakest performance, with both the highest RMSE and the highest MAPE.

ETS had the lowest mean MAPE, so it performed slightly better in terms of percentage error. Still, the gap in MAPE between ETS, MinT, and MiddleOut is quite small. Since RMSE usually gives a stronger sense of overall forecast accuracy by placing more weight on large errors, I would still consider MinT the best method in this comparison. Overall, the results suggest that the statistically based MinT reconciliation method provided the most reliable forecasts for this dataset.