Hierarchical and Grouped Time Series Discussion

Author

Teddy Kelly

Part I

For hierarchical time series, there is a clear parent child relationship between the predictors which results in only one possible aggregation structure, whereas for grouped time series, there is no clear parent child relationship between the variables which results in multiple possible aggregation structures that the data can be broken up to obtain the same combinations. For example, breaking up US inflation into inflation by state and then inflation by county within each state is a clear hierarchical time series because there is a clear disaggregation path (Country → state → county). However, a grouped time series would be if you could perform the aggregation in multiple equivalent ways. For example, for my MBTA project, I split time series by total subway ridership by daytype (weekend or weekday) and then broke those up into ridership by individual subway line. However, I could have first split the time series into ridership by line and then broken it up into weekend and weekday and got the same combinations.

Top Down:

Top down reconciliation fits a forecasting model at the top level of aggregation and then disaggregates these forecasts for each of these lower levels. Top down approach works best for less complex hierarchical structures and are generally not adequate for complicated mixed hierarchical and grouped structures. Also, this approach is better suited if more accurate forecasts for the top level are desired compared to the bottom levels, and when the lower level data is quite noisy.

Bottom Up:

Bottom up reconciliation generates forecasts at the most disaggregated levels of the structure and then sums up the forecasts upward through the time series structure to provide forecasts at each of the middle and higher levels of aggregation.
This approach can handle more complicated hierarchical structures (including mixed hierarchical and grouped structures) than top down, and is typically preferred when accurate bottom level forecasts are desired and the bottom level data are stable.

Middle Out:

This approach combines both the top down and bottom up approaches as highlighted above and is best used when the middle level in the aggregation structure is the most stable.
One limitation of the middle out approach is that there must be at least three levels within the hierarchy, otherwise there will be no middle level.

Minimum trace:

Minimum trace reconciliation is the most preferred reconciliation method for any hierarchical or grouped structure because it can handle complex hierarchies while delivering accurate forecasts at each aggregation level
This method works by fitting the forecasting model at all levels of aggregation and then combining them all together to minimize the forecasting error variance which helps produce coherent forecasts at each level. Basically, it is able to exploit the relationships between all the levels of aggregation to provide the best forecasts.

Part II

For this discussion, I have decided to attempt option I using the aus_livestock time series from the tsibbledata package. This is a monthly grouped time series that is disaggregated by animal type and Australian State. The time series records how many animals there are of each type of livestock for each state.

Below, I have loaded in the time series and used the aggregate_key() function to specify the proper group structure of the time series.

library(fpp3)
library(tsibbledata)
library(kableExtra)

df <- aus_livestock

df_gts <- df |>
  aggregate_key(State * Animal,
                Count = sum(Count))

# Train-test split
df_gts_train <- df_gts |>
  filter_index(~'2009 Sep')

df_gts_test <- df_gts |>
  filter_index('2009 Oct'~.)

Now that I have specified the grouped structure of the aus_livestock tsibble and split the data into training and testing sets, it’s time to fit an ARIMA model on the training data and reconcile the forecasts using top-down, bottom-up, middle out, and the Mint method. Unfortunately, since the time series follows a grouped structure, I am unable to generate forecasts using top down and middle out reconciliation which require a strictly hierarchical structure.

fit <- df_gts_train |>
  model(
    arima = ARIMA(Count)
  ) |>
  reconcile(
    bu_arima = bottom_up(arima),
    mint_arima = min_trace(arima, method = 'mint_shrink')
  )

nrow(fit)

[1] 70

As we can see, there are 70 unique time series within this grouped structure which are way too many to compare the forecasts of. Therefore, I will only compare the fitted values and forecasts of each reconciliation method for the most aggregated level of the time series: The total count of all livestock in all of Australia. Also, I have not included top down or middle out reconciliation because this is not a hierarchical structure. When I previously attempted to perform top down and middle out reconciliation, I received an error when generating forecasts, saying that they “strictly require hierarchical structures”. These highlight major limitations in top down and middle out reconciliation because they can not always be used to provide forecasts.

The fitted values of these methods are seen below:

# Obtaining the fitted values for the most aggregated level
augment_fit <- augment(fit |>
  filter(State == '<aggregated>' &
         Animal == '<aggregated>'))

# Graphing the fitted values
augment_fit |> ggplot(aes(x = Month)) +
    geom_line(aes(y = Count, color = 'Actual Values')) +
    geom_line(aes(y = .fitted, color = 'Fitted Values')) +
    scale_color_manual(
        values = c('black', 'blue'),
        breaks = c('Actual Values', 'Fitted Values')
    ) +
    facet_wrap(~.model, nrow = 2) +
  theme(legend.position = 'bottom') +
  labs(title = 'Fitted Value comparison of Reconciliation Methods with Arima Model')

The graphs above show very similar fitted values across all reconciliation methods and it is very difficult to see any differences because of the large amount of data available in the training set.
We will now use these methods to provide forecasts on the testing set and focus specifically on the most aggregated level forecasting accuracy.

# 111 observations for each time series in the testing set
fc <- fit |>
  forecast(h = 111)

# plotting forecasts for only most aggregated level

fc |> filter(State == '<aggregated>' & Animal == '<aggregated>') |>
    autoplot(df_gts_test |> filter(State == '<aggregated>' & Animal == '<aggregated>')) +
    facet_wrap(~.model, nrow = 2) +
    theme(legend.position = 'bottom') +
  labs(title = 'Test Set Forecasting Comparison of Reconciliation Methods',
       color = 'Model',
       fill = 'Model')

From the graph above, we can clearly see that the test set forecasts have much more narrow prediction intervals for both the bottom up and minimum trace ARIMA models compared to the top level forecast for the ARIMA method without any reconciliation.
This confirms why we use reconciliation methods in the first place because they often lead to more stable and coherent forecasts at higher levels of aggregation.

# Metric Table
agg_metric <- accuracy(fc, df_gts_test) |>
  filter (State == '<aggregated>' & Animal == '<aggregated>') |>
    select(.model, ME, RMSE, MAE, MAPE)

agg_metric <- agg_metric |> rename('Model' = '.model')

agg_metric |> kable(digits = 2)

Model	ME	RMSE	MAE	MAPE
arima	104015.7	520129.9	438318.1	10.48
bu_arima	-113989.4	444810.7	363604.0	9.18
mint_arima	-191010.1	452359.6	368665.2	9.41

The metric table above shows that the best performing model for is the bottom up ARIMA model which has a lower error in absolute value for the mean error (ME), root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error.
It’s expected that the reconciled ARIMA models outperform the baseline ARIMA model, however, it is surprising that the minimum trace method did not outperform the bottom up method considering that these forecasts are for the most aggregated level of the series (all livestock for the entire country of Australia).
Bottom up reconciliation typically excels at providing accurate forecasts for the bottom level of a hierarchical or grouped time series structure, while the minimum trace method usually outperforms all methods at any level of aggregation.
These results show that even when certain methods are “supposed” to outperform others, this is not always the case because of potentially noisy data at some levels of aggregation that may negatively affect more complex forecasting reconciliation methods like minimum trace.