Module 10 - Hierarchy & Groups

Author

Kevin Rusu

Part 2

Data Import & Exploration

Chosen data set: Australian Livestock (Monthly) by state.

library(fpp3)

Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr

── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──

✔ tibble      3.3.0     ✔ tsibble     1.1.6
✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
✔ tidyr       1.3.1     ✔ feasts      0.4.2
✔ lubridate   1.9.4     ✔ fable       0.4.1
✔ ggplot2     4.0.0

── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
✖ lubridate::date()    masks base::date()
✖ dplyr::filter()      masks stats::filter()
✖ tsibble::intersect() masks base::intersect()
✖ tsibble::interval()  masks lubridate::interval()
✖ dplyr::lag()         masks stats::lag()
✖ tsibble::setdiff()   masks base::setdiff()
✖ tsibble::union()     masks base::union()

aus_livestock

# A tsibble: 29,364 x 4 [1M]
# Key:       Animal, State [54]
      Month Animal                     State                        Count
      <mth> <fct>                      <fct>                        <dbl>
 1 1976 Jul Bulls, bullocks and steers Australian Capital Territory  2300
 2 1976 Aug Bulls, bullocks and steers Australian Capital Territory  2100
 3 1976 Sep Bulls, bullocks and steers Australian Capital Territory  2100
 4 1976 Oct Bulls, bullocks and steers Australian Capital Territory  1900
 5 1976 Nov Bulls, bullocks and steers Australian Capital Territory  2100
 6 1976 Dec Bulls, bullocks and steers Australian Capital Territory  1800
 7 1977 Jan Bulls, bullocks and steers Australian Capital Territory  1800
 8 1977 Feb Bulls, bullocks and steers Australian Capital Territory  1900
 9 1977 Mar Bulls, bullocks and steers Australian Capital Territory  2700
10 1977 Apr Bulls, bullocks and steers Australian Capital Territory  2300
# ℹ 29,354 more rows

Unique Values

aus_livestock |> distinct(Animal)

# A tibble: 7 × 1
  Animal                    
  <fct>                     
1 Bulls, bullocks and steers
2 Calves                    
3 Cattle (excl. calves)     
4 Cows and heifers          
5 Lambs                     
6 Pigs                      
7 Sheep

aus_livestock |> distinct(State)

# A tibble: 8 × 1
  State                       
  <fct>                       
1 Australian Capital Territory
2 New South Wales             
3 Northern Territory          
4 Queensland                  
5 South Australia             
6 Tasmania                    
7 Victoria                    
8 Western Australia

7 different animals, and 8 different states. Not every combination exists (7 × 8 = 56 possible, but only 54 are actually in the data), because some animals aren’t commercially produced in every state.

Nested Hierarchy

livestock_agg <- aus_livestock |>
  aggregate_key(State / Animal, Count = sum(Count))

livestock_agg

# A tsibble: 34,386 x 4 [1M]
# Key:       State, Animal [63]
      Month State        Animal         Count
      <mth> <fct*>       <fct*>         <dbl>
 1 1972 Jul <aggregated> <aggregated> 4837300
 2 1972 Aug <aggregated> <aggregated> 4881500
 3 1972 Sep <aggregated> <aggregated> 4439000
 4 1972 Oct <aggregated> <aggregated> 5189300
 5 1972 Nov <aggregated> <aggregated> 5497500
 6 1972 Dec <aggregated> <aggregated> 4926700
 7 1973 Jan <aggregated> <aggregated> 5258700
 8 1973 Feb <aggregated> <aggregated> 4566600
 9 1973 Mar <aggregated> <aggregated> 4936800
10 1973 Apr <aggregated> <aggregated> 3321300
# ℹ 34,376 more rows

The / operator creates a nested tree: Total sits on top, State totals on the next level, and (State, Animal) combinations at the bottom. I chose State / Animal (geography on top) because top-down reconciliation requires a strict nested tree with a single parent path. Count = sum(Count) tells R how to aggregate upward, which is the right rule for counts and volumes.

The alternative * operator would have created a grouped structure with two parent paths per bottom cell, which is valid for bottom-up and MinT but not for top-down or middle-out.

This creates the following structure:

The data runs from July 1972 to December 2018, which is 558 months per series.

Train and test split

train <- livestock_agg |>
  filter_index(. ~ "2015 Dec")

test <- livestock_agg |>
  filter_index("2016 Jan" ~ .)

train |> as_tibble() |> summarise(n = n_distinct(Month))

# A tibble: 1 × 1
      n
  <int>
1   522

test  |> as_tibble() |> summarise(n = n_distinct(Month))

# A tibble: 1 × 1
      n
  <int>
1    36

Splits the data into a training period (July 1972 through December 2015) and a test period (January 2016 through December 2018). The test window is 36 months, which is three full seasonal cycles.

Model fitting

Fits one ARIMA model to every series in the training data. Because the hierarchy has 63 series (1 grand total + 8 state totals + 54 bottom-level combinations), this single line fits 63 separate ARIMAs with orders selected automatically by AICc.

fit <- train |>
  model(arima = ARIMA(Count))

Warning in sqrt(diag(best$var.coef)): NaNs produced
Warning in sqrt(diag(best$var.coef)): NaNs produced
Warning in sqrt(diag(best$var.coef)): NaNs produced

fit

# A mable: 63 x 3
# Key:     State, Animal [63]
   State                        Animal                                     arima
   <fct*>                       <fct*>                                   <model>
 1 Australian Capital Territory Bulls, bullocks and s…            <ARIMA(0,1,2)>
 2 Australian Capital Territory Calves               … <ARIMA(0,1,1)(1,0,2)[12]>
 3 Australian Capital Territory Cattle (excl. calves)…            <ARIMA(0,1,2)>
 4 Australian Capital Territory Lambs                … <ARIMA(0,1,2)(0,0,1)[12]>
 5 Australian Capital Territory Pigs                 … <ARIMA(1,1,2)(0,0,1)[12]>
 6 Australian Capital Territory Sheep                …            <ARIMA(0,1,4)>
 7 Australian Capital Territory <aggregated>           <ARIMA(0,1,1)(0,0,2)[12]>
 8 New South Wales              Bulls, bullocks and s… <ARIMA(0,1,4)(2,0,0)[12]>
 9 New South Wales              Calves               … <ARIMA(4,1,1)(0,0,1)[12]>
10 New South Wales              Cattle (excl. calves)… <ARIMA(4,1,1)(1,0,0)[12]>
# ℹ 53 more rows

Prints the mable (model table). Each row is one series, and the arima column contains the selected ARIMA order for that series. Orders vary across series (some non-seasonal, some with drift, some with strong seasonal AR terms), which is a good sign that auto-ARIMA is actually responding to each series’ statistical properties rather than forcing a one-size-fits-all model.

Reconciliation

Applies four reconciliation methods to the base ARIMA forecasts. Reconciliation does not refit any model; it adjusts the forecasts so they are coherent across levels (bottom-level forecasts sum to aggregate forecasts).

Bottom-up sums only the bottom-level forecasts upward. Top-down starts from the top forecast and splits it down using historical proportions. Middle-out uses a middle level and propagates both directions. MinT with shrinkage uses all levels weighted by the forecast error covariance matrix, which typically gives the most statistically efficient combination. Top-down only works here because we have a strict hierarchy; it would fail on a grouped structure.

fit_reconciled <- fit |>
  reconcile(
    bu = bottom_up(arima),
    td = top_down(arima),
    mo = middle_out(arima),
    mint = min_trace(arima, method = "mint_shrink")
  )

fit_reconciled

# A mable: 63 x 7
# Key:     State, Animal [63]
   State          Animal                         arima bu                       
   <fct*>         <fct*>                       <model> <model>                  
 1 Australian Ca… Bulls, bu…            <ARIMA(0,1,2)> <ARIMA(0,1,2)>           
 2 Australian Ca… Calves   … <ARIMA(0,1,1)(1,0,2)[12]> <ARIMA(0,1,1)(1,0,2)[12]>
 3 Australian Ca… Cattle (e…            <ARIMA(0,1,2)> <ARIMA(0,1,2)>           
 4 Australian Ca… Lambs    … <ARIMA(0,1,2)(0,0,1)[12]> <ARIMA(0,1,2)(0,0,1)[12]>
 5 Australian Ca… Pigs     … <ARIMA(1,1,2)(0,0,1)[12]> <ARIMA(1,1,2)(0,0,1)[12]>
 6 Australian Ca… Sheep    …            <ARIMA(0,1,4)> <ARIMA(0,1,4)>           
 7 Australian Ca… <aggregat… <ARIMA(0,1,1)(0,0,2)[12]> <ARIMA(0,1,1)(0,0,2)[12]>
 8 New South Wal… Bulls, bu… <ARIMA(0,1,4)(2,0,0)[12]> <ARIMA(0,1,4)(2,0,0)[12]>
 9 New South Wal… Calves   … <ARIMA(4,1,1)(0,0,1)[12]> <ARIMA(4,1,1)(0,0,1)[12]>
10 New South Wal… Cattle (e… <ARIMA(4,1,1)(1,0,0)[12]> <ARIMA(4,1,1)(1,0,0)[12]>
# ℹ 53 more rows
# ℹ 3 more variables: td <model>, mo <model>, mint <model>

fc <- fit_reconciled |>
  forecast(h = 36)

fc

# A fable: 11,340 x 6 [1M]
# Key:     State, Animal, .model [315]
   State                        Animal                     .model    Month
   <fct*>                       <fct*>                     <chr>     <mth>
 1 Australian Capital Territory Bulls, bullocks and steers arima  2016 Jan
 2 Australian Capital Territory Bulls, bullocks and steers arima  2016 Feb
 3 Australian Capital Territory Bulls, bullocks and steers arima  2016 Mar
 4 Australian Capital Territory Bulls, bullocks and steers arima  2016 Apr
 5 Australian Capital Territory Bulls, bullocks and steers arima  2016 May
 6 Australian Capital Territory Bulls, bullocks and steers arima  2016 Jun
 7 Australian Capital Territory Bulls, bullocks and steers arima  2016 Jul
 8 Australian Capital Territory Bulls, bullocks and steers arima  2016 Aug
 9 Australian Capital Territory Bulls, bullocks and steers arima  2016 Sep
10 Australian Capital Territory Bulls, bullocks and steers arima  2016 Oct
# ℹ 11,330 more rows
# ℹ 2 more variables: Count <dist>, .mean <dbl>

Accuracy

accuracy(fc, livestock_agg) |>
  group_by(.model) |>
  summarise(
    RMSE = mean(RMSE, na.rm = TRUE),
    MAE  = mean(MAE,  na.rm = TRUE),
    MASE = mean(MASE, na.rm = TRUE)
  ) |>
  arrange(RMSE)

# A tibble: 5 × 4
  .model   RMSE    MAE  MASE
  <chr>   <dbl>  <dbl> <dbl>
1 mint   29761. 23898. 0.926
2 bu     30774. 24719. 0.918
3 arima  32156. 25733. 0.910
4 mo     32545. 26259. 0.908
5 td     34058. 27062. 0.907

Computes error metrics for every (series, model) combination via accuracy(fc, livestock_agg), then averages them across series within each reconciliation method to get one summary row per method. Passing the full livestock_agg rather than just the test set is what lets MASE compute its in-sample scale, which needs training data. The final arrange(RMSE) sorts the summary so the best-performing method appears first. MASE is the most meaningful metric here because it is scale-free (RMSE and MAE are dominated by the largest series, which distorts the mean across a hierarchy spanning very different volumes).

Forecast plot

fc |>
  filter(is_aggregated(State), is_aggregated(Animal)) |>
  autoplot(
    train |>
      filter(is_aggregated(State), is_aggregated(Animal)) |>
      filter_index("2010 Jan" ~ .),
    level = NULL
  ) +
  labs(
    title = "Reconciled forecasts: Australian livestock total",
    subtitle = "Trained through Dec 2015, forecast Jan 2016 - Dec 2018",
    y = "Head of livestock",
    x = NULL
  ) +
  theme_minimal()