library(fpp3)
-- Attaching packages ---------------------------------- fpp3 0.3 --
v tibble      3.0.6     v feasts      0.1.6
v tsibbledata 0.2.0     v fable       0.2.1
-- Conflicts ------------------------------------- fpp3_conflicts --
x lubridate::date()     masks base::date()
x dplyr::filter()       masks stats::filter()
x zoo::index()          masks tsibble::index()
x lubridate::interval() masks tsibble::interval()
x dplyr::lag()          masks stats::lag()

https://robjhyndman.com/seminars/uwa2017/

https://robjhyndman.com/seminars/isi2019workshop/

https://github.com/robjhyndman/ETC3550Slides/tree/fable

https://github.com/rstudio-conf-2020/time-series-forecasting/blob/master/materials/labs.R

http://www.gradaanwr.net/

1 ARIMA models

While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.

1.1 Stationarity and differencing

A stationary time series is one whose statistical properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time.

A time series with cyclic behaviour (but with no trend or seasonality) is stationary. This is because the cycles are not of a fixed length, so before we observe the series we cannot be sure where the peaks and troughs of the cycles will be.

In general, a stationary time series will have no predictable patterns in the long-term.

1.1.1 Differencing

Transformations such as logarithms can help to stabilise the variance of a time series. Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.

For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly. Also, for non-stationary data, the value of \(r_1\) is often large and positive.

# Re-index based on trading days
google_stock <- gafa_stock %>%
  filter(Symbol == "GOOG", year(Date) >= 2015) %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)
# Filter the year of interest
google_2015 <- google_stock %>% filter(year(Date) == 2015)

google_2015 %>%
  mutate(diff_close = difference(Close)) %>%
  features(diff_close, ljung_box, lag = 10)
p1 <- google_2015 %>% 
  ACF(Close) %>% 
  autoplot()
p2 <- google_2015 %>% 
  ACF(difference(Close)) %>% 
  autoplot()
library(cowplot)

Attaching package: 㤼㸱cowplot㤼㸲

The following object is masked from 㤼㸱package:lubridate㤼㸲:

    stamp
plot_grid(p1, p2, cols=2)
Argument 'cols' is deprecated. Use 'ncol' instead.

The ACF of the differenced Google stock price looks just like that of a white noise series. Only one autocorrelation is outside of the 95% limits, and the Ljung-Box Q^∗ statistic has a p-value of 0.637 (for \(h=10\)). This suggests that the daily change in the Google stock price is essentially a random amount which is uncorrelated with that of previous days.

1.1.2 Random walk model

Random walk models are widely used for non-stationary data, particularly financial and economic data. Random walks typically have:

  • long periods of apparent trends up or down
  • sudden and unpredictable changes in direction.

The forecasts from a random walk model are equal to the last observation, as future movements are unpredictable, and are equally likely to be up or down. Thus, the random walk model underpins naïve forecasts.

1.1.3 Second-order differencing

Occasionally the differenced data will not appear to be stationary and it may be necessary to difference the data a second time to obtain a stationary series.

1.1.4 Seasonal differencing

A seasonal difference is the difference between an observation and the previous observation from the same season.

Forecasts from this model are equal to the last observation from the relevant season. That is, this model gives seasonal naïve forecasts.

Sometimes it is necessary to take both a seasonal difference and a first difference to obtain stationary data.

PBS %>%
  filter(ATC2 == "H02") %>%
  summarise(Cost = sum(Cost)/1e6) %>%
  transmute(
    `Sales ($million)` = Cost,
    `Log sales` = log(Cost),
    `Annual change in log sales` = difference(log(Cost), 12),
    `Doubly differenced log sales` = difference(difference(log(Cost), 12), 1)
  ) %>%
  pivot_longer(-Month, names_to="Type", values_to="Sales") %>%
  mutate(
    Type = factor(Type, levels = c(
      "Sales ($million)",
      "Log sales",
      "Annual change in log sales",
      "Doubly differenced log sales"))
  ) %>%
  ggplot(aes(x = Month, y = Sales)) +
  geom_line() +
  facet_grid(vars(Type), scales = "free_y") +
  labs(title = "Corticosteroid drug sales", y = NULL)

If the data have a strong seasonal pattern, we recommend that seasonal differencing be done first, because the resulting series will sometimes be stationary and there will be no need for a further first difference.

First differences are the change between one observation and the next. Seasonal differences are the change between one year to the next. Other lags are unlikely to make much interpretable sense and should be avoided.

1.1.5 Unit root tests

One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required.

we use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. In this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required. The test can be computed using the unitroot_kpss() function.

google_2015 %>%
  features(Close, unitroot_kpss)

The reported p-value is set to 0.01 if it is less than that, or to 0.1 if it is greater than that. In this case, the test statistic (3.56) is much bigger than the 1% critical value, so the p-value is less than 0.01, indicating that the null hypothesis is rejected. That is, the data are not stationary. We can difference the data, and apply the test again.

google_2015 %>%
  mutate(diff_close = difference(Close)) %>%
  features(diff_close, unitroot_kpss)

This time, the test statistic is tiny, and well within the range we would expect for stationary data, so the p-value is greater than 0.1. We can conclude that the differenced data appear stationary.

This process of using a sequence of KPSS tests to determine the appropriate number of first differences is carried out using the unitroot_ndiffs() feature.

google_2015 %>%
  features(Close, unitroot_ndiffs)

As we saw from the KPSS tests above, one difference is required to make the google_2015 data stationary.

A similar feature for determining whether seasonal differencing is required is unitroot_nsdiffs(), which uses the measure of seasonal strength introduced in Section 4.3 to determine the appropriate number of seasonal differences required. No seasonal differences are suggested if \(F_S<0.64\), otherwise one seasonal difference is suggested.

aus_total_retail <- aus_retail %>%
  summarise(Turnover = sum(Turnover))
aus_total_retail %>%
  mutate(log_turnover = log(Turnover)) %>%
  features(log_turnover, unitroot_nsdiffs)
aus_total_retail %>%
  mutate(log_turnover = difference(log(Turnover), 12)) %>%
  features(log_turnover, unitroot_ndiffs)

Because unitroot_nsdiffs() returns 1 (indicating one seasonal difference is required), we apply the unitroot_ndiffs() function to the seasonally differenced data. These functions suggest we should do both a seasonal difference and a first difference.

1.3 Autoregressive models

In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. The term autoregression indicates that it is a regression of the variable against itself.

\(y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \dots + \phi_{p}y_{t-p} + \varepsilon_{t}\)

We refer to this as an AR(p) model, an autoregressive model of order p.

1.4 Moving average models

Rather than using past values of the forecast variable in a regression, a moving average model uses past forecast errors in a regression-like model.

\(y_{t} = c + \varepsilon_t + \theta_{1}\varepsilon_{t-1} + \theta_{2}\varepsilon_{t-2} + \dots + \theta_{q}\varepsilon_{t-q}\)

We refer to this as an MA(q) model, a moving average model of order
q. Of course, we do not observe the values of \(ε_t\), so it is not really a regression in the usual sense.

Notice that each value of \(y_t\) can be thought of as a weighted moving average of the past few forecast errors. However, moving average models should not be confused with the moving average smoothing. A moving average model is used for forecasting future values, while moving average smoothing is used for estimating the trend-cycle of past values.

1.5 Non-seasonal ARIMA models

If we combine differencing with autoregression and a moving average model, we obtain a non-seasonal ARIMA model. ARIMA is an acronym for AutoRegressive Integrated Moving Average (in this context, “integration” is the reverse of differencing). The full model can be written as

\[\begin{equation} y'_{t} = c + \phi_{1}y'_{t-1} + \cdots + \phi_{p}y'_{t-p} + \theta_{1}\varepsilon_{t-1} + \cdots + \theta_{q}\varepsilon_{t-q} + \varepsilon_{t}, \end{equation}\]

We call this an ARIMA(p,d,q) model, where:

p=order of the autoregressive part; d=degree of first differencing involved; q=order of the moving average part.

Special cases of ARIMA models:

White noise ARIMA(0,0,0)
Random walk ARIMA(0,1,0) with no constant
Random walk with drift ARIMA(0,1,0) with a constant
Autoregression ARIMA(p,0,0)
Moving average ARIMA(0,0,q)

Backshift notation:

\[\begin{equation} \begin{array}{c c c c} (1-\phi_1B - \cdots - \phi_p B^p) & (1-B)^d y_{t} &= &c + (1 + \theta_1 B + \cdots + \theta_q B^q)\varepsilon_t\\ {\uparrow} & {\uparrow} & &{\uparrow}\\ \text{AR($p$)} & \text{$d$ differences} & & \text{MA($q$)}\\ \end{array} \end{equation}\]

1.5.1 Egyptian exports

Egyptian exports as a percentage of GDP from 1960 to 2017.

global_economy %>%
  filter(Code == "EGY") %>%
  autoplot(Exports) +
  labs(y = "Percentage of GDP", title = "Egyptian Exports")

The following R code selects a non-seasonal ARIMA model automatically.

fit <- global_economy %>%
  filter(Code == "EGY") %>%
  model(ARIMA(Exports))
report(fit)
Series: Exports 
Model: ARIMA(2,0,1) w/ mean 

Coefficients:
         ar1      ar2      ma1  constant
      1.6764  -0.8034  -0.6896    2.5623
s.e.  0.1111   0.0928   0.1492    0.1161

sigma^2 estimated as 8.046:  log likelihood=-141.57
AIC=293.13   AICc=294.29   BIC=303.43

This is an ARIMA(2,0,1) model:

\(y_t = 2.56 + 1.68 y_{t-1} -0.80 y_{t-2} -0.69 \varepsilon_{t-1} + \varepsilon_{t}\)

Forecasts from the model are shown in Figure 9.8. Notice how they have picked up the cycles evident in the Egyptian economy over the last few decades.

fit %>% forecast(h=10) %>% autoplot(global_economy)

1.5.2 ACF and PACF plots

It is usually not possible to tell, simply from a time plot, what values of p and q are appropriate for the data. However, it is sometimes possible to use the ACF plot, and the closely related PACF plot, to determine appropriate values for p and q.

global_economy %>% filter(Code == "EGY") %>% ACF(Exports) %>% autoplot()

global_economy %>% filter(Code == "EGY") %>% PACF(Exports) %>% autoplot()

If the data are from an ARIMA(p,d,0) or ARIMA(0,d,q) model, then the ACF and PACF plots can be helpful in determining the value of p or q. If p and q are both positive, then the plots do not help in finding suitable values of p and q.

The data may follow an ARIMA(p,d,0) model if the ACF and PACF plots of the differenced data show the following patterns:

  • the ACF is exponentially decaying or sinusoidal;
  • there is a significant spike at lag p in the PACF, but none beyond lag p.

The data may follow an ARIMA(0,d,q) model if the ACF and PACF plots of the differenced data show the following patterns:

  • the PACF is exponentially decaying or sinusoidal;
  • there is a significant spike at lag q in the ACF, but none beyond lag q.

we see that there is a decaying sinusoidal pattern in the ACF, and the PACF shows the last significant spike at lag 4. This is what you would expect from an ARIMA(4,0,0) model.

fit2 <- global_economy %>%
  filter(Code == "EGY") %>%
  model(ARIMA(Exports ~ pdq(4,0,0)))
report(fit2)
Series: Exports 
Model: ARIMA(4,0,0) w/ mean 

Coefficients:
         ar1      ar2     ar3      ar4  constant
      0.9861  -0.1715  0.1807  -0.3283    6.6922
s.e.  0.1247   0.1865  0.1865   0.1273    0.3562

sigma^2 estimated as 7.885:  log likelihood=-140.53
AIC=293.05   AICc=294.7   BIC=305.41

This model is only slightly worse than the ARIMA(2,0,1) model identified by ARIMA() (with an AICc value of 294.70 compared to 294.29).

We can also specify particular values of pdq() that ARIMA can search for. For example, to find the best ARIMA model with \(p\in\{1,2,3\}\), \(q\in\{0,1,2\}\) and d=1, you could use ARIMA(y ~ pdq(p=1:3, d=1, q=0:2)).

1.6 Estimation and order selection

1.6.1 Maximum likelihood estimation

When fable estimates the ARIMA model, it uses maximum likelihood estimation (MLE). This technique finds the values of the parameters which maximise the probability of obtaining the data that we have observed. For ARIMA models, MLE is similar to the least squares estimates that would be obtained by minimising

\(\sum_{t=1}^T\varepsilon_t^2.\)

1.6.2 Information Criteria

Akaike’s Information Criterion (AIC), which was useful in selecting predictors for regression, is also useful for determining the order of an ARIMA model.

Good models are obtained by minimising the AIC, AICc or BIC. Our preference is to use the AICc.

t is important to note that these information criteria tend not to be good guides to selecting the appropriate order of differencing (d) of a model.

1.7 ARIMA modelling in fable

https://otexts.com/fpp3/arima-r.html

1.7.1 Modelling procedure

When fitting an ARIMA model to a set of (non-seasonal) time series data, the following procedure provides a useful general approach.

  1. Plot the data and identify any unusual observations.
  2. If necessary, transform the data (using a Box-Cox transformation) to stabilise the variance.
  3. If the data are non-stationary, take first differences of the data until the data are stationary.
  4. Examine the ACF/PACF: Is an ARIMA(p,d,0) or ARIMA(0,d,q) model appropriate?
  5. Try your chosen model(s), and use the AICc to search for a better model.
  6. Check the residuals from your chosen model by plotting the ACF of the residuals, and doing a portmanteau test of the residuals. If they do not look like white noise, try a modified model.
  7. Once the residuals look like white noise, calculate forecasts.

###Example: Central African Republic exports

global_economy %>%
  filter(Code == "CAF") %>%
  autoplot(Exports)

  1. The time plot shows some non-stationarity, with an overall decline. The improvement in 1994 was due to a new government which overthrew the military junta and had some initial success, before unrest caused further economic decline.

  2. There is no evidence of changing variance, so we will not do a Box-Cox transformation.

  3. To address the non-stationarity, we will take a first difference of the data

global_economy %>%
  filter(Code == "CAF") %>%
  gg_tsdisplay(difference(Exports), plot_type='partial')

  1. The PACF shown in Figure 9.14 is suggestive of an AR(2) model; so an initial candidate model is an ARIMA(2,1,0). The ACF suggests an MA(3) model; so an alternative candidate is an ARIMA(0,1,3).

  2. We fit both an ARIMA(2,1,0) and an ARIMA(0,1,3) model along with two automated model selections, one using the default stepwise procedure, and one working harder to search a larger model space.

caf_fit <- global_economy %>%
  filter(Code == "CAF") %>%
  model(
    arima210 = ARIMA(Exports ~ pdq(2,1,0)),
    arima013 = ARIMA(Exports ~ pdq(0,1,3)),
    stepwise = ARIMA(Exports),
    search = ARIMA(Exports, stepwise=FALSE)
  )
caf_fit
glance(caf_fit) %>% arrange(AICc)

The four models have almost the same AICc values. Of the models fitted, the full search has found that an ARIMA(3,1,0) gives the lowest AICc value, closely followed by the ARIMA(2,1,0) and ARIMA(0,1,3) — the latter two being the models that we guessed from the ACF and PACF plots. The automated stepwise selection has identified an ARIMA(2,1,2) model, which has the highest AICc value of the four models.

  1. The ACF plot of the residuals from the ARIMA(3,1,0) model shows that all autocorrelations are within the threshold limits, indicating that the residuals are behaving like white noise.
caf_fit %>%
  select(search) %>%
  gg_tsresiduals()

A portmanteau test returns a large p-value, also suggesting that the residuals are white noise.

augment(caf_fit) %>%
  filter(.model=='search') %>%
  features(.innov, ljung_box, lag = 10, dof = 3)
  1. Forecasts
caf_fit %>%
  forecast(h=5) %>%
  filter(.model=='search') %>%
  autoplot(global_economy)

Note that the mean forecasts look very similar to what we would get with a random walk (equivalent to an ARIMA(0,1,0)). The extra work to include AR and MA terms has made little difference to the point forecasts in this example, although the prediction intervals are much narrower than for a random walk model.

1.9 Seasonal ARIMA models

A seasonal ARIMA model is formed by including additional seasonal terms in the ARIMA models we have seen so far. It is written as follows:

\(\underbrace{(p, d, q)}\) \(\underbrace{(P, D, Q)_{m}}\)

1.9.1 ACF/PACF

The seasonal part of an AR or MA model will be seen in the seasonal lags of the PACF and ACF. For example, an \(ARIMA(0,0,0)(0,0,1)_12\) model will show:

  • a spike at lag 12 in the ACF but no other significant spikes;
  • exponential decay in the seasonal lags of the PACF (i.e., at lags 12, 24, 36, …).

Similarly, an \(ARIMA(0,0,0)(1,0,0)_12\) model will show:

  • exponential decay in the seasonal lags of the ACF;
  • a single significant spike at lag 12 in the PACF.

1.9.2 Example: Monthly US leisure and hospitality employment

monthly US employment data for leisure and hospitality jobs from January 2000 to September 2019:

leisure <- us_employment %>%
  filter(Title == "Leisure and Hospitality", year(Month) > 2000) %>%
  mutate(Employed = Employed/1000) %>%
  select(Month, Employed)
autoplot(leisure, Employed) +
  labs(y="Millions of people")

We will first take a seasonal difference

leisure %>%
  gg_tsdisplay(difference(Employed, 12), plot_type='partial', lag=36) +
  labs(y="Seasonally differenced")

leisure %>%
  gg_tsdisplay(difference(Employed, 12) %>% difference(), plot_type='partial', lag=36) +
  labs(y="Double differenced")

The significant spike at lag 2 in the ACF suggests a non-seasonal MA(2) component. The significant spike at lag 12 in the ACF suggests a seasonal MA(1) component. Consequently, we begin with an \(ARIMA(0,1,2)(0,1,1)_12\) model, indicating a first difference, a seasonal difference, and non-seasonal MA(2) and seasonal MA(1) component. If we had started with the PACF, we may have selected an \(ARIMA(2,1,0)(0,1,1)_12\) model.

fit <- leisure %>%
  model(
    arima012011 = ARIMA(Employed ~ pdq(0,1,2) + PDQ(0,1,1)),
    arima210011 = ARIMA(Employed ~ pdq(2,1,0) + PDQ(0,1,1)),
    auto = ARIMA(Employed, stepwise=FALSE, approximation=FALSE)
  )
fit
glance(fit) %>% arrange(AICc)

The three fitted models have similar AICc values, with the automatically selected model being a little better. Our second “guess” of \(ARIMA(2,1,0)(0,1,1)_12\) turned out to be very close to the automatically selected model of \(ARIMA(2,1,0)(1,1,1)_12\).

fit %>% select(auto) %>% gg_tsresiduals(lag=36)

One small but significant spike (at lag 11) out of 36 is still consistent with white noise. To be sure, we use a Ljung-Box test, which has a large p-value, confirming that the residuals are similar to white noise. Note that the alternative models also pass this test.

augment(fit) %>% features(.innov, ljung_box, lag=24, dof=4)

Thus, we now have a seasonal ARIMA model that passes the required checks and is ready for forecasting. Forecasts from the model for the next three years

fit %>% forecast(h=36) %>% filter(.model=='auto') %>% autoplot(leisure)

1.9.3 Example: Corticosteroid drug sales in Australia

For our second example, we will try to forecast monthly corticosteroid drug sales in Australia. These are known as H02 drugs under the Anatomical Therapeutic Chemical classification scheme.

h02 <- PBS %>%
  filter(ATC2 == "H02") %>%
  summarise(Cost = sum(Cost)/1e6)
h02 %>%
  mutate(log(Cost)) %>%
  pivot_longer(-Month) %>%
  ggplot(aes(x = Month, y = value)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y") +
  labs(y="", title="Cortecosteroid drug scripts (H02)")

There is a small increase in the variance with the level, so we take logarithms to stabilise the variance.

The data are strongly seasonal and obviously non-stationary, so seasonal differencing will be used.

It is not clear at this point whether we should do another difference or not. We decide not to, but the choice is not obvious.

The last few observations appear to be different (more variable) from the earlier data. This may be due to the fact that data are sometimes revised when earlier sales are reported late.

h02 %>% gg_tsdisplay(difference(log(Cost), 12), plot_type='partial', lag_max = 24)

In the plots of the seasonally differenced data, there are spikes in the PACF at lags 12 and 24, but nothing at seasonal lags in the ACF. This may be suggestive of a seasonal AR(2) term. In the non-seasonal lags, there are three significant spikes in the PACF, suggesting a possible AR(3) term. The pattern in the ACF is not indicative of any simple model.

Consequently, this initial analysis suggests that a possible model for these data is an \(ARIMA(3,0,0)(2,1,0)_12\).

fit <- h02 %>%
  model(ARIMA(log(Cost) ~ 0 + pdq(3,0,1) + PDQ(0,1,2)))
report(fit)
Series: Cost 
Model: ARIMA(3,0,1)(0,1,2)[12] 
Transformation: log(Cost) 

Coefficients:
          ar1     ar2     ar3     ma1     sma1     sma2
      -0.1603  0.5481  0.5678  0.3827  -0.5222  -0.1768
s.e.   0.1636  0.0878  0.0942  0.1895   0.0861   0.0872

sigma^2 estimated as 0.004278:  log likelihood=250.04
AIC=-486.08   AICc=-485.48   BIC=-463.28
fit %>% gg_tsresiduals(lag_max=36)

augment(fit) %>%
  features(.innov, ljung_box, lag = 36, dof = 6)

The innovation residuals from this model are shown. There are a few significant spikes in the ACF, and the model fails the Ljung-Box test. The model can still be used for forecasting, but the prediction intervals may not be accurate due to the correlated residuals.

Sometimes it is just not possible to find a model that passes all of the tests.

None of the models considered here pass all of the residual tests. In practice, we would normally use the best model we could find, even if it did not pass all of the tests.

Forecasts from the \(ARIMA(3,0,1)(0,1,2)_12\) model (which has the second lowest RMSE value on the test set, and the best AICc value amongst models with only seasonal differencing)

h02 %>%
  model(ARIMA(log(Cost) ~ 0 + pdq(3,0,1) + PDQ(0,1,2))) %>%
  forecast() %>%
  autoplot(h02) +
    labs(y="H02 sales (million scripts)")

1.10 ARIMA vs ETS

https://otexts.com/fpp3/arima-ets.html

1.10.1 Comparing ARIMA() and ETS() on non-seasonal data

We can use time series cross-validation to compare an ARIMA model and an ETS model.

aus_economy <- global_economy %>% filter(Code == "AUS") %>%
  mutate(Population = Population/1e6)

aus_economy %>%
  slice(-n()) %>%
  stretch_tsibble(.init = 10) %>%
  model(
    ETS(Population),
    ARIMA(Population)
  ) %>%
  forecast(h = 1) %>%
  accuracy(aus_economy)

In this case the ETS model has higher accuracy on the cross-validated performance measures. Below we generate and plot forecasts for the next 5 years generated from an ETS model.

aus_economy %>%
  model(ETS(Population)) %>%
  forecast(h = "5 years") %>%
  autoplot(aus_economy)

1.10.2 Comparing ARIMA() and ETS() on seasonal data

we want to compare seasonal ARIMA and ETS models applied to the quarterly cement production data (from aus_production). Because the series is relatively long, we can afford to use a training and a test set rather than time series cross-validation. The advantage is that this is much faster. We create a training set from the beginning of 1988 to the end of 2007 and select an ARIMA and an ETS model using the ARIMA() and ETS() functions.

cement <- aus_production %>% filter_index("1988 Q1" ~ .)
train <- cement %>% filter_index(. ~ "2007 Q4")
fit_arima <- train %>% model(ARIMA(Cement))
report(fit_arima)
Series: Cement 
Model: ARIMA(1,0,1)(2,1,1)[4] w/ drift 

Coefficients:
         ar1      ma1   sar1     sar2     sma1  constant
      0.8886  -0.2366  0.081  -0.2345  -0.8979    5.3884
s.e.  0.0842   0.1334  0.157   0.1392   0.1780    1.4844

sigma^2 estimated as 11456:  log likelihood=-463.52
AIC=941.03   AICc=942.68   BIC=957.35

The ARIMA model does well in capturing all the dynamics in the data as the residuals seem to be white noise.

gg_tsresiduals(fit_arima, lag_max = 16)

augment(fit_arima) %>%
  features(.innov, ljung_box, lag = 16, dof = 6)

The output below also shows the ETS model selected and estimated by ETS(). This model also does well in capturing all the dynamics in the data, as the residuals similarly appear to be white noise.

fit_ets <- train %>% model(ETS(Cement))
report(fit_ets)
Series: Cement 
Model: ETS(M,N,M) 
  Smoothing parameters:
    alpha = 0.7533714 
    gamma = 0.0001000093 

  Initial states:
        l       s1       s2       s3        s4
 1694.712 1.031179 1.045209 1.011424 0.9121874

  sigma^2:  0.0034

     AIC     AICc      BIC 
1104.095 1105.650 1120.769 
fit_ets %>% gg_tsresiduals(lag_max = 16)

augment(fit_ets) %>%
  features(.innov, ljung_box, lag = 16, dof = 6)

The output below evaluates the forecasting performance of the two competing models over the test set. In this case the ARIMA model seems to be the slightly more accurate model based on the test set RMSE, MAPE and MASE.

# Generate forecasts and compare accuracy over the test set
bind_rows(
  fit_arima %>% accuracy(),
  fit_ets %>% accuracy(),
  fit_arima %>% forecast(h = "2 years 6 months") %>%
    accuracy(cement),
  fit_ets %>% forecast(h = "2 years 6 months") %>%
    accuracy(cement)
)

Below we generate and plot forecasts from the ARIMA model for the next 3 years.

cement %>%
  model(ARIMA(Cement)) %>%
  forecast(h="3 years") %>%
  autoplot(cement)

2 Dynamic regression models

In this chapter, we consider how to extend ARIMA models in order to allow other information to be included in the models.

In this chapter, we will allow the errors from a regression to contain autocorrelation. To emphasise this change in perspective, we will replace \(ε_t\) with \(η_t\) in the equation. The error series \(η_t\) is assumed to follow an ARIMA model. For example, if \(η_t\) follows an ARIMA(1,1,1) model, we can write

\[\begin{align*} y_t &= \beta_0 + \beta_1 x_{1,t} + \dots + \beta_k x_{k,t} + \eta_t,\\ & (1-\phi_1B)(1-B)\eta_t = (1+\theta_1B)\varepsilon_t, \end{align*}\]

where \(ε_t\) is a white noise series.

Notice that the model has two error terms here — the error from the regression model, which we denote by \(η_t\), and the error from the ARIMA model, which we denote by \(ε_t\). Only the ARIMA model errors are assumed to be white noise.

2.1 Estimation

When we estimate the parameters from the model, we need to minimise the sum of squared \(ε_t\) values.

An important consideration when estimating a regression with ARMA errors is that all of the variables in the model must first be stationary.

We therefore first difference the non-stationary variables in the model. It is often desirable to maintain the form of the relationship between \(y_t\) and the predictors, and consequently it is common to difference all of the variables if any of them need differencing. The resulting model is then called a “model in differences,” as distinct from a “model in levels,” which is what is obtained when the original data are used without differencing.

2.2 Regression with ARIMA errors using fable

The fable function ARIMA() will fit a regression model with ARIMA errors if exogenous regressors are included in the formula.

If differencing is specified, then the differencing is applied to all variables in the regression model before the model is estimated.

ARIMA(y ~ x + pdq(1,1,0))

The ARIMA() function can also be used to select the best ARIMA model for the errors. This is done by not specifying the pdq() special. If differencing is required, then all variables are differenced during the estimation process, although the final model will be expressed in terms of the original variables.

The AICc is calculated for the final model, and this value can be used to determine the best predictors. That is, the procedure should be repeated for all subsets of predictors to be considered, and the model with the lowest AICc value selected.

2.2.1 Example: US Personal Consumption and Income

Quarterly changes in personal consumption expenditure and personal disposable income from 1970 to 2016 Q3. We would like to forecast changes in expenditure based on changes in income.

us_change %>%
  gather("var", "value", Consumption, Income) %>%
  ggplot(aes(x = Quarter, y = value)) +
  geom_line() +
  facet_grid(vars(var), scales = "free_y") +
  labs(y = "",
       title = "Quarterly changes in US consumption and personal income")

fit <- us_change %>%
  model(ARIMA(Consumption ~ Income))
report(fit)
Series: Consumption 
Model: LM w/ ARIMA(1,0,2) errors 

Coefficients:
         ar1      ma1     ma2  Income  intercept
      0.7070  -0.6172  0.2066  0.1976     0.5949
s.e.  0.1068   0.1218  0.0741  0.0462     0.0850

sigma^2 estimated as 0.3113:  log likelihood=-163.04
AIC=338.07   AICc=338.51   BIC=357.8

The data are clearly already stationary (as we are considering percentage changes rather than raw expenditure and income), so there is no need for any differencing. The fitted model is

\[\begin{align*} y_t &= 0.595 + 0.198 x_t + \eta_t, \\ \eta_t &= 0.707 \eta_{t-1} + \varepsilon_t -0.617 \varepsilon_{t-1} + 0.207 \varepsilon_{t-2},\\ \varepsilon_t &\sim \text{NID}(0,0.311). \end{align*}\]

We can recover estimates of both the \(η_t\) and \(ε_t\) series using the residuals() function.

bind_rows(
    `Regression Errors` = as_tibble(residuals(fit, type = "regression")),
    `ARIMA Errors` = as_tibble(residuals(fit, type = "innovation")),
    .id = "type"
  ) %>%
  ggplot(aes(x = Quarter, y = .resid)) +
  geom_line() +
  facet_grid(vars(type), scales = "free_y") +
  labs(y = "")

It is the ARIMA errors (the innovations) that should resemble a white noise series.

fit %>% gg_tsresiduals()

augment(fit) %>%
  features(.innov, ljung_box, dof = 5, lag = 8)

The residuals (i.e., the ARIMA errors) are not significantly different from white noise.

2.3 Forecasting

To forecast using a regression model with ARIMA errors, we need to forecast the regression part of the model and the ARIMA part of the model, and combine the results.

As with ordinary regression models, in order to obtain forecasts we first need to forecast the predictors. When the predictors are known into the future (e.g., calendar-related variables such as time, day-of-week, etc.), this is straightforward. But when the predictors are themselves unknown, we must either model them separately, or use assumed future values for each predictor.

2.3.1 US Personal Consumption and Income

We will calculate forecasts for the next eight quarters assuming that the future percentage changes in personal disposable income will be equal to the mean percentage change from the last forty years.

us_change_future <- new_data(us_change, 8) %>%
  mutate(Income = mean(us_change$Income))
forecast(fit, new_data = us_change_future) %>%
  autoplot(us_change) +
  labs(y = "Percentage change")

The prediction intervals for this model are narrower than if we had fitted an ARIMA model without covariates, because we are now able to explain some of the variation in the data using the income predictor.

It is important to realise that the prediction intervals from regression models (with or without ARIMA errors) do not take into account the uncertainty in the forecasts of the predictors. So they should be interpreted as being conditional on the assumed (or estimated) future values of the predictor variables.

2.3.2 Forecasting electricity demand

Daily electricity demand can be modelled as a function of temperature. The higher demand on cold and hot days is reflected in the U-shape

vic_elec_daily <- vic_elec %>%
  filter(year(Time) == 2014) %>%
  index_by(Date = date(Time)) %>%
  summarise(
    Demand = sum(Demand) / 1e3,
    Temperature = max(Temperature),
    Holiday = any(Holiday)
  ) %>%
  mutate(Day_Type = case_when(
    Holiday ~ "Holiday",
    wday(Date) %in% 2:6 ~ "Weekday",
    TRUE ~ "Weekend"
  ))

vic_elec_daily %>%
  ggplot(aes(x = Temperature, y = Demand, colour = Day_Type)) +
  geom_point() +
  labs(y = "Electricity demand (GW)", x = "Maximum daily temperature")

The data stored as vic_elec_daily includes total daily demand, daily maximum temperatures, and an indicator variable for if that day is a public holiday.

vic_elec_daily %>%
  select(Date:Temperature) %>% 
  pivot_longer(-Date, names_to = "type", values_to = "data") %>% 
  ggplot(aes(x = Date, y = data)) +
  geom_line() +
  facet_grid(vars(type), scales = "free_y") +
  labs(y = "Value")

we fit a quadratic regression model with ARMA errors using the ARIMA() function. The model also includes an indicator variable for if the day was a working day or not.

fit <- vic_elec_daily %>%
  model(ARIMA(Demand ~ Temperature + I(Temperature^2) + (Day_Type == "Weekday")))
fit %>% gg_tsresiduals()

augment(fit) %>%
  features(.innov, ljung_box, dof = 8, lag = 14)

There is clear heteroskedasticity in the residuals, with higher variance in January and February, and lower variance in May. The model also has some significant autocorrelation in the residuals, and the histogram of the residuals shows long tails. All of these issues with the residuals may affect the coverage of the prediction intervals, but the point forecasts should still be ok.

Using the estimated model we forecast 14 days ahead starting from Thursday 1 January 2015 (a non-work-day being a public holiday for New Years Day). In this case, we could obtain weather forecasts from the weather bureau for the next 14 days. But for the sake of illustration, we will use scenario based forecasting where we set the temperature for the next 14 days to a constant 26 degrees.

vic_elec_future <- new_data(vic_elec_daily, 14) %>%
  mutate(
    Temperature = 26,
    Holiday = c(TRUE, rep(FALSE, 13)),
    Day_Type = case_when(
      Holiday ~ "Holiday",
      wday(Date) %in% 2:6 ~ "Weekday",
      TRUE ~ "Weekend"
    )
  )
forecast(fit, vic_elec_future) %>%
  autoplot(vic_elec_daily) + labs(y = "Electricity demand (GW)")

The point forecasts look reasonable for the first two weeks of 2015. The slow down in electricity demand at the end of 2014 (due to many people taking summer vacations) has caused the forecasts for the next two weeks to show similarly low demand values.

2.5 Lagged predictors

Sometimes, the impact of a predictor that is included in a regression model will not be simple and immediate.

In these situations, we need to allow for lagged effects of the predictor. Suppose that we have only one predictor in our model. Then a model which allows for lagged effects can be written as

\(y_t = \beta_0 + \gamma_0x_t + \gamma_1 x_{t-1} + \dots + \gamma_k x_{t-k} + \eta_t,\)

where \(η_t\) is an ARIMA process. The value of k can be selected using the AICc, along with thevalues of p and q for the ARIMA error.

2.5.1 Example: TV advertising and insurance quotations

A US insurance company advertises on national television in an attempt to increase the number of insurance quotations provided (and consequently the number of new policies).

Number of quotations and the expenditure on television advertising for the company each month from January 2002 to April 2005:

load(file = "datasets/insurance.rda")

insurance %>%
  pivot_longer(Quotes:TVadverts) %>%
  ggplot(aes(x = Month, y = value)) +
  geom_line() +
  facet_grid(vars(name), scales = "free_y") +
  labs(y = "", title = "Insurance advertising and quotations")

We will consider including advertising expenditure for up to four months; that is, the model may include advertising expenditure in the current month, and the three months before that. When comparing models, it is important that they all use the same training set. In the following code, we exclude the first three months in order to make fair comparisons.

fit <- insurance %>%
  # Restrict data so models use same fitting period
  mutate(Quotes = c(NA, NA, NA, Quotes[4:40])) %>%
  # Estimate models
  model(
    lag0 = ARIMA(Quotes ~ pdq(d = 0) + TVadverts),
    lag1 = ARIMA(Quotes ~ pdq(d = 0) + TVadverts + lag(TVadverts)),
    lag2 = ARIMA(Quotes ~ pdq(d = 0) + TVadverts + lag(TVadverts) + lag(TVadverts, 2)),
    lag3 = ARIMA(Quotes ~ pdq(d = 0) + TVadverts + lag(TVadverts) + lag(TVadverts, 2) + lag(TVadverts, 3))
  )

Next we choose the optimal lag length for advertising based on the AICc.

glance(fit)

The best model (with the smallest AICc value) is lag1 with two predictors; that is, it includes advertising only in the current month and the previous month. So we now re-estimate that model, but using all the available data.

fit_best <- insurance %>%
  model(ARIMA(Quotes ~ pdq(d = 0) + TVadverts + lag(TVadverts)))
report(fit_best)
Series: Quotes 
Model: LM w/ ARIMA(1,0,2) errors 

Coefficients:
         ar1     ma1     ma2  TVadverts  lag(TVadverts)  intercept
      0.5123  0.9169  0.4591     1.2527          0.1464     2.1554
s.e.  0.1849  0.2051  0.1895     0.0588          0.0531     0.8595

sigma^2 estimated as 0.2166:  log likelihood=-23.94
AIC=61.88   AICc=65.38   BIC=73.7

The chosen model has ARIMA(1,0,2) errors. The model can be written as

\(y_t = 2.155 + 1.253 x_t + 0.146 x_{t-1} + \eta_t\)

where \(y_t\) is the number of quotations provided in month t, \(x_t\) is the advertising expenditure in month t,

\(\eta_t = 0.512 \eta_{t-1} + \varepsilon_t + 0.917 \varepsilon_{t-1} + 0.459 \varepsilon_{t-2}\)

We can calculate forecasts using this model if we assume future values for the advertising variable. If we set the future monthly advertising to 8 units, we get the forecasts:

insurance_future <- new_data(insurance, 20) %>%
  mutate(TVadverts = 8)
fit_best %>%
  forecast(insurance_future) %>%
  autoplot(insurance) +
  labs(y = "Quotes",
       title = "Forecast quotes with future advertising set to 8")

3 Forecasting hierarchical and grouped time series

Time series can often be naturally disaggregated by various attributes of interest.

Hierarchical time series often arise due to geographic divisions.

Alternative aggregation structures arise when attributes of interest are crossed rather than nested. We refer to the resulting time series of crossed attributes as “grouped time series.”

More complex structures arise when attributes of interest are both nested and crossed.

Forecasts are often required for all disaggregate and aggregate series, and it is natural to want the forecasts to add up in the same way as the data.

3.1 Hierarchical and grouped time series

3.1.1 Hierarchical time series

https://otexts.com/fpp3/hts.html

For any time t, the observations at the bottom level of the hierarchy will sum to the observations of the series above. For example,

\[\begin{equation} y_{t}=\y{AA}{t}+\y{AB}{t}+\y{AC}{t}+\y{BA}{t}+\y{BB}{t}, \tag{11.1} \end{equation}\]

\[\begin{equation} \y{A}{t}=\y{AA}{t}+\y{AB}{t}+\y{AC}{t}\qquad \text{and} \qquad \y{B}{t}=\y{BA}{t}+\y{BB}{t}. \tag{11.2} \end{equation}\]

3.1.2 Example: Australian tourism hierarchy

Australia is divided into six states and two territories, with each one having its own government and some economic and administrative autonomy. Each of these states can be further subdivided into regions.

The tourism tsibble contains data on quarterly domestic tourism demand, measured as the number of overnight trips Australians spend away from home. The key variables State and Region denote the geographical areas, while a further key Purpose describes the purpose of travel.

#recode State to use abbreviations
tourism <- tsibble::tourism %>%
  mutate(State = recode(State,
    `New South Wales` = "NSW",
    `Northern Territory` = "NT",
    `Queensland` = "QLD",
    `South Australia` = "SA",
    `Tasmania` = "TAS",
    `Victoria` = "VIC",
    `Western Australia` = "WA"
  ))

Using the aggregate_key() function, we can create the hierarchical time series with overnight trips in regions at the bottom level of the hierarchy, aggregated to states, which are aggregated to the national total. A hierarchical time series corresponding to the nested structure is created using a parent/child specification.

tourism_hts <- tourism %>%
  aggregate_key(State / Region, Trips = sum(Trips))
tourism_hts

The new tsibble now has some additional rows corresponding to state and national aggregations for each quarter. Figure 11.3 shows the aggregate total overnight trips for the whole of Australia as well as the states, revealing diverse and rich dynamics. For example, there is noticeable national growth since 2010 and for some states such as the ACT, New South Wales, Queensland, South Australia, and Victoria. There seems to be a significant jump for Western Australia in 2014.

tourism_hts %>%
  filter(is_aggregated(Region)) %>%
  autoplot(Trips) +
  labs(y = "Trips ('000)",
       title = "Australian tourism: national total and states") +
  facet_wrap(vars(State), scales = "free_y", ncol = 3) +
  theme(legend.position = "none")

The seasonal pattern of the northern states, such as Queensland and the Northern Territory, leads to peak visits in winter (corresponding to Q3) due to the tropical climate and rainy summer months. In contrast, the southern states tend to peak in summer (corresponding to Q1).

3.1.3 Grouped time series

With grouped time series, the data structure does not naturally disaggregate in a unique hierarchical manner.

This example shows that there are alternative aggregation paths for grouped structures. For any time t, as with the hierarchical structure,

\[\begin{equation*} y_{t}=y_{AX}{t}+y_{AY,t}+y_{BX,t}+y_{BY,t}. \end{equation*}\]

However, for the first level of the grouped structure,

\[\begin{equation} \y{A}{t}=\y{AX}{t}+\y{AY}{t}\quad \quad \y{B}{t}=\y{BX}{t}+\y{BY}{t} \tag{11.3} \end{equation}\]

but also

\[\begin{equation} \y{X}{t}=\y{AX}{t}+\y{BX}{t}\quad \quad \y{Y}{t}=\y{AY}{t}+\y{BY}{t} \tag{11.4}. \end{equation}\]

Grouped time series can sometimes be thought of as hierarchical time series that do not impose a unique hierarchical structure, in the sense that the order by which the series can be grouped is not unique.

3.1.4 Example: Australian prison population

In this example we consider the Australia prison population.

prison <- readr::read_csv("https://OTexts.com/fpp3/extrafiles/prison_population.csv") %>%
  mutate(Quarter = yearquarter(Date)) %>%
  select(-Date)  %>%
  as_tsibble(key = c(Gender, Legal, State, Indigenous), index = Quarter) %>%
  relocate(Quarter)

-- Column specification ------------------------------------------------------------------
cols(
  Date = col_date(format = ""),
  State = col_character(),
  Gender = col_character(),
  Legal = col_character(),
  Indigenous = col_character(),
  Count = col_double()
)

We create a grouped time series using aggregate_key() with attributes or groupings of interest now being crossed using the syntax attribute1*attribute2 (in contrast to the parent/child syntax used for hierarchical time series). The following code builds a grouped tsibble for the prison data with crossed attributes: gender, legal status and state.

prison_gts <- prison %>%
  aggregate_key(Gender * Legal * State, Count = sum(Count)/1e3)

Using is_aggregated() within filter() is helpful for exploring or plotting the main groups shown in the bottom panels. For example, the following code plots the total numbers of female and male prisoners across Australia.

prison_gts %>%
  filter(!is_aggregated(Gender), is_aggregated(Legal), is_aggregated(State)) %>%
  autoplot(Count) +
  labs(y = "Number of prisoners ('000)")

Plots of other group combinations can be obtained:

prison_gts %>%
  filter(!is_aggregated(Gender), !is_aggregated(Legal), !is_aggregated(State)) %>%
  mutate(Gender = as.character(Gender)) %>%
  ggplot(aes(x = Quarter, y = Count, group = Gender, colour=Gender)) +
  stat_summary(fun = sum, geom = "line") +
  labs(title = "Prison population by state and gender",
       y = "Number of prisoners ('000)") +
  facet_wrap(~ as.character(State), nrow = 1, scales = "free_y") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

3.1.5 Mixed hierarchical and grouped structure

Often disaggregating factors are both nested and crossed. For example, the Australian tourism data can also be disaggregated by the four purposes of travel: holiday, business, visiting friends and relatives, and other. This grouping variable does not nest within any of the geographical variables. In fact, we could consider overnight trips split by purpose of travel for the whole of Australia, and for each state, and for each region. We describe such a structure as a “nested” geographic hierarchy “crossed” with the purpose of travel. Using aggregate_key this can be specified by simply combining the factors as follows.

tourism_full <- tourism %>%
  aggregate_key((State / Region) * Purpose, Trips = sum(Trips))

The tourism_full tsibble contains 425 series, including the 85 series from the hierarchical structure, as well as another 340 series obtained when each series of the hierarchical structure is crossed with the purpose of travel.

3.2 Single level approaches

Traditionally, forecasts of hierarchical or grouped time series involved selecting one level of aggregation and generating forecasts for that level. These are then either aggregated for higher levels, or disaggregated for lower levels, to obtain a set of coherent forecasts for the rest of the structure.

3.2.1 The bottom-up approach

A simple method for generating coherent forecasts is the “bottom-up” approach. This approach involves first generating forecasts for each series at the bottom level, and then summing these to produce forecasts for all the series in the structure.

An advantage of this approach is that we are forecasting at the bottom level of a structure, and therefore no information is lost due to aggregation. On the other hand, bottom-level data can be quite noisy and more challenging to model and forecast.

3.2.1.1 Example: generating bottom-up forecasts

Suppose we want national and state forecasts for the Australian tourism data, but we aren’t interested in disaggregations using regions or the purpose of travel. So we first create a simple tsibble object containing only state and national trip totals for each quarter.

tourism_states <- tourism %>%
  aggregate_key(State, Trips = sum(Trips))

We could generate the bottom-level state forecasts first, and then sum them to obtain the national forecasts as follows.

fcasts_state <- tourism_states %>%
  filter(!is_aggregated(State)) %>%
  model(ets = ETS(Trips)) %>%
  forecast()
# Sum bottom-level forecasts to get top-level forecasts
fcasts_national <- fcasts_state %>%
  summarise(value = sum(Trips), .mean = mean(value))

However, we want a more general approach that will work with all the forecasting methods discussed in this chapter. So we will use the reconcile() function to specify how we want to compute coherent forecasts.

tourism_states %>%
  model(ets = ETS(Trips)) %>%
  reconcile(bu = bottom_up(ets)) %>%
  forecast()

The reconcile() step has created a new “model” to produce bottom-up forecasts. The fable object contains the ets forecasts as well as the coherent bu forecasts, for the 8 states and the national aggregate. At the state level, these forecasts are identical, but the national ets forecasts will be different from the national bu forecasts.

For bottom-up forecasting, this is rather inefficient as we are not interested in the ETS model for the national total, and the resulting fable contains a lot of duplicates.

The above code illustrates the general workflow for hierarchical and grouped forecasts. We use the following pipeline of functions.

data %>% aggregate_key() %>% model() %>% reconcile() %>% forecast()
  1. Begin with a tsibble object (here labelled data) containing the individual bottom-level series.
  2. Define in aggregate_key() the aggregation structure and build a tsibble object that also contains the aggregate series.
  3. Identify a model() for each series, at all levels of aggregation.
  4. Specify in reconcile() how the coherent forecasts are to be generated from the selected models.
  5. Use the forecast() function to generate forecasts for the whole aggregation structure.

3.2.2 Top-down approaches

Top-down approaches involve first generating forecasts for the Total series \(y_t\), and then disaggregating these down the hierarchy. This only works with strictly hierarchical data, not with grouped or mixed aggregation structures.

Top-down forecasts can be generated using top_down() within the reconcile() function.

3.2.3 Average historical proportions

\(p_j=\frac{1}{T}\sum_{t=1}^{T}\frac{y_{j,t}}{{y_t}}\)

This approach is implemented in the top_down() function by setting method = “average_proportions”.

3.2.4 Proportions of the historical averages

\(p_j={\sum_{t=1}^{T}\frac{y_{j,t}}{T}}\Big/{\sum_{t=1}^{T}\frac{y_t}{T}}\)

This approach is implemented in the top_down() function by setting method = “proportion_averages”.

In general, these approaches seem to produce quite reliable forecasts for the aggregate levels and they are useful with low count data. On the other hand, one disadvantage is the loss of information due to aggregation. Using such top-down approaches, we are unable to capture and take advantage of individual series characteristics such as time dynamics, special events, different seasonal patterns, etc.

3.2.5 Forecast proportions

Because historical proportions used for disaggregation do not take account of how those proportions may change over time, top-down approaches based on historical proportions tend to produce less accurate forecasts at lower levels of the hierarchy than bottom-up approaches. To address this issue, proportions based on forecasts rather than historical data can be used.

\(p_j=\prod^{K-1}_{\ell=0}\frac{\hat{y}_{j,h}^{(\ell)}}{\hat{S}_{j,h}^{(\ell+1)}}\)

This approach is implemented in the top_down() function by setting method = “forecast_proportions”. Because this approach tends to work better than other top-down methods, it is the default choice in the top_down() function when no method argument is specified.

One disadvantage of all top-down approaches, is that they do not produce unbiased coherent forecasts even if the base forecasts are unbiased.

3.2.6 Middle-out approach

The middle-out approach combines bottom-up and top-down approaches. Again, it can only be used for strictly hierarchical aggregation structures.

This approach is implemented in the middle_out() function by specifying the appropriate middle level via the level argument and selecting the top-down approach with the method argument.

3.3 Forecast reconciliation

https://otexts.com/fpp3/reconciliation.html

3.3.1 The MinT optimal reconciliation approach

We refer to this as the MinT (or Minimum Trace) optimal reconciliation approach . MinT is implemented by min_trace() within the reconcile() function.

To use this in practice, we need to estimate \(W_h\). Four simplifying approximations:

  1. min_trace() by setting method = "ols"
  2. min_trace() by setting method = "wls_var"
  3. min_trace() by setting method = "wls_struct"
  4. min_trace() by setting method = "mint_cov"; for cases where the number of bottom-level series \(m\) is large compared to the length of the series \(T\), this is not a good estimator. Instead we use a shrinkage estimator which shrinks the sample covariance to a diagonal matrix. method = "mint_shrink".

the optimal reconciliation forecasts are generated using all the information available within a hierarchical or a grouped structure. This is important, as particular aggregation levels or groupings may reveal features of the data that are of interest to the user and are important to be modelled. These features may be completely hidden or not easily identifiable at other levels.

3.4 Forecasting Australian domestic tourism

We use the data up to the end of 2015 as a training set, withholding the final two years (eight quarters, 2016Q1–2017Q4) as a test set for evaluation. The code below demonstrates the full workflow for generating coherent forecasts using the bottom-up, OLS and MinT methods.

tourism_full <- tourism %>%
  aggregate_key((State / Region) * Purpose, Trips = sum(Trips))
fit <- tourism_full %>%
  filter(year(Quarter) <= 2015) %>%
  model(base = ETS(Trips)) %>%
  reconcile(
    bu = bottom_up(base),
    ols = min_trace(base, method = "ols"),
    mint = min_trace(base, method = "mint_shrink"),
  )

Here, fit contains the base ETS model for each series in tourism_full, along with the three methods for producing coherent forecasts as specified in the reconcile() function.

fc <- fit %>% forecast(h = "2 years")

Passing fit into forecast() generates base and coherent forecasts across all the series in the aggregation structure.

We plot the four point forecasts for the overnight trips for the Australian total, the states, and the purposes of travel, along with the actual observations of the test set. To make it easier to see the differences, we have included only the last five years of the training data, and have omitted the prediction intervals. In most panels, the increase in overnight trips, especially in the second half of the test set, is higher than what is predicted by the point forecasts. This is particularly noticeable for the mainland eastern states of ACT, New South Wales, Queensland and Victoria, and across all purposes of travel.

fc %>%
  filter(is_aggregated(Region), is_aggregated(Purpose)) %>%
  autoplot(
    tourism_full %>% filter(year(Quarter) >= 2011),
    level = NULL
  ) +
  labs(y = "Trips ('000)") +
  facet_wrap(vars(State), scales = "free_y")

fc %>%
  filter(is_aggregated(State), !is_aggregated(Purpose)) %>%
  autoplot(
    tourism_full %>% filter(year(Quarter) >= 2011),
    level = NULL
  ) +
  labs(y = "Trips ('000)") +
  facet_wrap(vars(Purpose), scales = "free_y")

The accuracy of the forecasts over the test set can be evaluated using the accuracy() function.

fc %>%
  filter(is_aggregated(State), is_aggregated(Purpose)) %>%
  accuracy(
    data = tourism_full,
    measures = list(rmse = RMSE, mase = MASE)
  ) %>%
  group_by(.model) %>%
  summarise(rmse = mean(rmse), mase = mean(mase))
fc %>%
  filter(is_aggregated(State), !is_aggregated(Purpose)) %>%
  accuracy(
    data = tourism_full,
    measures = list(rmse = RMSE, mase = MASE)
  ) %>%
  group_by(.model) %>%
  summarise(rmse = mean(rmse), mase = mean(mase))

https://otexts.com/fpp3/tourism.html

Reconciling the base forecasts using OLS and MinT results in more accurate forecasts compared to the bottom-up approach. This result is commonly observed in applications as reconciliation approaches use information from all levels of the structure, resulting in more accurate coherent forecasts compared to the older methods which use limited information. Furthermore, reconciliation usually improves the incoherent base forecasts for almost all levels.

3.5 Reconciled distributional forecasts

https://otexts.com/fpp3/rec-prob.html

3.5.1 Forecasting Australian prison population

Returning to the Australian prison population data, we will compare the forecasts from bottom-up and MinT methods applied to base ETS models, using a test set comprising the final two years or eight quarters 2015Q1–2016Q4 of the available data.

prison <- readr::read_csv("https://OTexts.com/fpp3/extrafiles/prison_population.csv") %>%
  mutate(Quarter = yearquarter(Date)) %>%
  select(-Date)  %>%
  as_tsibble(key = c(Gender, Legal, State, Indigenous), index = Quarter) %>%
  relocate(Quarter)

-- Column specification --------------------------------------------
cols(
  Date = col_date(format = ""),
  State = col_character(),
  Gender = col_character(),
  Legal = col_character(),
  Indigenous = col_character(),
  Count = col_double()
)
prison_gts <- prison %>%
  aggregate_key(Gender * Legal * State, Count = sum(Count)/1e3)

fit <- prison_gts %>%
  filter(year(Quarter) <= 2014) %>%
  model(base = ETS(Count)) %>%
  reconcile(
    bottom_up = bottom_up(base),
    MinT = min_trace(base, method = "mint_shrink")
  )
fc <- fit %>% forecast(h = 8)
fc %>%
  filter(is_aggregated(State), is_aggregated(Gender), is_aggregated(Legal)) %>%
  autoplot(prison_gts, alpha = 0.7, level = 90) +
  labs(y = "Number of prisoners ('000)",
       title = "Australian prison population (total)")

The three sets of forecasts for the aggregate Australian prison population. The base and bottom-up forecasts from the ETS models seem to underestimate the trend over the test period. The MinT approach combines information from all the base forecasts in the aggregation structure; in this case, the base forecasts at the top level are adjusted upwards.

The MinT reconciled prediction intervals are much tighter than the base forecasts, due to MinT being based on an estimator that minimizes variances. The base forecast distributions are also incoherent, and therefore carry with them the extra uncertainty of the incoherency error.

MinT and base forecasts at various levels of aggregation. To make it easier to see the effect, we only show the last five years of training data. In general, MinT adjusts the base forecasts in the direction of the test set, hence improving the forecast accuracy. There is no guarantee that MinT reconciled forecasts will be more accurate than the base forecasts for every series, but they will be more accurate on average

fc %>%
  filter(
    .model %in% c("base", "MinT"),
    !is_aggregated(State), is_aggregated(Legal), is_aggregated(Gender)
  ) %>%
  autoplot(
    prison_gts %>% filter(year(Quarter) >= 2010),
    alpha = 0.7, level = 90
  ) +
  labs(title = "Prison population (by state)",
       y = "Number of prisoners ('000)") +
  facet_wrap(vars(State), scales = "free_y", ncol = 4) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Using the accuracy() function, we evaluate the forecast accuracy across the grouped structure. The code below evaluates the forecast accuracy for only the top-level national aggregate of the Australian prison population time series.

fc %>%
  filter(is_aggregated(State), is_aggregated(Gender), is_aggregated(Legal)) %>%
  accuracy(data = prison_gts, measures = list(
    mase = MASE,
    ss = skill_score(CRPS)
  )) %>%
  group_by(.model) %>%
  summarise(mase = mean(mase), sspc = mean(ss) * 100)

A low value of MASE indicates a good forecast, while a high value of the skill score indicates a good forecast.

The results show that the MinT reconciled forecasts improve on the accuracy of the base forecasts and are also more accurate than the bottom-up forecasts. As the MinT optimal reconciliation approach uses information from all levels in the structure, it generates more accurate forecasts than the traditional approaches (such as bottom-up) which use limited information.

---
title: "Forecasting: Principles and Practice 3"
author: "ARIMA, Dynamic regression models and Forecasting hierarchical and grouped time series"
output:
  html_notebook:
    toc: true
    toc_float: true
    toc_collapsed: false
    number_sections: true
    
toc_depth: 3
---

```{r}
library(fpp3)
```

<https://robjhyndman.com/seminars/uwa2017/>

<https://robjhyndman.com/seminars/isi2019workshop/>

<https://github.com/robjhyndman/ETC3550Slides/tree/fable>

<https://github.com/rstudio-conf-2020/time-series-forecasting/blob/master/materials/labs.R>

<http://www.gradaanwr.net/>

# ARIMA models

While exponential smoothing models are based on a description of the trend and seasonality in the data, ARIMA models aim to describe the autocorrelations in the data.

## Stationarity and differencing

A stationary time series is one whose statistical properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary --- the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary --- it does not matter when you observe it, it should look much the same at any point in time.

A time series with cyclic behaviour (but with no trend or seasonality) is stationary. This is because the cycles are not of a fixed length, so before we observe the series we cannot be sure where the peaks and troughs of the cycles will be.

In general, a stationary time series will have no predictable patterns in the long-term.

### Differencing

Transformations such as logarithms can help to stabilise the variance of a time series. Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.

For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly. Also, for non-stationary data, the value of $r_1$ is often large and positive.

```{r}
# Re-index based on trading days
google_stock <- gafa_stock %>%
  filter(Symbol == "GOOG", year(Date) >= 2015) %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)
# Filter the year of interest
google_2015 <- google_stock %>% filter(year(Date) == 2015)

google_2015 %>%
  mutate(diff_close = difference(Close)) %>%
  features(diff_close, ljung_box, lag = 10)
```

```{r}
p1 <- google_2015 %>% 
  ACF(Close) %>% 
  autoplot()
p2 <- google_2015 %>% 
  ACF(difference(Close)) %>% 
  autoplot()
```

```{r}
library(cowplot)
plot_grid(p1, p2, cols=2)
```

The ACF of the differenced Google stock price looks just like that of a white noise series. Only one autocorrelation is outside of the 95% limits, and the Ljung-Box Q\^∗ statistic has a p-value of 0.637 (for $h=10$). This suggests that the daily change in the Google stock price is essentially a random amount which is uncorrelated with that of previous days.

### Random walk model

Random walk models are widely used for non-stationary data, particularly financial and economic data. Random walks typically have:

-   long periods of apparent trends up or down
-   sudden and unpredictable changes in direction.

The forecasts from a random walk model are equal to the last observation, as future movements are unpredictable, and are equally likely to be up or down. Thus, the random walk model underpins naïve forecasts.

### Second-order differencing

Occasionally the differenced data will not appear to be stationary and it may be necessary to difference the data a second time to obtain a stationary series.

### Seasonal differencing

A seasonal difference is the difference between an observation and the previous observation from the same season.

Forecasts from this model are equal to the last observation from the relevant season. That is, this model gives seasonal naïve forecasts.

Sometimes it is necessary to take both a seasonal difference and a first difference to obtain stationary data.

```{r}
PBS %>%
  filter(ATC2 == "H02") %>%
  summarise(Cost = sum(Cost)/1e6) %>%
  transmute(
    `Sales ($million)` = Cost,
    `Log sales` = log(Cost),
    `Annual change in log sales` = difference(log(Cost), 12),
    `Doubly differenced log sales` = difference(difference(log(Cost), 12), 1)
  ) %>%
  pivot_longer(-Month, names_to="Type", values_to="Sales") %>%
  mutate(
    Type = factor(Type, levels = c(
      "Sales ($million)",
      "Log sales",
      "Annual change in log sales",
      "Doubly differenced log sales"))
  ) %>%
  ggplot(aes(x = Month, y = Sales)) +
  geom_line() +
  facet_grid(vars(Type), scales = "free_y") +
  labs(title = "Corticosteroid drug sales", y = NULL)
```

If the data have a strong seasonal pattern, we recommend that seasonal differencing be done first, because the resulting series will sometimes be stationary and there will be no need for a further first difference.

First differences are the change between one observation and the next. Seasonal differences are the change between one year to the next. Other lags are unlikely to make much interpretable sense and should be avoided.

### Unit root tests

One way to determine more objectively whether differencing is required is to use a unit root test. These are statistical hypothesis tests of stationarity that are designed for determining whether differencing is required.

we use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. In this test, the null hypothesis is that the data are stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values (e.g., less than 0.05) suggest that differencing is required. The test can be computed using the unitroot_kpss() function.

```{r}
google_2015 %>%
  features(Close, unitroot_kpss)
```

The reported p-value is set to 0.01 if it is less than that, or to 0.1 if it is greater than that. In this case, the test statistic (3.56) is much bigger than the 1% critical value, so the p-value is less than 0.01, indicating that the null hypothesis is rejected. That is, the data are not stationary. We can difference the data, and apply the test again.

```{r}
google_2015 %>%
  mutate(diff_close = difference(Close)) %>%
  features(diff_close, unitroot_kpss)
```

This time, the test statistic is tiny, and well within the range we would expect for stationary data, so the p-value is greater than 0.1. We can conclude that the differenced data appear stationary.

This process of using a sequence of KPSS tests to determine the appropriate number of first differences is carried out using the unitroot_ndiffs() feature.

```{r}
google_2015 %>%
  features(Close, unitroot_ndiffs)
```

As we saw from the KPSS tests above, one difference is required to make the google_2015 data stationary.

A similar feature for determining whether seasonal differencing is required is unitroot_nsdiffs(), which uses the measure of seasonal strength introduced in Section 4.3 to determine the appropriate number of seasonal differences required. No seasonal differences are suggested if $F_S<0.64$, otherwise one seasonal difference is suggested.

```{r}
aus_total_retail <- aus_retail %>%
  summarise(Turnover = sum(Turnover))
aus_total_retail %>%
  mutate(log_turnover = log(Turnover)) %>%
  features(log_turnover, unitroot_nsdiffs)
```

```{r}
aus_total_retail %>%
  mutate(log_turnover = difference(log(Turnover), 12)) %>%
  features(log_turnover, unitroot_ndiffs)
```

Because unitroot_nsdiffs() returns 1 (indicating one seasonal difference is required), we apply the unitroot_ndiffs() function to the seasonally differenced data. These functions suggest we should do both a seasonal difference and a first difference.

## Backshift notation

<https://otexts.com/fpp3/backshift.html>

## Autoregressive models

In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. The term autoregression indicates that it is a regression of the variable against itself.

$y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + \dots + \phi_{p}y_{t-p} + \varepsilon_{t}$

We refer to this as an AR(p) model, an autoregressive model of order p.

## Moving average models

Rather than using past values of the forecast variable in a regression, a moving average model uses past forecast errors in a regression-like model.

$y_{t} = c + \varepsilon_t + \theta_{1}\varepsilon_{t-1} + \theta_{2}\varepsilon_{t-2} + \dots + \theta_{q}\varepsilon_{t-q}$

We refer to this as an MA(q) model, a moving average model of order\
q. Of course, we do not observe the values of $ε_t$, so it is not really a regression in the usual sense.

Notice that each value of $y_t$ can be thought of as a weighted moving average of the past few forecast errors. However, moving average models should not be confused with the moving average smoothing. A moving average model is used for forecasting future values, while moving average smoothing is used for estimating the trend-cycle of past values.

## Non-seasonal ARIMA models

If we combine differencing with autoregression and a moving average model, we obtain a non-seasonal ARIMA model. ARIMA is an acronym for AutoRegressive Integrated Moving Average (in this context, "integration" is the reverse of differencing). The full model can be written as

```{=tex}
\begin{equation}
  y'_{t} = c + \phi_{1}y'_{t-1} + \cdots + \phi_{p}y'_{t-p}
     + \theta_{1}\varepsilon_{t-1} + \cdots + \theta_{q}\varepsilon_{t-q} + \varepsilon_{t},
\end{equation}
```
We call this an ARIMA(p,d,q) model, where:

p=order of the autoregressive part; d=degree of first differencing involved; q=order of the moving average part.

Special cases of ARIMA models:

|                        |                               |
|:-----------------------|:------------------------------|
| White noise            | ARIMA(0,0,0)                  |
| Random walk            | ARIMA(0,1,0) with no constant |
| Random walk with drift | ARIMA(0,1,0) with a constant  |
| Autoregression         | ARIMA(p,0,0)                  |
| Moving average         | ARIMA(0,0,q)                  |

Backshift notation:

\begin{equation}
  \begin{array}{c c c c}
    (1-\phi_1B - \cdots - \phi_p B^p) & (1-B)^d y_{t} &= &c + (1 + \theta_1 B + \cdots + \theta_q B^q)\varepsilon_t\\
    {\uparrow} & {\uparrow} & &{\uparrow}\\
    \text{AR($p$)} & \text{$d$ differences} & & \text{MA($q$)}\\
  \end{array}
\end{equation}

### Egyptian exports

Egyptian exports as a percentage of GDP from 1960 to 2017.
```{r}
global_economy %>%
  filter(Code == "EGY") %>%
  autoplot(Exports) +
  labs(y = "Percentage of GDP", title = "Egyptian Exports")
```
The following R code selects a non-seasonal ARIMA model automatically.
```{r}
fit <- global_economy %>%
  filter(Code == "EGY") %>%
  model(ARIMA(Exports))
report(fit)
```
This is an ARIMA(2,0,1) model:

$y_t = 2.56
         + 1.68 y_{t-1}
          -0.80 y_{t-2}
          -0.69 \varepsilon_{t-1}
          + \varepsilon_{t}$

Forecasts from the model are shown in Figure 9.8. Notice how they have picked up the cycles evident in the Egyptian economy over the last few decades.
```{r}
fit %>% forecast(h=10) %>% autoplot(global_economy)
```
### ACF and PACF plots

It is usually not possible to tell, simply from a time plot, what values of p and q are appropriate for the data. However, it is sometimes possible to use the ACF plot, and the closely related PACF plot, to determine appropriate values for p and q.

```{r}
global_economy %>% filter(Code == "EGY") %>% ACF(Exports) %>% autoplot()
```
```{r}
global_economy %>% filter(Code == "EGY") %>% PACF(Exports) %>% autoplot()
```
If the data are from an ARIMA(p,d,0) or ARIMA(0,d,q) model, then the ACF and PACF plots can be helpful in determining the value of p or q. If p and q are both positive, then the plots do not help in finding suitable values of p and q.

The data may follow an ARIMA(p,d,0) model if the ACF and PACF plots of the differenced data show the following patterns:

- the ACF is exponentially decaying or sinusoidal;
- there is a significant spike at lag p in the PACF, but none beyond lag p.

The data may follow an ARIMA(0,d,q) model if the ACF and PACF plots of the differenced data show the following patterns:

- the PACF is exponentially decaying or sinusoidal;
- there is a significant spike at lag q in the ACF, but none beyond lag q.

we see that there is a decaying sinusoidal pattern in the ACF, and the PACF shows the last significant spike at lag 4. This is what you would expect from an ARIMA(4,0,0) model.
```{r}
fit2 <- global_economy %>%
  filter(Code == "EGY") %>%
  model(ARIMA(Exports ~ pdq(4,0,0)))
report(fit2)
```
This model is only slightly worse than the ARIMA(2,0,1) model identified by ARIMA() (with an AICc value of 294.70 compared to 294.29).

We can also specify particular values of pdq() that ARIMA can search for. For example, to find the best ARIMA model with $p\in\{1,2,3\}$, $q\in\{0,1,2\}$ and d=1, you could use ARIMA(y ~ pdq(p=1:3, d=1, q=0:2)).

## Estimation and order selection


### Maximum likelihood estimation

When fable estimates the ARIMA model, it uses maximum likelihood estimation (MLE). This technique finds the values of the parameters which maximise the probability of obtaining the data that we have observed. For ARIMA models, MLE is similar to the least squares estimates that would be obtained by minimising

$\sum_{t=1}^T\varepsilon_t^2.$

### Information Criteria

Akaike’s Information Criterion (AIC), which was useful in selecting predictors for regression, is also useful for determining the order of an ARIMA model.

Good models are obtained by minimising the AIC, AICc or BIC. Our preference is to use the AICc.

t is important to note that these information criteria tend not to be good guides to selecting the appropriate order of differencing (d) of a model.

## ARIMA modelling in fable

https://otexts.com/fpp3/arima-r.html

### Modelling procedure

When fitting an ARIMA model to a set of (non-seasonal) time series data, the following procedure provides a useful general approach.

1. Plot the data and identify any unusual observations.
2. If necessary, transform the data (using a Box-Cox transformation) to stabilise the variance.
3. If the data are non-stationary, take first differences of the data until the data are stationary.
4. Examine the ACF/PACF: Is an ARIMA(p,d,0) or ARIMA(0,d,q) model appropriate?
5. Try your chosen model(s), and use the AICc to search for a better model.
6. Check the residuals from your chosen model by plotting the ACF of the residuals, and doing a portmanteau test of the residuals. If they do not look like white noise, try a modified model.
7. Once the residuals look like white noise, calculate forecasts.

###Example: Central African Republic exports

```{r}
global_economy %>%
  filter(Code == "CAF") %>%
  autoplot(Exports)
```
1. The time plot shows some non-stationarity, with an overall decline. The improvement in 1994 was due to a new government which overthrew the military junta and had some initial success, before unrest caused further economic decline.

2. There is no evidence of changing variance, so we will not do a Box-Cox transformation.

3. To address the non-stationarity, we will take a first difference of the data

```{r}
global_economy %>%
  filter(Code == "CAF") %>%
  gg_tsdisplay(difference(Exports), plot_type='partial')
```
4. The PACF shown in Figure 9.14 is suggestive of an AR(2) model; so an initial candidate model is an ARIMA(2,1,0). The ACF suggests an MA(3) model; so an alternative candidate is an ARIMA(0,1,3).

5. We fit both an ARIMA(2,1,0) and an ARIMA(0,1,3) model along with two automated model selections, one using the default stepwise procedure, and one working harder to search a larger model space.
```{r}
caf_fit <- global_economy %>%
  filter(Code == "CAF") %>%
  model(
    arima210 = ARIMA(Exports ~ pdq(2,1,0)),
    arima013 = ARIMA(Exports ~ pdq(0,1,3)),
    stepwise = ARIMA(Exports),
    search = ARIMA(Exports, stepwise=FALSE)
  )
caf_fit
```
```{r}
glance(caf_fit) %>% arrange(AICc)
```
The four models have almost the same AICc values. Of the models fitted, the full search has found that an ARIMA(3,1,0) gives the lowest AICc value, closely followed by the ARIMA(2,1,0) and ARIMA(0,1,3) — the latter two being the models that we guessed from the ACF and PACF plots. The automated stepwise selection has identified an ARIMA(2,1,2) model, which has the highest AICc value of the four models.

6. The ACF plot of the residuals from the ARIMA(3,1,0) model shows that all autocorrelations are within the threshold limits, indicating that the residuals are behaving like white noise.
```{r}
caf_fit %>%
  select(search) %>%
  gg_tsresiduals()
```
A portmanteau test returns a large p-value, also suggesting that the residuals are white noise.
```{r}
augment(caf_fit) %>%
  filter(.model=='search') %>%
  features(.innov, ljung_box, lag = 10, dof = 3)
```
7. Forecasts
```{r}
caf_fit %>%
  forecast(h=5) %>%
  filter(.model=='search') %>%
  autoplot(global_economy)
```
Note that the mean forecasts look very similar to what we would get with a random walk (equivalent to an ARIMA(0,1,0)). The extra work to include AR and MA terms has made little difference to the point forecasts in this example, although the prediction intervals are much narrower than for a random walk model.

## Forecasting

https://otexts.com/fpp3/arima-forecasting.html

## Seasonal ARIMA models

A seasonal ARIMA model is formed by including additional seasonal terms in the ARIMA models we have seen so far. It is written as follows:

$\underbrace{(p, d, q)}$ $\underbrace{(P, D, Q)_{m}}$

### ACF/PACF

The seasonal part of an AR or MA model will be seen in the seasonal lags of the PACF and ACF. For example, an $ARIMA(0,0,0)(0,0,1)_12$  model will show:

- a spike at lag 12 in the ACF but no other significant spikes;
- exponential decay in the seasonal lags of the PACF (i.e., at lags 12, 24, 36, …).

Similarly, an $ARIMA(0,0,0)(1,0,0)_12$ model will show:

- exponential decay in the seasonal lags of the ACF;
- a single significant spike at lag 12 in the PACF.

### Example: Monthly US leisure and hospitality employment

monthly US employment data for leisure and hospitality jobs from January 2000 to September 2019:
```{r}
leisure <- us_employment %>%
  filter(Title == "Leisure and Hospitality", year(Month) > 2000) %>%
  mutate(Employed = Employed/1000) %>%
  select(Month, Employed)
autoplot(leisure, Employed) +
  labs(y="Millions of people")
```
 We will first take a seasonal difference
```{r}
leisure %>%
  gg_tsdisplay(difference(Employed, 12), plot_type='partial', lag=36) +
  labs(y="Seasonally differenced")
```
```{r}
leisure %>%
  gg_tsdisplay(difference(Employed, 12) %>% difference(), plot_type='partial', lag=36) +
  labs(y="Double differenced")
```
The significant spike at lag 2 in the ACF suggests a non-seasonal MA(2) component. The significant spike at lag 12 in the ACF suggests a seasonal MA(1) component. Consequently, we begin with an $ARIMA(0,1,2)(0,1,1)_12$ model, indicating a first difference, a seasonal difference, and non-seasonal MA(2) and seasonal MA(1) component.  If we had started with the PACF, we may have selected an $ARIMA(2,1,0)(0,1,1)_12$ model.
```{r}
fit <- leisure %>%
  model(
    arima012011 = ARIMA(Employed ~ pdq(0,1,2) + PDQ(0,1,1)),
    arima210011 = ARIMA(Employed ~ pdq(2,1,0) + PDQ(0,1,1)),
    auto = ARIMA(Employed, stepwise=FALSE, approximation=FALSE)
  )
fit
```
```{r}
glance(fit) %>% arrange(AICc)
```
The three fitted models have similar AICc values, with the automatically selected model being a little better. Our second “guess” of $ARIMA(2,1,0)(0,1,1)_12$   turned out to be very close to the automatically selected model of $ARIMA(2,1,0)(1,1,1)_12$.

```{r}
fit %>% select(auto) %>% gg_tsresiduals(lag=36)
```
One small but significant spike (at lag 11) out of 36 is still consistent with white noise. To be sure, we use a Ljung-Box test, which has a large p-value, confirming that the residuals are similar to white noise. Note that the alternative models also pass this test.
```{r}
augment(fit) %>% features(.innov, ljung_box, lag=24, dof=4)
```
Thus, we now have a seasonal ARIMA model that passes the required checks and is ready for forecasting. Forecasts from the model for the next three years 
```{r}
fit %>% forecast(h=36) %>% filter(.model=='auto') %>% autoplot(leisure)
```
### Example: Corticosteroid drug sales in Australia

For our second example, we will try to forecast monthly corticosteroid drug sales in Australia. These are known as H02 drugs under the Anatomical Therapeutic Chemical classification scheme.
```{r}
h02 <- PBS %>%
  filter(ATC2 == "H02") %>%
  summarise(Cost = sum(Cost)/1e6)
h02 %>%
  mutate(log(Cost)) %>%
  pivot_longer(-Month) %>%
  ggplot(aes(x = Month, y = value)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y") +
  labs(y="", title="Cortecosteroid drug scripts (H02)")
```
There is a small increase in the variance with the level, so we take logarithms to stabilise the variance.

The data are strongly seasonal and obviously non-stationary, so seasonal differencing will be used. 

 It is not clear at this point whether we should do another difference or not. We decide not to, but the choice is not obvious.

The last few observations appear to be different (more variable) from the earlier data. This may be due to the fact that data are sometimes revised when earlier sales are reported late.
```{r}
h02 %>% gg_tsdisplay(difference(log(Cost), 12), plot_type='partial', lag_max = 24)
```
In the plots of the seasonally differenced data, there are spikes in the PACF at lags 12 and 24, but nothing at seasonal lags in the ACF. This may be suggestive of a seasonal AR(2) term. In the non-seasonal lags, there are three significant spikes in the PACF, suggesting a possible AR(3) term. The pattern in the ACF is not indicative of any simple model.

Consequently, this initial analysis suggests that a possible model for these data is an $ARIMA(3,0,0)(2,1,0)_12$. 
```{r}
fit <- h02 %>%
  model(ARIMA(log(Cost) ~ 0 + pdq(3,0,1) + PDQ(0,1,2)))
report(fit)
```
```{r}
fit %>% gg_tsresiduals(lag_max=36)
```
```{r}
augment(fit) %>%
  features(.innov, ljung_box, lag = 36, dof = 6)
```
The innovation residuals from this model are shown. There are a few significant spikes in the ACF, and the model fails the Ljung-Box test. The model can still be used for forecasting, but the prediction intervals may not be accurate due to the correlated residuals.

Sometimes it is just not possible to find a model that passes all of the tests.

None of the models considered here pass all of the residual tests. In practice, we would normally use the best model we could find, even if it did not pass all of the tests.

Forecasts from the $ARIMA(3,0,1)(0,1,2)_12$ model (which has the second lowest RMSE value on the test set, and the best AICc value amongst models with only seasonal differencing) 

```{r}
h02 %>%
  model(ARIMA(log(Cost) ~ 0 + pdq(3,0,1) + PDQ(0,1,2))) %>%
  forecast() %>%
  autoplot(h02) +
    labs(y="H02 sales (million scripts)")
```
## ARIMA vs ETS

https://otexts.com/fpp3/arima-ets.html


### Comparing ARIMA() and ETS() on non-seasonal data

We can use time series cross-validation to compare an ARIMA model and an ETS model.
```{r}
aus_economy <- global_economy %>% filter(Code == "AUS") %>%
  mutate(Population = Population/1e6)

aus_economy %>%
  slice(-n()) %>%
  stretch_tsibble(.init = 10) %>%
  model(
    ETS(Population),
    ARIMA(Population)
  ) %>%
  forecast(h = 1) %>%
  accuracy(aus_economy)
```
In this case the ETS model has higher accuracy on the cross-validated performance measures. Below we generate and plot forecasts for the next 5 years generated from an ETS model.
```{r}
aus_economy %>%
  model(ETS(Population)) %>%
  forecast(h = "5 years") %>%
  autoplot(aus_economy)
```
### Comparing ARIMA() and ETS() on seasonal data

we want to compare seasonal ARIMA and ETS models applied to the quarterly cement production data (from aus_production). Because the series is relatively long, we can afford to use a training and a test set rather than time series cross-validation. The advantage is that this is much faster. We create a training set from the beginning of 1988 to the end of 2007 and select an ARIMA and an ETS model using the ARIMA() and ETS() functions.
```{r}
cement <- aus_production %>% filter_index("1988 Q1" ~ .)
train <- cement %>% filter_index(. ~ "2007 Q4")
```
```{r}
fit_arima <- train %>% model(ARIMA(Cement))
report(fit_arima)
```
The ARIMA model does well in capturing all the dynamics in the data as the residuals seem to be white noise.
```{r}
gg_tsresiduals(fit_arima, lag_max = 16)
```
```{r}
augment(fit_arima) %>%
  features(.innov, ljung_box, lag = 16, dof = 6)
```
The output below also shows the ETS model selected and estimated by ETS(). This model also does well in capturing all the dynamics in the data, as the residuals similarly appear to be white noise.

```{r}
fit_ets <- train %>% model(ETS(Cement))
report(fit_ets)
fit_ets %>% gg_tsresiduals(lag_max = 16)
```
```{r}
augment(fit_ets) %>%
  features(.innov, ljung_box, lag = 16, dof = 6)
```
The output below evaluates the forecasting performance of the two competing models over the test set. In this case the ARIMA model seems to be the slightly more accurate model based on the test set RMSE, MAPE and MASE.
```{r}
# Generate forecasts and compare accuracy over the test set
bind_rows(
  fit_arima %>% accuracy(),
  fit_ets %>% accuracy(),
  fit_arima %>% forecast(h = "2 years 6 months") %>%
    accuracy(cement),
  fit_ets %>% forecast(h = "2 years 6 months") %>%
    accuracy(cement)
)
```
Below we generate and plot forecasts from the ARIMA model for the next 3 years.
```{r}
cement %>%
  model(ARIMA(Cement)) %>%
  forecast(h="3 years") %>%
  autoplot(cement)
```
# Dynamic regression models

In this chapter, we consider how to extend ARIMA models in order to allow other information to be included in the models.
 
In this chapter, we will allow the errors from a regression to contain autocorrelation. To emphasise this change in perspective, we will replace $ε_t$ with $η_t$ in the equation. The error series $η_t$ is assumed to follow an ARIMA model. For example, if $η_t$ follows an ARIMA(1,1,1) model, we can write

\begin{align*}
  y_t &= \beta_0 + \beta_1 x_{1,t} + \dots + \beta_k x_{k,t} + \eta_t,\\
      & (1-\phi_1B)(1-B)\eta_t = (1+\theta_1B)\varepsilon_t,
\end{align*}

where $ε_t$ is a white noise series.

Notice that the model has two error terms here — the error from the regression model, which we denote by $η_t$, and the error from the ARIMA model, which we denote by $ε_t$. Only the ARIMA model errors are assumed to be white noise.

## Estimation

When we estimate the parameters from the model, we need to minimise the sum of squared $ε_t$ values. 

An important consideration when estimating a regression with ARMA errors is that all of the variables in the model must first be stationary. 

We therefore first difference the non-stationary variables in the model. It is often desirable to maintain the form of the relationship between $y_t$ and the predictors, and consequently it is common to difference all of the variables if any of them need differencing. The resulting model is then called a “model in differences,” as distinct from a “model in levels,” which is what is obtained when the original data are used without differencing.

## Regression with ARIMA errors using fable

The fable function ARIMA() will fit a regression model with ARIMA errors if exogenous regressors are included in the formula.

If differencing is specified, then the differencing is applied to all variables in the regression model before the model is estimated. 

    ARIMA(y ~ x + pdq(1,1,0))
    
The ARIMA() function can also be used to select the best ARIMA model for the errors. This is done by not specifying the pdq() special. If differencing is required, then all variables are differenced during the estimation process, although the final model will be expressed in terms of the original variables.

The AICc is calculated for the final model, and this value can be used to determine the best predictors. That is, the procedure should be repeated for all subsets of predictors to be considered, and the model with the lowest AICc value selected.

### Example: US Personal Consumption and Income

Quarterly changes in personal consumption expenditure and personal disposable income from 1970 to 2016 Q3. We would like to forecast changes in expenditure based on changes in income.

```{r}
us_change %>%
  gather("var", "value", Consumption, Income) %>%
  ggplot(aes(x = Quarter, y = value)) +
  geom_line() +
  facet_grid(vars(var), scales = "free_y") +
  labs(y = "",
       title = "Quarterly changes in US consumption and personal income")
```
```{r}
fit <- us_change %>%
  model(ARIMA(Consumption ~ Income))
report(fit)
```
The data are clearly already stationary (as we are considering percentage changes rather than raw expenditure and income), so there is no need for any differencing. The fitted model is

\begin{align*}
  y_t &= 0.595 +
         0.198 x_t + \eta_t, \\
  \eta_t &= 0.707 \eta_{t-1} + \varepsilon_t
        -0.617 \varepsilon_{t-1} +
        0.207 \varepsilon_{t-2},\\
  \varepsilon_t &\sim \text{NID}(0,0.311).
\end{align*}

We can recover estimates of both the $η_t$ and $ε_t$ series using the residuals() function.
```{r}
bind_rows(
    `Regression Errors` = as_tibble(residuals(fit, type = "regression")),
    `ARIMA Errors` = as_tibble(residuals(fit, type = "innovation")),
    .id = "type"
  ) %>%
  ggplot(aes(x = Quarter, y = .resid)) +
  geom_line() +
  facet_grid(vars(type), scales = "free_y") +
  labs(y = "")
```
It is the ARIMA errors (the innovations) that should resemble a white noise series.
```{r}
fit %>% gg_tsresiduals()
```
```{r}
augment(fit) %>%
  features(.innov, ljung_box, dof = 5, lag = 8)
```
The residuals (i.e., the ARIMA errors) are not significantly different from white noise.

## Forecasting

To forecast using a regression model with ARIMA errors, we need to forecast the regression part of the model and the ARIMA part of the model, and combine the results.

As with ordinary regression models, in order to obtain forecasts we first need to forecast the predictors. When the predictors are known into the future (e.g., calendar-related variables such as time, day-of-week, etc.), this is straightforward. But when the predictors are themselves unknown, we must either model them separately, or use assumed future values for each predictor.

### US Personal Consumption and Income

We will calculate forecasts for the next eight quarters assuming that the future percentage changes in personal disposable income will be equal to the mean percentage change from the last forty years.
```{r}
us_change_future <- new_data(us_change, 8) %>%
  mutate(Income = mean(us_change$Income))
forecast(fit, new_data = us_change_future) %>%
  autoplot(us_change) +
  labs(y = "Percentage change")
```
The prediction intervals for this model are narrower than if we had fitted an ARIMA model without covariates, because we are now able to explain some of the variation in the data using the income predictor.

It is important to realise that the prediction intervals from regression models (with or without ARIMA errors) do not take into account the uncertainty in the forecasts of the predictors. So they should be interpreted as being conditional on the assumed (or estimated) future values of the predictor variables.

### Forecasting electricity demand

Daily electricity demand can be modelled as a function of temperature. The higher demand on cold and hot days is reflected in the U-shape
```{r}
vic_elec_daily <- vic_elec %>%
  filter(year(Time) == 2014) %>%
  index_by(Date = date(Time)) %>%
  summarise(
    Demand = sum(Demand) / 1e3,
    Temperature = max(Temperature),
    Holiday = any(Holiday)
  ) %>%
  mutate(Day_Type = case_when(
    Holiday ~ "Holiday",
    wday(Date) %in% 2:6 ~ "Weekday",
    TRUE ~ "Weekend"
  ))

vic_elec_daily %>%
  ggplot(aes(x = Temperature, y = Demand, colour = Day_Type)) +
  geom_point() +
  labs(y = "Electricity demand (GW)", x = "Maximum daily temperature")
```
The data stored as vic_elec_daily includes total daily demand, daily maximum temperatures, and an indicator variable for if that day is a public holiday. 
```{r}
vic_elec_daily %>%
  select(Date:Temperature) %>% 
  pivot_longer(-Date, names_to = "type", values_to = "data") %>% 
  ggplot(aes(x = Date, y = data)) +
  geom_line() +
  facet_grid(vars(type), scales = "free_y") +
  labs(y = "Value")
```
we fit a quadratic regression model with ARMA errors using the ARIMA() function. The model also includes an indicator variable for if the day was a working day or not.
```{r}
fit <- vic_elec_daily %>%
  model(ARIMA(Demand ~ Temperature + I(Temperature^2) + (Day_Type == "Weekday")))
fit %>% gg_tsresiduals()
```
```{r}
augment(fit) %>%
  features(.innov, ljung_box, dof = 8, lag = 14)
```
There is clear heteroskedasticity in the residuals, with higher variance in January and February, and lower variance in May. The model also has some significant autocorrelation in the residuals, and the histogram of the residuals shows long tails. All of these issues with the residuals may affect the coverage of the prediction intervals, but the point forecasts should still be ok.

Using the estimated model we forecast 14 days ahead starting from Thursday 1 January 2015 (a non-work-day being a public holiday for New Years Day). In this case, we could obtain weather forecasts from the weather bureau for the next 14 days. But for the sake of illustration, we will use scenario based forecasting where we set the temperature for the next 14 days to a constant 26 degrees.
```{r}
vic_elec_future <- new_data(vic_elec_daily, 14) %>%
  mutate(
    Temperature = 26,
    Holiday = c(TRUE, rep(FALSE, 13)),
    Day_Type = case_when(
      Holiday ~ "Holiday",
      wday(Date) %in% 2:6 ~ "Weekday",
      TRUE ~ "Weekend"
    )
  )
forecast(fit, vic_elec_future) %>%
  autoplot(vic_elec_daily) + labs(y = "Electricity demand (GW)")
```
The point forecasts look reasonable for the first two weeks of 2015. The slow down in electricity demand at the end of 2014 (due to many people taking summer vacations) has caused the forecasts for the next two weeks to show similarly low demand values.

## Stochastic and deterministic trends

There are two different ways of modelling a linear trend. A deterministic trend is obtained using the regression model

$y_t = \beta_0 + \beta_1 t + \eta_t$

where $η_t$ is an ARMA process. A stochastic trend is obtained using the model

$y_t = \beta_0 + \beta_1 t + \eta_t$

where $η_t$ is an ARIMA process with $d=1$. In the latter case, we can difference both sides so that $y_t' = \beta_1 + \eta_t'$, where $\eta_t′$ is an ARMA process

$y_t = y_{t-1} + \beta_1 + \eta_t'.$

### Example: Air transport passengers Australia

```{r}
aus_airpassengers %>%
  autoplot(Passengers) +
  labs(x = "Year", y = "millions of people",
       title = "Total annual air passengers")
```
It shows the total number of passengers for Australian air carriers each year from 1970 to 2016. We will fit both a deterministic and a stochastic trend model to these data.

The deterministic trend model is obtained as follows:
```{r}
fit_deterministic <- aus_airpassengers %>%
  model(deterministic = ARIMA(Passengers ~ 1 + trend() + pdq(d = 0)))
report(fit_deterministic)
```
This model can be written as

\begin{align*}
  y_t &= 1.09 + 1.337 t + \eta_t \\
  \eta_t &= 0.943 \eta_{t-1}  + \varepsilon_t\\
  \varepsilon_t &\sim \text{NID}(0,4.713).
\end{align*}

The estimated growth in visitor numbers is 1.34 million people per year.

Alternatively, the stochastic trend model can be estimated.
```{r}
fit_stochastic <- aus_airpassengers %>%
  model(stochastic = ARIMA(Passengers ~ pdq(d = 1)))
report(fit_stochastic)
```
This model can be written as $y_t-y_{t-1} = 1.367 + \varepsilon_t$, or equivalently

\begin{align*}
  y_t &= y_0 + 1.367 t + \eta_t \\
  \eta_t &= \eta_{t-1} + \varepsilon_{t}\\
  \varepsilon_t &\sim \text{NID}(0,4.629).
\end{align*}

In this case, the estimated growth in visitor numbers is also 1.37 million people per year.

Although the growth estimates are similar, the prediction intervals are not. In particular, stochastic trends have much wider prediction intervals because the errors are non-stationary.
```{r}
aus_airpassengers %>%
  autoplot(Passengers) +
  autolayer(fit_stochastic %>% forecast(h = 20),
    colour = "blue", alpha = 0.5, level = 95) +
  autolayer(fit_deterministic %>% forecast(h = 20),
    colour = "red", alpha = 0.5, level = 95) +
  labs(x = "Year", y = "Air passengers (millions)",
       title = "Forecasts from trend models")
```
There is an implicit assumption with deterministic trends that the slope of the trend is not going to change over time. On the other hand, stochastic trends can change, and the estimated growth is only assumed to be the average growth over the historical period, not necessarily the rate of growth that will be observed into the future. Consequently, it is safer to forecast with stochastic trends, especially for longer forecast horizons, as the prediction intervals allow for greater uncertainty in future growth.

### Dynamic harmonic regression

When there are long seasonal periods, a dynamic regression with Fourier terms is often better.

So for such time series, we prefer a harmonic regression approach where the seasonal pattern is modelled using Fourier terms with short-term time series dynamics handled by an ARMA error.

The advantages of this approach are:

- it allows any length seasonality;
- for data with more than one seasonal period, Fourier terms of different frequencies can be included;
- the smoothness of the seasonal pattern can be controlled by K, the number of Fourier sin and cos pairs – the seasonal pattern is smoother for smaller values of K;
- the short-term dynamics are easily handled with a simple ARMA error.

The only real disadvantage (compared to a seasonal ARIMA model) is that the seasonality is assumed to be fixed — the seasonal pattern is not allowed to change over time. 

### Australian eating out expenditure

we demonstrate combining Fourier terms for capturing seasonality with ARIMA errors capturing other dynamics in the data. 

We use the total monthly expenditure on cafes, restaurants and takeaway food services in Australia ($billion) from 2004 up to 2018 and forecast 24 months ahead. We vary K, the number of Fourier sin and cos pairs, from K=1 to K=6 (which is equivalent to including seasonal dummies).

Notice that as  
K
  increases the Fourier terms capture and project a more “wiggly” seasonal pattern and simpler ARIMA models are required to capture other dynamics. The AICc value is minimised for K=6, with a significant jump going from  K=4 to K=5, hence the forecasts generated from this model would be the ones used.

```{r}
aus_cafe <- aus_retail %>%
  filter(
    Industry == "Cafes, restaurants and takeaway food services",
    year(Month) %in% 2004:2018
  ) %>%
  summarise(Turnover = sum(Turnover))

fit <- aus_cafe %>%
  model(
    `K = 1` = ARIMA(log(Turnover) ~ fourier(K = 1) + PDQ(0, 0, 0)),
    `K = 2` = ARIMA(log(Turnover) ~ fourier(K = 2) + PDQ(0, 0, 0)),
    `K = 3` = ARIMA(log(Turnover) ~ fourier(K = 3) + PDQ(0, 0, 0)),
    `K = 4` = ARIMA(log(Turnover) ~ fourier(K = 4) + PDQ(0, 0, 0)),
    `K = 5` = ARIMA(log(Turnover) ~ fourier(K = 5) + PDQ(0, 0, 0)),
    `K = 6` = ARIMA(log(Turnover) ~ fourier(K = 6) + PDQ(0, 0, 0))
  )

fit %>%
  forecast(h = "2 years") %>%
  autoplot(aus_cafe, level = 95) +
  facet_wrap(vars(.model), ncol = 2) +
  guides(colour = FALSE, fill = FALSE, level = FALSE) +
  geom_label(
    aes(x = yearmonth("2007 Jan"), y = 4250, label = paste0("AICc = ", format(AICc))),
    data = glance(fit)
  )
```
## Lagged predictors

Sometimes, the impact of a predictor that is included in a regression model will not be simple and immediate. 

In these situations, we need to allow for lagged effects of the predictor. Suppose that we have only one predictor in our model. Then a model which allows for lagged effects can be written as

$y_t = \beta_0 + \gamma_0x_t + \gamma_1 x_{t-1} + \dots + \gamma_k x_{t-k} + \eta_t,$

where $η_t$ is an ARIMA process. The value of k can be selected using the AICc, along with thevalues of p and q for the ARIMA error.

### Example: TV advertising and insurance quotations

A US insurance company advertises on national television in an attempt to increase the number of insurance quotations provided (and consequently the number of new policies).

Number of quotations and the expenditure on television advertising for the company each month from January 2002 to April 2005:
```{r}
load(file = "datasets/insurance.rda")

insurance %>%
  pivot_longer(Quotes:TVadverts) %>%
  ggplot(aes(x = Month, y = value)) +
  geom_line() +
  facet_grid(vars(name), scales = "free_y") +
  labs(y = "", title = "Insurance advertising and quotations")
```
We will consider including advertising expenditure for up to four months; that is, the model may include advertising expenditure in the current month, and the three months before that. When comparing models, it is important that they all use the same training set. In the following code, we exclude the first three months in order to make fair comparisons.
```{r}
fit <- insurance %>%
  # Restrict data so models use same fitting period
  mutate(Quotes = c(NA, NA, NA, Quotes[4:40])) %>%
  # Estimate models
  model(
    lag0 = ARIMA(Quotes ~ pdq(d = 0) + TVadverts),
    lag1 = ARIMA(Quotes ~ pdq(d = 0) + TVadverts + lag(TVadverts)),
    lag2 = ARIMA(Quotes ~ pdq(d = 0) + TVadverts + lag(TVadverts) + lag(TVadverts, 2)),
    lag3 = ARIMA(Quotes ~ pdq(d = 0) + TVadverts + lag(TVadverts) + lag(TVadverts, 2) + lag(TVadverts, 3))
  )
```
Next we choose the optimal lag length for advertising based on the AICc.
```{r}
glance(fit)
```
The best model (with the smallest AICc value) is lag1 with two predictors; that is, it includes advertising only in the current month and the previous month. So we now re-estimate that model, but using all the available data.
```{r}
fit_best <- insurance %>%
  model(ARIMA(Quotes ~ pdq(d = 0) + TVadverts + lag(TVadverts)))
report(fit_best)
```
The chosen model has ARIMA(1,0,2) errors. The model can be written as

$y_t = 2.155 + 1.253 x_t + 0.146 x_{t-1} + \eta_t$

where $y_t$ is the number of quotations provided in month t, $x_t$ is the advertising expenditure in month t,

$\eta_t = 0.512 \eta_{t-1} + \varepsilon_t + 0.917 \varepsilon_{t-1} + 0.459 \varepsilon_{t-2}$

We can calculate forecasts using this model if we assume future values for the advertising variable. If we set the future monthly advertising to 8 units, we get the forecasts:
```{r}
insurance_future <- new_data(insurance, 20) %>%
  mutate(TVadverts = 8)
fit_best %>%
  forecast(insurance_future) %>%
  autoplot(insurance) +
  labs(y = "Quotes",
       title = "Forecast quotes with future advertising set to 8")
```
# Forecasting hierarchical and grouped time series

Time series can often be naturally disaggregated by various attributes of interest. 

Hierarchical time series often arise due to geographic divisions.

Alternative aggregation structures arise when attributes of interest are crossed rather than nested. We refer to the resulting time series of crossed attributes as “grouped time series.”

More complex structures arise when attributes of interest are both nested and crossed.

Forecasts are often required for all disaggregate and aggregate series, and it is natural to want the forecasts to add up in the same way as the data.

## Hierarchical and grouped time series

### Hierarchical time series

https://otexts.com/fpp3/hts.html

For any time t, the observations at the bottom level of the hierarchy will sum to the observations of the series above. For example,
 
\begin{equation}
  y_{t}=\y{AA}{t}+\y{AB}{t}+\y{AC}{t}+\y{BA}{t}+\y{BB}{t},
  \tag{11.1}
\end{equation}

\begin{equation}
  \y{A}{t}=\y{AA}{t}+\y{AB}{t}+\y{AC}{t}\qquad \text{and} \qquad  \y{B}{t}=\y{BA}{t}+\y{BB}{t}.
  \tag{11.2}
\end{equation}

### Example: Australian tourism hierarchy

Australia is divided into six states and two territories, with each one having its own government and some economic and administrative autonomy. Each of these states can be further subdivided into regions.

The tourism tsibble contains data on quarterly domestic tourism demand, measured as the number of overnight trips Australians spend away from home. The key variables State and Region denote the geographical areas, while a further key Purpose describes the purpose of travel.
```{r}
#recode State to use abbreviations
tourism <- tsibble::tourism %>%
  mutate(State = recode(State,
    `New South Wales` = "NSW",
    `Northern Territory` = "NT",
    `Queensland` = "QLD",
    `South Australia` = "SA",
    `Tasmania` = "TAS",
    `Victoria` = "VIC",
    `Western Australia` = "WA"
  ))
```

Using the aggregate_key() function, we can create the hierarchical time series with overnight trips in regions at the bottom level of the hierarchy, aggregated to states, which are aggregated to the national total. A hierarchical time series corresponding to the nested structure is created using a parent/child specification.
```{r}
tourism_hts <- tourism %>%
  aggregate_key(State / Region, Trips = sum(Trips))
tourism_hts
```
The new tsibble now has some additional rows corresponding to state and national aggregations for each quarter. Figure 11.3 shows the aggregate total overnight trips for the whole of Australia as well as the states, revealing diverse and rich dynamics. For example, there is noticeable national growth since 2010 and for some states such as the ACT, New South Wales, Queensland, South Australia, and Victoria. There seems to be a significant jump for Western Australia in 2014.
```{r}
tourism_hts %>%
  filter(is_aggregated(Region)) %>%
  autoplot(Trips) +
  labs(y = "Trips ('000)",
       title = "Australian tourism: national total and states") +
  facet_wrap(vars(State), scales = "free_y", ncol = 3) +
  theme(legend.position = "none")
```
The seasonal pattern of the northern states, such as Queensland and the Northern Territory, leads to peak visits in winter (corresponding to Q3) due to the tropical climate and rainy summer months. In contrast, the southern states tend to peak in summer (corresponding to Q1).

### Grouped time series

With grouped time series, the data structure does not naturally disaggregate in a unique hierarchical manner.

This example shows that there are alternative aggregation paths for grouped structures. For any time  t, as with the hierarchical structure,
 
\begin{equation*}
y_{t}=y_{AX}{t}+y_{AY,t}+y_{BX,t}+y_{BY,t}.
\end{equation*}

However, for the first level of the grouped structure,

\begin{equation}
\y{A}{t}=\y{AX}{t}+\y{AY}{t}\quad \quad \y{B}{t}=\y{BX}{t}+\y{BY}{t}
\tag{11.3}
\end{equation}

but also

\begin{equation}
\y{X}{t}=\y{AX}{t}+\y{BX}{t}\quad \quad \y{Y}{t}=\y{AY}{t}+\y{BY}{t}
\tag{11.4}.
\end{equation}

Grouped time series can sometimes be thought of as hierarchical time series that do not impose a unique hierarchical structure, in the sense that the order by which the series can be grouped is not unique.

### Example: Australian prison population

In this example we consider the Australia prison population. 
```{r}
prison <- readr::read_csv("https://OTexts.com/fpp3/extrafiles/prison_population.csv") %>%
  mutate(Quarter = yearquarter(Date)) %>%
  select(-Date)  %>%
  as_tsibble(key = c(Gender, Legal, State, Indigenous), index = Quarter) %>%
  relocate(Quarter)
```

We create a grouped time series using aggregate_key() with attributes or groupings of interest now being crossed using the syntax attribute1*attribute2 (in contrast to the parent/child syntax used for hierarchical time series). The following code builds a grouped tsibble for the prison data with crossed attributes: gender, legal status and state.
```{r}
prison_gts <- prison %>%
  aggregate_key(Gender * Legal * State, Count = sum(Count)/1e3)
```
Using is_aggregated() within filter() is helpful for exploring or plotting the main groups shown in the bottom panels. For example, the following code plots the total numbers of female and male prisoners across Australia.
```{r}
prison_gts %>%
  filter(!is_aggregated(Gender), is_aggregated(Legal), is_aggregated(State)) %>%
  autoplot(Count) +
  labs(y = "Number of prisoners ('000)")
```
Plots of other group combinations can be obtained:
```{r}
prison_gts %>%
  filter(!is_aggregated(Gender), !is_aggregated(Legal), !is_aggregated(State)) %>%
  mutate(Gender = as.character(Gender)) %>%
  ggplot(aes(x = Quarter, y = Count, group = Gender, colour=Gender)) +
  stat_summary(fun = sum, geom = "line") +
  labs(title = "Prison population by state and gender",
       y = "Number of prisoners ('000)") +
  facet_wrap(~ as.character(State), nrow = 1, scales = "free_y") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
### Mixed hierarchical and grouped structure

Often disaggregating factors are both nested and crossed. For example, the Australian tourism data can also be disaggregated by the four purposes of travel: holiday, business, visiting friends and relatives, and other. This grouping variable does not nest within any of the geographical variables. In fact, we could consider overnight trips split by purpose of travel for the whole of Australia, and for each state, and for each region. We describe such a structure as a “nested” geographic hierarchy “crossed” with the purpose of travel. Using aggregate_key this can be specified by simply combining the factors as follows.
```{r}
tourism_full <- tourism %>%
  aggregate_key((State / Region) * Purpose, Trips = sum(Trips))
```
The tourism_full tsibble contains 425 series, including the 85 series from the hierarchical structure, as well as another 340 series obtained when each series of the hierarchical structure is crossed with the purpose of travel.

##  Single level approaches

Traditionally, forecasts of hierarchical or grouped time series involved selecting one level of aggregation and generating forecasts for that level. These are then either aggregated for higher levels, or disaggregated for lower levels, to obtain a set of coherent forecasts for the rest of the structure.

### The bottom-up approach

A simple method for generating coherent forecasts is the “bottom-up” approach. This approach involves first generating forecasts for each series at the bottom level, and then summing these to produce forecasts for all the series in the structure.

An advantage of this approach is that we are forecasting at the bottom level of a structure, and therefore no information is lost due to aggregation. On the other hand, bottom-level data can be quite noisy and more challenging to model and forecast.

#### Example: generating bottom-up forecasts

Suppose we want national and state forecasts for the Australian tourism data, but we aren’t interested in disaggregations using regions or the purpose of travel. So we first create a simple tsibble object containing only state and national trip totals for each quarter.
```{r}
tourism_states <- tourism %>%
  aggregate_key(State, Trips = sum(Trips))
```
We could generate the bottom-level state forecasts first, and then sum them to obtain the national forecasts as follows.
```{r}
fcasts_state <- tourism_states %>%
  filter(!is_aggregated(State)) %>%
  model(ets = ETS(Trips)) %>%
  forecast()
# Sum bottom-level forecasts to get top-level forecasts
fcasts_national <- fcasts_state %>%
  summarise(value = sum(Trips), .mean = mean(value))
```
However, we want a more general approach that will work with all the forecasting methods discussed in this chapter. So we will use the reconcile() function to specify how we want to compute coherent forecasts.
```{r}
tourism_states %>%
  model(ets = ETS(Trips)) %>%
  reconcile(bu = bottom_up(ets)) %>%
  forecast()
```
The reconcile() step has created a new “model” to produce bottom-up forecasts. The fable object contains the ets forecasts as well as the coherent bu forecasts, for the 8 states and the national aggregate. At the state level, these forecasts are identical, but the national ets forecasts will be different from the national bu forecasts.

For bottom-up forecasting, this is rather inefficient as we are not interested in the ETS model for the national total, and the resulting fable contains a lot of duplicates.

The above code illustrates the general workflow for hierarchical and grouped forecasts. We use the following pipeline of functions.

    data %>% aggregate_key() %>% model() %>% reconcile() %>% forecast()

1. Begin with a tsibble object (here labelled data) containing the individual bottom-level series.
2. Define in aggregate_key() the aggregation structure and build a tsibble object that also contains the aggregate series.
3. Identify a model() for each series, at all levels of aggregation.
4. Specify in reconcile() how the coherent forecasts are to be generated from the selected models.
5. Use the forecast() function to generate forecasts for the whole aggregation structure.

### Top-down approaches

Top-down approaches involve first generating forecasts for the Total series  $y_t$, and then disaggregating these down the hierarchy. This only works with strictly hierarchical data, not with grouped or mixed aggregation structures.

Top-down forecasts can be generated using top_down() within the reconcile() function.

### Average historical proportions

$p_j=\frac{1}{T}\sum_{t=1}^{T}\frac{y_{j,t}}{{y_t}}$

This approach is implemented in the top_down() function by setting method = "average_proportions".

### Proportions of the historical averages

$p_j={\sum_{t=1}^{T}\frac{y_{j,t}}{T}}\Big/{\sum_{t=1}^{T}\frac{y_t}{T}}$

This approach is implemented in the top_down() function by setting method = "proportion_averages".

In general, these approaches seem to produce quite reliable forecasts for the aggregate levels and they are useful with low count data. On the other hand, one disadvantage is the loss of information due to aggregation. Using such top-down approaches, we are unable to capture and take advantage of individual series characteristics such as time dynamics, special events, different seasonal patterns, etc.

### Forecast proportions

Because historical proportions used for disaggregation do not take account of how those proportions may change over time, top-down approaches based on historical proportions tend to produce less accurate forecasts at lower levels of the hierarchy than bottom-up approaches. To address this issue, proportions based on forecasts rather than historical data can be used.

$p_j=\prod^{K-1}_{\ell=0}\frac{\hat{y}_{j,h}^{(\ell)}}{\hat{S}_{j,h}^{(\ell+1)}}$

This approach is implemented in the top_down() function by setting method = "forecast_proportions". Because this approach tends to work better than other top-down methods, it is the default choice in the top_down() function when no method argument is specified.

One disadvantage of all top-down approaches, is that they do not produce unbiased coherent forecasts even if the base forecasts are unbiased.

### Middle-out approach

The middle-out approach combines bottom-up and top-down approaches. Again, it can only be used for strictly hierarchical aggregation structures.

This approach is implemented in the middle_out() function by specifying the appropriate middle level via the level argument and selecting the top-down approach with the method argument.

## Forecast reconciliation

https://otexts.com/fpp3/reconciliation.html

### The MinT optimal reconciliation approach

We refer to this as the MinT (or Minimum Trace) optimal reconciliation approach . MinT is implemented by min_trace() within the reconcile() function.

To use this in practice, we need to estimate $W_h$. Four simplifying approximations:

1. `min_trace()` by setting `method = "ols"`
2. `min_trace()` by setting `method = "wls_var"`
3. `min_trace()` by setting `method = "wls_struct"`
4. `min_trace()` by setting `method = "mint_cov"`; for cases where the number of bottom-level series $m$ is large compared to the length of the series $T$, this is not a good estimator. Instead we use a shrinkage estimator which shrinks the sample covariance to a diagonal matrix. `method = "mint_shrink"`.

the optimal reconciliation forecasts are generated using all the information available within a hierarchical or a grouped structure. This is important, as particular aggregation levels or groupings may reveal features of the data that are of interest to the user and are important to be modelled. These features may be completely hidden or not easily identifiable at other levels.

## Forecasting Australian domestic tourism

We use the data up to the end of 2015 as a training set, withholding the final two years (eight quarters, 2016Q1–2017Q4) as a test set for evaluation. The code below demonstrates the full workflow for generating coherent forecasts using the bottom-up, OLS and MinT methods.
```{r}
tourism_full <- tourism %>%
  aggregate_key((State / Region) * Purpose, Trips = sum(Trips))
fit <- tourism_full %>%
  filter(year(Quarter) <= 2015) %>%
  model(base = ETS(Trips)) %>%
  reconcile(
    bu = bottom_up(base),
    ols = min_trace(base, method = "ols"),
    mint = min_trace(base, method = "mint_shrink"),
  )
```
Here, fit contains the base ETS model for each series in tourism_full, along with the three methods for producing coherent forecasts as specified in the reconcile() function.
```{r}
fc <- fit %>% forecast(h = "2 years")
```
Passing fit into forecast() generates base and coherent forecasts across all the series in the aggregation structure.

We plot the four point forecasts for the overnight trips for the Australian total, the states, and the purposes of travel, along with the actual observations of the test set. To make it easier to see the differences, we have included only the last five years of the training data, and have omitted the prediction intervals. In most panels, the increase in overnight trips, especially in the second half of the test set, is higher than what is predicted by the point forecasts. This is particularly noticeable for the mainland eastern states of ACT, New South Wales, Queensland and Victoria, and across all purposes of travel.
```{r}
fc %>%
  filter(is_aggregated(Region), is_aggregated(Purpose)) %>%
  autoplot(
    tourism_full %>% filter(year(Quarter) >= 2011),
    level = NULL
  ) +
  labs(y = "Trips ('000)") +
  facet_wrap(vars(State), scales = "free_y")
```
```{r}
fc %>%
  filter(is_aggregated(State), !is_aggregated(Purpose)) %>%
  autoplot(
    tourism_full %>% filter(year(Quarter) >= 2011),
    level = NULL
  ) +
  labs(y = "Trips ('000)") +
  facet_wrap(vars(Purpose), scales = "free_y")
```
The accuracy of the forecasts over the test set can be evaluated using the accuracy() function.

```{r}
fc %>%
  filter(is_aggregated(State), is_aggregated(Purpose)) %>%
  accuracy(
    data = tourism_full,
    measures = list(rmse = RMSE, mase = MASE)
  ) %>%
  group_by(.model) %>%
  summarise(rmse = mean(rmse), mase = mean(mase))
```
```{r}
fc %>%
  filter(is_aggregated(State), !is_aggregated(Purpose)) %>%
  accuracy(
    data = tourism_full,
    measures = list(rmse = RMSE, mase = MASE)
  ) %>%
  group_by(.model) %>%
  summarise(rmse = mean(rmse), mase = mean(mase))
```
https://otexts.com/fpp3/tourism.html

Reconciling the base forecasts using OLS and MinT results in more accurate forecasts compared to the bottom-up approach. This result is commonly observed in applications as reconciliation approaches use information from all levels of the structure, resulting in more accurate coherent forecasts compared to the older methods which use limited information. Furthermore, reconciliation usually improves the incoherent base forecasts for almost all levels.

## Reconciled distributional forecasts

https://otexts.com/fpp3/rec-prob.html

### Forecasting Australian prison population

Returning to the Australian prison population data, we will compare the forecasts from bottom-up and MinT methods applied to base ETS models, using a test set comprising the final two years or eight quarters 2015Q1–2016Q4 of the available data.
```{r}
prison <- readr::read_csv("https://OTexts.com/fpp3/extrafiles/prison_population.csv") %>%
  mutate(Quarter = yearquarter(Date)) %>%
  select(-Date)  %>%
  as_tsibble(key = c(Gender, Legal, State, Indigenous), index = Quarter) %>%
  relocate(Quarter)

prison_gts <- prison %>%
  aggregate_key(Gender * Legal * State, Count = sum(Count)/1e3)

fit <- prison_gts %>%
  filter(year(Quarter) <= 2014) %>%
  model(base = ETS(Count)) %>%
  reconcile(
    bottom_up = bottom_up(base),
    MinT = min_trace(base, method = "mint_shrink")
  )
fc <- fit %>% forecast(h = 8)
fc %>%
  filter(is_aggregated(State), is_aggregated(Gender), is_aggregated(Legal)) %>%
  autoplot(prison_gts, alpha = 0.7, level = 90) +
  labs(y = "Number of prisoners ('000)",
       title = "Australian prison population (total)")
```
The three sets of forecasts for the aggregate Australian prison population. The base and bottom-up forecasts from the ETS models seem to underestimate the trend over the test period. The MinT approach combines information from all the base forecasts in the aggregation structure; in this case, the base forecasts at the top level are adjusted upwards.

The MinT reconciled prediction intervals are much tighter than the base forecasts, due to MinT being based on an estimator that minimizes variances. The base forecast distributions are also incoherent, and therefore carry with them the extra uncertainty of the incoherency error.

 MinT and base forecasts at various levels of aggregation. To make it easier to see the effect, we only show the last five years of training data. In general, MinT adjusts the base forecasts in the direction of the test set, hence improving the forecast accuracy. There is no guarantee that MinT reconciled forecasts will be more accurate than the base forecasts for every series, but they will be more accurate on average
 
```{r}
fc %>%
  filter(
    .model %in% c("base", "MinT"),
    !is_aggregated(State), is_aggregated(Legal), is_aggregated(Gender)
  ) %>%
  autoplot(
    prison_gts %>% filter(year(Quarter) >= 2010),
    alpha = 0.7, level = 90
  ) +
  labs(title = "Prison population (by state)",
       y = "Number of prisoners ('000)") +
  facet_wrap(vars(State), scales = "free_y", ncol = 4) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Using the accuracy() function, we evaluate the forecast accuracy across the grouped structure. The code below evaluates the forecast accuracy for only the top-level national aggregate of the Australian prison population time series. 
```{r}
fc %>%
  filter(is_aggregated(State), is_aggregated(Gender), is_aggregated(Legal)) %>%
  accuracy(data = prison_gts, measures = list(
    mase = MASE,
    ss = skill_score(CRPS)
  )) %>%
  group_by(.model) %>%
  summarise(mase = mean(mase), sspc = mean(ss) * 100)
```
A low value of MASE indicates a good forecast, while a high value of the skill score indicates a good forecast.

The results show that the MinT reconciled forecasts improve on the accuracy of the base forecasts and are also more accurate than the bottom-up forecasts. As the MinT optimal reconciliation approach uses information from all levels in the structure, it generates more accurate forecasts than the traditional approaches (such as bottom-up) which use limited information.

