Exponential Smoothing

2022-08-12

Exponential smoothing

Historical perspective

Combine a “level”, “trend” (slope) and “seasonal” component to describe a time series.
The rate of change of the components are controlled by “smoothing parameters”: $\alpha$, $\beta$ and $\gamma$ respectively.
Need to choose best values for the smoothing parameters (and initial states).
This methodology weighted averages of past observations, with the weights decaying exponentially as the observations get older !!

Big idea: control the rate of change

$\alpha$ controls the flexibility of the level

If $\alpha = 0$, the level never updates (mean)
If $\alpha = 1$, the level updates completely (naive)

$\beta$ controls the flexibility of the trend

If $\beta = 0$, the trend is linear
If $\beta = 1$, the trend changes suddenly every observation

$\gamma$ controls the flexibility of the seasonality

If $\gamma = 0$, the seasonality is fixed (seasonal means)
If $\gamma = 1$, the seasonality updates completely (seasonal naive)

Simple Exponential Smoothing

Time series $y_1,y_2,\dots,y_T$.

Random walk forecasts (Naive) - most recent observation is the only important one

$\hat{y}_{T+h|T}= y_T$

Average forecasts - all observations are of equal importance

$\hat{y}_{T+h|T}= \frac1T\sum_{t=1}^T y_t$

Want something in between these methods.
Most recent data should have more weight.

Simple exponential smoothing uses a weighted moving average with weights that decrease exponentially

Simple Exponential Smoothing

\[\hat{y}_{T+h|T} = \alpha y_T + \alpha(1-\alpha) y_{T-1} + \alpha(1-\alpha)^2 y_{T-2}+ \cdots\]

where $0 \le \alpha \le1$.

where the weights decrease exponentially as observations come from further in the past
The rate at which the weights decrease is controlled by the parameter $\alpha$.

Simple Exponential Smoothing

Component Form

For the simple exponential smoothing, the only component included is the level ($\ell_{t}$)
It is defined by two equations:

\[\text{Forecast equation} - \hat{y}_{t+h|t} = \ell_{t}\]

\[\text{Smoothing equation} - \ell_{t} = \alpha y_{t} + (1 - \alpha)\ell_{t-1}\]

$\ell_t$ is the level (or the smoothed value) of the series at time t.
$\hat{y}_{t+1|t} = \alpha y_t + (1-\alpha) \hat{y}_{t-1|t}$
Iterate to get exponentially weighted moving average form:

Weighted average form

\[\hat{y}_{t+1|T}=\sum_{j=0}^{T-1} \alpha(1-\alpha)^j y_{T-j}+(1-\alpha)^T \ell_{0}\] - the last term goes to zero as T increases!

Simple Exponential Smoothing - Forecast

Simple exponential smoothing has a “flat” forecast function:

\[\hat{y}_{T+h|T} = \hat{y}_{T+1|T} = \ell_{T}\]

All forecasts take the same value, equal to the last level component.
Remember that these forecasts will only be suitable if the time series has no trend or seasonal component.

Optimizing smoothing parameters

Need to choose best values for $\alpha$ and $\ell_0$ (initial values).
All forecasts can be computed from the data once we know those values.
We can estimate both parameters
Similarly to regression, choose optimal parameters by minimizing SSE: \[ \text{SSE}=\sum_{t=1}^T(y_t -\hat{y}_{t|t-1})^2 = \sum_{t=1}^Te_{t}^2 \]
Unlike regression there is no closed form solution — use numerical optimization.

Example

algeria_economy <- global_economy %>%
  filter(Country == "Algeria")

fit <- algeria_economy %>%
  model(ETS(Exports ~ error("A") + trend("N") + season("N")))

algeria_economy%>% autoplot(Exports)+
  geom_line(aes(y =.fitted), col="#D55E00", data = augment(fit))

Example

The simple exponential smoothing model was fitted (automatically) to minimize the SSE over the sample period ($t=1,2, ..., 58$) to the one-step training errors subject to the restriction that $0 \le \alpha \le1$

Fitted Model

For Algerian Exports example:
$\hat\alpha = 0.84$
$\hat\ell_0 = 39.54$

When t=1

Observation $y_t$ = $y_1 = 39.4$ - Real data

Level $\hat\ell_t$ = $\hat\ell_1 = 0.84 \times 39.4 + 0.16 \times 39.54 = 39.12$

Forecast $\hat{y}_{t|t-1}$ = $\hat{y}_{1|0} = \hat\ell_0 = 39.54$

Fitted Model

For Algerian Exports example:
$\hat\alpha = 0.84$
$\hat\ell_1 = 39.12$

When $t=2

$y_2 = 46.24$ - Real data

$\hat\ell_2 = 0.84 \times 46.24 + 0.16 \times 39.12 = 45.1$

$\hat{y}_{2|1} = \hat\ell_1 = 39.12$

Fitted Model

For Algerian Exports example:
$\hat\alpha = 0.84$
$\hat\ell_2 = 45.1$

When t=3

$y_3 = 19.79$ - Real data

$\hat\ell_3 = 0.84 \times 19.79 + 0.16 \times 45.1 = 23.84$

$\hat{y}_{3|2} = \hat\ell_2 = 45.1$

Forecast

When t=58 - End of the sample

$y_{58} = 22.64$ - Real data

$\hat\ell_{58} = 0.84 \times 22.64 + 0.16 \times 21.43 = 22.44$

$\hat{y}_{58|57} = \hat\ell_{57} = 21.43$

Forecast (T+H)

$\hat{y}_{59|58} = \hat\ell_{58} = 22.44$ $(h=1)$

$\hat{y}_{60|58} = \hat\ell_{58} = 22.44$ $(h=2)$

$\hat{y}_{61|58} = \hat\ell_{58} = 22.44$ $(h=3)$

Forecast

fit %>%
  forecast(h = 3)%>%
  autoplot(algeria_economy) +
  geom_line(aes(y = .fitted), col="#D55E00",
            data = augment(fit)) +
  labs(y="% of GDP", title="Exports: Algeria") +
  guides(colour = "none")

Forecast

Holt’s linear trend

Holt (1957) expanded the simple exponential smoothing to allow forecasting of data with a trend.
This method involves a forecast equation and two smoothing equations

\[\text{Forecast equation} - \hat{y}_{t+h|t} = \ell_{t} + hb_t\]

\[\text{Level equation} - \ell_{t} = \alpha y_{t} + (1 - \alpha)(\ell_{t-1}+b_{t-1})\]

\[\text{Trend equation} - b_{t} = \beta^*(\ell_{t} - \ell_{t-1}) + (1 -\beta^*)b_{t-1}\]

Two smoothing parameters $\alpha$(level) and $\beta^*$(trend) ($0\le\alpha,\beta^*\le1$).
$\ell_t$ level: weighted average between $y_t$ and one-step ahead forecast for time $t$, $(\ell_{t-1} + b_{t-1}=\hat{y}_{t|t-1})$
$b_t$ slope: weighted average of $(\ell_{t} - \ell_{t-1})$ and $b_{t-1}$, current and previous estimate of slope.
Choose $\alpha, \beta^*, \ell_0, b_0$ to minimize SSE.

Holt’s linear trend

The forecast function is no longer flat but trending.
The h-step-ahead forecast is equal to the last estimated level plus h times the last estimated trend value.
Hence the forecasts are a linear function of h.

Example

aus_economy <- global_economy %>%
  filter(Code == "AUS") %>%
  mutate(Pop = Population / 1e6)
autoplot(aus_economy, Pop) +
  labs(y = "Millions", title = "Australian population")

Example

The smoothing parameters $\alpha, \beta^*, \ell_0, b_0$ are estimated by minimising the SSE for the one-step training errors over the sample period ($t=1,2, ..., 58$)

fit <- aus_economy %>%
  model(
    AAN = ETS(Pop ~ error("A") + trend("A") + season("N"))
  )
fc <- fit %>% forecast(h = 10)

Fitted Model

For Australian Population:
$\hat\alpha = 0.9999$; $\beta^* = 0.3267$
$\hat\ell_0 = 10.05$; $b_0 = 0.22$

When t=1 and h=0

$y_1 = 10.28$ - Real data

$\hat\ell_1 = 0.999 \times 10.28 + 0.001 \times (10.05 + 0.22)= 10.28$

$\hat{b}_1 = 0.3267*(10.28-10.05)+(1-0.3267) \times 0.22 = 0.223$

$\hat{y}_{1|0} = \hat\ell_0 + hb_0 = 10.05 + 1 \times 0.22 = 10.28$

Fitted Model

$\hat\alpha = 0.9999$; $\beta^* = 0.3267$
$\hat\ell_1 = 10.28$; $\hat{b_1} = 0.223$

When $t=2

$y_2 = 10.48$ - Real data

$\hat\ell_2 = 0.999 \times 10.48 + 0.001 \times (10.28 + 0.223)= 10.48$

$\hat{b}_2 = 0.3267*(10.48-10.28)+(1-0.3267) \times 0.223 = 0.225$

$\hat{y}_{2|1} = \hat\ell_1 + h\hat{b}_1 = 10.28 + 1 \times 0.225 = 10.50$

Forecast

When t=58 - End of the sample

$y_{58} = 24.60$ - Real data

$\hat\ell_{58} = 0.999 \times 24.60 + 0.001 \times (24.21 + 0.36)= 24.60$

$\hat{b}_{58} = 0.3267*(24.60-24.21)+(1-0.3267) \times 0.36 = 0.37$

$\hat{y}_{58|57} = \hat\ell_{57} + h\hat{b}_{57} = 24.21 + 1 \times 0.36 = 24.57$

Forecast (T+H)

$\hat{y}_{59|58} = \hat\ell_{58} + h\hat{b}_{58} = 24.60 + 1 \times 0.37 = 24.97$ $(h=1)$

$\hat{y}_{60|58} = \hat\ell_{58} + h\hat{b}_{58} = 24.60 + 2 \times 0.37 = 25.34$ $(h=2)$

$\hat{y}_{60|58} = \hat\ell_{58} + h\hat{b}_{58} = 24.60 + 3 \times 0.37 = 25.71$ $(h=3)$

Forecast

Damped trend methods

The forecasts generated by Holt’s linear method display a constant trend (increasing or decreasing) indefinitely into the future.
It might overforecast for longer periods.

$\hat{y}_{t+h|t}= \ell_{t} + (\phi+\phi^2 + \dots + \phi^{h})b_{t}$

$\ell_{t} = \alpha y_{t} + (1 - \alpha)(\ell_{t-1} + \phi b_{t-1})$

$b_{t} = \beta^*(\ell_{t} - \ell_{t-1}) + (1 -\beta^*)\phi b_{t-1}$

Damping parameter $0<\phi<1$.
If $\phi=1$, identical to Holt’s linear trend.
As $h\rightarrow\infty$, $\hat{y}_{T+h|T}\rightarrow \ell_T+\phi b_T/(1-\phi)$.
Short-run forecasts trended, but long-run forecasts constant.
$\phi$ is usually between 0.8 and a maximum of 0.98.

Example: Australian population

aus_economy %>%
  model(
    `Holt's method` = ETS(Pop ~ error("A") +  trend("A") + season("N")),
    `Damped Holt's method` = ETS(Pop ~ error("A") +
                       trend("Ad", phi = 0.9) + season("N"))) %>%
  forecast(h = 30) %>%
  autoplot(aus_economy, level = NULL) +
  labs(title = "Australian population",
       y = "Millions") +
  guides(colour = guide_legend(title = "Forecast"))

Let’s Compare the models so far

aus_economy %>%
  autoplot(Pop)

We will use time series cross-validation to compare the one-step forecast accuracy of the three methods.

Let’s Compare the models so far

aus_economy %>%
  stretch_tsibble(.init = 10) %>%
  model(
    SES = ETS(Pop ~ error("A") + trend("N") + season("N")),
    Holt = ETS(Pop ~ error("A") + trend("A") + season("N")),
    Damped = ETS(Pop ~ error("A") + trend("Ad") +
                   season("N"))
  ) %>%
  forecast(h = 1) %>%
  accuracy(aus_economy)

## # A tibble: 3 × 11
##   .model Country .type     ME   RMSE    MAE    MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <fct>   <chr>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Damped Austra… Test  0.0215 0.0783 0.0552 0.114  0.333 0.220 0.298 0.0565
## 2 Holt   Austra… Test  0.0145 0.0768 0.0534 0.0779 0.325 0.213 0.292 0.439 
## 3 SES    Austra… Test  0.257  0.270  0.257  1.44   1.44  1.02  1.03  0.521

Let’s Compare the models so far

Holt’s method is best whether you compare MAE or RMSE values.

fit<-aus_economy %>%
  model(Holt = ETS(Pop ~ error("A") + trend("A") + season("N")))
tidy(fit)

## # A tibble: 4 × 4
##   Country   .model term  estimate
##   <fct>     <chr>  <chr>    <dbl>
## 1 Australia Holt   alpha    1.00 
## 2 Australia Holt   beta     0.327
## 3 Australia Holt   l[0]    10.1  
## 4 Australia Holt   b[0]     0.222

Example: Australian population

fit%>%forecast(h = 10) %>%
  autoplot(aus_economy)

Methods with seasonality

Holt and Winters extended Holt’s method to capture seasonality.
The Holt-Winters seasonal method comprises the forecast equation and three smoothing equations
1. the level ($\ell_{t}$)
2. the trend ($b_{t}$)
3. the seasonal component ($s_{t}$)
There are two variations to this method that differ in the nature of the seasonal component.

The additive method is preferred when the seasonal variations are roughly constant through the series.
the Multiplicative method is preferred when the seasonal variations are changing proportional to the level of the series.

Methods with seasonality

Holt-Winters’ additive method

\[ \begin{align} \hat{y}_{t+h|h} &= \ell_{t} + hb _{t} + s_{t+h-m(k+1)} \\ \ell_{t} &= \alpha(y_{t} - s_{t-m}) + (1 - \alpha)(\ell_{t-1} + b_{t-1})\\ b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 - \beta^*)b_{t-1}\\ s_{t} &= \gamma (y_{t}-\ell_{t-1}-b_{t-1}) + (1-\gamma)s_{t-m} \end{align} \]

$k=$ integer part of $(h-1)/m$. Ensures estimates from the final year are used for forecasting.
Parameters:$0\le \alpha\le 1$, $0\le \beta^*\le 1$, $0\le \gamma\le 1-\alpha$ and $m=$ period of seasonality (e.g. $m=4$ for quarterly data).

Holt-Winters’ additive method

The level equation shows a weighted average between the seasonally adjusted observation $(y_{t} - s_{t-m})$ and the
the non-seasonal forecast $(\ell_{t-1} + b_{t-1})$
The trend equation is identical to Holt’s linear method.
The seasonal equation shows a weighted average between the current seasonal index, $(y_{t}-\ell_{t-1}-b_{t-1})$, and the seasonal index of the same season last year (i.e., m time periods ago).

Holt-Winters’ additive method

Seasonal component is usually expressed as $s_{t} = \gamma^* (y_{t}-\ell_{t})+ (1-\gamma^*)s_{t-m}.$
- Substitute in for $\ell_t$: $s_{t} = \gamma^*(1-\alpha) (y_{t}-\ell_{t-1}-b_{t-1})+ [1-\gamma^*(1-\alpha)]s_{t-m}$
- We set $\gamma=\gamma^*(1-\alpha)$.
- The usual parameter restriction is $0\le\gamma^*\le1$, which translates to $0\le\gamma\le(1-\alpha)$.

Holt-Winters multiplicative method

Seasonal variations change in proportion to the level of the series.

\[ \begin{align*} \hat{y}_{t+h|t} &= (\ell_{t} + hb_{t})s_{t+h-m(k+1)} \\ \ell_{t} &= \alpha \frac{y_{t}}{s_{t-m}} + (1 - \alpha)(\ell_{t-1} + b_{t-1})\\ b_{t} &= \beta^*(\ell_{t}-\ell_{t-1}) + (1 - \beta^*)b_{t-1} \\ s_{t} &= \gamma \frac{y_{t}}{(\ell_{t-1} + b_{t-1})} + (1 - \gamma)s_{t-m} \end{align*} \]

Additive method: $s_t$ in absolute terms — within each year $\sum_i s_i \approx 0$.
Multiplicative method: $s_t$ in relative terms — within each year $\sum_i s_i \approx m$.

Example: Australian holiday tourism

aus_holidays <- tourism %>%
  filter(Purpose == "Holiday") %>%
  summarise(Trips = sum(Trips))
fit <- aus_holidays %>%
  model(
    additive = ETS(Trips ~ error("A") + trend("A") + season("A")),
    multiplicative = ETS(Trips ~ error("M") + trend("A") + season("M"))
  )
fc <- fit %>% forecast()

Example: Australian holiday tourism

fc %>%
  autoplot(aus_holidays, level = NULL) +
  labs(y = "Thousands", title = "Overnight trips")

Estimated components

tidy(fit)%>%
  filter(.model == "additive")

## # A tibble: 9 × 3
##   .model   term     estimate
##   <chr>    <chr>       <dbl>
## 1 additive alpha    0.236   
## 2 additive beta     0.0298  
## 3 additive gamma    0.000100
## 4 additive l[0]  9899.      
## 5 additive b[0]   -37.4     
## 6 additive s[0]  -538.      
## 7 additive s[-1] -684.      
## 8 additive s[-2] -290.      
## 9 additive s[-3] 1512.

Estimated components

tidy(fit)%>%
  filter(.model == "multiplicative")

## # A tibble: 9 × 3
##   .model         term     estimate
##   <chr>          <chr>       <dbl>
## 1 multiplicative alpha    0.186   
## 2 multiplicative beta     0.0248  
## 3 multiplicative gamma    0.000100
## 4 multiplicative l[0]  9853.      
## 5 multiplicative b[0]   -33.4     
## 6 multiplicative s[0]     0.943   
## 7 multiplicative s[-1]    0.926   
## 8 multiplicative s[-2]    0.970   
## 9 multiplicative s[-3]    1.16

Estimated components

Holt-Winters damped method

Damping is possible with both additive and multiplicative Holt-Winters’ methods. A
One of the most accurate forecasting method for seasonal data is the Holt-Winters method with a damped trend and multiplicative seasonality:

\[ \begin{align*} \hat{y}_{t+h|t} &= [\ell_{t} + (\phi+\phi^2 + \dots + \phi^{h})b_{t}]s_{t+h-m(k+1)} \\ \ell_{t} &= \alpha(y_{t} / s_{t-m}) + (1 - \alpha)(\ell_{t-1} + \phi b_{t-1})\\ b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 - \beta^*)\phi b_{t-1} \\ s_{t} &= \gamma \frac{y_{t}}{(\ell_{t-1} + \phi b_{t-1})} + (1 - \gamma)s_{t-m} \end{align*} \]

Holt-Winters with daily data

sth_cross_ped <- pedestrian %>%
  filter(Date >= "2016-07-01",
         Sensor == "Southern Cross Station") %>%
  index_by(Date) %>%
  summarise(Count = sum(Count)/1000)

sth_cross_ped %>%
  filter(Date <= "2016-07-31") %>%
  model(hw = ETS(Count ~ error("M") + trend("Ad") + season("M"))) %>%
  forecast(h = "2 weeks") %>%
  autoplot(sth_cross_ped %>% filter(Date <= "2016-08-14")) +
  labs(title = "Daily traffic: Southern Cross",
       y="Pedestrians ('000)")

Holt-Winters with daily data

Innovations state space models

Exponential smoothing methods

So far we have looked at the methods (algorithms) of exponential smoothing for point forecasts - not at statistical models !!!
So let’s focus on the statistical models that underlie these algorithms
These statistical models generate the same point forecasts, but can also generate prediction (or forecast) intervals.
As we have seen before, it embodies a set of statistical assumptions concerning the generation of sample data (idealizing the true data-generating). process.
As such, a statistical model is a mathematical relationship between one or more random (stochastic) variables and other non-random variables.

State space models

Each model consists of two main parts:
1. Measurement equation that describes the observed data;
2. Some state equations that describe how the unobserved components or states (level, trend, seasonal) change over time.
Additionally, for each method there exist two models:
1. additive errors
2. multiplicative errors.

The point forecasts produced by the models are identical if they use the same smoothing parameter values. They will, however, generate different prediction intervals.

State space models

To correctly classify models with additive and multiplicative errors, and distiguish the models from the methods, for now on we will label the models as ETS() - Error, Trend, Seasonal.
Using the previous notation from the methods, the possibilities for each component (or state) are:

Error ={A,M}

Trend = {N,A,$A_d$}

Seasonal = {N,A,M}

ETS(A,N,N): simple exponential smoothing with additive errors

Recall the component form of simple exponential smoothing:

\[\text{Forecast equation} - \hat{y}_{t+h|t} = \ell_{t}\]

\[\text{Smoothing equation} - \ell_{t} = \alpha y_{t} + (1 - \alpha)\ell_{t-1}\]

If we re-arrange the smoothing equation for the level, we get the error correction form:

\[ \begin{align} \ell_{t} = \ell_{t-1}+\alpha(y_{t} -\ell_{t-1})\\ = \ell_{t-1} +\alpha e_{t} \end{align} \] Where $e_{t} = y_t - \ell_{t-1} = y_t - \hat{y}_{t|t-1} \text{is ther residual at time t}$

ETS(A,N,N)

The training data errors lead to the adjustment of the estimated level throughout the smoothing process for $t=1,..., T$
For example, if the error at time t is negative, then $y_t < \hat{y}_{t|t-1}$ and so the level at time $t-1$ has been over-estimated.
The new level $\ell_{t}$ is then the previous level $\ell_{t-1}$ adjusted downwards.
The closer $\alpha$ is to one, the “rougher” the estimate of the level, but smaller $\alpha$, the “smoother” the level.

ETS(A,N,N)

$y_t = \ell_{t-1} +e_t \rightarrow$ each observation can be represented by the previous level plus an error.

To make this into an innovations state space model, all we need to do is specify the probability distribution for $e_t$
For a model with additive errors, we assume that residuals (the one-step training errors) are $e_t = \varepsilon_t\sim\text{NID}(0,\sigma^2)$. (NID stands for “normally and independently distributed”)

\[ \begin{align*} \text{Measurement equation}&& y_t &= \ell_{t-1} + \varepsilon_t\\ \text{State equation}&& \ell_t&=\ell_{t-1}+\alpha \varepsilon_t \end{align*} \]

ETS(A,N,N)

\[ \begin{align*} \text{Measurement equation}&& y_t &= \ell_{t-1} + \varepsilon_t\\ \text{State equation}&& \ell_t&=\ell_{t-1}+\alpha \varepsilon_t \end{align*} \]

These two equations, together with the statistical distribution of the errors, form a fully specified statistical model.
Specifically, these constitute an innovations state space model underlying simple exponential smoothing.
“innovations” or “single source of error” because equations have the same error process, $\varepsilon_t$.
- Measurement equation: relationship between observations and states.
- State equation(s): evolution of the state(s) through time.
- $\alpha$ controls the changes in the level - rapid ($\uparrow \alpha$) or slow ($\downarrow \alpha$)

ETS(M,N,N): simple exponential smoothing with multiplicative errors

Specify relative errors $\varepsilon_t=\frac{y_t- \hat{y}_{t|t-1}}{\hat{y}_{t|t-1}}\sim \text{NID}(0,\sigma^2)$
- Substituting $\hat{y}_{t|t-1}=\ell_{t-1}$ gives:
  - $y_t = \ell_{t-1}+\ell_{t-1}\varepsilon_t$
  - $e_t = y_t - \hat{y}_{t|t-1} = \ell_{t-1}\varepsilon_t$

Therefore, the multiplicative form of the state space model as: \[ \begin{align*} \text{Measurement equation}&& y_t &= \ell_{t-1}(1 + \varepsilon_t)\\ \text{State equation}&& \ell_t&=\ell_{t-1}(1+\alpha \varepsilon_t) \end{align*} \]

Additive and multiplicative errors with the same parameters generate the same point forecasts but different prediction intervals.

ETS(A,A,N): Holt’s linear method with additive errors

Holt’s linear method with additive errors.

Assume $\varepsilon_t=y_t-\ell_{t-1}-b_{t-1} \sim \text{NID}(0,\sigma^2)$.
Substituting into the error correction equations for Holt’s linear method

\[ \begin{align*} y_t&=\ell_{t-1}+b_{t-1}+\varepsilon_t\\ \ell_t&=\ell_{t-1}+b_{t-1}+\alpha \varepsilon_t\\ b_t&=b_{t-1}+\alpha\beta^* \varepsilon_t \end{align*} \] * For simplicity, set $\beta=\alpha \beta^*$.

ETS(M,A,N): Holt’s linear method with multiplicative errors

Holt’s linear method with multiplicative errors.

Assume $\varepsilon_t=\frac{y_t-(\ell_{t-1}+b_{t-1})}{(\ell_{t-1}+b_{t-1})}$
Following a similar approach as previously, the innovations state space model underlying Holt’s linear method with multiplicative errors is specified as: \[ \begin{align*} y_t&=(\ell_{t-1}+b_{t-1})(1+\varepsilon_t)\\ \ell_t&=(\ell_{t-1}+b_{t-1})(1+\alpha \varepsilon_t)\\ b_t&=b_{t-1}+\beta(\ell_{t-1}+b_{t-1}) \varepsilon_t \end{align*} \]

where again $\beta=\alpha \beta^*$ and $\varepsilon_t \sim \text{NID}(0,\sigma^2)$.

ETS Models

Additive error models

Multiplicative error models

Estimation and model selection

An alternative to estimating the smoothing parameters $\alpha$, $\beta$, $\gamma$ and $\phi$, and the initial states $\ell_0$, $b_0$, $s_0,s_{-1},\dots,s_{-m+1}$ by minimizing the sum of squared errors is to maximize the “likelihood”.
The likelihood is the probability of the data arising from the specified model. Thus, a large likelihood is associated with a good model.
For an additive error model, maximizing the likelihood (assuming normally distributed errors) gives the same results as minimizing the sum of squared errors.
However, different results will be obtained for multiplicative error models.

Estimation

Let $x_t = (\ell_t, b_t, s_t, s_{t-1}, \dots, s_{t-m+1})$ and $\varepsilon_t{\sim}\mbox{N}(0,\sigma^2)$.

\[ \begin{align*} L^*(\theta,x_0) &= T\log\!\bigg(\sum_{t=1}^T \varepsilon^2_t\!\bigg) + 2\sum_{t=1}^T \log|k(x_{t-1})|\\ &= -2\log(\text{Likelihood}) + \mbox{constant} \end{align*} \]

Estimate parameters $\theta = (\alpha,\beta,\gamma,\phi)$ and initial states $x_0 = (\ell_0,b_0,s_0,s_{-1},\dots,s_{-m+1})$ by minimizing $L^*$.

Parameter restrictions

Usual region

Traditional restrictions in the methods $0< \alpha,\beta^*,\gamma^*,\phi<1$(as weighted averages).
In models we set $\beta=\alpha\beta^*$ and $\gamma=(1-\alpha)\gamma^*$
Therefore $0< \alpha <1$; $0 < \beta < \alpha$; and $0< \gamma < 1-\alpha$.
$0.8<\phi<0.98$ — to prevent numerical difficulties.

Admissible region

To prevent observations in the distant past having a continuing effect on current forecasts.
Usually (but not always) less restrictive than traditional region.
For example for ETS(A,N,N): traditional $0< \alpha <1$ while admissible $0< \alpha <2$.

Model selection

A great advantage of the ETS statistical framework is that information criteria can be used for model selection.

Akaike's Information Criterion \[\text{AIC} = -2\log(\text{L}) + 2k\] where $L$ is the likelihood and $k$ is the total number of parameters and initial states that have been estimated (including the residual variance).

Corrected AIC

\[\text{AIC}_{\text{c}} = \text{AIC} + \frac{2k(k+1)}{T-k-1}\]

which is the AIC corrected (for small sample bias).

Bayesian Information Criterion \[ \text{BIC} = \text{AIC} + k[\log(T)-2]\]

AIC and cross-validation

Minimizing the AIC assuming Gaussian residuals is asymptotically equivalent to minimizing one-step time series cross validation MSE.

Automatic forecasting

From Hyndman et al. (IJF, 2002):

Apply each model that is appropriate to the data. Optimize parameters and initial values using MLE (or some other criterion).
Select best method using AICc:
Produce forecasts using best method.
Obtain forecast intervals using underlying state space model.

Method performed very well in M3 competition.

Example: National populations

fit <- global_economy %>%
  mutate(Pop = Population / 1e6) %>%
  model(ets = ETS(Pop))
fit

## # A mable: 263 x 2
## # Key:     Country [263]
##    Country                      ets
##    <fct>                    <model>
##  1 Afghanistan         <ETS(A,A,N)>
##  2 Albania             <ETS(M,A,N)>
##  3 Algeria             <ETS(M,A,N)>
##  4 American Samoa      <ETS(M,A,N)>
##  5 Andorra             <ETS(M,A,N)>
##  6 Angola              <ETS(M,A,N)>
##  7 Antigua and Barbuda <ETS(M,A,N)>
##  8 Arab World          <ETS(M,A,N)>
##  9 Argentina           <ETS(A,A,N)>
## 10 Armenia             <ETS(M,A,N)>
## # … with 253 more rows

Example: National populations

fit %>%
  forecast(h = 5)

## # A fable: 1,315 x 5 [1Y]
## # Key:     Country, .model [263]
##    Country     .model  Year             Pop .mean
##    <fct>       <chr>  <dbl>          <dist> <dbl>
##  1 Afghanistan ets     2018    N(36, 0.012) 36.4 
##  2 Afghanistan ets     2019    N(37, 0.059) 37.3 
##  3 Afghanistan ets     2020     N(38, 0.16) 38.2 
##  4 Afghanistan ets     2021     N(39, 0.35) 39.0 
##  5 Afghanistan ets     2022     N(40, 0.64) 39.9 
##  6 Albania     ets     2018 N(2.9, 0.00012)  2.87
##  7 Albania     ets     2019   N(2.9, 6e-04)  2.87
##  8 Albania     ets     2020  N(2.9, 0.0017)  2.87
##  9 Albania     ets     2021  N(2.9, 0.0036)  2.86
## 10 Albania     ets     2022  N(2.9, 0.0066)  2.86
## # … with 1,305 more rows

Example: Australian holiday tourism

holidays <- tourism %>%
  filter(Purpose == "Holiday")
fit <- holidays %>% model(ets = ETS(Trips))
fit

## # A mable: 76 x 4
## # Key:     Region, State, Purpose [76]
##    Region                     State              Purpose          ets
##    <chr>                      <chr>              <chr>        <model>
##  1 Adelaide                   South Australia    Holiday <ETS(A,N,A)>
##  2 Adelaide Hills             South Australia    Holiday <ETS(A,A,N)>
##  3 Alice Springs              Northern Territory Holiday <ETS(M,N,A)>
##  4 Australia's Coral Coast    Western Australia  Holiday <ETS(M,N,A)>
##  5 Australia's Golden Outback Western Australia  Holiday <ETS(M,N,M)>
##  6 Australia's North West     Western Australia  Holiday <ETS(A,N,A)>
##  7 Australia's South West     Western Australia  Holiday <ETS(M,N,M)>
##  8 Ballarat                   Victoria           Holiday <ETS(M,N,A)>
##  9 Barkly                     Northern Territory Holiday <ETS(A,N,A)>
## 10 Barossa                    South Australia    Holiday <ETS(A,N,N)>
## # … with 66 more rows

Example: Australian holiday tourism

fit %>%
  filter(Region == "Snowy Mountains") %>%
  report()

## Series: Trips 
## Model: ETS(M,N,A) 
##   Smoothing parameters:
##     alpha = 0.157 
##     gamma = 1e-04 
## 
##   Initial states:
##  l[0] s[0] s[-1] s[-2] s[-3]
##   142  -61   131 -42.2 -27.7
## 
##   sigma^2:  0.0388
## 
##  AIC AICc  BIC 
##  852  854  869

Example: Australian holiday tourism

fit %>%
  filter(Region == "Snowy Mountains") %>%
  components(fit)

## # A dable: 84 x 9 [1Q]
## # Key:     Region, State, Purpose, .model [1]
## # :        Trips = (lag(level, 1) + lag(season, 4)) * (1 + remainder)
##    Region         State Purpose .model Quarter Trips level season remainder
##    <chr>          <chr> <chr>   <chr>    <qtr> <dbl> <dbl>  <dbl>     <dbl>
##  1 Snowy Mountai… New … Holiday ets    1997 Q1  NA     NA   -27.7   NA     
##  2 Snowy Mountai… New … Holiday ets    1997 Q2  NA     NA   -42.2   NA     
##  3 Snowy Mountai… New … Holiday ets    1997 Q3  NA     NA   131.    NA     
##  4 Snowy Mountai… New … Holiday ets    1997 Q4  NA    142.  -61.0   NA     
##  5 Snowy Mountai… New … Holiday ets    1998 Q1 101.   140.  -27.7   -0.113 
##  6 Snowy Mountai… New … Holiday ets    1998 Q2 112.   142.  -42.2    0.154 
##  7 Snowy Mountai… New … Holiday ets    1998 Q3 310.   148.  131.     0.137 
##  8 Snowy Mountai… New … Holiday ets    1998 Q4  89.8  148.  -61.0    0.0335
##  9 Snowy Mountai… New … Holiday ets    1999 Q1 112.   147.  -27.7   -0.0687
## 10 Snowy Mountai… New … Holiday ets    1999 Q2 103.   147.  -42.2   -0.0199
## # … with 74 more rows

Example: Australian holiday tourism

fit %>%
  filter(Region == "Snowy Mountains") %>%
  components(fit) %>%
  autoplot()

Example: Australian holiday tourism

fit %>% forecast()

## # A fable: 608 x 7 [1Q]
## # Key:     Region, State, Purpose, .model [76]
##    Region         State           Purpose .model Quarter       Trips .mean
##    <chr>          <chr>           <chr>   <chr>    <qtr>      <dist> <dbl>
##  1 Adelaide       South Australia Holiday ets    2018 Q1 N(210, 457) 210. 
##  2 Adelaide       South Australia Holiday ets    2018 Q2 N(173, 473) 173. 
##  3 Adelaide       South Australia Holiday ets    2018 Q3 N(169, 489) 169. 
##  4 Adelaide       South Australia Holiday ets    2018 Q4 N(186, 505) 186. 
##  5 Adelaide       South Australia Holiday ets    2019 Q1 N(210, 521) 210. 
##  6 Adelaide       South Australia Holiday ets    2019 Q2 N(173, 537) 173. 
##  7 Adelaide       South Australia Holiday ets    2019 Q3 N(169, 553) 169. 
##  8 Adelaide       South Australia Holiday ets    2019 Q4 N(186, 569) 186. 
##  9 Adelaide Hills South Australia Holiday ets    2018 Q1   N(19, 36)  19.4
## 10 Adelaide Hills South Australia Holiday ets    2018 Q2   N(20, 36)  19.6
## # … with 598 more rows

Example: Australian holiday tourism

fit %>% forecast() %>%
  filter(Region == "Snowy Mountains") %>%
  autoplot(holidays) +
  labs(y = "Thousands", title = "Overnight trips")

Residuals

Response residuals \[\hat{e}_t = y_t - \hat{y}_{t|t-1}\]

Innovation residuals

Additive error model: \[\hat\varepsilon_t = y_t - \hat{y}_{t|t-1}\]

Multiplicative error model: \[\hat\varepsilon_t = \frac{y_t - \hat{y}_{t|t-1}}{\hat{y}_{t|t-1}}\]

Example: Australian holiday tourism

aus_holidays <- tourism %>%
  filter(Purpose == "Holiday") %>%
  summarise(Trips = sum(Trips))
fit <- aus_holidays %>%
  model(ets = ETS(Trips)) %>%
  report()

## Series: Trips 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.358 
##     gamma = 0.000969 
## 
##   Initial states:
##  l[0]  s[0] s[-1] s[-2] s[-3]
##  9667 0.943 0.927 0.968  1.16
## 
##   sigma^2:  0.0022
## 
##  AIC AICc  BIC 
## 1331 1333 1348

Example: Australian holiday tourism

residuals(fit)
residuals(fit, type = "response")

Example: Australian holiday tourism

fit %>%
  augment()

## # A tsibble: 80 x 6 [1Q]
## # Key:       .model [1]
##    .model Quarter  Trips .fitted .resid   .innov
##    <chr>    <qtr>  <dbl>   <dbl>  <dbl>    <dbl>
##  1 ets    1998 Q1 11806.  11230.  576.   0.0513 
##  2 ets    1998 Q2  9276.   9532. -257.  -0.0269 
##  3 ets    1998 Q3  8642.   9036. -393.  -0.0435 
##  4 ets    1998 Q4  9300.   9050.  249.   0.0275 
##  5 ets    1999 Q1 11172.  11260.  -88.0 -0.00781
##  6 ets    1999 Q2  9608.   9358.  249.   0.0266 
##  7 ets    1999 Q3  8914.   9042. -129.  -0.0142 
##  8 ets    1999 Q4  9026.   9154. -129.  -0.0140 
##  9 ets    2000 Q1 11071.  11221. -150.  -0.0134 
## 10 ets    2000 Q2  9196.   9308. -111.  -0.0120 
## # … with 70 more rows

Innovation residuals (.innov) are given by $\hat{\varepsilon}_t$ while regular residuals (.resid) are $y_t - \hat{y}_{t-1}$. They are different when the model has multiplicative errors.

Some unstable models

Some of the combinations of (Error, Trend, Seasonal) can lead to numerical difficulties; see equations with division by a state.
These are: $ETS(A,N,M)$, $ETS(A,A,M)$, $ETS(A,A_d,M)$
Models with multiplicative errors are useful for strictly positive data, but are not numerically stable with data containing zeros or negative values.
In that case only the six fully additive models will be applied.

Forecasting with exponential smoothing

Forecasting with ETS models

Traditional point forecasts: iterate the equations for $t=T+1,T+2,\dots,T+h$ and set all $\varepsilon_t=0$ for $t>T$.

Example: ETS(A,A,N) \[ \begin{align*} y_{T+1} &= \ell_T + b_T + \varepsilon_{T+1}\\ \hat{y}_{T+1|T} & = \ell_{T}+b_{T}\\ y_{T+2} & = \ell_{T+1} + b_{T+1} + \varepsilon_{T+2}\\ & = (\ell_T + b_T + \alpha\varepsilon_{T+1}) + (b_T + \beta \varepsilon_{T+1}) + \varepsilon_{T+2} \\ \hat{y}_{T+2|T} &= \ell_{T}+2b_{T} \end{align*}\]

These are identical to the forecasts from Holt’s linear method, and also to those from model ETS(M,A,N).
Thus, the point forecasts obtained from the method and from the two models that underlie the method are identical (assuming that the same parameter values are used).

Example: ETS(M,A,N)

\[ \begin{align*} y_{T+1} &= (\ell_T + b_T )(1+ \varepsilon_{T+1})\\ \hat{y}_{T+1|T} & = \ell_{T}+b_{T}.\\ y_{T+2} & = (\ell_{T+1} + b_{T+1})(1 + \varepsilon_{T+2})\\ & = \left\{ (\ell_T + b_T) (1+ \alpha\varepsilon_{T+1}) + \left[b_T + \beta (\ell_T + b_T)\varepsilon_{T+1}\right] \right\} (1 + \varepsilon_{T+2}) \\ \hat{y}_{T+2|T} &= \ell_{T}+2b_{T} \end{align*} \]

etc.

Forecasting with ETS models

ETS point forecasts constructed in this way are equal to the means of the forecast distributions ($\text{E}(y_{t+h} | x_t)$), except for the models with multiplicative seasonality
fable uses $\text{E}(y_{t+h} | x_t)$.
Point forecasts for ETS(A,*,*) are identical to ETS(M,*,*) if the parameters are the same.

Forecasting with ETS models

Prediction intervals: can only be generated using the models.

The prediction intervals will differ between models with additive and multiplicative errors.
Exact formula for some models.
More general to simulate future sample paths, conditional on the last estimate of the states, and to obtain prediction intervals from the percentiles of these simulated future paths.

Prediction intervals

PI for most ETS models: $\hat{y}_{T+h|T} \pm c \sigma_h$, where $c$ depends on coverage probability and $\sigma_h$ is forecast standard deviation.

Example: Corticosteroid drug sales

h02 <- PBS %>%
  filter(ATC2 == "H02") %>%
  summarise(Cost = sum(Cost))
h02 %>% autoplot(Cost)

Example: Corticosteroid drug sales

h02 %>%
  model(ETS(Cost)) %>%
  report()

## Series: Cost 
## Model: ETS(M,Ad,M) 
##   Smoothing parameters:
##     alpha = 0.307 
##     beta  = 0.000101 
##     gamma = 0.000101 
##     phi   = 0.978 
## 
##   Initial states:
##    l[0] b[0]  s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6] s[-7] s[-8] s[-9]
##  417269 8206 0.872 0.826 0.756 0.773 0.687  1.28  1.32  1.18  1.16   1.1
##  s[-10] s[-11]
##    1.05  0.981
## 
##   sigma^2:  0.0046
## 
##  AIC AICc  BIC 
## 5515 5519 5575

Example: Corticosteroid drug sales

h02 %>%
  model(ETS(Cost ~ error("A") + trend("A") + season("A"))) %>%
  report()

## Series: Cost 
## Model: ETS(A,A,A) 
##   Smoothing parameters:
##     alpha = 0.17 
##     beta  = 0.00631 
##     gamma = 0.455 
## 
##   Initial states:
##    l[0] b[0]   s[0]   s[-1]   s[-2]   s[-3]   s[-4]  s[-5]  s[-6]  s[-7]
##  409706 9097 -99075 -136602 -191496 -174531 -241437 210644 244644 145368
##   s[-8] s[-9] s[-10] s[-11]
##  130570 84458  39132 -11674
## 
##   sigma^2:  3.5e+09
## 
##  AIC AICc  BIC 
## 5585 5589 5642

Example: Corticosteroid drug sales

h02 %>%
  model(ETS(Cost)) %>%
  forecast() %>%
  autoplot(h02)

Example: Corticosteroid drug sales

h02 %>%
  model(
    auto = ETS(Cost),
    AAA = ETS(Cost ~ error("A") + trend("A") + season("A"))
  ) %>%
  accuracy()

Model	MAE	RMSE	MAPE	MASE	RMSSE
auto	38649	51102	4.99	0.638	0.689
AAA	43378	56784	6.05	0.716	0.766