Exponential Smoothing in R

Exponential Smoothing and Innovation State Space Model (ISSM)

Exponential smoothing (ETS, which stands for Error, Trend, and Seasonality) is a family of very successful forecasting methods which are based on the key property that forecasts are weighted combinations of past observations (Hyndman et. al, 2008). For example, in simple exponential smoothing, the forecast \(\hat{z}_{T+1}\) for time step \(T+1\) is written as

\[\hat{z}_{T+1} = \hat{z}_T + \alpha (z_T - \hat{z}_T) = \alpha\cdot z_T + (1 - \alpha)\cdot \hat{z}_T,\] Here \(\alpha > 0\) is a smoothing parameter that controls the weight given to each observation. Note that the recent observations are given more weight than the older observations. In fact the weight given to the past observation decreases exponentially as it gets older and hence the name exponential smoothing.

As discussed already, whether we use additive or multiplicative errors, the point forecasts will be the same and the prediction intervals will differ. To distinguish models, we add an extra letter to the method notation. The triplet ETS (⋅,⋅,⋅) refers to error, trend, and seasonality (in that order)

General exponential smoothing methods consider the extensions of simple ETS to include time series patterns such as (linear) trend, various periodic seasonal effects. All ETS methods falls under the category of forecasting methods as the predictions are point forecasts (a single value is predicted for each future time step). On the other hand a statistical model describes the underlying data generation process and has an advantage that it can produce an entire probability distribution for each of the future time steps. Innovation state space model (ISSM) is an example of such models with considerable flexibility in representing commonly occurring time series patterns and underlie the exponential smoothing methods.

The idea behind ISSMs is to maintain a latent state vector \(l_{t}\) with recent information about level, trend, and seasonality factors. The state vector \(l_{t}\) evolves over time adding small innvoation (i.e., the Gaussian noise) at each time step. The observations are then a linear combination of the components of the current state.

Mathematically, ISSM is specified by 2 equations,

The state transition equation: \[l_{t} = F_t l_{t-1} + g_{t}\epsilon_t,\quad \epsilon_t\sim \mathcal{N}(0,1).\] Note that the innovation strength is controlled by \(g_t\)
The observation equation:

\[z_t = a_{t}^{\top}l_{t-1} + b_t + \nu_t, \quad \nu_t \sim \mathcal{N}(0, \sigma_t^2)\]

Here we allow for an additional term \(b_t\) which can model any determinstic component (exogenous variables). This describes a fairy generic model allowing the user to encode specific time series patterns using the coefficients \(F, a_t\) and thus are problem dependent.

The innovation vector \(g_t\) comes in terms of parameters to be learned (the innovation strengths). Moreover, the initial state \(l_0\) has to be specified. We do so by specifying a Gaussian prior distribution \(P(l_0)\), whose parameters (means, standard deviation) are learned from data as well.

The parameters of the ISSM are typically learned using the maximum likelihood principle. This requires the computation of the log-likelihood of the given observations i.e., computing the probability of the data under the model, \(P(z_1, ..., z_t)\)

Specificying the model type

So when specificying the model type you always specificy the error, trend, then seasonality (hence “ets”). The options you can specify for each component is as follows:

error: additive (“A”), multiplicative (“M”), unknown (“Z”)
trend: none (“N”), additive (“A”), multiplicative (“M”), unknown (“Z”)
seasonality: none (“N”), additive (“A”), multiplicative (“M”), unknown (“Z”)

Consequently, if we wanted to apply a Holt’s model where the error and trend were additive and no seasonality exists we would select model = "AAN".

If you want to apply a Holt-Winters model where there is additive error, an exponential (multiplicative) trend, and additive seasonality you would select model = "AMA".

If you are uncertain of the type of component then you use “Z”. So if you were uncertain of the components or if you want the model to select the best option, you could use model = "ZZZ" and the “optimal” model will be selected.

Example

We will use the qcement passengers dataset within the fpp2 package for illustration. The first step we do is by splitting the qcement dataset into train and test set to compare performance. This data has seasonality and trend; however, it is unclear if seasonality is additive or multiplicative.

#> ETS(A,A,A) 
#> 
#> Call:
#>  ets(y = qcement.train, model = "AAA") 
#> 
#>   Smoothing parameters:
#>     alpha = 0.6418 
#>     beta  = 1e-04 
#>     gamma = 0.1988 
#> 
#>   Initial states:
#>     l = 0.4511 
#>     b = 0.0075 
#>     s = 0.0049 0.0307 9e-04 -0.0365
#> 
#>   sigma:  0.0854
#> 
#>      AIC     AICc      BIC 
#> 126.0419 126.8676 156.9060 
#> 
#> Training set error measures:
#>                       ME       RMSE       MAE          MPE     MAPE      MASE
#> Training set 0.001463693 0.08393279 0.0597683 -0.003454533 3.922727 0.5912949
#>                    ACF1
#> Training set 0.02150539

If we assess our additive model we can see that \(a = 0.6418\), \(\beta = 0.0001\), and \(\gamma = 0.1988\)

The important thing to understand about the ets() model is how to select the model = parameter. In total we have 36 model options to choose from.

With the ets() function, the default estimation method is maximum likelihood rather than minimum sum of squares.

If we check our residuals, we see that residuals grow larger over time. This may suggest that a multiplicative error rate may be more appropriate.

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from ETS(A,A,A)
#> Q* = 20.288, df = 3, p-value = 0.0001479
#> 
#> Model df: 8.   Total lags used: 11

Compare predictive types, accuracy prespective

To compare the predictive accuracy of our models let’s compare four different models. We see that the first model (additive error, trend and seasonality) results in the lowest RMSE and MAPE on test test data set.

#>                       ME       RMSE        MAE          MPE     MAPE      MASE
#> Training set 0.001463693 0.08393279 0.05976830 -0.003454533 3.922727 0.5912949
#> Test set     0.031362775 0.07144211 0.06791904  1.115342984 2.899446 0.6719311
#>                     ACF1 Theil's U
#> Training set  0.02150539        NA
#> Test set     -0.31290496 0.2112428

#>                        ME       RMSE        MAE        MPE     MAPE      MASE
#> Training set -0.002700842 0.08387446 0.05912640 -0.3689856 3.811958 0.5849446
#> Test set      0.018046198 0.06438073 0.05763575  0.5614035 2.493544 0.5701974
#>                      ACF1 Theil's U
#> Training set  0.001490172        NA
#> Test set     -0.344335948 0.1722734

#>                       ME       RMSE        MAE       MPE     MAPE      MASE
#> Training set 0.007542179 0.07818508 0.05640602 0.3985457 3.693964 0.5580315
#> Test set     0.046793468 0.09456029 0.09241462 1.6559245 3.882433 0.9142688
#>                     ACF1 Theil's U
#> Training set -0.01243122        NA
#> Test set     -0.05070111 0.2737314

#>                        ME       RMSE        MAE         MPE     MAPE      MASE
#> Training set 0.0009645886 0.07807481 0.05595930 -0.02540554 3.657772 0.5536120
#> Test set     0.0164979314 0.08297241 0.07162921  0.37381187 3.064455 0.7086363
#>                     ACF1 Theil's U
#> Training set -0.03038457        NA
#> Test set     -0.06185361 0.1930288

If we were to compare this to an unspecified model where we let ets select the optimal model, we see that ets selects a model specification of multiplicative error, additive trend, and multiplicative seasonality (“MAM”). This is equivalent to our fourth model above. This model is assumed “optimal” because it minimizes RMSE, AIC, and BIC on the training data set, but does not necessarily minimize prediction errors on the test set.

#> ETS(M,A,M) 
#> 
#> Call:
#>  ets(y = qcement.train, model = "ZZZ") 
#> 
#>   Smoothing parameters:
#>     alpha = 0.7664 
#>     beta  = 0.0131 
#>     gamma = 1e-04 
#> 
#>   Initial states:
#>     l = 0.49 
#>     b = 0.0064 
#>     s = 1.0298 1.0488 1.0151 0.9063
#> 
#>   sigma:  0.0477
#> 
#>        AIC       AICc        BIC 
#>  0.3161397  1.1418277 31.1802503 
#> 
#> Training set error measures:
#>                        ME       RMSE       MAE         MPE     MAPE     MASE
#> Training set 0.0009645886 0.07807481 0.0559593 -0.02540554 3.657772 0.553612
#>                     ACF1
#> Training set -0.03038457

We can optimize the \(\gamma\) parameter in our Holt-Winters model (if we don’t want to use the triple z option). Here, we use the additive error, trend and seasonality model that minimized our prediction errors above and identify the \(\gamma\) parameter that minimizes forecast errors. In this case we see that \(\gamma = 0.21\) minimizes the error rate.

#> Warning: `data_frame()` is deprecated as of tibble 1.1.0.
#> Please use `tibble()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.

If we update our model with this “optimal” \(\gamma\) parameter we see that we bring our forecasting error rate down from 2.88% to 2.76%. This is a small improvement, but often small improvements can have large business implications.

#>                       ME       RMSE        MAE          MPE     MAPE      MASE
#> Training set 0.001463693 0.08393279 0.05976830 -0.003454533 3.922727 0.5912949
#> Test set     0.031362775 0.07144211 0.06791904  1.115342984 2.899446 0.6719311
#>                     ACF1 Theil's U
#> Training set  0.02150539        NA
#> Test set     -0.31290496 0.2112428

#>                        ME       RMSE        MAE        MPE     MAPE      MASE
#> Training set -0.001312025 0.08377557 0.05905971 -0.2684606 3.834134 0.5842847
#> Test set      0.033492771 0.07148708 0.06775269  1.2096488 2.881680 0.6702854
#>                     ACF1 Theil's U
#> Training set  0.04832198        NA
#> Test set     -0.35877010 0.2202448

With this new optimal model we can get our predicted values:

#>         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
#> 2013 Q1       2.134650 2.025352 2.243947 1.967494 2.301806
#> 2013 Q2       2.427828 2.299602 2.556055 2.231723 2.623934
#> 2013 Q3       2.601989 2.457284 2.746694 2.380681 2.823296
#> 2013 Q4       2.505001 2.345506 2.664496 2.261075 2.748927
#> 2014 Q1       2.171068 1.987914 2.354223 1.890958 2.451179

Exponential Smoothing in R

Salma Elshahawy

2021-02-22

Exponential Smoothing and Innovation State Space Model (ISSM)

Specificying the model type

Example

Compare predictive types, accuracy prespective