LPL E-BOOK - Forecasting - Exponential Smoothing

Introduction

We have looked at time series decomposition.

Decomposition methods are mainly meant for descriptive purposes.

Decomposition reveals the existence of the main components:
- trend
- cycle, and
- seasonality.

In addition, visual inspection of the series may reveal outliers, or unusual values, or breaks in the trend or in seasonal patterns.

Once we have a firm grasp of the components that make up the time series, we are in a better position to select the best model to both describe the data and produce the best forecasts.

Simple models are often used for benchmarking. For example, we can use naive models, or seasonally naive models, for forecasting.

A naive forecast is obtained by using the last known observation as the forecast for the next period. If sales were 110 in January 2024, then we would use that number as the forecast for February 2024. Naive work well in the absence of a trend or seasonal patterns.
A seasonally naive forecast for February 2024, would be the observation in February 2023. This makes sense if the time series is dominated by seasonal patterns, and in the absence of a clear trend.

Even though the main aim of decomposition methods is description of the time series and unraveling its components, they can be used for forecasting. Once we have established that the trend in demand for our product is linear, and that demand in the month of February is on average 10% lower than usual, then it is straightforward to make a forecast by extrapolating the trend, and adjusting it for the lower demand in February.

As we have seen, decomposition methods do have some drawbacks.

The main drawback is that the trend and seasonal patterns are fixed over the length of the time series (although variations of the method offer some flexibility). In general, it is more realistic to assume that patterns change, and that more recent observations provide more information about the future than observations from long ago.

That is, we can probably do a better job in forecasting by putting more weight on recent observations.

Here is where exponential smoothing (ES) comes in. ES is by itself a family of techniques, rather than one specific technique.

In this chapter, we will discuss the most common ES-methods, and see how they compare to simple methods.

Workflow for Forecasting

The figure below presents the workflow for forecasting. In this chapter, we will go through all the steps, one by one.

Reading and preparing data is a challenge, especially because of the specific nature of time series (with time indices, or time stamps).

If you have worked with date and time data before, for example in Excel, you may have noticed that it is not always obvious how the data is displayed on your screen, and stored internally.

Dates can be displayed in a variety of formats, like “MM/DD/YYYY” in the US, or “DD/MM/YYYY” which is common in Europe. The string “2/3/2023” is the 3^rd of February in the US, but 2^nd March in the Netherlands. Data across software packages can be displayed in a variety of forms, like “Monday, 27 March, 2023 03:18:27 PM EST”.

It is strongly recommended to describe and visualize your data before modeling. The graph of the time series will give you an idea of the trend, cycle and seasonal patterns in the data. If seasonal patterns are absent in the data (because you have aggregated your monthly data to annual data), then there is no point in modeling them.

For a time series with, say, a monthly seasonal pattern, you have a choice of models to select from. Going from simple to more advanced, you can use a seasonally naive model, one out of several decomposition models, a model from the family of exponential smoothing models (weighted average model, or Holt-Winters), or a model using ARIMA (or here: SARIMA) specifications. In general, more advanced models better describe, or fit, your data. And often, but not always, these models also produce better forecasts! Complex models have a tendency to overfit, implying that they contain parameters with little predictive power.

The proof of the pudding is in the eating! Rather than models with a great fit between the actual in sample data and the data reproduced by the model, we are looking for models that better forecast out of sample future data. The challenge is of course that we do not know the future values (that is the reason that we ventured into forecasting to start with).

In order to assess the predictive power of the model, we partition our data into a training and a test set. - We use the training data to train the model, and
- Check to what extent the model is able to predict the known values in the test set.

For this assessment, we can use several criteria, like the Root Mean Squared Error (RMSE) and the Mean Absolute Percentage Error (MAPE). We can use these criteria to assess our model against one or more benchmarks. If our advanced model does not outperform simpler alternatives, then there is no reason to prefer it - as overfitting seems to cancel out the benefits of the more advanced model.

Once we have decided on the best model to use, we can employ using the full data set to make real forecasts. Of course, even this model is not set in stone. The forecasts have to be evaluated continuously. Alternative models can then be benchmarked against the current model.

We will discuss five types of ES-methods:

Simple Moving Average (SMA)
Weighted Moving average (WMA)
Simple Exponential Smoothing (SES)
The Holt Model (ES for trended data)
The Holt-Winters Model (ES, for time series with trend and seasonality).

1. Simple Moving Average (SMA)

We have already used the concept of moving averages, in our discussion of decomposition methods. In the case of quarterly data, for example, the average of four consecutive quarters filters out the seasonality in the data, and estimates the trend (or: trend plus cycle) in the data.

In that approach, we centered the moving average. The moving average of, say, the first quarter of 2023, is based on previous quarters and next quarters. That is, we are using a two-sided rolling window.

For forecasting using SMA, we take a slightly different approach. If we have a time series of t observations, then the forecast for t+1 is based on the preceding m observations. We are using a so-called left tail rolling window.

Moreover, we are assuming that the time series has neither a trend nor a seasonal pattern, and the choice of the length of the rolling window (m) is based on the number of past observations that are informative for forecasting future periods.

In the simplest case, the length of our left tail rolling window (m) is equal to 1. The forecast for period t+1 is equal to the last observation. In other words, SMA(1) is equal to the naive model!

Since we do not have observations for t+1, the forecast for t+2 will make use of the forecast for t+1. Like in the naive model, the forecasts for the next periods are a horizontal line, at the level of the latest observation in period t.

We can base our forecasts on a longer rolling window (m>1). The choice of m can be based on intuition and experience, or on performance of the model. In order to evaluate SMA-models with different values for m, we divide the time series in a training set and a test set.

For example, let us use the data set on coffee prices. The data set in included in the TSstudio package, in R.

The data set contains two time series (hence, the mts format, for multiple time series).

We will use the time series on the prices of Robusta coffee, in the first column of the data set.

The graph of the time series shows no clear trend. There are cycles in the time series, of varying duration. Since it is hard to predict if the current cycle will continue, and in the absence of trend and seasonality, the choice of an SMA-model is justified.

library(TSstudio)
data("Coffee_Prices")
ts_info(Coffee_Prices)

##  The Coffee_Prices series is a mts object with 2 variables and 701 observations
##  Frequency: 12 
##  Start time: 1960 1 
##  End time: 2018 5

head(Coffee_Prices)

##        Robusta Arabica
## [1,] 0.6968643  0.9409
## [2,] 0.6887074  0.9469
## [3,] 0.6887074  0.9281
## [4,] 0.6845187  0.9303
## [5,] 0.6906915  0.9200
## [6,] 0.6968643  0.9123

robusta <- Coffee_Prices[,1]; head(robusta)

## [1] 0.6968643 0.6887074 0.6887074 0.6845187 0.6906915 0.6968643

library(plotly)
ts_plot(robusta,
        title = "The Robusta Coffee Monthly Prices",
        Ytitle = "Price in USD",
        Xtitle = "Year")

Since the SMA-model is based on simple arithmetic computations, there is no need to use sophisticated software.

We will use an Excel-spreadsheet to illustrate the working of the SMA-model.

We have written the Robusta data to a CSV-file, and opened that file in Excel.

Let us use various SMA-models (m = 1; 6; 12; 24; 36) to generate 24-months forecasts.

In order to evaluate these models, we split the time series in: - A training set of all observations apart from the most recent 24 observations, and
- A test set containing the last 24 observations.

The file can be downloaded from https://github.com/ssmresearch/forecasting/blob/main/robusta_df.xlsx

The cells in the Excel-sheet contain simple formulas. For example:

The SMA(m=12) forecast for February 2017 is based on data from the 12 preceding months.
The cells (24 rows, or months) marked yellow are part of the test set.
For February 2017, we will make use of 4 actual observations (February to May 2016), and 8 forecasts (June 2016 to January 2017).

Following the same logic for all 24 months in the test set, and using the various values for m, we can graph the forecasts against the actual observations.

The following conclusions can be drawn:

Whatever the value of m, all models perform poorly.
If m is small (1; 6; or 12), the short-term forecasts are lower than the corresponding observations in the training set. The short-term forecasts with higher values for m are better. However, this all has to do with the upswing in the cycle in June 2016: the models with small values for m are based on the low level of the time series in the preceding months, while the models with high values for m perform better by including the higher levels of a previous cycle.
Consequently, the SMA-model can be used for short-term forecasts using either m=1 (equivalent to the naive model), or with large values for m. The cyclical character of the time series makes it hard to use the SMA-model for medium or long run forecasts.

2. Weighted Moving Average

The WMA model is an extension of the SMA model.

While the SMA model uses the simple (unweighted) average of the m lags of the series, the WMA allows you to assign different weights.

The sum of the weights must be equal to zero.

One use of the WMA model would be a time series with a seasonal pattern. In the case of monthly data, we could put a heavy weight (say, 0.7 or even higher) on the 12^thlag, and the rest of the weight (1- 0.7 = 0.3) on the last observation.

But as we have seen, the time series on Robusta prices does not show a clear seasonal pattern. If the lengths of the cycles would be stable over time, then we could put more weight on the c^th lag, if c is the length of the cycle.

For the purpose of illustration:
- In the part of the time series that starts in 2011 (64 months, excluding the test set), we observe three peaks and troughs; on closer inspection, recent full cycles have a length of around 26 months).
- If we would also assume a weak seasonal pattern, then we can set weights at 0.2 for the first lag; 0.2 for the seasonal (12^th) lag; and 0.6 for the cyclical (26^th) lag.

Following a similar procedure as for SMA, we then obtain the following forecasts:

You can download the Excel sheet and use different weights. The Excel-sheet allows you to enter weights for the 1^st and the 12^th lag. Since the weights have to add up to 1, the weight for the 26^th lag is computed automatically.

OSF: You can download the file HERE:

In the spreadsheet, you can play with the weights for two of the three lags. Since the weights have to sum up to 1, the third lag is computed automatically.

After downloading the file, you can add another column with different sets of lags, and see which model works best!

As we will see in the next exponential smoothing models, the trick is in finding the optimal set of weights. For most of the more advanced models, there are algorithms in the R-packages that will suggest the best weights or parameters. In order to find a best set, the algorithms have to make use of a criterion. Since our main aim is to make accurate forecasts, logical criteria are accuracy measures applied to the test set. We will come back to that!

Unfortunately, the algorithms for finding the best set of weights are not always working the way we would expect. As an alternative, we can use a so-called grid search. In that approach, we examine how the accuracy of our forecasts is affected by changing the weights or parameters in our model. The exercise above, using the Excel-file, is essentially a kind of manual grid search: by changing the set of weights, we automatically observe in the graph how accurate our forecasts are in the short and longer run.

3. Simple Exponential Smoothing (SES)

Exponential Smoothing models are very popular among the forecasting techniques. This is due to the relatively ease of model estimation and use, and because their flexibility generate accurate forecasts in many cases.

Simple Exponential Smoothing (SES), as the name implies, is the simplest version of the family of exponential smoothing (ES) models. SES models can be applied to stable time series, that is, time series without trend and seasonality.

ES models differ from the simple and weighted moving average (MA) models, in two ways.

First of all, while MA models take the average of a the last m observations, ES models take all past observations into consideration.

Secondly, while the weights attached to past observations are either equal for all m observations (in SMA) or can be set by the user (in WMA), the weights attached to past observations are determined by so-called smoothing parameters. The SES model has only one smoothing parameter, α.

The parameter α (0 < α < 1) indicates the rate of decay. A value of 1 puts all the weight on the last observation (and zero weight on all other past observations), while a value close to zero more evenly distributes the weights to past observations. In all cases, the weights of past observations decay exponentially: older observations have a lesser weight than more recent observations.

In the SES model, the forecast for T+1, in a series of T observations, is:

Ŷ_T+1 = α∗Y_T + (1-α)∗Ŷ_T

In this formula:

Ŷ_T+1 is the forecasted value of observation T+1 for a series of n observations (T = 1, …, n))
Ŷ_T is the forecasted value of observation T, at T-1
Y_T is the T^th observation of the series
α is the smoothing parameter

Since it is assumed that the level of the model does not change over time, the forecast for h (the planning horizon) periods ahead Ŷ_T+h is the same as the forecast for Ŷ_T+1.

The interesting thing to note about this formula is that it describes a so-called recursive model. Why is that?

In the model, the forecasted value for T+1 is a function of the actual and forecasted values at period T. The forecasted value at period T, in turn, is a function of the the actual and forecasted values at period T-1. The latter value is a function of the actua and forecasted values at T-2, and so on and so forth.

By expanding the formula above, it can be shown that:

Ŷ_T+1 = α∗Y_T + α∗(1-α)∗Y_T-1 + α∗(1-α)²∗Y_T-2 + …

Since α is a positive number between 0 and 1, (1-α) is also a positive number between 0 and 1. The weight of an observation at any period T-k is equal to the weight of observation T-k+1∗(1-α). For example, is α=.8 then the weight of the observation at lag 3, is 0.2 times the weight of the observation at lag 2.

The rate of decay at various levels of α can be visualized using the Excel-sheet SES_Weight_Decay

Forecasting with SES

Let’s go through the forecasting process using the stepwise procedure described above.

So far, we have used the monthly prices of Robusta coffee.

Now, we want to focus on predicting the quarterly prices of Arabica coffee.

The prices of Arabica coffee are stored in the Coffee Prices data set, which come with the TSstudio package in R.

Let’s load, describe and visualize the data.

library(TSstudio)
data("Coffee_Prices")
head(Coffee_Prices)

##        Robusta Arabica
## [1,] 0.6968643  0.9409
## [2,] 0.6887074  0.9469
## [3,] 0.6887074  0.9281
## [4,] 0.6845187  0.9303
## [5,] 0.6906915  0.9200
## [6,] 0.6968643  0.9123

arabica <- Coffee_Prices[,2]; head(arabica)

## [1] 0.9409 0.9469 0.9281 0.9303 0.9200 0.9123

ts_info(Coffee_Prices)

##  The Coffee_Prices series is a mts object with 2 variables and 701 observations
##  Frequency: 12 
##  Start time: 1960 1 
##  End time: 2018 5

ts_plot(Coffee_Prices)

From the graph, it’s clear that the prices of Robusta and Arabica are highly correlated, and at least until the early 1980s the two time series were almost identical.In the late 1980s, there was a break in trend, and ever since the prices started to diverge with prices of Arabica well above the prices of Robusta. Moreover, the peaks and troughs of the series do not coincide, since 2005. The peaks in Arabica prices are very high, and more volatile.

However, our interest is in quarterly prices of Arabica. We prepare the data using the aggregate() function of the stats package. Since we have monthly data, with a frequency of 12 (there are twelve months in a year), we will set the frequency at 4 instead.

The aggregate() function groups our data, in slices of 3 months and then computes a summary statistic, in the FUN= argument. For price levels, we can use the mean of the monthly prices in each quarter.

arabicaQ <- aggregate(arabica, nfrequency=4, FUN = "mean")
head(arabica,12)

##  [1] 0.9409 0.9469 0.9281 0.9303 0.9200 0.9123 0.9160 0.9292 0.9226 0.9237
## [11] 0.9116 0.9006

head(arabicaQ,4)

## [1] 0.9386333 0.9208667 0.9226000 0.9119667

ts_plot(arabicaQ,
        title = "The Arabica Coffee Quarterly Prices",
        Ytitle = "Price in USD",
        Xtitle = "Year")

Recall that the SES model applies to a series without trend and seasonality.

In order to justify the model selection, we can decompose the time series into components, and visualize the results.

ts_decompose(arabicaQ)

Since the trend has the same shape as the time series, we can conclude that there is no clear upward or downward trend in the data apart from the cyclical variations that occur since the late 1970s.

After accounting for the trend and the cycle, what remains is seasonality and random variation. It seems hat there is some seasonal (quarterly) variation in the data, but for a correct interpretation, we have to look at the scales. The seasonal fluctuation is in the range from -0.05 to +0.05, which is very low compared to the level of the series (around 4, since 2010).

We can visualize the seasonal component using a ggsubseries() plot.

From the graph below, we see that indeed the price levels differ only marginally between the quarters.

library(forecast)
ggsubseriesplot(arabicaQ)

What we do observe is that the random fluctuations are higher in recent years than they were in the old days. It may help to analyze the causes of especially the spikes in the time series, over recent years. If these causes can be predicted, then we can add these causes in for example a regression model. If the causes of high prices are related to unpredictable events (weather, drought, or political unrest in certain parts of the world), then a regression model might add to our understanding of how prices are affected but be of limited use in predicting future prices!

For now, we will continue using the SES-model for forecasting.

For estimating and evaluating the model, we will split the data set into a training set and a test set.

when splitting the data, we have to decide on the length of set. If our forecasting horizon is 2 years (8 quarters), then it makes sense to use a test set of the same length. Of course, we can use test sets of different lengths (1 year and 2 years), which would allow us to assess the forecasting accuracy at different forecasting horizons. It is thinkable that some models generate better forecasts for the short run, while others perform better over longer forecasting horizons!

arabicaQ_split <- ts_split(arabicaQ, sample.out = 8)
train <- arabicaQ_split$train
test  <- arabicaQ_split$test
ts_info(train)

##  The train series is a ts object with 1 variable and 225 observations
##  Frequency: 4 
##  Start time: 1960 1 
##  End time: 2016 1

ts_info(test)

##  The test series is a ts object with 1 variable and 8 observations
##  Frequency: 4 
##  Start time: 2016 2 
##  End time: 2018 1

We can combine the training and test sets (which are both defined as single time series) in one data set, as multiple time series, and plot them in one graph. By using the slider, we can zoom in on the most recent years, and have a better idea of the forecasting challenge that lies ahead of us!

tr1 <- as.data.frame(train)  # define as data frames
te1 <- as.data.frame(test)
names(tr1) <- "train"        # label the variables in the data frames
names(te1) <- "test"
tr1$d <- 1:nrow(tr1)         # add an index (d), from 1 to 209 for the training set
te1$d <- (1+nrow(tr1)):(nrow(tr1)+nrow(te1)) # id, from 210 to 233 for the test set
trte <- merge(tr1,te1,by="d",all=TRUE)       # merge the data 
trte_mts <- ts(trte[,c("train","test")],start=c(1960,1),frequency=4) # define as multiple time series
ts_info(trte_mts)

##  The trte_mts series is a mts object with 2 variables and 233 observations
##  Frequency: 4 
##  Start time: 1960 1 
##  End time: 2018 1

ts_plot(trte_mts,type="single",Xgrid=TRUE,Ygrid=TRUE,title="Arabica Training and Test Set",Xtitle="Year",Ytitle="Average Quarterly Price",slider=TRUE)

The SES model can be estimated using the ses() function of the forecast package in R.

In using this function, three arguments are of importance:

The initial argument can be set at either “simple” or “optimal”. Setting this argument has to do with the recursive nature of the model. Forecasts for any period are based on forecasted and actual values in the previous period. But for the first observation in the series, that information is not available. In order to start up the process, we have to set the initial state at some value. In the simple method, the initial state is based on simple calculations using the first observations. In the optimal method, a complex optimization method is used. The initial state is shown as l (for lambda) in the output. In long time series and/or in time series with large values for α, the choice hardly matters.
The alpha argument, sets the value for α. If the argument is not used, then α will be estimated.
The argument h sets the forecasting horizon.

Leaving out the alpha argument suggests that an optimal value will be estimated. However this is not necessarily the case!

Since our main aim is to produce accurate forecasts, we will first look at what the ses() function suggests as an estimate of α, and then - in the next step - compare the accuracy of the forecast to alternative models.

The model estimation is straightforward:

library(forecast)
fc_ses_default <- ses(train, h = 8, initial = "optimal")
fc_ses_default$model

## Simple exponential smoothing 
## 
## Call:
##  ses(y = train, h = 8, initial = "optimal") 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 0.6328 
## 
##   sigma:  0.4233
## 
##      AIC     AICc      BIC 
## 835.7267 835.8353 845.9750

The output shows that α as estimated by ses() is very close to 1.

Note that, since the forecast one period ahead (T+1) is almost completely determined by the last observation in period T, this model is identical to the naive model!

The initial state is estimated. We would expect the initial state to be close the price of Arabic coffee at the start of the series in 1960 (around 0.90). But the output shows that the initial state is estimated as 0.63, which probably takes into consideration the long run upward trend in the coffee prices.

To be on the safe side, we can estimate the model with initial = “simple”. Indeed, the initial state, is close to the first observations (actually, it’s identical to the first observation).

library(forecast)
fc_ses_simple <- ses(train, h = 24, initial = "simple")
fc_ses_simple$model

## Simple exponential smoothing 
## 
## Call:
##  ses(y = train, h = 24, initial = "simple") 
## 
##   Smoothing parameters:
##     alpha = 1 
## 
##   Initial states:
##     l = 0.9386 
## 
##   sigma:  0.4209

cat("First 4 observations:",arabicaQ[1:4],"\n")

## First 4 observations: 0.9386333 0.9208667 0.9226 0.9119667

cat("Initial state",fc_ses_simple$model$par[2],"\n")

## Initial state 0.9386333

We can plot the actual and forecasted data using the test_forecast() function from the TSstudio package in R.

But, as expected, the implications for the forecasts are negligible, given the high α and the length (almost 60 years of data) of the series.

Now that we have estimated the model, we can visualize the actual observations in both the training and the test set, and show the 24 months forecasts. The forecasts of the SES model, as explained, are on a horizontal line.

Try to change the code below, to plot the graph for the fc_ses_simple model, and see that the forecasts of the prices in the quarters are the same as for fc_ses_default, namely 4.87.

test_forecast(actual = arabicaQ,
              forecast.obj = fc_ses_default,
              test = test) %>%
  layout(title = "Arabica Coffee Prices Forecast versus Actual",
         xaxis = list(range = c(2005, max(time(arabicaQ)))),
         yaxis = list(range = c(2, 7)))

What we see from the graph is that the forecasts using the default SES model (which is almost identical to the naive model) are systematically higher than the actual values in the test set.

This is despite the fact that the red line (the fitted values) are very close to the actual values in the training set.

But that is exactly what we would expect in an SES model with α≈1: the fitted values are identical to the actual values with a lag of 1 period!

So,let’s have a closer look at the accuracy of our models!

For the evaluation of our model, we need the following:

A test set, containing data that have not been used in estimating the model. In our case, we have set apart the last 24 observations of the data set, to see if these can be accurately predicted from the estimated model.
A statistic that enables us to assess how close our forecasts are to the actual values.
A benchmark. Normally, we would compare our model to a simple model. The idea is that using complex models only pays off if they systematically generate superior forecasts.

In our example, the default SES model with an estimated α≈1, is almost identical to the simple naive model, that forecasts the value at T+1 as the last observation at T. That is, using the naive model as the benchmark, would lead us to start using the naive model for forecasting.

More formally, we can assess the accuracy of our forecasts using a variety of measures.

Below, we first train the naive model, with a forecasting horizon

fc_naive <- naive(train,h=24)
accuracy(fc_naive, test)

##                     ME      RMSE       MAE        MPE     MAPE      MASE
## Training set 0.0105803 0.4218174 0.2506363 -0.2608367 8.689122 0.3840583
## Test set     0.1216767 0.3189048 0.2632316  2.8297994 7.457427 0.4033586
##                   ACF1 Theil's U
## Training set 0.2495545        NA
## Test set     0.6548135  1.603047

accuracy(fc_ses_default,test)

##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set 0.01189372 0.4213831 0.2508906 -0.1149186 8.795616 0.3844480
## Test set     0.12167690 0.3189049 0.2632316  2.8298067 7.457427 0.4033586
##                   ACF1 Theil's U
## Training set 0.2488819        NA
## Test set     0.6548135  1.603047

Holt-model

The Holt model is an expanded version of the SES model. The model is based on double smoothing, with the use of two rather than one parameter. In addition to the α that has the same function as in SES, the Holt model introduces a second parameter β. The β parameter is intended to smooth the slope of the linear trend.

Many decades ago, Holt recognized the shortcomings of using simple exponential smoothing for time series with a trend. Holt modified the simple exponential smoothing model to account for a linear trend. This became known as the Holt model. It is more complicated than SES, but essentially works in the same way. When we talk about the second parameter (β) to account for the trend, then we are confusing things! Like in SES, we have to estimate the initial slope of the linear trend (the increase per unit of time), and the variations in the slope over time. For a close to perfect linear trend, these variations are quite small, and hence β will be low!

Like in SES, the default settings will estimate values for α and β, but these values are not necessarily the best for forecasting! So, we have to go through all the steps of the forecasting process: we have to use a training and a test set; we have to assess the forecasting performance based on the accuracy of the forecast in the test set; and we have to think of alternative models to minimize the forecasting error.

Alternative models can be scanned in a grid search, with a range of values for both smoothing parameters and their impact on a relevant measure of accuracy, like MAPE. It is always good to visualize the results, in order to make an informed choice for the best model.

Although we have introduced Holt as the model for a linear trend, the parameters of the model allow us to deal with short run forecasts for models with nonlinear trends, or more complex times series with breaks in trends, as the example below will show.

The example is based on demand for a product that was affected by the Covid-19 pandemic. After 2020, demand went down sharply, but picked up even before the pandemic ended, at an accelerated pace.

Let’s read and visualize the time series.

demand <- read.csv("demand.csv")
demand <- ts(demand, end=c(2023,4), frequency=4)
ts_info(demand)

##  The demand series is a ts object with 1 variable and 48 observations
##  Frequency: 4 
##  Start time: 2012 1 
##  End time: 2023 4

ts_plot(demand, Xgrid=TRUE,Ygrid=TRUE,title="Demand for Product",Xtitle="Year",Ytitle="Demand in Units",slider=TRUE)

We will split the time series in a training set and a test set. If our forecasting horizon is two years (eight quarter), then it makes sense to set the length of the test set at eight quarters.

Using the default settings, in the holt() function of the forecast package, we would get the following. The optimal initial values of the level and the slope are computer by the model. Alternatively we can use a simple method based on the first observations. Like in the SES-model, in a time series of sufficient length, the method chosen has a neglible impact on the parameters, and hence on the forecasted values.

demand_par <- ts_split(demand, sample.out = 8)
traind <- demand_par$train
testd  <- demand_par$test
fc_demand_holt <- holt(traind,  h = 8, initial = "optimal") 
fc_demand_holt$model

## Holt's method 
## 
## Call:
##  holt(y = traind, h = 8, initial = "optimal") 
## 
##   Smoothing parameters:
##     alpha = 0.9985 
##     beta  = 0.0941 
## 
##   Initial states:
##     l = 14603.0487 
##     b = 161.9411 
## 
##   sigma:  80.9869
## 
##      AIC     AICc      BIC 
## 504.8838 506.6485 513.3282

accuracy(fc_demand_holt, testd)

##                      ME       RMSE      MAE         MPE      MAPE       MASE
## Training set   10.99164   76.83093  61.7626  0.05241543 0.3460477 0.08790283
## Test set     -665.87355 1141.19097 854.8069 -3.29699460 4.0874667 1.21659297
##                   ACF1 Theil's U
## Training set 0.0192117        NA
## Test set     0.2975704   1.08055

test_forecast(demand, forecast.obj = fc_demand_holt, test = testd)

Both from the accuracy measures and from the graph, it is obvious that this model performs poorly, for the simple reason that the sharp decrease in demand occurs within the quarters of the test set.

Even though there’s a strong linear trend in the training data, the β is quite low. hat does not deny the existence of the strong linear trend, and just confirms that the variations in the slope of the linear trend are minor.

The MAPE tells what we can see from the graph: even though the fitted values (the red line) match the actual observations very well, the forecasted values in the 8 quarters of the test set are very poor. The MAPE in the test set is around 8 times the MAPE in the training set!

Since the forecasts in the test set turn out to be inaccurate forecasts of the actual observations, we have little confidence in the capability of this model to generate accurate forecasts. It makes sense to look for better models.

Given the uncertainty in the data caused by a unique event (COVID-19), it is probably better to choose a shorter forecasting horizon, e.g., 4 quarters rather than 8 quarters. In addition, we can try different values for β, since it is likely that the changes in the slope of the curve are strongly linked to recent changes in the slope.

Let’s put these two changes into practice:

A short forecasting horizon, of 4 quarters, and
Setting the value for β at 0.70.

In addition, we will use the option of dampening the trend, which prevents sudden peaks or troughs in recent periods to overestimate or underestimate forecasts.

For a good comparison, we will estimate a model with the default settings (4.a) as well.

# Training and test set; test set consists of 4 quarters, equal to the forecast horizon
demand_par_4 <- ts_split(demand, sample.out = 4)
traind4 <- demand_par_4$train
testd4  <- demand_par_4$test
# Default settings
fc_demand_holt_4_a <- holt(traind4,  h = 4, initial = "optimal") 
fc_demand_holt_4_a$model

## Holt's method 
## 
## Call:
##  holt(y = traind4, h = 4, initial = "optimal") 
## 
##   Smoothing parameters:
##     alpha = 0.4323 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 14644.2311 
##     b = 151.4496 
## 
##   sigma:  391.4726
## 
##      AIC     AICc      BIC 
## 697.6633 699.2422 706.5842

test_forecast(demand, forecast.obj = fc_demand_holt_4_a, test = testd4)

accuracy(fc_demand_holt_4_a, testd4)

##                         ME      RMSE      MAE         MPE      MAPE     MASE
## Training set   -0.05613861  373.2545  151.563 -0.01178639 0.7849943 0.215217
## Test set     1306.14875103 1414.0990 1306.149  5.61325289 5.6132529 1.854709
##                   ACF1 Theil's U
## Training set 0.1035931        NA
## Test set     0.1770183  2.379496

# Our own settings for beta, and a dampened trend
fc_demand_holt_4_b <- holt(traind4,  h = 4, initial = "optimal", beta=.70, damped=TRUE) 
fc_demand_holt_4_b$model

## Damped Holt's method 
## 
## Call:
##  holt(y = traind4, h = 4, damped = TRUE, initial = "optimal",  
## 
##  Call:
##      beta = 0.7) 
## 
##   Smoothing parameters:
##     alpha = 0.7 
##     beta  = 0.7 
##     phi   = 0.8 
## 
##   Initial states:
##     l = 14642.7038 
##     b = 172.036 
## 
##   sigma:  512.9011
## 
##     AIC    AICc     BIC 
## 720.324 721.903 729.245

accuracy(fc_demand_holt_4_b, testd4)

##                    ME     RMSE      MAE       MPE     MAPE     MASE        ACF1
## Training set  59.3026 482.8805 194.9087 0.2980881 1.000961 0.276767 -0.22306674
## Test set     419.7282 493.1221 419.7282 1.7940209 1.794021 0.596007  0.04825039
##              Theil's U
## Training set        NA
## Test set     0.8351803

test_forecast(demand, forecast.obj = fc_demand_holt_4_b, test = testd4)

The results indicate that model 4.b performs quite well. The MAPE is much lower than in model 4.a. The MAPE in the test set is still higher than in the training set, but, firstly, this is common since the data in the training set are used in estimating the model, and, secondly, especially for this time series a unique event (COVID-19) has an impact in the latest observations, including the observations in the data set.

Probably, the MAPE would be even lower for the test set if we would have left out the option to dampen the trend. However, our ultimate aim is to forecast future demand rather than the test set! The dampened trend, in this example, produces a more conservative forecast of future demand, implicitly assuming (or reasoning) that the exponential increase of catch-up demand after COVID-19 will slow down. Obviously, whether this assumption holds true, we do not know. The quest for a stable forecasting model, is reflected in the feedback loop of the forecasting process!

Again, we see that it’s not recommended to use the default settings! The best ways to assess the validity of the forecasting model are to:
1. Use training and test sets of an appropriate length. Whatever is appropriate, depends on the length of the time series, our forecasting horizon, and the volatility of the series.
2. Use a grid search for the parameter values that minimize the accuracy statistics in the test set. Often, parameter values in a range (e.g., 0.40 < α or β > .70) tend to perform satisfactorily, and, hence, it’s not needed to hunt for the very best parameter with four decimal precision (e.g., α = 0.6487). The optimal parameter is bound to change slightly as time proceeds, and approximations (α = 0.65) that are kept constant for some time often produce robust forecasts.
3. Visualize patterns in the forecasts. In our last model, there is a pattern in the forecasts, in the sense that the differences between the actual observations and the forecasts in the test set are widening. Preferably, the differences should be random, rather than systematic. The pattern in our example, is partly inspired by our choice of a dampened trend. If we expect the accelerating trend to be permanent, we can opt for a different model, like an exponential trend.

Holt-Winters model

The Holt-Winters model is an extended version of the Holt model (which is in its turn an extension of the SES model). The Holt-Winters model is able to handle time series with both a trend and a seasonal component. For forecasting the seasonal component, the Holt-Winters model introduces a third parameter (γ, pronounced as gamma).

With the introduction of a third parameter, the mathematical formulas become even more complex, but it’s important to realize that the principle behind these parameters remains the same.

For an intuitive understanding, the easiest is to keep in mind the interpretation of the α parameter of the SES model. The parameter adjusts the level of the time series at T+1, if the actual value at T differs from the forecast for T. If the actual value is higher than the forecast, then it’s interpreted as a signal that something is happening. A high α uses most or all (α=1) of the difference to adjust the level of the series. A low α virtually ignores the signal.

The same principle is used in Holt and in Holt-Winters models. For Holt-Winters, a low γ assumes a steady seasonal pattern, while a high γ adjusts the seasonal component if, for example, the most recent observation for January was well above what we observed in January in earlier years.

There are several packages in R that have functions for forecasting with the popular Holt-Winters model. In this section, we will mainly use the HoltWinters() function from the forecast package. Alternatively, we can make use of the hw() function of the stats package, which has the options to use exponential trends, and dampened trends.

We will take the following steps:

Read and explore the data
Estimate a Holt-Winters model, using a training set of data
Assess the model, using a test set of data
Look for better set of parameter values (for our three parameters, α, β, and γ), using a grid search
Use these parameters for generating forecasts, using linear and exponential trends, and dampened trends.

We will use quarterly demand for a product, in the years 2003-2022. The data can be found in demand_hw.csv.

demand_hw <- read.csv("demand_hw.csv")
head(demand_hw)

##   demand_hw
## 1    6891.8
## 2    4929.3
## 3    4653.7
## 4    6063.8
## 5    7233.1
## 6    4774.0

Since data file only contains the data and not the time index, we have to define the data as a time series with the starting quarter (2003, Q1) as follows.

demand_hw_ts <- ts(demand_hw, start=c(2003,1),frequency=4)

We can inspect and plot the time series, and see if there is a seasonal (quarterly) component.

ts_info(demand_hw_ts)

##  The demand_hw_ts series is a ts object with 1 variable and 79 observations
##  Frequency: 4 
##  Start time: 2003 1 
##  End time: 2022 3

ts_plot(demand_hw_ts)

plot(decompose(demand_hw_ts))

ts_seasonal(demand_hw_ts)

The plots suggest an upward trend, which is accelerating since around 2010.

The seasonal plot confirms that there’s a strong and systematic seasonal component. In each of the years 2003-2021, demand in Q1 is around two thousand units higher than in Q2 and Q3. demand in Q4 is in between the peaks in Q1 and the troughs in Q2/Q3.

Note also that the demand in all quarters of any year, is higher than demand in the same quarters of the previous year - with some exceptions where lines cross. This patterns indicates an upward trend, and, since the gaps between the succeeding years are increasing, the trend is not strictly linear.

With that information, we can justify the use of the Holt-Winters model. Since we want to train the model, and assess its performance with data that we have at our disposal, we partition the data set in a training and test set. In case of seasonal data, we always want to see to it that the test set includes all seasons. Since we have quarterly data, the minimum length of the test set is therefore 4. If our forecasting horizon is one year, we can pick 4 as the length of the training set, but since the last year of observation may be idiosyncratic, a longer period (8, or even 12 quarters) are also good choices. Using both short and longer test sets, may shed light on the forecasting accuracy of various models in the short, medium and longer run!

For the sake of ease, we will stick to a test set length and forecasting horizon of h=4.

d_hw_par <- ts_split(demand_hw_ts, 4)
train_hw <- d_hw_par$train
test_hw  <- d_hw_par$test

Next, we use the training set to estimate the model, and use the model to fit and forecast the observations in the training and test set.

The output shows the estimations of the parameters α, β and γ. The sprintf() function is used to force R to print the output in 4 decimals.

d_hw <- HoltWinters(train_hw)
cat("alfa  =",sprintf("%.4f",d_hw$alfa),"\n")

## alfa  =

cat("beta  =",sprintf("%.4f",d_hw$beta),"\n")

## beta  = 0.0917

cat("gamma =",sprintf("%.4f",d_hw$gamma),"\n")

## gamma = 0.2025

We can use the forecast() function to forecast values for the test set (4 quarters, 2022Q4 to 2023Q3), and beyond (for example the next 8 quarters on which we don’t have data, 2023Q4 to 2025Q3).

The forecasts come with both point estimates and 80% and 95% confidence intervals. You will notice that the confidence intervals are in quite a broad range, and the range increases with the number of quarters forecasted ahead.

In many cases, the ranges are so broad that they hardly give any guidance about what the future may bring!

fc1 <- forecast(d_hw, 12)
plot_forecast(fc1, title="Fit & Eight Quarter Forecast", Xtitle = "Year", Ytitle = "Demand")

From the fitted values, it is obvious that the model fits the data quite well. The same holds true for the forecasts of the test set. But sometimes, the devil is in the details. On careful inspection, it seems that the model underestimate the peaks that occur in the actual data (the blue line). In the same vein, the forecasts for peak quarter in the test set is also an underestimation of actual demand.

It would help to analyze and explain which factors cause this systematic error. And maybe we can find a better model that better accounts for these errors, and thereby is likely to produce more accurate forecasts!

One possible solution is to search for parameters for α, β and γ that give better fits and forecasts in the training and test set.

One way to do it is to manually try a range of values. But that is time consuming, and the outcome may be arbitrary, based on only those values that you have actually tried.

We can use a more systematic approach, called a grid search, in which we loop over a range of values for the three parameters, and see how they perform in terms of one of the available criteria, like MAPE.

Since the grid search may take computer time, even on our modern fast devices, we will normally start with big steps.

The three parameters should have values in the range from 0 to 1. We can try different values in steps of 0.10, for each parameter, and record which combinations produce good forecasts in the test set.

Once we have narrowed down the ranges of the three parameters, we can repeat the process using smaller steps (e.g., we search for the best α in the range from 0.20 to 0.40, in steps of 0.02).

Of course, we can keep on refining, but normally there’s a range of values for the three parameters that produce robust forecasts. Looking time and again for the very best set of parameter values at a specific point in time, does not necessarily lead to better forecasts, and comes at the cost of continuously updating the model (although that can be automated). A transparent, easy and robust approach, would lead us to fix the parameters - although the feedback loop in the forecasting process suggests regular model reviews and updates.

Let’s try the grid search for our model.

In the ts_grid() function, we can set the ranges for the three parameters. Here, we have picked ranges from 0 to 1, in steps of 0.10. Since the values in the Holt-Winters specification have to be larger than 0, we have actually started with a very small value (0.001) instead of 0.

The advantage of the grid search using ts_grid is that it makes use of an advanced method for backtesting that is conceptually close to cross-validation. In a nutshell, the series is divided into many training and test partitions. By default, the function uses expanding windows, as explained in the graph below.

The window_space and window_test arguments allow us to set the lengths of the training partitions, and of the test partitions, respectively. The periods argument determine how far we will go back in time. In our example, we

gridsearch <- ts_grid(train_hw,
                      model="HoltWinters",
                      periods=4,
                      window_space = 4,
                      window_test = 4,
                      hyper_params =  list(alpha = seq(0.001,1,0.1),
                                           beta  = seq(0.001,1,0.1),
                                           gamma = seq(0.001,1,0.1)),
                      parallel = FALSE)
plot_grid(gridsearch)