LPL E-BOOK - Forecasting - Exponential Smoothing

Introduction

We have looked at time series decomposition.

Decomposition is mainly meant for descriptive purposes. It reveals the existence of the main components:

Trend,
Cycle, and
Seasonality.

In addition, the graph of the series may reveal:

Outliers or unusual values, or
Breaks in the trend or the seasonal patterns.

Once we have a firm grasp of the time series, its components and any irregularities, we are in a better position to select the best model to both describe the data and produce the best forecasts.

Simple models are often used for benchmarking. For example, we can use naive models, or seasonally naive models, for forecasting.

A naive forecast is, for example, to forecast the value for the next period as the last observation. If sales were 110 in January 2024, then we would use that number as the forecast for the coming month, February 2024. Naive forecasts make sense in the absence of a trend or seasonal patterns.
Likewise, we could use the value of February 2023, to forecast February 2024. This makes sense if the time series is dominated by seasonal patterns, and in the absence of a clear trend.
Decomposition methods can accommodate series with both trend and seasonality.

Even though the main aim of decomposition methods is description of the time series, they can be used for forecasting. If, for example, we know that the trend in demand for our product is linear, and that demand in February is on average 10% lower than usual, then it’s straightforward to make a forecast by extrapolating the trend, and adjusting it for the lower demand in February.

As we have seen, decomposition methods do have some drawbacks. The main drawback is that the trend and seasonal patterns are assumed to be fixed over the length of the time series (although variations of the decomposition method offer some flexibility in this regard).

In general it is more realistic to assume that patterns change, and that more recent observations provide more information about the future than observations from long ago.

Even if the long-term trend is upward and linear, the increase from this year to next year may be lower or higher due to cyclical patterns. Cyclical patterns can be signaled by slower or faster growth in recent years.
While demand for certain products is higher or lower in, say, winter than in other seasons, the impact of winter may change over time (e.g., due to climate change).

In general, we can probably do a better job in forecasting by putting more weight on recent observations.

Here is where exponential smoothing (ES) comes in. ES is a family of techniques, rather than one specific technique.

In this chapter, we will discuss the most common ES-methods, and see how they compare to simple methods.

Workflow for Forecasting

The figure below presents the workflow for forecasting.

Reading and preparing data is a challenge, especially because of the specific nature of time series (with time indices, or time stamps). If you have worked with date and time data before, for example in Excel, you may have noticed that it’s not always obvious how data is displayed on your screen and stored internally.

Dates can be displayed in a variety of formats, like “MM/DD/YYYY” in the US, or “DD/MM/YYYY” which is common in Europe.

The string “2/3/2023” is the 3^rd of February in the US, but 2^nd March in the Netherlands.
Other systems may present the data in a variety of forms, like “Monday, 27 March, 2023 03:18:27 PM EST”.

For checking whether your data have been read properly by your system and make sense, it is always good to describe and visualize your data. The graph of the time series will also give you an idea of any trend, cyclical and seasonal patterns in the data. If seasonal patterns are absent in the data (because you have aggregated your monthly data to annual data), then there’s no point in modeling them.

For a time series with a seasonal pattern, you have a choice of models to select from. Going from simple to more advanced, you can use:

A seasonally naive model,
One out of several decomposition models,
A model from the family of exponential smoothing models (weighted average model, or Holt-Winters),
An ARIMA model.

In general, more advanced models are better able to describe, or fit, your data. And often, but not always, do they also give better forecasts! Complex models have a tendency to overfit, implying that they contain parameters with little predictive power.

At the end of the day, our aim is to make a good forecast. The proof of the pudding is in the eating! Our main interest is not in models that produce a great fit between the actual in sample data and the data reproduced by the model, but in models that better forecast out of sample future data. The challenge is that we don’t know the future values (that’s the reason why we ventured into forecasting to start with).

In order to still assess the predictive power of the model, we partition (split) our data into a training and a test set.

We use the training data to train the model, and then check to what extent the model is able to predict the known values in the test set. For this assessment, we can use several criteria, like the Root Mean Squared Error (RMSE) and the Mean Absolute Percentage Error (MAPE).
We can use these criteria to assess our model against one or more benchmarks. If our advanced model does not outperform simpler alternatives, then we better stick to the latter.

Once we have decided on the best model to use, we can deploy that model using the full data set to make forecasts. But even this model is not set in stone. The forecasts have to be evaluated, and it pays off to check - especially in case of systematically poor forecasts - if the forecasting model needs revision. Alternative models can then be benchmarked against the current model.

We will discuss five types of ES-methods:

Simple Moving Average (SMA)
Weighted Moving average (WMA)
Simple Exponential Smoothing (SES)
The Holt Model (ES for trended data)
The Holt-Winters Model (ES, for time series with trend and seasonality).

1. Simple Moving Average (SMA)

We have already used the concept of moving averages, in our discussion of decomposition methods. In the case of quarterly data, for example, the average of four consecutive quarters filters out the seasonality in the data, and estimates the trend (or: trend plus cycle) in the data.

In doing so, we centered the moving average. The moving average of, say, the first quarter of 2024, is based on previous quarters and next quarters. We call that a two-sided rolling window.

For forecasting using SMA, we take a slightly different approach.

If we have a time series of t observations, then the forecast for t+1 is based on the preceding m observations. We are using a so-called left tail rolling window.

For SMA, we assuming that the time series has neither a trend nor a seasonal pattern, and the choice of the length of the rolling window (m) is based on the number of past observations that are informative for forecasting future periods. For quarterly data, m does not have to be 4, or a multiple thereof!

In the simplest case, the length of our left tail rolling window (m) is equal to 1. The forecast for period t+1 is equal to the last observation. Note that SMA(1) is equivalent to the naive model!

Since we don’t have observations for t+1, the forecast for t+2 will make use of the forecast for t+1. Like in the naive model, the forecasts for the next periods will converge to a horizontal line, at the level of the latest observation in period t.

We can base our forecasts on a longer rolling window (m>1). The choice of m can be based on intuition and experience, or on performance of the model. In order to evaluate SMA-models with different values for m, we divide the time series in a training set and a test set.

For example, let’s use the data set on coffee prices. The data set in included in the TSstudio package, in R.

The data set contains two time series (hence, the mts format, for multiple time series). We will use the time series on the prices of Robusta coffee, in the first column of the data set.

The graph of the time series shows no clear trend. There are cycles in the time series, of varying duration. Since it’s hard to predict if the current cycle will continue, and in the absence of trend and seasonality, the choice of an SMA-model is justified.

library(TSstudio)
data("Coffee_Prices")
ts_info(Coffee_Prices)

##  The Coffee_Prices series is a mts object with 2 variables and 701 observations
##  Frequency: 12 
##  Start time: 1960 1 
##  End time: 2018 5

head(Coffee_Prices)

##        Robusta Arabica
## [1,] 0.6968643  0.9409
## [2,] 0.6887074  0.9469
## [3,] 0.6887074  0.9281
## [4,] 0.6845187  0.9303
## [5,] 0.6906915  0.9200
## [6,] 0.6968643  0.9123

robusta <- Coffee_Prices[,1]; head(robusta)

## [1] 0.6968643 0.6887074 0.6887074 0.6845187 0.6906915 0.6968643

library(plotly)
ts_plot(robusta,
        title = "The Robusta Coffee Monthly Prices",
        Ytitle = "Price in USD",
        Xtitle = "Year")

The SMA-model is based on simple arithmetic computations, and there’s no need to use sophisticated software.

We will use an Excel-spreadsheet to illustrate the working of the SMA-model.

We have written the Robusta data from the TSstudio package to a CSV-file, and opened that file in Excel.
Let’s use various SMA-models (m = 1; 6; 12; 24; 36) to generate 24-months forecasts.
In order to evaluate these models, we use a training set of all observations apart from the most recent 24 observations, and a test set containing the last 24 observations.

The file can be downloaded here.

The cells in the Excel-sheet contain simple formulas. For example:

The SMA(m=12) forecast for February 2017 is based on (observed and forecast) data from the 12 preceding months.
The cells (24 rows, or months) marked yellow are part of the test set.
For February 2017, we use of 4 actual observations (February to May 2016), and 8 forecasts (June 2016 to January 2017).

Following the same logic for all 24 months in the test set, and using the various values for m, we can graph the forecast values against the actual observations.

The following conclusions can be drawn:

Whatever the value of m, all models perform poorly, especially if the planning horizon gets longer.
If m is small (1; 6; or 12), the short-term forecasts are lower than the corresponding observations in the test set. The short-term forecasts with higher values for m seem to do better. However, this all has to do with the upswing in the cycle in June 2016: the models with small values for m are based on the low level of the time series in the preceding months, while the models with high values for m perform better as they benefit from including the higher levels from a previous cycle.
Consequently, the SMA-model can be used for short-term forecasts using either m=1 (equivalent to the naive model), or with large values for m. The cyclical character of the time series makes it hard to use the SMA-model for medium or long run forecasts.

2. Weighted Moving Average

The WMA model is an extension of the SMA model.

While the SMA model uses a simple (unweighted) average of the m lags of the series, the WMA allows you to assign different weights.

The sum of the weights must be equal to 1 (or 100%).

One use of the WMA model would be in case of time series with a seasonal pattern. In the case of monthly data, we could put a heavy weight (say, 0.7) on the 12^thlag, and the remainder (1 - 0.7 = 0.3) on the last observation.

But as we have seen, the time series on Robusta prices does not show a clear seasonal pattern. If the length of cycles would be more or less the same, over time, we could put weight on the c^th lag, if c is the length of the cycle.

For the purpose of illustration:

In the part of the time series that starts in 2011 (64 months, excluding the test set), we observe three peaks and troughs; on close inspection, the most recent full cycle has a length of around 26 months).
If we would also assume a weak seasonal pattern, then we can set weights at 0.2 for the first lag; 0.2 for an assumed seasonal (12^th) lag; and 0.6 for the cyclical (26^th) lag.

Following a similar procedure as for SMA, we the obtain the following forecasts:

In the WMA-tab of the spreadsheet, you can play with the weights for two of the three lags. Since the weights have to sum up to 1, the third lag is computed automatically.

Try different sets of lags, and see which model works best for short run forecasting!

The trick is in finding the optimal set of weights! For most of the more advanced models, the algorithms in the R-packages that we use will suggest the best weights or parameters.

In order to find a best set, the algorithms have to make use of a criterion. Since our main aim is to make accurate forecasts, the criteria are accuracy measures applied to the test set. We will come back to that!

Unfortunately, the algorithms for finding the best set of weights are not always working the way we would expect. As an alternative, we can use a so-called grid search. In that approach, we examine how the accuracy of our forecasts is affected by changing the weights or parameters in our model. The exercise above, using the Excel-file for WMA, is essentially a kind of (manual) grid search: by changing the set of weights, we automatically observe in the graph the change in accuracy of our forecasts - in the short and longer run.

3. Simple Exponential Smoothing (SES)

Exponential Smoothing models are very popular among the forecasting techniques. This is due to the relatively ease of model estimation and use, and because their flexibility in generating accurate forecasts in many cases.

Simple Exponential Smoothing (SES), as the name implies, is the simplest version of the family of exponential smoothing (ES) models. SES models are applied to stable time series, that is, time series without trend and seasonality.

ES models differ from the simple and weighted moving average (MA) models, in two ways.

First of all, while MA models take the average of a the last m observations, ES models take all past observations into consideration.
Secondly, while the weights attached to past observations are either equal for all m observations (in SMA) or can be set by the user (in WMA), the weights attached to past observations are determined by so-called smoothing parameters. The SES model has only one smoothing parameter, α.

The parameter α (0 < α < 1) indicates the rate of decay.

A value of 1 puts all the weight on the last observation (and zero weight on all other past observations), while a value close to zero more evenly distributes the weights to past observations.
In all cases, the weights of past observations decay exponentially: older observations have a lesser weight than more recent observations.

In the SES model, the forecast for T+1, in a series of T observations, is:

Ŷ_T+1 = α∗Y_T + (1-α)∗Ŷ_T

In this formula:

Ŷ_T+1 is the forecasted value for period T+1 for a series of n observations (T = 1, …, n)
Ŷ_T is the forecasted value of observation T, at T-1
Y_T is the T^th observation of the series
α is the smoothing parameter

Since it is assumed that the level of the model does not change over time, the forecast for h (the planning horizon) periods ahead Ŷ_T+h is the same as the forecast for Ŷ_T+1.

The interesting thing to note about this formula is that it describes a so-called recursive model. Why is that?

In the model, the forecasted value for T+1 is a function of the actual and forecasted values at period T. The forecasted value at period T, in turn, is a function of the the actual and forecasted values at period T-1. The latter value is a function of the actual and forecasted values at T-2, and so on and so forth.

By expanding the formula above, it can be shown that:

Ŷ_T+1 = α∗Y_T + α∗(1-α)∗Y_T-1 + α∗(1-α)²∗Y_T-2 + …

Since α is a positive number between 0 and 1, (1-α) is also a positive number between 0 and 1. The weight of an observation at any period is equal to the weight of an observation one period later times (1-α). For example, if α=.8 then the weight of the observation at lag 3, is 0.2 times the weight of the observation at lag 2.

The rate of decay at various levels of α can be visualized using the weight decay Excel-file that can be downloaded here

Forecasting with SES

Let’s go through the forecasting process using the stepwise procedure described above.

So far, we have used the monthly prices of Robusta coffee.

Now, we want to focus on predicting the quarterly prices of Arabica coffee.

The prices of Arabica coffee are stored in the Coffee Prices data set, which come with the TSstudio package in R.

Let’s load, describe and visualize the data.

library(TSstudio)
data("Coffee_Prices")
head(Coffee_Prices)

##        Robusta Arabica
## [1,] 0.6968643  0.9409
## [2,] 0.6887074  0.9469
## [3,] 0.6887074  0.9281
## [4,] 0.6845187  0.9303
## [5,] 0.6906915  0.9200
## [6,] 0.6968643  0.9123

arabica <- Coffee_Prices[,2]; head(arabica)

## [1] 0.9409 0.9469 0.9281 0.9303 0.9200 0.9123

ts_info(Coffee_Prices)

##  The Coffee_Prices series is a mts object with 2 variables and 701 observations
##  Frequency: 12 
##  Start time: 1960 1 
##  End time: 2018 5

ts_plot(Coffee_Prices)

From the graph, it’s clear that the prices of Robusta and Arabica are highly correlated, and at least until the early 1980s the two time series were almost identical.

In the late 1980s, there was a break in trend, and ever since the prices started to diverge with prices of Arabica well above the prices of Robusta. Moreover, the peaks and troughs of the series do not coincide, since 2005. The peaks in Arabica prices are very high, and more volatile.

However, our interest is in quarterly prices of Arabica. We prepare the data using the aggregate() function of the stats package. Since we have monthly data, with a frequency of 12 (there are twelve months in a year), we will set the frequency at 4 instead.

The aggregate() function groups our data, in slices of 3 months and then computes a summary statistic, in the FUN= argument. For price levels, we can use the mean of the monthly prices in each quarter.

arabicaQ <- aggregate(arabica, nfrequency=4, FUN = "mean")
head(arabica,12)

##  [1] 0.9409 0.9469 0.9281 0.9303 0.9200 0.9123 0.9160 0.9292 0.9226 0.9237
## [11] 0.9116 0.9006

head(arabicaQ,4)

## [1] 0.9386333 0.9208667 0.9226000 0.9119667

ts_plot(arabicaQ,
        title = "The Arabica Coffee Quarterly Prices",
        Ytitle = "Price in USD",
        Xtitle = "Year")

Recall that the SES model applies to a series without trend and seasonality.

In order to justify the model selection, we can decompose the time series into components, and visualize the results.

ts_decompose(arabicaQ)

Since the trend has the same shape as the time series, we can conclude that there is no clear upward or downward trend in the data apart from the cyclical variations that occur since the late 1970s.

After accounting for the trend and the cycle, what remains is seasonality and random variation. It seems that there is some seasonal (quarterly) variation in the data, but for a correct interpretation, we have to look at the scales. The seasonal fluctuation is in the range from -0.05 to +0.05, which is very low compared to the level of the series (around 4, since 2010).

We can visualize the seasonal component using a ggsubseries() plot.

From the graph below, we see that indeed the price levels differ only marginally between the quarters.

library(forecast)
ggsubseriesplot(arabicaQ)

What we do observe is that the random fluctuations are higher in recent years than they were in the old days. It may help to analyze the causes of especially the spikes in the time series, over recent years. If these causes can be predicted, then we can add these causes in for example a regression model. If the causes of high prices are related to unpredictable events (weather, drought, or political unrest in certain parts of the world), then a regression model might add to our understanding of how prices are affected but be of limited use in predicting future prices!

For now, we will continue using the SES-model for forecasting.

For estimating and evaluating the model, we will split the data set into a training set and a test set.

When splitting the data, we have to decide on the length of set. If our forecasting horizon is 2 years (8 quarters), then it makes sense to use a test set of the same length. Alternatively, we can use test sets of different lengths (1 year and 2 years), which would allow us to assess the forecasting accuracy at different forecasting horizons. It is thinkable that some models generate better forecasts for the short run, while others perform better over longer forecasting horizons!

arabicaQ_split <- ts_split(arabicaQ, sample.out = 8)
train <- arabicaQ_split$train
test  <- arabicaQ_split$test
ts_info(train)

##  The train series is a ts object with 1 variable and 225 observations
##  Frequency: 4 
##  Start time: 1960 1 
##  End time: 2016 1

ts_info(test)

##  The test series is a ts object with 1 variable and 8 observations
##  Frequency: 4 
##  Start time: 2016 2 
##  End time: 2018 1

We can combine the training and test sets (which are both defined as single time series) in one data set, as multiple time series, and plot them in one graph. By using the slider, we can zoom in on the most recent years, and have a better idea of the forecasting challenge that lies ahead of us!

tr1 <- as.data.frame(train)  # define as data frames
te1 <- as.data.frame(test)
names(tr1) <- "train"        # label the variables in the data frames
names(te1) <- "test"
tr1$d <- 1:nrow(tr1)         # add an index (d), from 1 to 209 for the training set
te1$d <- (1+nrow(tr1)):(nrow(tr1)+nrow(te1)) # id, from 210 to 233 for the test set
trte <- merge(tr1,te1,by="d",all=TRUE)       # merge the data 
trte_mts <- ts(trte[,c("train","test")],start=c(1960,1),frequency=4) # define as multiple time series
ts_info(trte_mts)

##  The trte_mts series is a mts object with 2 variables and 233 observations
##  Frequency: 4 
##  Start time: 1960 1 
##  End time: 2018 1

ts_plot(trte_mts,type="single",Xgrid=TRUE,Ygrid=TRUE,title="Arabica Training and Test Set",Xtitle="Year",Ytitle="Average Quarterly Price",slider=TRUE)

The SES model can be estimated using the ses() function of the forecast package in R.

In using this function, three arguments are of importance:

The initial argument can be set at either “simple” or “optimal”. Setting this argument has to do with the recursive nature of the model. Forecasts for any period are based on forecasted and actual values in the previous period. But for the first observation in the series, that information is not available. In order to start up the process, we have to set the initial state at some value. In the simple method, the initial state is based on simple calculations using the first observations. In the optimal method, a complex optimization method is used. The initial state is shown as l (for lambda) in the output. In long time series and/or in time series with large values for α, the choice hardly matters.
The alpha argument, sets the value for α. If the argument is not used, then α will be estimated.
The argument h sets the forecasting horizon.

Leaving out the alpha argument suggests that an “optimal” value will be estimated. However, other values of alpha work better!

Since our main aim is to produce accurate forecasts, we will first look at what the ses() function suggests as an estimate of α, and then - in the next step - compare the accuracy of the forecast to alternative models.

The model estimation is straightforward:

library(forecast)
fc_ses_default <- ses(train, h = 8, initial = "optimal")
fc_ses_default$model

## Simple exponential smoothing 
## 
## Call:
##  ses(y = train, h = 8, initial = "optimal") 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 0.6328 
## 
##   sigma:  0.4233
## 
##      AIC     AICc      BIC 
## 835.7267 835.8353 845.9750

The output shows that α as estimated by ses() is very close to 1.

Note that, since the forecast one period ahead (T+1) is almost completely determined by the last observation in period T, this model is identical to the naive model!

The initial state is estimated. We would expect the initial state to be close the price of Arabic coffee at the start of the series in 1960 (around 0.90). But the output shows that the initial state is estimated as 0.63, which probably takes into consideration the long run upward trend in the coffee prices.

To be on the safe side, we can estimate the model with initial = “simple”. Indeed, the initial state, is close to the first observations (actually, it’s identical to the first observation).

library(forecast)
fc_ses_simple <- ses(train, h = 24, initial = "simple")
fc_ses_simple$model

## Simple exponential smoothing 
## 
## Call:
##  ses(y = train, h = 24, initial = "simple") 
## 
##   Smoothing parameters:
##     alpha = 1 
## 
##   Initial states:
##     l = 0.9386 
## 
##   sigma:  0.4209

cat("First 4 observations:",arabicaQ[1:4],"\n")

## First 4 observations: 0.9386333 0.9208667 0.9226 0.9119667

cat("Initial state",fc_ses_simple$model$par[2],"\n")

## Initial state 0.9386333

We can plot the actual and forecasted data using the test_forecast() function from the TSstudio package in R.

But, as expected, the implications for the forecasts are negligible, given the high α and the length (almost 60 years of data) of the series.

Now that we have estimated the model, we can visualize the actual observations in both the training and the test set, and show the 24 months forecasts. The forecasts of the SES model, as explained, are on a horizontal line.

Try to change the code below, to plot the graph for the fc_ses_simple model, and see that the forecasts of the prices in the quarters are the same as for fc_ses_default, namely 4.87.

test_forecast(actual = arabicaQ,
              forecast.obj = fc_ses_default,
              test = test) %>%
  layout(title = "Arabica Coffee Prices Forecast versus Actual",
         xaxis = list(range = c(2005, max(time(arabicaQ)))),
         yaxis = list(range = c(2, 7)))

What we see from the graph is that the forecasts using the default SES model (which is almost identical to the naive model) are systematically different from the actual values in the test set.

This is despite the fact that the red line (the fitted values) are very close to the actual values in the training set. But that is exactly what we would expect in an SES model with α≈1: the fitted values are identical to the actual values with a lag of 1 period!

So,let’s have a closer look at the accuracy of our models!

For the evaluation of our model, we need the following:

A test set, containing data that have not been used in estimating the model. In our case, we have set apart the last 24 observations of the data set, to see if these can be accurately predicted from the estimated model.
A statistic that enables us to assess how close our forecasts from the training set are to the actual values in the test set.
A benchmark. Normally, we would compare our model to a simple model. The idea is that using complex models only pays off if they systematically generate superior forecasts.

In our example, the default SES model with an estimated α≈1, is almost identical to the simple naive model, that forecasts the value at T+1 as the last observation at T. That is, using the naive model as the benchmark, would lead us to start using the naive model for forecasting.

More formally, we can assess the accuracy of our forecasts using a variety of measures.

Below, we first train the naive model, with a forecasting horizon

fc_naive <- naive(train,h=24)
accuracy(fc_naive, test)

##                     ME      RMSE       MAE        MPE     MAPE      MASE
## Training set 0.0105803 0.4218174 0.2506363 -0.2608367 8.689122 0.3840583
## Test set     0.1216767 0.3189048 0.2632316  2.8297994 7.457427 0.4033586
##                   ACF1 Theil's U
## Training set 0.2495545        NA
## Test set     0.6548135  1.603047

accuracy(fc_ses_default,test)

##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set 0.01189372 0.4213831 0.2508906 -0.1149186 8.795616 0.3844480
## Test set     0.12167690 0.3189049 0.2632316  2.8298067 7.457427 0.4033586
##                   ACF1 Theil's U
## Training set 0.2488819        NA
## Test set     0.6548135  1.603047

4. Holt-model

The Holt model is an expanded version of the SES model. The model is based on double smoothing, with the use of two rather than one parameter. In addition to the α that has the same function as in SES, the Holt model introduces a second parameter β. The β parameter is intended to smooth the slope of the linear trend.

Holt recognized the shortcomings of using simple exponential smoothing for time series with a trend. Holt modified the simple exponential smoothing model to account for a linear trend. The Holt model is more complex than SES but essentially works in the same way. The focus is on changes in the slope of the linear trend (the increase per unit of time). For a close to perfect linear trend, these variations are quite small, and hence β will be low! That is, β is not comparable to a regression coefficient, in regression analysis!

Like in SES, the default settings will estimate values for α and β, but these values are not necessarily the best for forecasting! So, we have to go through all the steps of the forecasting process:

We will use a training and a test set.
we will assess the forecasting performance based on the accuracy of the forecast in the test set.
We have to think of alternative models to minimize the forecasting error.

Alternative models can be scanned in a grid search, with a range of values for both smoothing parameters and their impact on a relevant measure of accuracy, like MAPE.

Although we have introduced Holt as the model for a linear trend, the parameters of the model allow us to deal with short run forecasts for models with nonlinear trends, or more complex times series with breaks in trends.

The example is based on demand for a product that was affected by the Covid-19 pandemic. After 2020, demand went down sharply, but picked up even before the pandemic ended, at an accelerated pace.

Let’s read and visualize the time series.

demand <- read.csv("demand.csv")
demand <- ts(demand, end=c(2023,4), frequency=4)
ts_info(demand)

##  The demand series is a ts object with 1 variable and 48 observations
##  Frequency: 4 
##  Start time: 2012 1 
##  End time: 2023 4

ts_plot(demand, Xgrid=TRUE,Ygrid=TRUE,title="Demand for Product",Xtitle="Year",Ytitle="Demand in Units",slider=TRUE)

We will split the time series in a training set and a test set. If our forecasting horizon is two years (eight quarters), then it makes sense to set the length of the test set at eight quarters.

Using the default settings, in the holt() function of the forecast package, we would get the following.

The optimal initial values of the level and the slope are computed by the model. Alternatively we can use a simple method based on the first observations.

demand_par <- ts_split(demand, sample.out = 8)
traind <- demand_par$train
testd  <- demand_par$test
fc_demand_holt <- holt(traind,  h = 8, initial = "optimal") 
fc_demand_holt$model

## Holt's method 
## 
## Call:
##  holt(y = traind, h = 8, initial = "optimal") 
## 
##   Smoothing parameters:
##     alpha = 0.9985 
##     beta  = 0.0941 
## 
##   Initial states:
##     l = 14603.0487 
##     b = 161.9411 
## 
##   sigma:  80.9869
## 
##      AIC     AICc      BIC 
## 504.8838 506.6485 513.3282

accuracy(fc_demand_holt, testd)

##                      ME       RMSE      MAE         MPE      MAPE       MASE
## Training set   10.99164   76.83093  61.7626  0.05241543 0.3460477 0.08790283
## Test set     -665.87355 1141.19097 854.8069 -3.29699460 4.0874667 1.21659297
##                   ACF1 Theil's U
## Training set 0.0192117        NA
## Test set     0.2975704   1.08055

test_forecast(demand, forecast.obj = fc_demand_holt, test = testd)

Both from the accuracy measures and from the graph, it is obvious that this model performs poorly, for the simple reason that the sharp decrease in demand occurs within the quarters of the test set.

Even though there’s a strong linear trend in the training data, the β is quite low. That does not deny the existence of the strong linear trend, and just confirms that the variations in the slope of the linear trend are minor!

The MAPE tells what we can see from the graph: even though the fitted values (the red line) match the actual observations very well, the forecasted values in the 8 quarters of the test set are very poor. The MAPE in the test set is around 8 times the MAPE in the training set!

Since the forecasts in the test set turn out to be inaccurate forecasts of the actual observations, we have little confidence in the capability of this model to generate accurate forecasts. It makes sense to look for better models.

Given the uncertainty in the data caused by a unique event (COVID-19), it is probably better to choose a shorter forecasting horizon, e.g., 4 quarters rather than 8 quarters. In addition, we can try different values for β, since it is likely that the changes in the slope of the curve are strongly linked to recent changes in the slope.

Let’s put these two changes into practice:

A short forecasting horizon, of 4 quarters, and
Setting the value for β at 0.70.

In addition, we will use the option of dampening the trend, which prevents sudden peaks or troughs in recent periods to overestimate or underestimate forecasts.

For a good comparison, we will estimate a model with the default settings (4.a) as well.

# Training and test set; test set consists of 4 quarters, equal to the forecast horizon
demand_par_4 <- ts_split(demand, sample.out = 4)
traind4 <- demand_par_4$train
testd4  <- demand_par_4$test
# Default settings
fc_demand_holt_4_a <- holt(traind4,  h = 4, initial = "optimal") 
fc_demand_holt_4_a$model

## Holt's method 
## 
## Call:
##  holt(y = traind4, h = 4, initial = "optimal") 
## 
##   Smoothing parameters:
##     alpha = 0.4323 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 14644.2311 
##     b = 151.4496 
## 
##   sigma:  391.4726
## 
##      AIC     AICc      BIC 
## 697.6633 699.2422 706.5842

test_forecast(demand, forecast.obj = fc_demand_holt_4_a, test = testd4)

accuracy(fc_demand_holt_4_a, testd4)

##                         ME      RMSE      MAE         MPE      MAPE     MASE
## Training set   -0.05613861  373.2545  151.563 -0.01178639 0.7849943 0.215217
## Test set     1306.14875103 1414.0990 1306.149  5.61325289 5.6132529 1.854709
##                   ACF1 Theil's U
## Training set 0.1035931        NA
## Test set     0.1770183  2.379496

# Our own settings for beta, and a dampened trend
fc_demand_holt_4_b <- holt(traind4,  h = 4, initial = "optimal", beta=.70, damped=TRUE) 
fc_demand_holt_4_b$model

## Damped Holt's method 
## 
## Call:
##  holt(y = traind4, h = 4, damped = TRUE, initial = "optimal",  
## 
##  Call:
##      beta = 0.7) 
## 
##   Smoothing parameters:
##     alpha = 0.7 
##     beta  = 0.7 
##     phi   = 0.8 
## 
##   Initial states:
##     l = 14642.7038 
##     b = 172.036 
## 
##   sigma:  512.9011
## 
##     AIC    AICc     BIC 
## 720.324 721.903 729.245

accuracy(fc_demand_holt_4_b, testd4)

##                    ME     RMSE      MAE       MPE     MAPE     MASE        ACF1
## Training set  59.3026 482.8805 194.9087 0.2980881 1.000961 0.276767 -0.22306674
## Test set     419.7282 493.1221 419.7282 1.7940209 1.794021 0.596007  0.04825039
##              Theil's U
## Training set        NA
## Test set     0.8351803

test_forecast(demand, forecast.obj = fc_demand_holt_4_b, test = testd4)

The results indicate that model 4.b performs quite well. The MAPE is much lower than in model 4.a. The MAPE in the test set is still higher than in the training set, but, firstly, this is common since the data in the training set are used in estimating the model, and, secondly, especially for this time series a unique event (COVID-19) has an impact in the latest observations, including the observations in the data set.

Probably, the MAPE would be even lower for the test set if we would have left out the option to dampen the trend. However, our ultimate aim is to forecast future demand rather than the test set! The dampened trend, in this example, produces a more conservative forecast of future demand, implicitly assuming (or reasoning) that the exponential increase of catch-up demand after COVID-19 will slow down.

Obviously, whether this assumption holds true, we do not know. The quest for a stable forecasting model, is reflected in the feedback loop of the forecasting process!

We see that it’s not recommended to use the default settings! The best ways to assess the validity of the forecasting model are to:

Use training and test sets of an appropriate length.

Whatever is appropriate depends on:

The length of the time series,
Our forecasting horizon, and
The volatility of the series.

Use a grid search for the parameter values that minimize the accuracy statistics in the test set.

Often, parameter values in a range (e.g., .30 < α < .70) tend to perform robustly and satisfactorily, and, hence, there’s no need to hunt for the very best parameter with four decimal precision (e.g., α = 0.6487). The optimal parameter is bound to change slightly as time goes by, and approximations (α = 0.65) that are kept constant for some time often produce robust forecasts.

Visualize patterns in the forecasts. In our last model, there is a pattern in the forecasts, in the sense that the differences between the actual observations and the forecasts in the test set are widening. Preferably, the differences should be random, rather than systematic. The pattern in our example, is partly inspired by our choice of a dampened trend. If we expect the accelerating trend to be permanent, we can opt for a different model, like an exponential trend.

5. Holt-Winters model

The Holt-Winters model is an extended version of the Holt model (which is in its turn an extension of the SES model).

The Holt-Winters model is able to handle time series with both a trend and a seasonal component. For forecasting the seasonal component, the Holt-Winters model introduces a third parameter (γ, pronounced as gamma).

With the introduction of a third parameter, the mathematical formulas become even more complex, but it’s important to realize that the principle behind these parameters remains the same.

For an intuitive understanding, the easiest is to keep in mind the interpretation of the α parameter of the SES model.

The parameter adjusts the level of the time series at T+1, if the actual value at T differs from the forecast for T. If the actual value is higher than the forecast, then it’s interpreted as a signal that something is happening. A high α uses most or all (α=1) of the difference to adjust the level of the series. A low α virtually ignores the signal.
The same principle is used in Holt and in Holt-Winters models. For Holt-Winters, a low γ assumes a steady seasonal pattern, while a high γ adjusts the seasonal component if, for example, the most recent observation for January was well above what we observed in January in earlier years.

There are several packages in R that have functions for forecasting with the popular Holt-Winters model. In this section, we will mainly use the HoltWinters() function from the forecast package. Alternatively, we can make use of the hw() function of the stats package, which has the options to use exponential trends, and dampened trends.

We will go through the usual steps:

Read and explore the data
Estimate a Holt-Winters model, using a training set
Assess the model, using a test set
Look for better set of parameter values (for our three parameters, α, β, and γ), using a grid search
Use these parameters for generating forecasts, using linear and exponential trends, and dampened trends.

We will use quarterly demand for a product, in the years 2003-2022. The data can be found in demand_hw.csv.

demand_hw <- read.csv("demand_hw.csv")
head(demand_hw)

##   demand_hw
## 1    6891.8
## 2    4929.3
## 3    4653.7
## 4    6063.8
## 5    7233.1
## 6    4774.0

Since the data set only contains the data and not the time index, we have to define the data as a time series with the starting quarter (2003, Q1) as follows.

demand_hw_ts <- ts(demand_hw, start=c(2003,1),frequency=4)

We can inspect and plot the time series, and see if there is a seasonal (quarterly) component.

ts_info(demand_hw_ts)

##  The demand_hw_ts series is a ts object with 1 variable and 79 observations
##  Frequency: 4 
##  Start time: 2003 1 
##  End time: 2022 3

ts_plot(demand_hw_ts)

plot(decompose(demand_hw_ts))

ts_seasonal(demand_hw_ts)

The plots suggest an upward trend, which is accelerating as of 2010.

The seasonal plot confirms that there’s a strong and systematic seasonal component. In each of the years 2003-2021, demand in Q1 is around two thousand units higher than in Q2 and Q3. Demand in Q4 is in between the peaks in Q1 and the troughs in Q2/Q3.

Note also that the demand in all quarters of any year, is higher than demand in the same quarters of the previous year - with some exceptions where lines cross. These patterns indicate an upward trend, and, since the gaps between the succeeding years are increasing, the trend is not strictly linear.

This information justifies the use of the Holt-Winters model. Since we want to train the model, and assess its performance with data that we have at our disposal, we partition the data set in a training and test set. In case of seasonal data, we want to see to it that the test set includes all seasons. Since we have quarterly data, the minimum length of the test set should therefore be 4. If our forecasting horizon is one year, we can pick 4 as the length of the training set, but longer periods (8, or even 12 quarters) are also good choices.

Using both short and longer test sets, may shed light on the forecasting accuracy of various models in the short, medium and longer run!

For the sake of ease, we will stick to a test set length and forecasting horizon of h=4.

d_hw_par <- ts_split(demand_hw_ts, 4)
train_hw <- d_hw_par$train
test_hw  <- d_hw_par$test

Next, we use the training set to estimate the model, and use the model to fit and forecast the observations in the training and test set.

The output shows the estimations of the parameters α, β and γ. The sprintf() function is used to force R to print the output in 4 decimals.

d_hw <- HoltWinters(train_hw)
cat("alpha =",sprintf("%.4f",d_hw$alpha),"\n")

## alpha = 0.1863

cat("beta  =",sprintf("%.4f",d_hw$beta),"\n")

## beta  = 0.0917

cat("gamma =",sprintf("%.4f",d_hw$gamma),"\n")

## gamma = 0.2025

We can use the forecast() function to forecast values for the test set (4 quarters, 2022Q4 to 2023Q3), and beyond (for example the next 8 quarters on which we don’t have data, 2023Q4 to 2025Q3).

The forecasts come with both point estimates and 80% and 95% confidence intervals. You will notice that the confidence intervals are in quite a broad range, and the range increases with the number of quarters forecasted ahead.

In many cases, the ranges are so broad that they hardly give guidance about what the future may bring!

fc1 <- forecast(d_hw, 12)
plot_forecast(fc1, title="Fit & Eight Quarter Forecast", Xtitle = "Year", Ytitle = "Demand")

From the fitted values, it is obvious that the model fits the data quite well.

The estimates of the parameters α, β and γ are not necessarily the ones that perform best in forecasting. An option is to search for parameters for α, β and γ that give better fits and forecasts in the training and test set.

One way to do it is to manually try different combinations of values for the three parameters α, β and γ. But that is time consuming, and the outcome may be arbitrary, based on only those values that you have actually tried.

We can use a more systematic approach, called a grid search, in which we loop over a range of values for the three parameters, and see how they perform in terms of one of the available criteria, like MAPE.

Since an extensive grid search may take computer time, even on our modern fast devices, we will normally start with big steps.

The three parameters should have values in the range from 0 to 1. We can try different values in steps of 0.10, for each parameter, and find out which combinations produce good forecasts in the test set.

Once we have narrowed down the ranges of the three parameters, we can repeat the process using smaller steps (e.g., we search for the best α in the range from 0.20 to 0.40, in steps of 0.02).

Of course, we can keep on refining, but normally there’s a range of values for the three parameters that produces robust forecasts. Looking time and again for the very best set of parameter values at a specific point in time, does not necessarily lead to better forecasts, and comes at the cost of a continuous need to update the model (although that can be automated). A transparent, easy and robust approach, would lead us to keep the parameters constant - although the feedback loop in the forecasting process suggests regular model reviews and updates.

Let’s apply the grid search for our model.

# Determine the range of values for the three parameters, and the step size
# We start with steps of .10, in the range 0.05 to .95 for each of the three parameters
aa <- seq(.05,.95,.10)
bb <- seq(.05,.95,.10)
gg <- seq(.05,.95,.10)
# We initiate a matrix with all combinations in rows, and the information in columns
# The parameters take on 10 possible values (.05, .15, .25 ... .95), so we will have 10*10*10 = 1,000 rows
# In the 5 columns we have a sequence number (1..1000), the values of the three parameters, and the accuracy (we pick the MAPE in the test set)
xm <- matrix(NA, nrow = length(aa)*length(bb)*length(gg), ncol = 5)

# We initiate the variable count, for the sequence of combinations
count=0
# We loop over the range of values for alpha, beta and gamma, and store the info in our matrix xm
# The MAPE of the test set, is the 10th element in the accuracy matrix
for (p in aa){
  for (q in bb){
    for (r in gg){
      count = count+1
      m   <- HoltWinters(train_hw, alpha = p, beta = q, gamma = r)
      fcm <- forecast(m, h=4)
      am  <- accuracy(fcm,test_hw)
      xm[count,1] = count
      xm[count,2] = p
      xm[count,3] = q
      xm[count,4] = r
      xm[count,5] = am[10]
    }
  }
}

# OK, we have our data, in a matrix (named xm). We convert it to a data frame (named xmd)
# We label the 5 columns of the data frame xmd
# And sort the data by MAPE, from low (accurate) to high (not accurate)
xmd <- as.data.frame(xm)
names(xmd) <- c("Rec","Alpha","Beta","Gamma","MAPE")
xmd <- xmd[order(xmd$MAPE),]

# The 10 combinations with the lowest MAPE
head(xmd,10)

##     Rec Alpha Beta Gamma     MAPE
## 469 469  0.45 0.65  0.85 1.725393
## 220 220  0.25 0.15  0.95 1.742235
## 470 470  0.45 0.65  0.95 1.797934
## 320 320  0.35 0.15  0.95 1.895602
## 478 478  0.45 0.75  0.75 1.953290
## 460 460  0.45 0.55  0.95 2.022947
## 219 219  0.25 0.15  0.85 2.029206
## 495 495  0.45 0.95  0.45 2.035051
## 130 130  0.15 0.25  0.95 2.061845
## 120 120  0.15 0.15  0.95 2.063485

# Let's zoom in on our "optimal" solution (of course, we can refine)
library(forecast)
m   <- HoltWinters(train_hw, alpha = .45, beta = .65, gamma = .85)
fcm <- forecast(m, h=4)
accuracy(fcm,test_hw)

##                     ME     RMSE      MAE         MPE     MAPE      MASE
## Training set  3.379679 443.8979 345.7638 -0.03696704 5.722628 1.2129547
## Test set     16.861674 186.3120 135.1783 -0.09037914 1.725393 0.4742113
##                    ACF1 Theil's U
## Training set  0.2293851        NA
## Test set     -0.4866051 0.1230794

test_forecast(actual=demand_hw_ts,forecast.obj = fcm, test=test_hw)