library(fpp3)
-- Attaching packages ----------------------------------------- fpp3 0.3 --
v tibble 3.0.6 v feasts 0.1.6
v tsibbledata 0.2.0 v fable 0.2.1
-- Conflicts -------------------------------------------- fpp3_conflicts --
x lubridate::date() masks base::date()
x dplyr::filter() masks stats::filter()
x lubridate::interval() masks tsibble::interval()
x dplyr::lag() masks stats::lag()
https://robjhyndman.com/seminars/uwa2017/
https://robjhyndman.com/seminars/isi2019workshop/
https://github.com/robjhyndman/ETC3550Slides/tree/fable
https://github.com/rstudio-conf-2020/time-series-forecasting/blob/master/materials/labs.R
https://otexts.com/fpp3/judgmental.html
In many cases, judgmental forecasting is the only option, such as when there is a complete lack of historical data, or when a new product is being launched, or when a new competitor enters the market, or during completely new and unique market conditions.
Research in this area3 has shown that the accuracy of judgmental forecasting improves when the forecaster has (i) important domain knowledge, and (ii) more timely, up-to-date information. A judgmental approach can be quick to adjust to such changes, information or events.
There are three general settings in which judgmental forecasting is used: (i) there are no available data, so that statistical methods are not applicable and judgmental forecasting is the only feasible approach; (ii) data are available, statistical forecasts are generated, and these are then adjusted using judgment; and (iii) data are available and statistical and judgmental forecasts are generated independently and then combined.
\(y_t = \beta_0 + \beta_1 x_t + \varepsilon_t.\)
The coefficients \(β_0\) and \(β_1\) denote the intercept and the slope of the line respectively.
We can think of each observation \(y_t\) as consisting of the systematic or explained part of the model, \(β_0+β_1x_t\), and the random “error,” \(ε_t\). The “error” term does not imply a mistake, but a deviation from the underlying straight line model. It captures anything that may affect \(y_t\) other than \(x_t\).
us_change %>%
pivot_longer(c(Consumption, Income), names_to = "Series") %>%
autoplot(value) +
labs(y = "% change")
us_change %>%
ggplot(aes(x = Income, y = Consumption)) +
labs(y = "Consumption (quarterly % change)", x = "Income (quarterly % change)") +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
\(\hat{y}_t=0.54 + 0.27x_t.\)
The equation is estimated using the TSLM() function:
us_change %>%
model(TSLM(Consumption ~ Income)) %>%
report()
Series: Consumption
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-2.58236 -0.27777 0.01862 0.32330 1.42229
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.54454 0.05403 10.079 < 2e-16 ***
Income 0.27183 0.04673 5.817 2.4e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5905 on 196 degrees of freedom
Multiple R-squared: 0.1472, Adjusted R-squared: 0.1429
F-statistic: 33.84 on 1 and 196 DF, p-value: 2.4022e-08
The fitted line has a positive slope, reflecting the positive relationship between income and consumption. The slope coefficient shows that a one unit increase in \(x\) (a 1 percentage point increase in personal disposable income) results on average in 0.27 units increase in \(y\) (an average increase of 0.27 percentage points in personal consumption expenditure).
When there are two or more predictor variables, the model is called a multiple regression model. The general form of a multiple regression model is
\[\begin{equation} y_t = \beta_{0} + \beta_{1} x_{1,t} + \beta_{2} x_{2,t} + \cdots + \beta_{k} x_{k,t} + \varepsilon_t \end{equation}\]the coefficients \(β_1,…,β_k\) measure the effect of each predictor after taking into account the effects of all the other predictors in the model. Thus, the coefficients measure the marginal effects of the predictor variables.
The scatterplots show positive relationships with income and industrial production, and negative relationships with savings and unemployment. The strength of these relationships are shown by the correlation coefficients across the first row. The remaining scatterplots and correlation coefficients show the relationships between the predictors.
us_change %>%
GGally::ggpairs(columns = 2:6)
First, we assume that the model is a reasonable approximation to reality; that is, the relationship between the forecast variable and the predictor variables satisfies this linear equation.
Second, we make the following assumptions about the errors: - they have mean zero; otherwise the forecasts will be systematically biased. - they are not autocorrelated; otherwise the forecasts will be inefficient, as there is more information in the data that can be exploited. - they are unrelated to the predictor variables; otherwise there would be more information that should be included in the systematic part of the model.
It is also useful to have the errors being normally distributed with a constant variance \(σ^2\) in order to easily produce prediction intervals.
Another important assumption in the linear regression model is that each predictor \(x\) is not a random variable.
The least squares principle provides a way of choosing the coefficients effectively by minimising the sum of the squared errors.
\[\begin{equation} \sum_{t=1}^T \varepsilon_t^2 = \sum_{t=1}^T (y_t - \beta_{0} - \beta_{1} x_{1,t} - \beta_{2} x_{2,t} - \cdots - \beta_{k} x_{k,t})^2. \end{equation}\]This is called least squares estimation because it gives the least value for the sum of squared errors.
The TSLM() function fits a linear regression model to time series data. It is similar to the lm() function which is widely used for linear models, but TSLM() provides additional facilities for handling time series.
The following output provides information about the fitted model. The first column of Coefficients gives an estimate of each \(β\) coefficient and the second column gives its standard error (i.e., the standard deviation which would be obtained from repeatedly estimating the \(β\) coefficients on similar data sets). The standard error gives a measure of the uncertainty in the estimated \(β\) coefficient.
fit.consMR <- us_change %>%
model(tslm = TSLM(Consumption ~ Income + Production + Unemployment + Savings))
report(fit.consMR)
Series: Consumption
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-0.90555 -0.15821 -0.03608 0.13618 1.15471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.253105 0.034470 7.343 5.71e-12 ***
Income 0.740583 0.040115 18.461 < 2e-16 ***
Production 0.047173 0.023142 2.038 0.0429 *
Unemployment -0.174685 0.095511 -1.829 0.0689 .
Savings -0.052890 0.002924 -18.088 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3102 on 193 degrees of freedom
Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635
F-statistic: 160 on 4 and 193 DF, p-value: < 2.22e-16
The “t value” is the ratio of an estimated \(β\) coefficient to its standard error and the last column gives the p-value: the probability of the estimated \(β\) coefficient being as large as it is if there was no real relationship between consumption and the corresponding predictor. This is useful when studying the effect of each predictor, but is not particularly useful for forecasting.
The following plots show the actual values compared to the fitted values for the percentage change in the US consumption expenditure series.
augment(fit.consMR) %>%
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Consumption, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
labs(y = NULL, title = "Percent change in US consumption expenditure") +
scale_color_manual(values=c(Data="black",Fitted="red")) +
guides(colour = guide_legend(title = NULL))
augment(fit.consMR) %>%
ggplot(aes(x = Consumption, y = .fitted)) +
geom_point() +
labs(y = "Fitted (predicted values)",
x = "Data (actual values)",
title = "Percent change in US consumption expenditure") +
geom_abline(intercept = 0, slope = 1)
A common way to summarise how well a linear regression model fits the data is via the coefficient of determination, or \(R^2\).
If the predictions are close to the actual values, we would expec \(R^2\) to be close to 1. On the other hand, if the predictions are unrelated to the actual values, then \(R^2 = 0\) (again, assuming there is an intercept). In all cases, \(R^2\) lies between 0 and 1.
The value of \(R^2\) will never decrease when adding an extra predictor to the model and this can lead to over-fitting. There are no set rules for what is a good \(R^2\) value, and typical values of \(R^2\) depend on the type of data used. Validating a model’s forecasting performance on the test data is much better than measuring the R^2$ value on the training data.
Another measure of how well the model has fitted the data is the standard deviation of the residuals, which is often known as the “residual standard error.”
The differences between the observed y values and the corresponding fitted \(\hat{y}\) values are the training-set errors or “residuals”
Properties residuals:
It is always a good idea to check whether the residuals are normally distributed. As we explained earlier, this is not essential for forecasting, but it does make the calculation of prediction intervals much easier.
fit.consMR %>% gg_tsresiduals()
augment(fit.consMR) %>% features(.innov, ljung_box, lag = 10, dof = 5)
The time plot shows some changing variation over time, but is otherwise relatively unremarkable. This heteroscedasticity will potentially make the prediction interval coverage inaccurate.
The histogram shows that the residuals seem to be slightly skewed, which may also affect the coverage probability of the prediction intervals.
The autocorrelation plot shows a significant spike at lag 7, and a significant Ljung-Box test at the 5% level. However, the autocorrelation is not particularly large, and at lag 7 it is unlikely to have any noticeable impact on the forecasts or the prediction intervals.
We would expect the residuals to be randomly scattered without showing any systematic patterns. A simple and quick way to check this is to examine scatterplots of the residuals against each of the predictor variables.
It is also necessary to plot the residuals against any predictors that are not in the model. If any of these show a pattern, then the corresponding predictor may need to be added to the model
#residuals from the multiple regression model for forecasting US consumption
df <- left_join(us_change, residuals(fit.consMR), by = "Quarter")
p1 <- ggplot(df, aes(x = Income, y = .resid)) +
geom_point() + labs(y = "Residuals")
p2 <- ggplot(df, aes(x = Production, y = .resid)) +
geom_point() + labs(y = "Residuals")
p3 <- ggplot(df, aes(x = Savings, y = .resid)) +
geom_point() + labs(y = "Residuals")
p4 <- ggplot(df, aes(x = Unemployment, y = .resid)) +
geom_point() + labs(y = "Residuals")
library(cowplot)
Attaching package: 㤼㸱cowplot㤼㸲
The following object is masked from 㤼㸱package:lubridate㤼㸲:
stamp
plot_grid(p1, p2, p3, p4, cols=2)
Argument 'cols' is deprecated. Use 'ncol' instead.
A plot of the residuals against the fitted values should also show no pattern. If a pattern is observed, there may be “heteroscedasticity” in the errors which means that the variance of the residuals may not be constant. If this problem occurs, a transformation of the forecast variable such as a logarithm or square root may be required.
augment(fit.consMR) %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_point() + labs(x = "Fitted", y = "Residuals")
Observations that take extreme values compared to the majority of the data are called outliers. Observations that have a large influence on the estimated coefficients of a regression model are called influential observations. Usually, influential observations are also outliers that are extreme in the \(x\) direction.
A scatter plot of \(y\) against each \(x\) is always a useful starting point in regression analysis, and often helps to identify unusual observations.
Incorrect data entry should be corrected or removed from the sample immediately.
If an observation has been identified as a likely outlier, it is important to study it and analyse the possible reasons behind it. The decision to remove or retain an observation can be a challenging one (especially when outliers are influential observations). It is wise to report results both with and without the removal of such observations.
More often than not, time series data are “non-stationary”; that is, the values of the time series do not fluctuate around a constant mean or with a constant variance.
Regressing non-stationary time series can lead to spurious regressions. The output of regressing Australian air passengers on rice production in Guinea is shown in Figure 7.13. High \(R^2\) and high residual autocorrelation can be signs of spurious regression. Notice these features in the output below.
fit <- aus_airpassengers %>%
filter(Year <= 2011) %>%
left_join(guinea_rice, by = "Year") %>%
model(TSLM(Passengers ~ Production))
report(fit)
Series: Passengers
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-5.9448 -1.8917 -0.3272 1.8620 10.4210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.493 1.203 -6.229 2.25e-07 ***
Production 40.288 1.337 30.135 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.239 on 40 degrees of freedom
Multiple R-squared: 0.9578, Adjusted R-squared: 0.9568
F-statistic: 908.1 on 1 and 40 DF, p-value: < 2.22e-16
fit %>% gg_tsresiduals()
A trend variable can be specified in the TSLM() function using the trend() specia
Categorical variables like “yes” or “no”. his situation can still be handled within the framework of multiple regression models by creating a “dummy variable” which takes value 1 corresponding to “yes” and 0 corresponding to “no.” A dummy variable is also known as an “indicator variable.”
A dummy variable can also be used to account for an outlier in the data. Rather than omit the outlier, a dummy variable removes its effect. In this case, the dummy variable takes value 1 for that observation and 0 everywhere
TSLM() will automatically handle this case if you specify a factor variable as a predictor. There is usually no need to manually create the corresponding dummy variables.
The general rule is to use one fewer dummy variables than categories. So for quarterly data, use three dummy variables; for monthly data, use 11 dummy variables; and for daily data, use six dummy variables, and so on.
The interpretation of each of the coefficients associated with the dummy variables is that it is a measure of the effect of that category relative to the omitted category.
The TSLM() function will automatically handle this situation if you specify the special season().
recent_production <- aus_production %>%
filter(year(Quarter) >= 1992)
recent_production %>%
autoplot(Beer) +
labs(y = "Megalitres")
fit_beer <- recent_production %>%
model(TSLM(Beer ~ trend() + season()))
report(fit_beer)
Series: Beer
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-42.9029 -7.5995 -0.4594 7.9908 21.7895
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 441.80044 3.73353 118.333 < 2e-16 ***
trend() -0.34027 0.06657 -5.111 2.73e-06 ***
season()year2 -34.65973 3.96832 -8.734 9.10e-13 ***
season()year3 -17.82164 4.02249 -4.430 3.45e-05 ***
season()year4 72.79641 4.02305 18.095 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.23 on 69 degrees of freedom
Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199
F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16
There is an average downward trend of -0.34 megalitres per quarter. On average, the second quarter has production of 34.7 megalitres lower than the first quarter, the third quarter has production of 17.8 megalitres lower than the first quarter, and the fourth quarter has production of 72.8 megalitres higher than the first quarter.
augment(fit_beer) %>%
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Beer, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
scale_color_manual(values = c(Data = "black", Fitted = "red")) +
labs(y = "Megalitres", title = "Quarterly Beer Production") +
guides(colour = guide_legend(title = "Series"))
augment(fit_beer) %>%
ggplot(aes(x = Beer, y = .fitted, colour = factor(quarter(Quarter)))) +
geom_point() +
labs(y = "Fitted", x = "Actual values",
title = "Quarterly beer production") +
geom_abline(intercept = 0, slope = 1) +
guides(colour = guide_legend(title = "Quarter"))
It is often necessary to model interventions that may have affected the variable to be forecast. For example, competitor activity, advertising expenditure, industrial action, and so on, can all have an effect.
When the effect lasts only for one period, we use a “spike” variable. This is a dummy variable that takes value one in the period of the intervention and zero elsewhere. A spike variable is equivalent to a dummy variable for handling an outlier.
Other interventions have an immediate and permanent effect. A step variable takes value zero before the intervention and one from the time of intervention onward.
The number of trading days in a month can vary considerably and can have a substantial effect on sales data. To allow for this, the number of trading days in each month can be included as a predictor. Another form of permanent effect is a change of slope.
It is often useful to include advertising expenditure as a predictor. However, since the effect of advertising can last beyond the actual campaign, we need to include lagged values of advertising expenditure.
Easter differs from most holidays because it is not held on the same date each year, and its effect can last for several days. In this case, a dummy variable can be used with value one where the holiday falls in the particular time period and zero otherwise.
An alternative to using seasonal dummy variables, especially for long seasonal periods, is to use Fourier terms.
If \(m\)is the seasonal period, then the first few Fourier terms are given by
\[\begin{equation} x_{1,t} = \sin\left(\textstyle\frac{2\pi t}{m}\right), x_{2,t} = \cos\left(\textstyle\frac{2\pi t}{m}\right), x_{3,t} = \sin\left(\textstyle\frac{4\pi t}{m}\right), \end{equation}\] \[\begin{equation} x_{4,t} = \cos\left(\textstyle\frac{4\pi t}{m}\right), x_{5,t} = \sin\left(\textstyle\frac{6\pi t}{m}\right), x_{6,t} = \cos\left(\textstyle\frac{6\pi t}{m}\right), \end{equation}\]With Fourier terms, we often need fewer predictors than with dummy variables, especially when \(m\) is large. This makes them useful for weekly data, for example, where \(m ≈ 52\). For short seasonal periods (e.g., quarterly data), there is little advantage in using Fourier terms over seasonal dummy variables.
fourier_beer <- recent_production %>%
model(TSLM(Beer ~ trend() + fourier(K = 2)))
report(fourier_beer)
Series: Beer
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-42.9029 -7.5995 -0.4594 7.9908 21.7895
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 446.87920 2.87321 155.533 < 2e-16 ***
trend() -0.34027 0.06657 -5.111 2.73e-06 ***
fourier(K = 2)C1_4 8.91082 2.01125 4.430 3.45e-05 ***
fourier(K = 2)S1_4 -53.72807 2.01125 -26.714 < 2e-16 ***
fourier(K = 2)C2_4 -13.98958 1.42256 -9.834 9.26e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.23 on 69 degrees of freedom
Multiple R-squared: 0.9243, Adjusted R-squared: 0.9199
F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16
The K argument to fourier() specifies how many pairs of sin and cos terms to include. The maximum allowed is \(K = m/2\) where \(m\) is the seasonal period. Because we have used the maximum here, the results are identical to those obtained when using seasonal dummy variables.
A regression model containing Fourier terms is often called a harmonic regression because the successive Fourier terms represent harmonics of the first two Fourier terms.
We will use a measure of predictive accuracy. They can be shown using the glance() function, here applied to the model for US consumption:
glance(fit.consMR) %>% select(adj_r_squared, CV, AIC, AICc, BIC)
We compare these values against the corresponding values from other models. For the CV, AIC, AICc and BIC measures, we want to find the model with the lowest value; for Adjusted \(R^2\), we seek the model with the highest value.
This is an improvement on \(R^2\), as it will no longer increase with each added predictor. Using this measure, the best model will be the one with the largest value of \(\bar{R}^2\). Maximising \(\bar{R}^2\)is equivalent to minimising the standard error \(\hat{\sigma}_e\).
Maximising\(\bar{R}^2\) works quite well as a method of selecting predictors, although it does tend to err on the side of selecting too many predictors.
For regression models, it is also possible to use classical leave-one-out cross-validation to selection predictors:
Under this criterion, the best model is the one with the smallest value of CV.
The model with the minimum value of the AIC is often the best model for forecasting. For large values of \(T\), minimising the AIC is equivalent to minimising the CV value.
For small values of \(T\), the AIC tends to select too many predictors, and so a bias-corrected version of the AIC has been developed. \(AICc\) should be minimised.
Schwarz’s Bayesian Information Criterion (usually abbreviated to BIC, SBIC or SC). Minimising the BIC is intended to give the best model.
We recommend that one of the AICc, AIC, or CV statistics be used, each of which has forecasting as their objective. If the value of T is large enough, they will all lead to the same model.
Where possible, all potential regression models should be fitted (as was done in the example above) and the best model should be selected based on one of the measures discussed. This is known as “best subsets” regression or “all possible subsets” regression.
An approach that works quite well is backwards stepwise regression:
If the number of potential predictors is too large, then the backwards stepwise regression will not work and forward stepwise regression can be used instead. This procedure starts with a model that includes only the intercept. Predictors are added one at a time, and the one that most improves the measure of predictive accuracy is retained in the model. The procedure is repeated until no further improvement can be achieved.
Ex-ante forecasts are those that are made using only the information that is available in advance. Might require forecasts of the predictors.
Ex-post forecasts are those that are made using later information on the predictors.
The model from which ex-post forecasts are produced should not be estimated using data from the forecast period. That is, ex-post forecasts can assume knowledge of the predictor variables (the x variables), but should not assume knowledge of the data that are to be forecast (the y variable).
A comparative evaluation of ex-ante forecasts and ex-post forecasts can help to separate out the sources of forecast uncertainty.
Calendar predictors or deterministic functions make no difference between ex-ante and ex-post forecasts.
recent_production <- aus_production %>%
filter(year(Quarter) >= 1992)
fit_beer <- recent_production %>%
model(TSLM(Beer ~ trend() + season()))
fc_beer <- forecast(fit_beer)
fc_beer %>%
autoplot(recent_production) +
labs(title = "Forecasts of beer production using regression",
y = "megalitres")
In this setting, the forecaster assumes possible scenarios for the predictor variables that are of interest. For example, a US policy maker may be interested in comparing the predicted change in consumption when there is a constant growth of 1% and 0.5% respectively for income and savings with no change in the employment rate, versus a respective decline of 1% and 0.5%, for each of the four quarters following the end of the sample.
fit_consBest <- us_change %>%
model(lm = TSLM(Consumption ~ Income + Savings + Unemployment))
future_scenarios <- scenarios(
Increase = new_data(us_change, 4) %>%
mutate(Income=1, Savings=0.5, Unemployment=0),
Decrease = new_data(us_change, 4) %>%
mutate(Income=-1, Savings=-0.5, Unemployment=0),
names_to = "Scenario")
fc <- forecast(fit_consBest, new_data = future_scenarios)
us_change %>%
autoplot(Consumption) +
autolayer(fc) +
labs(y = "% change in US consumption")
The great advantage of regression models is that they can be used to capture important relationships between the forecast variable of interest and the predictor variables. A major challenge however, is that in order to generate ex-ante forecasts, the model requires future values of each predictor. If scenario based forecasting is of interest then these models are extremely useful. However, if ex-ante forecasting is the main focus, obtaining forecasts of the predictors can be challenging (in many cases generating forecasts for the predictor variables can be more challenging than forecasting directly the forecast variable without using predictors).
An alternative formulation is to use as predictors their lagged values. Including lagged values of the predictors does not only make the model operational for easily generating forecasts, it also makes it intuitively appealing.
The estimated simple regression line in the US consumption example is
\(\hat{y}_t=0.54+0.27x_t\)
#personal income increases by its historical mean
#extreme case, increase 5% in income
fit_cons <- us_change %>%
model(TSLM(Consumption ~ Income))
new_cons <- scenarios(
"Average increase" = new_data(us_change, 4) %>% mutate(Income = mean(us_change$Income)),
"Extreme increase" = new_data(us_change, 4) %>% mutate(Income = 12),
names_to = "Scenario"
)
fcast <- forecast(fit_cons, new_cons)
us_change %>%
autoplot(Consumption) +
autolayer(fcast) +
labs(y = "% change in US consumption")
The simplest way of modelling a nonlinear relationship is to transform the forecast variable y and/or the predictor variable x before estimating a regression model. While this provides a non-linear functional form, the model is still linear in the parameters.
\(\log y=\beta_0+\beta_1 \log x +\varepsilon\)
In this model, the slope β1 can be interpreted as an elasticity: β1 is the average percentage change in y resulting from a 1% increase in x. Other useful forms can also be specified. The log-linear form is specified by only transforming the forecast variable and the linear-log form is obtained by transforming the predictor.
A better approach is to use the piecewise specification and fit a piecewise linear trend which bends at some point in time
boston_men <- boston_marathon %>%
filter(Year >= 1924) %>%
filter(Event == "Men's open division") %>%
mutate(Minutes = as.numeric(Time)/60)
The plot of winning times reveals three different periods. There is a lot of volatility in the winning times up to about 1950, with the winning times barely declining. After 1950 there is a clear decrease in times, followed by a flattening out after the 1980s, with the suggestion of an upturn towards the end of the sample. To account for these changes, we specify the years 1950 and 1980 as knots. We should warn here that subjective identification of knots can lead to over-fitting, which can be detrimental to the forecast performance of a model, and should be performed with caution.
fit_trends <- boston_men %>%
model(
linear = TSLM(Minutes ~ trend()),
exponential = TSLM(log(Minutes) ~ trend()),
piecewise = TSLM(Minutes ~ trend(knots = c(1950, 1980)))
)
fc_trends <- fit_trends %>% forecast(h = 10)
boston_men %>%
autoplot(Minutes) +
geom_line(aes(y = .fitted, colour = .model), data = fitted(fit_trends)) +
autolayer(fc_trends, alpha = 0.5, level = 95) +
labs(y = "Winning times in minutes",
title = "Boston Marathon")
A variable x may be useful for forecasting a variable y, but that does not mean x is causing y. It is possible that x is causing y, but it may be that y is causing x, or that the relationship between them is more complicated than simple causality.
Confounder when a variable not included in the model influences both the response variable and at least one predictor variable. Confounding makes it difficult to determine what variables are causing changes in other variables, but it does not necessarily make forecasting more difficult.
Often a better model is possible if a causal mechanism can be determined.
A closely related issue is multicollinearity, which occurs when similar information is provided by two or more of the predictor variables in a multiple regression.
When multicollinearity is present, the uncertainty associated with individual regression coefficients will be large. This is because they are difficult to estimate. Consequently, statistical tests (e.g., t-tests) on regression coefficients are unreliable.
Forecasts will be unreliable if the values of the future predictors are outside the range of the historical values of the predictors.
Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older.
This method is suitable for forecasting data with no clear trend or seasonal pattern.
algeria_economy <- global_economy %>%
filter(Country == "Algeria")
algeria_economy %>%
autoplot(Exports) +
labs(y="Exports (% of GDP)")
Forecasts are calculated using weighted averages, where the weights decrease exponentially as observations come from further in the past — the smallest weights are associated with the oldest observations:
\[\begin{equation} \hat{y}_{T+1|T} = \alpha y_T + \alpha(1-\alpha) y_{T-1} + \alpha(1-\alpha)^2 y_{T-2}+ \cdots, \tag{8.1} \end{equation}\]where $0≤α≤1$ is the smoothing parameter.
The forecast at time $T+1$ is equal to a weighted average between the most recent observation $y_T$ and the previous forecast \(\hat{y}_{T|T−1}\):
\(\hat{y}_{T+1|T} = \alpha y_T + (1-\alpha) \hat{y}_{T|T-1},\)
An alternative representation is the component form. For simple exponential smoothing, the only component included is the level, \(ℓ_t\). Component form representations of exponential smoothing methods comprise a forecast equation and a smoothing equation for each of the components included in the method. The component form of simple exponential smoothing is given by:
\[\begin{align*} \text{Forecast equation} && \hat{y}_{t+h|t} & = \ell_{t}\\ \text{Smoothing equation} && \ell_{t} & = \alpha y_{t} + (1 - \alpha)\ell_{t-1}, \end{align*}\]Simple exponential smoothing has a “flat” forecast function:
\(\hat{y}_{T+h|T} = \hat{y}_{T+1|T}=\ell_T, \qquad h=2,3,\dots.\)
The application of every exponential smoothing method requires the smoothing parameters and the initial values to be chosen. In particular, for simple exponential smoothing, we need to select the values of \(α\) and \(ℓ_0\). All forecasts can be computed from the data once we know those values. For the methods that follow there is usually more than one smoothing parameter and more than one initial component to be chosen.
A more reliable and objective way to obtain values for the unknown parameters is to estimate them from the observed data. The unknown parameters and the initial values for any exponential smoothing method can be estimated by minimising the SSE.
# Estimate parameters
fit <- algeria_economy %>%
model(ETS(Exports ~ error("A") + trend("N") + season("N")))
fc <- fit %>%
forecast(h = 5)
report(fit)
Series: Exports
Model: ETS(A,N,N)
Smoothing parameters:
alpha = 0.8399875
Initial states:
l
39.539
sigma^2: 35.6301
AIC AICc BIC
446.7154 447.1599 452.8968
This gives parameter estimates \(/hat{^}α=0.84\) and \(/hat^ℓ_0=39.5\)
autoplot(fc) +
geom_line(aes(y = Exports, col="Data"), data = algeria_economy) +
geom_line(aes(y = .fitted, col="Fitted"), data = augment(fit)) +
labs(y="Exports (% of GDP)") +
scale_color_manual(values=c(Data="black",Fitted="red")) +
guides(colour = guide_legend("Series"))
The prediction intervals shown here are calculated using the methods described in Section 8.7. The prediction intervals show that there is considerable uncertainty in the future exports over the five-year forecast period. So interpreting the point forecasts without accounting for the large uncertainty can be very misleading.
This method involves a forecast equation and two smoothing equations (one for the level and one for the trend):
\[\begin{align*} \text{Forecast equation}&& \hat{y}_{t+h|t} &= \ell_{t} + hb_{t} \\ \text{Level equation} && \ell_{t} &= \alpha y_{t} + (1 - \alpha)(\ell_{t-1} + b_{t-1})\\ \text{Trend equation} && b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 -\beta^*)b_{t-1}, \end{align*}\]aus_economy <- global_economy %>%
filter(Code == "AUS") %>%
mutate(Pop = Population / 1e6)
autoplot(aus_economy, Pop) + labs(y = "Millions")
fit <- aus_economy %>%
model(AAN = ETS(Pop ~ error("A") + trend("A") + season("N")))
fc <- fit %>% forecast(h = 10)
report(fit)
Series: Pop
Model: ETS(A,A,N)
Smoothing parameters:
alpha = 0.9999
beta = 0.3266366
Initial states:
l b
10.05414 0.2224818
sigma^2: 0.0041
AIC AICc BIC
-76.98569 -75.83184 -66.68347
The estimated smoothing coefficient for the level is \(\hat{^}α=1.00\). The very high value shows that the level changes rapidly in order to capture the highly trended series. The estimated smoothing coefficient for the slope is \(\hat{^}β^∗=0.33\). This is relatively large suggesting that the trend also changes often (even if the changes are slight).
“dampens” the trend to a flat line some time in the future. Methods that include a damped trend have proven to be very successful, and are arguably the most popular individual methods when forecasts are required automatically for many series.
this method also includes a damping parameter \(0<ϕ<1\):
\[\begin{align*} \hat{y}_{t+h|t} &= \ell_{t} + (\phi+\phi^2 + \dots + \phi^{h})b_{t} \\ \ell_{t} &= \alpha y_{t} + (1 - \alpha)(\ell_{t-1} + \phi b_{t-1})\\ b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 -\beta^*)\phi b_{t-1}. \end{align*}\]In practice, \(ϕ\) is rarely less than 0.8 as the damping has a very strong effect for smaller values. Values of \(ϕ\) close to 1 will mean that a damped model is not able to be distinguished from a non-damped model. For these reasons, we usually restrict \(ϕ\) to a minimum of 0.8 and a maximum of 0.98.
aus_economy %>%
model(
`Holt's method` = ETS(Pop ~ error("A") + trend("A") + season("N")),
`Damped Holt's method` = ETS(Pop ~ error("A") + trend("Ad", phi = 0.9) + season("N"))
) %>%
forecast(h = 15) %>%
autoplot(aus_economy, level = NULL) +
labs(title = "Forecasts from Holt's method",
y = "Population of Australia (millions)") +
guides(colour = guide_legend(title = "Forecast"))
We have set the damping parameter to a relatively low number (\(ϕ=0.90\)) to exaggerate the effect of damping for comparison. Usually, we would estimate \(ϕ\) along with the other parameters. We have also used a rather large forecast horizon (\(h=15\)) to highlight the difference between a damped trend and a linear trend.
we compare the forecasting performance of the three exponential smoothing methods that we have considered so far in forecasting the number of users connected to the internet via a server. The data is observed over 100 minutes:
www_usage <- as_tsibble(WWWusage)
www_usage %>% autoplot(value) +
labs(x="Minute", y="Number of users")
We will use time series cross-validation to compare the one-step forecast accuracy of the three methods.
www_usage %>%
stretch_tsibble(.init = 10) %>%
model(
SES = ETS(value ~ error("A") + trend("N") + season("N")),
Holt = ETS(value ~ error("A") + trend("A") + season("N")),
Damped = ETS(value ~ error("A") + trend("Ad") + season("N"))
) %>%
forecast(h = 1) %>%
accuracy(www_usage)
The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
1 observation is missing at 101
Damped Holt’s method is best whether you compare MAE or RMSE values. So we will proceed with using the damped Holt’s method and apply it to the whole data set to get forecasts for future years.
fit <- www_usage %>%
model(Damped = ETS(value ~ error("A") + trend("Ad") + season("N")))
tidy(fit)
The smoothing parameter for the slope is estimated to be almost one, indicating that the trend changes to mostly reflect the slope between the last two minutes of internet usage. The value of \(α\) is very close to one, showing that the level reacts strongly to each new observation.
fit %>%
forecast(h = 10) %>%
autoplot(www_usage) +
labs(x="Minute", y="Number of users")
The resulting forecasts look sensible with decreasing trend, and relatively wide prediction intervals reflecting the variation in the historical data.
The Holt-Winters seasonal method comprises the forecast equation and three smoothing equations — one for the level \(ℓ_t\), one for the trend \(b_t\), and one for the seasonal component \(s_t\), with corresponding smoothing parameters \(α\), \(β^∗\) and \(γ\). We use \(m\) to denote the frequency of the seasonality, i.e., the number of seasons in a year. For example, for quarterly data \(m=4\), and for monthly data \(m=12\).
The additive method is preferred when the seasonal variations are roughly constant through the series, while the multiplicative method is preferred when the seasonal variations are changing proportional to the level of the series.
We apply Holt-Winters’ method with both additive and multiplicative seasonality to forecast quarterly visitor nights in Australia spent by domestic tourists.
aus_holidays <- tourism %>%
filter(Purpose == "Holiday") %>%
summarise(Trips = sum(Trips)/1e3)
fit <- aus_holidays %>%
model(
additive = ETS(Trips ~ error("A") + trend("A") + season("A")),
multiplicative = ETS(Trips ~ error("M") + trend("A") + season("M"))
)
fc <- fit %>% forecast(h = "3 years")
fc %>%
autoplot(aus_holidays, level = NULL) +
labs(y="Overnight trips (millions)")
tidy(fit)
ecause both methods have exactly the same number of parameters to estimate, we can compare the training RMSE from both models. In this case, the method with multiplicative seasonality fits the data slightly better.
Damping is possible with both additive and multiplicative Holt-Winters’ methods. A method that often provides accurate and robust forecasts for seasonal data is the Holt-Winters method with a damped trend and multiplicative seasonality:
\[\begin{align*} \hat{y}_{t+h|t} &= \left[\ell_{t} + (\phi+\phi^2 + \dots + \phi^{h})b_{t}\right]s_{t+h-m(k+1)}. \\ \ell_{t} &= \alpha(y_{t} / s_{t-m}) + (1 - \alpha)(\ell_{t-1} + \phi b_{t-1})\\ b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 - \beta^*)\phi b_{t-1} \\ s_{t} &= \gamma \frac{y_{t}}{(\ell_{t-1} + \phi b_{t-1})} + (1 - \gamma)s_{t-m}. \end{align*}\]ETS(y ~ error("M") + trend("Ad") + season("M"))
The Holt-Winters method can also be used for daily type of data, where the seasonal period is \(m=7\), and the appropriate unit of time for \(h\) is in days. Here we forecast pedestrian traffic at a busy Melbourne train station in July 2016.
sth_cross_ped <- pedestrian %>%
filter(Date >= "2016-07-01", Sensor == "Southern Cross Station") %>%
index_by(Date) %>%
summarise(Count = sum(Count))
sth_cross_ped %>%
filter(Date <= "2016-07-31") %>%
model(hw = ETS(Count ~ error("M") + trend("Ad") + season("M"))) %>%
forecast(h = "2 weeks") %>%
autoplot(sth_cross_ped %>% filter(Date <= "2016-08-14"))
Clearly the model has identified the weekly seasonal pattern and the increasing trend at the end of the data, and the forecasts are a close match to the test data.
By considering variations in the combinations of the trend and seasonal components, nine exponential smoothing methods are possible. Each method is labelled by a pair of letters (T,S) defining the type of ‘Trend’ and ‘Seasonal’ components. For example, (A,M) is the method with an additive trend and multiplicative seasonality; (\(A_d\), N) is the method with damped trend and no seasonality; and so on.
Trend Component | Seasonal Component | ||
---|---|---|---|
N (None) | A(Additive) | M (Multiplicative) | |
N (None) | (N,N) | (N,A) | (N,M) |
A (Additive) | (A,N) | (A,A) | (A,M) |
A_d (Additive damped) | (\(A_d\),N) | (\(A_d\),A) | (\(A_d\),M) |
Short hand | Method |
---|---|
(N,N) | Simple exponential smoothing |
(A,N) | Holt’s linear method |
(\(A_d\),N) | Additive damped trend method |
(A,A) | Additive Holt-Winters’ method |
(A,M) | Multiplicative Holt-Winters’ method |
(\(A_d\),M) | Holt-Winters’ damped method |
The exponential smoothing methods presented in Table 8.6 are algorithms which generate point forecasts. The statistical models in this section generate the same point forecasts, but can also generate prediction (or forecast) intervals. A statistical model is a stochastic (or random) data generating process that can produce an entire forecast distribution.
Each model consists of a measurement equation that describes the observed data, and some state equations that describe how the unobserved components or states (level, trend, seasonal) change over time. Hence, these are referred to as state space models.
For each method there exist two models: one with additive errors and one with multiplicative errors. To distinguish between a model with additive errors and one with multiplicative errors (and also to distinguish the models from the methods), we add a third letter to the classification. We label each state space model as ETS(⋅,⋅,) for (Error, Trend, Seasonal).
An alternative to estimating the parameters by minimising the sum of squared errors is to maximise the “likelihood.” The likelihood is the probability of the data arising from the specified model. Thus, a large likelihood is associated with a good model. For an additive error model, maximising the likelihood (assuming normally distributed errors) gives the same results as minimising the sum of squared errors. However, different results will be obtained for multiplicative error models.
A great advantage of the ETS statistical framework is that information criteria can be used for model selection. The AIC, \(AIC_c\) and BIC can be used here to determine which of the ETS models is most appropriate for a given time series.
Models with multiplicative errors are useful when the data are strictly positive, but are not numerically stable when the data contain zeros or negative values.
We now employ the ETS statistical framework to forecast Australian holiday tourism over the period 2016–2019. We let the ETS() function select the model by minimising the AICc.
aus_holidays <- tourism %>%
filter(Purpose == "Holiday") %>%
summarise(Trips = sum(Trips)/1e3)
fit <- aus_holidays %>%
model(ETS(Trips))
report(fit)
Series: Trips
Model: ETS(M,N,A)
Smoothing parameters:
alpha = 0.3484054
gamma = 0.0001000018
Initial states:
l s1 s2 s3 s4
9.727072 -0.5376106 -0.6884343 -0.2933663 1.519411
sigma^2: 0.0022
AIC AICc BIC
226.2289 227.7845 242.9031
The model selected is ETS(M,N,A).
The small values of \(γ\) indicate that the seasonal components change very little over time. The narrow prediction intervals indicate that the series is relatively easy to forecast due to the strong trend and seasonality.
components(fit) %>%
autoplot() +
labs(title = "ETS(M,N,A) components")
Because this model has multiplicative errors, the innovation residuals are not equivalent to the regular residuals (i.e., the one-step training errors). The innovation residuals are given by \(\hat{ε}_t}\), while the regular residuals are defined as \(y_t−\hat{y}_{t|t−1}\).
augment(fit) %>%
autoplot(.resid)
augment(fit) %>%
autoplot(.innov)
ETS point forecasts are equal to the medians of the forecast distributions. For models with only additive components, the forecast distributions are normal, so the medians and means are equal. For ETS models with multiplicative errors, or with multiplicative seasonality, the point forecasts will not be equal to the means of the forecast distributions.
fit %>%
forecast(h = 8) %>%
autoplot(aus_holidays) +
labs(y="Domestic holiday visitors in Australia (thousands)")
The prediction intervals will differ between models with additive and multiplicative methods.
For most ETS models, a prediction interval can be written as
\(\hat{y}_{T+h|T} \pm c \sigma_h\)