Stat 142 (Time Series Analysis)

Lesson 2.3 (Time series regression models)

Author

Norberto E. Milla, Jr.

Published

April 13, 2023

1 Introduction

Suppose we are interested to forecast the time series of interest y_t assuming that it has a linear relationship with other time series x_t. For example, we might wish to forecast monthly sales using total advertising spending as a predictor. Or, we might forecast daily electricity demand using temperature and the day of week as predictors.

The forecast variable y_t is sometimes also called the regressand, dependent or explained variable. The predictor variables x_t are sometimes also called the regressors, independent or explanatory variables.

2 The linear model

2.1 Simple linear regression model

In the simplest case, the regression model allows for a linear relationship between the forecast variable y_t and a single predictor variable x_t: y_t = \beta_0 + \beta_1 x_t + \epsilon_t

The coefficients \beta_0 and \beta_1 denote the intercept and the slope of the regression line, respectively. The \beta_1 represents the average predicted change in y_t resulting from a one unit increase in x_t.

Notice that the observations do not lie on the straight line but are scattered around it. We can think of each observation y_t as consisting of the systematic or explained part of the model, \beta_0 + \beta_1 x_t, and the random “error”, \epsilon_t. The “error” term does not imply a mistake, but a deviation from the underlying straight line model. It captures anything that may affect y_t other than x_t.

Code

uschange %>%
  as.data.frame() %>%
  ggplot(aes(x=Income, y=Consumption)) +
    ylab("Consumption (quarterly % change)") +
    xlab("Income (quarterly % change)") +
    geom_point() +
    geom_smooth(method="lm", se=FALSE)

The equation is estimated in R using the tslm() function:

Code

lm1 <- tslm(Consumption ~ Income, data=uschange)
summary(lm1)


Call:
tslm(formula = Consumption ~ Income, data = uschange)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.40845 -0.31816  0.02558  0.29978  1.45157 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.54510    0.05569   9.789  < 2e-16 ***
Income       0.28060    0.04744   5.915 1.58e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6026 on 185 degrees of freedom
Multiple R-squared:  0.159, Adjusted R-squared:  0.1545 
F-statistic: 34.98 on 1 and 185 DF,  p-value: 1.577e-08

The fitted line has a positive slope, reflecting the positive relationship between income and consumption. The slope coefficient shows that a 1 percentage point increase in personal disposable income results, on average, an increase of 0.28 percentage points in personal consumption expenditure. Alternatively, the estimated equation shows that a value of 1 for x (the percentage increase in personal disposable income) will result in a forecast value of 0.55 + 0.28 \times 1 = 0.83 for y (the percentage increase in personal consumption expenditure).

The interpretation of the intercept requires that a value of x = 0 makes sense. In this case when x = 0 (i.e., when there is no change in personal disposable income since the last quarter) the predicted value of y is 0.55 (i.e., an average increase in personal consumption expenditure of 0.55%). Even when x = 0 does not make sense, the intercept is an important part of the model. Without it, the slope coefficient can be distorted unnecessarily. The intercept should always be included unless the requirement is to force the regression line “through the origin”. In what follows we assume that an intercept is always included in the model.

2.2 Multiple linear regression model

When there are two or more predictor variables, the model is called a multiple regression model. The general form of a multiple regression model is

y_t = \beta_0 + \beta_1x_{1,t} + \beta_2x_{2,t} + \cdots + \beta_kx_{k,t} + \epsilon_t

where y is the variable to be forecast and x_1, x_2, \cdots, x_k are the k predictor variables. Each of the predictor variables must be numerical. The coefficients \beta_1, \beta_2, \cdots, \beta_k measure the effect of each predictor after taking into account the effects of all the other predictors in the model. Thus, the coefficients measure the marginal effects of the predictor variables.

For instance, additional predictors that may be useful for forecasting US consumption expenditure include quarterly percentage changes in industrial production and personal savings, and quarterly changes in the unemployment rate (as this is already a percentage). Building a multiple linear regression model can potentially generate more accurate forecasts as we expect consumption expenditure to not only depend on personal income but on other predictors as well.

Below is a scatterplot matrix of five variables. The first column shows the relationships between the forecast variable (consumption) and each of the predictors. The scatterplots show positive relationships with income and industrial production, and negative relationships with savings and unemployment. The strength of these relationships are shown by the correlation coefficients across the first row. The remaining scatterplots and correlation coefficients show the relationships between the predictors.

Code

uschange %>%
  as.data.frame() %>%
  GGally::ggpairs()

2.3 Assumptions

When we use a linear regression model, we are implicitly making some assumptions about the variables. First, we assume that the model is a reasonable approximation to reality; that is, the relationship between the forecast variable and the predictor variables satisfies this linear equation.

Second, we make the following assumptions about the errors \epsilon_1, \epsilon_2, \cdots, \epsilon_T:

they have mean zero; otherwise the forecasts will be systematically biased.
they are not autocorrelated; otherwise the forecasts will be inefficient, as there is more information in the data that can be exploited.
they are unrelated to the predictor variables; otherwise there would be more information that should be included in the systematic part of the model.

It is also useful to have the errors being normally distributed with a constant variance \sigma^2 in order to easily produce prediction intervals.

Another important assumption in the linear regression model is that each predictor x is not a random variable. If we were performing a controlled experiment in a laboratory, we could control the values of each x (so they would not be random) and observe the resulting values of y. With observational data (including most data in business and economics), it is not possible to control the value of x, we simply observe it. Hence we make this an assumption.

3 Least squares estimation

In practice, of course, we have a collection of observations but we do not know the values of the coefficients \beta_0, \beta_1, \beta_2, \cdots, \beta_k. These need to be estimated from the data.

The least squares principle provides a way of choosing the coefficients effectively by minimizing the sum of the squared errors. That is, we choose the values of \beta_0, \beta_1, \beta_2, \cdots, \beta_k that minimize

\sum_{t=1}^T \epsilon_t^2 = \sum_{t=1}^T (y - \beta_0 - \beta_1x_{1,t} - \beta_2x_{2,t} - \cdots - \beta_kx_{k,t} )^2

This is called least squares estimation because it gives the least value for the sum of squared errors. Finding the best estimates of the coefficients is often called “fitting” the model to the data, or sometimes “learning” or “training” the model.

When we refer to the estimated coefficients, we will use the notation \hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \cdots, \hat{\beta}_k.

The tslm() function fits a linear regression model to time series data. It is similar to the lm() function which is widely used for linear models, but tslm() provides additional facilities for handling time series.

3.1 Example

A multiple linear regression model for US consumption is y_t = \beta_0 + \beta_1x_{1,t} + \beta_2x_{2,t} + \beta_3x_{3,t} + \beta_4x_{4,t} + \epsilon_t

where y is the percentage change in real personal consumption expenditure, x_1 is the percentage change in real personal disposable income, x_2 is the percentage change in industrial production, x_3 is the percentage change in personal savings and x_4 is the change in the unemployment rate.

Code

fit1 <- tslm(Consumption ~ Income + Production + Unemployment + Savings,
     data=uschange)
summary(fit1)


Call:
tslm(formula = Consumption ~ Income + Production + Unemployment + 
    Savings, data = uschange)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.88296 -0.17638 -0.03679  0.15251  1.20553 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.26729    0.03721   7.184 1.68e-11 ***
Income        0.71449    0.04219  16.934  < 2e-16 ***
Production    0.04589    0.02588   1.773   0.0778 .  
Unemployment -0.20477    0.10550  -1.941   0.0538 .  
Savings      -0.04527    0.00278 -16.287  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3286 on 182 degrees of freedom
Multiple R-squared:  0.754, Adjusted R-squared:  0.7486 
F-statistic: 139.5 on 4 and 182 DF,  p-value: < 2.2e-16

For forecasting purposes, the final two columns are of limited interest. The “t value” is the ratio of an estimated \beta coefficient to its standard error and the last column gives the p-value: the probability of the estimated \beta coefficient being as large as it is if there was no real relationship between consumption and the corresponding predictor. This is useful when studying the effect of each predictor, but is not particularly useful for forecasting.

3.2 Fitted values

Predictions of y can be obtained by using the estimated coefficients in the regression equation and setting the error term to zero. In general, we write

\hat{y}_t = \hat{\beta}_0 + \hat{\beta}_1x_{1,t} + \hat{\beta}_2x_{2,t} + \cdots + \hat{\beta}_kx_{k,t}

Plugging in the values of x_{1,t}, x_{2,t}, \cdots, x_{k,t} for t=1,2, \cdots, T returns predictions of y_t within the training-sample, referred to as fitted values. Note that these are predictions of the data used to estimate the model, not genuine forecasts of future values of y.

The following plots show the actual values compared to the fitted values for the percentage change in the US consumption expenditure series. The time plot shows that the fitted values follow the actual data fairly closely. This is verified by the strong positive relationship shown by the scatterplot.

Code

cbind(Data = uschange[,"Consumption"],
      Fitted = fitted(fit1)) %>% 
  autoplot() + 
  labs(x = "Year", y = " ") +
  ggtitle("Percent change in US consumption expenditure") +
  guides(colour=guide_legend(title=" "))

Code

cbind(Data = uschange[,"Consumption"],
      Fitted = fitted(fit1)) %>%
  as.data.frame() %>%
  ggplot(aes(x=Data, y=Fitted)) +
    geom_point() +
    ylab("Fitted (predicted values)") +
    xlab("Data (actual values)") +
    ggtitle("Percent change in US consumption expenditure") +
    geom_abline(intercept=0, slope=1)

3.3 Goodness-of-fit

A common way to summarize how well a linear regression model fits the data is via the coefficient of determination, or R^2. This can be calculated as the square of the correlation between the observed y and the predicted \hat{y} values. Alternatively, it can also be calculated as,

R^2 = \frac{\sum_{t=1}^T (\hat{y}_t - \overline{y})^2}{\sum_{t=1}^T (y_t - \overline{y})^2}

Thus, it reflects the proportion of variation in the forecast variable that is accounted for (or explained) by the regression model.

In the case of multiple linear regression, we use the “adjusted R^2” instead of teh ordinary R^2.

In the previous example we have adjusted R^2 = 0.7486. In this case model does an excellent job as it explains 74.86% of the variation in the consumption data.

Another measure of how well the model has fitted the data is the standard deviation of the residuals, which is often known as the residual standard error. In the example above the value of RSE is 0.3286 which is quite low which also indicates a good model fit.

4 Evaluating the regression model

The differences between the observed y values and the corresponding fitted \hat{y} values are the training-set errors or “residuals” defined as

\begin{align} e_t &= y_t - \hat{y}_t \notag \\ &= y_t - \hat{\beta}_0 - \hat{\beta}_1x_{1,t} - \hat{\beta}_2x_{2,t} - \cdots - \hat{\beta}_kx_{k,t} \notag \end{align}

for t = 1, 2, \cdots, T. Each residual is the unpredictable component of the associated observation.

The residuals have some useful properties including the following two:

\sum_{t=1}^T e_t =0

and

\sum_{t=1}^T x_{k,t}e_t = 0, \forall k

As a result of these properties, it is clear that the average of the residuals is zero, and that the correlation between the residuals and the observations for the predictor variable is also zero. (This is not necessarily true when the intercept is omitted from the model.)

After selecting the regression variables and fitting a regression model, it is necessary to plot the residuals to check that the assumptions of the model have been satisfied. There are a series of plots that should be produced in order to check different aspects of the fitted model and the underlying assumptions.

4.1 ACF plot of residuals

With time series data, it is highly likely that the value of a variable observed in the current time period will be similar to its value in the previous period, or even the period before that, and so on. Therefore when fitting a regression model to time series data, it is common to find autocorrelation in the residuals. In this case, the estimated model violates the assumption of no autocorrelation in the errors, and our forecasts may be inefficient — there is some information left over which should be accounted for in the model in order to obtain better forecasts. The forecasts from a model with autocorrelated errors are still unbiased, and so are not “wrong”, but they will usually have larger prediction intervals than they need to. Therefore we should always look at an ACF plot of the residuals.

Another useful test of autocorrelation in the residuals designed to take account for the regression model is the Breusch-Godfrey test, also referred to as the LM (Lagrange Multiplier) test for serial correlation. It is used to test the joint hypothesis that there is no autocorrelation in the residuals up to a certain specified order. A small p-value indicates there is significant autocorrelation remaining in the residuals.

The Breusch-Godfrey test is similar to the Ljung-Box test, but it is specifically designed for use with regression models.

It is always a good idea to check whether the residuals are normally distributed. As we explained earlier, this is not essential for forecasting, but it does make the calculation of prediction intervals much easier.

Code

checkresiduals(fit1)


    Breusch-Godfrey test for serial correlation of order up to 8

data:  Residuals from Linear regression model
LM test = 14.874, df = 8, p-value = 0.06163

The checkresiduals() function generates a time plot, the ACF, and the histogram of the residuals from the multiple regression model, as well as the Breusch-Godfrey test for jointly testing up to 8th order autocorrelation. (The checkresiduals() function will use the Breusch-Godfrey test for regression models, but the Ljung-Box test otherwise.)

The time plot shows some changing variation over time, but is otherwise relatively unremarkable. This heteroscedasticity will potentially make the prediction interval coverage inaccurate.

The histogram shows that the residuals seem to be slightly skewed, which may also affect the coverage probability of the prediction intervals.

The autocorrelation plot shows a significant spike at lag 7, but it is not quite enough for the Breusch-Godfrey to be significant at the 5% level. In any case, the autocorrelation is not particularly large, and at lag 7 it is unlikely to have any noticeable impact on the forecasts or the prediction intervals.

4.2 Other useful plots of residuals

We would expect the residuals to be randomly scattered without showing any systematic patterns. A simple and quick way to check this is to examine scatterplots of the residuals against each of the predictor variables. If these scatterplots show a pattern, then the relationship may be nonlinear and the model will need to be modified accordingly.

Code

df <- as.data.frame(uschange)
df[,"Residuals"]  <- as.numeric(residuals(fit1))
p1 <- ggplot(df, aes(x=Income, y=Residuals)) +
  geom_point()
p2 <- ggplot(df, aes(x=Production, y=Residuals)) +
  geom_point()
p3 <- ggplot(df, aes(x=Savings, y=Residuals)) +
  geom_point()
p4 <- ggplot(df, aes(x=Unemployment, y=Residuals)) +
  geom_point()
gridExtra::grid.arrange(p1, p2, p3, p4, nrow=2)

A plot of the residuals against the fitted values should also show no pattern. If a pattern is observed, there may be “heteroscedasticity” in the errors which means that the variance of the residuals may not be constant. If this problem occurs, a transformation of the forecast variable such as a logarithm or square root may be required.

Code

cbind(Fitted = fitted(fit1),
      Residuals=residuals(fit1)) %>%
  as.data.frame() %>%
  ggplot(aes(x=Fitted, y=Residuals)) + geom_point()

4.3 Outliers and influential observations

Observations that take extreme values compared to the majority of the data are called outliers. Observations that have a large influence on the estimated coefficients of a regression model are called influential observations. Usually, influential observations are also outliers that are extreme in the x direction.

A scatter plot of y against each x is always a useful starting point in regression analysis, and often helps to identify unusual observations.

One source of outliers is incorrect data entry. Simple descriptive statistics of your data can identify minima and maxima that are not sensible. If such an observation is identified, and it has been recorded incorrectly, it should be corrected or removed from the sample immediately.

Outliers also occur when some observations are simply different. In this case it may not be wise for these observations to be removed. If an observation has been identified as a likely outlier, it is important to study it and analyse the possible reasons behind it. The decision to remove or retain an observation can be a challenging one (especially when outliers are influential observations). It is wise to report results both with and without the removal of such observations.

Code

knitr::include_graphics("pic12.png")

The above figure highlights the effect of a single outlier when regressing US consumption on income. In the left panel the outlier is only extreme in the direction of y, as the percentage change in consumption has been incorrectly recorded as -4%. The red line is the regression line fitted to the data which includes the outlier, compared to the black line which is the line fitted to the data without the outlier.

In the right panel the outlier now is also extreme in the direction of x with the 4% decrease in consumption corresponding to a 6% increase in income. In this case the outlier is extremely influential as the red line now deviates substantially from the black line.

5 Some useful predictors

There are several useful predictors that occur frequently when using regression for time series data.

Trend

It is common for time series data to be trending. A linear trend can be modelled by simply using x_{i,t} = t as a predictor

y_t = \beta_0 + \beta_1t + \epsilon_t

where t = 1, 2, \cdots, T. A trend variable can be specified in the tslm() function using the trend predictor

Dummy variables

So far, we have assumed that each predictor takes numerical values. But what about when a predictor is a categorical variable taking only two values (e.g., “yes” and “no”)? Such a variable might arise, for example, when forecasting daily sales and you want to take account of whether the day is a public holiday or not. So the predictor takes value “yes” on a public holiday, and “no” otherwise.

This situation can still be handled within the framework of multiple regression models by creating a “dummy variable” which takes value 1 corresponding to “yes” and 0 corresponding to “no”. A dummy variable is also known as an “indicator variable”.

A dummy variable can also be used to account for an outlier in the data. Rather than omit the outlier, a dummy variable removes its effect. In this case, the dummy variable takes value 1 for that observation and 0 everywhere else.

If there are more than two categories, then the variable can be coded using several dummy variables (one fewer than the total number of categories). tslm() will automatically handle this case if you specify a factor variable as a predictor. There is usually no need to manually create the corresponding dummy variables.

Seasonal dummy variables

Suppose that we are forecasting daily data and we want to account for the day of the week as a predictor. Then the following dummy variables can be created.

	d_{1,t}	d_{2,t}	d_{3,t}	d_{4,t}	d_{5,t}	d_{6,t}
Monday	1	0	0	0	0	0
Tuesday	0	1	0	0	0	0
Wednesday	0	0	1	0	0	0
Thursday	0	0	0	1	0	0
Friday	0	0	0	0	1	0
Saturday	0	0	0	0	0	1
Sunday	0	0	0	0	0	0

Notice that only six dummy variables are needed to code seven categories. That is because the seventh category (in this case Sunday) is captured by the intercept, and is specified when the dummy variables are all set to zero.

Many beginners will try to add a seventh dummy variable for the seventh category. This is known as the “dummy variable trap”, because it will cause the regression to fail. There will be one too many parameters to estimate when an intercept is also included. The general rule is to use one fewer dummy variables than categories. So for quarterly data, use three dummy variables; for monthly data, use 11 dummy variables; and for daily data, use six dummy variables, and so on.

The interpretation of each of the coefficients associated with the dummy variables is that it is a measure of the effect of that category relative to the omitted category. In the above example, the coefficient of d_{1,t} associated with Monday will measure the effect of Monday on the forecast variable compared to the effect of Sunday.

The tslm() function will automatically handle this situation if you specify the predictor season.

5.1 Example

Let us use the Australian quarterly beer production data to illustrate the above concepts.

Code

beer <- window(ausbeer, start=1992) 
autoplot(beer) +
  labs(x = "Year",
       y = "Megalitres")

We want to forecast the value of future beer production. We can model this data using a regression model with a linear trend and quarterly dummy variables

y_t = \beta_0 + \beta_1t + \beta_2d_{2,t} + \beta_3d_{3,t} + \beta_4d_{4,t} + \epsilon_t

where d_{i,t}= 1, if t is in quarter i, and 0, otherwise. The first quarter variable has been omitted, so the coefficients associated with the other quarters are measures of the difference between those quarters and the first quarter.

Code

fit.beer <- tslm(beer ~ trend + season)
summary(fit.beer)


Call:
tslm(formula = beer ~ trend + season)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.903  -7.599  -0.459   7.991  21.789 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 441.80044    3.73353 118.333  < 2e-16 ***
trend        -0.34027    0.06657  -5.111 2.73e-06 ***
season2     -34.65973    3.96832  -8.734 9.10e-13 ***
season3     -17.82164    4.02249  -4.430 3.45e-05 ***
season4      72.79641    4.02305  18.095  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.23 on 69 degrees of freedom
Multiple R-squared:  0.9243,    Adjusted R-squared:  0.9199 
F-statistic: 210.7 on 4 and 69 DF,  p-value: < 2.2e-16

Note that trend and season are not objects in the R workspace; they are created automatically by tslm() when specified in this way.

There is an average downward trend of -0.34 megalitres per quarter. On average, the second quarter has production of 34.7 megalitres lower than the first quarter, the third quarter has production of 17.8 megalitres lower than the first quarter, and the fourth quarter has production of 72.8 megalitres higher than the first quarter.

Note that trend and season are not objects in the R workspace; they are created automatically by tslm() when specified in this way.

Code

cbind(Data=beer, Fitted=fitted(fit.beer)) %>%
  autoplot() +
  labs(x = "Year",
        y = "Megalitres") +
  ggtitle("Quarterly Beer Production") +
  guides(colour=guide_legend(title=" "))

Code

cbind(Data=beer, Fitted=fitted(fit.beer)) %>%
  as.data.frame() %>%
  ggplot(aes(x = Data, y = Fitted,
             colour = as.factor(cycle(beer)))) +
    geom_point() +
    ylab("Fitted") + xlab("Actual values") +
    ggtitle("Quarterly beer production") +
    scale_colour_brewer(palette="Dark2", name="Quarter") +
    geom_abline(intercept=0, slope=1)

6 Selecting predictors

When there are many possible predictors, we need some strategy for selecting the best predictors to use in a regression model.

A common approach that is not recommended is to plot the forecast variable against a particular predictor and if there is no noticeable relationship, drop that predictor from the model. This is invalid because it is not always possible to see the relationship from a scatterplot, especially when the effects of other predictors have not been accounted for.

Another common approach which is also invalid is to do a multiple linear regression on all the predictors and disregard all variables whose p-values are greater than 0.05. To start with, statistical significance does not always indicate predictive value. Even if forecasting is not the goal, this is not a good strategy because the p-values can be misleading when two or more predictors are correlated with each other.

Instead, we will use a measure of predictive accuracy. Five such measures are: CV, AIC, AIC_c, BIC and \text{adjusted}\, R^2. They can be calculated using the CV() function.

Code

CV(fit1)

          CV          AIC         AICc          BIC        AdjR2 
   0.1163477 -409.2980298 -408.8313631 -389.9113781    0.7485856

We compare these values against the corresponding values from other models. For the CV, AIC, AIC_c and BIC measures, we want to find the model with the lowest value; for \text{Adjusted}\,R^2, we seek the model with the highest value.

6.1 Adjusted R^2

Computer output for a regression will always give the R^2 value. However, it is not a good measure of the predictive ability of a model. It measures how well the model fits the historical data, but not how well the model will forecast future data.

In addition, R^2 does not allow for “degrees of freedom”. Adding any variable tends to increase the value of R^2, even if that variable is irrelevant. For these reasons, forecasters should not use R^2 to determine whether a model will give good predictions, as it will lead to overfitting.

An equivalent idea is to select the model which gives the minimum sum of squared errors (SSE), given by

SSE = \sum_{t=1}^T e_t^2

Minimising the SSE is equivalent to maximizing R^2 and will always choose the model with the most variables, and so is not a valid way of selecting predictors.

An alternative which is designed to overcome these problems is the adjusted \text{Adjusted}\,R^2 which we denote here as \overline{R}^2. \overline{R}^2 = 1- (1-R^2)\frac{T-1}{T-k-1}

where T is the number of observations and k is the number of predictors. This is an improvement on R^2, as it will no longer increase with each added predictor. Using this measure, the best model will be the one with the largest value of \overline{R}^2. Maximizing \overline{R}^2 is equivalent to minimizing the standard error of the residuals (\hat{\sigma}_{\epsilon}).

6.2 Cross-validation

Time series cross-validation is a general tool for determining the predictive ability of a model. For regression models, it is also possible to use classical leave-one-out cross-validation to selection predictors. This is faster and makes more efficient use of the data. The procedure uses the following steps:

Remove observation t from the data set, and fit the model using the remaining data. Then compute the error e_t^{*} = y_t - \hat{y}_t for the omitted observation.
Repeat step 1 for t= 1, 2, \cdots, T.
Compute the MSE from e_1{*}, e_2{*}, \cdots, e_T{*}. We shall call this the CV.

Under this criterion, the best model is the one with the smallest value of CV.

6.3 Akaike’s Information Criterion

A closely-related method is Akaike’s Information Criterion, which we define as

AIC = T\,log\left(\frac{SSE}{T}\right) +2 (k+2)

where T is the number of observations used for estimation and
k is the number of predictors in the model. Different computer packages use slightly different definitions for the AIC, although they should all lead to the same model being selected. The k+2 part of the equation occurs because there are k+2 parameters in the model: the k coefficients for the predictors, the intercept and the variance of the residuals. The idea here is to penalize the fit of the model (SSE) with the number of parameters that need to be estimated.

The model with the minimum value of the AIC is often the best model for forecasting. For large values of T, minimizing the AIC is equivalent to minimizing the CV value.

6.4 Corrected Akaike’s Information Criterion

For small values of T, the AIC tends to select too many predictors, and so a bias-corrected version of the AIC has been developed,

AIC_c = AIC + \frac{2(k+2)(k+3)}{T-k-3}

As with the AIC, the AICc should be minimized.

6.5 Schwarz’s Bayesian Information Criterion

A related measure is Schwarz’s Bayesian Information Criterion (usually abbreviated to BIC, SBIC or SC):

BIC = T\,log\left(\frac{SSE}{T}\right) + (k+2)log(T)

As with the AIC, minimising the BIC is intended to give the best model. The model chosen by the BIC is either the same as that chosen by the AIC, or one with fewer terms. This is because the BIC penalizes the number of parameters more heavily than the AIC.

6.6 Which measure should we use?

While \overline{R}^2 is widely used, and has been around longer than the other measures, its tendency to select too many predictor variables makes it less suitable for forecasting.

Many statisticians like to use the BIC because it has the feature that if there is a true underlying model, the BIC will select that model given enough data. However, in reality, there is rarely, if ever, a true underlying model, and even if there was a true underlying model, selecting that model will not necessarily give the best forecasts (because the parameter estimates may not be accurate).

Consequently, it is recommended that one of the AICc, AIC, or CV statistics be used, each of which has forecasting as their objective. If the value of T is large enough, they will all lead to the same model.

6.7 Example: US consumption

In the multiple regression example for forecasting US consumption we considered four predictors. With four predictors, there are 2^4 = 16 possible models. Now we can check if all four predictors are actually useful, or whether we can drop one or more of them. All 16 models were fitted and the results are summarized in the following table. A “1” indicates that the predictor was included in the model, and a “0” means that the predictor was not included in the model. Hence the first row shows the measures of predictive accuracy for a model including all four predictors.

The results have been sorted according to the AICc. Therefore the best models are given at the top of the table, and the worst at the bottom of the table.

The best model contains all four predictors. However, a closer look at the results reveals some interesting features. There is clear separation between the models in the first four rows and the ones below. This indicates that Income and Savings are both more important variables than Production and Unemployment. Also, the first two rows have almost identical values of CV, AIC and AICc. So we could possibly drop the Production variable and get similar forecasts. Note that Production and Unemployment are highly (negatively) correlated, so most of the predictive information in Production is also contained in the Unemployment variable.

7 Best subset regression

Where possible, all potential regression models should be fitted (as was done in the example above) and the best model should be selected based on one of the measures discussed. This is known as “best subsets” regression or “all possible subsets” regression.

7.1 Stepwise regression

If there are a large number of predictors, it is not possible to fit all possible models. For example, 40 predictors leads to 2^{40} > 1 trilion possible models! Consequently, a strategy is required to limit the number of models to be explored.

An approach that works quite well is backwards stepwise regression:

Start with the model containing all potential predictors.
Remove one predictor at a time. Keep the model if it improves the measure of predictive accuracy.
Iterate until no further improvement.

If the number of potential predictors is too large, then the backwards stepwise regression will not work and forward stepwise regression can be used instead. This procedure starts with a model that includes only the intercept. Predictors are added one at a time, and the one that most improves the measure of predictive accuracy is retained in the model. The procedure is repeated until no further improvement can be achieved.

Alternatively for either the backward or forward direction, a starting model can be one that includes a subset of potential predictors. In this case, an extra step needs to be included. For the backwards procedure we should also consider adding a predictor with each step, and for the forward procedure we should also consider dropping a predictor with each step. These are referred to as hybrid procedures.

It is important to realise that any stepwise approach is not guaranteed to lead to the best possible model, but it almost always leads to a good model.

8 Beware of inference after selecting predictors

The procedures we recommend for selecting predictors are helpful when the model is used for forecasting; they are not helpful if you wish to study the effect of any predictor on the forecast variable.

If you do wish to look at the statistical significance of the predictors, beware that any procedure involving selecting predictors first will invalidate the assumptions behind the p-values.