Regression Models

2022-10-03

Time series regression models

We will be looking at regression models (You have/will seen/see this topic when you took/take Econometrics)
We will be focus on regression models for forecasting !!!
The basic concept is that we want to forecast the time series of interest (y) assuming that it has a linear relationship with other time series (x)
\(y_t\) is the variable we want to predict: the “response” variable
Each \(x_{j,t}\) are numerical and is called a “predictor”.
Ex: Forecast monthly sales (y) using total advertising spend (x)

Simple linear regression

It is a statistical method for fitting a line to data to represent the relationship between two variables.
It is modeled by a straight line with some error (Why?)

\[ y_t = \beta_0 + \beta_1x_t + \varepsilon_t \]

\(\beta_0\) - intercept - the predicted value of \(y_t\) when \(x = 0\)
\(\beta_1\) - slope - average predicted change in y from a one unit change in x
\(\varepsilon_t\) - error - changes in y that is not explained by x (a white noise error term)
\(\beta_1\) measures the marginal effect

Example

Let’s see if percentage changes in real personal disposable income explain changes on percentage changes of real personal consumption expenditure.

us_change %>%
  pivot_longer(c(Consumption, Income), names_to="Series") %>%
  autoplot(value) +
  labs(y = "% change")

Example

Lets see if there is a pattern between the variables (scatterplot)

us_change %>% ggplot(aes(x = Income, y = Consumption)) +
  labs(y = "Consumption (quarterly % change)",
       x = "Income (quarterly % change)")+geom_point()

It appears to be positive - % change on income leads to positive % change on consumption

Example

us_change %>% model(TSLM(Consumption ~ Income)) %>%report()

## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.582 -0.278  0.019  0.323  1.422 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.5445     0.0540   10.08  < 2e-16 ***
## Income        0.2718     0.0467    5.82  2.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.591 on 196 degrees of freedom
## Multiple R-squared: 0.147,   Adjusted R-squared: 0.143
## F-statistic: 33.8 on 1 and 196 DF, p-value: 2e-08

Example

We will talk about how to estimate a linear model in the next few slides.
\(\hat{\beta_0} = 0.5445\) - when there is no increase in personal disposable income (x=0), results on average in 0.54 percentage points in personal consumption expenditure.
\(\hat{\beta_1} = 0.2718\) - 1 percentage point increase in personal disposable income results on average in 0.27 percentage points in personal consumption expenditure.
\(\hat{\beta_0} + \hat{\beta_1} = 0.82\) - Total, when there is 1 percentage point increase in personal disposable income, it will result (on average) in a forecast value in a 0.82 percentage increase in personal consumption expenditure.
The intercept should always be included unless the requirement is to force the regression line “through the origin”.
Without it, the slope coefficient can be distorted unnecessarily.

Example

The fitted line !!!

us_change %>% ggplot(aes(x = Income, y = Consumption)) +
  labs(y = "Consumption (quarterly % change)",
       x = "Income (quarterly % change)")+geom_point()+
       xlim(0,5)+
    geom_smooth(method = "lm", se = FALSE)

Multiple regression and forecasting

\[y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t} + \varepsilon_t\] * Each \(x_{j,t}\) predictor is numerical, are usually assumed to be known for all past and future times.

The coefficients \(\beta_1,\dots,\beta_k\) measure the effect of each predictor after taking account of the effect of all other predictors in the model.
That is, the coefficients measure the marginal effects.

Example

Let’s see if additional predictors are helpful to forecast % change in consumption.

Example:

Least Squares Estimation

How to estimate a linear regression? We will use the Ordinary Least Square Estimation (OLS)

\[y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t} + \varepsilon_t\]

As we did earlier, we are trying to explain the variation in the response variable via predictors by fitting a linear model.
We do not know the true values of \(\beta_0, \beta_1, ..., \beta_k\), so OLS estimation choose these values in order to minimize the sum of the squared errors.

\[ \sum_{t=1}^{T}\varepsilon_t^{2} = \sum_{t=1}^{T}(y_t - \beta_0 - \beta_1 x_{1,t} - \beta_2 x_{2,t} - \cdots - \beta_kx_{k,t})^{2} \]

OLS

This is called least squares estimation because it gives the least value for the sum of squared errors.
Finding the best estimates of the coefficients is often called “fitting” the model to the data.
When we refer to the estimated coefficients, we will use the notation \(\hat{\beta_k}\)
The TSLM() function fits a linear regression model to time series data (similar to lm() but provides some short cuts for time series)
The math is available (univariate and multiple OLS) in the end of the chapter.

Assumptions

We assume that the model is a reasonable approximation to reality (linear).
No Multicollinearity - highly (perfectly) correlated predictors.

We make the following assumptions about the errors \(\varepsilon_1, ..., \varepsilon_T\)

The sum of errors is zero \(\sum \varepsilon_t = 0\)
Errors have a constant variance \(\sigma^2\) (homoscedasticity) - if not, It gives biased standard errors of the coefficient
They are not autocorrelated; otherwise the forecasts will be inefficient, as there is more information in the data that can be exploited.
Errors are unrelated to the predictor variables; otherwise there would be more information that should be included in the systematic part of the model.
Errors follow a normal distribution - \(\varepsilon_t \sim N(0,\sigma^2)\)

Assumptions

Given the assumptions, the estimators \(\hat{\alpha}\) and \(\hat{\beta}\) are the best (most efficient) linear unbiased estimators with the lowest variance among all other linear unbiased estimators - BLUE
Unbiased estimator - \(E(\hat{\beta}) = \beta\)
Consistency - \(lim_{n \rightarrow \infty}\hat{\beta} = \beta\)
Efficiency - \(\hat{\beta}\) presents the lowest variance among all other unbiased estimators

Example

fit_consMR <- us_change %>%
  model(lm = TSLM(Consumption ~ Income + Production + Unemployment + Savings))
report(fit_consMR)

## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.906 -0.158 -0.036  0.136  1.155 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.25311    0.03447    7.34  5.7e-12 ***
## Income        0.74058    0.04012   18.46  < 2e-16 ***
## Production    0.04717    0.02314    2.04    0.043 *  
## Unemployment -0.17469    0.09551   -1.83    0.069 .  
## Savings      -0.05289    0.00292  -18.09  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.31 on 193 degrees of freedom
## Multiple R-squared: 0.768,   Adjusted R-squared: 0.763
## F-statistic:  160 on 4 and 193 DF, p-value: <2e-16

TSLM() output

Coefficients - Gives an estimation of each \(\beta\)
Standard Error (standard deviation of an estimate) - measure the precision of the estimate regarding its unknown value.
T-value - ratio of the estimate to its standard error
p-value - the probability of \(\beta\) coefficient being different than zero (Alternative hypothesis).

Fitted values

\[\hat{y}_t = \hat{\beta}_0 + \hat{\beta}_1 x_{1,t} + \hat{\beta_2} x_{2,t} + \cdots + \hat{\beta}_kx_{k,t}\]

After estimating the values of \(\hat{\beta}s\), we can plug the values of each \(x\) and calculate the fitted value of \(y\).
These are predictions of the data used to estimate the model, not genuine forecasts of future values of \(y\)
Let’s compare the true values of \(y\) with the fitted values.

Example

Goodness-of-fit

A common way to summarize how well a linear regression model fits the data is via the coefficient of determination (\(R^2\)).
The squared correlation between observe \(y\) and fitted \(\hat{y}\) or:

\[ R^2 = \frac{\sum(\hat{y}_t-\bar{y})^2}{\sum(y_t-\bar{y})^2} \] - Thus, it reflects the proportion of variation in the forecast variable that is explained by the regression model.

\(R^2\) lies between 0 and 1. Larger \(R^2\) ‘’equals’’ better model
\(R^2\) never decrease when adding an extra predictor to the model (over-fitting).
There are no set rules for what is a good \(R^2\) (depends on the type of data)
Validate a model’s forecasting performance on the test data is much better than measuring the \(R^2\) value on the training data.

Multiple regression and forecasting

Again, for forecasting purposes, we require the following assumptions for the Error:

\(\varepsilon_t\) are uncorrelated and zero mean
\(\varepsilon_t\) are uncorrelated with each \(x_{j,t}\).

It is useful to also have \(\varepsilon_t \sim \text{N}(0,\sigma^2)\) when producing prediction intervals or doing statistical tests.

Residuals

Residuals are the leftover variation in the data after accounting for the model fit:

\[e_t = y_t - \hat{y_t}\]

Errors are related to the true data generating process (DGP).

\[y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \cdots + \beta_kx_{k,t} + \varepsilon_t\] - Residuals are what is left over after having estimated your model.

\[\hat{y}_t = \hat{\beta}_0 + \hat{\beta}_1 x_{1,t} + \hat{\beta}_2 x_{2,t} + \cdots + \hat{\beta}_kx_{k,t}\]

The difference between observe y and the corresponding fitted \(\hat{y}\) values are the training-set errors or residuals:

\[ e_t = y_t - \hat{\beta}_0 - \hat{\beta}_1 x_{1,t} - \hat{\beta}_2 x_{2,t} - \cdots - \hat{\beta}_kx_{k,t}\]

The assumptions made earlier are related to the error term (like normality, homoscedasticity, and independence).
After estimating the model, we can test if the residuals have these properties.

Residuals

The residuals have some useful properties including the following two:

\[\sum_{t=1}^{T}e_t = 0\]

\[\sum_{t=1}^{T}x_{k,t}e_t = 0\] for all k

From the residuals, we can check whether the assumptions are satisfied (homoscedasticity, autocorrelation, normality, etc)
Math is available in the end of the slide

Standard error of the regression

Another measure of how well the model has fitted the data is the standard deviation of the residuals - residual standard error.

\[ \hat{\sigma}_e = \sqrt{\frac{1}{T-K-1}\sum_{t=1}^{T}e_{t}^{2}} \]

\(K\) is the number of predictors in the model. (we subtract one because of the intercept)
The standard error is related to the size of the average error that the model produces.
Comparing it to the to standard deviation of y, we can analyze the accuracy of the model

ACF - Autocorrelation

it is common to find autocorrelation in the residuals.
Our forecasts may be inefficient — there is some information left over which should be accounted for in the model in order to obtain better forecasts.
The forecasts from a model with autocorrelated errors are still unbiased, and so are not “wrong”.
However, it usually have larger prediction intervals than they need to. Therefore we should always look at an ACF plot of the residuals.
Additionally, we can check the Ljung-box test

Example

fit_consMR %>% gg_tsresiduals()

Example

 augment(fit_consMR) %>%
  features(.innov, ljung_box, lag = 10, dof = 5)

## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 lm        18.9   0.00204

Example

The time plot of the residuals might indicate some heteroscedasticity since the variation appears to change over time - It will potentially make the prediction interval inaccurate.
The histogram shows that the residuals seem to be slightly skewed, which may also affect the coverage probability of the prediction intervals.
The autocorrelation plot shows a significant spike at lag 7, and a significant Ljung-Box test at the 5% level.
However, the autocorrelation is not particularly large, and at lag 7 it is unlikely to have any noticeable impact on the forecasts or the prediction intervals.

Residual plots

Useful for spotting outliers and whether the linear model was appropriate.

Scatterplot of residuals \(\varepsilon_t\) against each predictor \(x_{j,t}\).
Scatterplot residuals against the fitted values \(\hat y_t\)
Expect to see scatterplots resembling a horizontal band with no values too far from the band and no patterns such as curvature or increasing spread.

Residual patterns

If a plot of the residuals vs any predictor in the model shows a pattern, then the relationship is nonlinear.
If a plot of the residuals vs any predictor not in the model shows a pattern, then the predictor should be added to the model.
If a plot of the residuals vs fitted values shows a pattern, then there is heteroscedasticity in the errors. (Could try a transformation.)

Residual patterns

us_change %>%
  left_join(residuals(fit_consMR), by = "Quarter") %>%
  pivot_longer(Income:Unemployment,
               names_to = "regressor", values_to = "x") %>%
  ggplot(aes(x = x, y = .resid)) +
  geom_point() +
  facet_wrap(. ~ regressor, scales = "free_x") +
  labs(y = "Residuals", x = "")

Outliers

Observations that take extreme values compared to the majority of the data.
If the observation have a large influence on the estimated coefficients, it is called influential observation
One source of outliers is incorrect data entry (okay to be removed or fixed)
Outliers also occur when some observations are simply different (not okay to be removed)
Hard to know what it is the case !!
The decision to remove or retain an observation can be a challenging one (especially when outliers are influential observations).
It is wise to report results both with and without the removal of such observations.

Some useful predictors for linear models

Linear Trend

\[x_t = t\]

\[ y_t = \beta_0 + \beta_1t + \varepsilon_t. \]

\(t=1,2,\dots,T\)
Strong assumption that trend will continue.

Non-linear Trend

There are cases that we have to assume a non-linear form - as long the parameters are still linear, we are good to go.
We can transform the data by using log (log-log, lin-log, or log-lin)

\[ log({y_t}) = \beta_0 + \beta_1\log x_{t} +\varepsilon_t \]

We can also assume a Piecewise linear trend with bend at \(\tau\)

\[ y_t = \beta_0 + \beta_1x_{1,t} + \beta_2x_{2,t} + \varepsilon_t. \] Where:

\(x_{1,t} = t\)

\(x_{2,t} = 0\) if \(t< \tau\) or

\(x_{2,t} = (t-\tau)\) if \(t> \tau\)

Non-linear regression

marathon <- boston_marathon %>%
  filter(Event == "Men's open division") %>%
  select(-Event) %>%
  mutate(Minutes = as.numeric(Time)/60)
marathon %>% autoplot(Minutes) + labs(y="Winning times in minutes")

Piecewise linear trend

\[ y_t = \beta_0 + \beta_1x_{1,t} + \beta_2x_{2,t} + \beta_3x_{3,t} \varepsilon_t. \]

\(x_{1,t} = t\)

\(x_{2,t} = 0\) if \(t < 1940\) or

\(x_{2,t} = (1940- t)\) if \(t > 1940\)

\(x_{3,t} = 0\) if \(t < 1980\) or

\(x_{3,t} = (t - 1980)\) if \(t > 1980\)

Piecewise linear trend

fit_trends <- marathon %>% model(piecewise = TSLM(Minutes ~ trend(knots = c(1940, 1980))))

marathon %>% autoplot(Minutes) +
  geom_line(data = fitted(fit_trends),
            aes(y = .fitted, colour = .model)) +
  labs(y = "Minutes",
       title = "Boston marathon winning times")

Example: Other models

fit_trends <- marathon %>%
  model(
    # Linear trend
    linear = TSLM(Minutes ~ trend()),
    # Exponential trend
    exponential = TSLM(log(Minutes) ~ trend()),
    # Piecewise linear trend
    piecewise = TSLM(Minutes ~ trend(knots = c(1940, 1980)))
  )

fit_trends

## # A mable: 1 x 3
##    linear exponential piecewise
##   <model>     <model>   <model>
## 1  <TSLM>      <TSLM>    <TSLM>

Example: Boston marathon winning times

fit_trends %>% forecast(h=10) %>% autoplot(marathon)

Example

fit_trends %>%
  select(linear) %>%
  gg_tsresiduals()

Example

fit_trends %>%
  select(piecewise) %>%
  gg_tsresiduals()

Example

fit_trends %>%
  select(exponential) %>%
  gg_tsresiduals()

Dummy variables

If a categorical variable takes only two values (e.g., Yes or No), then an equivalent numerical variable can be constructed taking value 1 if yes and 0 if no. This is called a dummy variable

Outliers

If there is an outlier, you can use a dummy variable to remove its effect.

Public holidays

For daily data: if it is a public holiday, dummy=1, otherwise dummy=0.

Dummy variables

If there are more than two categories, then the variable can be coded using several dummy variables one fewer than the total number of categories.

Beware of the dummy variable trap!

Using one dummy for each category gives too many dummy variables!
The regression will then be singular and inestimable
Either omit the constant, or omit the dummy for one category.
The coefficients of the dummies are relative to the omitted category

Uses of dummy variables

Seasonal dummies

For quarterly data: use 3 dummies
For monthly data: use 11 dummies
For daily data: use 6 dummies
What to do with weekly data?

Beer production

Beer Production

We want to forecast the value of future beer production

\[y_t = \beta_0 + \beta_1 t + \beta_2d_{2,t} + \beta_3 d_{3,t} + \beta_4 d_{4,t} + \varepsilon_t\]

\(d_{i,t} = 1\) if \(t\) is quarter \(i\) and 0 otherwise.
t is a linear trend

The first quarter dummy variable is omitted, so all the coefficients are relative to the first quarter

Beer production

fit_beer <- recent_production %>% model(TSLM(Beer ~ trend() + season()))
report(fit_beer)

## Series: Beer 
## Model: TSLM 
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -42.9   -7.6   -0.5    8.0   21.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   441.8004     3.7335  118.33  < 2e-16 ***
## trend()        -0.3403     0.0666   -5.11  2.7e-06 ***
## season()year2 -34.6597     3.9683   -8.73  9.1e-13 ***
## season()year3 -17.8216     4.0225   -4.43  3.4e-05 ***
## season()year4  72.7964     4.0230   18.09  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.2 on 69 degrees of freedom
## Multiple R-squared: 0.924,   Adjusted R-squared: 0.92
## F-statistic:  211 on 4 and 69 DF, p-value: <2e-16

Beer production

Note that trend() and season() are not standard functions; they are “special” functions that work within the TSLM().
We could easily create the trend and seasonal dummy variables

t<-nrow(recent_production)
trend<-seq(1:t)

recent_production2<-recent_production%>%
  select(Quarter, Beer)%>%
  mutate(q2 = ifelse(quarter(Quarter) == 2,1,0))%>%
  mutate(q3 = ifelse(quarter(Quarter) == 3,1,0))%>%
  mutate(q4 = ifelse(quarter(Quarter) == 4,1,0))%>%
  mutate(td = trend)
recent_production2

## # A tsibble: 74 x 6 [1Q]
##    Quarter  Beer    q2    q3    q4    td
##      <qtr> <dbl> <dbl> <dbl> <dbl> <int>
##  1 1992 Q1   443     0     0     0     1
##  2 1992 Q2   410     1     0     0     2
##  3 1992 Q3   420     0     1     0     3
##  4 1992 Q4   532     0     0     1     4
##  5 1993 Q1   433     0     0     0     5
##  6 1993 Q2   421     1     0     0     6
##  7 1993 Q3   410     0     1     0     7
##  8 1993 Q4   512     0     0     1     8
##  9 1994 Q1   449     0     0     0     9
## 10 1994 Q2   381     1     0     0    10
## # … with 64 more rows

Beer production

fit_beer2 <- recent_production2 %>% model(TSLM(Beer ~ q2+q3+q4+td))
report(fit_beer2)

## Series: Beer 
## Model: TSLM 
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -42.9   -7.6   -0.5    8.0   21.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 441.8004     3.7335  118.33  < 2e-16 ***
## q2          -34.6597     3.9683   -8.73  9.1e-13 ***
## q3          -17.8216     4.0225   -4.43  3.4e-05 ***
## q4           72.7964     4.0230   18.09  < 2e-16 ***
## td           -0.3403     0.0666   -5.11  2.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.2 on 69 degrees of freedom
## Multiple R-squared: 0.924,   Adjusted R-squared: 0.92
## F-statistic:  211 on 4 and 69 DF, p-value: <2e-16

Beer production

Trend = -0.34 = There is an average downward trend of -0.34 megalitres per quarter.

q2 = -34.7 -> On average, the second quarter has production of 34.7 megalitres lower than the first quarter

q4 = 72.8 -> On average, fourth quarter produce 72.8 megaliters more than the first quarter.

Beer production - Fitted

augment(fit_beer) %>%
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Beer, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  labs(y="Megalitres",title ="Australian quarterly beer production") +
  scale_colour_manual(values = c(Data = "black", Fitted = "#D55E00"))

Beer production -

augment(fit_beer) %>%
  ggplot(aes(x=Beer, y=.fitted, colour=factor(quarter(Quarter)))) +
    geom_point() +
    labs(y="Fitted", x="Actual values", title = "Quarterly beer production") +
    scale_colour_brewer(palette="Dark2", name="Quarter") +
    geom_abline(intercept=0, slope=1)

Beer production

fit_beer %>% gg_tsresiduals()

Beer production -forecast

Ex-ante forecast !

\[\hat{y}_{t+h} = \hat{\beta_0} + \hat{\beta_1}(t+h) + \hat{\beta_2}d2 + \hat{\beta_3}d3 + \hat{\beta_4}d4\]

fit_beer %>% forecast(h=10) %>% autoplot(recent_production)

Fourier series

Periodic seasonality (long seasonal periods) can be handled using pairs of Fourier terms.
Fourier terms is an approximation of periodic functions via a series of sine and cosine terms of the right frequencies

\[ s_{k}(t) = \sin\left(\frac{2\pi k t}{m}\right)\qquad c_{k}(t) = \cos\left(\frac{2\pi k t}{m}\right) \] If k =1

\[ s_{1}(t) = \sin\left(\frac{2\pi t}{m}\right)\qquad c_{1}(t) = \cos\left(\frac{2\pi t}{m}\right) \] If k = 2

\[ s_{2}(t) = \sin\left(\frac{4\pi t}{m}\right)\qquad c_{2}(t) = \cos\left(\frac{4\pi t}{m}\right) \]

Fourier series

With Fourier terms, we often need fewer predictors than with dummy variables ( when m is large)
Example: for weekly data, for example, where (m≈ 52)
For short seasonal periods (e.g., quarterly data), there is little advantage in using Fourier terms over seasonal dummy variables.

\[ y_t = a + bt + \sum_{k=1}^K \left[\alpha_k s_k(t) + \beta_k c_k(t)\right] + \varepsilon_t \]

Fourier series

fourier_beer <- recent_production %>%
  model(TSLM(Beer ~ trend() + fourier(K = 2)))
report(fourier_beer)

## Series: Beer 
## Model: TSLM 
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -42.9   -7.6   -0.5    8.0   21.8 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        446.8792     2.8732  155.53  < 2e-16 ***
## trend()             -0.3403     0.0666   -5.11  2.7e-06 ***
## fourier(K = 2)C1_4   8.9108     2.0112    4.43  3.4e-05 ***
## fourier(K = 2)S1_4 -53.7281     2.0112  -26.71  < 2e-16 ***
## fourier(K = 2)C2_4 -13.9896     1.4226   -9.83  9.3e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.2 on 69 degrees of freedom
## Multiple R-squared: 0.924,   Adjusted R-squared: 0.92
## F-statistic:  211 on 4 and 69 DF, p-value: <2e-16

Fourier Terms

augment(fourier_beer) %>%
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Beer, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  labs(y="Megalitres",title ="Australian quarterly beer production") +
  scale_colour_manual(values = c(Data = "black", Fitted = "#D55E00"))

Fourier Terms

augment(fourier_beer) %>%
  ggplot(aes(x=Beer, y=.fitted, colour=factor(quarter(Quarter)))) +
    geom_point() +
    labs(y="Fitted", x="Actual values", title = "Quarterly beer production") +
    scale_colour_brewer(palette="Dark2", name="Quarter") +
    geom_abline(intercept=0, slope=1)

Fourier Terms

fourier_beer %>% gg_tsresiduals()

Fourier Terms

fourier_beer %>% forecast(h=10) %>% autoplot(recent_production)

Fourier terms

The K argument specifies how many pairs of sin and cos terms to include (Choose \(K\) by minimizing AICc).
The maximum allowed is \(k=m/2\)
Because we have used the maximum here, the results are identical to those obtained when using seasonal dummy variables.
Called “harmonic regression”

Harmonic regression: eating-out expenditure

aus_cafe <- aus_retail %>% filter(
    Industry == "Cafes, restaurants and takeaway food services",
    year(Month) %in% 2004:2018
  ) %>% summarise(Turnover = sum(Turnover))
aus_cafe %>% autoplot(Turnover)

Harmonic regression: eating-out expenditure

fit <- aus_cafe %>%
  model(K1 = TSLM(log(Turnover) ~ trend() + fourier(K = 1)),
        K2 = TSLM(log(Turnover) ~ trend() + fourier(K = 2)),
        K3 = TSLM(log(Turnover) ~ trend() + fourier(K = 3)),
        K4 = TSLM(log(Turnover) ~ trend() + fourier(K = 4)),
        K5 = TSLM(log(Turnover) ~ trend() + fourier(K = 5)),
        K6 = TSLM(log(Turnover) ~ trend() + fourier(K = 6)))
glance(fit) %>% select(.model, r_squared, adj_r_squared, AICc)

## # A tibble: 6 × 4
##   .model r_squared adj_r_squared   AICc
##   <chr>      <dbl>         <dbl>  <dbl>
## 1 K1         0.962         0.962 -1085.
## 2 K2         0.966         0.965 -1099.
## 3 K3         0.976         0.975 -1160.
## 4 K4         0.980         0.979 -1183.
## 5 K5         0.985         0.984 -1234.
## 6 K6         0.985         0.984 -1232.

Harmonic regression: eating-out expenditure

Intervention variables

Spikes

Equivalent to a dummy variable for handling an outlier.

Steps

Variable takes value 0 before the intervention and 1 afterwards - shift the level

Change of slope

Variables take values 0 before the intervention and values \(\{1,2,3,\dots\}\) afterwards - shift the slope

Holidays

For monthly data

Christmas: always in December so part of monthly seasonal effect
Easter: use a dummy variable \(v_t=1\) if any part of Easter is in that month, \(v_t=0\) otherwise.
Ramadan and Chinese new year similar.

Distributed lags

Lagged values of a predictor.

Example: \(x\) is advertising which has a delayed effect

\[ x_{1} = \text{advertising for previous month}\\ x_{2} = \text{advertising for two months previously}\\ \vdots\\ x_{m} = \text{advertising for m months previously} \]

Selecting predictors

How to choose predictors for our forecasting model?
Scatterplot not always tell if there is a relationship between forecast variable and predictor
Adding all predictors and drop those whose that are not significant (p-value > 5%). -
Statistical significance does not always indicate predictive value or because predictors might be correlated.
We use a measure of predictive accuracy (Adjusted \(R^2\), CV, AIC, AICc, and BIC)

Adjusted \(R^2\)

Only looking at \(R^2\) is not a good measure of predictability because it measures how well the models fits the historical data, but not how well the model forecast future data.

Additionally \(\dots\)

\(R^2\) does not allow for ``degrees of freedom’’
Adding any variable tends to increase the value of \(R^2\), even if that variable is irrelevant.

To overcome this problem, we can use adjusted \(R^2\):

\[ \bar{R}^2 = 1-(1-R^2)\frac{T-1}{T-k-1} \]

where \(k=\) no. predictors and \(T=\) no. observations.

Adjusted \(R^2\)

Maximizing \(\bar{R}^2\) is equivalent to minimizing \(\hat\sigma^2\). \[ \hat{\sigma}^2 = \frac{1}{T-k-1}\sum_{t=1}^T \varepsilon_t^2 \] - Maximizing \(\bar{R}^2\) works quite well as a method of selecting predictors.

Leave-one-out cross-validation

For regression, leave-one-out cross-validation is faster and more efficient than time-series cross-validation.

Select one observation for test set, and use remaining observations in training set.
Fit the model on the training set.
Use the model to predict the observation left out.
Compute ‘’error’’ on test observation - \(e_{t}^*\)
Repeat using each possible observation as the test set.
Compute the Mean Squared Errors (MSE) \(\sum_{t=1}^Te_{t}^*\) \(\rightarrow\) CV
The best model is the one with the smallest value of CV.

Cross-validation

Traditional evaluation

Time series cross-validation

Cross-validation

Traditional evaluation

Leave-one-out cross-validation

Akaike’s Information Criterion

AIC is an estimator of out-of-sample prediction error.
AIC scores are only useful in comparison with other AIC scores for the same dataset.
A lower AIC score is better.

\[\text{AIC} = T\log(\frac{SSE}{T}) + 2(k+2)\] \(k\) is the number of predictors in the model.

The idea here is to penalise the fit of the model (SSE) with the number of parameters that need to be estimated.
Minimizing the AIC is asymptotically equivalent to minimizing MSE via leave-one-out cross-validation (for any linear regression).

Corrected AIC

For small values of \(T\), the AIC tends to select too many predictors, and so a bias-corrected version of the AIC has been developed.

\[ \text{AIC}_{\text{C}} = \text{AIC} + \frac{2(k+2)(k+3)}{T-k-3} \] - As with the AIC, the AIC\(_{\text{C}}\) should be minimized.

Bayesian Information Criterion

\[\text{BIC} = \log(\frac{SSE}{T}) + (k+2)\log(T) \] * \(k\) is the number of predictors in the model.

BIC penalizes terms more heavily than AIC
Also called SBIC and SC.

Which measure should we use??

\(\bar{R}^2\) - has the tendency to select too many predictor variables makes it less suitable for forecasting.

BIC - it has the feature that if there is a true underlying model, the BIC will select that model given enough data -

There is rarely, if ever, a true underlying model, and even if there was a true underlying model, selecting that model will not necessarily give the best forecasts (because the parameter estimates may not be accurate).

AICc, AIC, and CV might provide better accurancy measure for forecasting. If the value of T is large enough, they will all lead to the same model.

Example

The best model contains all four predictors.
Income and Savings are both more important variables than Production and Unemployment.

Choosing regression variables

Best subsets regression

Fit all possible regression models using one or more of the predictors.
Choose the best model based on one of the measures of predictive ability (CV, AIC, AICc).

Warning!

If there are a large number of predictors, this is not possible.
For example, 44 predictors leads to 18 trillion possible models!

Choosing regression variables

Backwards stepwise regression

Start with a model containing all variables.
Try subtracting one variable at a time. Keep the model if it has lower CV or AICc.
Iterate until no further improvement.

Notes

Stepwise regression is not guaranteed to lead to the best possible model.
Inference on coefficients of final model will be wrong - spurious regression.

Beware !!!

The procedures we recommend for selecting predictors are helpful when the model is used for forecasting; they are not helpful if you wish to study the effect of any predictor on the forecast variable (causal effect)

Forecasting with regression

Ex-ante versus ex-post forecasts

Ex ante forecasts are made using only information available in advance (true forecast)
- require forecasts of predictors
Ex post forecasts are made using later information on the predictors.
- useful for studying behavior of forecasting models.
trend, seasonal and calendar variables are all known in advance, so these don’t need to be forecast.

Scenario based forecasting

Assumes possible scenarios for the predictor variables
Prediction intervals for scenario based forecasts do not include the uncertainty associated with the future values of the predictor variables - only the variable of interest.
Example: US policy maker may be interested in comparing the predicted change in consumption in two scenarios:

there is a constant growth of 1% and 0.5% respectively for income and savings with no change in the employment rate.
versus a respective decline of 1% and 0.5%.

Scenario based forecasting

fit_consBest <- us_change %>%
  model(
    lm = TSLM(Consumption ~ Income + Savings + Unemployment))


future_scenarios <- scenarios(
  Increase = new_data(us_change, 4) %>%
    mutate(Income=1, Savings=0.5, Unemployment=0),
  Decrease = new_data(us_change, 10) %>%
    mutate(Income= -1, Savings=-0.5, Unemployment=0),
  names_to = "Scenario")

Scenario based forecasting

fc <- forecast(fit_consBest, new_data = future_scenarios)

us_change %>%
  autoplot(Consumption) +
  autolayer(fc) +
  labs(title = "US consumption", y = "% change")

Building a predictive regression model

If getting forecasts of predictors is difficult, you can use lagged predictors instead.

\[y_{t}=\beta_0+\beta_1x_{1,t-h}+\dots+\beta_kx_{k,t-h}+\varepsilon_{t}\]

A different model for each forecast horizon \(h\).

Correlation, causation and forecasting

Correlation is not causation

When \(x\) is useful for predicting \(y\), it is not necessarily causing \(y\).
e.g., predict number of drownings \(y\) using number of ice-creams sold \(x\).
Correlations are useful for forecasting, even when there is no causality.
Better models usually involve causal relationships (e.g., temperature \(x\) and people \(z\) to predict drownings \(y\)).

Multicollinearity

In regression analysis, multicollinearity occurs when:

Two predictors are highly correlated (i.e., the correlation between them is close to \(\pm1\)).
A linear combination of some of the predictors is highly correlated with another predictor.
For forecasting, as long as the future values of your predictor variables are within their historical ranges, there is nothing to worry about — multicollinearity is not a problem except when there is perfect correlation.