load("workspace.RData")
library(fpp2)

eg. monthly sales y depening upon ad spend x as predictor eg. quarterly percentage change of real personal consumption expenditure y, and real personal disposable income x, for the US from 1970 Q1 to 2016 Q3

autoplot(uschange[,c("Consumption","Income")]) +
ylab("% change") + xlab("Year")

Lets see the scatter plot and the regression line:

uschange %>%
as.data.frame() %>%
ggplot(aes(x=Income, y=Consumption)) +
ylab("Consumption (quarterly % change)") +
xlab("Income (quarterly % change)") +
geom_point() +
geom_smooth(method="lm", se=FALSE)

We can estimate this equation using tslm() function

tslm(Consumption ~ Income, data=uschange)
## 
## Call:
## tslm(formula = Consumption ~ Income, data = uschange)
## 
## Coefficients:
## (Intercept)       Income  
##      0.5451       0.2806

So every 1% increase in income consumption is increased by 0.54+0.28= 0.82%

Multiple Linear Regression

Let us say consumption also depends upon production, saving and unemployment. Lets create a scatterplot matrix of these five variables:

uschange %>%
as.data.frame() %>%
GGally::ggpairs()
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

tslm() function fits a linear regression model to time series data. Lets fit this model

fit.consMR <- tslm(
Consumption ~ Income + Production + Unemployment + Savings,
data=uschange)
summary(fit.consMR)
## 
## Call:
## tslm(formula = Consumption ~ Income + Production + Unemployment + 
##     Savings, data = uschange)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88296 -0.17638 -0.03679  0.15251  1.20553 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.26729    0.03721   7.184 1.68e-11 ***
## Income        0.71449    0.04219  16.934  < 2e-16 ***
## Production    0.04589    0.02588   1.773   0.0778 .  
## Unemployment -0.20477    0.10550  -1.941   0.0538 .  
## Savings      -0.04527    0.00278 -16.287  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3286 on 182 degrees of freedom
## Multiple R-squared:  0.754,  Adjusted R-squared:  0.7486 
## F-statistic: 139.5 on 4 and 182 DF,  p-value: < 2.2e-16

The value of 75.2% of R-sq suggests that the model can explain 75.4% of the variations. Lets overlap the fitted value over the actual change and see visually the similarity

autoplot(uschange[,'Consumption'], series="Data") +
autolayer(fitted(fit.consMR), series="Fitted") +
xlab("Year") + ylab("") +
ggtitle("Percent change in US consumption expenditure") +
guides(colour=guide_legend(title=" "))

lets see the scatterplot between between actual and fitted

cbind(Data = uschange[,"Consumption"],
Fitted = fitted(fit.consMR)) %>%
as.data.frame() %>%
ggplot(aes(x=Data, y=Fitted)) +
geom_point() +
xlab("Fitted (predicted values)") +
ylab("Data (actual values)") +
ggtitle("Percent change in US consumption expenditure") +
geom_abline(intercept=0, slope=1)

Evaluating the Regression Model

We can check the residuals

checkresiduals(fit.consMR)

## 
##  Breusch-Godfrey test for serial correlation of order up to 8
## 
## data:  Residuals from Linear regression model
## LM test = 14.874, df = 8, p-value = 0.06163

Timeplot: shows some changing variation over time, otherwise it is quite unremarkable Autocorrelation plot along with LM test: No autocorrelation Histogram: Slightly skewed- affect the prediction interval probability Residual Plots against predictors: Examine the scatter plot of the residuals against each of the predictor variables, if any pattern then the relationship is non linear.Modify the model Also scatter plot of the residuals against predictors not in the model, if there is a pattern, introduce them. Residual Plot against the fitted Value: if a pattern there may be a heteroscedasticity in the errors means the variance of the residuals may not be constant. Transformation needed

df <- as.data.frame(uschange)
df[,"Residuals"] <- as.numeric(residuals(fit.consMR))
p1 <- ggplot(df, aes(x=Income, y=Residuals)) +
geom_point()
p2 <- ggplot(df, aes(x=Production, y=Residuals)) +
geom_point()
p3 <- ggplot(df, aes(x=Savings, y=Residuals)) +
geom_point()
p4 <- ggplot(df, aes(x=Unemployment, y=Residuals)) +
geom_point()
gridExtra::grid.arrange(p1, p2, p3, p4, nrow=2)

Residual Plot against the fitted Value: if a pattern there may be a heteroscedasticity in the errors means the variance of the residuals may not be constant. Transformation needed

cbind(Fitted = fitted(fit.consMR),
Residuals=residuals(fit.consMR)) %>%
as.data.frame() %>%
ggplot(aes(x=Fitted, y=Residuals)) + geom_point()

There is no pattern.

Some Useful Variables 1. Trend: Can be specified in tslm() function as trend 2. Dummy Variables: tslm() autometically handles 3. Seasonal Dummy Variable: Can be specified in tslm() function as season

Australian Beer Data

beer2 <- window(ausbeer, start=1992)
autoplot(beer2) + xlab("Year") + ylab("Megalitres")

We want to forecast the value of beer using a regression model with a linear trend and quarterly dummy variables

fit.beer <- tslm(beer2 ~ trend + season)
summary(fit.beer)
## 
## Call:
## tslm(formula = beer2 ~ trend + season)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.903  -7.599  -0.459   7.991  21.789 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 441.80044    3.73353 118.333  < 2e-16 ***
## trend        -0.34027    0.06657  -5.111 2.73e-06 ***
## season2     -34.65973    3.96832  -8.734 9.10e-13 ***
## season3     -17.82164    4.02249  -4.430 3.45e-05 ***
## season4      72.79641    4.02305  18.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.23 on 69 degrees of freedom
## Multiple R-squared:  0.9243, Adjusted R-squared:  0.9199 
## F-statistic: 210.7 on 4 and 69 DF,  p-value: < 2.2e-16

Note that trend and season are created automatically by tslm. Let us visually see the fit

autoplot(beer2, series="Data") +
autolayer(fitted(fit.beer), series="Fitted") +
xlab("Year") + ylab("Megalitres") +
ggtitle("Quarterly Beer Production")

Let us see the scatter plot of actual and fitted values:

cbind(Data=beer2, Fitted=fitted(fit.beer)) %>%
as.data.frame() %>%
ggplot(aes(x = Data, y = Fitted,
colour = as.factor(cycle(beer2)))) +
geom_point() +
ylab("Fitted") + xlab("Actual values") +
ggtitle("Quarterly beer production") +
scale_colour_brewer(palette="Dark2", name="Quarter") +
geom_abline(intercept=0, slope=1)

###Intervention Variables

When the effect lasts only for one period, we use a “spike” variable. this is a dummy variable that takes value one in the period of the intervention and zero else where.

If an intervention causes a level shift , the value changes permanently after introduction, then we use a step variable. It takes a value 0 before intervention and 1 after.

if there is a change in slope we are talking about nonlinear trend, we discuss it later.

Trading Days

For monthly or quarterly data, the bizdays() will compute the number of trading days each period.

Distributed lags

Effect of advertising lasts beyond the current period. Out of Syllabus.

Selecting Predictors

Don’do

  1. plot forecast varible against predictor- if no relationship drop that
  2. Do an MLR, drop variables whose p values are >0.95

You should do

Use CV function

CV(fit.consMR)
##           CV          AIC         AICc          BIC        AdjR2 
##    0.1163477 -409.2980298 -408.8313631 -389.9113781    0.7485856

Compare this with other models, take the value with lowest of first four and highest of last. eg. if there are 4 predictors,2^4= 16 models are possible.

Use Stepwise Regression - Forward,backward or hybrid

Forecasting with Regression

eg beer forecast using trend & Season alone

beer2 <- window(ausbeer, start=1992)
fit.beer <- tslm(beer2 ~ trend + season)
fcast <- forecast(fit.beer)
autoplot(fcast) +
ggtitle("Forecasts of beer production using regression") +
xlab("Year") + ylab("megalitres")