Autocorrelation in Regression, an Example from Business

Autocorrelation, or the correlation of sequence to an offset version of itself is very common with time series datasets. What is also common is attempting to use linear regression techniques on time series data. This can be problematic because it can lead to a violation of one of the four basic assumptions of simple linear regression. The four assumptions of linear regression are:

Linearity: The relationship between X and the mean of Y is linear.
Homoscedasticity: The variance of residual is the same for any value of X.
Independence: Observations are independent of each other.
Normality: For any fixed value of X, Y is normally distributed.

http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression4.html

The reason that observations must be independent for ordinary least squares regression to work has to do with residuals. Actually, it is the residuals that must not be autocorrelated in order for inference based on the regression model to be reliable. If the residuals are autocorrelated, then the variance of the residuals will be larger than expected. This, in turn, causes the standard error of the regression coefficients to be smaller than expected. If the standard errors of the regression coefficients are smaller than expected, then confidence intervals for the coefficients will be narrower than expected and the t-statistic for the coefficients will be artifically inflated. In other words, our model will appear to be better and more reliable than it really is.

One possible way to “fix” autocorrealted residuals in a time series based regression model is to include additional predictor variables in the model. This document will explore a specific instance of this scenario in a business context. This example has been adopted from chapter 7 of “Understanding Econometrics” by Dennis Halcoussis and pubished by Thomson/South-Western, 2005 (ISBN 0030348064, 9780030348068).

The Data

The data for this example is based on Microsoft, Inc. marketing, revenue and R&D expenditures from 1987-2000. The response variable in this example is REVENUE, and the predictors are MARKETING (money spent on marketing), and the season that money was spent. Two models will be developed including one with only MARKETING and SEASON, and then a second that will R&D expenditure (trailing by 4 quarters) to the model.

First we load the data…

rm(list=ls())
library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.3     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

set.seed(10101)
df <- read_csv('MS_REVENUE_MARKETING_R&D.csv')

## Parsed with column specification:
## cols(
##   Year = col_double(),
##   Quarter = col_double(),
##   REVENUES = col_double(),
##   MARKETING = col_double(),
##   SEASON = col_character(),
##   SUMMER = col_double(),
##   FALL = col_double(),
##   WINTER = col_double(),
##   RD1YEAR = col_double()
## )

df <- df[c('REVENUES','MARKETING','SEASON','RD1YEAR')]
df$SEASON <- as.factor(df$SEASON)
summary(df)

##     REVENUES         MARKETING         SEASON      RD1YEAR      
##  Min.   :  60.02   Min.   : 13.44   FALL  :13   Min.   :  3.66  
##  1st Qu.: 234.94   1st Qu.: 57.48   SPRING:14   1st Qu.: 21.74  
##  Median : 678.94   Median :190.72   SUMMER:14   Median : 71.88  
##  Mean   :1088.59   Mean   :224.36   WINTER:14   Mean   :125.17  
##  3rd Qu.:1630.01   3rd Qu.:372.18               3rd Qu.:187.10  
##  Max.   :3559.01   Max.   :601.96               Max.   :450.70

The following shows the first few rows of the dataset. Values are in the millions of US dollars.

df

## # A tibble: 55 x 4
##    REVENUES MARKETING SEASON RD1YEAR
##       <dbl>     <dbl> <fct>    <dbl>
##  1     60.0      13.4 WINTER    3.66
##  2     71.6      16.8 SPRING    4.59
##  3     85.7      18.4 SUMMER    4.55
##  4     86.7      22.5 FALL      6.34
##  5     88.7      23.3 WINTER    6.27
##  6    133.       31.5 SPRING    7.07
##  7    136.       32.8 SUMMER    9.62
##  8    141.       40.7 FALL     10.4 
##  9    145.       37.8 WINTER   12.9 
## 10    170.       46.9 SPRING   13.6 
## # … with 45 more rows

An Initial Model

The code below creates a simple linear model with REVENUES as the target variable, and the MARKETING and SEASON as predicting variables. A summary of the model is then displayed.

mdl1 <- lm(REVENUES~MARKETING+SEASON, data=df)
summary(mdl1)

## 
## Call:
## lm(formula = REVENUES ~ MARKETING + SEASON, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -391.90 -163.76   36.54  144.41  593.99 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -161.3982    74.7375  -2.160   0.0356 *  
## MARKETING       5.7713     0.1797  32.111   <2e-16 ***
## SEASONSPRING  -64.0364    89.3735  -0.717   0.4770    
## SEASONSUMMER  -75.9197    89.3889  -0.849   0.3998    
## SEASONWINTER  -36.2570    89.2578  -0.406   0.6863    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 231.7 on 50 degrees of freedom
## Multiple R-squared:  0.9539, Adjusted R-squared:  0.9503 
## F-statistic: 258.9 on 4 and 50 DF,  p-value: < 2.2e-16

The above information shows the estimated coefficients for the model, along with other variable metrics. The adjusted R-Squared value for the model is 0.9503.

Visualizing Modeling Assumptions

Several graphs can be used to assess if any modeling assumptions have been violated. Those graphs are shown below.

par(mfrow=c(2,2))
plot(mdl1)

The residuals vs fitted values graph shows points that do not appear to be randomly distributed across the x-axis. This can be an indication of correlation in the error terms. Further evidence of this for time series data can be seen when plotting the residuals in the order that they are created by the time series data.

x = seq(1,length(mdl1$residuals))
p = ggplot() + 
   geom_line(data = df, aes(x = x, y = mdl1$residuals), color = "red") +
   geom_hline(yintercept=0, linetype="dashed", color='blue')
print(p)

In the above graph, points tend to be associated with the values of their neighbors to the left and right. If the data were not autocorrelated, the values of the residuals would be randomly jump as we move from the left to the right.

Durbin Watson Test

One way to statistically quantify autocorrelation in the residuals of a linear model is to use the Durbin Watson test. This test produces a statistic that ranges from 0-4. Values between 0-2 indicate a positive autocorrelation, and values between 2-4 indicate a negative correlation. A value of 2 indicates no autocorrelation. The following code will perform the Durbin Watson test on the residuals from our initial model. The code produces p-value that can be used to test the hypothesis that there is or is not autocorrelation present at a given confidence level.

durbinWatsonTest(mdl1)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.7751583       0.43725       0
##  Alternative hypothesis: rho != 0

The null hypothesis of the Durbin Watson test (\(H_o\)) is that there is no correlation among residuals, i.e., they are independent. The alternative hypothesis (\(H_a\)) is that residuals are autocorrelated. In the case of the Microsoft data we see that the resulting p-value is very low, and so we reject the null hypothesis. The D-W statistic for this model of 0.43725 indicates a likely positive correlation in the residuals data. This means that the \(\hat\epsilon_i\) residual is very likely to have a similar value to the \(\hat\epsilon_{i-1}\). This is a violation of modeling assumptions and it means that our model is most likely inaccurate.

Fixing Autocorrelation by Adding Missing Predictors

One way to address autocorrelation errors, particularly in regards to time series data, is to include explainatory variables in the model. The inclusion of these variable will help to remove the autocorrealtion present in residuals. In the case of the Microsoft, Inc. data, adding a predictor fo the model that represents the amount spent on R&D on the corresponding quarter of the previous year reduces the autocorrelation seen in the residuals. The following cell trains a new linear model that predicts REVENUE based on MARKETING, SEASSON and RD1YEAR.

mdl2 <- lm(REVENUES~MARKETING+SEASON+RD1YEAR, data=df)
summary(mdl2)

## 
## Call:
## lm(formula = REVENUES ~ MARKETING + SEASON + RD1YEAR, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -201.853  -20.519   -0.129   23.763  191.596 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -8.0144    22.4160  -0.358    0.722    
## MARKETING      1.9671     0.1691  11.630 1.06e-15 ***
## SEASONSPRING  22.8215    25.9182   0.881    0.383    
## SEASONSUMMER  12.2734    25.9307   0.473    0.638    
## SEASONWINTER -20.6629    25.6312  -0.806    0.424    
## RD1YEAR        5.2059     0.2204  23.617  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.52 on 49 degrees of freedom
## Multiple R-squared:  0.9963, Adjusted R-squared:  0.9959 
## F-statistic:  2625 on 5 and 49 DF,  p-value: < 2.2e-16

The following plot shows what the residuals in the time series sequence.

x = seq(1,length(mdl2$residuals))
p = ggplot() + 
   geom_line(data = df, aes(x = x, y = mdl2$residuals), color = "red") +
   geom_hline(yintercept=0, linetype="dashed", color='blue')
print(p)

The residuals now tend to bounce back and forth across the x axis. There may be issues with constant variance in this model, but at least the issues with autocorrelation may have been corrected. This can be confirmed by running the Durbin Watson test again.

durbinWatsonTest(mdl2)

##  lag Autocorrelation D-W Statistic p-value
##    1     -0.07676205      2.144235   0.696
##  Alternative hypothesis: rho != 0

The p-value is now fairly large. As a result, we are not able to reject the null hypothesis (\(H_o\)) that there is no correlation among residuals.

Confidence Intervals

The following code displays the confidence intervals for the coefficients for the two models developed above.

Model 1 Confidence Intervals

confint(mdl1)

##                    2.5 %     97.5 %
## (Intercept)  -311.512888 -11.283432
## MARKETING       5.410319   6.132328
## SEASONSPRING -243.548291 115.475469
## SEASONSUMMER -255.462577 103.623193
## SEASONWINTER -215.536617 143.022567

Model 2 Confidence Intervals

confint(mdl2)

##                  2.5 %    97.5 %
## (Intercept)  -53.06097 37.032196
## MARKETING      1.62717  2.306984
## SEASONSPRING -29.26310 74.906070
## SEASONSUMMER -39.83628 64.383001
## SEASONWINTER -72.17080 30.845034
## RD1YEAR        4.76289  5.648834

One of the effects of autocorrelation of residuals is that it results in confidence intervals for regression coefficients that are shorter than they should be. The confidence intervals for the two models we developed are dispalyed above. The range of the 95% confidence interval for the MARKETING coefficient went from 0.722009 to 0.679814. While that may seem like a contradiction to the statement that the first confidence interval is likely too narrow, it is important to remember that we are dealing with two different models that have a different number of predictors. Similarly, it does not make sense to compare the standard errors of the coefficients of the two models either because they are two different models with different predictors.

Conclusion

This document explored the impact of autocorrelated errors on regression models, and how inclusion of missing predictors can help to remove autocorrelation from residuals. Autocorrelation is a common result of time series data, as shown in the example of Microsoft revenue vs marketing data.