Handling Autocorrelation in Simple Linear Regression Model with Cochrane Orcutt and Hildreth Lu

Importing Dataset

Our first step would be to import the dataset into R. The dataset that’ll be used today is IPM (Indeks Pembangunan Manusia, English: Human Development Index) from the year of 2010 to 2021 in DKI Jakarta.

df <- readxl::read_xlsx("C:/Users/Haci/Desktop/Metode Peramalan Deret Waktu/ipmdki.xlsx")

Now let’s have a look at our data

library(knitr)
kable(df, caption="IPM DKI Jakarta (2010-2021)")

IPM DKI Jakarta (2010-2021)
Year	IPM
2010	76.31
2011	76.98
2012	77.53
2013	78.08
2014	78.39
2015	78.99
2016	79.60
2017	80.06
2018	80.47
2019	80.76
2020	80.77
2021	81.11

Performing Time Series Analysis

data.ts <- ts(df$IPM)
ts.plot(data.ts, xlab="Time Period", ylab="IPM", 
        main="IPM DKI Jakarta overtime (Year)")

Performing Double Moving Average on the IPM data

library(TTR)
data.sma<-SMA(data.ts, n=3)
data.fc<-c(NA,data.sma)
data.gab<-cbind(actual=c(data.ts,rep(NA,3)),smoothing=c(data.sma,rep(NA,3)),
                forecast=c(data.fc,rep(data.fc[length(data.fc)],2)))

dma <- SMA(data.sma, n = 3)
At <- 2*data.sma - dma
Bt <- 2/(3-1)*(data.sma - dma)
data.dma<- At+Bt
data.fc2<- c(NA, data.dma)

t = 1:4
f = c()

for (i in t) {
  f[i] = At[length(At)] + Bt[length(Bt)]*(i)
}

data.gab2 <- cbind(aktual = c(data.ts,rep(NA,4)), pemulusan1 = c(data.sma,rep(NA,4)),pemulusan2 = c(data.dma, rep(NA,4)),At = c(At, rep(NA,4)), Bt = c(Bt,rep(NA,4)),fcan = c(data.fc2, f[-1]))
data.gab2

##       aktual pemulusan1 pemulusan2       At        Bt     fcan
##  [1,]  76.31         NA         NA       NA        NA       NA
##  [2,]  76.98         NA         NA       NA        NA       NA
##  [3,]  77.53   76.94000         NA       NA        NA       NA
##  [4,]  78.08   77.53000         NA       NA        NA       NA
##  [5,]  78.39   78.00000   79.02000 78.51000 0.5100000       NA
##  [6,]  78.99   78.48667   79.44889 78.96778 0.4811111 79.02000
##  [7,]  79.60   78.99333   79.99333 79.49333 0.5000000 79.44889
##  [8,]  80.06   79.55000   80.63000 80.09000 0.5400000 79.99333
##  [9,]  80.47   80.04333   81.07222 80.55778 0.5144444 80.63000
## [10,]  80.76   80.43000   81.27444 80.85222 0.4222222 81.07222
## [11,]  80.77   80.66667   81.24000 80.95333 0.2866667 81.27444
## [12,]  81.11   80.88000   81.32222 81.10111 0.2211111 81.24000
## [13,]     NA         NA         NA       NA        NA 81.32222
## [14,]     NA         NA         NA       NA        NA 81.54333
## [15,]     NA         NA         NA       NA        NA 81.76444
## [16,]     NA         NA         NA       NA        NA 81.98556

ts.plot(data.gab2[,1], xlab="Time Period ", ylab="IPM", main= "DMA N=3 IPM",
        ylim=c(75,85))
points(data.gab2[,1])
lines(data.gab2[,3],col="green",lwd=2)
lines(data.gab2[,6],col="red",lwd=2)
legend("topleft",c("actual","smoothed","forecasted"), lty=8, col=c("black","green","red"), cex=0.8)

we have just tried to forecast the IPM for 3 period ahead using Double Moving Average, let’s check the error.

error.dma = data.ts-data.fc2[1:length(data.ts)]
SSE.dma = sum(error.dma[8:length(data.ts)]^2)
MSE.dma = mean(error.dma[8:length(data.ts)]^2)
MAPE.dma = mean(abs((error.dma[8:length(data.ts)]/data.ts[8:length(data.ts)])*100))

accuration.dma <- matrix(c(SSE.dma, MSE.dma, MAPE.dma))
row.names(accuration.dma)<- c("SSE", "MSE", "MAPE")
colnames(accuration.dma) <- c("Accuration of m = 3")
accuration.dma

##      Accuration of m = 3
## SSE           0.39889136
## MSE           0.07977827
## MAPE          0.29070565

Creating a linear regression model

What is the correlaction of Year and IPM?

cor(df$Year,df$IPM)

## [1] 0.9879386

From the correlation coefficient presented above, we can see that Year and IPM are highly correlated.

model <- lm(df$IPM~df$Year)
summary(model)

## 
## Call:
## lm(formula = df$IPM ~ df$Year)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42154 -0.16017  0.05061  0.16141  0.30594 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -816.54150   44.39144  -18.39 4.86e-09 ***
## df$Year        0.44437    0.02202   20.18 1.97e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2634 on 10 degrees of freedom
## Multiple R-squared:  0.976,  Adjusted R-squared:  0.9736 
## F-statistic: 407.1 on 1 and 10 DF,  p-value: 1.97e-09

Let’s plot our regression model

From the summarized result of the model and the plot, we can assume that indeed it is a good linear regression model. That is because of the low standard error (0.2634), high R-Squared (0.976), and low p-value (1.97 x 10^-9)

But is there any autocorrelation in our model? let’s investigate

Checking for autocorrelation in the model

We have two ways to check whether there is an autocorrelation in our regression model, they are:
1. Visualization: We can analyze the residual plots and use ACF/PACF plots
2. Formal test: We can use both the Durbin-Watson and the Breusch-Godfrey test

Let’s start from visualization by analyzing its residual plots

As we can see, the residual of the model isn’t homoscedasticity, it is somehow creates a pattern. There might be an indication of autocorrelation in our model.

How about ACF and PACF plots?

par(mfrow = c(1,1))
acf(model$residuals)

pacf(model$residuals)

From the ACF/PACF plots, we don’t see any significant spikes, that is the spike that goes above the blue dashed line. There is no strong indication of autocorrelation in the model.

Notes that the spike for lag = 0 is 1.0 (above the blue dashed line) because lag = 0 means that it compares the value with itself so it will always have a value of 1.0

Now that we have done the exploration through visualization, let’s try use the formal testing we have at our disposal

library(lmtest)

#Durbin Watson Test
#H0: autocorrelation does not exist
#H1: autocorrelation does exist
dwtest(model)

## 
##  Durbin-Watson test
## 
## data:  model
## DW = 0.53015, p-value = 6.42e-05
## alternative hypothesis: true autocorrelation is greater than 0

#Breusch-Godfrey Test 
#H0: autocorrelation does not exist
#H1: autocorrelation does exist
bgtest(df$Year~df$IPM, order=1)

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  df$Year ~ df$IPM
## LM test = 5.5773, df = 1, p-value = 0.01819

The result of formal test using Durbin-Watson and Breusch-Godfrey showed that there is an autocorrelation in our model.

Now we know that our model has autocorrelation, but how to take care of it?

Welcome to ‘Taking care of your Autocorrelation’ section

In this part, we will try to use Cochrane Orcutt and Hildreth Lu to transform our data so that we won’t have any autocorrelation problem in our model! Both method basically use the same concept but Cochrane Orcutt use an interative method to find the best rho (that is, the one that have the lowest error). Let’s just begin!

Cochrane Orcutt

Let’s use the cochrane.orcutt function from the library(orcutt). We will use convergence = 5 and max.iter = 1000 as the parameter.

Notes: I have tried to use convergence = 8 but the result will not converge. If we use max.iter = 100, we can only have convergence up to 3 so that it will converge. We might be able to use converge = 8 if we raise the max.iter but it will takes more time and memory so let’s not do it for now.

library(orcutt)
modelco<-cochrane.orcutt(model, convergence = 5, max.iter=1000)
modelco

## Cochrane-orcutt estimation for first order autocorrelation 
##  
## Call:
## lm(formula = df$IPM ~ df$Year)
## 
##  number of interaction: 720
##  rho 0.898765
## 
## Durbin-Watson statistic 
## (original):    0.53015 , p-value: 6.42e-05
## (transformed): 1.89065 , p-value: 2.796e-01
##  
##  coefficients: 
## (Intercept)     df$Year 
##  -39.430552    0.060569

Now that we’ve got our optimum rho (0.898765), let’s transform our data.

rho<- modelco$rho
y.trans<- df$IPM[-1]-df$IPM[-12]*rho
x.trans<- df$Year[-1]-df$Year[-12]*rho
modelcorho<- lm(y.trans~x.trans)
summary(modelcorho)

## 
## Call:
## lm(formula = y.trans ~ x.trans)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.262962 -0.031802  0.001728  0.083219  0.182379 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.99175   27.02154  -0.148    0.886
## x.trans      0.06057    0.13182   0.459    0.657
## 
## Residual standard error: 0.14 on 9 degrees of freedom
## Multiple R-squared:  0.02292,    Adjusted R-squared:  -0.08564 
## F-statistic: 0.2111 on 1 and 9 DF,  p-value: 0.6568

Let’s us now do the formal test to check whether the autocorrelation still exists in our model

dwtest(modelcorho)

## 
##  Durbin-Watson test
## 
## data:  modelcorho
## DW = 1.8906, p-value = 0.2796
## alternative hypothesis: true autocorrelation is greater than 0

bgtest(y.trans~x.trans, order=1)

## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  y.trans ~ x.trans
## LM test = 0.021769, df = 1, p-value = 0.8827

As we can see, our P-value is higher than 0.05 thus we accept H0. We can now conclude that the autocorrelation does not exist in our model anymore. Hurray, Problem solved!

Let’s try Hildreth Lu.

Hildreth Lu

First, we need to create an iterative function to find the optimum rho.

hildreth.lu.func<- function(r, model){
  x <- model.matrix(model)[,-1]
  y <- model.response(model.frame(model))
  n <- length(y)
  t <- 2:n
  y <- y[t]-r*y[t-1]
  x <- x[t]-r*x[t-1]
  
  return(lm(y~x))
}

Now that we’ve created the function, let’s do the iteration!

r <- c(seq(0.1,0.8, by= 0.1), seq(0.9,0.99, by= 0.01))
tab <- data.frame("rho" = r, "SSE" = sapply(r, function(i){deviance(hildreth.lu.func(i, model))}))
optrho <- which.min(round(tab, 4)[,2])
round(tab, 4)[optrho,]

##   rho    SSE
## 9 0.9 0.1763

From the iteration, we got 0.9 as our optimum rho. It is actually quite close with the optimum rho we got from Cochrane Orcutt (0.898765)

Let’s make a plot so we can have a better view of the error of each iteration

plot(tab$SSE ~ tab$rho , type = "l")
abline(v = tab[tab$SSE==min(tab$SSE),"rho"], lty = 3)

Let’s have a closer look at the plot by only look at the values in the interval of 0.8 < rho < 1.0

r <- seq(0.8,1.0, by= 0.01)
tab <- data.frame("rho" = r, "SSE" = sapply(r,function(i){deviance(hildreth.lu.func(i, model))}))
plot(tab$SSE ~ tab$rho , type = "l")
abline(v = tab[tab$SSE==min(tab$SSE),"rho"], lty = 3)

We can conclude that from the iterative process, our best rho is 0.9. Let’s try to transform our model using that rho.

modelhl <- hildreth.lu.func(0.9, model)
summary(modelhl)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.262945 -0.031786  0.002045  0.082941  0.182273 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.92123   27.02291  -0.108    0.916
## x            0.05555    0.13345   0.416    0.687
## 
## Residual standard error: 0.14 on 9 degrees of freedom
## Multiple R-squared:  0.01889,    Adjusted R-squared:  -0.09013 
## F-statistic: 0.1733 on 1 and 9 DF,  p-value: 0.687

Finally, let’s check the autocorrelation for the new model we created using the Hildreth lu using Durbin-Watson

dwtest(modelhl)

## 
##  Durbin-Watson test
## 
## data:  modelhl
## DW = 1.8932, p-value = 0.2811
## alternative hypothesis: true autocorrelation is greater than 0

From the result, we can see that our new model also does not have any autocorrelation. Thus, another problem solved.

Let’s compare the MSE of the regression model we’ve created so far

mse.reg <- mean(model$residuals^2)
mse.co  <- mean(modelcorho$residuals^2)
mse.hl  <- mean(modelhl$residuals^2)

error <- data.frame(model=c("original", "cochraneorcutt", "hildrethlu"),
                    mse=c(mse.reg, mse.co, mse.hl))
error

##            model        mse
## 1       original 0.05780778
## 2 cochraneorcutt 0.01602742
## 3     hildrethlu 0.01602712

So, we can conclude that the regression model we’ve created using Cochrane Orcutt and Hildreth Lu have lower MSE than the original model. The difference between both method is tiny so we shouldn’t mind about it.

Summary

We can check whether our linear regression model has autocorrelation or not using two ways: Visualization (Such as Residual Plots Analysis and ACF/PACF Plots) and Formal Test (such as Durbin-Watson and Breusch-Godfrey). The formal test result came out that our original simple linear regression model has autocorrelation.
If our model does have an autocorrelation, we can use Cochrane Orcutt or Hildreth Lu to solve it. Using Cochrane Orcutt is preferrable because it can save your time for manually creating the iterative process if you don’t the optimum rho. The result of using both of this method for the IPM of DKI Jakarta (2010-2021) showed that both has similar optimum rho (~0.9)
When we compare the MSE of our original model, Cochrane Orcutt, and Hildreth Lu, we find out that using either Cochrane Orcutt or Hildreth Lu can decrease the MSE of the linear regression model.