Our first step would be to import the dataset into R. The dataset that’ll be used today is IPM (Indeks Pembangunan Manusia, English: Human Development Index) from the year of 2010 to 2021 in DKI Jakarta.
df <- readxl::read_xlsx("C:/Users/Haci/Desktop/Metode Peramalan Deret Waktu/ipmdki.xlsx")
Now let’s have a look at our data
library(knitr)
kable(df, caption="IPM DKI Jakarta (2010-2021)")
Year | IPM |
---|---|
2010 | 76.31 |
2011 | 76.98 |
2012 | 77.53 |
2013 | 78.08 |
2014 | 78.39 |
2015 | 78.99 |
2016 | 79.60 |
2017 | 80.06 |
2018 | 80.47 |
2019 | 80.76 |
2020 | 80.77 |
2021 | 81.11 |
data.ts <- ts(df$IPM)
ts.plot(data.ts, xlab="Time Period", ylab="IPM",
main="IPM DKI Jakarta overtime (Year)")
Performing Double Moving Average on the IPM data
library(TTR)
data.sma<-SMA(data.ts, n=3)
data.fc<-c(NA,data.sma)
data.gab<-cbind(actual=c(data.ts,rep(NA,3)),smoothing=c(data.sma,rep(NA,3)),
forecast=c(data.fc,rep(data.fc[length(data.fc)],2)))
dma <- SMA(data.sma, n = 3)
At <- 2*data.sma - dma
Bt <- 2/(3-1)*(data.sma - dma)
data.dma<- At+Bt
data.fc2<- c(NA, data.dma)
t = 1:4
f = c()
for (i in t) {
f[i] = At[length(At)] + Bt[length(Bt)]*(i)
}
data.gab2 <- cbind(aktual = c(data.ts,rep(NA,4)), pemulusan1 = c(data.sma,rep(NA,4)),pemulusan2 = c(data.dma, rep(NA,4)),At = c(At, rep(NA,4)), Bt = c(Bt,rep(NA,4)),fcan = c(data.fc2, f[-1]))
data.gab2
## aktual pemulusan1 pemulusan2 At Bt fcan
## [1,] 76.31 NA NA NA NA NA
## [2,] 76.98 NA NA NA NA NA
## [3,] 77.53 76.94000 NA NA NA NA
## [4,] 78.08 77.53000 NA NA NA NA
## [5,] 78.39 78.00000 79.02000 78.51000 0.5100000 NA
## [6,] 78.99 78.48667 79.44889 78.96778 0.4811111 79.02000
## [7,] 79.60 78.99333 79.99333 79.49333 0.5000000 79.44889
## [8,] 80.06 79.55000 80.63000 80.09000 0.5400000 79.99333
## [9,] 80.47 80.04333 81.07222 80.55778 0.5144444 80.63000
## [10,] 80.76 80.43000 81.27444 80.85222 0.4222222 81.07222
## [11,] 80.77 80.66667 81.24000 80.95333 0.2866667 81.27444
## [12,] 81.11 80.88000 81.32222 81.10111 0.2211111 81.24000
## [13,] NA NA NA NA NA 81.32222
## [14,] NA NA NA NA NA 81.54333
## [15,] NA NA NA NA NA 81.76444
## [16,] NA NA NA NA NA 81.98556
ts.plot(data.gab2[,1], xlab="Time Period ", ylab="IPM", main= "DMA N=3 IPM",
ylim=c(75,85))
points(data.gab2[,1])
lines(data.gab2[,3],col="green",lwd=2)
lines(data.gab2[,6],col="red",lwd=2)
legend("topleft",c("actual","smoothed","forecasted"), lty=8, col=c("black","green","red"), cex=0.8)
we have just tried to forecast the IPM for 3 period ahead using Double Moving Average, let’s check the error.
error.dma = data.ts-data.fc2[1:length(data.ts)]
SSE.dma = sum(error.dma[8:length(data.ts)]^2)
MSE.dma = mean(error.dma[8:length(data.ts)]^2)
MAPE.dma = mean(abs((error.dma[8:length(data.ts)]/data.ts[8:length(data.ts)])*100))
accuration.dma <- matrix(c(SSE.dma, MSE.dma, MAPE.dma))
row.names(accuration.dma)<- c("SSE", "MSE", "MAPE")
colnames(accuration.dma) <- c("Accuration of m = 3")
accuration.dma
## Accuration of m = 3
## SSE 0.39889136
## MSE 0.07977827
## MAPE 0.29070565
What is the correlaction of Year and IPM?
cor(df$Year,df$IPM)
## [1] 0.9879386
From the correlation coefficient presented above, we can see that Year and IPM are highly correlated.
model <- lm(df$IPM~df$Year)
summary(model)
##
## Call:
## lm(formula = df$IPM ~ df$Year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42154 -0.16017 0.05061 0.16141 0.30594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -816.54150 44.39144 -18.39 4.86e-09 ***
## df$Year 0.44437 0.02202 20.18 1.97e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2634 on 10 degrees of freedom
## Multiple R-squared: 0.976, Adjusted R-squared: 0.9736
## F-statistic: 407.1 on 1 and 10 DF, p-value: 1.97e-09
Let’s plot our regression model
From the summarized result of the model and the plot, we can assume that indeed it is a good linear regression model. That is because of the low standard error (0.2634), high R-Squared (0.976), and low p-value (1.97 x 10^-9)
But is there any autocorrelation in our model? let’s investigate
We have two ways to check whether there is an autocorrelation in our regression model, they are:
1. Visualization: We can analyze the residual plots and use ACF/PACF plots
2. Formal test: We can use both the Durbin-Watson and the Breusch-Godfrey test
Let’s start from visualization by analyzing its residual plots
As we can see, the residual of the model isn’t homoscedasticity, it is somehow creates a pattern. There might be an indication of autocorrelation in our model.
How about ACF and PACF plots?
par(mfrow = c(1,1))
acf(model$residuals)
pacf(model$residuals)
From the ACF/PACF plots, we don’t see any significant spikes, that is the spike that goes above the blue dashed line. There is no strong indication of autocorrelation in the model.
Notes that the spike for lag = 0 is 1.0 (above the blue dashed line) because lag = 0 means that it compares the value with itself so it will always have a value of 1.0
Now that we have done the exploration through visualization, let’s try use the formal testing we have at our disposal
library(lmtest)
#Durbin Watson Test
#H0: autocorrelation does not exist
#H1: autocorrelation does exist
dwtest(model)
##
## Durbin-Watson test
##
## data: model
## DW = 0.53015, p-value = 6.42e-05
## alternative hypothesis: true autocorrelation is greater than 0
#Breusch-Godfrey Test
#H0: autocorrelation does not exist
#H1: autocorrelation does exist
bgtest(df$Year~df$IPM, order=1)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: df$Year ~ df$IPM
## LM test = 5.5773, df = 1, p-value = 0.01819
The result of formal test using Durbin-Watson and Breusch-Godfrey showed that there is an autocorrelation in our model.
Now we know that our model has autocorrelation, but how to take care of it?
In this part, we will try to use Cochrane Orcutt and Hildreth Lu to transform our data so that we won’t have any autocorrelation problem in our model! Both method basically use the same concept but Cochrane Orcutt use an interative method to find the best rho (that is, the one that have the lowest error). Let’s just begin!
Cochrane Orcutt
Let’s use the cochrane.orcutt function from the library(orcutt). We will use convergence = 5 and max.iter = 1000 as the parameter.
Notes: I have tried to use convergence = 8 but the result will not converge. If we use max.iter = 100, we can only have convergence up to 3 so that it will converge. We might be able to use converge = 8 if we raise the max.iter but it will takes more time and memory so let’s not do it for now.
library(orcutt)
modelco<-cochrane.orcutt(model, convergence = 5, max.iter=1000)
modelco
## Cochrane-orcutt estimation for first order autocorrelation
##
## Call:
## lm(formula = df$IPM ~ df$Year)
##
## number of interaction: 720
## rho 0.898765
##
## Durbin-Watson statistic
## (original): 0.53015 , p-value: 6.42e-05
## (transformed): 1.89065 , p-value: 2.796e-01
##
## coefficients:
## (Intercept) df$Year
## -39.430552 0.060569
Now that we’ve got our optimum rho (0.898765), let’s transform our data.
rho<- modelco$rho
y.trans<- df$IPM[-1]-df$IPM[-12]*rho
x.trans<- df$Year[-1]-df$Year[-12]*rho
modelcorho<- lm(y.trans~x.trans)
summary(modelcorho)
##
## Call:
## lm(formula = y.trans ~ x.trans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.262962 -0.031802 0.001728 0.083219 0.182379
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.99175 27.02154 -0.148 0.886
## x.trans 0.06057 0.13182 0.459 0.657
##
## Residual standard error: 0.14 on 9 degrees of freedom
## Multiple R-squared: 0.02292, Adjusted R-squared: -0.08564
## F-statistic: 0.2111 on 1 and 9 DF, p-value: 0.6568
Let’s us now do the formal test to check whether the autocorrelation still exists in our model
dwtest(modelcorho)
##
## Durbin-Watson test
##
## data: modelcorho
## DW = 1.8906, p-value = 0.2796
## alternative hypothesis: true autocorrelation is greater than 0
bgtest(y.trans~x.trans, order=1)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: y.trans ~ x.trans
## LM test = 0.021769, df = 1, p-value = 0.8827
As we can see, our P-value is higher than 0.05 thus we accept H0. We can now conclude that the autocorrelation does not exist in our model anymore. Hurray, Problem solved!
Let’s try Hildreth Lu.
Hildreth Lu
First, we need to create an iterative function to find the optimum rho.
hildreth.lu.func<- function(r, model){
x <- model.matrix(model)[,-1]
y <- model.response(model.frame(model))
n <- length(y)
t <- 2:n
y <- y[t]-r*y[t-1]
x <- x[t]-r*x[t-1]
return(lm(y~x))
}
Now that we’ve created the function, let’s do the iteration!
r <- c(seq(0.1,0.8, by= 0.1), seq(0.9,0.99, by= 0.01))
tab <- data.frame("rho" = r, "SSE" = sapply(r, function(i){deviance(hildreth.lu.func(i, model))}))
optrho <- which.min(round(tab, 4)[,2])
round(tab, 4)[optrho,]
## rho SSE
## 9 0.9 0.1763
From the iteration, we got 0.9 as our optimum rho. It is actually quite close with the optimum rho we got from Cochrane Orcutt (0.898765)
Let’s make a plot so we can have a better view of the error of each iteration
plot(tab$SSE ~ tab$rho , type = "l")
abline(v = tab[tab$SSE==min(tab$SSE),"rho"], lty = 3)
Let’s have a closer look at the plot by only look at the values in the interval of 0.8 < rho < 1.0
r <- seq(0.8,1.0, by= 0.01)
tab <- data.frame("rho" = r, "SSE" = sapply(r,function(i){deviance(hildreth.lu.func(i, model))}))
plot(tab$SSE ~ tab$rho , type = "l")
abline(v = tab[tab$SSE==min(tab$SSE),"rho"], lty = 3)
We can conclude that from the iterative process, our best rho is 0.9. Let’s try to transform our model using that rho.
modelhl <- hildreth.lu.func(0.9, model)
summary(modelhl)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.262945 -0.031786 0.002045 0.082941 0.182273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.92123 27.02291 -0.108 0.916
## x 0.05555 0.13345 0.416 0.687
##
## Residual standard error: 0.14 on 9 degrees of freedom
## Multiple R-squared: 0.01889, Adjusted R-squared: -0.09013
## F-statistic: 0.1733 on 1 and 9 DF, p-value: 0.687
Finally, let’s check the autocorrelation for the new model we created using the Hildreth lu using Durbin-Watson
dwtest(modelhl)
##
## Durbin-Watson test
##
## data: modelhl
## DW = 1.8932, p-value = 0.2811
## alternative hypothesis: true autocorrelation is greater than 0
From the result, we can see that our new model also does not have any autocorrelation. Thus, another problem solved.
Let’s compare the MSE of the regression model we’ve created so far
mse.reg <- mean(model$residuals^2)
mse.co <- mean(modelcorho$residuals^2)
mse.hl <- mean(modelhl$residuals^2)
error <- data.frame(model=c("original", "cochraneorcutt", "hildrethlu"),
mse=c(mse.reg, mse.co, mse.hl))
error
## model mse
## 1 original 0.05780778
## 2 cochraneorcutt 0.01602742
## 3 hildrethlu 0.01602712
So, we can conclude that the regression model we’ve created using Cochrane Orcutt and Hildreth Lu have lower MSE than the original model. The difference between both method is tiny so we shouldn’t mind about it.