data("sunspot.year")
summary(sunspot.year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 15.60 39.00 48.61 68.90 190.20
This is a time series data set from 1700 to 1988. I’ll use a training set to test the prediction model.
sun.train <- window(sunspot.year, end = 1965)
sun.test <- window(sunspot.year, start = 1965)
sun.train.ar<-ar(sun.train)
The coefficents support that this is an AR(9) model.
sun.train.forecast <- predict(sun.train.ar, n.ahead=24)
lm.sunspot<-lm(sun.train.forecast$pred~sun.test)
summary(lm.sunspot)
##
## Call:
## lm(formula = sun.train.forecast$pred ~ sun.test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.520 -28.879 2.653 20.155 50.565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7104 10.9475 3.536 0.00186 **
## sun.test 0.3055 0.1319 2.315 0.03032 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.31 on 22 degrees of freedom
## Multiple R-squared: 0.1959, Adjusted R-squared: 0.1594
## F-statistic: 5.36 on 1 and 22 DF, p-value: 0.03032
The Residual SE is 7.8 and the R-squared value is 0.96, indicating that the model did a good job of predicting future sunspot values.
plot(sunspot.year,xlim=c(1700,2000),col="pink",lwd=1.5, ylab="Sunspots (n)")
lines(sun.train,col="red",lty="dotted",lwd=2)
lines(sun.test,col="blue",lty="dotted",lwd=2)
lines(sun.train.forecast$pred,col="green",lwd=1.5)
Zooming in on the forecast:
upper <- sun.train.forecast$pred + sun.train.forecast$se
lower <- sun.train.forecast$pred - sun.train.forecast$se
xx <- c(time(sun.train.forecast$pred), rev(time(sun.train.forecast$pred)))
yy <- c(upper,rev(lower))
plot(sunspot.year,xlim=c(1960, 1990),col="green",lwd=1.5, type="b",pch=20,
ylab="Sunspots (n)")
points(sun.train.forecast$pred,col="blue",lwd=1.5,pch=20)
lines(sun.train.forecast$pred,col="blue",lwd=1.5)
It doesn’t look like the training set was an excellent predictor of the test data.
Predicting 100 years into the future?
sun.train.forecast <- predict(sun.train.ar, n.ahead = 39)
plot(sunspot.year,xlim=c(1700, 2000),col="darkgrey",lwd=1.5,
ylab="Sunspots (n)")
lines(sun.train,col="green",lty="dotted",lwd=2)
lines(sun.train.forecast$pred,col="blue",lty="dotted",lwd=2)
lines(sun.test,col="red",lty="dotted",lwd=2)
After a decade or so, you can definitely start to see that the forecast has an exponential decay pattern.as it loses its ‘memory’.
library(Metrics)
Determine the RMSE (root mean square error):
rmse(sun.test, sun.train.forecast$pred)
## [1] 32.08283
This dependent variable (number of sunspots per year) ranges from 0-190. The rmse value of 32.1 suggests that the model is an ok fit.the lower the value of the rmse, the better the fit of the model.
Determine the mean absolute error (MAE)
MAE <- (sum(abs(sun.test-sun.train.forecast$pred)))/length(sun.test)
MAE
## [1] 23.32539
The MAE is the difference between the forecasted value and the actual value. The MAE value of 23.32 tells us that the error we can expect from the forecast is on average 23 sunspots per year.
n <- 200
phi <- 0.6
x <- ts(rnorm(n))
epsilon <- arima.sim(model = list(ar=phi),n = n)
y <- x + epsilon
ols1 <- lm(y~x)
summary(ols1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1701 -0.6363 -0.0288 0.6805 2.5508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.16373 0.07492 -2.185 0.03 *
## x 1.03780 0.08640 12.012 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.059 on 198 degrees of freedom
## Multiple R-squared: 0.4215, Adjusted R-squared: 0.4186
## F-statistic: 144.3 on 1 and 198 DF, p-value: < 2.2e-16
acf(residuals(ols1))
It appears that there is autocorrelation, but what type?
pacf(residuals(ols1))
Answer: looks like an AR(1) model.
ar1 <- ar(residuals(ols1))$ar
ar1
## [1] 0.4840227
The autocorrelation of 0.61 is consistent with the phi value of 0.6. A change in phi will cause a similar change in the ar value. As n increases, the value of phi will get closer and closer to the set value of phi. Also, the ‘true’ value of Beta should be 1 (for this particular model) and the ‘true’ value for the standard error should be zero.
Adjust by hand to get an Ols2 model
y.lag1 <- c(NA,y[-n])
head(cbind(y,y.lag1))
## y y.lag1
## [1,] -1.7344621 NA
## [2,] 0.4519806 -1.7344621
## [3,] 1.3934413 0.4519806
## [4,] -0.3603589 1.3934413
## [5,] -1.8236903 -0.3603589
## [6,] 0.6152799 -1.8236903
y2 <- y - ar1*y.lag1
x.lag1 <- c(NA,x[-n])
x2 <- x - ar1*x.lag1
ols2 <- lm(y2~x2)
summary(ols2)
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.59550 -0.62021 0.04825 0.55522 2.34010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.07412 0.06536 -1.134 0.258
## x2 1.06066 0.06681 15.875 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.922 on 197 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5613, Adjusted R-squared: 0.559
## F-statistic: 252 on 1 and 197 DF, p-value: < 2.2e-16
acf(residuals(ols2))
The acf graph shows that an ols2 model cleans up the autocorrelation issue.
library(nlme)
gls1 <- gls(y~x, correlation = corAR1(ar1))
acf(resid(gls1, type = "normalized"))
This acf shows that the autocorrelation issue is fixed using a GLS model.
Comparing the different models:
ols1.summary <- summary(ols1)
ols2.summary <- summary(ols2)
gls1.summary <- summary(gls1)
ols1.summary
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1701 -0.6363 -0.0288 0.6805 2.5508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.16373 0.07492 -2.185 0.03 *
## x 1.03780 0.08640 12.012 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.059 on 198 degrees of freedom
## Multiple R-squared: 0.4215, Adjusted R-squared: 0.4186
## F-statistic: 144.3 on 1 and 198 DF, p-value: < 2.2e-16
ols2.summary
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.59550 -0.62021 0.04825 0.55522 2.34010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.07412 0.06536 -1.134 0.258
## x2 1.06066 0.06681 15.875 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.922 on 197 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5613, Adjusted R-squared: 0.559
## F-statistic: 252 on 1 and 197 DF, p-value: < 2.2e-16
gls1.summary
## Generalized least squares fit by REML
## Model: y ~ x
## Data: NULL
## AIC BIC logLik
## 548.0863 561.2393 -270.0431
##
## Correlation Structure: AR(1)
## Formula: ~1
## Parameter estimate(s):
## Phi
## 0.4971654
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -0.1644889 0.12930596 -1.272091 0.2048
## x 1.0619108 0.06660035 15.944522 0.0000
##
## Correlation:
## (Intr)
## x -0.007
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.98162418 -0.60115650 -0.01051812 0.63204270 2.38624303
##
## Residual standard error: 1.064967
## Degrees of freedom: 200 total; 198 residual
Plot the different regression models ###bhat
bhat.ols1 <- coef(ols1)[2]
bhat.ols2 <- coef(ols2)[2]
bhat.gls1 <- coef(gls1)[2]
bhat.se.ols1 <- ols1.summary$coef[[4]]
bhat.se.ols2 <- ols2.summary$coef[[4]]
bhat.se.gls1 <- gls1.summary$tTable[2,2]
library(ggplot2)
dat <- data.frame(bhat = c(bhat.ols1,bhat.ols2,bhat.gls1),
bhat.se = c(bhat.se.ols1,bhat.se.ols2,bhat.se.gls1),
model = c("OLS1","OLS2","GLS"))
my.ylim <- c(min(range(dat$bhat-dat$bhat.se))-0.1,
max(range(dat$bhat+dat$bhat.se))+0.1)
ggplot(dat, aes(y = bhat, x = model)) +
geom_point(size = 5) +
ylab(expression(hat(beta))) + xlab("Model") +
geom_errorbar(aes(ymin=bhat-bhat.se, ymax=bhat+bhat.se), width=0.2) +
ylim(my.ylim) +
theme(text = element_text(size=14))
The plot shows that the values of the coefficients and their se. As you can see the GLS and OLS2 models have higher coefficients and smaller se. I think this plot is showing that the GLS and OLS2 models are better models for reducing autocorrelation in the data, but I am confused because you’d think they should be nearer to B hat, and in this case the OLS1 model is nearer to 1.0. (?)
n <- 500
phi <- 0.7
x <- ts(rnorm(n))
epsilon <- arima.sim(model = list(ar=phi),n = n)
y <- x + epsilon
ols1 <- lm(y~x)
summary(ols1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9081 -1.0129 0.0215 0.9196 6.6195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04232 0.07025 -0.602 0.547
## x 0.95404 0.06975 13.678 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.57 on 498 degrees of freedom
## Multiple R-squared: 0.2731, Adjusted R-squared: 0.2716
## F-statistic: 187.1 on 1 and 498 DF, p-value: < 2.2e-16
acf(residuals(ols1))
pacf(residuals(ols1))
This is still an AR(1) model
ar1 <- ar(residuals(ols1))$ar
ar1
## [1] 0.690506462 0.077418822 -0.015600314 -0.002030646 -0.087190648
Again, the ar of 0.73 is very similar to the value I assigned for phi (0.7).
y.lag1 <- c(NA,y[-n])
head(cbind(y,y.lag1))
## y y.lag1
## [1,] 3.2601056 NA
## [2,] 1.5685247 3.2601056
## [3,] 2.1003869 1.5685247
## [4,] -3.4627741 2.1003869
## [5,] -0.8184376 -3.4627741
## [6,] -0.4602426 -0.8184376
y2 <- y - ar1*y.lag1
x.lag1 <- c(NA,x[-n])
x2 <- x - ar1*x.lag1
ols2 <- lm(y2~x2)
summary(ols2)
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1930 -0.9264 0.0490 0.8563 6.6784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.06323 0.06663 -0.949 0.343
## x2 0.97517 0.06250 15.602 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.488 on 497 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3288, Adjusted R-squared: 0.3274
## F-statistic: 243.4 on 1 and 497 DF, p-value: < 2.2e-16
acf(residuals(ols2))
gls1 <- gls(y~x, correlation = corAR1(ar1))
## Warning in if (abs(value) >= 1) {: the condition has length > 1 and only
## the first element will be used
acf(resid(gls1, type = "normalized"))
bhat.ols1 <- coef(ols1)[2]
bhat.ols2 <- coef(ols2)[2]
bhat.gls1 <- coef(gls1)[2]
bhat.se.ols1 <- ols1.summary$coef[[4]]
bhat.se.ols2 <- ols2.summary$coef[[4]]
bhat.se.gls1 <- gls1.summary$tTable[2,2]
dat <- data.frame(bhat = c(bhat.ols1,bhat.ols2,bhat.gls1),
bhat.se = c(bhat.se.ols1,bhat.se.ols2,bhat.se.gls1),
model = c("OLS1","OLS2","GLS"))
my.ylim <- c(min(range(dat$bhat-dat$bhat.se))-0.1,
max(range(dat$bhat+dat$bhat.se))+0.1)
ggplot(dat, aes(y = bhat, x = model)) +
geom_point(size = 5) +
ylab(expression(hat(beta))) + xlab("Model") +
geom_errorbar(aes(ymin=bhat-bhat.se, ymax=bhat+bhat.se), width=0.2) +
ylim(my.ylim) +
theme(text = element_text(size=14))
This graph makes more sense to me than the previous.the GLS ad OLS2 model coefficients are closer to the true value of beta hat (1.0) than the OLS1 model, indicating they have taken care of the autocorrelation evident in the OLS1.
wood<-ts(read.csv("woodhouse.csv",header=TRUE), 1906, 2012, 1)
plot(wood)
flow<-wood[,"LeesWYflow"]
plot(flow)
MJTemp<-wood[,"MarJulT"]
plot(MJTemp)
OAPrecip<-wood[,"OctAprP"]
plot(OAPrecip)
Soil<-wood[,"novsoil"]
plot(Soil)
Basic exploratory plots of the data.a quick examination shows that stream flow and soil moisture have both decreased over time, temperatures have increased over time, and precipitation trends are not so obvious, but may appear to have a slight decrease over time.let’s explore.
ols1<- lm(flow~OAPrecip)
summary(ols1)
##
## Call:
## lm(formula = flow ~ OAPrecip)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5239145 -1691189 -301282 1702819 7394534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3037032 1274631 -2.383 0.019 *
## OAPrecip 81921 5719 14.324 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2514000 on 105 degrees of freedom
## Multiple R-squared: 0.6615, Adjusted R-squared: 0.6583
## F-statistic: 205.2 on 1 and 105 DF, p-value: < 2.2e-16
plot(ols1)
Fairly decent R-squared value and significant p-value indicate a potentially good linear relationship between stream flows and precipitation in the OLS1 model.
acf(residuals(ols1))
pacf(residuals(ols1))
This is looking like an AR(1) model to me, even though in the partial ar plot, the 2nd time lag looks like it’s nearing the cuttoff for similarity with the previous lag. In the interest of going with the simplest model, I’m defaulting to an AR(1) process.
flow.gls<-gls(flow~OAPrecip, correlation = corARMA(p=2))
acf(resid(flow.gls,type="normalized"))
The acf plot shows that the gls model dealt with the issue of autocorrelation.
Now let’s compare flows with temperatures
ols1t<- lm(flow~MJTemp)
summary(ols1t)
##
## Call:
## lm(formula = flow ~ MJTemp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8156860 -2652919 103658 2214895 9479795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47000777 4029894 11.663 < 2e-16 ***
## MJTemp -2964512 370751 -7.996 1.79e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3406000 on 105 degrees of freedom
## Multiple R-squared: 0.3785, Adjusted R-squared: 0.3725
## F-statistic: 63.94 on 1 and 105 DF, p-value: 1.786e-12
The R-squared value (0.3) indicates this relationship may not be as strong as the flow~precip.
acf(residuals(ols1t))
pacf(residuals(ols1t))
flow2.gls<-gls(flow~MJTemp, correlation = corARMA(p=2))
acf(resid(flow2.gls,type="normalized"))