data("sunspot.year")
summary(sunspot.year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 15.60 39.00 48.61 68.90 190.20
This is a time series data set from 1700 to 1988.
sun.train <- window(sunspot.year, end = 1965)
sun.test <- window(sunspot.year, start = 1965)
sun.train.ar<-ar(sun.train)
The coefficents support that this is an AR(9) model…
sun.train.forecast <- predict(sun.train.ar, n.ahead=24)
lm.sunspot<-lm(sun.train.forecast$pred~sun.test)
summary(lm.sunspot)
##
## Call:
## lm(formula = sun.train.forecast$pred ~ sun.test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.520 -28.879 2.653 20.155 50.565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.7104 10.9475 3.536 0.00186 **
## sun.test 0.3055 0.1319 2.315 0.03032 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.31 on 22 degrees of freedom
## Multiple R-squared: 0.1959, Adjusted R-squared: 0.1594
## F-statistic: 5.36 on 1 and 22 DF, p-value: 0.03032
The Residual SE is 7.8 and the R-squared value is 0.96, indicating that the model did a good job of predicting future sunspot values.
plot(sunspot.year,xlim=c(1700,2000),col="pink",lwd=1.5, ylab="Sunspots (n)")
lines(sun.train,col="red",lty="dotted",lwd=2)
lines(sun.test,col="blue",lty="dotted",lwd=2)
lines(sun.train.forecast$pred,col="green",lwd=1.5)
Zooming in on the forecast:
upper <- sun.train.forecast$pred + sun.train.forecast$se
lower <- sun.train.forecast$pred - sun.train.forecast$se
xx <- c(time(sun.train.forecast$pred), rev(time(sun.train.forecast$pred)))
yy <- c(upper,rev(lower))
plot(sunspot.year,xlim=c(1960, 1990),col="green",lwd=1.5, type="b",pch=20,
ylab="Sunspots (n)")
points(sun.train.forecast$pred,col="blue",lwd=1.5,pch=20)
lines(sun.train.forecast$pred,col="blue",lwd=1.5)
It doesn’t look like the training set was an excellent predictor of the test data.
sun.train.forecast <- predict(sun.train.ar, n.ahead = 39)
plot(sunspot.year,xlim=c(1700, 2000),col="darkgrey",lwd=1.5,
ylab="Sunspots (n)")
lines(sun.train,col="green",lty="dotted",lwd=2)
lines(sun.train.forecast$pred,col="blue",lty="dotted",lwd=2)
lines(sun.test,col="red",lty="dotted",lwd=2)
After a decade or so, you can definitely start to see that the forecast has an exponential decay pattern…as it loses its ‘memory’.
Determine the RMSE (root mean square error):
library(Metrics)
## Warning: package 'Metrics' was built under R version 3.2.3
rmse(sun.test, sun.train.forecast$pred)
## [1] 32.08283
This dependent variable (number of sunspots per year) ranges from 0-190. The rmse value of 32.1 suggests that the model is an ok fit…the lower the value of the rmse, the better the fit of the model.
Determine the mean absolute error (MAE)
MAE <- (sum(abs(sun.test-sun.train.forecast$pred)))/length(sun.test)
MAE
## [1] 23.32539
The MAE is the difference between the forecasted value and the actual value. The MAE value of 31.12 tells us that the error we can expect from the forecast is on average 31 sunspots per year.
n <- 200
phi <- 0.6
x <- ts(rnorm(n))
epsilon <- arima.sim(model = list(ar=phi),n = n)
y <- x + epsilon
ols1 <- lm(y~x)
summary(ols1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3877 -0.7528 0.0580 0.6755 3.2550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.007793 0.084510 0.092 0.927
## x 0.996446 0.092616 10.759 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.191 on 198 degrees of freedom
## Multiple R-squared: 0.3689, Adjusted R-squared: 0.3657
## F-statistic: 115.8 on 1 and 198 DF, p-value: < 2.2e-16
acf(residuals(ols1))
It appears that there is autocorrelation, but what type?
pacf(residuals(ols1))
Answer: looks like an AR(1) model.
ar1 <- ar(residuals(ols1))$ar
ar1
## [1] 0.5663379
The autocorrelation of 0.61 is consistent with the phi value of 0.6. A change in phi will cause a similar change in the ar value. As n increases, the value of phi will get closer and closer to the set value of phi. Also, the ‘true’ value of Beta should be 1 (for this particular model) and the ‘true’ value for the standard error should be zero.
y.lag1 <- c(NA,y[-n])
head(cbind(y,y.lag1))
## y y.lag1
## [1,] -1.42192984 NA
## [2,] -1.09922550 -1.4219298
## [3,] -1.53350575 -1.0992255
## [4,] 0.17265622 -1.5335058
## [5,] 2.84391204 0.1726562
## [6,] 0.08474672 2.8439120
y2 <- y - ar1*y.lag1
x.lag1 <- c(NA,x[-n])
x2 <- x - ar1*x.lag1
ols2 <- lm(y2~x2)
summary(ols2)
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4675 -0.6621 -0.0507 0.7297 3.3570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.00282 0.06968 0.04 0.968
## x2 1.00605 0.06685 15.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9823 on 197 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5348, Adjusted R-squared: 0.5324
## F-statistic: 226.5 on 1 and 197 DF, p-value: < 2.2e-16
acf(residuals(ols2))
The acf graph shows that an ols2 model cleans up the autocorrelation issue.
library(nlme)
## Warning: package 'nlme' was built under R version 3.2.4
gls1 <- gls(y~x, correlation = corAR1(ar1))
acf(resid(gls1, type = "normalized"))
This acf shows that the autocorrelation issue is fixed using a GLS model.
ols1.summary <- summary(ols1)
ols2.summary <- summary(ols2)
gls1.summary <- summary(gls1)
ols1.summary
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3877 -0.7528 0.0580 0.6755 3.2550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.007793 0.084510 0.092 0.927
## x 0.996446 0.092616 10.759 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.191 on 198 degrees of freedom
## Multiple R-squared: 0.3689, Adjusted R-squared: 0.3657
## F-statistic: 115.8 on 1 and 198 DF, p-value: < 2.2e-16
ols2.summary
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4675 -0.6621 -0.0507 0.7297 3.3570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.00282 0.06968 0.04 0.968
## x2 1.00605 0.06685 15.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9823 on 197 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5348, Adjusted R-squared: 0.5324
## F-statistic: 226.5 on 1 and 197 DF, p-value: < 2.2e-16
gls1.summary
## Generalized least squares fit by REML
## Model: y ~ x
## Data: NULL
## AIC BIC logLik
## 571.4533 584.6064 -281.7266
##
## Correlation Structure: AR(1)
## Formula: ~1
## Parameter estimate(s):
## Phi
## 0.5732285
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -0.0042012 0.16142668 -0.026025 0.9793
## x 1.0077771 0.06643902 15.168452 0.0000
##
## Correlation:
## (Intr)
## x -0.029
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.82570791 -0.62021318 0.04781553 0.57368962 2.73923863
##
## Residual standard error: 1.196497
## Degrees of freedom: 200 total; 198 residual
#bhat
bhat.ols1 <- coef(ols1)[2]
bhat.ols2 <- coef(ols2)[2]
bhat.gls1 <- coef(gls1)[2]
#se of bhat
bhat.se.ols1 <- ols1.summary$coef[[4]]
bhat.se.ols2 <- ols2.summary$coef[[4]]
bhat.se.gls1 <- gls1.summary$tTable[2,2]
require(ggplot2)
## Loading required package: ggplot2
dat <- data.frame(bhat = c(bhat.ols1,bhat.ols2,bhat.gls1),
bhat.se = c(bhat.se.ols1,bhat.se.ols2,bhat.se.gls1),
model = c("OLS1","OLS2","GLS"))
my.ylim <- c(min(range(dat$bhat-dat$bhat.se))-0.1,
max(range(dat$bhat+dat$bhat.se))+0.1)
ggplot(dat, aes(y = bhat, x = model)) +
geom_point(size = 5) +
ylab(expression(hat(beta))) + xlab("Model") +
geom_errorbar(aes(ymin=bhat-bhat.se, ymax=bhat+bhat.se), width=0.2) +
ylim(my.ylim) +
theme(text = element_text(size=14))
The plot shows that the values of the coefficients and their se. As you can see the GLS and OLS2 models have higher coefficients and smaller se. I think this plot is showing that the GLS and OLS2 models are better models for reducing autocorrelation in the data, but I am confused because you’d think they should be nearer to B hat, and in this case the OLS1 model is nearer to 1.0. (?)
n <- 500
phi <- 0.7
x <- ts(rnorm(n))
epsilon <- arima.sim(model = list(ar=phi),n = n)
y <- x + epsilon
ols1 <- lm(y~x)
summary(ols1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5219 -0.8991 -0.0453 1.0287 4.1267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.10772 0.06223 -1.731 0.0841 .
## x 1.01669 0.06303 16.131 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.391 on 498 degrees of freedom
## Multiple R-squared: 0.3432, Adjusted R-squared: 0.3419
## F-statistic: 260.2 on 1 and 498 DF, p-value: < 2.2e-16
acf(residuals(ols1))
pacf(residuals(ols1))
This is still an AR(1) model
ar1 <- ar(residuals(ols1))$ar
ar1
## [1] 0.64626869 0.05950728 0.09435704 -0.08194842
Again, the ar of 0.73 is very similar to the value I assigned for phi (0.7).
y.lag1 <- c(NA,y[-n])
head(cbind(y,y.lag1))
## y y.lag1
## [1,] 0.3884819 NA
## [2,] 0.3729465 0.3884819
## [3,] -3.7644740 0.3729465
## [4,] -1.1562496 -3.7644740
## [5,] -2.6959352 -1.1562496
## [6,] -1.7783640 -2.6959352
y2 <- y - ar1*y.lag1
x.lag1 <- c(NA,x[-n])
x2 <- x - ar1*x.lag1
ols2 <- lm(y2~x2)
summary(ols2)
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2761 -0.8527 -0.0195 0.9669 3.8734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.09081 0.05771 -1.574 0.116
## x2 1.00315 0.05409 18.546 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.289 on 497 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.409, Adjusted R-squared: 0.4078
## F-statistic: 343.9 on 1 and 497 DF, p-value: < 2.2e-16
acf(residuals(ols2))
gls1 <- gls(y~x, correlation = corAR1(ar1))
## Warning in if (abs(value) >= 1) {: the condition has length > 1 and only
## the first element will be used
acf(resid(gls1, type = "normalized"))
bhat.ols1 <- coef(ols1)[2]
bhat.ols2 <- coef(ols2)[2]
bhat.gls1 <- coef(gls1)[2]
bhat.se.ols1 <- ols1.summary$coef[[4]]
bhat.se.ols2 <- ols2.summary$coef[[4]]
bhat.se.gls1 <- gls1.summary$tTable[2,2]
require(ggplot2)
dat <- data.frame(bhat = c(bhat.ols1,bhat.ols2,bhat.gls1),
bhat.se = c(bhat.se.ols1,bhat.se.ols2,bhat.se.gls1),
model = c("OLS1","OLS2","GLS"))
my.ylim <- c(min(range(dat$bhat-dat$bhat.se))-0.1,
max(range(dat$bhat+dat$bhat.se))+0.1)
ggplot(dat, aes(y = bhat, x = model)) +
geom_point(size = 5) +
ylab(expression(hat(beta))) + xlab("Model") +
geom_errorbar(aes(ymin=bhat-bhat.se, ymax=bhat+bhat.se), width=0.2) +
ylim(my.ylim) +
theme(text = element_text(size=14))
This graph makes more sense to me than the previous…the GLS ad OLS2 model coefficients are closer to the true value of beta hat (1.0) than the OLS1 model, indicating they have taken care of the autocorrelation evident in the OLS1.
setwd("D:/timeseries")
wood<-ts(read.csv("woodhouse.csv",header=TRUE), 1906, 2012, 1)
plot(wood)
flow<-wood[,"LeesWYflow"]
plot(flow)
MJTemp<-wood[,"MarJulT"]
plot(MJTemp)
OAPrecip<-wood[,"OctAprP"]
plot(OAPrecip)
Soil<-wood[,"novsoil"]
plot(Soil)
Basic exploratory plots of the data…I quick examination shows that stream flow and soil moisture have both decreased over time, temperatures have increased over time, and precipitation trends are not so obvious, but may appear to have a slight decrease over time…let’s explore.
ols1<- lm(flow~OAPrecip)
summary(ols1)
##
## Call:
## lm(formula = flow ~ OAPrecip)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5239145 -1691189 -301282 1702819 7394534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3037032 1274631 -2.383 0.019 *
## OAPrecip 81921 5719 14.324 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2514000 on 105 degrees of freedom
## Multiple R-squared: 0.6615, Adjusted R-squared: 0.6583
## F-statistic: 205.2 on 1 and 105 DF, p-value: < 2.2e-16
plot(ols1)
Fairly decent R-squared value and significant p-value indicate a potentially good linear relationship between stream flows and precipitation in the OLS1 model.
acf(residuals(ols1))
pacf(residuals(ols1))
This is looking like an AR(1) model to me, even though in the partial ar plot, the 2nd time lag looks like it’s nearing the cuttoff for similarity with the previous lag. In the interest of going with the simplest model, I’m defaulting to an AR(1) process.
library(nlme)
flow.gls<-gls(flow~OAPrecip, correlation = corARMA(p=2))
acf(resid(flow.gls,type="normalized"))
The acf plot shows that the gls model dealt with the issue of autocorrelation.
ols1t<- lm(flow~MJTemp)
summary(ols1t)
##
## Call:
## lm(formula = flow ~ MJTemp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8156860 -2652919 103658 2214895 9479795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47000777 4029894 11.663 < 2e-16 ***
## MJTemp -2964512 370751 -7.996 1.79e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3406000 on 105 degrees of freedom
## Multiple R-squared: 0.3785, Adjusted R-squared: 0.3725
## F-statistic: 63.94 on 1 and 105 DF, p-value: 1.786e-12
The R-squared value (0.3) indicates this relationship may not be as strong as the flow~precip…
acf(residuals(ols1t))
pacf(residuals(ols1t))
library(nlme)
flow2.gls<-gls(flow~MJTemp, correlation = corARMA(p=2))
acf(resid(flow2.gls,type="normalized"))