Week03

data("sunspot.year")

summary(sunspot.year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.60   39.00   48.61   68.90  190.20

This is a time series data set from 1700 to 1988.

sun.train <- window(sunspot.year, end = 1965) 
sun.test <- window(sunspot.year, start = 1965)
sun.train.ar<-ar(sun.train)

The coefficents support that this is an AR(9) model…

sun.train.forecast <- predict(sun.train.ar, n.ahead=24)

lm.sunspot<-lm(sun.train.forecast$pred~sun.test)
summary(lm.sunspot)

## 
## Call:
## lm(formula = sun.train.forecast$pred ~ sun.test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.520 -28.879   2.653  20.155  50.565 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  38.7104    10.9475   3.536  0.00186 **
## sun.test      0.3055     0.1319   2.315  0.03032 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.31 on 22 degrees of freedom
## Multiple R-squared:  0.1959, Adjusted R-squared:  0.1594 
## F-statistic:  5.36 on 1 and 22 DF,  p-value: 0.03032

The Residual SE is 7.8 and the R-squared value is 0.96, indicating that the model did a good job of predicting future sunspot values.

plot(sunspot.year,xlim=c(1700,2000),col="pink",lwd=1.5, ylab="Sunspots (n)")
lines(sun.train,col="red",lty="dotted",lwd=2)
lines(sun.test,col="blue",lty="dotted",lwd=2)
lines(sun.train.forecast$pred,col="green",lwd=1.5)

Zooming in on the forecast:

upper <- sun.train.forecast$pred + sun.train.forecast$se
lower <- sun.train.forecast$pred - sun.train.forecast$se
xx <- c(time(sun.train.forecast$pred), rev(time(sun.train.forecast$pred)))
yy <- c(upper,rev(lower))
plot(sunspot.year,xlim=c(1960, 1990),col="green",lwd=1.5, type="b",pch=20,
     ylab="Sunspots (n)")
points(sun.train.forecast$pred,col="blue",lwd=1.5,pch=20)
lines(sun.train.forecast$pred,col="blue",lwd=1.5)

It doesn’t look like the training set was an excellent predictor of the test data.

Predicting 100 years into the future?

sun.train.forecast <- predict(sun.train.ar, n.ahead = 39)
plot(sunspot.year,xlim=c(1700, 2000),col="darkgrey",lwd=1.5, 
     ylab="Sunspots (n)")
lines(sun.train,col="green",lty="dotted",lwd=2)
lines(sun.train.forecast$pred,col="blue",lty="dotted",lwd=2)
lines(sun.test,col="red",lty="dotted",lwd=2)

After a decade or so, you can definitely start to see that the forecast has an exponential decay pattern…as it loses its ‘memory’.

Determine the RMSE (root mean square error):

library(Metrics)

## Warning: package 'Metrics' was built under R version 3.2.3

rmse(sun.test, sun.train.forecast$pred)

## [1] 32.08283

This dependent variable (number of sunspots per year) ranges from 0-190. The rmse value of 32.1 suggests that the model is an ok fit…the lower the value of the rmse, the better the fit of the model.

Determine the mean absolute error (MAE)

MAE <- (sum(abs(sun.test-sun.train.forecast$pred)))/length(sun.test)
MAE

## [1] 23.32539

The MAE is the difference between the forecasted value and the actual value. The MAE value of 31.12 tells us that the error we can expect from the forecast is on average 31 sunspots per year.

Part 2:Regression simulation–ols1–(with n=200, phi=0.6)

n <- 200
phi <- 0.6
x <- ts(rnorm(n))
epsilon <- arima.sim(model = list(ar=phi),n = n)
y <- x + epsilon
ols1 <- lm(y~x)
summary(ols1)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3877 -0.7528  0.0580  0.6755  3.2550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.007793   0.084510   0.092    0.927    
## x           0.996446   0.092616  10.759   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.191 on 198 degrees of freedom
## Multiple R-squared:  0.3689, Adjusted R-squared:  0.3657 
## F-statistic: 115.8 on 1 and 198 DF,  p-value: < 2.2e-16

acf(residuals(ols1))

It appears that there is autocorrelation, but what type?

pacf(residuals(ols1))

Answer: looks like an AR(1) model.

ar1 <- ar(residuals(ols1))$ar
ar1

## [1] 0.5663379

The autocorrelation of 0.61 is consistent with the phi value of 0.6. A change in phi will cause a similar change in the ar value. As n increases, the value of phi will get closer and closer to the set value of phi. Also, the ‘true’ value of Beta should be 1 (for this particular model) and the ‘true’ value for the standard error should be zero.

Adjust by hand to get an Ols2 model

y.lag1 <- c(NA,y[-n]) 
head(cbind(y,y.lag1))

##                y     y.lag1
## [1,] -1.42192984         NA
## [2,] -1.09922550 -1.4219298
## [3,] -1.53350575 -1.0992255
## [4,]  0.17265622 -1.5335058
## [5,]  2.84391204  0.1726562
## [6,]  0.08474672  2.8439120

y2 <- y - ar1*y.lag1 
x.lag1 <- c(NA,x[-n]) 
x2 <- x - ar1*x.lag1 
ols2 <- lm(y2~x2)
summary(ols2)

## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4675 -0.6621 -0.0507  0.7297  3.3570 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.00282    0.06968    0.04    0.968    
## x2           1.00605    0.06685   15.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9823 on 197 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5348, Adjusted R-squared:  0.5324 
## F-statistic: 226.5 on 1 and 197 DF,  p-value: < 2.2e-16

acf(residuals(ols2))

The acf graph shows that an ols2 model cleans up the autocorrelation issue.

GLS model

library(nlme)

## Warning: package 'nlme' was built under R version 3.2.4

gls1 <- gls(y~x, correlation = corAR1(ar1)) 
acf(resid(gls1, type = "normalized"))

This acf shows that the autocorrelation issue is fixed using a GLS model.

Comparing the different models:

ols1.summary <- summary(ols1)
ols2.summary <- summary(ols2)
gls1.summary <- summary(gls1)
ols1.summary

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3877 -0.7528  0.0580  0.6755  3.2550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.007793   0.084510   0.092    0.927    
## x           0.996446   0.092616  10.759   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.191 on 198 degrees of freedom
## Multiple R-squared:  0.3689, Adjusted R-squared:  0.3657 
## F-statistic: 115.8 on 1 and 198 DF,  p-value: < 2.2e-16

ols2.summary

## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4675 -0.6621 -0.0507  0.7297  3.3570 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.00282    0.06968    0.04    0.968    
## x2           1.00605    0.06685   15.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9823 on 197 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5348, Adjusted R-squared:  0.5324 
## F-statistic: 226.5 on 1 and 197 DF,  p-value: < 2.2e-16

gls1.summary

## Generalized least squares fit by REML
##   Model: y ~ x 
##   Data: NULL 
##        AIC      BIC    logLik
##   571.4533 584.6064 -281.7266
## 
## Correlation Structure: AR(1)
##  Formula: ~1 
##  Parameter estimate(s):
##       Phi 
## 0.5732285 
## 
## Coefficients:
##                  Value  Std.Error   t-value p-value
## (Intercept) -0.0042012 0.16142668 -0.026025  0.9793
## x            1.0077771 0.06643902 15.168452  0.0000
## 
##  Correlation: 
##   (Intr)
## x -0.029
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -2.82570791 -0.62021318  0.04781553  0.57368962  2.73923863 
## 
## Residual standard error: 1.196497 
## Degrees of freedom: 200 total; 198 residual

Plot the different regression models

#bhat
bhat.ols1 <- coef(ols1)[2] 
bhat.ols2 <- coef(ols2)[2] 
bhat.gls1 <- coef(gls1)[2] 
#se of bhat
bhat.se.ols1 <- ols1.summary$coef[[4]]
bhat.se.ols2 <- ols2.summary$coef[[4]]
bhat.se.gls1 <- gls1.summary$tTable[2,2]
require(ggplot2)

## Loading required package: ggplot2

dat <- data.frame(bhat = c(bhat.ols1,bhat.ols2,bhat.gls1), 
                  bhat.se = c(bhat.se.ols1,bhat.se.ols2,bhat.se.gls1),
                  model = c("OLS1","OLS2","GLS"))
my.ylim <- c(min(range(dat$bhat-dat$bhat.se))-0.1, 
             max(range(dat$bhat+dat$bhat.se))+0.1)
ggplot(dat, aes(y = bhat, x = model)) + 
  geom_point(size = 5) +
  ylab(expression(hat(beta))) + xlab("Model") +
  geom_errorbar(aes(ymin=bhat-bhat.se, ymax=bhat+bhat.se), width=0.2) +
  ylim(my.ylim) + 
  theme(text = element_text(size=14))

The plot shows that the values of the coefficients and their se. As you can see the GLS and OLS2 models have higher coefficients and smaller se. I think this plot is showing that the GLS and OLS2 models are better models for reducing autocorrelation in the data, but I am confused because you’d think they should be nearer to B hat, and in this case the OLS1 model is nearer to 1.0. (?)

Regression with n=500, phi=0.7

n <- 500
phi <- 0.7
x <- ts(rnorm(n))
epsilon <- arima.sim(model = list(ar=phi),n = n)
y <- x + epsilon

ols1 <- lm(y~x)
summary(ols1)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5219 -0.8991 -0.0453  1.0287  4.1267 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.10772    0.06223  -1.731   0.0841 .  
## x            1.01669    0.06303  16.131   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.391 on 498 degrees of freedom
## Multiple R-squared:  0.3432, Adjusted R-squared:  0.3419 
## F-statistic: 260.2 on 1 and 498 DF,  p-value: < 2.2e-16

acf(residuals(ols1))

pacf(residuals(ols1))

This is still an AR(1) model

ar1 <- ar(residuals(ols1))$ar
ar1

## [1]  0.64626869  0.05950728  0.09435704 -0.08194842

Again, the ar of 0.73 is very similar to the value I assigned for phi (0.7).

y.lag1 <- c(NA,y[-n]) 
head(cbind(y,y.lag1))

##               y     y.lag1
## [1,]  0.3884819         NA
## [2,]  0.3729465  0.3884819
## [3,] -3.7644740  0.3729465
## [4,] -1.1562496 -3.7644740
## [5,] -2.6959352 -1.1562496
## [6,] -1.7783640 -2.6959352

y2 <- y - ar1*y.lag1 
x.lag1 <- c(NA,x[-n]) 
x2 <- x - ar1*x.lag1

ols2 <- lm(y2~x2)
summary(ols2)

## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2761 -0.8527 -0.0195  0.9669  3.8734 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.09081    0.05771  -1.574    0.116    
## x2           1.00315    0.05409  18.546   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.289 on 497 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.409,  Adjusted R-squared:  0.4078 
## F-statistic: 343.9 on 1 and 497 DF,  p-value: < 2.2e-16

acf(residuals(ols2))

gls1 <- gls(y~x, correlation = corAR1(ar1))

## Warning in if (abs(value) >= 1) {: the condition has length > 1 and only
## the first element will be used

acf(resid(gls1, type = "normalized"))

bhat.ols1 <- coef(ols1)[2] 
bhat.ols2 <- coef(ols2)[2] 
bhat.gls1 <- coef(gls1)[2] 

bhat.se.ols1 <- ols1.summary$coef[[4]]
bhat.se.ols2 <- ols2.summary$coef[[4]]
bhat.se.gls1 <- gls1.summary$tTable[2,2]

require(ggplot2)
dat <- data.frame(bhat = c(bhat.ols1,bhat.ols2,bhat.gls1), 
                  bhat.se = c(bhat.se.ols1,bhat.se.ols2,bhat.se.gls1),
                  model = c("OLS1","OLS2","GLS"))
my.ylim <- c(min(range(dat$bhat-dat$bhat.se))-0.1, 
             max(range(dat$bhat+dat$bhat.se))+0.1)
ggplot(dat, aes(y = bhat, x = model)) + 
  geom_point(size = 5) +
  ylab(expression(hat(beta))) + xlab("Model") +
  geom_errorbar(aes(ymin=bhat-bhat.se, ymax=bhat+bhat.se), width=0.2) +
  ylim(my.ylim) + 
  theme(text = element_text(size=14))

This graph makes more sense to me than the previous…the GLS ad OLS2 model coefficients are closer to the true value of beta hat (1.0) than the OLS1 model, indicating they have taken care of the autocorrelation evident in the OLS1.

Regressions for real

setwd("D:/timeseries")
wood<-ts(read.csv("woodhouse.csv",header=TRUE), 1906, 2012, 1)

plot(wood)

flow<-wood[,"LeesWYflow"]
plot(flow)

MJTemp<-wood[,"MarJulT"]
plot(MJTemp)

OAPrecip<-wood[,"OctAprP"]
plot(OAPrecip)

Soil<-wood[,"novsoil"]
plot(Soil)

Basic exploratory plots of the data…I quick examination shows that stream flow and soil moisture have both decreased over time, temperatures have increased over time, and precipitation trends are not so obvious, but may appear to have a slight decrease over time…let’s explore.

ols1<- lm(flow~OAPrecip)
summary(ols1)

## 
## Call:
## lm(formula = flow ~ OAPrecip)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5239145 -1691189  -301282  1702819  7394534 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3037032    1274631  -2.383    0.019 *  
## OAPrecip       81921       5719  14.324   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2514000 on 105 degrees of freedom
## Multiple R-squared:  0.6615, Adjusted R-squared:  0.6583 
## F-statistic: 205.2 on 1 and 105 DF,  p-value: < 2.2e-16

plot(ols1)

Fairly decent R-squared value and significant p-value indicate a potentially good linear relationship between stream flows and precipitation in the OLS1 model.

acf(residuals(ols1))

pacf(residuals(ols1))

This is looking like an AR(1) model to me, even though in the partial ar plot, the 2nd time lag looks like it’s nearing the cuttoff for similarity with the previous lag. In the interest of going with the simplest model, I’m defaulting to an AR(1) process.

library(nlme)
flow.gls<-gls(flow~OAPrecip, correlation = corARMA(p=2))
acf(resid(flow.gls,type="normalized"))

The acf plot shows that the gls model dealt with the issue of autocorrelation.

Now let’s compare flows with temperatures

ols1t<- lm(flow~MJTemp)
summary(ols1t)

## 
## Call:
## lm(formula = flow ~ MJTemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8156860 -2652919   103658  2214895  9479795 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47000777    4029894  11.663  < 2e-16 ***
## MJTemp      -2964512     370751  -7.996 1.79e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3406000 on 105 degrees of freedom
## Multiple R-squared:  0.3785, Adjusted R-squared:  0.3725 
## F-statistic: 63.94 on 1 and 105 DF,  p-value: 1.786e-12

The R-squared value (0.3) indicates this relationship may not be as strong as the flow~precip…

acf(residuals(ols1t))

pacf(residuals(ols1t))

library(nlme)
flow2.gls<-gls(flow~MJTemp, correlation = corARMA(p=2))
acf(resid(flow2.gls,type="normalized"))