Part 1–Predicting sunspot data

data("sunspot.year")
summary(sunspot.year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.60   39.00   48.61   68.90  190.20

This is a time series data set from 1700 to 1988. I’ll use a training set to test the prediction model.

sun.train <- window(sunspot.year, end = 1965) 
sun.test <- window(sunspot.year, start = 1965)
sun.train.ar<-ar(sun.train)

The coefficents support that this is an AR(9) model.

sun.train.forecast <- predict(sun.train.ar, n.ahead=24)
lm.sunspot<-lm(sun.train.forecast$pred~sun.test)
summary(lm.sunspot)
## 
## Call:
## lm(formula = sun.train.forecast$pred ~ sun.test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.520 -28.879   2.653  20.155  50.565 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  38.7104    10.9475   3.536  0.00186 **
## sun.test      0.3055     0.1319   2.315  0.03032 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.31 on 22 degrees of freedom
## Multiple R-squared:  0.1959, Adjusted R-squared:  0.1594 
## F-statistic:  5.36 on 1 and 22 DF,  p-value: 0.03032

The Residual SE is 7.8 and the R-squared value is 0.96, indicating that the model did a good job of predicting future sunspot values.

plot(sunspot.year,xlim=c(1700,2000),col="pink",lwd=1.5, ylab="Sunspots (n)")
lines(sun.train,col="red",lty="dotted",lwd=2)
lines(sun.test,col="blue",lty="dotted",lwd=2)
lines(sun.train.forecast$pred,col="green",lwd=1.5)

Zooming in on the forecast:

upper <- sun.train.forecast$pred + sun.train.forecast$se
lower <- sun.train.forecast$pred - sun.train.forecast$se
xx <- c(time(sun.train.forecast$pred), rev(time(sun.train.forecast$pred)))
yy <- c(upper,rev(lower))
plot(sunspot.year,xlim=c(1960, 1990),col="green",lwd=1.5, type="b",pch=20,
     ylab="Sunspots (n)")
points(sun.train.forecast$pred,col="blue",lwd=1.5,pch=20)
lines(sun.train.forecast$pred,col="blue",lwd=1.5)

It doesn’t look like the training set was an excellent predictor of the test data.

Predicting 100 years into the future?

sun.train.forecast <- predict(sun.train.ar, n.ahead = 39)
plot(sunspot.year,xlim=c(1700, 2000),col="darkgrey",lwd=1.5, 
     ylab="Sunspots (n)")
lines(sun.train,col="green",lty="dotted",lwd=2)
lines(sun.train.forecast$pred,col="blue",lty="dotted",lwd=2)
lines(sun.test,col="red",lty="dotted",lwd=2)

After a decade or so, you can definitely start to see that the forecast has an exponential decay pattern.as it loses its ‘memory’.

library(Metrics)

Determine the RMSE (root mean square error):

rmse(sun.test, sun.train.forecast$pred)
## [1] 32.08283

This dependent variable (number of sunspots per year) ranges from 0-190. The rmse value of 32.1 suggests that the model is an ok fit.the lower the value of the rmse, the better the fit of the model.

Determine the mean absolute error (MAE)

MAE <- (sum(abs(sun.test-sun.train.forecast$pred)))/length(sun.test)
MAE
## [1] 23.32539

The MAE is the difference between the forecasted value and the actual value. The MAE value of 23.32 tells us that the error we can expect from the forecast is on average 23 sunspots per year.

Part 2:Regression simulation-ols1-(with n=200, phi=0.6)

n <- 200
phi <- 0.6
x <- ts(rnorm(n))
epsilon <- arima.sim(model = list(ar=phi),n = n)
y <- x + epsilon
ols1 <- lm(y~x)
summary(ols1)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1701 -0.6363 -0.0288  0.6805  2.5508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.16373    0.07492  -2.185     0.03 *  
## x            1.03780    0.08640  12.012   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.059 on 198 degrees of freedom
## Multiple R-squared:  0.4215, Adjusted R-squared:  0.4186 
## F-statistic: 144.3 on 1 and 198 DF,  p-value: < 2.2e-16
acf(residuals(ols1)) 

It appears that there is autocorrelation, but what type?

pacf(residuals(ols1))

Answer: looks like an AR(1) model.

ar1 <- ar(residuals(ols1))$ar
ar1
## [1] 0.4840227

The autocorrelation of 0.61 is consistent with the phi value of 0.6. A change in phi will cause a similar change in the ar value. As n increases, the value of phi will get closer and closer to the set value of phi. Also, the ‘true’ value of Beta should be 1 (for this particular model) and the ‘true’ value for the standard error should be zero.

Adjust by hand to get an Ols2 model

y.lag1 <- c(NA,y[-n]) 
head(cbind(y,y.lag1))
##               y     y.lag1
## [1,] -1.7344621         NA
## [2,]  0.4519806 -1.7344621
## [3,]  1.3934413  0.4519806
## [4,] -0.3603589  1.3934413
## [5,] -1.8236903 -0.3603589
## [6,]  0.6152799 -1.8236903
y2 <- y - ar1*y.lag1 
x.lag1 <- c(NA,x[-n]) 
x2 <- x - ar1*x.lag1 
ols2 <- lm(y2~x2)
summary(ols2)
## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.59550 -0.62021  0.04825  0.55522  2.34010 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.07412    0.06536  -1.134    0.258    
## x2           1.06066    0.06681  15.875   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.922 on 197 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5613, Adjusted R-squared:  0.559 
## F-statistic:   252 on 1 and 197 DF,  p-value: < 2.2e-16
acf(residuals(ols2)) 

The acf graph shows that an ols2 model cleans up the autocorrelation issue.

GLS model

library(nlme)
gls1 <- gls(y~x, correlation = corAR1(ar1)) 
acf(resid(gls1, type = "normalized"))

This acf shows that the autocorrelation issue is fixed using a GLS model.

Comparing the different models:

ols1.summary <- summary(ols1)
ols2.summary <- summary(ols2)
gls1.summary <- summary(gls1)
ols1.summary
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1701 -0.6363 -0.0288  0.6805  2.5508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.16373    0.07492  -2.185     0.03 *  
## x            1.03780    0.08640  12.012   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.059 on 198 degrees of freedom
## Multiple R-squared:  0.4215, Adjusted R-squared:  0.4186 
## F-statistic: 144.3 on 1 and 198 DF,  p-value: < 2.2e-16
ols2.summary
## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.59550 -0.62021  0.04825  0.55522  2.34010 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.07412    0.06536  -1.134    0.258    
## x2           1.06066    0.06681  15.875   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.922 on 197 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5613, Adjusted R-squared:  0.559 
## F-statistic:   252 on 1 and 197 DF,  p-value: < 2.2e-16
gls1.summary
## Generalized least squares fit by REML
##   Model: y ~ x 
##   Data: NULL 
##        AIC      BIC    logLik
##   548.0863 561.2393 -270.0431
## 
## Correlation Structure: AR(1)
##  Formula: ~1 
##  Parameter estimate(s):
##       Phi 
## 0.4971654 
## 
## Coefficients:
##                  Value  Std.Error   t-value p-value
## (Intercept) -0.1644889 0.12930596 -1.272091  0.2048
## x            1.0619108 0.06660035 15.944522  0.0000
## 
##  Correlation: 
##   (Intr)
## x -0.007
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -2.98162418 -0.60115650 -0.01051812  0.63204270  2.38624303 
## 
## Residual standard error: 1.064967 
## Degrees of freedom: 200 total; 198 residual

Plot the different regression models ###bhat

bhat.ols1 <- coef(ols1)[2] 
bhat.ols2 <- coef(ols2)[2] 
bhat.gls1 <- coef(gls1)[2] 

se of bhat

bhat.se.ols1 <- ols1.summary$coef[[4]]
bhat.se.ols2 <- ols2.summary$coef[[4]]
bhat.se.gls1 <- gls1.summary$tTable[2,2]
library(ggplot2)
dat <- data.frame(bhat = c(bhat.ols1,bhat.ols2,bhat.gls1), 
                  bhat.se = c(bhat.se.ols1,bhat.se.ols2,bhat.se.gls1),
                  model = c("OLS1","OLS2","GLS"))
my.ylim <- c(min(range(dat$bhat-dat$bhat.se))-0.1, 
             max(range(dat$bhat+dat$bhat.se))+0.1)
ggplot(dat, aes(y = bhat, x = model)) + 
  geom_point(size = 5) +
  ylab(expression(hat(beta))) + xlab("Model") +
  geom_errorbar(aes(ymin=bhat-bhat.se, ymax=bhat+bhat.se), width=0.2) +
  ylim(my.ylim) + 
  theme(text = element_text(size=14))

The plot shows that the values of the coefficients and their se. As you can see the GLS and OLS2 models have higher coefficients and smaller se. I think this plot is showing that the GLS and OLS2 models are better models for reducing autocorrelation in the data, but I am confused because you’d think they should be nearer to B hat, and in this case the OLS1 model is nearer to 1.0. (?)

Regression with n=500, phi=0.7

n <- 500
phi <- 0.7
x <- ts(rnorm(n))
epsilon <- arima.sim(model = list(ar=phi),n = n)
y <- x + epsilon
ols1 <- lm(y~x)
summary(ols1)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9081 -1.0129  0.0215  0.9196  6.6195 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.04232    0.07025  -0.602    0.547    
## x            0.95404    0.06975  13.678   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.57 on 498 degrees of freedom
## Multiple R-squared:  0.2731, Adjusted R-squared:  0.2716 
## F-statistic: 187.1 on 1 and 498 DF,  p-value: < 2.2e-16
acf(residuals(ols1)) 

pacf(residuals(ols1))

This is still an AR(1) model

ar1 <- ar(residuals(ols1))$ar
ar1
## [1]  0.690506462  0.077418822 -0.015600314 -0.002030646 -0.087190648

Again, the ar of 0.73 is very similar to the value I assigned for phi (0.7).

y.lag1 <- c(NA,y[-n]) 
head(cbind(y,y.lag1))
##               y     y.lag1
## [1,]  3.2601056         NA
## [2,]  1.5685247  3.2601056
## [3,]  2.1003869  1.5685247
## [4,] -3.4627741  2.1003869
## [5,] -0.8184376 -3.4627741
## [6,] -0.4602426 -0.8184376
y2 <- y - ar1*y.lag1 
x.lag1 <- c(NA,x[-n]) 
x2 <- x - ar1*x.lag1 
ols2 <- lm(y2~x2)
summary(ols2)
## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1930 -0.9264  0.0490  0.8563  6.6784 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.06323    0.06663  -0.949    0.343    
## x2           0.97517    0.06250  15.602   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.488 on 497 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3288, Adjusted R-squared:  0.3274 
## F-statistic: 243.4 on 1 and 497 DF,  p-value: < 2.2e-16
acf(residuals(ols2)) 

gls1 <- gls(y~x, correlation = corAR1(ar1)) 
## Warning in if (abs(value) >= 1) {: the condition has length > 1 and only
## the first element will be used
acf(resid(gls1, type = "normalized"))

bhat.ols1 <- coef(ols1)[2] 
bhat.ols2 <- coef(ols2)[2] 
bhat.gls1 <- coef(gls1)[2] 
bhat.se.ols1 <- ols1.summary$coef[[4]]
bhat.se.ols2 <- ols2.summary$coef[[4]]
bhat.se.gls1 <- gls1.summary$tTable[2,2]
dat <- data.frame(bhat = c(bhat.ols1,bhat.ols2,bhat.gls1), 
                  bhat.se = c(bhat.se.ols1,bhat.se.ols2,bhat.se.gls1),
                  model = c("OLS1","OLS2","GLS"))
my.ylim <- c(min(range(dat$bhat-dat$bhat.se))-0.1, 
             max(range(dat$bhat+dat$bhat.se))+0.1)
ggplot(dat, aes(y = bhat, x = model)) + 
  geom_point(size = 5) +
  ylab(expression(hat(beta))) + xlab("Model") +
  geom_errorbar(aes(ymin=bhat-bhat.se, ymax=bhat+bhat.se), width=0.2) +
  ylim(my.ylim) + 
  theme(text = element_text(size=14))

This graph makes more sense to me than the previous.the GLS ad OLS2 model coefficients are closer to the true value of beta hat (1.0) than the OLS1 model, indicating they have taken care of the autocorrelation evident in the OLS1.

Part 3–Regressions with real data

wood<-ts(read.csv("woodhouse.csv",header=TRUE), 1906, 2012, 1)
plot(wood)

flow<-wood[,"LeesWYflow"]
plot(flow)

MJTemp<-wood[,"MarJulT"]
plot(MJTemp)

OAPrecip<-wood[,"OctAprP"]
plot(OAPrecip)

Soil<-wood[,"novsoil"]
plot(Soil)

Basic exploratory plots of the data.a quick examination shows that stream flow and soil moisture have both decreased over time, temperatures have increased over time, and precipitation trends are not so obvious, but may appear to have a slight decrease over time.let’s explore.

ols1<- lm(flow~OAPrecip)
summary(ols1)
## 
## Call:
## lm(formula = flow ~ OAPrecip)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5239145 -1691189  -301282  1702819  7394534 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3037032    1274631  -2.383    0.019 *  
## OAPrecip       81921       5719  14.324   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2514000 on 105 degrees of freedom
## Multiple R-squared:  0.6615, Adjusted R-squared:  0.6583 
## F-statistic: 205.2 on 1 and 105 DF,  p-value: < 2.2e-16
plot(ols1)

Fairly decent R-squared value and significant p-value indicate a potentially good linear relationship between stream flows and precipitation in the OLS1 model.

acf(residuals(ols1))

pacf(residuals(ols1))

This is looking like an AR(1) model to me, even though in the partial ar plot, the 2nd time lag looks like it’s nearing the cuttoff for similarity with the previous lag. In the interest of going with the simplest model, I’m defaulting to an AR(1) process.

flow.gls<-gls(flow~OAPrecip, correlation = corARMA(p=2))
acf(resid(flow.gls,type="normalized"))

The acf plot shows that the gls model dealt with the issue of autocorrelation.

Now let’s compare flows with temperatures

ols1t<- lm(flow~MJTemp)
summary(ols1t)
## 
## Call:
## lm(formula = flow ~ MJTemp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8156860 -2652919   103658  2214895  9479795 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47000777    4029894  11.663  < 2e-16 ***
## MJTemp      -2964512     370751  -7.996 1.79e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3406000 on 105 degrees of freedom
## Multiple R-squared:  0.3785, Adjusted R-squared:  0.3725 
## F-statistic: 63.94 on 1 and 105 DF,  p-value: 1.786e-12

The R-squared value (0.3) indicates this relationship may not be as strong as the flow~precip.

acf(residuals(ols1t))

pacf(residuals(ols1t))

flow2.gls<-gls(flow~MJTemp, correlation = corARMA(p=2))
acf(resid(flow2.gls,type="normalized"))