Part 1 - Sunspot predictions

I chose to predict the sunspot data from 1941 and again from 1967. The

data("sunspot.year")
summary(sunspot.year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.60   39.00   48.61   68.90  190.20
sun.train1 <- window(sunspot.year, end = 1940)  
sun.test1 <- window(sunspot.year, start = 1941)
sun.train1.ar <- ar(sun.train1)
sun.train1.ar
## 
## Call:
## ar(x = sun.train1)
## 
## Coefficients:
##       1        2        3        4        5        6        7        8  
##  1.2144  -0.4658  -0.1255   0.1439  -0.1520   0.0954  -0.0818   0.0772  
##       9  
##  0.1057  
## 
## Order selected 9  sigma^2 estimated as  207.1
sun.train1.forecast <- predict(sun.train1.ar, n.ahead = 48)
plot(sunspot.year,col="grey",ylab="Sunspots (n)")
lines(sun.train1,col="red",lty="dotted",lwd=2)
lines(sun.test1,col="olivedrab1",lty="dotted",lwd=2)
lines(sun.train1.forecast$pred,col="olivedrab4",lwd=1.5)

test.1<-lm(sun.train1.forecast$pred~sun.test1)
summary(test.1)
## 
## Call:
## lm(formula = sun.train1.forecast$pred ~ sun.test1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.6836  -6.4470   0.4537   5.7300  22.2657 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.98562    2.36851  11.816 1.56e-15 ***
## sun.test1    0.22330    0.02641   8.456 6.35e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.668 on 46 degrees of freedom
## Multiple R-squared:  0.6085, Adjusted R-squared:    0.6 
## F-statistic:  71.5 on 1 and 46 DF,  p-value: 6.348e-11
op = par(mfrow=c(2,2))
plot(test.1)
library(qpcR)
## Warning: package 'qpcR' was built under R version 3.2.5
## Loading required package: MASS
## Loading required package: minpack.lm
## Warning: package 'minpack.lm' was built under R version 3.2.5
## Loading required package: rgl
## Warning: package 'rgl' was built under R version 3.2.5
## Loading required package: robustbase
## Warning: package 'robustbase' was built under R version 3.2.5
## Loading required package: Matrix

RMSE(test.1, which = NULL)
## [1] 9.464802

The first few prediced values seem to match. The regression was significant(p<0.001) and the R-sqrd was 0.6. The risiduals plot has a bit of an arch to it, but I do not think that it is much of a problem. The RMSE was 9.5, that is much lower than the other subset I used to train and test (see below).

sun.train2 <- window(sunspot.year, end = 1966)  
sun.test2 <- window(sunspot.year, start = 1967)
sun.train2.ar <- ar(sun.train2)
sun.train2.ar
## 
## Call:
## ar(x = sun.train2)
## 
## Coefficients:
##       1        2        3        4        5        6        7        8  
##  1.2144  -0.4612  -0.1292   0.1497  -0.1545   0.1055  -0.0629   0.0389  
##       9  
##  0.1464  
## 
## Order selected 9  sigma^2 estimated as  236.1
sun.train2.forecast <- predict(sun.train2.ar, n.ahead = 22)
plot(sunspot.year,col="grey",ylab="Sunspots (n)")
lines(sun.train2,col="red",lty="dotted",lwd=2)
lines(sun.test2,col="olivedrab1",lty="dotted",lwd=2)
lines(sun.train2.forecast$pred,col="olivedrab4",lwd=2)

test.2<-lm(sun.train2.forecast$pred~sun.test2)
summary(test.2)
## 
## Call:
## lm(formula = sun.train2.forecast$pred ~ sun.test2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.089 -14.775  -8.803  14.895  43.049 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.29734    8.50825   2.386   0.0271 *  
## sun.test2    0.52784    0.09891   5.337 3.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.13 on 20 degrees of freedom
## Multiple R-squared:  0.5874, Adjusted R-squared:  0.5668 
## F-statistic: 28.48 on 1 and 20 DF,  p-value: 3.187e-05
op = par(mfrow=c(2,2))
plot(test.2)

library(qpcR)
RMSE(test.2, which = NULL)
## [1] 20.14253

To me there seems to be marginal changes between the first model I built that used data up to 1940 and this one which used data up to 1966. This one has a larger p value and the r-squred is not as strong. Additionally, the RMSE is double what it was in the first set.This seems counterintuitive, I would think that with more training data, a better regression model would be built.I think the larger sample size contributed to the lower p-value observed in my first model.

Take a look at this nifty code I found online.

library(ggplot2)
ggplotRegression <- function (fit) {
  
  require(ggplot2)
  
  ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) + 
    geom_point() +
    stat_smooth(method = "lm", col = "red") +
    labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 4),
                       "Intercept =",signif(fit$coef[[1]],4 ),
                       " Slope =",signif(fit$coef[[2]], 4),
                       " p =",signif(summary(fit)$coef[2,4], 4)))
}
ggplotRegression(test.1)
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous

ggplotRegression(test.2)
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous

Part 2 - Regression Simulation

n <-333 
phi <-0.333
x <-ts(rnorm(n))
epsilon <- arima.sim(model = list(ar=phi),n = n)
y <- x + epsilon
ols1 <- lm(y~x)
summary(ols1)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2497 -0.6360  0.0243  0.6903  3.5368 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.12546    0.05773   2.173   0.0305 *  
## x            1.02469    0.05491  18.660   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.053 on 331 degrees of freedom
## Multiple R-squared:  0.5127, Adjusted R-squared:  0.5112 
## F-statistic: 348.2 on 1 and 331 DF,  p-value: < 2.2e-16
acf(residuals(ols1))##looks like there is some autocorrelation going on

pacf(residuals(ols1))

ar1 <- ar(residuals(ols1))$ar
ar1
## [1] 0.3523663 0.1060402
library(nlme)
gls1 <- gls(y~x, correlation = corAR1(ar1))## Here I correct for the correlation.
## Warning in if (abs(value) >= 1) {: the condition has length > 1 and only
## the first element will be used
acf(resid(gls1, type = "normalized"))## looks good now. 

ols1.summary <- summary(ols1)
gls1.summary <- summary(gls1)
ols1.summary
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2497 -0.6360  0.0243  0.6903  3.5368 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.12546    0.05773   2.173   0.0305 *  
## x            1.02469    0.05491  18.660   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.053 on 331 degrees of freedom
## Multiple R-squared:  0.5127, Adjusted R-squared:  0.5112 
## F-statistic: 348.2 on 1 and 331 DF,  p-value: < 2.2e-16
gls1.summary
## Generalized least squares fit by REML
##   Model: y ~ x 
##   Data: NULL 
##        AIC      BIC    logLik
##   938.2783 957.2889 -464.1392
## 
## Correlation Structure: AR(1)
##  Formula: ~1 
##  Parameter estimate(s):
##       Phi      <NA> 
## 0.4004347 0.1060402 
## 
## Coefficients:
##                 Value  Std.Error   t-value p-value
## (Intercept) 0.1270118 0.08822985  1.439556  0.1509
## x           0.9995082 0.04723730 21.159302  0.0000
## 
##  Correlation: 
##   (Intr)
## x 0.009 
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -3.04647162 -0.61174391  0.01774796  0.66639007  3.34623121 
## 
## Residual standard error: 1.055546 
## Degrees of freedom: 333 total; 331 residual
bhat.ols1 <- coef(ols1)[2]
bhat.gls1 <- coef(gls1)[2]
bhat.se.ols1 <- ols1.summary$coef[[4]]
bhat.se.gls1 <- gls1.summary$tTable[2,2]
require(ggplot2)
dat <- data.frame(bhat = c(bhat.ols1,bhat.gls1), 
                  bhat.se = c(bhat.se.ols1,bhat.se.gls1),
                  model = c("OLS1","GLS"))
my.ylim <- c(min(range(dat$bhat-dat$bhat.se))-0.1, 
             max(range(dat$bhat+dat$bhat.se))+0.1)
ggplot(dat, aes(y = bhat, x = model)) + 
  geom_point(size = 5) +
  ylab(expression(hat(beta))) + xlab("Model") +
  geom_errorbar(aes(ymin=bhat-bhat.se, ymax=bhat+bhat.se), width=0.2) +
  ylim(my.ylim) + 
  theme(text = element_text(size=14))

If beta hat is the slope of the residuals, then I would expect the “true” value to be zero. The residuals should be evenly distributed. As for the “true” standard error, I think it is one because that is what the summary output is stating.

Part 3 - Woodhouse data analysis

First I load the data and make them ts class. I first take a look at the overall pattern of the flow and precipitation data, then make some histograms and get into the regressions.

setwd("E:/spring16")
WH<-read.csv("woodhouse.csv",header=TRUE)
wh.flow<-ts(WH$LeesWYflow,start=c(1906,1),end=c(2012,1),frequency=1)
wh.precip<-ts(WH$OctAprP,start=c(1906,1),end=c(2012,1),frequency=1)
wh.temp<-ts(WH$MarJulT,start=c(1906,1),end=c(2012,1),frequency=1)
wh.soil<-ts(WH$novsoil,start=c(1906,1),end=c(2012,1),frequency=1)
layout(matrix(c(1,1,2,2),ncol=2,nrow=2,byrow=T))
plot(wh.flow)
plot(wh.precip)

summary(wh.flow)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  5418000 11670000 14640000 14890000 18070000 24180000
my.xlim <- range(wh.flow)
h<-hist(wh.flow, breaks=10, col="chartreuse3", xlab="Colorado river flow",main="",xlim=my.xlim)
xfit<-seq(min(wh.flow),max(wh.flow),length=18762000) 
yfit<-dnorm(xfit,mean=mean(wh.flow),sd=sd(wh.flow)) 
yfit <- yfit*diff(h$mids[1:2])*length(wh.flow)
lines(xfit, yfit, col="navy", lwd=2)
boxplot(wh.flow, horizontal=TRUE,  outline=TRUE,  axes=FALSE, ylim=my.xlim, col = "lightblue4", add = TRUE, boxwex=3)


my.xlim <- range(wh.precip)
h<-hist(wh.precip, breaks=10, col="chartreuse3", xlab="Precipitation",main="",xlim=my.xlim)
xfit<-seq(min(wh.precip),max(wh.precip),length=230) 
yfit<-dnorm(xfit,mean=mean(wh.precip),sd=sd(wh.precip)) 
yfit <- yfit*diff(h$mids[1:2])*length(wh.precip)
lines(xfit, yfit, col="navy", lwd=2)
boxplot(wh.precip, horizontal=TRUE,  outline=TRUE,  axes=FALSE, ylim=my.xlim, col = "lightblue4", add = TRUE, boxwex=3)

ols.flow<-lm(wh.flow~wh.precip)
ggplotRegression(ols.flow)

acf(residuals(ols.flow))##looks like an ar1 or2##
pacf(residuals(ols.flow))## feeling more like an ar1 here.

flow.ar <- ar(residuals(ols.flow))$ar
flow.ar
## [1] 0.3659556 0.1572489

Since the ar output gave me two phi values I am assuming it means that these data follow an ar2 model. For this reason, in the gls correction, corARMA is used instead of corAR1.

library(nlme)
flow.gls<- gls(wh.flow~wh.precip, correlation = corARMA(p=2)) #give an initial value 
acf(resid(flow.gls, type = "normalized"))

summary(flow.gls)
## Generalized least squares fit by REML
##   Model: wh.flow ~ wh.precip 
##   Data: NULL 
##        AIC     BIC    logLik
##   3393.011 3406.28 -1691.505
## 
## Correlation Structure: ARMA(2,0)
##  Formula: ~1 
##  Parameter estimate(s):
##      Phi1      Phi2 
## 0.3927175 0.1529462 
## 
## Coefficients:
##                  Value Std.Error  t-value p-value
## (Intercept) -2109433.3 1112007.0 -1.89696  0.0606
## wh.precip      77566.2    4605.2 16.84305  0.0000
## 
##  Correlation: 
##           (Intr)
## wh.precip -0.908
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -2.0561293 -0.7052324 -0.1192535  0.6746901  2.9279674 
## 
## Residual standard error: 2541745 
## Degrees of freedom: 107 total; 105 residual
plot(flow.gls)

This is pretty niffty that the flow of the Colorado river and the Precipitation from October to April create a good regression model. The ACF and PACF graphs displayed that there were correlations in the residuals of the OLS model. To correct for this we extracted the AR coeficients and used a generalised least squares method to build a model that would account for the autocorrelation in the residuals. Again we used the ACF to check for autocorrelation and were please to find none. Yes, I think a GLS approach changes my interpretation of the OLS approach.The oLS assume there is no correlation in the residuals, GlS may take that information and use it to build a better model.