I chose to predict the sunspot data from 1941 and again from 1967. The
data("sunspot.year")
summary(sunspot.year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 15.60 39.00 48.61 68.90 190.20
sun.train1 <- window(sunspot.year, end = 1940)
sun.test1 <- window(sunspot.year, start = 1941)
sun.train1.ar <- ar(sun.train1)
sun.train1.ar
##
## Call:
## ar(x = sun.train1)
##
## Coefficients:
## 1 2 3 4 5 6 7 8
## 1.2144 -0.4658 -0.1255 0.1439 -0.1520 0.0954 -0.0818 0.0772
## 9
## 0.1057
##
## Order selected 9 sigma^2 estimated as 207.1
sun.train1.forecast <- predict(sun.train1.ar, n.ahead = 48)
plot(sunspot.year,col="grey",ylab="Sunspots (n)")
lines(sun.train1,col="red",lty="dotted",lwd=2)
lines(sun.test1,col="olivedrab1",lty="dotted",lwd=2)
lines(sun.train1.forecast$pred,col="olivedrab4",lwd=1.5)
test.1<-lm(sun.train1.forecast$pred~sun.test1)
summary(test.1)
##
## Call:
## lm(formula = sun.train1.forecast$pred ~ sun.test1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.6836 -6.4470 0.4537 5.7300 22.2657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.98562 2.36851 11.816 1.56e-15 ***
## sun.test1 0.22330 0.02641 8.456 6.35e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.668 on 46 degrees of freedom
## Multiple R-squared: 0.6085, Adjusted R-squared: 0.6
## F-statistic: 71.5 on 1 and 46 DF, p-value: 6.348e-11
op = par(mfrow=c(2,2))
plot(test.1)
library(qpcR)
## Warning: package 'qpcR' was built under R version 3.2.5
## Loading required package: MASS
## Loading required package: minpack.lm
## Warning: package 'minpack.lm' was built under R version 3.2.5
## Loading required package: rgl
## Warning: package 'rgl' was built under R version 3.2.5
## Loading required package: robustbase
## Warning: package 'robustbase' was built under R version 3.2.5
## Loading required package: Matrix
RMSE(test.1, which = NULL)
## [1] 9.464802
The first few prediced values seem to match. The regression was significant(p<0.001) and the R-sqrd was 0.6. The risiduals plot has a bit of an arch to it, but I do not think that it is much of a problem. The RMSE was 9.5, that is much lower than the other subset I used to train and test (see below).
sun.train2 <- window(sunspot.year, end = 1966)
sun.test2 <- window(sunspot.year, start = 1967)
sun.train2.ar <- ar(sun.train2)
sun.train2.ar
##
## Call:
## ar(x = sun.train2)
##
## Coefficients:
## 1 2 3 4 5 6 7 8
## 1.2144 -0.4612 -0.1292 0.1497 -0.1545 0.1055 -0.0629 0.0389
## 9
## 0.1464
##
## Order selected 9 sigma^2 estimated as 236.1
sun.train2.forecast <- predict(sun.train2.ar, n.ahead = 22)
plot(sunspot.year,col="grey",ylab="Sunspots (n)")
lines(sun.train2,col="red",lty="dotted",lwd=2)
lines(sun.test2,col="olivedrab1",lty="dotted",lwd=2)
lines(sun.train2.forecast$pred,col="olivedrab4",lwd=2)
test.2<-lm(sun.train2.forecast$pred~sun.test2)
summary(test.2)
##
## Call:
## lm(formula = sun.train2.forecast$pred ~ sun.test2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.089 -14.775 -8.803 14.895 43.049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.29734 8.50825 2.386 0.0271 *
## sun.test2 0.52784 0.09891 5.337 3.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.13 on 20 degrees of freedom
## Multiple R-squared: 0.5874, Adjusted R-squared: 0.5668
## F-statistic: 28.48 on 1 and 20 DF, p-value: 3.187e-05
op = par(mfrow=c(2,2))
plot(test.2)
library(qpcR)
RMSE(test.2, which = NULL)
## [1] 20.14253
To me there seems to be marginal changes between the first model I built that used data up to 1940 and this one which used data up to 1966. This one has a larger p value and the r-squred is not as strong. Additionally, the RMSE is double what it was in the first set.This seems counterintuitive, I would think that with more training data, a better regression model would be built.I think the larger sample size contributed to the lower p-value observed in my first model.
Take a look at this nifty code I found online.
library(ggplot2)
ggplotRegression <- function (fit) {
require(ggplot2)
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 4),
"Intercept =",signif(fit$coef[[1]],4 ),
" Slope =",signif(fit$coef[[2]], 4),
" p =",signif(summary(fit)$coef[2,4], 4)))
}
ggplotRegression(test.1)
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous
ggplotRegression(test.2)
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous
n <-333
phi <-0.333
x <-ts(rnorm(n))
epsilon <- arima.sim(model = list(ar=phi),n = n)
y <- x + epsilon
ols1 <- lm(y~x)
summary(ols1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2497 -0.6360 0.0243 0.6903 3.5368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.12546 0.05773 2.173 0.0305 *
## x 1.02469 0.05491 18.660 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.053 on 331 degrees of freedom
## Multiple R-squared: 0.5127, Adjusted R-squared: 0.5112
## F-statistic: 348.2 on 1 and 331 DF, p-value: < 2.2e-16
acf(residuals(ols1))##looks like there is some autocorrelation going on
pacf(residuals(ols1))
ar1 <- ar(residuals(ols1))$ar
ar1
## [1] 0.3523663 0.1060402
library(nlme)
gls1 <- gls(y~x, correlation = corAR1(ar1))## Here I correct for the correlation.
## Warning in if (abs(value) >= 1) {: the condition has length > 1 and only
## the first element will be used
acf(resid(gls1, type = "normalized"))## looks good now.
ols1.summary <- summary(ols1)
gls1.summary <- summary(gls1)
ols1.summary
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2497 -0.6360 0.0243 0.6903 3.5368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.12546 0.05773 2.173 0.0305 *
## x 1.02469 0.05491 18.660 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.053 on 331 degrees of freedom
## Multiple R-squared: 0.5127, Adjusted R-squared: 0.5112
## F-statistic: 348.2 on 1 and 331 DF, p-value: < 2.2e-16
gls1.summary
## Generalized least squares fit by REML
## Model: y ~ x
## Data: NULL
## AIC BIC logLik
## 938.2783 957.2889 -464.1392
##
## Correlation Structure: AR(1)
## Formula: ~1
## Parameter estimate(s):
## Phi <NA>
## 0.4004347 0.1060402
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 0.1270118 0.08822985 1.439556 0.1509
## x 0.9995082 0.04723730 21.159302 0.0000
##
## Correlation:
## (Intr)
## x 0.009
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -3.04647162 -0.61174391 0.01774796 0.66639007 3.34623121
##
## Residual standard error: 1.055546
## Degrees of freedom: 333 total; 331 residual
bhat.ols1 <- coef(ols1)[2]
bhat.gls1 <- coef(gls1)[2]
bhat.se.ols1 <- ols1.summary$coef[[4]]
bhat.se.gls1 <- gls1.summary$tTable[2,2]
require(ggplot2)
dat <- data.frame(bhat = c(bhat.ols1,bhat.gls1),
bhat.se = c(bhat.se.ols1,bhat.se.gls1),
model = c("OLS1","GLS"))
my.ylim <- c(min(range(dat$bhat-dat$bhat.se))-0.1,
max(range(dat$bhat+dat$bhat.se))+0.1)
ggplot(dat, aes(y = bhat, x = model)) +
geom_point(size = 5) +
ylab(expression(hat(beta))) + xlab("Model") +
geom_errorbar(aes(ymin=bhat-bhat.se, ymax=bhat+bhat.se), width=0.2) +
ylim(my.ylim) +
theme(text = element_text(size=14))
If beta hat is the slope of the residuals, then I would expect the “true” value to be zero. The residuals should be evenly distributed. As for the “true” standard error, I think it is one because that is what the summary output is stating.
First I load the data and make them ts class. I first take a look at the overall pattern of the flow and precipitation data, then make some histograms and get into the regressions.
setwd("E:/spring16")
WH<-read.csv("woodhouse.csv",header=TRUE)
wh.flow<-ts(WH$LeesWYflow,start=c(1906,1),end=c(2012,1),frequency=1)
wh.precip<-ts(WH$OctAprP,start=c(1906,1),end=c(2012,1),frequency=1)
wh.temp<-ts(WH$MarJulT,start=c(1906,1),end=c(2012,1),frequency=1)
wh.soil<-ts(WH$novsoil,start=c(1906,1),end=c(2012,1),frequency=1)
layout(matrix(c(1,1,2,2),ncol=2,nrow=2,byrow=T))
plot(wh.flow)
plot(wh.precip)
summary(wh.flow)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5418000 11670000 14640000 14890000 18070000 24180000
my.xlim <- range(wh.flow)
h<-hist(wh.flow, breaks=10, col="chartreuse3", xlab="Colorado river flow",main="",xlim=my.xlim)
xfit<-seq(min(wh.flow),max(wh.flow),length=18762000)
yfit<-dnorm(xfit,mean=mean(wh.flow),sd=sd(wh.flow))
yfit <- yfit*diff(h$mids[1:2])*length(wh.flow)
lines(xfit, yfit, col="navy", lwd=2)
boxplot(wh.flow, horizontal=TRUE, outline=TRUE, axes=FALSE, ylim=my.xlim, col = "lightblue4", add = TRUE, boxwex=3)
my.xlim <- range(wh.precip)
h<-hist(wh.precip, breaks=10, col="chartreuse3", xlab="Precipitation",main="",xlim=my.xlim)
xfit<-seq(min(wh.precip),max(wh.precip),length=230)
yfit<-dnorm(xfit,mean=mean(wh.precip),sd=sd(wh.precip))
yfit <- yfit*diff(h$mids[1:2])*length(wh.precip)
lines(xfit, yfit, col="navy", lwd=2)
boxplot(wh.precip, horizontal=TRUE, outline=TRUE, axes=FALSE, ylim=my.xlim, col = "lightblue4", add = TRUE, boxwex=3)
ols.flow<-lm(wh.flow~wh.precip)
ggplotRegression(ols.flow)
acf(residuals(ols.flow))##looks like an ar1 or2##
pacf(residuals(ols.flow))## feeling more like an ar1 here.
flow.ar <- ar(residuals(ols.flow))$ar
flow.ar
## [1] 0.3659556 0.1572489
Since the ar output gave me two phi values I am assuming it means that these data follow an ar2 model. For this reason, in the gls correction, corARMA is used instead of corAR1.
library(nlme)
flow.gls<- gls(wh.flow~wh.precip, correlation = corARMA(p=2)) #give an initial value
acf(resid(flow.gls, type = "normalized"))
summary(flow.gls)
## Generalized least squares fit by REML
## Model: wh.flow ~ wh.precip
## Data: NULL
## AIC BIC logLik
## 3393.011 3406.28 -1691.505
##
## Correlation Structure: ARMA(2,0)
## Formula: ~1
## Parameter estimate(s):
## Phi1 Phi2
## 0.3927175 0.1529462
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -2109433.3 1112007.0 -1.89696 0.0606
## wh.precip 77566.2 4605.2 16.84305 0.0000
##
## Correlation:
## (Intr)
## wh.precip -0.908
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.0561293 -0.7052324 -0.1192535 0.6746901 2.9279674
##
## Residual standard error: 2541745
## Degrees of freedom: 107 total; 105 residual
plot(flow.gls)
This is pretty niffty that the flow of the Colorado river and the Precipitation from October to April create a good regression model. The ACF and PACF graphs displayed that there were correlations in the residuals of the OLS model. To correct for this we extracted the AR coeficients and used a generalised least squares method to build a model that would account for the autocorrelation in the residuals. Again we used the ACF to check for autocorrelation and were please to find none. Yes, I think a GLS approach changes my interpretation of the OLS approach.The oLS assume there is no correlation in the residuals, GlS may take that information and use it to build a better model.