library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nlme)
## 
## Attaching package: 'nlme'
## 
## The following object is masked from 'package:dplyr':
## 
##     collapse
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## 
## Attaching package: 'forecast'
## 
## The following object is masked from 'package:nlme':
## 
##     getResponse
library(ggplot2)
library(shiny)

Big Idea

OLS regression can be problematic with time series because the assumption of independence gets violated pretty easily. But, there are other approaches that are useful.

4. Your Work

4.1 What Does Generalized Mean?

Below I’m going to ask you to compare GLS and OLS using some real data. But first, write a paragraph that explains the difference between OLS and GLS. What is a generalized model?

A generalized linear model is different mainly in that it employs a link function to transform the data into data that is normally distributed, thus allowing us to start with a non-normal distribution but avoid violating assumptions of linear models. The link function will depend on the distribution of our data. Using a generalized linear model also used the maximum likelihood estimation, rather than ordinary least squares. In the past, this could be computational intensive, but as computing power has excelled, MLE is trivial to perform.

It is my impression that in the case of our time series data, the use of the GLM helps account for non-independent residuals. In this tutorial, we explore the autocorrelation of the residuals, find a model to fit them, and then “update” our model to consider this. I do not fully understand how this is done, but I know I wish I did, and that it involves matrix convolution.

4.2 Regression Simulation

Now, revisit the code above and adjust n, B1, and phi to see how the results of an OLS model change. Is OLS or GLS better at capturing the true value of B1 when there are non-independent errors? Or is OLS robust to moderate autocorrelation? You can do this in a full blown simulation of course, but you can just mess around with the values a bit and try to get a feel for it.

shinyAppDir(appDir = 'GLM_demo')
## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
Shiny applications not supported in static R Markdown documents
set.seed(47) # for reproducibility
n <- 100
phi <- 0.8
x <- ts(rnorm(n))
epsilon <- arima.sim(model = list(ar=phi),n = n)
epsilon <- epsilon - mean(epsilon) # demean
B0 <- 0 # intercept
B1 <- 0.5 # slope
y <- B0 + B1*x + epsilon
# step 1 
gls1 <- gls(y~x) 
# step 2 - inspect the residuals
Acf(residuals(gls1)) # problem with independence

gls1
## Generalized least squares fit by REML
##   Model: y ~ x 
##   Data: NULL 
##   Log-restricted-likelihood: -194.9446
## 
## Coefficients:
## (Intercept)           x 
##  0.01618039  0.18575084 
## 
## Degrees of freedom: 100 total; 98 residual
## Residual standard error: 1.687976
# step 3 - update the model with an appropriate correlation structure
cs1 <- corARMA(p=1)
gls2 <- update(gls1,correlation=cs1)
gls2
## Generalized least squares fit by REML
##   Model: y ~ x 
##   Data: NULL 
##   Log-restricted-likelihood: -149.63
## 
## Coefficients:
## (Intercept)           x 
## -0.06975389  0.44966531 
## 
## Correlation Structure: AR(1)
##  Formula: ~1 
##  Parameter estimate(s):
##       Phi 
## 0.7946776 
## Degrees of freedom: 100 total; 98 residual
## Residual standard error: 1.767387
# step 4 - and inspect the residuals. Note type="normalized"
# in the calls to `residuals`. See help page
Acf(residuals(gls2,type="normalized")) # clean!

4.3 Regression for real

Finally, let’s look at some data. A great paper by Connie Woodhouse and colleagues looks at how annual flows on the Colorado River are affected by precipitation, soil moisture, and temperature. Read the paper. It’s great and short.

dat <- readRDS("woodhouse.rds")
dat <- dat %>% mutate(MAF = LeesWYflow / 1e6)
# add a 5-yr moving average
dat <- dat %>% mutate(MAF5yr = c(stats::filter(MAF,filter = rep(0.2,5))))
ggplot(data=dat, aes(x=Year)) +
  geom_line(aes(y=MAF),color="lightgreen") + 
  geom_line(aes(y=MAF5yr),color="darkgreen") + 
  labs(y= "Millions of Acre Feet", title="Colorado River at Lee's Ferry",
       sub = "Total Flow for Water Year") + 
  scale_x_continuous(expand=c(0,0))
## Warning: Removed 4 rows containing missing values (`geom_line()`).

The main gist of the paper is that precipitation explains most of the variability in the river’s flow but that temperature is important under certain conditions. E.g., “Different combinations of temperature, precipitation, and soil moisture can result in flow deficits of similar magnitude, but recent droughts have been amplified by warmer temperatures that exacerbate the effects of relatively modest precipitation deficits.” This is cool: Transpiration and evaporation are greater when it’s warmer, right? The cool thing that Woodhouse et al. have done is tease out the relationship between precipitation and temperature in a new way.

The data they used are online and I’ve grabbed the relevant info and put it in the file woodhouse.rds as a data.frame. Explore the data (go back to week one for ideas of what to do – plot your data!) and eventually make a model of river flow (LeesWYflow) as a function of October to April precipitation (OctAprP). The units on flow are cumulative flow over the water year in acre feet and cumulative October to April precipitation is in mm.

Is it a good model? How do the residuals look in a time series context? Interpret. Does a GLS approach change your interpretation from an OLS approach? Given your work with the simulated data above do you think Woodhouse et al. are on safe footing with OLS?

plot(dat$Year,dat$LeesWYflow)

ols_river <- lm(dat$LeesWYflow~dat$OctAprP)
Acf(residuals(ols_river)) # problem with independence

Pacf(residuals(ols_river)) # problem with independence

ar_river <- ar(residuals(ols_river))$ar
LeesWYflow <- dat$LeesWYflow
OctAprP <- dat$OctAprP
gls_river <- gls(LeesWYflow~OctAprP) 
Acf(residuals(gls_river)) # problem with independence

Pacf(residuals(gls_river)) # problem with independence

# step 3 - update the model with an appropriate correlation structure
cs_riv <- corARMA(p=2)
gls_riv2 <- update(gls_river,correlation=cs_riv)
summary(gls_riv2)
## Generalized least squares fit by REML
##   Model: LeesWYflow ~ OctAprP 
##   Data: NULL 
##        AIC     BIC    logLik
##   3393.011 3406.28 -1691.505
## 
## Correlation Structure: ARMA(2,0)
##  Formula: ~1 
##  Parameter estimate(s):
##      Phi1      Phi2 
## 0.3927175 0.1529462 
## 
## Coefficients:
##                  Value Std.Error  t-value p-value
## (Intercept) -2109433.3 1112007.0 -1.89696  0.0606
## OctAprP        77566.2    4605.2 16.84305  0.0000
## 
##  Correlation: 
##         (Intr)
## OctAprP -0.908
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -2.0561293 -0.7052324 -0.1192535  0.6746901  2.9279674 
## 
## Residual standard error: 2541745 
## Degrees of freedom: 107 total; 105 residual
# step 4 - and inspect the residuals. Note type="normalized"
# in the calls to `residuals`. See help page
Acf(residuals(gls_riv2,type="normalized")) # clean!