library(tidyverse)## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nlme)##
## Attaching package: 'nlme'
##
## The following object is masked from 'package:dplyr':
##
## collapse
library(forecast)## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'forecast'
##
## The following object is masked from 'package:nlme':
##
## getResponse
library(ggplot2)
library(shiny)OLS regression can be problematic with time series because the assumption of independence gets violated pretty easily. But, there are other approaches that are useful.
Below I’m going to ask you to compare GLS and OLS using some real data. But first, write a paragraph that explains the difference between OLS and GLS. What is a generalized model?
A generalized linear model is different mainly in that it employs a link function to transform the data into data that is normally distributed, thus allowing us to start with a non-normal distribution but avoid violating assumptions of linear models. The link function will depend on the distribution of our data. Using a generalized linear model also used the maximum likelihood estimation, rather than ordinary least squares. In the past, this could be computational intensive, but as computing power has excelled, MLE is trivial to perform.
It is my impression that in the case of our time series data, the use of the GLM helps account for non-independent residuals. In this tutorial, we explore the autocorrelation of the residuals, find a model to fit them, and then “update” our model to consider this. I do not fully understand how this is done, but I know I wish I did, and that it involves matrix convolution.
Now, revisit the code above and adjust n, B1, and phi to see how the results of an OLS model change. Is OLS or GLS better at capturing the true value of B1 when there are non-independent errors? Or is OLS robust to moderate autocorrelation? You can do this in a full blown simulation of course, but you can just mess around with the values a bit and try to get a feel for it.
shinyAppDir(appDir = 'GLM_demo')## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
set.seed(47) # for reproducibility
n <- 100
phi <- 0.8
x <- ts(rnorm(n))
epsilon <- arima.sim(model = list(ar=phi),n = n)
epsilon <- epsilon - mean(epsilon) # demean
B0 <- 0 # intercept
B1 <- 0.5 # slope
y <- B0 + B1*x + epsilon
# step 1
gls1 <- gls(y~x)
# step 2 - inspect the residuals
Acf(residuals(gls1)) # problem with independencegls1## Generalized least squares fit by REML
## Model: y ~ x
## Data: NULL
## Log-restricted-likelihood: -194.9446
##
## Coefficients:
## (Intercept) x
## 0.01618039 0.18575084
##
## Degrees of freedom: 100 total; 98 residual
## Residual standard error: 1.687976
# step 3 - update the model with an appropriate correlation structure
cs1 <- corARMA(p=1)
gls2 <- update(gls1,correlation=cs1)
gls2## Generalized least squares fit by REML
## Model: y ~ x
## Data: NULL
## Log-restricted-likelihood: -149.63
##
## Coefficients:
## (Intercept) x
## -0.06975389 0.44966531
##
## Correlation Structure: AR(1)
## Formula: ~1
## Parameter estimate(s):
## Phi
## 0.7946776
## Degrees of freedom: 100 total; 98 residual
## Residual standard error: 1.767387
# step 4 - and inspect the residuals. Note type="normalized"
# in the calls to `residuals`. See help page
Acf(residuals(gls2,type="normalized")) # clean!Finally, let’s look at some data. A great paper by Connie Woodhouse and colleagues looks at how annual flows on the Colorado River are affected by precipitation, soil moisture, and temperature. Read the paper. It’s great and short.
dat <- readRDS("woodhouse.rds")
dat <- dat %>% mutate(MAF = LeesWYflow / 1e6)
# add a 5-yr moving average
dat <- dat %>% mutate(MAF5yr = c(stats::filter(MAF,filter = rep(0.2,5))))
ggplot(data=dat, aes(x=Year)) +
geom_line(aes(y=MAF),color="lightgreen") +
geom_line(aes(y=MAF5yr),color="darkgreen") +
labs(y= "Millions of Acre Feet", title="Colorado River at Lee's Ferry",
sub = "Total Flow for Water Year") +
scale_x_continuous(expand=c(0,0))## Warning: Removed 4 rows containing missing values (`geom_line()`).
The main gist of the paper is that precipitation explains most of the
variability in the river’s flow but that temperature is important under
certain conditions. E.g., “Different combinations of temperature,
precipitation, and soil moisture can result in flow deficits of similar
magnitude, but recent droughts have been amplified by warmer
temperatures that exacerbate the effects of relatively modest
precipitation deficits.” This is cool: Transpiration and evaporation are
greater when it’s warmer, right? The cool thing that Woodhouse et
al. have done is tease out the relationship between precipitation and
temperature in a new way.
The data they used are online and I’ve grabbed the relevant info and put it in the file woodhouse.rds as a data.frame. Explore the data (go back to week one for ideas of what to do – plot your data!) and eventually make a model of river flow (LeesWYflow) as a function of October to April precipitation (OctAprP). The units on flow are cumulative flow over the water year in acre feet and cumulative October to April precipitation is in mm.
Is it a good model? How do the residuals look in a time series context? Interpret. Does a GLS approach change your interpretation from an OLS approach? Given your work with the simulated data above do you think Woodhouse et al. are on safe footing with OLS?
plot(dat$Year,dat$LeesWYflow)ols_river <- lm(dat$LeesWYflow~dat$OctAprP)
Acf(residuals(ols_river)) # problem with independencePacf(residuals(ols_river)) # problem with independencear_river <- ar(residuals(ols_river))$arLeesWYflow <- dat$LeesWYflow
OctAprP <- dat$OctAprP
gls_river <- gls(LeesWYflow~OctAprP)
Acf(residuals(gls_river)) # problem with independencePacf(residuals(gls_river)) # problem with independence# step 3 - update the model with an appropriate correlation structure
cs_riv <- corARMA(p=2)
gls_riv2 <- update(gls_river,correlation=cs_riv)
summary(gls_riv2)## Generalized least squares fit by REML
## Model: LeesWYflow ~ OctAprP
## Data: NULL
## AIC BIC logLik
## 3393.011 3406.28 -1691.505
##
## Correlation Structure: ARMA(2,0)
## Formula: ~1
## Parameter estimate(s):
## Phi1 Phi2
## 0.3927175 0.1529462
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -2109433.3 1112007.0 -1.89696 0.0606
## OctAprP 77566.2 4605.2 16.84305 0.0000
##
## Correlation:
## (Intr)
## OctAprP -0.908
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.0561293 -0.7052324 -0.1192535 0.6746901 2.9279674
##
## Residual standard error: 2541745
## Degrees of freedom: 107 total; 105 residual
# step 4 - and inspect the residuals. Note type="normalized"
# in the calls to `residuals`. See help page
Acf(residuals(gls_riv2,type="normalized")) # clean!