#knitr::purl("wing_gls.Rmd", output = "wing_gls.R") # convert Rmd to R script
The gls() function fits a linear model using generalized least squares. The errors are allowed to be correlated and/or have unequal variances. This function is used in time series analyses.
From the wing_timeseries.Rmd script, the following was concluded:
We have very few points (10). Ideal would be 15-25 observations. Because of the few points, none of the spikes in the ACF or PACF are significant even though the Augmented Dickey-Fuller Test keeps testing for non-stationarity. In turn, this is a short time series that can only be evaluated for temporal dependencies in another modeling framework (e.g. GLMM, GAMM, or LOESS). Otherwise, in AR, these time series would all fall under white noise.
Let’s then try a linear model using GLS.
source_path = "~/Desktop/git_repositories/SBB-dispersal/avbernat/Rsrc/"
script_names = c("compare_models.R",
"regression_output.R",
"clean_morph_data.R",
"AICprobabilities.R",
"get_Akaike_weights.R",
"remove_torn_wings.R")
for (script in script_names) {
path = paste0(source_path, script)
source(path)
}
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
source("~/Desktop/git_repositories/SBB-dispersal/avbernat/RTsrc/ts_auxiliary_funs.R")
data_list <- read_morph_data("data/allmorphology04.26.21-coors2.csv")
## morph types: L S NA LS L/S SL
## recoding missing morph types...
## S if wing2thorax <=2.2, L if wing2thorax >=2.5
##
## ambiguous wing morph bug count: 32
raw_data = data_list[[1]]
all_bugs = nrow(raw_data)
data_long = data_list[[2]]
# Remove individuals with torn wings first
raw_data = remove_torn_wings(raw_data) # **don't need to remove torn wings for wing morph analysis
##
## number of bugs with torn wings: 147
data_long = remove_torn_wings(data_long) # **don't need to remove torn wings for wing morph analysis
##
## number of bugs with torn wings: 144
clean_bugs = nrow(raw_data)
cat("number of bugs with torn wings:", all_bugs - clean_bugs, "\n\n")
## number of bugs with torn wings: 147
# remove NA dates
d = raw_data %>%
filter(!is.na(wing), !is.na(datetime))
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing", cat_var="datetime", func="mean")
wing_avg = ts_cols[[1]]
date = ts_cols[[2]]
beak_avg = tapply(X=raw_data[,"beak"], INDEX=raw_data[,"datetime"], FUN=mean, na.rm=T)
df = as.data.frame(cbind(wing_avg, beak_avg, date))
m1 <- gls(wing_avg ~ beak_avg,
data = df,
na.action = na.omit,
method = "REML")
summary(m1)
## Generalized least squares fit by REML
## Model: wing_avg ~ beak_avg
## Data: df
## AIC BIC logLik
## 7.449552 7.687877 -0.7247762
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -2.311756 2.8202658 -0.819694 0.4361
## beak_avg 1.572614 0.4341785 3.622045 0.0068
##
## Correlation:
## (Intr)
## beak_avg -1
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.9738138 -0.3664493 0.1030877 0.6563266 1.2651954
##
## Residual standard error: 0.2462595
## Degrees of freedom: 10 total; 8 residual
Check for temporal dependency. See wing_timseries.Rmd script. Conclusion: there is no temporal dependency.
If there is temporal dependency, then define the correlation structure in the gls() function. correlation in an optional corStruct object describing the within-group correlation structure.
For instance, below I’m incorporating AR1 (first order autoregressive structure) into the model. To do so, we have to switch to gls() and use the correlation=argument. In addition to the model incorporating an AR1, we can refit the model with no correlation structure with the gls() function so that we can directly compare the fits of the two models. Since there was no temporal dependency, the parameters will not change below and so the models are identical.
# Model with an AR1 correlation
# Figure 3.7
m2 <- gls(wing_avg ~ beak_avg,
correlation = corAR1(form=~date),
data = df,
na.action = na.omit)
summary(m2)
## Generalized least squares fit by REML
## Model: wing_avg ~ beak_avg
## Data: df
## AIC BIC logLik
## 9.449552 9.767318 -0.7247762
##
## Correlation Structure: ARMA(1,0)
## Formula: ~date
## Parameter estimate(s):
## Phi1
## 0
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -2.311756 2.8202658 -0.819694 0.4361
## beak_avg 1.572614 0.4341785 3.622045 0.0068
##
## Correlation:
## (Intr)
## beak_avg -1
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.9738138 -0.3664493 0.1030877 0.6563266 1.2651954
##
## Residual standard error: 0.2462595
## Degrees of freedom: 10 total; 8 residual
# Create a grid of covariate values
x = range(df$beak_avg, na.rm = TRUE)
data <- data.frame(beak_avg = seq(from = min(x),
to = max(x),
length = 10))
# Model with an AR1 correlation
par(mfrow = c(1,1), mar = c(5,5,2,2), cex.lab = 1.5)
plot(x = df$beak_avg,
y = df$wing_avg,
xlab = "wing_avg (mm)",
ylab = "beak_avg (mm)",
pch = 16,
type = "n")
#Add the fitted values
pred1 <- predict(m1, newdata = data)
pred2 <- predict(m2, newdata = data)
lines(x = data$beak_avg, # without a correlation
y = pred1,
lwd = 1,
lty=2,
col = "black")
lines(x = data$beak_avg, # with a correlation
y = pred2,
col = "red",
lty=2,
lwd = 3)
text(x = df$beak_avg,y = df$wing_avg, df$date, cex = 1)
# Get only bugs with long wings
data_long<-raw_data[raw_data$w_morph=="L",]
# Calculate wing2body ratio for bugs with long wings
data_long$wing2body <- data_long$wing/as.numeric(data_long$body)
# remove NA dates
d = data_long %>%
filter(!is.na(wing2body), !is.na(datetime))
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing2body", cat_var="datetime", func="mean")
ratio_avg = ts_cols[[1]]
ratio_avg = ratio_avg[!is.na(ratio_avg)]
date = ts_cols[[2]]
beak_avg = beak_avg[c(1,3:10)]
df = as.data.frame(cbind(ratio_avg, beak_avg, date))
m3 <- gls(ratio_avg ~ beak_avg,
data = df,
na.action = na.omit,
method = "REML")
summary(m3)
## Generalized least squares fit by REML
## Model: ratio_avg ~ beak_avg
## Data: df
## AIC BIC logLik
## -50.85001 -51.01228 28.425
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 0.6689254 0.04694160 14.250162 0.0000
## beak_avg 0.0094287 0.00724794 1.300887 0.2345
##
## Correlation:
## (Intr)
## beak_avg -1
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.3501574 -0.9290108 0.3976261 0.5985678 1.4377538
##
## Residual standard error: 0.003895618
## Degrees of freedom: 9 total; 7 residual
# Model with an AR1 correlation
# Figure 3.7
m4 <- gls(ratio_avg ~ beak_avg,
correlation = corAR1(form=~date),
data = df,
na.action = na.omit)
summary(m4)
## Generalized least squares fit by REML
## Model: ratio_avg ~ beak_avg
## Data: df
## AIC BIC logLik
## -48.85001 -49.06637 28.425
##
## Correlation Structure: ARMA(1,0)
## Formula: ~date
## Parameter estimate(s):
## Phi1
## 0
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 0.6689254 0.04694160 14.250162 0.0000
## beak_avg 0.0094287 0.00724794 1.300887 0.2345
##
## Correlation:
## (Intr)
## beak_avg -1
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.3501574 -0.9290108 0.3976261 0.5985678 1.4377538
##
## Residual standard error: 0.003895618
## Degrees of freedom: 9 total; 7 residual
Same thing. Exactly the same.
# remove NA dates and wing morph (S=0, L=1)
d = raw_data %>%
filter(!is.na(wing_morph_b), !is.na(datetime))
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing_morph_b", cat_var="datetime", func="mean")
freq_avg = ts_cols[[1]]
date = ts_cols[[2]]
beak_avg = tapply(X=raw_data[,"beak"], INDEX=raw_data[,"datetime"], FUN=mean, na.rm=T)
df = as.data.frame(cbind(freq_avg, beak_avg, date))
m5 <- gls(freq_avg ~ beak_avg,
data = df,
na.action = na.omit,
method = "REML")
summary(m5)
## Generalized least squares fit by REML
## Model: freq_avg ~ beak_avg
## Data: df
## AIC BIC logLik
## -14.8975 -14.65918 10.44875
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -1.1302489 0.6977741 -1.619792 0.1439
## beak_avg 0.2942132 0.1074220 2.738855 0.0255
##
## Correlation:
## (Intr)
## beak_avg -1
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.7714116 -0.3972292 -0.1116195 0.7191137 1.3219052
##
## Residual standard error: 0.06092813
## Degrees of freedom: 10 total; 8 residual
# Model with an AR1 correlation
# Figure 3.7
m6 <- gls(freq_avg ~ beak_avg,
correlation = corAR1(form=~date),
data = df,
na.action = na.omit)
summary(m6)
## Generalized least squares fit by REML
## Model: freq_avg ~ beak_avg
## Data: df
## AIC BIC logLik
## -12.8975 -12.57974 10.44875
##
## Correlation Structure: ARMA(1,0)
## Formula: ~date
## Parameter estimate(s):
## Phi1
## 0
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -1.1302489 0.6977741 -1.619792 0.1439
## beak_avg 0.2942132 0.1074220 2.738855 0.0255
##
## Correlation:
## (Intr)
## beak_avg -1
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.7714116 -0.3972292 -0.1116195 0.7191137 1.3219052
##
## Residual standard error: 0.06092813
## Degrees of freedom: 10 total; 8 residual
Again the same thing - as we saw in the wing_timeseries.Rmd, there were no spatial dependencies. Let’s continue then only with wing length for the remainder of this script.
females = raw_data[raw_data$sex=="F",]
# remove NA dates
d = females %>%
filter(!is.na(wing), !is.na(datetime))
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing", cat_var="datetime", func="mean")
wing_avg = ts_cols[[1]]
date = ts_cols[[2]]
beak_avg = tapply(X=d[,"beak"], INDEX=d[,"datetime"], FUN=mean, na.rm=T)
df = as.data.frame(cbind(wing_avg, beak_avg, date))
m7 <- gls(wing_avg ~ beak_avg,
data = df,
na.action = na.omit,
method = "REML")
summary(m7)
## Generalized least squares fit by REML
## Model: wing_avg ~ beak_avg
## Data: df
## AIC BIC logLik
## 5.950043 6.188368 0.02497837
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -0.5907377 1.4303907 -0.412990 0.6905
## beak_avg 1.2173388 0.1953003 6.233163 0.0003
##
## Correlation:
## (Intr)
## beak_avg -0.999
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.28039056 -0.40089750 -0.08521194 0.09081897 2.18717051
##
## Residual standard error: 0.2073299
## Degrees of freedom: 10 total; 8 residual
# Model with an AR1 correlation
# Figure 3.7
m8 <- gls(wing_avg ~ beak_avg,
correlation = corAR1(form=~date),
data = df,
na.action = na.omit)
summary(m8)
## Generalized least squares fit by REML
## Model: wing_avg ~ beak_avg
## Data: df
## AIC BIC logLik
## 7.950043 8.267809 0.02497837
##
## Correlation Structure: ARMA(1,0)
## Formula: ~date
## Parameter estimate(s):
## Phi1
## 0
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) -0.5907377 1.4303907 -0.412990 0.6905
## beak_avg 1.2173388 0.1953003 6.233163 0.0003
##
## Correlation:
## (Intr)
## beak_avg -0.999
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -1.28039056 -0.40089750 -0.08521194 0.09081897 2.18717051
##
## Residual standard error: 0.2073299
## Degrees of freedom: 10 total; 8 residual
# Create a grid of covariate values
x = range(df$beak_avg, na.rm = TRUE)
data <- data.frame(beak_avg = seq(from = min(x),
to = max(x),
length = 10))
# Model with an AR1 correlation
par(mfrow = c(1,1), mar = c(5,5,2,2), cex.lab = 1.5)
plot(x = df$beak_avg,
y = df$wing_avg,
xlab = "wing_avg (mm)",
ylab = "beak_avg (mm)",
pch = 16,
type = "n")
#Add the fitted values
pred1 <- predict(m7, newdata = data)
pred2 <- predict(m8, newdata = data)
lines(x = data$beak_avg, # without a correlation
y = pred1,
lwd = 1,
lty=2,
col = "black")
lines(x = data$beak_avg, # with a correlation
y = pred2,
col = "red",
lty=2,
lwd = 3)
text(x = df$beak_avg, y = df$wing_avg, df$date, cex = 1)
One thing I did above was assume an AR(1) model for the data because the data actually has no temporal dependencies. However, if the data did, then we would need to choose a proper ARMA model.
First, create you xts object from you data.
# remove NA dates
d = raw_data %>%
filter(!is.na(wing), !is.na(datetime))
# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing", cat_var="datetime", func="mean")
wing_avg = ts_cols[[1]]
date = ts_cols[[2]]
wing_mm = xts(wing_avg, date)
colnames(wing_mm) <- "wing"
wing_mm
## wing
## 2013-04-01 8.401154
## 2013-12-01 7.687829
## 2014-04-01 8.178410
## 2015-04-01 7.586518
## 2016-12-01 8.014340
## 2017-08-01 8.481303
## 2018-09-01 7.584976
## 2019-05-01 7.870708
## 2019-10-01 7.302041
## 2020-02-01 7.887514
We can to fit a series of AR(p) models and check whether the model captures the dependence structure of the noise terms. Our aim is that the residuals of a fitted model exhibit white noise. I will not do any formal hypothesis testing to determine if one model is more appropriate compared to another here. For now, I’m using the data and identifying a model that appears to be most appropriate. Then, I’ll justify our findings.
When we use the arima
function to fit the model, there are two key parameters we need: x
, which is our dataset, and order
which tells us how many lags to use. For now, only toggle the first parameter to choose an appropriate AR(p) model (i.e., you will have order=c(x,y,z)
, where x
is replaced by an appropriate number of lags for your model). The y
and z
are differncing and MA order, respectively.
Summary: order=c(AR order, differencing, MA order)
# Example AR(1) model
# x has to be a "ts" object
mod.ar1 <- arima(x=wing_mm, order=c(1,0,0)) # change the 1 to other numbers to get it to be white noise
Acf(residuals(mod.ar1), main='')
You can also use auto.arima() to choose an ARMIA model automatically.It uses a variation of the Hyndman and Khandakar algorithm which combines unit root tests, minimization of the AICc, and MLE to obtain an ARIMA model. The function takes some short-cuts in order to speed up the computation and will not always yield the best model. Setting stepwise and approximation to FALSE prevents the function from taking short-cuts.
#auto.arima(wing_mm)
auto.arima(wing_mm, stepwise=FALSE, approximation=FALSE) # ours yield the same result
## Series: wing_mm
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 7.8995
## s.e. 0.1132
##
## sigma^2 estimated as 0.1423: log likelihood=-3.91
## AIC=11.83 AICc=13.54 BIC=12.43
An ARIMA(0,0,0) model with zero mean is white noise, so it means that the errors are uncorrelated across time.
Since we have no temporal dependencies to begin with, the ACF of the mod.ar1
residuals are already showing up as noise (no spikes above the dotted blue significance lines). As you increase your x
value you will see that those spikes will get smaller and smaller too. That is how you would eyeball your arma model without formal hypothesis testing comparing models.