All Morphology: Linear Model Using GLS

GLS
AR Conclusion
Sourcing Scripts and Reading the Data
- Clean the Data
Create GLS and Compare GLS models
Females (only ones that fits an autoregressive model)
Fiting an ARMA model and making ACF plots of residuals
Citation

#knitr::purl("wing_gls.Rmd", output = "wing_gls.R") # convert Rmd to R script

GLS

The gls() function fits a linear model using generalized least squares. The errors are allowed to be correlated and/or have unequal variances. This function is used in time series analyses.

AR Conclusion

From the wing_timeseries.Rmd script, the following was concluded:

We have very few points (10). Ideal would be 15-25 observations. Because of the few points, none of the spikes in the ACF or PACF are significant even though the Augmented Dickey-Fuller Test keeps testing for non-stationarity. In turn, this is a short time series that can only be evaluated for temporal dependencies in another modeling framework (e.g. GLMM, GAMM, or LOESS). Otherwise, in AR, these time series would all fall under white noise.

Let’s then try a linear model using GLS.

Sourcing Scripts and Reading the Data

source_path = "~/Desktop/git_repositories/SBB-dispersal/avbernat/Rsrc/"

script_names = c("compare_models.R",
                 "regression_output.R",
                 "clean_morph_data.R", 
                 "AICprobabilities.R",
                 "get_Akaike_weights.R",
                 "remove_torn_wings.R")

for (script in script_names) { 
  path = paste0(source_path, script)
  source(path) 
}

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

source("~/Desktop/git_repositories/SBB-dispersal/avbernat/RTsrc/ts_auxiliary_funs.R")

Clean the Data

data_list <- read_morph_data("data/allmorphology04.26.21-coors2.csv")

## morph types: L S  NA LS L/S SL 
##    recoding missing morph types...
##    S if wing2thorax <=2.2, L if wing2thorax >=2.5
## 
## ambiguous wing morph bug count:  32

raw_data = data_list[[1]]
all_bugs = nrow(raw_data)
data_long = data_list[[2]] 

# Remove individuals with torn wings first
raw_data = remove_torn_wings(raw_data) # **don't need to remove torn wings for wing morph analysis

## 
## number of bugs with torn wings: 147

data_long = remove_torn_wings(data_long) # **don't need to remove torn wings for wing morph analysis

## 
## number of bugs with torn wings: 144

clean_bugs = nrow(raw_data)

cat("number of bugs with torn wings:", all_bugs - clean_bugs, "\n\n")

## number of bugs with torn wings: 147

# remove NA dates
d = raw_data %>%
  filter(!is.na(wing), !is.na(datetime))

# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing", cat_var="datetime", func="mean")
wing_avg = ts_cols[[1]]
date = ts_cols[[2]]

beak_avg = tapply(X=raw_data[,"beak"], INDEX=raw_data[,"datetime"], FUN=mean, na.rm=T)

df = as.data.frame(cbind(wing_avg, beak_avg, date))

Create GLS and Compare GLS models

Wing Length

w/o time dependencies

m1 <- gls(wing_avg ~ beak_avg, 
          data = df,
          na.action = na.omit,
          method = "REML") 
summary(m1)

## Generalized least squares fit by REML
##   Model: wing_avg ~ beak_avg 
##   Data: df 
##        AIC      BIC     logLik
##   7.449552 7.687877 -0.7247762
## 
## Coefficients:
##                 Value Std.Error   t-value p-value
## (Intercept) -2.311756 2.8202658 -0.819694  0.4361
## beak_avg     1.572614 0.4341785  3.622045  0.0068
## 
##  Correlation: 
##          (Intr)
## beak_avg -1    
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.9738138 -0.3664493  0.1030877  0.6563266  1.2651954 
## 
## Residual standard error: 0.2462595 
## Degrees of freedom: 10 total; 8 residual

Check for temporal dependency. See wing_timseries.Rmd script. Conclusion: there is no temporal dependency.
If there is temporal dependency, then define the correlation structure in the gls() function. correlation in an optional corStruct object describing the within-group correlation structure.

For instance, below I’m incorporating AR1 (first order autoregressive structure) into the model. To do so, we have to switch to gls() and use the correlation=argument. In addition to the model incorporating an AR1, we can refit the model with no correlation structure with the gls() function so that we can directly compare the fits of the two models. Since there was no temporal dependency, the parameters will not change below and so the models are identical.

w/ time dependencies

# Model with an AR1 correlation
# Figure 3.7
m2 <- gls(wing_avg ~ beak_avg, 
          correlation = corAR1(form=~date),
          data = df,
          na.action = na.omit)
summary(m2)

## Generalized least squares fit by REML
##   Model: wing_avg ~ beak_avg 
##   Data: df 
##        AIC      BIC     logLik
##   9.449552 9.767318 -0.7247762
## 
## Correlation Structure: ARMA(1,0)
##  Formula: ~date 
##  Parameter estimate(s):
## Phi1 
##    0 
## 
## Coefficients:
##                 Value Std.Error   t-value p-value
## (Intercept) -2.311756 2.8202658 -0.819694  0.4361
## beak_avg     1.572614 0.4341785  3.622045  0.0068
## 
##  Correlation: 
##          (Intr)
## beak_avg -1    
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.9738138 -0.3664493  0.1030877  0.6563266  1.2651954 
## 
## Residual standard error: 0.2462595 
## Degrees of freedom: 10 total; 8 residual

Compare gls w/o and w/o Temporal Correlation

Now compare each gls model in a summary plot. In this case, the lines will be exactly the same because there was no temporal dependency.

# Create a grid of covariate values
x = range(df$beak_avg, na.rm = TRUE)
data <- data.frame(beak_avg = seq(from = min(x), 
                               to = max(x), 
                               length = 10))
# Model with an AR1 correlation
par(mfrow = c(1,1), mar = c(5,5,2,2), cex.lab = 1.5)
plot(x = df$beak_avg,
     y = df$wing_avg,
     xlab = "wing_avg (mm)",
     ylab = "beak_avg (mm)",
     pch = 16,
     type = "n")
         
#Add the fitted values     
pred1 <- predict(m1, newdata = data)
pred2 <- predict(m2, newdata = data)

lines(x = data$beak_avg, # without a correlation
      y = pred1,
      lwd = 1,
      lty=2,
      col = "black")
lines(x = data$beak_avg, # with a correlation
      y = pred2,
      col = "red",
      lty=2,
      lwd = 3)
text(x = df$beak_avg,y = df$wing_avg, df$date, cex = 1)

Wing2body

# Get only bugs with long wings
data_long<-raw_data[raw_data$w_morph=="L",]

# Calculate wing2body ratio for bugs with long wings 
data_long$wing2body <- data_long$wing/as.numeric(data_long$body)

# remove NA dates
d = data_long %>%
  filter(!is.na(wing2body), !is.na(datetime))

# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing2body", cat_var="datetime", func="mean")
ratio_avg = ts_cols[[1]]
ratio_avg = ratio_avg[!is.na(ratio_avg)]
date = ts_cols[[2]]

beak_avg = beak_avg[c(1,3:10)]

df = as.data.frame(cbind(ratio_avg, beak_avg, date))

m3 <- gls(ratio_avg ~ beak_avg, 
          data = df,
          na.action = na.omit,
          method = "REML") 
summary(m3)

## Generalized least squares fit by REML
##   Model: ratio_avg ~ beak_avg 
##   Data: df 
##         AIC       BIC logLik
##   -50.85001 -51.01228 28.425
## 
## Coefficients:
##                 Value  Std.Error   t-value p-value
## (Intercept) 0.6689254 0.04694160 14.250162  0.0000
## beak_avg    0.0094287 0.00724794  1.300887  0.2345
## 
##  Correlation: 
##          (Intr)
## beak_avg -1    
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.3501574 -0.9290108  0.3976261  0.5985678  1.4377538 
## 
## Residual standard error: 0.003895618 
## Degrees of freedom: 9 total; 7 residual

# Model with an AR1 correlation
# Figure 3.7
m4 <- gls(ratio_avg ~ beak_avg, 
          correlation = corAR1(form=~date),
          data = df,
          na.action = na.omit)

summary(m4)

## Generalized least squares fit by REML
##   Model: ratio_avg ~ beak_avg 
##   Data: df 
##         AIC       BIC logLik
##   -48.85001 -49.06637 28.425
## 
## Correlation Structure: ARMA(1,0)
##  Formula: ~date 
##  Parameter estimate(s):
## Phi1 
##    0 
## 
## Coefficients:
##                 Value  Std.Error   t-value p-value
## (Intercept) 0.6689254 0.04694160 14.250162  0.0000
## beak_avg    0.0094287 0.00724794  1.300887  0.2345
## 
##  Correlation: 
##          (Intr)
## beak_avg -1    
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.3501574 -0.9290108  0.3976261  0.5985678  1.4377538 
## 
## Residual standard error: 0.003895618 
## Degrees of freedom: 9 total; 7 residual

Same thing. Exactly the same.

Wing Morph Frequency

# remove NA dates and wing morph (S=0, L=1)
d = raw_data %>%
  filter(!is.na(wing_morph_b), !is.na(datetime))

# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing_morph_b", cat_var="datetime", func="mean")
freq_avg = ts_cols[[1]]
date = ts_cols[[2]]

beak_avg = tapply(X=raw_data[,"beak"], INDEX=raw_data[,"datetime"], FUN=mean, na.rm=T)

df = as.data.frame(cbind(freq_avg, beak_avg, date))

m5 <- gls(freq_avg ~ beak_avg, 
          data = df,
          na.action = na.omit,
          method = "REML") 
summary(m5)

## Generalized least squares fit by REML
##   Model: freq_avg ~ beak_avg 
##   Data: df 
##        AIC       BIC   logLik
##   -14.8975 -14.65918 10.44875
## 
## Coefficients:
##                  Value Std.Error   t-value p-value
## (Intercept) -1.1302489 0.6977741 -1.619792  0.1439
## beak_avg     0.2942132 0.1074220  2.738855  0.0255
## 
##  Correlation: 
##          (Intr)
## beak_avg -1    
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.7714116 -0.3972292 -0.1116195  0.7191137  1.3219052 
## 
## Residual standard error: 0.06092813 
## Degrees of freedom: 10 total; 8 residual

# Model with an AR1 correlation
# Figure 3.7
m6 <- gls(freq_avg ~ beak_avg, 
          correlation = corAR1(form=~date),
          data = df,
          na.action = na.omit)

summary(m6)

## Generalized least squares fit by REML
##   Model: freq_avg ~ beak_avg 
##   Data: df 
##        AIC       BIC   logLik
##   -12.8975 -12.57974 10.44875
## 
## Correlation Structure: ARMA(1,0)
##  Formula: ~date 
##  Parameter estimate(s):
## Phi1 
##    0 
## 
## Coefficients:
##                  Value Std.Error   t-value p-value
## (Intercept) -1.1302489 0.6977741 -1.619792  0.1439
## beak_avg     0.2942132 0.1074220  2.738855  0.0255
## 
##  Correlation: 
##          (Intr)
## beak_avg -1    
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.7714116 -0.3972292 -0.1116195  0.7191137  1.3219052 
## 
## Residual standard error: 0.06092813 
## Degrees of freedom: 10 total; 8 residual

Again the same thing - as we saw in the wing_timeseries.Rmd, there were no spatial dependencies. Let’s continue then only with wing length for the remainder of this script.

Females (only ones that fits an autoregressive model)

females = raw_data[raw_data$sex=="F",]

# remove NA dates
d = females %>%
  filter(!is.na(wing), !is.na(datetime))

# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing", cat_var="datetime", func="mean")
wing_avg = ts_cols[[1]]
date = ts_cols[[2]]

beak_avg = tapply(X=d[,"beak"], INDEX=d[,"datetime"], FUN=mean, na.rm=T)

df = as.data.frame(cbind(wing_avg, beak_avg, date))

m7 <- gls(wing_avg ~ beak_avg, 
          data = df,
          na.action = na.omit,
          method = "REML") 
summary(m7)

## Generalized least squares fit by REML
##   Model: wing_avg ~ beak_avg 
##   Data: df 
##        AIC      BIC     logLik
##   5.950043 6.188368 0.02497837
## 
## Coefficients:
##                  Value Std.Error   t-value p-value
## (Intercept) -0.5907377 1.4303907 -0.412990  0.6905
## beak_avg     1.2173388 0.1953003  6.233163  0.0003
## 
##  Correlation: 
##          (Intr)
## beak_avg -0.999
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -1.28039056 -0.40089750 -0.08521194  0.09081897  2.18717051 
## 
## Residual standard error: 0.2073299 
## Degrees of freedom: 10 total; 8 residual

# Model with an AR1 correlation
# Figure 3.7
m8 <- gls(wing_avg ~ beak_avg, 
          correlation = corAR1(form=~date),
          data = df,
          na.action = na.omit)

summary(m8)

## Generalized least squares fit by REML
##   Model: wing_avg ~ beak_avg 
##   Data: df 
##        AIC      BIC     logLik
##   7.950043 8.267809 0.02497837
## 
## Correlation Structure: ARMA(1,0)
##  Formula: ~date 
##  Parameter estimate(s):
## Phi1 
##    0 
## 
## Coefficients:
##                  Value Std.Error   t-value p-value
## (Intercept) -0.5907377 1.4303907 -0.412990  0.6905
## beak_avg     1.2173388 0.1953003  6.233163  0.0003
## 
##  Correlation: 
##          (Intr)
## beak_avg -0.999
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -1.28039056 -0.40089750 -0.08521194  0.09081897  2.18717051 
## 
## Residual standard error: 0.2073299 
## Degrees of freedom: 10 total; 8 residual

# Create a grid of covariate values
x = range(df$beak_avg, na.rm = TRUE)
data <- data.frame(beak_avg = seq(from = min(x), 
                               to = max(x), 
                               length = 10))
# Model with an AR1 correlation
par(mfrow = c(1,1), mar = c(5,5,2,2), cex.lab = 1.5)
plot(x = df$beak_avg,
     y = df$wing_avg,
     xlab = "wing_avg (mm)",
     ylab = "beak_avg (mm)",
     pch = 16,
     type = "n")
         
#Add the fitted values     
pred1 <- predict(m7, newdata = data)
pred2 <- predict(m8, newdata = data)

lines(x = data$beak_avg, # without a correlation
      y = pred1,
      lwd = 1,
      lty=2,
      col = "black")
lines(x = data$beak_avg, # with a correlation
      y = pred2,
      col = "red",
      lty=2,
      lwd = 3)
text(x = df$beak_avg, y = df$wing_avg, df$date, cex = 1)

Fiting an ARMA model and making ACF plots of residuals

One thing I did above was assume an AR(1) model for the data because the data actually has no temporal dependencies. However, if the data did, then we would need to choose a proper ARMA model.
First, create you xts object from you data.

# remove NA dates
d = raw_data %>%
  filter(!is.na(wing), !is.na(datetime))

# prep dataset for xts()
ts_cols = clean_for_ts(d, contin_var="wing", cat_var="datetime", func="mean")
wing_avg = ts_cols[[1]]
date = ts_cols[[2]]

wing_mm = xts(wing_avg, date)
colnames(wing_mm) <- "wing"
wing_mm

##                wing
## 2013-04-01 8.401154
## 2013-12-01 7.687829
## 2014-04-01 8.178410
## 2015-04-01 7.586518
## 2016-12-01 8.014340
## 2017-08-01 8.481303
## 2018-09-01 7.584976
## 2019-05-01 7.870708
## 2019-10-01 7.302041
## 2020-02-01 7.887514

Fit your AR(p) model.

We can to fit a series of AR(p) models and check whether the model captures the dependence structure of the noise terms. Our aim is that the residuals of a fitted model exhibit white noise. I will not do any formal hypothesis testing to determine if one model is more appropriate compared to another here. For now, I’m using the data and identifying a model that appears to be most appropriate. Then, I’ll justify our findings.

When we use the arima function to fit the model, there are two key parameters we need: x, which is our dataset, and order which tells us how many lags to use. For now, only toggle the first parameter to choose an appropriate AR(p) model (i.e., you will have order=c(x,y,z), where x is replaced by an appropriate number of lags for your model). The y and z are differncing and MA order, respectively.

Summary: order=c(AR order, differencing, MA order)

Then, plot the ACF of the residuals of the AR model. The forecasts from a model with autocorrelated errors are still unbiased, and so are not “wrong,” but they will usually have larger prediction intervals than they need to. Therefore we should always look at an ACF plot of the residuals.

# Example AR(1) model
# x has to be a "ts" object
mod.ar1 <- arima(x=wing_mm, order=c(1,0,0)) # change the 1 to other numbers to get it to be white noise 
Acf(residuals(mod.ar1), main='')

You can also use auto.arima() to choose an ARMIA model automatically.It uses a variation of the Hyndman and Khandakar algorithm which combines unit root tests, minimization of the AICc, and MLE to obtain an ARIMA model. The function takes some short-cuts in order to speed up the computation and will not always yield the best model. Setting stepwise and approximation to FALSE prevents the function from taking short-cuts.

#auto.arima(wing_mm)
auto.arima(wing_mm, stepwise=FALSE, approximation=FALSE) # ours yield the same result

## Series: wing_mm 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##         mean
##       7.8995
## s.e.  0.1132
## 
## sigma^2 estimated as 0.1423:  log likelihood=-3.91
## AIC=11.83   AICc=13.54   BIC=12.43

An ARIMA(0,0,0) model with zero mean is white noise, so it means that the errors are uncorrelated across time.

Since we have no temporal dependencies to begin with, the ACF of the mod.ar1 residuals are already showing up as noise (no spikes above the dotted blue significance lines). As you increase your x value you will see that those spikes will get smaller and smaller too. That is how you would eyeball your arma model without formal hypothesis testing comparing models.

Citation

https://www.flutterbys.com.au/stats/tut/tut8.3a.html