Generalized Additive Mixed Models

We will use the bioluminescence data that I also showed in class. These data are reported in Chapter 17 of Zuur et al. text, “Mixed Effects Models and Extensions in Ecology”. I will deviate at times from their analysis, however. I encourage you to read that chapter for a richer description of the data and the problem than is given here.

The data

You can get the data from this folder: https://www.dropbox.com/sh/za8bxeor5fdruye/wzA88i84ey You are looking for the file ISIT.txt.

The data were gathered during four cruises of a research vessel over 2001 and 2002 in the Atlantic Ocean west of Ireland. The cuises were organized in Spring and late Summer to collect samples before and after the downward flux of particulate organic content that occurs in June and July. The ship is busy and used for many different purposes, however, and the time of samples are not quite the same in 2001 and 2002.

The data were collected by dropping a device overboard. As it falls, it stirs up water, inducing bioluminescence. A camera is periodically turned on and is used to measure the amount of light emitted at various depths.

Exploratory data anlysis

First, create a graphic that shows the data, as functions of depth, and separated by each month

ISIT <- read.table("../Data/ISIT.txt", header = TRUE)
library(ggplot2)
# First plot, without separation by month:
ggp <- ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = Sources, group = Station))
ggp

plot of chunk unnamed-chunk-1

# Add separation by month
ggp <- ggp + facet_wrap(~Month)
ggp

plot of chunk unnamed-chunk-1

# Clean up the axes
ggp <- ggp + scale_y_continuous("Source") + scale_x_continuous("Depth (m)")
ggp

plot of chunk unnamed-chunk-1

# Give the months more helpful names
ISIT$fMonth <- factor(ISIT$Month, levels = c(3, 4, 8, 10), labels = c("March", 
    "April", "August", "October"))
ggp <- ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = Sources, group = Station)) + 
    facet_wrap(~fMonth) + scale_y_continuous("Source") + scale_x_continuous("Depth (m)")
ggp

plot of chunk unnamed-chunk-1

A quick inspection shows a few noteworthy points:

The shape is not the same in every month: March and April seem to be monotonically decreasing, August has a peak between 1000 and 2000 m, and October may have a peak there as well.
The variance seems to increase with the mean
But this is not equal in all months: in particular, April is less noisy.

Zuur et al also produce a plot like this:

ggp <- ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = Sources, group = Station)) + 
    facet_grid(Year ~ fMonth) + scale_y_continuous("Source") + scale_x_continuous("Depth (m)")
ggp

plot of chunk unnamed-chunk-2

# Mine had ugly tick marks and labels - fix those
ggp <- ggp + scale_x_continuous("Depth (1000m)", breaks = c(1000, 2000, 3000, 
    4000, 5000), labels = c(1:5))

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

ggp

plot of chunk unnamed-chunk-2

I'm still astonished at how little noise there is in April.

Modeling

There are a few ways to deal with the nonlinearity between Sources and Depth.

We could take a log transform of Sources. But that won't deal with the peak in Fall.
We could implement a Poisson GLM. This, too won'w deal with the pean in Fall.
We could Model Depth with an additive model. This could be combined with either 1. or 2.

There are a few ways we could deal with the heteroskedasticity.

A GLM or log transform might help.
We might model the variance as a function of Depth. I haven't discussed this in class. This is Zuur et al's approach.

Linear Mixed Effects Model.

First, we'll do a linear mixed effects model. I don't think that linear is right in this case. But I want to show how the lme function works. Since the model isn't linear, let's try transforming the dependent variable. There are zeros, so a straightup log transform won't work. The smallest nonzero value is 0.11. (I used the sort function). So let's try log(Z+.1). Besides, previous literature has suggested a logarithic relationship between depth; you might start with this model just to show that it is inadequate.

library(nlme)
lme.mod <- lme(log(Sources + 0.1) ~ fMonth * log(SampleDepth), random = ~1 | 
    Station, data = ISIT)
ISIT$lme.fit <- fitted(lme.mod, 0)
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = lme.fit, group = Station)) + 
    facet_wrap(~fMonth)

plot of chunk unnamed-chunk-3

ISIT$lme.resid <- resid(lme.mod)
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = lme.resid, group = Station)) + 
    facet_wrap(~fMonth)

plot of chunk unnamed-chunk-3

That's pretty terrible. The biggest problem is perhaps that we need to include an additive compnent somehow, if we are ever to model that hump in the Fall months.

GAM

Let's now try a GLM to deal with the heteroskedasticity. Note that this will not deal with the fact that April has a different amount of heteroskedasticity than the other months.

April is the cruelest month.

library(mgcv)
gam.mod <- gam(Sources ~ s(SampleDepth) + fMonth, family = quasipoisson, data = ISIT)

## Warning: the matrix is either rank-deficient or indefinite

summary(gam.mod)

## 
## Family: quasipoisson 
## Link function: log 
## 
## Formula:
## Sources ~ s(SampleDepth) + fMonth
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.3771     0.0467   29.48   <2e-16 ***
## fMonthApril    -0.1542     0.0610   -2.53    0.012 *  
## fMonthAugust    0.5406     0.0433   12.50   <2e-16 ***
## fMonthOctober   0.5434     0.0431   12.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                 edf Ref.df   F p-value    
## s(SampleDepth) 7.25   8.04 233  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =    0.7   Deviance explained = 82.9%
## GCV score = 2.0754  Scale est. = 2.0458    n = 789

# It appears that the months are significant But you might compare it with
# the model without months to verify
anova(gam.mod)  # gives you results for dropping either the smooth component or the Month factor

## 
## Family: quasipoisson 
## Link function: log 
## 
## Formula:
## Sources ~ s(SampleDepth) + fMonth
## 
## Parametric Terms:
##        df   F p-value
## fMonth  3 102  <2e-16
## 
## Approximate significance of smooth terms:
##                 edf Ref.df   F p-value
## s(SampleDepth) 7.25   8.04 233  <2e-16

The ANOVA results suggest that both the overall month effect and the smooth are significant.

Residual Diagnostics

We could plot either the residuals (y-axis) vs the fitted values (x-axis), or the residuals vs the individual regressors. Normally, there might be too many regressors to look at all of the individual plots. But here, we only have one regressor, Depth, and it wouldn't be difficult to do this at all. So this is what we will do.

ISIT$resid.gam.mod <- residuals(gam.mod, type = "pearson")
ISIT$fit.gam.mod <- residuals(gam.mod, type = "pearson")
# First I tried this:
ggplot(data = ISIT) + geom_point(aes(x = SampleDepth, y = resid.gam.mod, group = Station)) + 
    facet_wrap(~fMonth)

plot of chunk unnamed-chunk-5

# As if that wasn't ugly enough, I wanted to see how the different
# stations look
ggplot(data = ISIT) + geom_point(aes(x = SampleDepth, y = resid.gam.mod, group = Station, 
    color = factor(Station))) + facet_wrap(~fMonth)

plot of chunk unnamed-chunk-5

# Not a good plot, but it shows a lot of grouping by station: so a line
# plot might make sense
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = resid.gam.mod, group = Station)) + 
    facet_wrap(~fMonth)

plot of chunk unnamed-chunk-5

That's really ugly. There is no way that this is an acceptable model. Normally, a line plot of residuals shouldn't even make sense - it should look like spaghetti. There should be no clustering of residuals. No trends. But the fact that we can draw a line and have it be more smooth than erratic, suggests very strong clustering within stations an dunaccounted for patterns.

So, where too? We might try random effects for each station. But I don't think that this will solve all our problems. There appear to be trends that are common across month. We might fit a separate GAM for each month, or even each station.

# This will take awhile, even on a fast computer.
gam.mod.1 <- gam(Sources ~ s(SampleDepth, by = factor(Station)), family = quasipoisson, 
    data = ISIT)
ISIT$resid.gam.mod <- residuals(gam.mod.1, type = "pearson")
ISIT$fit.gam.mod <- predict(gam.mod.1, type = "link")
plot(ISIT$fit.gam.mod, ISIT$resid.gam.mod)

plot of chunk unnamed-chunk-6

ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = resid.gam.mod, group = Station)) + 
    facet_wrap(~fMonth)

plot of chunk unnamed-chunk-6

That's not perfect, but it's better. April and Octover are a little screwy. But isn't there any way we can be more parsimonious? We haven't learned much about patterns.

What about, instead of one smooth per Station, we have one smooth per month?

gam.mod.2 <- gam(Sources ~ s(SampleDepth, by = fMonth), family = quasipoisson, 
    data = ISIT)

## Warning: the matrix is either rank-deficient or indefinite

## Warning: the matrix is either rank-deficient or indefinite

## Warning: the matrix is either rank-deficient or indefinite

## Warning: the matrix is either rank-deficient or indefinite

ISIT$resid.gam.mod <- residuals(gam.mod.2, type = "pearson")
ISIT$fit.gam.mod <- predict(gam.mod.2, type = "link")
ISIT$fit2.gam.mod <- predict(gam.mod.2, type = "response")
plot(ISIT$fit.gam.mod, ISIT$resid.gam.mod)  # Not great at all

plot of chunk unnamed-chunk-7

ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = resid.gam.mod, group = Station)) + 
    facet_wrap(~fMonth)

plot of chunk unnamed-chunk-7

ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = fit2.gam.mod, group = Station)) + 
    facet_wrap(~fMonth)

plot of chunk unnamed-chunk-7

Not great. There is still too much variance at the high values. The quasi-poisson assumes that there is a linear relation between mean and variance. Maybe it is quadratic, i.e. more like a negative binomial.

The glm family has an extra family for this, called the quasi family. There are a few different options for the relation between mean and variance.

What about even more overdispersion?

# help(quasi) # The allowed variance functions are buried in there.
gam.mod.3 <- gam(Sources ~ s(SampleDepth, by = fMonth), family = quasi(link = "log", 
    variance = "mu^2"), data = ISIT)
ISIT$resid.gam.mod <- residuals(gam.mod.3, type = "pearson")
ISIT$fit.gam.mod <- predict(gam.mod.3, type = "link")  # Predict the link function (before applying the inverse link)
ISIT$fit2.gam.mod <- predict(gam.mod.3, type = "response")  # Predict the response value (after applying the inverse link)
plot(ISIT$fit.gam.mod, ISIT$resid.gam.mod)

plot of chunk unnamed-chunk-8

ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = resid.gam.mod, group = Station)) + 
    facet_wrap(~fMonth)

plot of chunk unnamed-chunk-8

ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = fit2.gam.mod, group = Station)) + 
    facet_wrap(~fMonth)

plot of chunk unnamed-chunk-8

That seems to fix March, but break August. April is still screwy, but it's screwy down at the deep depths where the estimate is practically zero anyway, so I'm less concerned there. Remember, that these are scaled residuals, so these are relatively big residuals, not actually big residuals.

Here's a different way to visualize the residuals. This way throws away the depth dimension, but it shows the differences between stations and months nicely

ggplot(data = ISIT) + geom_boxplot(aes(y = resid.gam.mod, x = factor(Station), 
    fill = fMonth))

plot of chunk unnamed-chunk-9

There is still some systematic differences between stations. The next thing to try might be a station-specific random effect. We are now talking about a Generalized Linear Mixed Model, and that could get hard to estimate.

ISIT$fStation <- factor(ISIT$Station)
gam.mod.4 <- gamm(Sources ~ s(SampleDepth, by = fMonth), family = quasi(link = "log", 
    variance = "mu^2"), random = list(fStation = ~1), data = ISIT)

## 
##  Maximum number of PQL iterations:  20

## iteration 1

## iteration 2

## iteration 3

## iteration 4

## iteration 5

## iteration 6

# Note that random effects have to enter as a list.
summary(gam.mod.4$gam)  # The output is screwy too, it outpus a gam object:

## 
## Family: quasi 
## Link function: log 
## 
## Formula:
## Sources ~ s(SampleDepth, by = fMonth)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.5245     0.0899    16.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                               edf Ref.df    F p-value    
## s(SampleDepth):fMonthMarch   6.13   6.13  189  <2e-16 ***
## s(SampleDepth):fMonthApril   1.00   1.00 1326  <2e-16 ***
## s(SampleDepth):fMonthAugust  8.59   8.59  217  <2e-16 ***
## s(SampleDepth):fMonthOctober 7.28   7.28  276  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.67  Scale est. = 0.16783   n = 789

summary(gam.mod.4$lme)  # And a lme object

## Linear mixed-effects model fit by maximum likelihood
##  Data: data 
##    AIC  BIC logLik
##   1010 1062 -494.2
## 
## Random effects:
##  Formula: ~Xr - 1 | g
##  Structure: pdIdnot
##           Xr1   Xr2   Xr3   Xr4   Xr5   Xr6   Xr7   Xr8
## StdDev: 28.56 28.56 28.56 28.56 28.56 28.56 28.56 28.56
## 
##  Formula: ~Xr.0 - 1 | g.0 %in% g
##  Structure: pdIdnot
##             Xr.01     Xr.02     Xr.03     Xr.04     Xr.05     Xr.06
## StdDev: 0.0003958 0.0003958 0.0003958 0.0003958 0.0003958 0.0003958
##             Xr.07     Xr.08
## StdDev: 0.0003958 0.0003958
## 
##  Formula: ~Xr.1 - 1 | g.1 %in% g.0 %in% g
##  Structure: pdIdnot
##         Xr.11 Xr.12 Xr.13 Xr.14 Xr.15 Xr.16 Xr.17 Xr.18
## StdDev: 66.38 66.38 66.38 66.38 66.38 66.38 66.38 66.38
## 
##  Formula: ~Xr.2 - 1 | g.2 %in% g.1 %in% g.0 %in% g
##  Structure: pdIdnot
##         Xr.21 Xr.22 Xr.23 Xr.24 Xr.25 Xr.26 Xr.27 Xr.28
## StdDev: 23.45 23.45 23.45 23.45 23.45 23.45 23.45 23.45
## 
##  Formula: ~1 | fStation %in% g.2 %in% g.1 %in% g.0 %in% g
##         (Intercept) Residual
## StdDev:      0.3763   0.4097
## 
## Variance function:
##  Structure: fixed weights
##  Formula: ~invwt 
## Fixed effects: list(fixed) 
##                                   Value Std.Error  DF t-value p-value
## X(Intercept)                      1.525    0.0902 766   16.91  0.0000
## Xs(SampleDepth):fMonthMarchFx1   -1.829    1.4685 766   -1.25  0.2134
## Xs(SampleDepth):fMonthAprilFx1   -1.333    0.0367 766  -36.32  0.0000
## Xs(SampleDepth):fMonthAugustFx1  -7.379    0.9904 766   -7.45  0.0000
## Xs(SampleDepth):fMonthOctoberFx1 -1.069    0.4172 766   -2.56  0.0106
##  Correlation: 
##                                  X(Int) X(SD):MM Xs(SmplDpth):fMnthApF1
## Xs(SampleDepth):fMonthMarchFx1    0.129                                
## Xs(SampleDepth):fMonthAprilFx1    0.025  0.003                         
## Xs(SampleDepth):fMonthAugustFx1  -0.005 -0.001    0.000                
## Xs(SampleDepth):fMonthOctoberFx1 -0.001  0.000    0.000                
##                                  Xs(SmplDpth):fMnthAgF1
## Xs(SampleDepth):fMonthMarchFx1                         
## Xs(SampleDepth):fMonthAprilFx1                         
## Xs(SampleDepth):fMonthAugustFx1                        
## Xs(SampleDepth):fMonthOctoberFx1  0.000                
## 
## Standardized Within-Group Residuals:
##      Min       Q1      Med       Q3      Max 
## -2.45340 -0.53241 -0.06104  0.41734  6.63648 
## 
## Number of Observations: 789
## Number of Groups: 
##                                          g 
##                                          1 
##                                 g.0 %in% g 
##                                          1 
##                        g.1 %in% g.0 %in% g 
##                                          1 
##               g.2 %in% g.1 %in% g.0 %in% g 
##                                          1 
## fStation %in% g.2 %in% g.1 %in% g.0 %in% g 
##                                         19

A GAMM is fit by an iterative procedure. Essentially, the GLM is already an iterative procedure, where each iteration involves a simpler linear regression. The GLMM has each iteration by a linear mixed effects model. The GAM object is the final output of the fixed effects and the smooth parts. The lme is the final output of the inner lme step. If you want predictions or fixed effects, than you need to inspect the gam object. If you want random effects, then you need to inspect the lme object.

The lme object is a bit cryptic, I admit. But I will point your attention to the fact that the within-station correlation is about 45% (.376^{2/(.376^{2+.41^2)).}} It's probaby even higher. The random intercept isn't entirely adequate. But we don't know that yet. More on that later.

The package developers are working on a new package: gamm4, which should be more robust and user friendly, but it's not quite ready.

ISIT$resid.gam.mod <- residuals(gam.mod.4$lme, type = "pearson")
ISIT$fit.gam.mod <- predict(gam.mod.4$gam, type = "link")
ISIT$fit2.gam.mod <- predict(gam.mod.4$gam, type = "response")
plot(ISIT$fit.gam.mod, ISIT$resid.gam.mod)

plot of chunk unnamed-chunk-11

ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = resid.gam.mod, group = Station)) + 
    facet_wrap(~fMonth)

plot of chunk unnamed-chunk-11

ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = fit2.gam.mod, group = Station)) + 
    facet_wrap(~fMonth)

plot of chunk unnamed-chunk-11

ggplot(data = ISIT) + geom_boxplot(aes(y = resid.gam.mod, x = factor(Station), 
    fill = fMonth))

plot of chunk unnamed-chunk-11

Look at those residuals! Isn't that beautiful?

Plot the monthly fixed effects fits plus standard errors

# Create a new dataset for prediction.  The dataset should have a month
# column, and depths, maybe from 500m to 5000 m
pred.depths <- seq(from = 500, to = 5000, by = 100)
ndepths <- length(pred.depths)
ISIT.pred <- data.frame(SampleDepth = rep(pred.depths, times = 4), Station = 0, 
    Month = rep(c(3, 4, 8, 10), each = ndepths))
head(ISIT.pred)  # See what we've done.

##   SampleDepth Station Month
## 1         500       0     3
## 2         600       0     3
## 3         700       0     3
## 4         800       0     3
## 5         900       0     3
## 6        1000       0     3

ISIT.pred$fMonth <- factor(ISIT.pred$Month, levels = c(3, 4, 8, 10), labels = c("March", 
    "April", "August", "October"))
ISIT.pred <- cbind(ISIT.pred, as.data.frame(predict(gam.mod.4$gam, ISIT.pred, 
    se.fit = TRUE)))
ISIT.pred$fit.l <- exp(ISIT.pred$fit - 2 * ISIT.pred$se.fit)  # prediction - 2 s.e.
ISIT.pred$fit.h <- exp(ISIT.pred$fit + 2 * ISIT.pred$se.fit)  # prediction + 2 s.e.
ISIT.pred$fit.m <- exp(ISIT.pred$fit)  # prediction.

ggplot(data = ISIT.pred) + geom_line(aes(x = SampleDepth, y = fit.m)) + facet_wrap(~fMonth) + 
    geom_ribbon(aes(x = SampleDepth, ymax = fit.h, ymin = fit.l), alpha = 0.5)

plot of chunk unnamed-chunk-12

Here are the predictions. Note the erratic behavior at deep depths in March. But there isn't any data there. This is pure extrapolation. I definitely probably delete the prediction points that extrapolations for a publication.

What have we not done:

There are at least one things we have not done: we have not looked for residual autocorrelation. There are actually three spatial dimensions: latitude, longitude, and depth. We should ideally draw correlations for each of these.

Here it is for Depth:

plot(Variogram(gam.mod.4$lme, robust = TRUE, data = ISIT, form = ~SampleDepth | 
    fStation))

plot of chunk unnamed-chunk-13

This should be a straight line. It's not. Crap. What this means is that when a station curve is above (below) it's monthly prediction, then neighboring depths are also likely to be above (below) their monthly prediction. This deviation from the monthly mean is persistent up to about 1000 meters. I had hoped that the random intercept would fix this. It didn't. Not entirely, anyway.

The practical effect of this is that the confidence intervals are still too narrow.

There are two possible sources:

there may be more “types of curve” than just one type every month. Zuur et al spend quite a bit of time on this.
There is spatial autocorrelation. Even if there isn't a “type” of curve that we are missing, individual curves may smoothly vary from the monthly average in random ways.

We'll save this battle for another day. And we still haven't looked at spatial autocorrelation.

# We might try this as a model that accounts for monthly averages (which
# are of interest) and station specific deviations (which are not of
# interest.)  But this model isn't easy to solve... gam.mod.5 <-
# gam(Sources~s(SampleDepth, by=fMonth) + s(SampleDepth,
# by=factor(Station)), family=quasi(link='log', variance='mu^2'),
# data=ISIT)