We will use the bioluminescence data that I also showed in class. These data are reported in Chapter 17 of Zuur et al. text, “Mixed Effects Models and Extensions in Ecology”. I will deviate at times from their analysis, however. I encourage you to read that chapter for a richer description of the data and the problem than is given here.
You can get the data from this folder: https://www.dropbox.com/sh/za8bxeor5fdruye/wzA88i84ey You are looking for the file ISIT.txt.
The data were gathered during four cruises of a research vessel over 2001 and 2002 in the Atlantic Ocean west of Ireland. The cuises were organized in Spring and late Summer to collect samples before and after the downward flux of particulate organic content that occurs in June and July. The ship is busy and used for many different purposes, however, and the time of samples are not quite the same in 2001 and 2002.
The data were collected by dropping a device overboard. As it falls, it stirs up water, inducing bioluminescence. A camera is periodically turned on and is used to measure the amount of light emitted at various depths.
First, create a graphic that shows the data, as functions of depth, and separated by each month
ISIT <- read.table("../Data/ISIT.txt", header = TRUE)
library(ggplot2)
# First plot, without separation by month:
ggp <- ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = Sources, group = Station))
ggp
# Add separation by month
ggp <- ggp + facet_wrap(~Month)
ggp
# Clean up the axes
ggp <- ggp + scale_y_continuous("Source") + scale_x_continuous("Depth (m)")
ggp
# Give the months more helpful names
ISIT$fMonth <- factor(ISIT$Month, levels = c(3, 4, 8, 10), labels = c("March",
"April", "August", "October"))
ggp <- ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = Sources, group = Station)) +
facet_wrap(~fMonth) + scale_y_continuous("Source") + scale_x_continuous("Depth (m)")
ggp
A quick inspection shows a few noteworthy points:
Zuur et al also produce a plot like this:
ggp <- ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = Sources, group = Station)) +
facet_grid(Year ~ fMonth) + scale_y_continuous("Source") + scale_x_continuous("Depth (m)")
ggp
# Mine had ugly tick marks and labels - fix those
ggp <- ggp + scale_x_continuous("Depth (1000m)", breaks = c(1000, 2000, 3000,
4000, 5000), labels = c(1:5))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
ggp
I'm still astonished at how little noise there is in April.
There are a few ways to deal with the nonlinearity between Sources and Depth.
There are a few ways we could deal with the heteroskedasticity.
First, we'll do a linear mixed effects model. I don't think that linear is right in this case. But I want to show how the lme function works. Since the model isn't linear, let's try transforming the dependent variable. There are zeros, so a straightup log transform won't work. The smallest nonzero value is 0.11. (I used the sort function). So let's try log(Z+.1). Besides, previous literature has suggested a logarithic relationship between depth; you might start with this model just to show that it is inadequate.
library(nlme)
lme.mod <- lme(log(Sources + 0.1) ~ fMonth * log(SampleDepth), random = ~1 |
Station, data = ISIT)
ISIT$lme.fit <- fitted(lme.mod, 0)
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = lme.fit, group = Station)) +
facet_wrap(~fMonth)
ISIT$lme.resid <- resid(lme.mod)
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = lme.resid, group = Station)) +
facet_wrap(~fMonth)
That's pretty terrible. The biggest problem is perhaps that we need to include an additive compnent somehow, if we are ever to model that hump in the Fall months.
Let's now try a GLM to deal with the heteroskedasticity. Note that this will not deal with the fact that April has a different amount of heteroskedasticity than the other months.
April is the cruelest month.
library(mgcv)
gam.mod <- gam(Sources ~ s(SampleDepth) + fMonth, family = quasipoisson, data = ISIT)
## Warning: the matrix is either rank-deficient or indefinite
summary(gam.mod)
##
## Family: quasipoisson
## Link function: log
##
## Formula:
## Sources ~ s(SampleDepth) + fMonth
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3771 0.0467 29.48 <2e-16 ***
## fMonthApril -0.1542 0.0610 -2.53 0.012 *
## fMonthAugust 0.5406 0.0433 12.50 <2e-16 ***
## fMonthOctober 0.5434 0.0431 12.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(SampleDepth) 7.25 8.04 233 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.7 Deviance explained = 82.9%
## GCV score = 2.0754 Scale est. = 2.0458 n = 789
# It appears that the months are significant But you might compare it with
# the model without months to verify
anova(gam.mod) # gives you results for dropping either the smooth component or the Month factor
##
## Family: quasipoisson
## Link function: log
##
## Formula:
## Sources ~ s(SampleDepth) + fMonth
##
## Parametric Terms:
## df F p-value
## fMonth 3 102 <2e-16
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(SampleDepth) 7.25 8.04 233 <2e-16
The ANOVA results suggest that both the overall month effect and the smooth are significant.
We could plot either the residuals (y-axis) vs the fitted values (x-axis), or the residuals vs the individual regressors. Normally, there might be too many regressors to look at all of the individual plots. But here, we only have one regressor, Depth, and it wouldn't be difficult to do this at all. So this is what we will do.
ISIT$resid.gam.mod <- residuals(gam.mod, type = "pearson")
ISIT$fit.gam.mod <- residuals(gam.mod, type = "pearson")
# First I tried this:
ggplot(data = ISIT) + geom_point(aes(x = SampleDepth, y = resid.gam.mod, group = Station)) +
facet_wrap(~fMonth)
# As if that wasn't ugly enough, I wanted to see how the different
# stations look
ggplot(data = ISIT) + geom_point(aes(x = SampleDepth, y = resid.gam.mod, group = Station,
color = factor(Station))) + facet_wrap(~fMonth)
# Not a good plot, but it shows a lot of grouping by station: so a line
# plot might make sense
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = resid.gam.mod, group = Station)) +
facet_wrap(~fMonth)
That's really ugly. There is no way that this is an acceptable model. Normally, a line plot of residuals shouldn't even make sense - it should look like spaghetti. There should be no clustering of residuals. No trends. But the fact that we can draw a line and have it be more smooth than erratic, suggests very strong clustering within stations an dunaccounted for patterns.
So, where too? We might try random effects for each station. But I don't think that this will solve all our problems. There appear to be trends that are common across month. We might fit a separate GAM for each month, or even each station.
# This will take awhile, even on a fast computer.
gam.mod.1 <- gam(Sources ~ s(SampleDepth, by = factor(Station)), family = quasipoisson,
data = ISIT)
ISIT$resid.gam.mod <- residuals(gam.mod.1, type = "pearson")
ISIT$fit.gam.mod <- predict(gam.mod.1, type = "link")
plot(ISIT$fit.gam.mod, ISIT$resid.gam.mod)
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = resid.gam.mod, group = Station)) +
facet_wrap(~fMonth)
That's not perfect, but it's better. April and Octover are a little screwy. But isn't there any way we can be more parsimonious? We haven't learned much about patterns.
What about, instead of one smooth per Station, we have one smooth per month?
gam.mod.2 <- gam(Sources ~ s(SampleDepth, by = fMonth), family = quasipoisson,
data = ISIT)
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
ISIT$resid.gam.mod <- residuals(gam.mod.2, type = "pearson")
ISIT$fit.gam.mod <- predict(gam.mod.2, type = "link")
ISIT$fit2.gam.mod <- predict(gam.mod.2, type = "response")
plot(ISIT$fit.gam.mod, ISIT$resid.gam.mod) # Not great at all
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = resid.gam.mod, group = Station)) +
facet_wrap(~fMonth)
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = fit2.gam.mod, group = Station)) +
facet_wrap(~fMonth)
Not great. There is still too much variance at the high values. The quasi-poisson assumes that there is a linear relation between mean and variance. Maybe it is quadratic, i.e. more like a negative binomial.
The glm family has an extra family for this, called the quasi family. There are a few different options for the relation between mean and variance.
# help(quasi) # The allowed variance functions are buried in there.
gam.mod.3 <- gam(Sources ~ s(SampleDepth, by = fMonth), family = quasi(link = "log",
variance = "mu^2"), data = ISIT)
ISIT$resid.gam.mod <- residuals(gam.mod.3, type = "pearson")
ISIT$fit.gam.mod <- predict(gam.mod.3, type = "link") # Predict the link function (before applying the inverse link)
ISIT$fit2.gam.mod <- predict(gam.mod.3, type = "response") # Predict the response value (after applying the inverse link)
plot(ISIT$fit.gam.mod, ISIT$resid.gam.mod)
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = resid.gam.mod, group = Station)) +
facet_wrap(~fMonth)
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = fit2.gam.mod, group = Station)) +
facet_wrap(~fMonth)
That seems to fix March, but break August. April is still screwy, but it's screwy down at the deep depths where the estimate is practically zero anyway, so I'm less concerned there. Remember, that these are scaled residuals, so these are relatively big residuals, not actually big residuals.
Here's a different way to visualize the residuals. This way throws away the depth dimension, but it shows the differences between stations and months nicely
ggplot(data = ISIT) + geom_boxplot(aes(y = resid.gam.mod, x = factor(Station),
fill = fMonth))
There is still some systematic differences between stations. The next thing to try might be a station-specific random effect. We are now talking about a Generalized Linear Mixed Model, and that could get hard to estimate.
ISIT$fStation <- factor(ISIT$Station)
gam.mod.4 <- gamm(Sources ~ s(SampleDepth, by = fMonth), family = quasi(link = "log",
variance = "mu^2"), random = list(fStation = ~1), data = ISIT)
##
## Maximum number of PQL iterations: 20
## iteration 1
## iteration 2
## iteration 3
## iteration 4
## iteration 5
## iteration 6
# Note that random effects have to enter as a list.
summary(gam.mod.4$gam) # The output is screwy too, it outpus a gam object:
##
## Family: quasi
## Link function: log
##
## Formula:
## Sources ~ s(SampleDepth, by = fMonth)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5245 0.0899 16.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(SampleDepth):fMonthMarch 6.13 6.13 189 <2e-16 ***
## s(SampleDepth):fMonthApril 1.00 1.00 1326 <2e-16 ***
## s(SampleDepth):fMonthAugust 8.59 8.59 217 <2e-16 ***
## s(SampleDepth):fMonthOctober 7.28 7.28 276 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.67 Scale est. = 0.16783 n = 789
summary(gam.mod.4$lme) # And a lme object
## Linear mixed-effects model fit by maximum likelihood
## Data: data
## AIC BIC logLik
## 1010 1062 -494.2
##
## Random effects:
## Formula: ~Xr - 1 | g
## Structure: pdIdnot
## Xr1 Xr2 Xr3 Xr4 Xr5 Xr6 Xr7 Xr8
## StdDev: 28.56 28.56 28.56 28.56 28.56 28.56 28.56 28.56
##
## Formula: ~Xr.0 - 1 | g.0 %in% g
## Structure: pdIdnot
## Xr.01 Xr.02 Xr.03 Xr.04 Xr.05 Xr.06
## StdDev: 0.0003958 0.0003958 0.0003958 0.0003958 0.0003958 0.0003958
## Xr.07 Xr.08
## StdDev: 0.0003958 0.0003958
##
## Formula: ~Xr.1 - 1 | g.1 %in% g.0 %in% g
## Structure: pdIdnot
## Xr.11 Xr.12 Xr.13 Xr.14 Xr.15 Xr.16 Xr.17 Xr.18
## StdDev: 66.38 66.38 66.38 66.38 66.38 66.38 66.38 66.38
##
## Formula: ~Xr.2 - 1 | g.2 %in% g.1 %in% g.0 %in% g
## Structure: pdIdnot
## Xr.21 Xr.22 Xr.23 Xr.24 Xr.25 Xr.26 Xr.27 Xr.28
## StdDev: 23.45 23.45 23.45 23.45 23.45 23.45 23.45 23.45
##
## Formula: ~1 | fStation %in% g.2 %in% g.1 %in% g.0 %in% g
## (Intercept) Residual
## StdDev: 0.3763 0.4097
##
## Variance function:
## Structure: fixed weights
## Formula: ~invwt
## Fixed effects: list(fixed)
## Value Std.Error DF t-value p-value
## X(Intercept) 1.525 0.0902 766 16.91 0.0000
## Xs(SampleDepth):fMonthMarchFx1 -1.829 1.4685 766 -1.25 0.2134
## Xs(SampleDepth):fMonthAprilFx1 -1.333 0.0367 766 -36.32 0.0000
## Xs(SampleDepth):fMonthAugustFx1 -7.379 0.9904 766 -7.45 0.0000
## Xs(SampleDepth):fMonthOctoberFx1 -1.069 0.4172 766 -2.56 0.0106
## Correlation:
## X(Int) X(SD):MM Xs(SmplDpth):fMnthApF1
## Xs(SampleDepth):fMonthMarchFx1 0.129
## Xs(SampleDepth):fMonthAprilFx1 0.025 0.003
## Xs(SampleDepth):fMonthAugustFx1 -0.005 -0.001 0.000
## Xs(SampleDepth):fMonthOctoberFx1 -0.001 0.000 0.000
## Xs(SmplDpth):fMnthAgF1
## Xs(SampleDepth):fMonthMarchFx1
## Xs(SampleDepth):fMonthAprilFx1
## Xs(SampleDepth):fMonthAugustFx1
## Xs(SampleDepth):fMonthOctoberFx1 0.000
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -2.45340 -0.53241 -0.06104 0.41734 6.63648
##
## Number of Observations: 789
## Number of Groups:
## g
## 1
## g.0 %in% g
## 1
## g.1 %in% g.0 %in% g
## 1
## g.2 %in% g.1 %in% g.0 %in% g
## 1
## fStation %in% g.2 %in% g.1 %in% g.0 %in% g
## 19
A GAMM is fit by an iterative procedure. Essentially, the GLM is already an iterative procedure, where each iteration involves a simpler linear regression. The GLMM has each iteration by a linear mixed effects model. The GAM object is the final output of the fixed effects and the smooth parts. The lme is the final output of the inner lme step. If you want predictions or fixed effects, than you need to inspect the gam object. If you want random effects, then you need to inspect the lme object.
The lme object is a bit cryptic, I admit. But I will point your attention to the fact that the within-station correlation is about 45% (.3762/(.3762+.412)). It's probaby even higher. The random intercept isn't entirely adequate. But we don't know that yet. More on that later.
The package developers are working on a new package: gamm4, which should be more robust and user friendly, but it's not quite ready.
ISIT$resid.gam.mod <- residuals(gam.mod.4$lme, type = "pearson")
ISIT$fit.gam.mod <- predict(gam.mod.4$gam, type = "link")
ISIT$fit2.gam.mod <- predict(gam.mod.4$gam, type = "response")
plot(ISIT$fit.gam.mod, ISIT$resid.gam.mod)
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = resid.gam.mod, group = Station)) +
facet_wrap(~fMonth)
ggplot(data = ISIT) + geom_line(aes(x = SampleDepth, y = fit2.gam.mod, group = Station)) +
facet_wrap(~fMonth)
ggplot(data = ISIT) + geom_boxplot(aes(y = resid.gam.mod, x = factor(Station),
fill = fMonth))
Look at those residuals! Isn't that beautiful?
# Create a new dataset for prediction. The dataset should have a month
# column, and depths, maybe from 500m to 5000 m
pred.depths <- seq(from = 500, to = 5000, by = 100)
ndepths <- length(pred.depths)
ISIT.pred <- data.frame(SampleDepth = rep(pred.depths, times = 4), Station = 0,
Month = rep(c(3, 4, 8, 10), each = ndepths))
head(ISIT.pred) # See what we've done.
## SampleDepth Station Month
## 1 500 0 3
## 2 600 0 3
## 3 700 0 3
## 4 800 0 3
## 5 900 0 3
## 6 1000 0 3
ISIT.pred$fMonth <- factor(ISIT.pred$Month, levels = c(3, 4, 8, 10), labels = c("March",
"April", "August", "October"))
ISIT.pred <- cbind(ISIT.pred, as.data.frame(predict(gam.mod.4$gam, ISIT.pred,
se.fit = TRUE)))
ISIT.pred$fit.l <- exp(ISIT.pred$fit - 2 * ISIT.pred$se.fit) # prediction - 2 s.e.
ISIT.pred$fit.h <- exp(ISIT.pred$fit + 2 * ISIT.pred$se.fit) # prediction + 2 s.e.
ISIT.pred$fit.m <- exp(ISIT.pred$fit) # prediction.
ggplot(data = ISIT.pred) + geom_line(aes(x = SampleDepth, y = fit.m)) + facet_wrap(~fMonth) +
geom_ribbon(aes(x = SampleDepth, ymax = fit.h, ymin = fit.l), alpha = 0.5)
Here are the predictions. Note the erratic behavior at deep depths in March. But there isn't any data there. This is pure extrapolation. I definitely probably delete the prediction points that extrapolations for a publication.
There are at least one things we have not done: we have not looked for residual autocorrelation. There are actually three spatial dimensions: latitude, longitude, and depth. We should ideally draw correlations for each of these.
Here it is for Depth:
plot(Variogram(gam.mod.4$lme, robust = TRUE, data = ISIT, form = ~SampleDepth |
fStation))
This should be a straight line. It's not. Crap. What this means is that when a station curve is above (below) it's monthly prediction, then neighboring depths are also likely to be above (below) their monthly prediction. This deviation from the monthly mean is persistent up to about 1000 meters. I had hoped that the random intercept would fix this. It didn't. Not entirely, anyway.
The practical effect of this is that the confidence intervals are still too narrow.
There are two possible sources:
We'll save this battle for another day. And we still haven't looked at spatial autocorrelation.
# We might try this as a model that accounts for monthly averages (which
# are of interest) and station specific deviations (which are not of
# interest.) But this model isn't easy to solve... gam.mod.5 <-
# gam(Sources~s(SampleDepth, by=fMonth) + s(SampleDepth,
# by=factor(Station)), family=quasi(link='log', variance='mu^2'),
# data=ISIT)