Salamander GLMM

The “salamander mating” data set is a classic example in generalized linear mixed models (e.g. see this example from Charlie Geyer, or this one from the SAS PROC GLIMMIX documentation). Geyer points out that this example has been discussed by:

McCullagh and Nelder (1989), Schall (1991), Karim and Zeger (1992), Breslow and Clayton (1993), Wolfinger and O’Connell (1993), and Shun (1997).

More recently, it has been analyzed by Booth and Hobert (1999) and by Sung and Geyer (2007).

This document does a quick review of available GLMM packages from R …

Preliminaries

library("plyr")
library("lme4")
library("glmmADMB")
library("hglm")
## recommended (i.e. built-in)
library("MASS")
library("nlme")

The bernor package isn’t on CRAN, needs to be installed via:

install.packages("bernor",
                 repos="http://www.math.mcmaster.ca/bolker/R",
                 type="source")

(you’ll need to have development tools, e.g. compilers, installed)

library("bernor")

Estimates from the literature/elsewhere

Estimates of female and male variance parameters from SAS PROC GLIMMIX:

##      var_est var_se
## fnum 1.40990 0.8871
## mnum 0.08963 0.4102

Estimates of all parameters (from Booth and Hobert (1999) by way of Geyer’s example).

##     R/R     R/W     W/R     W/W sigma_f sigma_m 
##   1.030   0.320  -1.950   0.990   1.183   1.118

Geyer considers these results to be the gold standard (he gets close to them with a sufficiently thorough Monte Carlo maximum likelihood fit, and says he has independently confirmed with an MCMC fit). As shown below, it seems as though most available algorithms estimate the among-female variance by quite a lot (orders of magnitude, although surprisingly (to me) SAS PROC GLIMMIX is one of the closest …

For convenience, I’ve copied the data from the SAS online documentation example:

uu <- url("http://www.math.mcmaster.ca/bolker/R/misc/salamander.txt")
sdat <- read.table(uu,header=TRUE,
                   colClasses=c(rep("factor",5),"numeric"))

Check experimental design:

with(sdat,table(fpop,mpop))

##     mpop
## fpop rb ws
##   rb 30 30
##   ws 30 30

with(sdat,table(fnum,mnum))

##     mnum
## fnum 1 10 2 3 4 5 6 7 8 9
##   1  2  2 0 0 2 2 0 1 1 2
##   10 0  1 2 2 1 1 1 2 2 0
##   2  1  1 1 2 0 2 2 2 0 1
##   3  2  0 1 2 1 0 2 2 1 1
##   4  0  2 2 0 2 2 0 0 2 2
##   5  1  1 2 2 1 0 2 1 2 0
##   6  1  1 2 1 1 1 1 1 1 2
##   7  2  0 1 0 1 2 1 1 2 2
##   8  1  2 1 1 1 2 2 1 0 1
##   9  2  2 0 2 2 0 1 1 1 1

Methods

lme4

library(lme4)
LM_t_0 <- system.time(
  LM_fit_0 <- glmer(mating~0+fpop:mpop+
        (1|mnum)+(1|fnum),
      sdat,family=binomial))
  
print(LM_fit_0,cor=FALSE)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: mating ~ 0 + fpop:mpop + (1 | mnum) + (1 | fnum)
##    Data: sdat
##      AIC      BIC   logLik deviance df.resid 
## 148.8094 165.5344 -68.4047 136.8094      114 
## Random effects:
##  Groups Name        Std.Dev.
##  mnum   (Intercept) 0.8572  
##  fnum   (Intercept) 0.0000  
## Number of obs: 120, groups:  mnum, 10; fnum, 10
## Fixed Effects:
## fpoprb:mpoprb  fpopws:mpoprb  fpoprb:mpopws  fpopws:mpopws  
##        1.1654        -1.3712         0.8004         0.9777

glmmADMB

library(glmmADMB)
GA_t_0 <-
  system.time(
    GA_fit_0 <- glmmadmb(mating~0+fpop:mpop+
        (1|mnum)+(1|fnum),
      sdat,family="binomial"))

Importance sampling

Using importance sampling to try to do better (see D. Fournier’s post):

## 25 points, 50 samples, random number seed=121:
GA_t_is1 <- system.time(
  GA_fit_is1 <- glmmadmb(mating~0+fpop:mpop+
        (1|mnum)+(1|fnum),
      sdat,family="binomial",
          extra.args="-is 25  121  50"))
GA_t_is2 <- system.time(
  GA_fit_is2 <- glmmadmb(mating~0+fpop:mpop+
        (1|mnum)+(1|fnum),
      sdat,family="binomial",
          extra.args="-is 50  121  100"))

##                glmer      GA_0 GA_is_50 GA_is_100
## fpoprb:mpoprb  1.165  1.17e+00  1.16680  1.16e+00
## fpopws:mpoprb -1.371 -1.37e+00 -1.36838 -1.37e+00
## fpoprb:mpopws  0.800  8.01e-01  0.80102  7.99e-01
## fpopws:mpopws  0.978  9.78e-01  0.97803  9.76e-01
## mnum           0.735  7.36e-01  0.75562  7.50e-01
## fnum           0.000  3.53e-07  0.00148  4.20e-09

Conclusion: Results are similar from both approaches; Laplace approximation doesn’t work that well. Importance sampling only helps a bit.

glmmADMB with MCMC

GA_t_mcmc <- system.time(
  GA_fit_mcmc <- glmmadmb(mating~0+fpop:mpop+
        (1|mnum)+(1|fnum),
      sdat,family="binomial",
      mcmc=TRUE))

## Loading required package: coda
## Loading required package: lattice

Diagnose (the tmpL variables represent the variance parameters, on a standard deviation scale).

library(coda)

## Loading required package: lattice

xyplot(GA_fit_mcmc$mcmc[,1:6],layout=c(2,3),aspect="fill")

A close-up of the female-sd parameter:

Bottom line, these chains aren’t mixing very well. We would have to run it for a lot longer, or fuss with tuning parameters like mcgrope, to get a good answer this way.

glmmPQL

glmmPQL should in principle work, but various people have tried and failed

## http://tolstoy.newcastle.edu.au/R/help/05/01/10005.html
sdat$dummy <- factor(rep(1,nrow(sdat)))
sdatG <- groupedData(mating~1|dummy,sdat)
glmmPQL(mating~0+fpop:mpop,
        data=sdatG,
  random=pdBlocked(list(pdIdent(~mnum-1),
                        pdIdent(~fnum-1))),
        family=binomial)

## Error in getGroups.data.frame(dataMix, groups): invalid formula for groups

## this one works, but only with groupedData
lme(mating~0+fpop:mpop,data=sdatG,
        random=pdBlocked(list(pdIdent(~mnum-1),pdIdent(~fnum-1))))

## Linear mixed-effects model fit by REML
##   Data: sdatG 
##   Log-restricted-likelihood: -77.45596
##   Fixed: mating ~ 0 + fpop:mpop 
## fpoprb:mpoprb fpopws:mpoprb fpoprb:mpopws fpopws:mpopws 
##     0.7333333     0.2333333     0.6666667     0.7000000 
## 
## Random effects:
##  Composite Structure: Blocked
## 
##  Block 1: mnum1, mnum10, mnum2, mnum3, mnum4, mnum5, mnum6, mnum7, mnum8, mnum9
##  Formula: ~mnum - 1 | dummy
##  Structure: Multiple of an Identity
##            mnum1   mnum10    mnum2    mnum3    mnum4    mnum5    mnum6
## StdDev: 0.167323 0.167323 0.167323 0.167323 0.167323 0.167323 0.167323
##            mnum7    mnum8    mnum9
## StdDev: 0.167323 0.167323 0.167323
## 
##  Block 2: fnum1, fnum10, fnum2, fnum3, fnum4, fnum5, fnum6, fnum7, fnum8, fnum9
##  Formula: ~fnum - 1 | dummy
##  Structure: Multiple of an Identity
##                fnum1       fnum10        fnum2        fnum3        fnum4
## StdDev: 4.901473e-05 4.901473e-05 4.901473e-05 4.901473e-05 4.901473e-05
##                fnum5        fnum6        fnum7        fnum8        fnum9
## StdDev: 4.901473e-05 4.901473e-05 4.901473e-05 4.901473e-05 4.901473e-05
##         Residual
## StdDev: 0.427264
## 
## Number of Observations: 120
## Number of Groups: 1

data(Assay)  ## from PB p 165
Assay <- as.data.frame(Assay)
fm1Assay <- lme(logDens~sample*dilut,Assay,
         random=pdBlocked(list(pdIdent(~1),pdIdent(~dilut-1),pdIdent(~sample-1))))

## Error in getGroups.data.frame(dataMix, groups): invalid formula for groups

I think the proximal problem is that glmmPQL constructs a model frame from the data, which loses the groupedData class (and the formula attribute, which seems necessary – I’m not sure why). It might be hackable …

`bernor` package

Geyer and Sung’s bernor package is probably the best available (=most accurate) R choice at the moment. However, it needs some hacking to make it run with modern R. I have built bernor v. 0.3-8 (see installation instructions above)

The following code is reproduced from Geyer’s example (this also demonstrates that Geyer’s package, while functional, does not package everything nicely for the user as we have come to expect)

set.seed(101)  ## initialize .Random.seed
objfun <- function(theta) {
   if (!is.numeric(theta))
   stop("objfun: theta not numeric")
   if (length(theta) != nparm)
      stop("objfun: theta wrong length")
   mu <- theta[seq(1, nfix)]
   sigma <- theta[-seq(1, nfix)]
   .Random.seed <<- .save.Random.seed
   bnlogl(y, mu, sigma, nmiss, x, z, i, moo)$value
}
objgrd <- function(theta) {
   if (!is.numeric(theta)) 
     stop("objfun: theta not numeric")
   if (length(theta) != nparm)
     stop("objfun: theta wrong length")
   mu <- theta[seq(1, nfix)]
   sigma <- theta[-seq(1, nfix)]
   .Random.seed <<- .save.Random.seed
  bnlogl(y, mu, sigma, nmiss, x, z, i, moo, deriv = 1)$gradient
}
## FIXME: should memoize!
library(bernor)
data(salam)
attach(salam)  ## would prefer not to do this ...
nparm <- ncol(x) + length(unique(i))
nfix <- ncol(x)
moo <- model("gaussian", length(i), 1)
.save.Random.seed <- .Random.seed
nobs <- ncol(y)
nmiss <- 100
theta.start <- rep(0, nparm)
names(theta.start) <- c(dimnames(x)[[2]], paste("sigma", c("f",
"m"), sep = "_"))
lower <- rep(0, nparm)
lower[1:ncol(x)] <- (-Inf)
trust <- 1
lowert <- pmax(lower, theta.start - trust)
uppert <- theta.start + trust
control <- list(fnscale = -10)
tout <- system.time(out <- optim(theta.start, objfun, objgrd,
method = "L-BFGS-B", lower = lowert, upper = uppert, control = control))
cat("elapsed time", tout[1], "seconds\n")

## elapsed time 0.2 seconds

Retry with larger trust region (slow)

nmiss <- 10000
theta.start <- out$par
lowert <- pmax(lower, theta.start - trust)
uppert <- theta.start + trust
control <- list(fnscale = signif(out$value, 1))

tout <- system.time(out <- optim(theta.start, objfun, objgrd,
  method = "L-BFGS-B", lower = lowert, upper = uppert, control = control))
cat("elapsed time", tout[1], "seconds\n")

## elapsed time 33.59 seconds

Read the linked example by Geyer for more details: they go on to try nmiss=1e7 (number of samples of “missing data” [=latent variables]?, which they say is very slow, but they also say their results match Booth and Hobert (1999) , and an independent MCMC estimate, to 3 significant digits).

hglm

hglm_fit_0 <- hglm2(meanmodel = mating ~ 0 + fpop:mpop  + 
                       (1 | as.numeric(fnum)) + (1 | as.numeric(mnum)), 
                    family = binomial(), data = sdat,
                    conv = 1e-08, maxit = 40)

The example from the hglm vignette:

data(salamander)
hglm.salam <- hglm2(meanmodel = Mate ~ TypeF + TypeM + TypeF*TypeM + (1 | Female) + (1 | Male), family = binomial(), data = salamander,conv = 1e-08, maxit = 40)

Note hglm has a different version of the salamander data set – 360 rather than 120 values … (‘summer’ data set is what we’ve entered above, I think, this is both summer and fall … …) (Even for the data as included with hglm I don’t seem to be getting exactly the same structure as in the hglm vignette … ???)

To do/Other possibilities

get crossed GHQ to work in lme4 … ??? Laplace diagnostics ???
SAS, AS-REML
INLA?
WinBUGS/JAGS/Stan? McElreath’s RStan package?
print package numbers etc.
pictures of estimates, timings, confidence intervals …

References

Booth, James G., and James P. Hobert. 1999. “Maximizing Generalized Linear Mixed Model Likelihoods with an Automated Monte Carlo EM Algorithm.” Journal of the Royal Statistical Society. Series B 61 (1): 265–85. doi:10.1111/1467-9868.00176.

McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. London: Chapman; Hall.

Sung, Yun Ju, and Charles J. Geyer. 2007. “Monte Carlo Likelihood Inference for Missing Data Models.” The Annals of Statistics 35 (3): 990–1011. doi:10.1214/009053606000001389.