Logistic regression, accounting for differences in exposure

Version 2014-09-30 19:33:20

IN PROGRESS

Introduction

This is a general document, originally inspired by one question about nest survival and another about tree survival, about analyzing survival data, possibly in discrete time, possibly with random effects and/or “frailty” (individual-level variation in survival probability). In particular, I’m interested in doing it efficiently (both in terms of computation time and in terms of compact code) in R, or at least from R (e.g. in AD Model Builder or in WinBUGS, or maybe Stan?).

  • more here: difference from standard survival analysis (highly discretized/interval-censored; interested in covariates; maybe willing to use parametric models; maybe interested in density-dependent mortality? typically have one or more cross-sectional observations of a population, rather than a single longitudinal survey of a cohort)

A very common situation in ecology (and elsewhere) is a survival/binary-outcome model where individuals (each measured once) differ in their exposure. The classical approach to this problem is to use a complementary log-log link. The cloglog function is \(C(\mu)=\log(-\log(1-\mu))\); its inverse is \(\mu=C^{-1}(\eta) = 1-\exp(-\exp(\eta))\). Thus if we expect mortality \(\mu_0\) over a period \(\Delta t=1\) and the linear predictor \(\eta=C^{-1}(\mu_0)\) then \[ C^{-1}(\eta+\log \Delta t)=(1-\exp(-\exp(\eta) \cdot \Delta t)). \] Some algebra shows that this is equal to \(1-(1-\mu_0)^{\Delta t}\), which is what we want.

The function \(\exp(-\exp(x))\) is called the Gompertz function (it is also the CDF of the extreme value distribution), so fitting a model with this inverse-link function (i.e. fitting a cloglog link to the survival, rather than the mortality, probability) is called a gompit (or extreme value) regression.

To use this approach in R, add a term of the form offset(log(A)) to the formula (some modeling functions take offset as a separate argument).

The bottom line, though, is that if we change the form of the dependence of survival on covariates from logistic to complementary log-log, we can sneak the exposure variable in via an offset.

Suppose we don’t want to do this, and instead want to stick with the logistic form, but want to have the link function be \[ \mu = \left( \text{Logistic}(\eta) \right)^e, \] where \(e\) is the exposure (typically, but not necessarily, an integer). How do we do this?

Data

Some example data, from Lucy Meng:

Package versions:

##    bbmle     lme4     plyr   R2admb reshape2 
##   1.0.17    1.1.8    1.8.1   0.7.10      1.4

Parameters:

  • Exposure = exposure days
  • surv= survival of the nest contents (1=nest has contents, 0=nest empty)
  • survive = outcome of the nest (1=successful, 0=failure)
  • nestheight = height of nest
  • averageSPL= sound pressure level near the nest
  • distroad = distance from nest to road
  • waterdepth= water depth surrounding the nest
dat <- read.csv("brant_survive.csv")
## orig_names <- scan("brant_survive.csv",what=character(),nlines=1,sep=",")
dat2 <- subset(dat,
    select=c(Nest.ID,Exposure,Date,Surv,
             NestHeight,AverageSPL,DistEdge,
             DistWater,DistRoad))
dat2S <- subset(dat2,Exposure>0)

Total samples: 174 observations, 36 nests.

Marginal plots of survival prob vs predictors:

mdat <- melt(dat2,id.var=1:4)
ggplot(mdat,aes(x=value,y=Surv))+
  geom_point(alpha=0.5,aes(size=Exposure))+geom_smooth(method="loess")+
  facet_wrap(~variable,scale="free")+
  coord_cartesian(ylim=c(-0.05,1.05))+xlab("")+ylab("Survival")

Methods

Method 1: bbmle

The bbmle package offers a formula interface that lets us do general nonlinear MLE fitting in a reasonably convenient way.

library(bbmle)
m_mle2 <- mle2(Surv~dbinom(plogis(mu)^Exposure,size=1),
     parameters=list(mu~NestHeight),
     start=list(mu=2),data=dat2S)
summary(m_mle2)
## Maximum likelihood estimation
## 
## Call:
## mle2(minuslogl = Surv ~ dbinom(plogis(mu)^Exposure, size = 1), 
##     start = list(mu = 2), data = dat2S, parameters = list(mu ~ 
##         NestHeight))
## 
## Coefficients:
##                Estimate Std. Error z value   Pr(z)  
## mu.(Intercept)  1.81505    1.09315  1.6604 0.09684 .
## mu.NestHeight   1.09549    0.96283  1.1378 0.25521  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 136.0476
  • Advantages: simple, flexible.
  • Disadvantages: need to specify starting values; probably slow.

Method 2: glm

Generalized linear modeling code can be hacked to do this problem, by providing a custom link function. A SAS macro to do this is avaiable from Terry Shaffer’s web site; equivalent R code, by Mark Herzog, is available here

This is an updated version of Mark Herzog’s code which runs in development R.

library(MASS)
logexp <- function(exposure = 1)
{
    linkfun <- function(mu) qlogis(mu^(1/exposure))
    ## FIXME: is there some trick we can play here to allow
    ##   evaluation in the context of the 'data' argument?
    linkinv <- function(eta)  plogis(eta)^exposure
    logit_mu_eta <- function(eta) {
        ifelse(abs(eta)>30,.Machine$double.eps,
               exp(eta)/(1+exp(eta))^2)
        ## OR .Call(stats:::C_logit_mu_eta, eta, PACKAGE = "stats")
    }
    mu.eta <- function(eta) {       
        exposure * plogis(eta)^(exposure-1) *
            logit_mu_eta(eta)
    }
    valideta <- function(eta) TRUE
    link <- paste("logexp(", deparse(substitute(exposure)), ")",
                   sep="")
    structure(list(linkfun = linkfun, linkinv = linkinv,
                   mu.eta = mu.eta, valideta = valideta, 
                   name = link),
              class = "link-glm")
}

This is basically a modified version of the logit link function produced by binomial(). The original version used .Call(stats:::C_logit_mu_eta, eta, PACKAGE = "stats"), but I decided it was safer (if perhaps slower) to write a pure-R version of C_logit_mu_eta and include it in the function.

Now use it:

m_glm <- glm(Surv~NestHeight,
         family=binomial(link=logexp(dat2S$Exposure)),
         data=dat2S,start=c(1,0))
summary(m_glm)
## 
## Call:
## glm(formula = Surv ~ NestHeight, family = binomial(link = logexp(dat2S$Exposure)), 
##     data = dat2S, start = c(1, 0))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6417   0.3256   0.4605   0.5852   0.8539  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   1.8150     1.1141   1.629    0.103
## NestHeight    1.0955     0.9791   1.119    0.263
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 122.65  on 167  degrees of freedom
## Residual deviance: 136.05  on 166  degrees of freedom
## AIC: 140.05
## 
## Number of Fisher Scoring iterations: 5
  • Advantages: probably faster, no need for starting values, more compact syntax, maybe extendable to GLMMs
  • Disadvantages: less flexible. Because of the way glm is set up, the exposure variable can’t be drawn from inside the data argument, which makes it harder to use methods such as predict on the results of the fit.

AD Model Builder

It’s not too hard to get the equivalent for AD Model Builder.

  • Due to lack of familiarity, it took me several tries
  • computing likelihood instead of negative log-likelihood (!);
  • no vectorized pow function? (undefined reference to pow(dvar_vector const&, dvector const&))
  • should substitute binomial log-likelihood for greater compactness
  • I used R2admb’s features to give only a bare-bones TPL file – data and most of the parameter sections are auto-generated. Should tweak this a bit so we only compile once.
library("R2admb")
setup_admb()
## [1] "/usr/local/admb"
tplString <- "
PARAMETER_SECTION
   vector eta(1,nobs)
   vector prob(1,nobs)

PROCEDURE_SECTION
   dvariable fpen=0;
   eta = X*beta;
   prob = pow(1/(1+exp(-eta)),Exposure);
   // constrain probability to be positive (maybe unnecessary?)
   prob = posfun(prob,1e-8,fpen);
   f = -sum(log(elem_prod(Surv,prob)+elem_prod(1-Surv,1-prob)));
"

If we didn’t want to or couldn’t vectorize the pow function, we would have to write out the following (fairly simple) for loop instead – in ADMB it would have no performance cost, it would just be slightly more code to read …

 for (int i=1; i<=nobs; i++)
   prob[i] = pow(1/(1+exp(-eta[i])),Exposure[i]);

In order to use the vectorized power function above, I have to define these functions in the TPL file; they’re scary looking, but you can ignore them (presumably at some future date these vectorized power functions will make their way into ADMB – in fact, they may already be in the most recent version, I haven’t checked …)

## n.b. CANNOT put the #include statement on a line by itself,
## must use \n !
extraStr <- "
GLOBALS_SECTION\n #include <df1b2fun.h>

 dvar_vector pow(const dvar_vector& v1, const dvector& v2)// ***
 {
   RETURN_ARRAYS_INCREMENT();
   dvar_vector tmp(v1.indexmin(),v1.indexmax());
   for (int i=v1.indexmin();i<=v1.indexmax();i++)
   {
     tmp.elem(i)=pow(v1.elem(i),v2.elem(i));             // ***
   }
   RETURN_ARRAYS_DECREMENT();
   return(tmp);
 }
 df1b2vector pow(df1b2vector const& _x,dvector const& v)
 {
  ADUNCONST(df1b2vector,x);
  int mmin=v.indexmin();
  int mmax=v.indexmax();
  df1b2vector tmp;
  tmp.noallocate(mmin,mmax);
  for (int i=mmin;i<=mmax;i++) tmp(i)=pow(x(i),v(i));
  return tmp;
 }
"
writeLines(c(tplString,extraStr),"brant.tpl")
X <- model.matrix(~NestHeight,data=dat2S)
admbData <- with(dat2S,list(X=X,
                          Exposure=Exposure,Surv=Surv,nobs=nrow(dat2S)))

Test: compute NLL for starting parameters, compare with R solution

d1 <- do_admb("brant",data=admbData,
        params=list(beta=c(1,1)),
        run.opts=run.control(checkparam="write",checkdata="write"),
        extra.args="-maxfn 0")
(nll_admb <- -logLik(d1))
## 'log Lik.' 78.8121 (df=2)
m_mle2@minuslogl(1,1)
## [1] 78.81212
(nll_R <- with(dat2S,
      -sum(dbinom(Surv,prob=plogis(1+NestHeight)^Exposure,size=1,
                  log=TRUE))))
## [1] 78.81212

Now run the model:

mfit_admb <- do_admb("brant",data=admbData,
        params=list(beta=c(0,0)),
        run.opts=run.control(checkparam="write",
                            checkdata="write"))
summary(mfit_admb)
## Model file: brant_gen 
## Negative log-likelihood:  68.0    AIC:  140.0 
## Coefficients:
##        Estimate Std. Error z value Pr(>|z|)  
## beta.1   1.8150     1.0931   1.660   0.0968 .
## beta.2   1.0955     0.9628   1.138   0.2552  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Advantages: probably fastest, flexible, maybe extendable to GLMMs
  • Disadvantages: most complex coding, have to install AD Model Builder

Random effects

In principle, we should be able to implement this model with random effects in either lme4 (recent versions handle user-defined families) or in ADMB. (Doing it in bbmle is essentially impossible.)

Random effect of individuals:

lme4

library(lme4)
packageVersion("lme4")
## [1] '1.1.8'
mfit_glmer <- glmer(Surv~NestHeight+(1|Nest.ID),
         family=binomial(link=logexp(dat2S$Exposure)),
         data=dat2S,start=list(theta=1,fixef=c(1,0)))
summary(mfit_glmer)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logexp(dat2S$Exposure) )
## Formula: Surv ~ NestHeight + (1 | Nest.ID)
##    Data: dat2S
## 
##      AIC      BIC   logLik deviance df.resid 
##    138.5    147.9    -66.2    132.5      165 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.0725  0.1819  0.2670  0.3420  0.7222 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  Nest.ID (Intercept) 1.229    1.108   
## Number of obs: 168, groups:  Nest.ID, 30
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    1.541      1.539   1.001    0.317
## NestHeight     1.436      1.341   1.071    0.284
## 
## Correlation of Fixed Effects:
##            (Intr)
## NestHeight -0.974

ADMB

(To do)

Survival model

If we have multiple observations per individual (i.e. each observed individual is) we should really not be treating each observation separately. Adding a random effect of individual might help a little but is papering over the cracks. i.e., we ought to model as a (possibly censored) geometric or exponential distribution.

Check geometric/exponential equivalence:

dgeom(2,0.5)
## [1] 0.125
diff(pexp(2:3,-log(0.5)))
## [1] 0.125
library("plyr")
dat_agg <- ddply(dat2S,"Nest.ID",
          summarize,Exposure=sum(Exposure),
                    Surv=tail(Surv,1),
                   NestHeight=NestHeight[1],AverageSPL=AverageSPL[1],
                 DistEdge=DistEdge[1],
             DistWater=DistWater[1],DistRoad=DistRoad[1])
dcgeom <- function(x,prob,cens,log=FALSE) {
  ifelse(!cens,dgeom(x,prob,log=log),
         pgeom(x,prob,lower.tail=FALSE,log.p=log))
  }
dcexp <- function(x,rate,cens,log=FALSE) {
  ifelse(!cens,dexp(x,rate,log=log),
         pexp(x,prob,lower.tail=FALSE,log.p=log))
  }
with(dat_agg,-sum(dcgeom(Exposure,prob=0.5,cens=Surv,log=TRUE)))
mle2(Exposure~dcgeom(prob=plogis(mu),cens=Surv),
     parameters=list(mu~NestHeight),
     start=list(mu=0),data=dat_agg)

fixme: use exponential instead of geometric distribution

To do

  • benchmark?
  • WinBUGS/Stan solutions?
  • fix formatting etc.
  • mention evaluation problem with exposure / fragility for predictions etc.
  • survival model examples: interval censoring etc.. Time-dependent covariates? Time-dependent hazard?
  • discuss individual-level frailty: identifiability issues? Are they overcome by exposure-dependence?
  • comparison of cloglog/Gompertz approach?
  • check and describe bird literature: Shaffer, Nur, Heisey, White, etc.