This is a general document, originally inspired by one question about nest survival and another about tree survival, about analyzing survival data, possibly in discrete time, possibly with random effects and/or “frailty” (individual-level variation in survival probability). In particular, I’m interested in doing it efficiently (both in terms of computation time and in terms of compact code) in R, or at least from R (e.g. in AD Model Builder or in WinBUGS or TMB or Stan …)
A very common situation in ecology (and elsewhere) is a survival (binary outcome) model where individuals (each measured once) differ in their exposure, typically because they have been observed for different lengths of time. The classic statistical approach to this problem is to use a complementary log-log (“cloglog”) link. The cloglog
function is C(μ)=log(−log(1−μ)); its inverse is μ=C−1(η)=1−exp(−exp(η)). Thus if we expect mortality μ0 over a period Δt=1 and the linear predictor η=C−1(μ0) then C−1(η+logΔt)=(1−exp(−exp(η)⋅Δt)). Some algebra shows that this is equal to 1−(1−μ0)Δt, which is what we want.
The function exp(−exp(x)) is called the Gompertz function (it is also the CDF of the extreme value distribution), so fitting a model with this inverse-link function (i.e. fitting a cloglog
link to the survival, rather than the mortality, probability) is called a gompit (or extreme value) regression.
The bottom line is that if we change the form of the dependence of survival on covariates from logistic to complementary log-log, we can sneak the exposure variable in via an offset.
To use this approach in R, use family=binomial(link="cloglog")
and add a term of the form offset(log(A))
to the formula (some modeling functions take offset
as a separate argument).
Suppose we don’t want to do this, and instead want to stick with the logistic form, but want to have the link function be μ=(Logistic(η))e, where e is the exposure (typically, but not necessarily, an integer). I will call this the power-logistic link approach; it is popular in analyses of nest survival. The code below explores both the complementary log-log and power-logistic approaches. In my opinion the complementary log-log approach seems much more statistically coherent and computationally convenient than the power-exponential (it can be used in any GLM-based framework that provides a cloglog link option for binomial models). I get e-mail about this document a few times a year, implying that people are still using the power-logistic approach. I would be interested in hearing arguments or explanations from biostatisticians or ecologists about why the power-logistic might be preferred: is it just historical, does it fit the data better, or does it have advantages I haven’t thought of?
Some example data on red-winged blackbird nest survivorship in Brant County, Ontario from Lucy Meng (part of her undergraduate thesis with Dr. Pat Chow-Fraser of McMaster University):
Package versions:
## bbmle broom broom.mixed dotwhisker effects emmeans
## 1.0.21.9004 0.5.2 0.2.4 0.5.0.9000 4.1.2 1.4.1
## ggplot2 lme4 plyr R2admb reshape2 sjPlot
## 3.2.1 1.1.21 1.8.4 0.7.16 1.4.3 2.7.1
## visreg
## 2.5.1
Parameters:
Exposure
= exposure dayssurv
= survival of the nest contents (1=nest has contents, 0=nest empty)survive
= outcome of the nest (1=successful, 0=failure)nestheight
= height of nestaverageSPL
= sound pressure level near the nestdistroad
= distance from nest to roadwaterdepth
= water depth surrounding the nestdat <- read.csv("brant_survive.csv")
## orig_names <- scan("brant_survive.csv",what=character(),nlines=1,sep=",")
dat2 <- subset(dat,
select=c(Nest.ID,Exposure,Date,Surv,
NestHeight,AverageSPL,DistEdge,
DistWater,DistRoad))
dat2S <- subset(dat2,Exposure>0)
Total samples: 174 observations, 36 nests.
Marginal plots of survival prob vs predictors:
mdat <- melt(dat2,id.var=1:4)
ggplot(mdat,aes(x=value,y=Surv))+
geom_point(alpha=0.5,aes(size=Exposure))+geom_smooth(method="loess")+
facet_wrap(~variable,scale="free_x")+
coord_cartesian(ylim=c(-0.05,1.05))+xlab("")+ylab("Survival")
In what follows I’m only going to model the effect of nest height on survivorship, but the example should generalize to any number of continuous or categorical predictors …
bbmle
The bbmle
package offers a formula interface that lets us do general nonlinear MLE fitting in a reasonably convenient way.
library(bbmle)
m_mle2 <- mle2(Surv~dbinom(plogis(mu)^Exposure,size=1),
parameters=list(mu~NestHeight),
start=list(mu=2),data=dat2S)
summary(m_mle2)
## Maximum likelihood estimation
##
## Call:
## mle2(minuslogl = Surv ~ dbinom(plogis(mu)^Exposure, size = 1),
## start = list(mu = 2), data = dat2S, parameters = list(mu ~
## NestHeight))
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## mu.(Intercept) 1.81505 1.09315 1.6604 0.09684 .
## mu.NestHeight 1.09549 0.96283 1.1378 0.25521
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 136.0476
glm
Generalized linear modeling code can be hacked to fit the power-logistic link, by providing a custom link function. A SAS macro to do this was available from Terry Shaffer’s web site at http://www.npwrc.usgs.gov/resource/birds/nestsurv/index.htm; equivalent R code, by Mark Herzog, was available from http://www.npwrc.usgs.gov/resource/birds/nestsurv/download/CreateLogisticExposureFamily.R; I can’t find either of these resources as of September 2019.
This is an updated version of Mark Herzog’s code (should run in “recent” R versions, e.g at least >=3.5.0 [??])
library(MASS)
logexp <- function(exposure = 1) {
## hack to help with visualization, post-prediction etc etc
get_exposure <- function() {
if (exists("..exposure", env=.GlobalEnv))
return(get("..exposure", envir=.GlobalEnv))
exposure
}
linkfun <- function(mu) qlogis(mu^(1/get_exposure()))
## FIXME: is there some trick we can play here to allow
## evaluation in the context of the 'data' argument?
linkinv <- function(eta) plogis(eta)^get_exposure()
logit_mu_eta <- function(eta) {
ifelse(abs(eta)>30,.Machine$double.eps,
exp(eta)/(1+exp(eta))^2)
}
mu.eta <- function(eta) {
get_exposure() * plogis(eta)^(get_exposure()-1) *
logit_mu_eta(eta)
}
valideta <- function(eta) TRUE
link <- paste("logexp(", deparse(substitute(exposure)), ")",
sep="")
structure(list(linkfun = linkfun, linkinv = linkinv,
mu.eta = mu.eta, valideta = valideta,
name = link),
class = "link-glm")
}
This is basically a modified version of the logit link function produced by binomial()
. The original version used .Call(stats:::C_logit_mu_eta, eta, PACKAGE = "stats")
, but I decided it was safer (if perhaps slower) to write a pure-R version of C_logit_mu_eta
and include it in the function.
Now use it:
m_glm <- glm(Surv~NestHeight,
family=binomial(link=logexp(dat2S$Exposure)),
data=dat2S,start=c(1,0))
summary(m_glm)
##
## Call:
## glm(formula = Surv ~ NestHeight, family = binomial(link = logexp(dat2S$Exposure)),
## data = dat2S, start = c(1, 0))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6417 0.3256 0.4605 0.5852 0.8539
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.8150 1.1141 1.629 0.103
## NestHeight 1.0955 0.9791 1.119 0.263
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 122.65 on 167 degrees of freedom
## Residual deviance: 136.05 on 166 degrees of freedom
## AIC: 140.05
##
## Number of Fisher Scoring iterations: 5
glm
is set up, the exposure variable can’t be drawn from inside the data
argument, which makes it harder to use methods such as predict
on the results of the fit. Setting ..exposure
globally provides a hack around this (see below).It’s not too hard to get the equivalent for AD Model Builder.
pow
function? (undefined reference to pow(dvar_vector const&, dvector const&)
)R2admb
’s features to give only a bare-bones TPL file – data and most of the parameter sections are auto-generated. Should tweak this a bit so we only compile once.library("R2admb")
tt <- try(setup_admb(),silent=TRUE)
admb_ok <- !inherits(tt,"try-error")
tplString <- "
PARAMETER_SECTION
vector eta(1,nobs)
vector prob(1,nobs)
PROCEDURE_SECTION
dvariable fpen=0;
eta = X*beta;
prob = pow(1/(1+exp(-eta)),Exposure);
// constrain probability to be positive (maybe unnecessary?)
prob = posfun(prob,1e-8,fpen);
f = -sum(log(elem_prod(Surv,prob)+elem_prod(1-Surv,1-prob)));
"
If we didn’t want to or couldn’t vectorize the pow
function, we would have to write out the following (fairly simple) for
loop instead – in ADMB it would have no performance cost, it would just be slightly more code to read …
for (int i=1; i<=nobs; i++)
prob[i] = pow(1/(1+exp(-eta[i])),Exposure[i]);
In order to use the vectorized power function above, I have to define these functions in the TPL file; they’re scary looking, but you can ignore them (presumably at some future date these vectorized power functions will make their way into ADMB – in fact, they may already be in the most recent version, I haven’t checked …)
## n.b. CANNOT put the #include statement on a line by itself,
## must use \n !
extraStr <- "
GLOBALS_SECTION\n #include <df1b2fun.h>
dvar_vector pow(const dvar_vector& v1, const dvector& v2)// ***
{
RETURN_ARRAYS_INCREMENT();
dvar_vector tmp(v1.indexmin(),v1.indexmax());
for (int i=v1.indexmin();i<=v1.indexmax();i++)
{
tmp.elem(i)=pow(v1.elem(i),v2.elem(i)); // ***
}
RETURN_ARRAYS_DECREMENT();
return(tmp);
}
df1b2vector pow(df1b2vector const& _x,dvector const& v)
{
ADUNCONST(df1b2vector,x);
int mmin=v.indexmin();
int mmax=v.indexmax();
df1b2vector tmp;
tmp.noallocate(mmin,mmax);
for (int i=mmin;i<=mmax;i++) tmp(i)=pow(x(i),v(i));
return tmp;
}
"
writeLines(c(tplString,extraStr),"brant.tpl")
X <- model.matrix(~NestHeight,data=dat2S)
admbData <- with(dat2S,list(X=X,
Exposure=Exposure,Surv=Surv,nobs=nrow(dat2S)))
Test: compute NLL for starting parameters, compare with R solution
if (admb_ok) {
d1 <- do_admb("brant",data=admbData,
params=list(beta=c(1,1)),
run.opts=run.control(checkparam="write",checkdata="write"),
extra.args="-maxfn 0")
(nll_admb <- -logLik(d1))
m_mle2@minuslogl(1,1)
(nll_R <- with(dat2S,
-sum(dbinom(Surv,prob=plogis(1+NestHeight)^Exposure,size=1,
log=TRUE))))
}
Now run the model:
if (admb_ok) {
m_admb <- do_admb("brant",data=admbData,
params=list(beta=c(0,0)),
run.opts=run.control(checkparam="write",
checkdata="write"))
summary(m_admb)
}
In principle, we should be able to implement this model with random effects in either lme4
(recent versions handle user-defined families) or in ADMB. (Doing it in bbmle
is essentially impossible.)
Random effect of individuals:
library(lme4)
m_glmer <- glmer(Surv~NestHeight+(1|Nest.ID),
family=binomial(link=logexp(dat2S$Exposure)),
data=dat2S,start=list(theta=1,fixef=c(1,0)))
summary(m_glmer)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logexp(dat2S$Exposure) )
## Formula: Surv ~ NestHeight + (1 | Nest.ID)
## Data: dat2S
##
## AIC BIC logLik deviance df.resid
## 138.5 147.9 -66.2 132.5 165
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.0725 0.1819 0.2670 0.3420 0.7222
##
## Random effects:
## Groups Name Variance Std.Dev.
## Nest.ID (Intercept) 1.229 1.108
## Number of obs: 168, groups: Nest.ID, 30
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.541 1.539 1.001 0.317
## NestHeight 1.436 1.341 1.071 0.284
##
## Correlation of Fixed Effects:
## (Intr)
## NestHeight -0.974
“works” (but don’t know if it’s correct):
If we use cloglog
we need to make sure the offset is handled properly by whatever prediction/plotting machinery we are using downstream; if we use link=logexp
we need to set a ..exposure
variable in the global environment, if we are plotting on the response (probability) scale, to allow the hack we defined above to work. Make sure to remove ..exposure
afterwards to avoid confusion …
..exposure <- mean(dat2S$Exposure)
visreg(m_glmer, "NestHeight", scale="response")
rm(..exposure) ## clean up!
..exposure <- mean(dat2S$Exposure)
sjPlot::plot_model(m_glmer,type="pred")
## $NestHeight
rm(..exposure) ## clean up!
plot(emmeans(m_glmer, ~NestHeight, at=list(NestHeight=1:10))) ## "works"
..exposure <- mean(dat2S$Exposure)
plot(allEffects(m_glmer, type="response"))
rm(..exposure)
(To do)
If we have multiple observations per individual (i.e. each observed individual is) we should really not be treating each observation separately. Adding a random effect of individual might help a little but is papering over the cracks. i.e., we ought to model as a (possibly censored) geometric or exponential distribution.
Check geometric/exponential equivalence:
dgeom(2,0.5)
## [1] 0.125
diff(pexp(2:3,-log(0.5)))
## [1] 0.125
library("plyr")
dat_agg <- ddply(dat2S,"Nest.ID",
summarize,Exposure=sum(Exposure),
Surv=tail(Surv,1),
NestHeight=NestHeight[1],AverageSPL=AverageSPL[1],
DistEdge=DistEdge[1],
DistWater=DistWater[1],DistRoad=DistRoad[1])
dcgeom <- function(x,prob,cens,log=FALSE) {
ifelse(!cens,dgeom(x,prob,log=log),
pgeom(x,prob,lower.tail=FALSE,log.p=log))
}
dcexp <- function(x,rate,cens,log=FALSE) {
ifelse(!cens,dexp(x,rate,log=log),
pexp(x,prob,lower.tail=FALSE,log.p=log))
}
with(dat_agg,-sum(dcgeom(Exposure,prob=0.5,cens=Surv,log=TRUE)))
mle2(Exposure~dcgeom(prob=plogis(mu),cens=Surv),
parameters=list(mu~NestHeight),
start=list(mu=0),data=dat_agg)
fixme: use exponential instead of geometric distribution