Small tutorial on mixed effect models for count data

Short introduction on mixed effect models for count data, including models from the glmmTMB and gamlss libraries from R.

glmmTMB

Is an R package built on the Template Model Builder automatic differentiation engine, for fitting generalized linear mixed models and extensions.

General framework for fitting regression type models where the distribution of the response variable does not have to belong to the exponential family. It allows all the parameters of the distribution of the response variable to be modeled as linear/non-linear or smooth functions of the explanatory variables.

library(dplyr)
library(gamlss)
library(glmmTMB)
library(DHARMa)
library(ggplot2)
library(plotrix)

simulate data: Repeated-counts of birds in transects of different area, in two sites along 4 years (n=200).

dat <- data.frame(y =c(rZINBI(100, mu = 10, sigma = .6, nu=0.1),rZINBI(100, mu = 5, sigma = .3, nu=.5)),sites =c(rep("a", 100), rep("b", 100)),year = rep(1:4, each = 10, times = 5),trans = rep(1:40, each = 5, times = 1), area=rNO(200,20))
head(dat)

##    y sites year trans     area
## 1  6     a    1     1 21.03874
## 2  3     a    1     1 20.84432
## 3 29     a    1     1 21.74034
## 4  1     a    1     1 19.00565
## 5 18     a    1     1 19.78353
## 6  2     a    1     2 19.40191

#see the data structure 
sizetree(dat[2:4])

#plot the data 
p <- ggplot(dat, aes(sites,y))
p + geom_jitter(aes(colour = year, size=area), width = 0.2, alpha=0.4)+stat_summary(fun.data = "mean_cl_boot",colour="purple")+
  ylab("abundance")+xlab("sites")

1. Modeling counts with the library glmmTMB

Two distributions available the nbinom1 and the nbinom2, in which the variance increases linearly and quadratically with the mean (respectively).

nbinom1: sigma = 𝜇(1 + a), nbinom2: sigma = 𝜇(1 + 𝜇/b)

a, b = dispersion parameter

A description of the dispersion parameter for each distribution can be accessed by typing ?sigma.glmmTMB in R.

Model 1: Negative binomial model

Fixed effect= site (dummy variable), random effect= year, zero-inflation= not included (zi). So we are interested - modeling the differences in the mean of the abundance between both sites, while adjusting for variations-correlation between transects (variance parameter).

In glmmTMB random effects are assumed to be Gaussian

m1 <- glmmTMB(y ~ sites + (1|trans),
               zi=~0,
               family=nbinom1, data=dat)
summary(m1)

##  Family: nbinom1  ( log )
## Formula:          y ~ sites + (1 | trans)
## Data: dat
## 
##      AIC      BIC   logLik deviance df.resid 
##   1047.6   1060.8   -519.8   1039.6      196 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  trans  (Intercept) 0.06409  0.2532  
## Number of obs: 200, groups:  trans, 40
## 
## Dispersion parameter for nbinom1 family (): 7.74 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.2494     0.1099  20.467  < 2e-16 ***
## sitesb       -1.4311     0.1889  -7.577 3.55e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary of the model showed the std and the variance of the trans. The rest of the information is provided in a similar way to linear models and GLM in R.

# for model evaluation 

m1_resi <- simulateResiduals(m1)

plot(m1_resi)

Model 2: Offset

Same as model 1 but includes an offset to account varying levels of “effort” (area). Under a typical count model (Poisson or negative binomial):

log(𝜇) = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑝𝑥𝑝

An offset variable (𝐴) is a variable whose 𝛽 coefficient is constrained to 1, allowing to model a rate (density) instead of the count:

log(𝜇) = 1 ∙ log(𝐴) + 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑝𝑥𝑝

which is equivalent to,

log(𝜇/𝐴) = 𝛽0 + 𝛽1𝑥1 + ⋯ + 𝛽𝑝𝑥𝑝

We need to specify a log function in the offset because the link of the nbinom1 is log

m2 <- glmmTMB(y ~ sites + (1|trans)+offset(log(area)),
              zi=~0,
              family=nbinom1, data=dat)
summary(m2)

##  Family: nbinom1  ( log )
## Formula:          y ~ sites + (1 | trans) + offset(log(area))
## Data: dat
## 
##      AIC      BIC   logLik deviance df.resid 
##   1048.0   1061.2   -520.0   1040.0      196 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  trans  (Intercept) 0.06654  0.258   
## Number of obs: 200, groups:  trans, 40
## 
## Dispersion parameter for nbinom1 family (): 7.74 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7549     0.1105  -6.834 8.24e-12 ***
## sitesb       -1.4298     0.1896  -7.542 4.64e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 3: Nested random effects

fixed effect =site, random effects= transect nested within year, zero-inflation= not included (zi). In this model we account for variations between years and for transects within years (see summary= trans:year).

m3 <- glmmTMB(y ~ sites + (1|year/trans)+offset(log(area)),
                             zi=~0,
                             family=nbinom1, data=dat)
summary(m3)

##  Family: nbinom1  ( log )
## Formula:          y ~ sites + (1 | year/trans) + offset(log(area))
## Data: dat
## 
##      AIC      BIC   logLik deviance df.resid 
##   1050.0   1066.5   -520.0   1040.0      195 
## 
## Random effects:
## 
## Conditional model:
##  Groups     Name        Variance  Std.Dev. 
##  trans:year (Intercept) 6.654e-02 2.580e-01
##  year       (Intercept) 9.952e-10 3.155e-05
## Number of obs: 200, groups:  trans:year, 40; year, 4
## 
## Dispersion parameter for nbinom1 family (): 7.74 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7549     0.1105  -6.834 8.24e-12 ***
## sitesb       -1.4298     0.1896  -7.542 4.64e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 4: zero inflation

We include include a parameter for the zero inflation.

m4 <- glmmTMB(y ~ sites + (1|year/trans)+offset(log(area)),
              zi=~1,
              family=nbinom1, data=dat)

Model 5: zero inflation ~ predictor

We include the variable sites as predictor for the zero inflation and for the dispersion parameters.

m5 <- glmmTMB(y ~ sites + (1|trans)+offset(log(area)),
              zi=~sites,
              disp=~sites,
              family=nbinom1, data=dat)
summary(m5)

##  Family: nbinom1  ( log )
## Formula:          y ~ sites + (1 | trans) + offset(log(area))
## Zero inflation:     ~sites
## Dispersion:         ~sites
## Data: dat
## 
##      AIC      BIC   logLik deviance df.resid 
##   1023.8   1046.9   -504.9   1009.8      193 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  trans  (Intercept) 0.1478   0.3844  
## Number of obs: 200, groups:  trans, 40
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7338     0.1305  -5.623 1.87e-08 ***
## sitesb       -0.6728     0.1727  -3.897 9.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Zero-inflation model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.1117     0.9022  -3.449 0.000562 ***
## sitesb        3.3197     0.9250   3.589 0.000332 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Dispersion model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.8668     0.2241   8.331   <2e-16 ***
## sitesb       -6.6273    45.0699  -0.147    0.883    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Site b has lower average and dispersion of birds/area and higher zero-inflation (i.e., lower frequency of occurrence), while adjusting for varitions between transects.

For more tutorials

Tutoria by Ben Bolker glmmTMB paper glmmTMB models evaluation glmmTMB simulate Covariance structures with glmmTMB e.g., spatial spatial covariance

1. Modeling with the library gamlss

#a lot of families beyond exponential distributions
?gamlss.family

## starting httpd help server ... done

Model 1 gamlss

Two negative binomial distributions available in gamlss are the NBI and the NBII, in which the variance increases quadratically and linearly (respectively) with the mean.

Similar notation, but for random effect we include: re(random=~1|“factor”).

m1_gamlss <- gamlss(y ~ sites + re(random=~1|year),family=NBII, data=dat, trace = FALSE)

summary(m1_gamlss)

## ******************************************************************
## Family:  c("NBII", "Negative Binomial type II") 
## 
## Call:  gamlss(formula = y ~ sites + re(random = ~1 | year),  
##     family = NBII, data = dat, trace = FALSE) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.27486    0.09425  24.135  < 2e-16 ***
## sitesb      -1.40231    0.17160  -8.172 3.55e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  log
## Sigma Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.126      0.161   13.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## NOTE: Additive smoothing terms exist in the formulas: 
##  i) Std. Error for smoothers are for the linear effect only. 
## ii) Std. Error for the linear terms maybe are not accurate. 
## ------------------------------------------------------------------
## No. of observations in the fit:  200 
## Degrees of Freedom for the fit:  2
##       Residual Deg. of Freedom:  198 
##                       at cycle:  3 
##  
## Global Deviance:     1041.6 
##             AIC:     1045.6 
##             SBC:     1052.197 
## ******************************************************************

The summary of a gamlss model includes the coefficients for the mean (mu) and for the dispersion parameter (sigma) (just an intercept for sigma in this model). To extract the random component of a gamlss model we need the mu.coefSmo function.

m1_gamlss$mu.coefSmo

## [[1]]
## Linear mixed-effects model fit by maximum likelihood
##   Data: Data 
##   Log-likelihood: -366.6986
##   Fixed: fix.formula 
##  (Intercept) 
## 5.993402e-10 
## 
## Random effects:
##  Formula: ~1 | year
##          (Intercept)  Residual
## StdDev: 3.305458e-05 0.9999994
## 
## Variance function:
##  Structure: fixed weights
##  Formula: ~W.var 
## Number of Observations: 200
## Number of Groups: 4

Plot Residual Diagnostics for an GAMLSS Object

plot(m1_gamlss)

## ******************************************************************
##   Summary of the Randomised Quantile Residuals
##                            mean   =  -0.01471342 
##                        variance   =  1.104109 
##                coef. of skewness  =  -0.4097675 
##                coef. of kurtosis  =  2.691559 
## Filliben correlation coefficient  =  0.9897367 
## ******************************************************************

Model 2 gamlss: Offset

m2_gamlss <- gamlss(y ~ sites + re(random=~1|year)+offset(log(area)),family=NBII, data=dat,trace = FALSE)
#summary(m2_gamlss)

Model 3 gamlss: Nested random effects

m3_gamlss <- gamlss(y ~ sites + re(random=~1|year/trans)+offset(log(area)),family=NBII, data=dat, trace = FALSE)
#summary(m3_gamlss)

Model 4 gamlss: we can incorporate predictor for the dispersion parameters

m4_gamlss <- gamlss(y ~ sites + re(random=~1|year/trans)+offset(log(area)),
                    sigma.formula = ~sites,
                    family=NBII, data=dat, trace = FALSE)
#summary(m4_gamlss)

Model 4,5 gamlss: for zero inflation in gamlss we need to include the ZINBI family (3 parameters, mu, sigma and nu)

ZINBI()

## 
## GAMLSS Family: ZINBI Zero inflated negative binomial type I 
## Link function for mu   : log 
## Link function for sigma: log 
## Link function for nu   : logit

m4_gamlss <- gamlss(y ~ sites + re(random=~1|year)+offset(log(area)),family=ZINBI, data=dat,trace = FALSE)

# include a predictor to the zero inflation (nu) and sigma parameters 
m5_gamlss <- gamlss(y ~ sites + re(random=~1|year)+offset(log(area)),
                    nu.fo=~sites,
                    sigma.fo=~sites,
                    family=ZINBI, data=dat,n.cyc = 100, trace = FALSE)
summary(m5_gamlss)

## ******************************************************************
## Family:  c("ZINBI", "Zero inflated negative binomial type I") 
## 
## Call:  gamlss(formula = y ~ sites + re(random = ~1 | year) +  
##     offset(log(area)), sigma.formula = ~sites, nu.formula = ~sites,  
##     family = ZINBI, data = dat, n.cyc = 100, trace = FALSE) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  log
## Mu Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.63953    0.09313  -6.867 8.54e-11 ***
## sitesb      -0.74113    0.13831  -5.358 2.36e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  log
## Sigma Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -0.4536     0.2211  -2.051   0.0416 *
## sitesb       -1.0433     0.5234  -1.993   0.0476 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Nu link function:  logit 
## Nu Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.5504     0.5511  -4.628 6.72e-06 ***
## sitesb        2.7289     0.5899   4.626 6.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## NOTE: Additive smoothing terms exist in the formulas: 
##  i) Std. Error for smoothers are for the linear effect only. 
## ii) Std. Error for the linear terms maybe are not accurate. 
## ------------------------------------------------------------------
## No. of observations in the fit:  200 
## Degrees of Freedom for the fit:  5
##       Residual Deg. of Freedom:  195 
##                       at cycle:  7 
##  
## Global Deviance:     1015.286 
##             AIC:     1025.286 
##             SBC:     1041.777 
## ******************************************************************

#compare with model 5 from glmmTMB
summary(m5)

##  Family: nbinom1  ( log )
## Formula:          y ~ sites + (1 | trans) + offset(log(area))
## Zero inflation:     ~sites
## Dispersion:         ~sites
## Data: dat
## 
##      AIC      BIC   logLik deviance df.resid 
##   1023.8   1046.9   -504.9   1009.8      193 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance Std.Dev.
##  trans  (Intercept) 0.1478   0.3844  
## Number of obs: 200, groups:  trans, 40
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7338     0.1305  -5.623 1.87e-08 ***
## sitesb       -0.6728     0.1727  -3.897 9.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Zero-inflation model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.1117     0.9022  -3.449 0.000562 ***
## sitesb        3.3197     0.9250   3.589 0.000332 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Dispersion model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.8668     0.2241   8.331   <2e-16 ***
## sitesb       -6.6273    45.0699  -0.147    0.883    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Modeling density with the library glmmTMB

In cases that we have densities as a response variable we cant use discrete probability distributions (Poisson, negative binomial, etc).

One possible alternative is to use the Tweedie distribution. This is a a very flexible continuous distribution, allowing to adjust variables with high dispersion and a peak at zero.

densi<-dat$y/dat$area

m1_tweedie <- glmmTMB(densi ~ sites + (1|trans),
              family=tweedie, data=dat)
summary(m1_tweedie)

##  Family: tweedie  ( log )
## Formula:          densi ~ sites + (1 | trans)
## Data: dat
## 
##      AIC      BIC   logLik deviance df.resid 
##    236.1    252.6   -113.0    226.1      195 
## 
## Random effects:
## 
## Conditional model:
##  Groups Name        Variance  Std.Dev. 
##  trans  (Intercept) 2.159e-09 4.647e-05
## Number of obs: 200, groups:  trans, 40
## 
## Dispersion parameter for tweedie family (): 0.443 
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7141     0.0860  -8.303   <2e-16 ***
## sitesb       -1.4515     0.1685  -8.615   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m1_tweedie_resi <- simulateResiduals(m1_tweedie)

plot(m1_tweedie_resi)

Not bad at all

An interesting aspect of glmmTMB its that it has the capability to simulate from a fitted model.

#simulate from the tweedie model 
simo=simulate(m1_tweedie, seed=1,nsim =10)

data_sim<-cbind(dat,simo )
data_sim[1,]

##   y sites year trans     area     sim_1 sim_2     sim_3 sim_4     sim_5
## 1 6     a    1     1 21.03874 0.2991762     0 0.6064395     0 0.1660954
##       sim_6     sim_7    sim_8 sim_9    sim_10
## 1 0.7302457 0.1390962 1.084313     0 0.5485966

p <- ggplot(data_sim, aes(sites,y/area))
p + stat_summary(fun.data = "mean_cl_boot",colour="purple")+ylab("observed-density")+xlab("sites")

psim <- ggplot(data_sim, aes(sites,sim_1))
psim + stat_summary(fun.data = "mean_cl_boot",colour="purple")+ylab("sim-density")+xlab("sites")