Poisson GLMMs
In this module we will go over a variety of more advanced models. We will start with Poisson GLMMs. Let’s first load the libraries we will need:
library(tidyverse)
library(emmeans)
library(car)
library(agridat)
library(lme4)
library(lmerTest)
library(glmmTMB)
library(DHARMa)
library(performance)
library(MuMIn)
library(bbmle)
library(aods3)
Let’s load in and read about the beall.webworms dataset. The variables of interest are the y-count of webworms, spray- spray treatment, and lead-lead treatment. Don’t worry about the block or other variables for now.
data("beall.webworms")
d1 <- beall.webworms
?beall.webworms ## info about the beall.webworms dataset
head(d1) ## view data set
## row col y block trt spray lead
## 1 1 1 1 B1 T1 N N
## 2 2 1 0 B1 T1 N N
## 3 3 1 1 B1 T1 N N
## 4 4 1 3 B1 T1 N N
## 5 5 1 6 B1 T1 N N
## 6 6 1 0 B2 T1 N N
Let’s examine a plot of data:
ggplot(d1, aes(x=spray, y=y, fill=lead)) +
geom_violin(scale="width", adjust=2) +
geom_point(position = position_jitterdodge(jitter.width=.5,
jitter.height=.1,
dodge.width = 1),
alpha=.1)+
facet_wrap(~block)
Here we create two models, r1 and r2. They were previously used in module 4C:
r1 <- glm(y ~ spray * lead, data=d1, family="poisson")
summary(r1)
##
## Call:
## glm(formula = y ~ spray * lead, family = "poisson", data = d1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6733 -1.0046 -0.9081 0.6141 4.2771
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.33647 0.04688 7.177 7.12e-13 ***
## sprayY -1.02043 0.09108 -11.204 < 2e-16 ***
## leadY -0.49628 0.07621 -6.512 7.41e-11 ***
## sprayY:leadY 0.29425 0.13917 2.114 0.0345 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1955.9 on 1299 degrees of freedom
## Residual deviance: 1720.4 on 1296 degrees of freedom
## AIC: 3125.5
##
## Number of Fisher Scoring iterations: 6
Anova(r1)
## Analysis of Deviance Table (Type II tests)
##
## Response: y
## LR Chisq Df Pr(>Chisq)
## spray 188.707 1 < 2.2e-16 ***
## lead 42.294 1 7.853e-11 ***
## spray:lead 4.452 1 0.03485 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(r1, ~spray:lead, type='response')
## spray lead rate SE df asymp.LCL asymp.UCL
## N N 1.400 0.0656 Inf 1.277 1.535
## Y N 0.505 0.0394 Inf 0.433 0.588
## N Y 0.852 0.0512 Inf 0.758 0.959
## Y Y 0.412 0.0356 Inf 0.348 0.488
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
We check overdispersion in the model:
check_overdispersion(r1) # overdispersion ratio calculator from performance
## # Overdispersion test
##
## dispersion ratio = 1.355
## Pearson's Chi-Squared = 1755.717
## p-value = < 0.001
Now let’s implment the r2 model with a negative binomial error distribution:
r2 <- glmmTMB(y ~ spray * lead, data=d1, family="nbinom2")
Anova(r2)
## Analysis of Deviance Table (Type II Wald chisquare tests)
##
## Response: y
## Chisq Df Pr(>Chisq)
## spray 125.5047 1 < 2.2e-16 ***
## lead 26.8005 1 2.256e-07 ***
## spray:lead 3.3942 1 0.06542 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(r2, ~spray:lead, type='response')
## spray lead response SE df lower.CL upper.CL
## N N 1.400 0.0855 1295 1.242 1.578
## Y N 0.505 0.0441 1295 0.425 0.599
## N Y 0.852 0.0611 1295 0.740 0.981
## Y Y 0.412 0.0391 1295 0.342 0.497
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
Now let’s simulate residuals for poisson and negative binomial models:
plot(simulateResiduals(r1))
hist(simulateResiduals(r1)) ## histogram should be flat
plot(simulateResiduals(r2))
hist(simulateResiduals(r2)) ## histogram should be flat
Improving the models: What’s next?
What’s next? Any aspects of the experimental design missing from the model?
We can construct models that includes any missing factors.
We can also make an individual-level random effect to use for overdispersed poisson:
d1$obs <- 1:length(d1$y) ## makes a unique number for each row in dataset
We can also create other models that include different random effects that account for blocks:
r0 <- glmmTMB(y ~ spray * lead, data=d1, family="gaussian") ### gaussian glm just to see how bad it really is
r1 <- glmmTMB(y ~ spray * lead, data=d1, family="poisson") ### poisson glm
r2 <- glmmTMB(y ~ spray * lead, data=d1, family="nbinom2") ### nb glm
r3 <- glmmTMB(y ~ spray * lead + (1|obs), data=d1, family="poisson") ### overdispersed poisson glm
r4 <- glmmTMB(y ~ spray * lead + (1|block), data=d1, family="nbinom2") ### nb w/ block
r5 <- glmmTMB(y ~ spray * lead + (1|obs) + (1|block), data=d1, family="poisson") ### OD poisson w/ block
Six models above differ only distribution and random effects. Fixed effects are the same. We can now use AIC for selecting the most appropriate distribution and random effects:
#### for model selection, use AIC or likihood ratio test
model.sel(r0,r1,r2,r3,r4,r5) ## from MuMIn
## Model selection table
## cnd((Int)) dsp((Int)) cnd(led) cnd(spr) cnd(led:spr) family
## r4 0.2835 + + + + nbinom2(log)
## r5 0.1399 + + + + poisson(log)
## r3 0.1042 + + + + poisson(log)
## r2 0.3365 + + + + nbinom2(log)
## r1 0.3365 + + + + poisson(log)
## r0 1.4000 + + + + gaussian(identity)
## random df logLik AICc delta weight
## r4 c(b) 6 -1497.706 3007.5 0.00 0.903
## r5 c(o)+c(b) 6 -1499.931 3011.9 4.45 0.097
## r3 c(o) 5 -1518.031 3046.1 38.63 0.000
## r2 5 -1521.484 3053.0 45.54 0.000
## r1 4 -1558.739 3125.5 118.03 0.000
## r0 5 -1929.308 3868.7 861.18 0.000
## Models ranked by AICc(x)
## Random terms:
## c(b) = 'cond(1 | block)'
## c(o) = 'cond(1 | obs)'
AICtab(r0,r1,r2,r3,r4,r5,base=T,logLik=T,weights=T) ## from bbmle
## logLik AIC dLogLik dAIC df weight
## r4 -1497.7 3007.4 431.6 0.0 6 0.903
## r5 -1499.9 3011.9 429.4 4.5 6 0.097
## r3 -1518.0 3046.1 411.3 38.7 5 <0.001
## r2 -1521.5 3053.0 407.8 45.6 5 <0.001
## r1 -1558.7 3125.5 370.6 118.1 4 <0.001
## r0 -1929.3 3868.6 0.0 861.2 5 <0.001
Now for further model interpretation you can examine residuals, fixed effects using Anova(), emmeans, etc.
Binomial and Beta GLMMs
Binomial
In this section we look at the binomial distribution. Let’s load in the data we will use:
data("wheatley.carrot")
?wheatley.carrot
Let’s clean it up a bit:
dat1 <- wheatley.carrot %>% filter(insecticide!='nil')
dat1$propdmg <- dat1$damaged/dat1$total
head(dat1)
## treatment insecticide depth rep damaged total propdmg
## 1 T01 diazinon 1.0 R1 120 187 0.64171123
## 2 T02 diazinon 2.5 R1 60 184 0.32608696
## 3 T03 diazinon 5.0 R1 35 179 0.19553073
## 4 T04 diazinon 10.0 R1 5 178 0.02808989
## 5 T05 diazinon 25.0 R1 66 182 0.36263736
## 6 T06 disulfoton 1.0 R1 97 187 0.51871658
hist(dat1$propdmg)
Now let’s plot out the data. Note that for plotting we can use the “beta_family” although for analysis we use “binomial”.
ggplot(dat1, aes(x=log(depth),
y=(damaged/total), color=insecticide)) +
geom_point(size=3) +
theme_bw(base_size = 16)
ggplot(dat1, aes(x=log(depth), y=(damaged/total), color=insecticide)) +
geom_point(size=3) +
geom_smooth(method='glm',
method.args=list(family="beta_family"),
formula = y~x) +
theme_bw(base_size = 16)
ggplot(dat1, aes(x=log(depth), y=(damaged/total),
color=insecticide)) + geom_point() +
geom_smooth(method='glm',
method.args=list(family="beta_family"),
formula = y~x+I(x^2))
Now let’s build two different binomical GLMMs with different set ups:
mod1 <- glmmTMB(cbind(damaged,total-damaged) ~ insecticide * depth + (1|rep), data=dat1, family='binomial')
mod2 <- glmmTMB(cbind(damaged,total-damaged) ~ insecticide * depth + I(depth^2) + (1|rep), data=dat1, family='binomial')
Note above that mod2 is applying a non-linear modification where we include depth and depth squared. The “^” is another way of saying raise depth to the 2nd power. The I() isolate this and tells the R interpreter that the model is including a second order fixed effect. For more info check ?formula.
We can plot residuals:
plot(simulateResiduals(mod1))
plot(simulateResiduals(mod2))
Which of the two is better?
We can also run ANOVAs on the models:
Anova(mod1)
## Analysis of Deviance Table (Type II Wald chisquare tests)
##
## Response: cbind(damaged, total - damaged)
## Chisq Df Pr(>Chisq)
## insecticide 91.449 1 < 2.2e-16 ***
## depth 2.543 1 0.1107873
## insecticide:depth 14.525 1 0.0001383 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(mod2)
## Analysis of Deviance Table (Type II Wald chisquare tests)
##
## Response: cbind(damaged, total - damaged)
## Chisq Df Pr(>Chisq)
## insecticide 111.62 1 < 2.2e-16 ***
## depth 563.71 1 < 2.2e-16 ***
## I(depth^2) 594.17 1 < 2.2e-16 ***
## insecticide:depth 15.85 1 6.858e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##Beta
In this section we show an example of applying the beta family using true proportions (or percentages). First let’s load data:
b1 <-read_csv("LynnETAL_2022_Ecography_sodCaseStudy.csv") %>%
filter(species=='SALA')
head(b1)
## # A tibble: 6 × 18
## study family speci…¹ species meas_…² site latit…³ longi…⁴ o_lat o_long
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 RivkinETAL_… Alism… Sagitt… SALA 2015 ACT-… 44.5 -77.3 NA NA
## 2 RivkinETAL_… Alism… Sagitt… SALA 2015 BPV-… 45.7 -77.6 NA NA
## 3 RivkinETAL_… Alism… Sagitt… SALA 2015 BSM-… 44.8 -76.1 NA NA
## 4 RivkinETAL_… Alism… Sagitt… SALA 2015 BTC-… 44.2 -76.8 NA NA
## 5 RivkinETAL_… Alism… Sagitt… SALA 2015 CBT-… 47.4 -79.7 NA NA
## 6 RivkinETAL_… Alism… Sagitt… SALA 2015 CFR-… 44.7 -77.1 NA NA
## # … with 8 more variables: mean_herb <dbl>, na_wdcpc <dbl>, mat <dbl>,
## # iso <dbl>, dtr <dbl>, precip <dbl>, aet <dbl>, notes <chr>, and abbreviated
## # variable names ¹species_name, ²meas_year, ³latitude, ⁴longitude
Let’s examine a histogram:
hist(b1$mean_herb) ## examine histogram. Data are entered as percentage (0-100%)
Let’s plot out the data. Note: data was collected at 42 unique sites (one data point per site).
ggplot(b1 , aes(x=latitude, y=mean_herb)) +
geom_point()
Now we need to convert to proportion and then do a small transformation to remove 0 and 1:
b1$mean_herb1 <- (b1$mean_herb/100)
b1$mean_herb1 <- (b1$mean_herb1*(length(b1$mean_herb1)-1)+.5)/length(b1$mean_herb1)
Let’s plot this:
ggplot(b1, aes(x=latitude, y=mean_herb1)) + geom_point() +
geom_smooth(method='glm',
method.args=list(family="beta_family"),
formula = y~x)
Now let’s build a beta GLMM:
bm1 <- glmmTMB(mean_herb1 ~ latitude, data=b1, family='beta_family')
Anova(bm1)
## Analysis of Deviance Table (Type II Wald chisquare tests)
##
## Response: mean_herb1
## Chisq Df Pr(>Chisq)
## latitude 9.1383 1 0.002503 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Check the summary:
summary(bm1)
## Family: beta ( logit )
## Formula: mean_herb1 ~ latitude
## Data: b1
##
## AIC BIC logLik deviance df.resid
## -126.9 -121.7 66.5 -132.9 39
##
##
## Dispersion parameter for beta family (): 30.5
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.69631 2.58221 2.206 0.0274 *
## latitude -0.17525 0.05797 -3.023 0.0025 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Plot and check residuals:
plot(simulateResiduals(bm1)) ## QQ plot is good. Resid vs pred is wiggle, probably ok.
Problem set
- Problem set with real data (Rivkin et al. 2018 Am J Bot)
Herbivory data was collected on 43 populations of an aquatic plant across a latitudinal gradient in Canada. At each population, many plants (~5-15) were examined for herbivory damage. Some additional covariates were recorded, such as Competition around the plant (1-3 from less to more) and plant height (cm).
Read in data:
d1 <-read.csv("ajb21098-sup-0002-appendixs2.csv")
head(d1)
## Population Latitude Longitude Competition Individual PlantHeight LeafDamage
## 1 ACT-ON 44.54829 -77.32384 2 4 41.0 0.02000000
## 2 ACT-ON 44.54829 -77.32384 2 6 43.0 0.03000000
## 3 ACT-ON 44.54829 -77.32384 2 18 17.5 0.04000000
## 4 ACT-ON 44.54829 -77.32384 2 2 37.0 0.06666667
## 5 ACT-ON 44.54829 -77.32384 2 3 48.0 0.07000000
## 6 ACT-ON 44.54829 -77.32384 2 12 39.5 0.08000000
## LarvalAbundance
## 1 0
## 2 0
## 3 0
## 4 1
## 5 0
## 6 0
hist(d1$LeafDamage)
Let’s remove 0’s and 1’s (see Smithson & Verkuilen 2006 or Douma
& Weedon 2018):
d1$LeafDamage <- (d1$LeafDamage*(length(d1$LeafDamage)-1)+.5)/length(d1$LeafDamage)
Now plot data:
ggplot(d1 , aes(x=Latitude, y=LeafDamage)) +
geom_point() +
geom_smooth(method='glm',
method.args=list(family="beta_family"),
formula = y~x)
Question: Does herbivory increase towards the equator? How do residuals look? Any way to improve them?
- Problem set with real data (Rivkin et al. 2018 Am J Bot), Rats dataset with only females. Half treated with a drug and the other half were untreated. Then they checked for tumors.Does the drug reduce probability of developing tumor? After 50 days? After 100 days?
Load data:
library(survival)
rats <- rats %>% filter(sex=="f")
?rats
## Help on topic 'rats' was found in the following packages:
##
## Package Library
## aods3 /Users/leoohyama/Library/R/x86_64/4.1/library
## survival /Library/Frameworks/R.framework/Versions/4.1/Resources/library
##
##
## Using the first match ...
head(rats)
## litter rx time status sex
## 1 1 1 101 0 f
## 2 1 0 49 1 f
## 3 1 0 104 0 f
## 4 3 1 104 0 f
## 5 3 0 102 0 f
## 6 3 0 104 0 f
Plot data:
ggplot(rats , aes(x=time, y=status)) +
geom_point() +
geom_smooth(method='glm',
method.args=list(family="binomial"),
formula = y~x) +
facet_wrap(~rx)
Zero-Inflated Models
In this section we will go over applying zero-inflated models. First let’s load the data (wheatley carrot infection by carrot fly larvae):
data("ridout.appleshoots")
Let’s clean up the data and check out the distribution of the data in question:
dat1 <- ridout.appleshoots %>% mutate(photo=as.factor(photo))
hist(dat1$roots)
Now let’s run several models: a Poisson, a negative binomial, and a negative binomial with a general intercept for zero inflation:
## poisson model
mod1 <- glmmTMB(roots ~ photo * bap , data=dat1, family='poisson')
plot(simulateResiduals(mod1))
## negative binomial model
mod2 <- glmmTMB(roots ~ photo * bap , data=dat1, family='nbinom2')
plot(simulateResiduals(mod2))
## negative binomial w/ a general intercept for zero inflation (ie. ZI equal for all observations)
mod3 <- glmmTMB(roots ~ photo * bap , data=dat1, family='nbinom2', zi=~1)
plot(simulateResiduals(mod3))
summary(mod3) ## back-transform ZI intercept (exp(p)/(1+exp(p)))
## Family: nbinom2 ( log )
## Formula: roots ~ photo * bap
## Zero inflation: ~1
## Data: dat1
##
## AIC BIC logLik deviance df.resid
## 1342.4 1364.0 -665.2 1330.4 264
##
##
## Dispersion parameter for nbinom2 family (): 12.6
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.895659 0.072535 26.134 <2e-16 ***
## photo16 -0.029863 0.135890 -0.220 0.8261
## bap 0.008760 0.006724 1.303 0.1927
## photo16:bap -0.032083 0.013065 -2.456 0.0141 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.226 0.151 -8.125 4.49e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What do we expect to cause the zero inflation? We can check out further details in the data’s metadata:
??ridout.appleshoots
ggplot(dat1, aes(x=roots)) +
geom_histogram() +
facet_grid(bap~photo) +
theme_bw(base_size=16)
Now let’s run a negative binomial with zero inflation specified by treatment:
mod4 <- glmmTMB(roots ~ photo * bap , data=dat1, family='nbinom2', zi=~photo)
summary(mod4)
## Family: nbinom2 ( log )
## Formula: roots ~ photo * bap
## Zero inflation: ~photo
## Data: dat1
##
## AIC BIC logLik deviance df.resid
## 1252.3 1277.5 -619.2 1238.3 263
##
##
## Dispersion parameter for nbinom2 family (): 13.8
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.898347 0.071568 26.525 <2e-16 ***
## photo16 -0.023525 0.130488 -0.180 0.8569
## bap 0.008222 0.006689 1.229 0.2190
## photo16:bap -0.029492 0.012427 -2.373 0.0176 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.4000 0.8446 -5.210 1.89e-07 ***
## photo16 4.2831 0.8627 4.965 6.87e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Plot residuals:
plot(simulateResiduals(mod4))
Now let’s compare all models with AIC. Which one was the best? Why do you think?
### check with AIC
AIC(mod1,mod2,mod3,mod4)
## df AIC
## mod1 4 1575.916
## mod2 5 1412.415
## mod3 6 1342.435
## mod4 7 1252.333