Guides

Bolker et al. 2009. Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution.
Bolker. GLMM FAQ
Bolker et al. 2011. GLMMs in action: gene-by-environment interaction in total fruit production of wild populations of Arabidopsis thaliana. Part 1, Part 2
Bolker. In: Ecological Statistics: Contemporary Theory and Application
- GLMM:worked examples
- Reporting GLMM results
Bolker et al. 2013. Strategies for fitting nonlinear ecological models in R, AD Model Builder, and BUGS
- Bolker et al. 2012. Owls example: a zero-inflated, generalized linear mixed model for count data
- other examples from Non-Linear Model Working Group
A Practical Guide to Mixed Models in R
Schielzeth and Nakagawa 2012. Nested by design: model fitting and interpretation in a mixed model era. Methods in Ecology and Evolution.
Stroup. Rethinking the Analysis of Non-Normal Data in Plant and Soil Science
Elston et al. 2001. Analysis of aggregation, a worked example: Numbers of ticks on red grouse chicks
Annika Nelson. Ant GLMM Example
R package pscl vignette: Regression Models for Count Data in R

Purpose

Identify reproductive barriers between two sympatric moth-pollinated plant species, Schiedea kaalae and S. hookeri by fitting a generalized linear mixed model (GLMM).

In the experimental design, the following crosstypes were made:

within species, between population (may show outbreeding depression or heterosis)
within species, within populations (may show inbreeding depression)
hybrids between species (indicates species barrier from pollination to seed production)

In this analysis the response variable is the number of seeds produced by each cross. Other barriers (hybrid survival, biomass, flowering) could be analyzed in a similar framework, with appropriate changes to the underlying distribution. For count data, we cannot use a normal distribution, even after transformation. Instead we use discrete distributions available in generalized models - Poisson, negative binomial, and variants. In addition, the data are zero-inflated, so we model this as a separate process.

Fixed effects:

crosstype - hybrids, within population, between populations
species - species of the maternal plant that produced the seeds

Potential random effects:

mompop - maternal plant population
mompid - maternal plant, specified by its population and ID
dadpop - paternal plant population

Data Import

sds <- read.table("hybridseeds.csv", header=T, sep="\t")

#treat populations as factors
sds$mompop <- as.factor(sds$mompop)
sds$dadpop <- as.factor(sds$dadpop)
sds$linenum <- as.factor(sds$linenum)

#exclude "closed" dadpop 
sds$dadpop[sds$dadpop=="closed"] <- NA
sds$dadpop<- droplevels(sds$dadpop)
sds <- na.omit(sds)

#rename crosstype codes
levels(sds$crosstype) <- list(between="e",within="a",hybrid="h")
#made "between" the first reference level to facilitate comparison between outcrossing populations and hybridizing species 

#define interactions
sds <- within(sds, sxc <- interaction(species,crosstype))
sds <- within(sds, sxcxm <- interaction(species,crosstype,mompop,momid))
sds <- within(sds, mompid <- as.factor(paste(mompop,momid,sep=".")))
sds <- within(sds, smompop <- as.factor(paste(species,mompop,sep="")))
sds <- within(sds, yesno <- as.numeric(seednum>0))
#check final structure
str(sds)

   'data.frame':    533 obs. of  15 variables:
    $ linenum   : Factor w/ 534 levels "1","2","3","4",..: 22 23 24 25 26 27 28 29 30 31 ...
    $ labelid   : int  18 19 22 23 24 25 32 35 36 37 ...
    $ species   : Factor w/ 2 levels "hook","kaal": 1 1 1 1 1 1 1 2 2 2 ...
    $ mompop    : Factor w/ 5 levels "794","879","892",..: 1 1 1 1 1 1 1 3 3 3 ...
    $ momid     : Factor w/ 18 levels "1","10","10-1",..: 6 6 11 11 11 11 10 11 1 1 ...
    $ dadpop    : Factor w/ 7 levels "3587","794","866",..: 2 4 2 2 5 4 4 4 4 4 ...
    $ crosstype : Factor w/ 3 levels "between","within",..: 2 1 2 2 3 1 1 3 3 3 ...
    $ dadspecies: Factor w/ 2 levels "H","K": 1 1 1 1 2 1 1 1 1 1 ...
    $ cross     : Factor w/ 6 levels "HHbetween","HHwithin",..: 2 1 2 2 3 1 1 4 4 4 ...
    $ seednum   : int  20 18 0 11 0 1 7 4 14 2 ...
    $ sxc       : Factor w/ 6 levels "hook.between",..: 3 1 3 3 5 1 1 6 6 6 ...
    $ sxcxm     : Factor w/ 540 levels "hook.between.794.1",..: 153 151 303 303 305 301 271 318 18 18 ...
    $ mompid    : Factor w/ 26 levels "3587.10","3587.10-1",..: 8 8 10 10 10 10 9 23 18 18 ...
    $ smompop   : Factor w/ 5 levels "hook794","hook879",..: 1 1 1 1 1 1 1 4 4 4 ...
    $ yesno     : num  1 1 0 1 0 1 1 1 1 1 ...
    - attr(*, "na.action")=Class 'omit'  Named int 427
     .. ..- attr(*, "names")= chr "427"

Data Inspection

Replication

The sample sizes are unbalanced at all levels, including maternal population:

reptab <- with(sds, table(smompop,crosstype))
mosaic(reptab, pop=F)
labeling_cells(text = reptab, margin = 0)(reptab)

Replication is low for some within-population crosses. The replication is even lower for each maternal plant, so we need to be wary of estimates when subsetting at this level:

with(sds, kable(table(mompid,crosstype)))

	between	within	hybrid
3587.10	3	2	7
3587.10-1	0	0	1
3587.14	5	2	18
3587.15	3	0	7
3587.7	4	2	20
3587.A	1	2	6
3587.C	6	3	11
794.2	18	13	33
794.4	7	6	14
794.5	10	5	20
794.6	0	1	7
879.10-1	15	6	29
879.2-2	3	6	18
879.3-8	2	1	6
879.G-2	3	3	9
879.H-2	0	2	4
879.N-5	6	4	16
892.1	6	3	12
892.10	1	0	2
892.2	1	0	0
892.3	1	1	9
892.4	2	1	11
892.5	5	4	12
904.2	6	3	17
904.3	1	5	12
904.5	10	6	32

Overall data distribution

To identify the best-fitting distribution, we make quantile-quantile plots of the raw data against various distributions. The more points within the confidence interval envelopes, the better the fit. Later, we present quantile-quantile plots of the model residuals to assess model fit.

Based on these fits (none of which account for zero-inflation), the negative binomial distribution best fits the overall data. We will also formally test the Poisson and normal distributions. Zero-inflation is evident from the hockey-stick shapes:

#QQ plots against various distributions
set.seed(1)
par(mfrow=c(2,3))
normal <- fitdistr(sds$seednum+1, "normal")
qqp(sds$seednum+1, "norm", main="Normal")
lognormal <- fitdistr(sds$seednum+1, "lognormal")
qqp(sds$seednum+1, "lnorm", main="Log Normal")
pois <- fitdistr(sds$seednum+1, "Poisson")
qqp(sds$seednum, "pois", pois$estimate, main="Poisson")
nbinom <- fitdistr(sds$seednum+1, "Negative Binomial") 
qqp(sds$seednum+1, "nbinom", size = nbinom$estimate[[1]], mu=nbinom$estimate[[2]], main="Negative Binomial")
gamma <- fitdistr(sds$seednum+1, "gamma")
qqp(sds$seednum+1, "gamma", shape = gamma$estimate[[1]], rate = gamma$estimate[[2]], main="Gamma")

Another visual technique to identify an underlying distribution is the rootogram:

poisf <- goodfit(sds$seednum, type="poisson") 
vcd::rootogram(poisf, main="Raw data Poisson fit", xlab="Seeds")

nbinomf <- goodfit(sds$seednum, type="nbinom") 
vcd::rootogram(nbinomf, main="Raw data negative binomial fit", xlab="Seeds")

sc.c.nb.hurd <- countreg::hurdle(seednum~ species*crosstype | crosstype , data=sds, dist="negbin", zero="binom")
countreg::rootogram(sc.c.nb.hurd, main="Fixed effects only hurdle model (negative binomial + binomial) fit")

Distributions by fixed factors

ggplot(sds, aes(x = seednum, fill=species)) +
  geom_histogram(data=subset(sds,species == "hook"), aes(y=-..density..),binwidth=1)+
  geom_histogram(data=subset(sds,species == "kaal"), aes(y= ..density..),binwidth=1)+
  coord_flip() + facet_grid(~crosstype) + labs(y="Histogram", x="Seeds")

Distributions by random factors

ggplot(aes(y=seednum, x=mompid, color=crosstype), data=sds) + geom_count(alpha=0.8) + coord_flip() + labs(x="Maternal plant", y="Seeds")

Homogeneity of variances across subsets

Our mixed model uses one parameter to capture random effect variance, which is assumed to be homogeneous. Plotting on a log scale should uncouple variances from means to assess this visually. Subsets are species * crosstype * maternal plant.

Subset variances are not homogeneous:

ggplot(aes(y=log(seednum+1), x=sxcxm, color=crosstype), data=sds) + geom_boxplot() + coord_flip() + labs(y="ln(seeds + 1)",x="Subsets")

Subset mean-variance relationship

Various distributions make different assumptions about the mean-variance (µ-Var) ratio. If the variance is greater than the mean (Var > µ, region above the black line representing the Poisson distribution), the data are overdispersed.

Distribution	Variance-mean model
Poisson	Var = µ
quasi-Poisson	Var = φµ
Negative binomial	Var = µ(1 + αµ)

We can see that the data are overdispersed, but both the negative binomial and quasi-Poisson seem to fit better than the Poisson:

grpVars <- with(sds, tapply(seednum, list(sxcxm), var))
grpMeans <- with(sds, tapply(seednum, list(sxcxm), mean))
grpCounts <- with(sds, tapply(seednum, list(sxcxm), length))
#set weight=grpCounts to weight loess by sample sizes
ggplot(na.omit(data.frame(grpMeans,grpVars,grpCounts)),
       aes(x=grpMeans,y=grpVars, weight=1))+geom_point(aes(size=grpCounts))+
  geom_abline(intercept = 0, slope = 1, aes(color="Poisson"))+
  geom_smooth(method="loess", se=F, aes(color="Loess"))+
  geom_smooth(method="lm",formula=y~x-1,aes(color="Quasi-Poisson"))+
  geom_smooth(method="lm",formula=y~I(x^2)+offset(x)-1, aes(color="Negative Binomial"))+
  guides(colour=guide_legend(title="Fit"),size=guide_legend(title="Sample size")) + labs(x="Subset Mean", y="Subset Variance") + labs(subtitle="Subset: species*crosstype*mompid")

Fixed effects

Effects and interactions in these plots are simply given by the mean, which may be unduly influenced by high values.

intplot <- ggplot(sds,aes(x=crosstype,y=seednum))+
  geom_count(aes(size = ..prop.., group=sxc),alpha=0.5)+
  stat_summary(aes(x=as.numeric(crosstype)),fun.y=mean,geom="line")+ facet_grid(~species)
intplot + aes(group=species, color=species)

Random effects

Maternal population

intplot + aes(group=mompop, color=mompop)

Maternal plant

intplot + aes(group=mompid, color=mompop)

Paternal population

intplot + aes(group=dadpop, color=dadpop)

Zeroes by category

ggplot(sds,aes(x=crosstype,fill=factor(seednum>0))) + geom_bar(position="fill") + facet_grid(~species) + labs(y="Proportion") + guides(fill=guide_legend(title="Seeds produced"))

Run models on subsets

Run many generalized linear models on subsets of the data defined by crosstype | mompid to see if effects estimates are consistent within maternal plants.

Most maternal plant subsets agree, but some are problematic outliers. Three plants (plus two missing) can be picked out visually from the random effects interaction plot above, the estimated parameters of each subset model, and the QQ plot of the estimated parameters:

 #had to get rid of species or mompid since mompid is nested inside species. dadpop also works
glm.lis <- lmList(seednum~crosstype|mompid,data=sds, family="poisson")
plot.lmList(glm.lis,scale=list(x=list(relation="free")))

   Using grp as id variables

qqmath.lmList(glm.lis)

   Using  as id variables

Diagnosis of each problematic maternal plant:

3587.10-1: only crossed to 1 hybrid
794.6: only crossed to 1 within, with no seeds
879.3-8: did not produce any seeds for any crosstype
892.2: only crossed to 1 between
892.4: only crossed to 1 within, with no seeds

We should consider rerunning model without these to see if random effect variance changes.

xtreme <- c("794.6","892.4","879.3-8","892.2","3587.10-1") 
with(droplevels(sds[sds$mompid %in% xtreme,]), kable(table(mompid,crosstype)))

	between	within	hybrid
3587.10-1	0	0	1
794.6	0	1	7
879.3-8	2	1	6
892.2	1	0	0
892.4	2	1	11

Models

We constructed the following models with the package glmmADMB. They all have the same fixed effects, species x crosstype

X = standard GL(M)M
ZI = zero-inflation mixture model
HR = hurdle model

Distribution, Random Effects:	None	Maternal plant	Maternal population
normal (norm)	X	X	X
Poisson (poi)	X, ZI	X, ZI, HR
quasi-Poisson (qpoi)	X, ZI	X, ZI, HR	ZI
negative binomial (nb)	X, ZI	X, ZI, HR	ZI

#Normal (Gaussian) distribution, identity link
sc.norm          <- glmmadmb(seednum~species*crosstype, data=sds, family="gaussian")
sc.mix           <- glmmadmb(seednum~species*crosstype + (1|mompid), data=sds, family="gaussian")
sc.mix.mompop    <- glmmadmb(seednum~species*crosstype + (1|mompop), data=sds, family="gaussian")

#Poissson distribution, log link
sc.poi           <- glmmadmb(seednum~species*crosstype, data=sds, family="poisson")
sc.poi.zi        <- glmmadmb(seednum~species*crosstype, data=sds, family="poisson", zeroInflation=T)
sc.mix.poi       <- glmmadmb(seednum~species*crosstype + (1|mompid), data=sds, family="poisson")
sc.mix.poi.zi    <- glmmadmb(seednum~species*crosstype + (1|mompid), data=sds, family="poisson", zeroInflation=T)

#Negative binomial distribution, log link
sc.nb            <- glmmadmb(seednum~species*crosstype, data=sds, family="nbinom")
sc.nb.zi         <- glmmadmb(seednum~species*crosstype, data=sds, family="nbinom", zeroInflation=T)
sc.mix.nb        <- glmmadmb(seednum~species*crosstype + (1|mompid), data=sds, family="nbinom")
sc.mix.nb.zi     <- glmmadmb(seednum~species*crosstype + (1|mompid), data=sds, family="nbinom", zeroInflation=T)
sc.mix.nb.zi.mompop    <- glmmadmb(seednum~species*crosstype + (1|mompop), data=sds, family="nbinom", zeroInflation=T)

#Quasi-Poisson (NB1), log link
sc.qpoi            <- glmmadmb(seednum~species*crosstype, data=sds, family="nbinom1")
sc.qpoi.zi         <- glmmadmb(seednum~species*crosstype, data=sds, family="nbinom1", zeroInflation=T)
sc.mix.qpoi        <- glmmadmb(seednum~species*crosstype + (1|mompid), data=sds, family="nbinom1")
sc.mix.qpoi.zi     <- glmmadmb(seednum~species*crosstype + (1|mompid), data=sds, family="nbinom1", zeroInflation=T)
sc.mix.qpoi.zi.mompop    <- glmmadmb(seednum~species*crosstype + (1|mompop), data=sds, family="nbinom1", zeroInflation=T)

#Hurdle models - split into binary and count models
#binary models
intercept.hr       <- glmmadmb(yesno~1, data=sds, family="binom")
c.hr               <- glmmadmb(yesno~crosstype, data=sds, family="binom")
c.mix.hr           <- glmmadmb(yesno~crosstype + (1|mompid), data=sds, family="binom")
sc.hr              <- glmmadmb(yesno~species*crosstype, data=sds, family="binom")
sc.mix.hr          <- glmmadmb(yesno~species*crosstype + (1|mompid), data=sds, family="binom")

#truncated count models
sc.mix.poi.tr      <- glmmadmb(seednum~species*crosstype + (1|mompid), data=subset(sds,seednum>0), family="truncpoiss")
sc.mix.nb.tr       <- glmmadmb(seednum~species*crosstype + (1|mompid), data=subset(sds,seednum>0), family="truncnbinom")
sc.mix.qpoi.tr     <- glmmadmb(seednum~species*crosstype + (1|mompid), data=subset(sds,seednum>0), family="truncnbinom1", 
                               admb.opts = admbControl(shess = FALSE, noinit = FALSE))

Model comparison

AIC

We will use the Aikake Information Criterion to pick the model the best fits the data, penalized by the number of parameters. Differences of 2 units are significant.

Zero inflation is present, so fit is improved with zero-inflated and hurdle models.

sc.names <- c("sc.norm","sc.mix","sc.mix.mompop","sc.poi","sc.poi.zi","sc.mix.poi","sc.mix.poi.zi","sc.nb","sc.nb.zi","sc.mix.nb","sc.mix.nb.zi","sc.mix.nb.zi.mompop","sc.qpoi","sc.qpoi.zi","sc.mix.qpoi","sc.mix.qpoi.zi","sc.mix.qpoi.zi.mompop")
sc.list <- sapply(sc.names, get, USE.NAMES=T)
sc.AIC <- ICtab(sc.list,mnames=sc.names,type="AIC", base=T, delta=F) # for AICc, nobs=nobs(sc.list[[1]])
class(sc.AIC)<-"data.frame"
hr <- AIC(intercept.hr, c.hr, c.mix.hr, sc.hr, sc.mix.hr)
trunc <- AIC(sc.mix.poi.tr,sc.mix.nb.tr,sc.mix.qpoi.tr)
combined <- trunc + hr[c(5,5,5),] #TODO: check df incorporation
all.names <- c(sc.names, "sc.mix.poi.tr","sc.mix.nb.tr","sc.mix.qpoi.tr")
all.list <- sapply(all.names, get, USE.NAMES=T)
all.AIC <- dfun(rbind(sc.AIC,combined))
all.AIC <- all.AIC[order(all.AIC$dAIC),]
kable(all.AIC, format.arg=list(digits=3))

	dAIC	df
sc.mix.qpoi.tr	0.0	15
sc.mix.nb.tr	22.7	15
sc.mix.qpoi.zi	43.4	9
sc.mix.nb.zi	54.3	9
sc.nb.zi	102.4	8
sc.mix.nb.zi.mompop	103.5	9
sc.mix.qpoi.zi.mompop	109.5	9
sc.qpoi.zi	111.7	8
sc.mix.qpoi	144.5	8
sc.mix.poi.tr	187.0	14
sc.mix.nb	204.9	8
sc.mix.poi.zi	221.1	8
sc.nb	252.5	7
sc.qpoi	256.8	7
sc.poi.zi	424.0	7
sc.mix	525.4	8
sc.mix.mompop	603.3	8
sc.norm	615.1	7
sc.mix.poi	1075.4	7
sc.poi	1730.3	6

The best-fiting model is a hurdle model with the following components:

binary component (c.mix.hr):
- response: yesno, logit link, binomial distribution
- fixed effects: species, crosstype, species x crosstype
- random effect: mompid
count component (sc.mix.qpoi.tr)
- response: seednum>0, log link, truncated quasi-Poisson distribution
- fixed effects: species, crosstype, species x crosstype
- random effect: mompid

Compare hurdle models

We can individually select the binary and count component of the hurdle models. The AIC of the combined model (sum of the AICs of the binary and count models) was computed above.

dfun(hr)[order(dfun(hr)$dAIC),]

                df   dAIC
   c.mix.hr      4  0.000
   sc.mix.hr     7  3.508
   c.hr          3 37.492
   sc.hr         6 40.060
   intercept.hr  1 40.516

dfun(trunc)[order(dfun(trunc)$dAIC),]

                  df   dAIC
   sc.mix.qpoi.tr  8   0.00
   sc.mix.nb.tr    8  22.68
   sc.mix.poi.tr   7 186.96

Hurdle models are not easily comparable to regular GLMMs except by AIC, so they are omitted until inference below.

Overdispersion

Looking at the normal, fixed effects model, we see that the residuals are not normal:

shapiro.test(sc.norm$residuals)#raw residuals!

   
    Shapiro-Wilk normality test
   
   data:  sc.norm$residuals
   W = 0.95667, p-value = 2.082e-11

Furthermore, we examined the overall data distribution for overdispersion above. This is a formal test of each model, with the null hypothesis that there is no overdispersion.

The mixed negative binomial and mixed quasi-Poisson are not over-dispersed:

sc.overdisp.names <- c("sc.mix.poi","sc.mix.poi.zi","sc.mix.nb","sc.mix.nb.zi.mompop", "sc.mix.nb.zi", "sc.mix.qpoi", "sc.mix.qpoi.zi.mompop","sc.mix.qpoi.zi")
sc.overdisp <- sapply(sc.overdisp.names, get, USE.NAMES=T)
kable(t(sapply(sc.overdisp, overdisp)))

	chisq	ratio	rdf	p
sc.mix.poi	1990.6421	3.784491	526	0.0000000
sc.mix.poi.zi	1778.4340	3.381053	526	0.0000000
sc.mix.nb	372.5411	0.708253	526	0.9999999
sc.mix.nb.zi.mompop	708.9644	1.347841	526	0.0000002
sc.mix.nb.zi	762.2621	1.449167	526	0.0000000
sc.mix.qpoi	1226.2945	2.331358	526	0.0000000
sc.mix.qpoi.zi.mompop	652.8878	1.241232	526	0.0001282
sc.mix.qpoi.zi	512.1089	0.973591	526	0.6596505

Coefficients

The coefficients estimated for each model agree qualitatively (excluding the normal distribution models, not shown, which has a different link function). It’s hard to interpret main effects in the presence of a significant interaction, but here are the non-rigorous results (wait for the lsmeans Tukey test):

S. kaalae produces fewer seeds than S. hookeri
Plants crossed within populations produce fewer seeds than between populations, the reference level
Plants crossed to another species (hybrids) produce fewer seeds than between populations
S. kaalae produces about the same number of seeds whether crossed within or between populations.

sc.loglink.names <- c("sc.poi","sc.poi.zi","sc.mix.poi","sc.mix.poi.tr","sc.mix.poi.zi","sc.nb","sc.nb.zi","sc.mix.nb","sc.mix.nb.zi", "sc.mix.nb.zi.mompop","sc.mix.nb.tr","sc.qpoi","sc.qpoi.zi", "sc.mix.qpoi", "sc.mix.qpoi.zi.mompop","sc.mix.qpoi.zi","sc.mix.qpoi.tr")
sc.loglink <- sapply(sc.loglink.names, get, USE.NAMES=T)
coefplot2(sc.loglink, legend.x="topright",legend=T,legend.args=list(cex=0.8, xpd=T, inset=c(-0.1,0)), col.pts=sample(gg_color_hue(length(sc.loglink.names))), spacing=0.05, lwd.2=2, lwd.1=4, intercept=T)

Residuals-fitted plot

These are plots of residuals (y) against fitted values (x) for each model. Ideally, the black spline is horizontal and residuals are equally spread. Otherwise, there are non-linear relationships beteeen the predictors and outcome. Zeroes in the original data are marked in red.

#cannot use the normal models since they do not have sd.est to compute Pearson residuals
par(mfrow=c(3,5),mar=c(0,0,1,0))
for(i in 4:length(sc.list)){ 
  if(i %in% c(2,3,8)) plot.new()
  plot(fitted(sc.list[[i]]), residuals(sc.list[[i]]), axes=F, frame= T, xlab = "Fitted Values", ylab = "Residuals", col=1+(sds$seednum==0), main=sc.names[i])
  abline(h = 0, lty = 2)
  lines(smooth.spline(fitted(sc.list[[i]]), residuals(sc.list[[i]])))
}

Residuals QQ plot

Model residuals are normally distributed if they follow the fitted line. For large sample sizes, residuals of Poisson and negative binomial GLMs tend towards normality. Zeroes in the original data are marked in red.

par(mfrow=c(3,5),mar=c(2,0,1,0))
for(i in 4:length(sc.list)){
  if(i %in% c(2,3,8)) plot.new()
  qqp(residuals(sc.list[[i]]), "norm", main=sc.names[i], lwd=1, col.lines=3, col=1+(sds$seednum==0))  
}

Location-scale plot

These are plots of sqrt(abs(residuals)) (y) against fitted values (x) for each model. Ideally, the loess fit should be flat. This tests the assumption of equal variance (homoscedasticity). Zeroes in the original data are marked in red.

par(mfrow=c(3,5),mar=c(0,0,1,0))
for(i in 4:length(sc.list)){
  if(i==8) plot.new()
  locscaleplot(sc.list[[i]],col=1+(sds$seednum==0))
  title(main=sc.names[i])
}

Inference

We chose the model with nearly the best (lowest) AIC, sc.mix.qpoi.tr+sc.mix.hr, to carry out inference tests and parameter estimation. This hurdle model is split into a binary model (data reduced to =0 or >0) and a count model (truncated distribution fit to observations >0), with the same fixed and random terms. Each model has its own parameter estimates.

Description

Count

mod <- sc.mix.qpoi.tr
print(mod)

   
   GLMM's in R powered by AD Model Builder:
   
     Family: truncnbinom 
     alpha = 1.9153 
     link = log 
   
   Fixed effects:
     Log-likelihood: -1079.12 
     AIC: 2174.24 
     Formula: seednum ~ species * crosstype + (1 | mompid) 
                   (Intercept)                 specieskaal 
                    2.57178857                 -0.73781370 
               crosstypewithin             crosstypehybrid 
                   -0.12982426                 -0.76278096 
   specieskaal:crosstypewithin specieskaal:crosstypehybrid 
                    0.06588293                  0.52063268 
   
   Random effects:
   Structure: Diagonal matrix
   Group=mompid
               Variance StdDev
   (Intercept)   0.1078 0.3283
   
   Number of observations: total=387, mompid=25

Binary

mod.hr <- sc.mix.hr
print(mod.hr)

   
   GLMM's in R powered by AD Model Builder:
   
     Family: binom 
     link = logit 
   
   Fixed effects:
     Log-likelihood: -288.431 
     AIC: 590.862 
     Formula: yesno ~ species * crosstype + (1 | mompid) 
                   (Intercept)                 specieskaal 
                    1.33667166                  0.20635116 
               crosstypewithin             crosstypehybrid 
                   -0.02307112                 -0.89466439 
   specieskaal:crosstypewithin specieskaal:crosstypehybrid 
                   -0.30533293                  0.43281810 
   
   Random effects:
   Structure: Diagonal matrix
   Group=mompid
               Variance StdDev
   (Intercept)    1.084  1.041
   
   Number of observations: total=533, mompid=26

Test significance of random effects

Using a likelihood ratio test, with a null hypothesis of zero variance, the random effect (maternal plant) is significant for both model parts:

Count

anova(sc.qpoi.zi, sc.mix.qpoi.zi) #double this p-value. or simulate null by permuting data.

   Analysis of Deviance Table
   
   Model 1: seednum ~ species * crosstype
   Model 2: seednum ~ species * crosstype
     NoPar  LogLik Df Deviance  Pr(>Chi)    
   1     8 -1430.4                          
   2     9 -1395.2  1    70.32 < 2.2e-16 ***
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Binary

anova(sc.hr, sc.mix.hr)

   Analysis of Deviance Table
   
   Model 1: yesno ~ species * crosstype
   Model 2: yesno ~ species * crosstype
     NoPar  LogLik Df Deviance  Pr(>Chi)    
   1     6 -307.71                          
   2     7 -288.43  1   38.552 5.331e-10 ***
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test significance of interaction

By dropping it from the model and performing a likelihood-ratio test, we see that the species x crosstype interaction is significant for the count model but not the binary model:

Count

#sxc.chisq <- drop1(mod, test="Chisq") #load from file
dfun(sxc.chisq)

   Single term deletions
   
   Model:
   seednum ~ species * crosstype + (1 | mompid)
                     Df  dAIC   LRT Pr(>Chi)    
   <none>                0.00                   
   species:crosstype  2 11.56 15.56 0.000418 ***
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Binary

# sxc.chisq.hr <- drop1(sc.mix.hr, test="Chisq") #load from file
dfun(sxc.chisq.hr)

   Single term deletions
   
   Model:
   yesno ~ species * crosstype + (1 | mompid)
                     Df  dAIC   LRT Pr(>Chi)
   <none>               2.394               
   species:crosstype  2 0.000 1.606    0.448

Model summary

The model estimated the following parameters, with individual parameter significance determined by the Wald z-test, and fixed effect significance determined by analysis of deviance Wald test.

Count

summary(mod)

   
   Call:
   glmmadmb(formula = seednum ~ species * crosstype + (1 | mompid), 
       data = subset(sds, seednum > 0), family = "truncnbinom1", 
       admb.opts = admbControl(shess = FALSE, noinit = FALSE))
   
   AIC: 2174.2 
   
   Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
   (Intercept)                   2.5718     0.1286   20.00  < 2e-16 ***
   specieskaal                  -0.7378     0.1769   -4.17  3.0e-05 ***
   crosstypewithin              -0.1298     0.0825   -1.57     0.12    
   crosstypehybrid              -0.7628     0.0784   -9.73  < 2e-16 ***
   specieskaal:crosstypewithin   0.0659     0.1562    0.42     0.67    
   specieskaal:crosstypehybrid   0.5206     0.1240    4.20  2.7e-05 ***
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
   
   Number of observations: total=387, mompid=25 
   Random effect variance(s):
   Group=mompid
               Variance StdDev
   (Intercept)   0.1078 0.3283
   
   Negative binomial dispersion parameter: 1.9153 (std. err.: 0.15177)
   
   Log-likelihood: -1079.12

Anova(mod, type=3)

   Analysis of Deviance Table (Type III tests)
   
   Response: seednum
                     Df   Chisq Pr(>Chisq)    
   (Intercept)        1 399.872  < 2.2e-16 ***
   species            1  17.395  3.035e-05 ***
   crosstype          2 101.604  < 2.2e-16 ***
   species:crosstype  2  18.171  0.0001133 ***
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Binary

summary(mod.hr)

   
   Call:
   glmmadmb(formula = yesno ~ species * crosstype + (1 | mompid), 
       data = sds, family = "binom")
   
   AIC: 590.9 
   
   Coefficients:
                               Estimate Std. Error z value Pr(>|z|)   
   (Intercept)                   1.3367     0.4875    2.74   0.0061 **
   specieskaal                   0.2064     0.6745    0.31   0.7596   
   crosstypewithin              -0.0231     0.5167   -0.04   0.9644   
   crosstypehybrid              -0.8947     0.3793   -2.36   0.0184 * 
   specieskaal:crosstypewithin  -0.3053     0.7722   -0.40   0.6925   
   specieskaal:crosstypehybrid   0.4328     0.5540    0.78   0.4347   
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
   
   Number of observations: total=533, mompid=26 
   Random effect variance(s):
   Group=mompid
               Variance StdDev
   (Intercept)    1.084  1.041
   
   
   Log-likelihood: -288.431

Anova(mod.hr, type=3)

   Analysis of Deviance Table (Type III tests)
   
   Response: yesno
                     Df  Chisq Pr(>Chisq)   
   (Intercept)        1 7.5177    0.00611 **
   species            1 0.0936    0.75964   
   crosstype          2 5.5648    0.06189 . 
   species:crosstype  2 0.7898    0.67375   
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Predicted random effects

These are box and QQ (to check normality) plots of the estimated random effect of each maternal plant. Note that this is NOT an assuption of the model, but they should approach normality for high sample sizes in a well-fitted model.

Count

predre <- setNames(data.frame(RE=ranef(mod)$mompid,SD=ranef(mod, sd=T)$`1`),c("RE","SD"))
ggplot(predre, aes(x = rownames(predre),y=RE)) +
  geom_point(size = 2) + coord_flip()+
  geom_errorbar(aes(ymin = RE-SD, ymax = RE+SD)) + labs(x="Maternal plants", y="Predicted random effects")

#Count
reStack <- ldply(ranef(mod))
print( qqmath( ~`(Intercept)`|.id, data=reStack, scales=list(relation="free"),
                 prepanel = prepanel.qqmathline,
                 panel = function(x, ...) {
                   panel.qqmathline(x, ...)
                   panel.qqmath(x, ...)
                 },
                 layout=c(1,1)))

Binary

#Binary
reStack <- ldply(ranef(mod.hr))
print( qqmath( ~`(Intercept)`|.id, data=reStack, scales=list(relation="free"),
                 prepanel = prepanel.qqmathline,
                 panel = function(x, ...) {
                   panel.qqmathline(x, ...)
                   panel.qqmath(x, ...)
                 },
                 layout=c(1,1)))

Predicted fixed effects, Count

Confidence intervals

These predictions and confidence intervals are only based on fixed effects, conditional on the estimates of the random-effect variances.

newdat <- expand.grid(crosstype=levels(sds$crosstype), species=levels(sds$species)) 
mm <- model.matrix(delete.response(terms(mod)),newdat) ## design matrix (fixed effects)
newdat$pred <- mm %*% fixef(mod) ## matrix-multiply X by the parameter vector β to get the predictions (or linear predictor in the case of GLM(M)s). could also use predict(mod, newdata=newdat)
newdat$seednum <- exp(newdat$pred) #For GLMMs, back-transform this with the inverse link function (e.g. plogis() for binomial, beta;  exp() for Poisson, negative binomial
predvar <- diag(mm %*% vcov(mod) %*% t(mm)) #extract the variance-covariance matrix of the parameters V, compute XVX′ to get the variance-covariance matrix of the predictions; extract the diagonal of this matrix to get variances of predictions; 

pdl <- position_dodge(width=0.1)
pd <- position_dodge(width=0.4)
g0 <- ggplot(sds,aes(x=crosstype,y=seednum,color=species)) + geom_count(aes(size = ..prop..),position=pd) + scale_size_area() 
g1 <- g0 + geom_line(aes(group=species),data=newdat, position=pd) 
#confidence intervals
newdat$SE <- sqrt(predvar) #take the square-root of the variances to get the standard deviations (errors) of the predictions
#confidence intervals = based on fixed-effected uncertainty only
#compute confidence intervals based on a Normal approximation;
#for GL(M)Ms, run the confidence interval boundaries (not the standard errors) through the inverse-link function.
g1 + geom_errorbar(data=newdat, aes(ymin=exp(pred-1.96*SE),ymax=exp(pred+1.96*SE)), position=pd)

Prediction intervals

These confidence intervals are based on fixed effects uncertainty and random effects variance.

#if computing prediction rather than confidence intervals, add the residual variance
newdat$SE2 <- sqrt(predvar+mod$alpha^2) 
g1 + geom_errorbar(data=newdat, aes(ymin=pred-1.96*SE2,ymax=pred+1.96*SE2), position=pd)

Least square means

The least square means procedure can generate predictor estimates of each type, and give their significance groupings with a post-hoc Tukey test. S. hookeri-produced hybrids produce less seeds than either crosses between or within S. hookeri populations. The other differences are not significant, but remember that the fixed effect of hybrid (vs. between) was significant (model summary).

Count

#Count
rg <- ref.grid(mod)
#summary(rg)
sxc.lsm <- lsmeans(rg, ~ crosstype*species)
plot(sxc.lsm)

cld(sxc.lsm, Letters=letters) #tukey letterings

    crosstype species   lsmean        SE df asymp.LCL asymp.UCL .group
    hybrid    kaal    1.591827 0.2689134 NA  1.064766  2.118887  a    
    within    kaal    1.770034 0.2813601 NA  1.218578  2.321489  a    
    hybrid    hook    1.809008 0.1557897 NA  1.503665  2.114350  a    
    between   kaal    1.833975 0.2058160 NA  1.430583  2.237367  a    
    within    hook    2.441964 0.1505596 NA  2.146873  2.737056   b   
    between   hook    2.571789 0.1286100 NA  2.319718  2.823860   b   
   
   Results are given on the log (not the response) scale. 
   Confidence level used: 0.95 
   P value adjustment: tukey method for comparing a family of 6 estimates 
   significance level used: alpha = 0.05

Binary

#Binary
rg.hr <- ref.grid(mod.hr)
sc.lsm.hr <- lsmeans(rg.hr, ~ crosstype*species)
plot(sc.lsm.hr)

options(digits=4)
cld.mod <- cld(sxc.lsm, Letters=letters) #tukey letterings
cld.mod[rev(order(cld.mod$species, cld.mod$crosstype)),]

    crosstype species lsmean     SE df asymp.LCL asymp.UCL .group
    hybrid    kaal     1.592 0.2689 NA     1.065     2.119  a    
    within    kaal     1.770 0.2814 NA     1.219     2.321  a    
    between   kaal     1.834 0.2058 NA     1.431     2.237  a    
    hybrid    hook     1.809 0.1558 NA     1.504     2.114  a    
    within    hook     2.442 0.1506 NA     2.147     2.737   b   
    between   hook     2.572 0.1286 NA     2.320     2.824   b   
   
   Results are given on the log (not the response) scale. 
   Confidence level used: 0.95 
   P value adjustment: tukey method for comparing a family of 6 estimates 
   significance level used: alpha = 0.05

Hybrids Seeds GLMM

John Powers

April 4, 2017

Guides

Purpose

Data Import

Data Inspection

Replication

Overall data distribution

Distributions by fixed factors

Distributions by random factors

Homogeneity of variances across subsets

Subset mean-variance relationship

Fixed effects

Random effects

Maternal population

Maternal plant

Paternal population

Zeroes by category

Run models on subsets

Models

Model comparison

AIC

Compare hurdle models

Overdispersion

Coefficients

Residuals-fitted plot

Residuals QQ plot

Location-scale plot

Inference

Description

Count

Binary

Test significance of random effects

Count

Binary

Test significance of interaction

Count

Binary

Model summary

Count

Binary

Predicted random effects

Count

Binary

Predicted fixed effects, Count

Confidence intervals

Prediction intervals

Least square means

Count

Binary