The source of this document is available at https://github.com/inbo/Rpubs. Pull requests with corrections or addition are welcome.

library(ggplot2)
library(scales)
library(pscl)
library(MASS)
library(tidyr)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## 
## Attaching package: 'lme4'
## The following object is masked from 'package:stats':
## 
##     sigma
set.seed(1234)
n <- 1e4
n.sim <- 1e3
mean.poisson <- 0.05
mean.zeroinfl <- 10000
prop.zeroinfl <- 0.1

Lots of zero’s doesn’t imply zero-inflation

dataset <- data.frame(
  Poisson = rpois(n, lambda = mean.poisson)
)
ggplot(dataset, aes(x = Poisson)) + 
  geom_histogram(binwidth = 1)

table(dataset$Poisson)
## 
##    0    1    2 
## 9533  462    5

The example above generates values from a Poisson distribution with mean 0.05. Although it has no zero-inflation, 95.3% of the values are zero.

Zero-inflation can also occur with low number of zero’s

dataset$ZeroInflatedPoisson <- rbinom(n, size = 1, prob = 1 - prop.zeroinfl) * 
  rpois(n, lambda = mean.zeroinfl)
ggplot(dataset, aes(x = ZeroInflatedPoisson)) + 
  geom_histogram(binwidth = 1)

table(dataset$ZeroInflatedPoisson == 0)
## 
## FALSE  TRUE 
##  9007   993

The second example generates a data from a zero-inflated Poisson distribution with mean 10^{4} and 10, % excest zero’s. The actual proportion of zero’s is 9.9%.

How to test for zero-inflation

dataset <- expand.grid(
  Mean = c(mean.poisson, mean.zeroinfl),
  Rep = seq_len(n)
)
dataset$Poisson <- rpois(nrow(dataset), lambda = dataset$Mean)
dataset$ZeroInflatedPoisson <- rbinom(nrow(dataset), size = 1, prob = 1 - prop.zeroinfl) * 
  rpois(nrow(dataset), lambda = dataset$Mean)
ggplot(dataset, aes(x = Poisson)) + 
  geom_histogram(binwidth = 1)

ggplot(dataset, aes(x = ZeroInflatedPoisson)) + 
  geom_histogram(binwidth = 1)

For this example we generate a new dataset with two levels of Mean: 0.05, 10000.00. We generate both a Poisson and a zero-inflated Poisson variable. The latter has 10, % excess zero’s. The proportion of zero’s the variables are respectively 47.3% and 52.7%.

Let’s assume that our hypothesis is that all observations share the same mean. Note that is clearly not the case. If this hypothesis would hold, then we fit a simple intercept-only model.

m.pois.simple <- glm(Poisson ~ 1, data = dataset, family = poisson)
m.zero.simple <- glm(ZeroInflatedPoisson ~ 1, data = dataset, family = poisson)
summary(m.pois.simple)
## 
## Call:
## glm(formula = Poisson ~ 1, family = poisson, data = dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -100.00  -100.00   -20.85    62.15    66.33  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   8.5172     0.0001   85172   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 138627905  on 19999  degrees of freedom
## Residual deviance: 138627905  on 19999  degrees of freedom
## AIC: 138739486
## 
## Number of Fisher Scoring iterations: 6
summary(m.zero.simple)
## 
## Call:
## glm(formula = ZeroInflatedPoisson ~ 1, family = poisson, data = dataset)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -94.96  -94.96  -94.96   70.19   74.27  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 8.4137957  0.0001053   79899   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 143653204  on 19999  degrees of freedom
## Residual deviance: 143653204  on 19999  degrees of freedom
## AIC: 143753725
## 
## Number of Fisher Scoring iterations: 6

After fitting the model we generate at random a new response variable based on the models and count the number of zero’s. This is repeated several times so that we can get a distribion of zero’s based on the model. Finally we compare the number of zero’s in the original dataset with this distribution.

simulated <- data.frame(
  PoissonSimple = apply(
    simulate(m.pois.simple, nsim = n.sim) == 0, 
    2, 
    mean
  ),
  ZeroInflatedSimple = apply(
    simulate(m.zero.simple, nsim = n.sim) == 0, 
    2, 
    mean
  )
)
ggplot(simulated, aes(x = PoissonSimple)) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(xintercept = mean(dataset$Poisson == 0), colour = "red") +
  scale_x_continuous(label = percent)

ggplot(simulated, aes(x = ZeroInflatedSimple)) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(xintercept = mean(dataset$Poisson == 0), colour = "red") +
  scale_x_continuous(label = percent)

In both cases none of the simulated values contains zero’s. The red line indicates the observed proportion of zero’s with is clear greater. This indicates that we have either a poor model fit or zero-inflation.

Improving the model

In the case of this example we have two groups with very different means: one group with low mean generating lots of zero’s and one group with large mean generating no zero’s. Adding this grouping improves the model.

m.pois.complex <- glm(Poisson ~ factor(Mean), data = dataset, family = poisson)
m.zero.complex <- glm(ZeroInflatedPoisson ~ factor(Mean), data = dataset, family = poisson)
summary(m.pois.complex)
## 
## Call:
## glm(formula = Poisson ~ factor(Mean), family = poisson, data = dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6720  -0.3353  -0.3353   0.1405   4.2399  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.87884    0.04218  -68.25   <2e-16 ***
## factor(Mean)10000 12.08917    0.04218  286.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 138627905  on 19999  degrees of freedom
## Residual deviance:     13118  on 19998  degrees of freedom
## AIC: 124702
## 
## Number of Fisher Scoring iterations: 6
summary(m.zero.complex)
## 
## Call:
## glm(formula = ZeroInflatedPoisson ~ factor(Mean), family = poisson, 
##     data = dataset)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -134.295    -0.301    -0.301    10.023    13.695  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -3.09224    0.04691  -65.93   <2e-16 ***
## factor(Mean)10000 12.19918    0.04691  260.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 143653204  on 19999  degrees of freedom
## Residual deviance:  18653539  on 19998  degrees of freedom
## AIC: 18754062
## 
## Number of Fisher Scoring iterations: 5
simulated$PoissonComplex <- apply(
  simulate(m.pois.complex, nsim = n.sim) == 0, 
  2, 
  mean
)
simulated$ZeroInflatedComplex <- apply(
  simulate(m.zero.complex, nsim = n.sim) == 0, 
  2, 
  mean
)
ggplot(simulated, aes(x = PoissonComplex)) +
  geom_histogram(binwidth = 0.0005) +
  geom_vline(xintercept = mean(dataset$Poisson == 0), colour = "red") +
  scale_x_continuous(label = percent)

ggplot(simulated, aes(x = ZeroInflatedComplex)) +
  geom_histogram(binwidth = 0.0005) +
  geom_vline(xintercept = mean(dataset$ZeroInflatedPoisson == 0), colour = "red") +
  scale_x_continuous(label = percent)

In case of the Poisson variable, 45.1% of the simulated dataset from the complex model has more zero’s than the observed dataset. So this model can captures the zero’s and there is not zero-inflation. In case of the zero-inflated Poisson variable 0.0% of the simulated dataset from the complex model has more zero’s than the observed dataset. Hence we conclude that the zero’s are not model properly. We can’t improve the model further with covariates, hence we’ll have to treat is a a zero-inflated distribution.

m.zero.zi <- zeroinfl(
  ZeroInflatedPoisson ~ factor(Mean) | 1, 
  data = dataset, 
  dist = "poisson"
)
summary(m.zero.zi)
## 
## Call:
## zeroinfl(formula = ZeroInflatedPoisson ~ factor(Mean) | 1, data = dataset, 
##     dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0282 -0.2126 -0.2126  0.3297  9.1506 
## 
## Count model coefficients (poisson with log link):
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.98879    0.04716  -63.37   <2e-16 ***
## factor(Mean)10000 12.19909    0.04716  258.65   <2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.2170     0.0336  -65.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 14 
## Log-likelihood: -5.938e+04 on 3 Df
sim.zero.zi <- function(model){
  eta <- predict(model, type = "count")
  prob <- predict(model, type = "zero")
  new.value <- 
    rbinom(length(eta), size = 1, prob = 1 - prob) *
    rpois(length(eta), lambda = eta)
  mean(new.value == 0)
}
simulated$ZeroInflatedZI <- replicate(n.sim, sim.zero.zi(m.zero.zi))
ggplot(simulated, aes(x = ZeroInflatedZI)) +
  geom_histogram(binwidth = 0.0005) +
  geom_vline(xintercept = mean(dataset$ZeroInflatedPoisson == 0), colour = "red") +
  scale_x_continuous(label = percent)

When we fit a proper zero-inflated model to the zero-inflated Poisson variable, then 45.5% of the simulated datasets have more zero’s than the observed dataset. So the zero-inflation is proper handled.

Modelling zero-inflation by overdispersion

m.zero.nb <- glm.nb(ZeroInflatedPoisson ~ factor(Mean), data = dataset)
summary(m.zero.nb)
## 
## Call:
## glm.nb(formula = ZeroInflatedPoisson ~ factor(Mean), data = dataset, 
##     init.theta = 0.6535490762, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5298  -0.2963  -0.2963   0.0850   2.8393  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -3.09224    0.04854  -63.71   <2e-16 ***
## factor(Mean)10000 12.19918    0.05009  243.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.6535) family taken to be 1)
## 
##     Null deviance: 130864  on 19999  degrees of freedom
## Residual deviance:  14676  on 19998  degrees of freedom
## AIC: 204749
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.65355 
##           Std. Err.:  0.00868 
## 
##  2 x log-likelihood:  -204742.76400
dataset$ID <- seq_along(dataset$Mean)
m.zero.olre <- glmer(
  ZeroInflatedPoisson ~ factor(Mean) + (1 | ID), 
  data = dataset,
  family = poisson
)
summary(m.zero.olre)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: ZeroInflatedPoisson ~ factor(Mean) + (1 | ID)
##    Data: dataset
## 
##       AIC       BIC    logLik  deviance  df.resid 
##  218344.4  218368.1 -109169.2  218338.4     19997 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -0.99401 -0.06651 -0.06651  0.00113  0.81490 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  ID     (Intercept) 8.377    2.894   
## Number of obs: 20000, groups:  ID, 20000
## 
## Fixed effects:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -5.38381    0.06933  -77.66   <2e-16 ***
## factor(Mean)10000 13.64882    0.07465  182.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## fct(M)10000 -0.920
simulated$ZeroInflatedNB <- apply(
  simulate(m.zero.nb, nsim = n.sim) == 0, 
  2, 
  mean
)
ggplot(simulated, aes(x = ZeroInflatedNB)) +
  geom_histogram(binwidth = 0.0005) +
  geom_vline(xintercept = mean(dataset$ZeroInflatedPoisson == 0), colour = "red") +
  scale_x_continuous(label = percent)

simulated$ZeroInflatedOLRE <- apply(
  simulate(m.zero.olre, nsim = n.sim) == 0, 
  2, 
  mean
)
## Warning in rpois(nsim * length(ftd), ftd): NAs produced
ggplot(simulated, aes(x = ZeroInflatedOLRE)) +
  geom_histogram(binwidth = 0.0005) +
  geom_vline(xintercept = mean(dataset$ZeroInflatedPoisson == 0), colour = "red") +
  scale_x_continuous(label = percent)
## Warning: Removed 19 rows containing non-finite values (stat_bin).

mean.2 <- 5
dataset2 <- expand.grid(
  Mean = mean.2,
  Rep = seq_len(100)
)
dataset2$ZeroInflatedPoisson <- 
  rbinom(nrow(dataset2), size = 1, prob = 1 - prop.zeroinfl) * 
  rpois(nrow(dataset2), lambda = dataset2$Mean)
ggplot(dataset2, aes(x = ZeroInflatedPoisson)) + 
  geom_histogram(binwidth = 1)

m.zero2 <- glm(ZeroInflatedPoisson ~ 1, data = dataset2, family = poisson)
summary(m.zero2)
## 
## Call:
## glm(formula = ZeroInflatedPoisson ~ 1, family = poisson, data = dataset2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0100  -0.7664   0.2172   0.6576   3.2357  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.51072    0.04698   32.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 198.17  on 99  degrees of freedom
## Residual deviance: 198.17  on 99  degrees of freedom
## AIC: 500.78
## 
## Number of Fisher Scoring iterations: 5
simulated$ZeroInflated2 <- apply(
  simulate(m.zero2, nsim = n.sim) == 0, 
  2, 
  mean
)
ggplot(simulated, aes(x = ZeroInflated2)) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(xintercept = mean(dataset2$ZeroInflatedPoisson == 0), colour = "red") +
  scale_x_continuous(label = percent)

m.zero.nb2 <- glm.nb(ZeroInflatedPoisson ~ 1, data = dataset2)
summary(m.zero.nb2)
## 
## Call:
## glm.nb(formula = ZeroInflatedPoisson ~ 1, data = dataset2, init.theta = 5.897146834, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5927  -0.5924   0.1621   0.4841   2.2220  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.51072    0.06248   24.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(5.8971) family taken to be 1)
## 
##     Null deviance: 129.39  on 99  degrees of freedom
## Residual deviance: 129.39  on 99  degrees of freedom
## AIC: 488.65
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  5.90 
##           Std. Err.:  2.26 
## 
##  2 x log-likelihood:  -484.647
simulated$ZeroInflatedNB2 <- apply(
  simulate(m.zero.nb2, nsim = n.sim) == 0, 
  2, 
  mean
)
ggplot(simulated, aes(x = ZeroInflatedNB2)) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(xintercept = mean(dataset2$ZeroInflatedPoisson == 0), colour = "red") +
  scale_x_continuous(label = percent)

dataset2$ID <- seq_along(dataset2$Mean)
m.zero.olre2 <- glmer(
  ZeroInflatedPoisson ~ (1 | ID), 
  data = dataset2,
  family = poisson
)
simulated$ZeroInflatedOLRE2 <- apply(
  simulate(m.zero.olre2, nsim = n.sim) == 0, 
  2, 
  mean
)
ggplot(simulated, aes(x = ZeroInflatedOLRE2)) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(xintercept = mean(dataset2$ZeroInflatedPoisson == 0), colour = "red") +
  scale_x_continuous(label = percent)

m.zero.zi2 <- zeroinfl(
  ZeroInflatedPoisson ~ 1, 
  data = dataset2, 
  dist = "poisson"
)
summary(m.zero.zi2)
## 
## Call:
## zeroinfl(formula = ZeroInflatedPoisson ~ 1, data = dataset2, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6896 -0.5706  0.1753  0.5483  3.1591 
## 
## Count model coefficients (poisson with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.63254    0.04773   34.21   <2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.0437     0.3222  -6.342 2.26e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 6 
## Log-likelihood: -227.7 on 2 Df
simulated$ZeroInflatedZI2 <- replicate(n.sim, sim.zero.zi(m.zero.zi2))
ggplot(simulated, aes(x = ZeroInflatedZI2)) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(xintercept = mean(dataset2$ZeroInflatedPoisson == 0), colour = "red") +
  scale_x_continuous(label = percent)

A negative binomial distribution can capture some of the zero-inflation by overdispersion. Especially in cases with low mean and few data. When the mean is low, a reasonal portion of the zero origination from the poisson distribution. However, the shape of the distribution is different.

distribution <- data.frame(
  Count = 0:20
)
distribution$Truth <- dpois(
  distribution$Count, 
  lambda = mean.2
) + ifelse(
  distribution$Count == 0,
  prop.zeroinfl,
  0
)
distribution$Poisson <- dpois(
  distribution$Count, 
  lambda = exp(coef(m.zero2))
)
distribution$ZIPoisson <- dpois(
  distribution$Count, 
  lambda = exp(coef(m.zero.zi2)[1])
) + ifelse(
  distribution$Count == 0,
  plogis(coef(m.zero.zi2)[2]),
  0
)

se <- sqrt(VarCorr(m.zero.olre2)$ID)
z <- seq(qnorm(.001, sd = se), qnorm(.999, sd = se), length = 101)
dz <- dnorm(z, sd = se)
dz <- dz / sum(dz)
delta <- outer(
  distribution$Count,
  exp(z + fixef(m.zero.olre2)),
  FUN = dpois
)
distribution$OLRE <- delta %*% dz

distribution$NegBin <- dnbinom(
  distribution$Count, 
  size = m.zero.nb2$theta,
  mu = exp(coef(m.zero.nb2))
)
long <- gather(distribution, key = "Model", value = "Density", -Count)
## Warning: attributes are not identical across measure variables; they will
## be dropped
long$Truth <- long$Model == "Truth"
ggplot(long, aes(x = Count, y = Density, colour = Model, linetype = Truth)) + 
  geom_line()