Zero-Inflated Modeling Made Workable

Objective
Understanding AIC and Model Choice
Part A — Simulated Cargo Like Data
Part B — Real Data: bioChemists
Part C — Bayesian models using JAGS
Takeaways

Objective

In this project we tell a clear story. We start with simple models and check how they fail when zeros pile up. We then add models that split the zero process from the count process. Finally we view the same story in a Bayesian light. This helps us see how uncertainty and model choice change the conclusions.

We model count data with excess zeros. We explore:

Standard models: Poisson, NB
Zero inflated: ZIP, ZINB
Hurdle (NB based)
Bayesian ZIP and Poisson via JAGS

We apply this to:

1. simulated cargo like data
1. the bioChemists dataset loaded from a CSV you provide

Understanding AIC and Model Choice

The Akaike Information Criterion (AIC) is a tool for comparing models fitted to the same dataset. It balances goodness of fit (via the likelihood) with model complexity (via a penalty for extra parameters).

Lower AIC means the model achieves a better trade-off between fit and simplicity.

A difference of < 2 suggests models are essentially equivalent.

Differences of 4–7 provide moderate evidence that the lower AIC model is better.

Differences of > 10 provide strong evidence in favor of the lower AIC model.

What does it mean if Poisson has a lower AIC than ZIP?

This implies that the additional parameters in the Zero-Inflated Poisson (ZIP) model do not improve the fit enough to justify their complexity. In other words, the data may not have a “true” structural zero process, and the simpler Poisson describes the counts sufficiently well.

When to adopt specialized zero-inflated or hurdle models

Adopt Poisson when counts are not overdispersed (variance ≈ mean) and the number of zeros is consistent with the Poisson expectation.

Adopt Negative Binomial (NB) when counts are overdispersed (variance > mean) but zeros are not excessive.

Adopt Zero-Inflated Poisson/Negative Binomial (ZIP/ZINB) when there are more zeros than expected, and you believe some of those zeros are structural (i.e., generated by a different process, such as non-participation).

Adopt Hurdle models when the decision to be zero vs. nonzero is qualitatively different from the decision of how many counts (e.g., whether a person publishes at all vs. how many papers they publish once active).

In short, we should not automatically prefer the “fancier” model. AIC (and in Bayesian settings, DIC, WAIC, or LOO) helps decide whether the complexity of a zero-inflated or hurdle model is justified, or if a simpler Poisson or NB already captures the data-generating process.

Part A — Simulated Cargo Like Data

We create a small toy example that behaves like daily shipments. Many days have no shipments at all. We add a simple traffic variable so models have a predictor. This controlled example shows the mechanics before we use real data.

n <- 365
traffic <- rnorm(n, mean = 10, sd = 5)
zprob <- plogis(-1 + 0.05 * traffic)   # structural zero probability
is_zero <- rbinom(n, 1, zprob)
lambda <- exp(0.3 + 0.1 * traffic)     # Poisson mean when active
shipments <- ifelse(is_zero == 1, 0, rpois(n, lambda))
cargo <- data.frame(day = 1:n, traffic, shipments)

# quick checks
summary(cargo$shipments)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   2.378   4.000  15.000

mean_zero <- mean(cargo$shipments == 0)
mean_zero

## [1] 0.4027397

Visual Checks

We first look at the distribution and then at shipments versus traffic. The histogram should show a spike at zero.

ggplot(cargo, aes(shipments)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "white") +
  labs(title = "Simulated Shipments Distribution", x = "Shipments", y = "Count")

ggplot(cargo, aes(traffic, shipments)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Shipments vs Traffic (Simulated)")

Fit Models with glmmTMB

Now we fit a ladder of models. The baseline Poisson is first. Negative binomial lets variance exceed the mean. Zero inflated and hurdle models let some zeros be structural or model zero vs positive as a separate stage. We compare AIC to pick a frequentist winner.

fit_pois  <- glm(shipments ~ traffic, family = poisson, data = cargo)
fit_nb    <- MASS::glm.nb(shipments ~ traffic, data = cargo)

fit_zip   <- glmmTMB(shipments ~ traffic, ziformula = ~ traffic, family = poisson, data = cargo)
fit_zinb  <- glmmTMB(shipments ~ traffic, ziformula = ~ traffic, family = nbinom2, data = cargo)

# Hurdle style: truncated negative binomial family via glmmTMB
fit_hurd  <- glmmTMB(shipments ~ traffic, ziformula = ~ traffic, family = truncated_nbinom2, data = cargo)

AIC_tab <- AIC(fit_pois, fit_nb, fit_zip, fit_zinb, fit_hurd)
knitr::kable(as.data.frame(AIC_tab), caption = "AIC Comparison: Simulated Data")

AIC Comparison: Simulated Data
	df	AIC
fit_pois	2	1825.549
fit_nb	3	1479.301
fit_zip	4	1331.888
fit_zinb	5	1333.888
fit_hurd	5	1336.855

Brief note: lower AIC is better. Small differences (<2) are weak. Differences of 4 or more are meaningful.

Part B — Real Data: `bioChemists`

We now move to a small real dataset of publications by young biochemists. The response art counts articles and often contains many zeros. Load the CSV you keep locally and inspect it.

# Replace the path below with the path to your local CSV if needed
bioChem <- read.csv("~/CODE/ANALYSIS/zero inflated models/bioChemists.csv",
                    stringsAsFactors = TRUE)

# keep the columns we need
d <- bioChem %>% select(art, fem, mar, kid5, phd, ment)

# quick checks
str(d)

## 'data.frame':    915 obs. of  6 variables:
##  $ art : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fem : Factor w/ 2 levels "Men","Women": 1 2 2 1 2 2 2 1 1 2 ...
##  $ mar : Factor w/ 2 levels "Married","Single": 1 2 2 1 2 1 2 1 2 1 ...
##  $ kid5: int  0 0 0 1 0 2 0 2 0 0 ...
##  $ phd : num  2.52 2.05 3.75 1.18 3.75 ...
##  $ ment: int  7 6 6 3 26 2 3 4 6 0 ...

summary(d$art)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.693   2.000  19.000

mean(d$art == 0)

## [1] 0.3005464

Visual Checks

A histogram confirms the heavy mass at zero. The scatter versus mentor publications shows the relation for positive counts.

ggplot(d, aes(art)) +
  geom_histogram(binwidth = 1, fill = "lightgreen", color = "white") +
  labs(title = "Articles per Biochemist", x = "Number of Articles", y = "Count")

ggplot(d, aes(ment, art)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Articles vs Mentor Publications")

Model Fitting: glmmTMB

We fit the same ladder here. We use glmmTMB for ZIP, ZINB and hurdle. For baseline Poisson and NB we use glm and glm.nb.

m_pois   <- glm(art ~ fem + mar + kid5 + phd + ment, family = poisson, data = d)
m_nb     <- MASS::glm.nb(art ~ fem + mar + kid5 + phd + ment, data = d)

m_zip    <- glmmTMB(art ~ fem + mar + kid5 + phd + ment,
                    ziformula = ~ fem + mar + kid5,
                    family = poisson, data = d)

m_zinb   <- glmmTMB(art ~ fem + mar + kid5 + phd + ment,
                    ziformula = ~ fem + mar + kid5,
                    family = nbinom2, data = d)

m_hurd   <- glmmTMB(art ~ fem + mar + kid5 + phd + ment,
                    ziformula = ~ fem + mar + kid5,
                    family = truncated_nbinom2, data = d)

AIC_bio <- AIC(m_pois, m_nb, m_zip, m_zinb, m_hurd)
knitr::kable(as.data.frame(AIC_bio), caption = "AIC Comparison: bioChemists Data")

AIC Comparison: bioChemists Data
	df	AIC
m_pois	6	3314.113
m_nb	7	3135.917
m_zip	10	3260.631
m_zinb	11	NA
m_hurd	11	3185.727

Brief note: if NB beats Poisson then overdispersion is present. If ZIP or Hurdle has much lower AIC then a separate zero process matters.

Part C — Bayesian models using JAGS

We now retell the same story from a Bayesian view. We fit two compact Bayesian models with rjags:

Bayesian Poisson with predictor ment.
Bayesian Zero Inflated Poisson with ment in both log mean and logit zero probability.

We keep models small so sampling runs quickly. We compute DIC for each model and compare. In DIC lower is better. A DIC difference near 2 is small. A difference of 4 to 7 is meaningful. Ten or more is strong evidence.

# Prepare data for JAGS using only art and ment
d_bayes <- d %>%
  select(art, ment) %>%
  filter(!is.na(art), !is.na(ment))

y <- as.numeric(d_bayes$art)
x <- as.numeric(scale(d_bayes$ment))  # center and scale mentor
N <- length(y)

jdat <- list(y = y, x = x, N = N)

Bayesian Poisson (JAGS)

poisson_model_string <- "
model {
  for (i in 1:N) {
    y[i] ~ dpois(mu[i])
    log(mu[i]) <- beta0 + beta1 * x[i]
  }

  beta0 ~ dnorm(0, 0.001)
  beta1 ~ dnorm(0, 0.001)
}
"
writeLines(poisson_model_string, con = "poisson_model.jags")

jm_pois <- jags.model(file = "poisson_model.jags", data = jdat, n.chains = 3, n.adapt = 1000)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 915
##    Unobserved stochastic nodes: 2
##    Total graph size: 1982
## 
## Initializing model

update(jm_pois, 2000)  # burn in
post_pois <- coda.samples(jm_pois, variable.names = c("beta0", "beta1"), n.iter = 4000)

# DIC for Poisson
dic_pois <- dic.samples(jm_pois, n.iter = 2000, type = "pD")
# compute DIC as mean deviance + mean penalty
dic_pois_val <- mean(dic_pois$deviance) + mean(dic_pois$penalty)

# Summaries
print(summary(post_pois))

## 
## Iterations = 3001:7000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 4000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##         Mean      SD  Naive SE Time-series SE
## beta0 0.4869 0.02632 0.0002402      0.0003165
## beta1 0.2465 0.01817 0.0001659      0.0002168
## 
## 2. Quantiles for each variable:
## 
##         2.5%    25%    50%    75%  97.5%
## beta0 0.4356 0.4693 0.4870 0.5043 0.5388
## beta1 0.2107 0.2344 0.2467 0.2587 0.2819

cat('Poisson DIC =', round(dic_pois_val, 2), '\n')

## Poisson DIC = 3.65

Bayesian Zero Inflated Poisson (JAGS)

zip_model_string <- "
model {
  for (i in 1:N) {
    # structural zero indicator
    z[i] ~ dbern(psi[i])
    logit(psi[i]) <- alpha0 + alpha1 * x[i]

    # count mean when not structural zero
    lambda[i] <- exp(beta0 + beta1 * x[i])

    # observed outcome: if z==1 then zero, else Poisson(lambda)
    y[i] ~ dpois(mu[i])
    mu[i] <- (1 - z[i]) * lambda[i]
  }

  # priors
  alpha0 ~ dnorm(0, 0.001)
  alpha1 ~ dnorm(0, 0.001)
  beta0  ~ dnorm(0, 0.001)
  beta1  ~ dnorm(0, 0.001)
}
"
writeLines(zip_model_string, con = "zip_model.jags")

jm_zip <- jags.model(file = "zip_model.jags", data = jdat, n.chains = 3, n.adapt = 1500)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 915
##    Unobserved stochastic nodes: 919
##    Total graph size: 4877
## 
## Initializing model

update(jm_zip, 2500)
post_zip <- coda.samples(jm_zip, variable.names = c("alpha0","alpha1","beta0","beta1"), n.iter = 5000)

# DIC for ZIP
dic_zip <- dic.samples(jm_zip, n.iter = 2000, type = "pD")
dic_zip_val <- mean(dic_zip$deviance) + mean(dic_zip$penalty)

# Summaries
print(summary(post_zip))

## 
## Iterations = 4001:9000
## Thinning interval = 1 
## Number of chains = 3 
## Sample size per chain = 5000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean      SD  Naive SE Time-series SE
## alpha0 -1.8400 0.25032 0.0020438      0.0107408
## alpha1 -1.3229 0.37806 0.0030868      0.0156043
## beta0   0.6958 0.03229 0.0002636      0.0006458
## beta1   0.1715 0.02070 0.0001690      0.0002787
## 
## 2. Quantiles for each variable:
## 
##           2.5%     25%     50%     75%   97.5%
## alpha0 -2.4241 -1.9832 -1.8050 -1.6637 -1.4451
## alpha1 -2.1755 -1.5465 -1.2834 -1.0526 -0.7019
## beta0   0.6315  0.6741  0.6961  0.7176  0.7588
## beta1   0.1309  0.1577  0.1715  0.1855  0.2110

cat('ZIP DIC =', round(dic_zip_val, 2), '\n')

## ZIP DIC = NaN

Compare Bayesian models by DIC

Below we show a small table that ranks the two Bayesian models by DIC.

library(tibble)
bayes_dic <- tibble(
  Model = c("Bayesian Poisson", "Bayesian ZIP"),
  DIC   = c(round(dic_pois_val,2), round(dic_zip_val,2))
) %>% arrange(DIC)

knitr::kable(bayes_dic, caption = "Bayesian model comparison via DIC (lower is better)")

Bayesian model comparison via DIC (lower is better)
Model	DIC
Bayesian Poisson	3.65
Bayesian ZIP	NaN

A note on interpretation. Use DIC differences the same way you use AIC differences. Small differences under 2 suggest models are similar. Large differences favor the lower DIC model.

Takeaways

We followed a simple narrative. We used a simulated example so the mechanics are clear. We used the real biochemists data to see how the methods work in practice. Frequentist AIC and Bayesian DIC both guide model choice. If zero inflated or hurdle models show much better criteria then a separate zero process is supported. The Bayesian fits add posterior uncertainty that tells us how sure we are about the parameters that control zeros and counts.

##Addendum: On NaN AIC/DIC Results

In some of the model outputs, the AIC (frequentist) and DIC (Bayesian) values returned as NaN (Not a Number). This can occur for several reasons:

Numerical instability: Zero-inflated and hurdle models can suffer from likelihood evaluation problems when parameter estimates are extreme or when chains do not mix well in MCMC.

Overparameterization: Adding too many predictors to both the count and zero components may result in weakly identified parameters, making the information criteria undefined.

Convergence issues: For Bayesian models, poor convergence or highly correlated parameters can cause the deviance or penalty term to fail, leading to undefined DIC.

Data structure limits: In small or sparse datasets, the model may not have enough information to estimate both processes (zero vs. count) reliably, producing unstable IC values.

Recommended fixes:

Simplify the zero-inflated part (e.g., fewer predictors).

Use stronger or more informative priors in Bayesian models to stabilize estimation.

Increase iterations or adjust adaptation/burn-in to ensure convergence.

Try alternative criteria such as WAIC or LOO (for Bayesian models) that are often more stable than DIC.

Important note: These NaN results do not invalidate the main purpose of this exercise. The project aimed to demonstrate how to handle inflated zero data through a hierarchy of models (Poisson → NB → ZIP/ZINB → Hurdle → Bayesian). The NaN values highlight practical challenges in fitting more complex models rather than undermining the overall modeling approach.

Zero-Inflated Modeling Made Workable

Etini Akpayang

Objective

Understanding AIC and Model Choice

What does it mean if Poisson has a lower AIC than ZIP?

When to adopt specialized zero-inflated or hurdle models

Part A — Simulated Cargo Like Data

Visual Checks

Fit Models with glmmTMB

Part B — Real Data: bioChemists

Visual Checks

Model Fitting: glmmTMB

Part C — Bayesian models using JAGS

Bayesian Poisson (JAGS)

Bayesian Zero Inflated Poisson (JAGS)

Compare Bayesian models by DIC

Takeaways

Recommended fixes:

Part B — Real Data: `bioChemists`