In this project we tell a clear story. We start with simple models and check how they fail when zeros pile up. We then add models that split the zero process from the count process. Finally we view the same story in a Bayesian light. This helps us see how uncertainty and model choice change the conclusions.
We model count data with excess zeros. We explore:
We apply this to:
The Akaike Information Criterion (AIC) is a tool for comparing models fitted to the same dataset. It balances goodness of fit (via the likelihood) with model complexity (via a penalty for extra parameters).
Lower AIC means the model achieves a better trade-off between fit and simplicity.
A difference of < 2 suggests models are essentially equivalent.
Differences of 4–7 provide moderate evidence that the lower AIC model is better.
Differences of > 10 provide strong evidence in favor of the lower AIC model.
This implies that the additional parameters in the Zero-Inflated Poisson (ZIP) model do not improve the fit enough to justify their complexity. In other words, the data may not have a “true” structural zero process, and the simpler Poisson describes the counts sufficiently well.
Adopt Poisson when counts are not overdispersed (variance ≈ mean) and the number of zeros is consistent with the Poisson expectation.
Adopt Negative Binomial (NB) when counts are overdispersed (variance > mean) but zeros are not excessive.
Adopt Zero-Inflated Poisson/Negative Binomial (ZIP/ZINB) when there are more zeros than expected, and you believe some of those zeros are structural (i.e., generated by a different process, such as non-participation).
Adopt Hurdle models when the decision to be zero vs. nonzero is qualitatively different from the decision of how many counts (e.g., whether a person publishes at all vs. how many papers they publish once active).
In short, we should not automatically prefer the “fancier” model. AIC (and in Bayesian settings, DIC, WAIC, or LOO) helps decide whether the complexity of a zero-inflated or hurdle model is justified, or if a simpler Poisson or NB already captures the data-generating process.
We create a small toy example that behaves like daily shipments. Many days have no shipments at all. We add a simple traffic variable so models have a predictor. This controlled example shows the mechanics before we use real data.
n <- 365
traffic <- rnorm(n, mean = 10, sd = 5)
zprob <- plogis(-1 + 0.05 * traffic) # structural zero probability
is_zero <- rbinom(n, 1, zprob)
lambda <- exp(0.3 + 0.1 * traffic) # Poisson mean when active
shipments <- ifelse(is_zero == 1, 0, rpois(n, lambda))
cargo <- data.frame(day = 1:n, traffic, shipments)
# quick checks
summary(cargo$shipments)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 2.000 2.378 4.000 15.000
mean_zero <- mean(cargo$shipments == 0)
mean_zero
## [1] 0.4027397
We first look at the distribution and then at shipments versus traffic. The histogram should show a spike at zero.
ggplot(cargo, aes(shipments)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "white") +
labs(title = "Simulated Shipments Distribution", x = "Shipments", y = "Count")
ggplot(cargo, aes(traffic, shipments)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Shipments vs Traffic (Simulated)")
Now we fit a ladder of models. The baseline Poisson is first. Negative binomial lets variance exceed the mean. Zero inflated and hurdle models let some zeros be structural or model zero vs positive as a separate stage. We compare AIC to pick a frequentist winner.
fit_pois <- glm(shipments ~ traffic, family = poisson, data = cargo)
fit_nb <- MASS::glm.nb(shipments ~ traffic, data = cargo)
fit_zip <- glmmTMB(shipments ~ traffic, ziformula = ~ traffic, family = poisson, data = cargo)
fit_zinb <- glmmTMB(shipments ~ traffic, ziformula = ~ traffic, family = nbinom2, data = cargo)
# Hurdle style: truncated negative binomial family via glmmTMB
fit_hurd <- glmmTMB(shipments ~ traffic, ziformula = ~ traffic, family = truncated_nbinom2, data = cargo)
AIC_tab <- AIC(fit_pois, fit_nb, fit_zip, fit_zinb, fit_hurd)
knitr::kable(as.data.frame(AIC_tab), caption = "AIC Comparison: Simulated Data")
df | AIC | |
---|---|---|
fit_pois | 2 | 1825.549 |
fit_nb | 3 | 1479.301 |
fit_zip | 4 | 1331.888 |
fit_zinb | 5 | 1333.888 |
fit_hurd | 5 | 1336.855 |
Brief note: lower AIC is better. Small differences (<2) are weak. Differences of 4 or more are meaningful.
bioChemists
We now move to a small real dataset of publications by young
biochemists. The response art
counts articles and often
contains many zeros. Load the CSV you keep locally and inspect it.
# Replace the path below with the path to your local CSV if needed
bioChem <- read.csv("~/CODE/ANALYSIS/zero inflated models/bioChemists.csv",
stringsAsFactors = TRUE)
# keep the columns we need
d <- bioChem %>% select(art, fem, mar, kid5, phd, ment)
# quick checks
str(d)
## 'data.frame': 915 obs. of 6 variables:
## $ art : int 0 0 0 0 0 0 0 0 0 0 ...
## $ fem : Factor w/ 2 levels "Men","Women": 1 2 2 1 2 2 2 1 1 2 ...
## $ mar : Factor w/ 2 levels "Married","Single": 1 2 2 1 2 1 2 1 2 1 ...
## $ kid5: int 0 0 0 1 0 2 0 2 0 0 ...
## $ phd : num 2.52 2.05 3.75 1.18 3.75 ...
## $ ment: int 7 6 6 3 26 2 3 4 6 0 ...
summary(d$art)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.693 2.000 19.000
mean(d$art == 0)
## [1] 0.3005464
A histogram confirms the heavy mass at zero. The scatter versus mentor publications shows the relation for positive counts.
ggplot(d, aes(art)) +
geom_histogram(binwidth = 1, fill = "lightgreen", color = "white") +
labs(title = "Articles per Biochemist", x = "Number of Articles", y = "Count")
ggplot(d, aes(ment, art)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Articles vs Mentor Publications")
We fit the same ladder here. We use glmmTMB
for ZIP,
ZINB and hurdle. For baseline Poisson and NB we use glm
and
glm.nb
.
m_pois <- glm(art ~ fem + mar + kid5 + phd + ment, family = poisson, data = d)
m_nb <- MASS::glm.nb(art ~ fem + mar + kid5 + phd + ment, data = d)
m_zip <- glmmTMB(art ~ fem + mar + kid5 + phd + ment,
ziformula = ~ fem + mar + kid5,
family = poisson, data = d)
m_zinb <- glmmTMB(art ~ fem + mar + kid5 + phd + ment,
ziformula = ~ fem + mar + kid5,
family = nbinom2, data = d)
m_hurd <- glmmTMB(art ~ fem + mar + kid5 + phd + ment,
ziformula = ~ fem + mar + kid5,
family = truncated_nbinom2, data = d)
AIC_bio <- AIC(m_pois, m_nb, m_zip, m_zinb, m_hurd)
knitr::kable(as.data.frame(AIC_bio), caption = "AIC Comparison: bioChemists Data")
df | AIC | |
---|---|---|
m_pois | 6 | 3314.113 |
m_nb | 7 | 3135.917 |
m_zip | 10 | 3260.631 |
m_zinb | 11 | NA |
m_hurd | 11 | 3185.727 |
Brief note: if NB beats Poisson then overdispersion is present. If ZIP or Hurdle has much lower AIC then a separate zero process matters.
We now retell the same story from a Bayesian view. We fit two compact
Bayesian models with rjags
:
ment
.ment
in both log
mean and logit zero probability.We keep models small so sampling runs quickly. We compute DIC for each model and compare. In DIC lower is better. A DIC difference near 2 is small. A difference of 4 to 7 is meaningful. Ten or more is strong evidence.
# Prepare data for JAGS using only art and ment
d_bayes <- d %>%
select(art, ment) %>%
filter(!is.na(art), !is.na(ment))
y <- as.numeric(d_bayes$art)
x <- as.numeric(scale(d_bayes$ment)) # center and scale mentor
N <- length(y)
jdat <- list(y = y, x = x, N = N)
poisson_model_string <- "
model {
for (i in 1:N) {
y[i] ~ dpois(mu[i])
log(mu[i]) <- beta0 + beta1 * x[i]
}
beta0 ~ dnorm(0, 0.001)
beta1 ~ dnorm(0, 0.001)
}
"
writeLines(poisson_model_string, con = "poisson_model.jags")
jm_pois <- jags.model(file = "poisson_model.jags", data = jdat, n.chains = 3, n.adapt = 1000)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 915
## Unobserved stochastic nodes: 2
## Total graph size: 1982
##
## Initializing model
update(jm_pois, 2000) # burn in
post_pois <- coda.samples(jm_pois, variable.names = c("beta0", "beta1"), n.iter = 4000)
# DIC for Poisson
dic_pois <- dic.samples(jm_pois, n.iter = 2000, type = "pD")
# compute DIC as mean deviance + mean penalty
dic_pois_val <- mean(dic_pois$deviance) + mean(dic_pois$penalty)
# Summaries
print(summary(post_pois))
##
## Iterations = 3001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 4000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## beta0 0.4869 0.02632 0.0002402 0.0003165
## beta1 0.2465 0.01817 0.0001659 0.0002168
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## beta0 0.4356 0.4693 0.4870 0.5043 0.5388
## beta1 0.2107 0.2344 0.2467 0.2587 0.2819
cat('Poisson DIC =', round(dic_pois_val, 2), '\n')
## Poisson DIC = 3.65
zip_model_string <- "
model {
for (i in 1:N) {
# structural zero indicator
z[i] ~ dbern(psi[i])
logit(psi[i]) <- alpha0 + alpha1 * x[i]
# count mean when not structural zero
lambda[i] <- exp(beta0 + beta1 * x[i])
# observed outcome: if z==1 then zero, else Poisson(lambda)
y[i] ~ dpois(mu[i])
mu[i] <- (1 - z[i]) * lambda[i]
}
# priors
alpha0 ~ dnorm(0, 0.001)
alpha1 ~ dnorm(0, 0.001)
beta0 ~ dnorm(0, 0.001)
beta1 ~ dnorm(0, 0.001)
}
"
writeLines(zip_model_string, con = "zip_model.jags")
jm_zip <- jags.model(file = "zip_model.jags", data = jdat, n.chains = 3, n.adapt = 1500)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 915
## Unobserved stochastic nodes: 919
## Total graph size: 4877
##
## Initializing model
update(jm_zip, 2500)
post_zip <- coda.samples(jm_zip, variable.names = c("alpha0","alpha1","beta0","beta1"), n.iter = 5000)
# DIC for ZIP
dic_zip <- dic.samples(jm_zip, n.iter = 2000, type = "pD")
dic_zip_val <- mean(dic_zip$deviance) + mean(dic_zip$penalty)
# Summaries
print(summary(post_zip))
##
## Iterations = 4001:9000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## alpha0 -1.8400 0.25032 0.0020438 0.0107408
## alpha1 -1.3229 0.37806 0.0030868 0.0156043
## beta0 0.6958 0.03229 0.0002636 0.0006458
## beta1 0.1715 0.02070 0.0001690 0.0002787
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## alpha0 -2.4241 -1.9832 -1.8050 -1.6637 -1.4451
## alpha1 -2.1755 -1.5465 -1.2834 -1.0526 -0.7019
## beta0 0.6315 0.6741 0.6961 0.7176 0.7588
## beta1 0.1309 0.1577 0.1715 0.1855 0.2110
cat('ZIP DIC =', round(dic_zip_val, 2), '\n')
## ZIP DIC = NaN
Below we show a small table that ranks the two Bayesian models by DIC.
library(tibble)
bayes_dic <- tibble(
Model = c("Bayesian Poisson", "Bayesian ZIP"),
DIC = c(round(dic_pois_val,2), round(dic_zip_val,2))
) %>% arrange(DIC)
knitr::kable(bayes_dic, caption = "Bayesian model comparison via DIC (lower is better)")
Model | DIC |
---|---|
Bayesian Poisson | 3.65 |
Bayesian ZIP | NaN |
A note on interpretation. Use DIC differences the same way you use AIC differences. Small differences under 2 suggest models are similar. Large differences favor the lower DIC model.
We followed a simple narrative. We used a simulated example so the mechanics are clear. We used the real biochemists data to see how the methods work in practice. Frequentist AIC and Bayesian DIC both guide model choice. If zero inflated or hurdle models show much better criteria then a separate zero process is supported. The Bayesian fits add posterior uncertainty that tells us how sure we are about the parameters that control zeros and counts.
##Addendum: On NaN AIC/DIC Results
In some of the model outputs, the AIC (frequentist) and DIC (Bayesian) values returned as NaN (Not a Number). This can occur for several reasons:
Numerical instability: Zero-inflated and hurdle models can suffer from likelihood evaluation problems when parameter estimates are extreme or when chains do not mix well in MCMC.
Overparameterization: Adding too many predictors to both the count and zero components may result in weakly identified parameters, making the information criteria undefined.
Convergence issues: For Bayesian models, poor convergence or highly correlated parameters can cause the deviance or penalty term to fail, leading to undefined DIC.
Data structure limits: In small or sparse datasets, the model may not have enough information to estimate both processes (zero vs. count) reliably, producing unstable IC values.
Simplify the zero-inflated part (e.g., fewer predictors).
Use stronger or more informative priors in Bayesian models to stabilize estimation.
Increase iterations or adjust adaptation/burn-in to ensure convergence.
Try alternative criteria such as WAIC or LOO (for Bayesian models) that are often more stable than DIC.
Important note: These NaN results do not invalidate the main purpose of this exercise. The project aimed to demonstrate how to handle inflated zero data through a hierarchy of models (Poisson → NB → ZIP/ZINB → Hurdle → Bayesian). The NaN values highlight practical challenges in fitting more complex models rather than undermining the overall modeling approach.