- Introduction to zero-inflated count models
- Identifying zero inflation
- Implementation of zero inflated count model in \(R\)
- Interpreting model coefficients
- Generating predictions from a zero-inflated count model
# Load affair data affairs_df <- read_csv( "https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/affairs.csv") # Show data preview glimpse(affairs_df)
Rows: 601 Columns: 9 $ affairs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… $ gender <chr> "male", "female", "female", "male", "male", "female", "f… $ age <dbl> 37, 27, 32, 57, 22, 32, 22, 57, 32, 22, 37, 27, 47, 22, … $ yearsmarried <dbl> 10.00, 4.00, 15.00, 15.00, 0.75, 1.50, 0.75, 15.00, 15.0… $ children <chr> "no", "no", "yes", "yes", "no", "no", "no", "yes", "yes"… $ religiousness <dbl> 3, 4, 1, 5, 2, 2, 2, 2, 4, 4, 2, 4, 5, 2, 4, 1, 2, 3, 2,… $ education <dbl> 18, 14, 12, 18, 17, 17, 12, 14, 16, 14, 20, 18, 17, 17, … $ occupation <dbl> 7, 6, 1, 6, 6, 5, 1, 4, 1, 4, 7, 6, 6, 5, 5, 5, 4, 5, 5,… $ rating <dbl> 4, 4, 4, 5, 3, 5, 3, 4, 2, 5, 2, 4, 4, 4, 4, 5, 3, 4, 5,…
Most people don’t have affairs (at least most of the time) – we want to know more about those who do have affairs under which conditions people have affairs.
Number of affairs as coming from a Poisson or a negative binomial distribution (for overdispersion):
\[ \begin{align} \text{affairs}_i & \sim \text{Poisson}(\lambda)\\ \log(\lambda) & = \beta_0\\ \end{align} \]
m_poisson <- glm(affairs ~ 1, data = affairs_df, family = poisson(link = "log")) exp(coef(m_poisson))
(Intercept) 1.455907
What’s the average number of affairs according to the following negative binomial model?
\[ \begin{align} \text{affairs}_i &\sim \text{NegBinomial}(\mu, r)\\ \log(\mu) &= \beta_0 \end{align} \]
We can test how many many zeros we would expect for a Poisson model with an average \(\lambda\) of
exp(coef(m_poisson))
(Intercept) 1.455907
# Proportion of zeros in a Poisson dist with lambda = 1.4559 dpois(0, lambda = 1.4559)
[1] 0.2331904
So a Poisson distribution with a \(\lambda\) of 1.4559 would expect that 23% observations are 0.
What’s the probability to observe 0 in a negative binomial distribution with an average \(\mu\)
exp(coef(m_nb))
(Intercept) 1.455907
and a dispersion (size) \(r\) of
m_nb$theta
[1] 0.11196
dnbinom(0, mu = ..., size = ...)
How many zeros did we actually observe?
Average number of affairs:
affairs_df <- mutate(affairs_df, has_no_affairs = affairs == 0)
Number of people with and without affairs:
count(affairs_df, has_no_affairs)
# A tibble: 2 × 2 has_no_affairs n <lgl> <int> 1 FALSE 150 2 TRUE 451
We (you) can estimate the proportion of zero values (no affairs) by implementing a binary logistic model of the form
\[ \begin{align} \text{has_no_affairs}_i \sim \text{Bernoulli}(\theta)\\ \log\left(\frac{\theta}{1-\theta}\right) = \beta_0 \end{align} \] What’s the probability \(\theta\) of not having an affair?
Relative number of people without affairs is \(\hat\theta\) = 0.7504 and with affairs is \(1 - \hat\theta\) = 0.2496.
The Poisson model underestimated the number of zero observations; the negative binomial model was close.
Both are mixing up two different process that generate the data!
Two processes rather than one!
Not only the average of the latter can change in response to a set of characteristics / predictors (age, education) but also the relative size of the zero and non-zero distribution.
Averaging across people with and without affairs; most data come from people without affairs:
library(psyntur)
describe(affairs_df,
avg_n_affairs = mean(affairs),
prop_of_no_affairs = mean(has_no_affairs))
# A tibble: 1 × 2
avg_n_affairs prop_of_no_affairs
<dbl> <dbl>
1 1.46 0.750
Separating the data into people with and without affairs:
describe(affairs_df, avg_n_affairs = mean(affairs), by = has_no_affairs)
# A tibble: 2 × 2 has_no_affairs avg_n_affairs <lgl> <dbl> 1 FALSE 5.83 2 TRUE 0
Zero inflation: superimposing distribution of zeros over non-zeros
There is a certain probability that people are in one of those two groups:
Characteristics of the individuals (predictors) can change
\[ \begin{align} \text{affairs} &\sim \begin{cases} \text{Poisson}(\lambda) & \text{if } z = 0 \\ 0 & \text{if } z = 1\\ \end{cases}\\ z &\sim \text{Bernoulli}(\theta) \end{align} \]
where \(\lambda\) is modelled as a Poisson and \(\theta\) is modelled as in binary logistic regression, using the usual log and logit link functions, respectively, that allow us to add predictors (later)
\[ \log(\lambda) = \beta_0 \]
and
\[ \log\left(\frac{\theta}{1 - \theta}\right) = \gamma_0 \]
We use \(\gamma\) (gamma) to indicate that regression coefficients in the Poisson and binary logistic regression are different from each other.
library(pscl) m_zip <- zeroinfl(affairs ~ 1, dist = 'poisson', data = affairs_df)
You can run zero-inflated negative binomial by changing the value of dist to 'negbin'.
(coefs <- coef(m_zip))
count_(Intercept) zero_(Intercept)
1.760605 1.096853
Given these results, whats the average number \(\lambda\) of affairs (for people who have affairs) and what’s the probability \(\theta\) of zero values (non affairs).
The average of the Poisson distribution \(\lambda\), the number of affairs is 5.816. This value is very different from our earlier Poisson and negative binomial models.
The average probability of non-affairs (zeros) \(\theta\) is 0.7497. This value is similar to our binary logistic regression from earlier.
Popular confusion: \(\theta\) does NOT represents the proportion of people with affairs but the proportion of zeros, so people without affairs. The proportion of people with affairs is \(1-\) 0.7497, so 0.2503!
Is it worth assuming a zero inflation? Is the zero inflated Poisson model better than the ordinary Poisson model?
The Vuong test is designed for non-nested models like Poisson vs zero-inflated Poisson, where we can’t use a likelihood‑ratio test.
It evaluates whether one model gives systematically higher predicted probability for each observation.
vuong(m_poisson, m_zip)
Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
-------------------------------------------------------------
Vuong z-statistic H_A p-value
Raw -12.00601 model2 > model1 < 2.22e-16
AIC-corrected -11.99233 model2 > model1 < 2.22e-16
BIC-corrected -11.96223 model2 > model1 < 2.22e-16
Null hypothesis: “the models are indistinguishable”
The raw Vuong z-statistic is associated with a p-value below 0.05, so we reject the null hypothesis that the models are indistinguishable; the zero-inflated Poisson model is statistically supported.
Use the raw Vuong z-statistic when comparing Poisson vs zero-inflated Poisson: models differ in structure, not just parameter count. AIC / BIC corrections can over-penalize the zero-inflation component.
Adding yearsmarried as a predictor to regressions of both the count distribution and the probability of non-affairs.
\[ \log(\lambda) = \beta_0 + \beta_1 \times \text{yearsmarried}_i \]
and
\[ \log\left(\frac{\theta}{1 - \theta}\right) = \gamma_0 + \gamma_1 \times \text{yearsmarried}_i \]
m_zip_1 <- zeroinfl(affairs ~ yearsmarried, dist = 'poisson', data = affairs_df)
Model coefficients are identical to what you know from Poisson regression and binary logistic regression.
m_zip_1
Call:
zeroinfl(formula = affairs ~ yearsmarried, data = affairs_df, dist = "poisson")
Count model coefficients (poisson with log link):
(Intercept) yearsmarried
1.35927 0.03976
Zero-inflation model coefficients (binomial with logit link):
(Intercept) yearsmarried
1.58854 -0.05749
\[ \log(\lambda) = 1.35927 + 0.03976 \times \text{yearsmarried}_i \]
and
\[ \log\left(\frac{\theta}{1 - \theta}\right) = 1.58854 + -0.05749 \times \text{yearsmarried}_i \]
What do the slope coefficients tell us about the effect that years in a marriage has on the average of the distribution of number of affairs and the probability of not having an affair?
Inferential results:
summary(m_zip_1)$coef
$count
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.35927459 0.081371510 16.704552 1.214372e-62
yearsmarried 0.03975665 0.006934872 5.732861 9.875084e-09
$zero
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.58853581 0.18401224 8.632773 5.988234e-18
yearsmarried -0.05748937 0.01729087 -3.324840 8.846937e-04
The number of affairs is distributed according to a zero-inflated Poisson distribution
\[
\begin{align}
\text{affairs}_i &\sim
\begin{cases}
\text{Poisson}(\lambda_i) & \text{if } z_i = 0 \\
0 & \text{if } z_i = 1\\
\end{cases}\\
z_i &\sim \text{Bernoulli}(\theta_i)
\end{align}
\] where the average number of affairs varies as a function of yearsmarried and children
\[ \log(\lambda_i) = \beta_0 + \beta_1 \times \text{yearsmarried}_i + \beta_2 \times \text{children}_i \]
and the probability of not having affairs varies as a function of yearsmarried and children
\[ \log\left(\frac{\theta_i}{1 - \theta_i}\right) = \gamma_0 + \gamma_1 \times \text{yearsmarried}_i + \gamma_2 \times \text{children}_i \]
(On paper if you prefer) obtain the model coefficients and complete the regression equation that describes how the average number of affairs
\[ \log(\lambda_i) = \ldots + \ldots \times \text{yearsmarried}_i + \ldots \times \text{children}_i \]
and the probability of not having affairs
\[ \log\left(\frac{\theta_i}{1 - \theta_i}\right) = \ldots + \ldots \times \text{yearsmarried}_i + \ldots \times \text{children}_i \]
varies as a function of yearsmarried and children.
There are three possible types of predictions for zero-inflated models:
(There are also the log count / log-odds transformations)
Access the coefficients of the binary logistic regression using 'zero':
coef(m_zip_2, 'zero')
(Intercept) yearsmarried childrenyes 1.78686783 -0.03705913 -0.49526965
Calculate the predicted probability of not having an affair.
plogis(1.78686783 + -0.49526965 * 1) # with children
[1] 0.7844176
plogis(1.78686783 + -0.49526965 * 0) # without children
[1] 0.8565428
Probability of not having an affair is higher for individuals without children.
Translated to English: People with children are more likely to have affairs.
1 - plogis(1.78686783 + -0.49526965 * 1) # with children
[1] 0.2155824
1 - plogis(1.78686783 + -0.49526965 * 0) # without children
[1] 0.1434572
Access the coefficients of, here, the Poisson regression:
coef(m_zip_2, 'count')
(Intercept) yearsmarried childrenyes 1.4845430 0.0467285 -0.2379274
Calculate the predicted number of affairs (for people that do have affairs)
exp(1.4845430 + -0.2379274 * 1) # with children
[1] 3.47855
exp(1.4845430 + -0.2379274 * 0) # without children
[1] 4.412948
Taken together, people with kids are more likely to have affairs but are also having less affairs than people without kids. Are we surprised?
Repeat for yearsmarried.
(For individuals without children) what is the predicted probability \(\theta\) and the average count \(\lambda\) for individuals that are married 1 year vs 10 years? Also change the probability \(\theta\) to represent the probability to have an affair.
Probability of affairs \(1 - \hat\theta\) for 1 year married is 0.148; for 10 years married, it’s 0.195.
Average number of affairs \(\hat\lambda\) for 1 year married is 4.624; for 10 years married, it’s 7.042.
What do these results mean? Are the results for the probability and average number of affairs surprising us?
add_predictions# Data frame for hypothetical participants
affairs_df_new <- tibble(yearsmarried = c(0, 0, 1, 1, 10, 10),
children = c('yes', 'no', 'yes', 'no', 'yes', 'no'))
affairs_df_new
# A tibble: 6 × 2
yearsmarried children
<dbl> <chr>
1 0 yes
2 0 no
3 1 yes
4 1 no
5 10 yes
6 10 no
add_predictionsProbability that an individual has no affairs.
library(modelr) add_predictions(data = affairs_df_new, model = m_zip_2, type = 'zero')
# A tibble: 6 × 3
yearsmarried children pred
<dbl> <chr> <dbl>
1 0 yes 0.784
2 0 no 0.857
3 1 yes 0.778
4 1 no 0.852
5 10 yes 0.715
6 10 no 0.805
add_predictionsAverage number of affairs for individuals with affairs.
library(modelr) add_predictions(data = affairs_df_new, model = m_zip_2, type = 'count')
# A tibble: 6 × 3
yearsmarried children pred
<dbl> <chr> <dbl>
1 0 yes 3.48
2 0 no 4.41
3 1 yes 3.64
4 1 no 4.62
5 10 yes 5.55
6 10 no 7.04
add_predictionsAverage number of affairs for an individual who might have affairs or might not have affairs.
Expected value for count of affairs in a zero-inflated Poisson model calculated as
\[ \tilde{y} = \hat\lambda_i \times (1 - \hat\theta_i) \]
which is the weighted \((1 - \hat\theta_i)\) (probability of having affairs) average number of affairs \(\hat\lambda_i\) for a hypothetical participant with a set of characteristics.
add_predictions(data = affairs_df_new, model = m_zip_2, type = 'response')
# A tibble: 6 × 3
yearsmarried children pred
<dbl> <chr> <dbl>
1 0 yes 0.750
2 0 no 0.633
3 1 yes 0.809
4 1 no 0.685
5 10 yes 1.58
6 10 no 1.37
add_predictionsSo response in
# A tibble: 6 × 5
yearsmarried children lambda theta response
<dbl> <chr> <dbl> <dbl> <dbl>
1 0 yes 3.48 0.784 0.750
2 0 no 4.41 0.857 0.633
3 1 yes 3.64 0.778 0.809
4 1 no 4.62 0.852 0.685
5 10 yes 5.55 0.715 1.58
6 10 no 7.04 0.805 1.37
is calculate using
\[ \tilde{y} = \hat\lambda_i \times (1 - \hat\theta_i) \]
For example, 3.478 * (1 - 0.784) = 0.75 i.e. the average number of affairs for people that are married 0 years with children.
For such hypothetical people the probability of having affairs is 1 - 0.784, so 0.216 which weighs down the average count 3.479 to an average number of affairs of 0.749.
For the other group (\(\hat\theta\) = 0.784), the average number of affairs is (always) 0.
Without add_predictions or predict calculate the following predictions. You don’t need to run the model, just calculate the predictions using the regression formulae.
coef(m_zip_3, "zero")
(Intercept) gendermale 1.2136990 -0.2387054
coef(m_zip_3, "count")
(Intercept) gendermale 1.8238602 -0.1256744