Session objectives

  • Introduction to zero-inflated count models
  • Identifying zero inflation
  • Implementation of zero inflated count model in \(R\)
  • Interpreting model coefficients
  • Generating predictions from a zero-inflated count model

The problem

# Load affair data
affairs_df <- read_csv(
  "https://raw.githubusercontent.com/mark-andrews/ntupsychology-data/main/data-sets/affairs.csv")

# Show data preview
glimpse(affairs_df)
Rows: 601
Columns: 9
$ affairs       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ gender        <chr> "male", "female", "female", "male", "male", "female", "f…
$ age           <dbl> 37, 27, 32, 57, 22, 32, 22, 57, 32, 22, 37, 27, 47, 22, …
$ yearsmarried  <dbl> 10.00, 4.00, 15.00, 15.00, 0.75, 1.50, 0.75, 15.00, 15.0…
$ children      <chr> "no", "no", "yes", "yes", "no", "no", "no", "yes", "yes"…
$ religiousness <dbl> 3, 4, 1, 5, 2, 2, 2, 2, 4, 4, 2, 4, 5, 2, 4, 1, 2, 3, 2,…
$ education     <dbl> 18, 14, 12, 18, 17, 17, 12, 14, 16, 14, 20, 18, 17, 17, …
$ occupation    <dbl> 7, 6, 1, 6, 6, 5, 1, 4, 1, 4, 7, 6, 6, 5, 5, 5, 4, 5, 5,…
$ rating        <dbl> 4, 4, 4, 5, 3, 5, 3, 4, 2, 5, 2, 4, 4, 4, 4, 5, 3, 4, 5,…

The problem

Most people don’t have affairs (at least most of the time) – we want to know more about those who do have affairs under which conditions people have affairs.

Previous models

Number of affairs as coming from a Poisson or a negative binomial distribution (for overdispersion):

\[ \begin{align} \text{affairs}_i & \sim \text{Poisson}(\lambda)\\ \log(\lambda) & = \beta_0\\ \end{align} \]

m_poisson <- glm(affairs ~ 1, data = affairs_df, family = poisson(link = "log"))
exp(coef(m_poisson))
(Intercept) 
   1.455907 

What’s the average number of affairs according to the following negative binomial model?

\[ \begin{align} \text{affairs}_i &\sim \text{NegBinomial}(\mu, r)\\ \log(\mu) &= \beta_0 \end{align} \]

Are zero counts inflated?

We can test how many many zeros we would expect for a Poisson model with an average \(\lambda\) of

exp(coef(m_poisson))
(Intercept) 
   1.455907 
# Proportion of zeros in a Poisson dist with lambda = 1.4559
dpois(0,  lambda = 1.4559)
[1] 0.2331904

So a Poisson distribution with a \(\lambda\) of 1.4559 would expect that 23% observations are 0.

Are zero counts inflated?

What’s the probability to observe 0 in a negative binomial distribution with an average \(\mu\)

exp(coef(m_nb))
(Intercept) 
   1.455907 

and a dispersion (size) \(r\) of

m_nb$theta
[1] 0.11196
dnbinom(0, mu = ..., size = ...)

Are zero counts inflated?

How many zeros did we actually observe?

Average number of affairs:

affairs_df <- mutate(affairs_df, has_no_affairs = affairs == 0)

Number of people with and without affairs:

count(affairs_df, has_no_affairs)
# A tibble: 2 × 2
  has_no_affairs     n
  <lgl>          <int>
1 FALSE            150
2 TRUE             451

Are zero counts inflated?

We (you) can estimate the proportion of zero values (no affairs) by implementing a binary logistic model of the form

\[ \begin{align} \text{has_no_affairs}_i \sim \text{Bernoulli}(\theta)\\ \log\left(\frac{\theta}{1-\theta}\right) = \beta_0 \end{align} \] What’s the probability \(\theta\) of not having an affair?

Are zero counts inflated?

Relative number of people without affairs is \(\hat\theta\) = 0.7504 and with affairs is \(1 - \hat\theta\) = 0.2496.

The Poisson model underestimated the number of zero observations; the negative binomial model was close.

Both are mixing up two different process that generate the data!

Zero inflated models

  • Some data require a special count distribution because of their low possible number or frequency of zero counts:
    • number of times you check Instagram in one hour
    • number of affairs people have in a year
    • number of cigarettes smoked in a day
  • Excessive number of zeros violates assumptions of both Poisson and negative binomial.
  • Poisson and negative binomial models are relatively similar.
  • Zero inflated models are conceptually a little bit trickier.
  • They are a mixture of two generalised linear models.

Mixture of two processes

Two processes rather than one!

  1. No affairs: Excessive numbers of zeros represent absence of the phenomenon of interest. The average number of affairs of this distribution is always zero.
  2. Affairs: Non-zero observations represent the presence of the phenomenon. The average of the other distribution (having affairs) varies relative to other variables.

Not only the average of the latter can change in response to a set of characteristics / predictors (age, education) but also the relative size of the zero and non-zero distribution.

Mixture of two processes

Averaging across people with and without affairs; most data come from people without affairs:

library(psyntur)
describe(affairs_df, 
         avg_n_affairs = mean(affairs),
         prop_of_no_affairs = mean(has_no_affairs))
# A tibble: 1 × 2
  avg_n_affairs prop_of_no_affairs
          <dbl>              <dbl>
1          1.46              0.750

Separating the data into people with and without affairs:

describe(affairs_df, avg_n_affairs = mean(affairs), by = has_no_affairs)
# A tibble: 2 × 2
  has_no_affairs avg_n_affairs
  <lgl>                  <dbl>
1 FALSE                   5.83
2 TRUE                    0   

Mixture of two processes

Zero inflation: superimposing distribution of zeros over non-zeros

  • One distribution is a count model (Poisson, negative binomial)
  • The other is a distribution that always has a count of 0.

There is a certain probability that people are in one of those two groups:

  • \(\theta\) to be in the zero model group
  • \(1 - \theta\) to be in the non-zero group

Characteristics of the individuals (predictors) can change

  • the average number of affairs
  • the probability of non-affairs

Zero inflated Poisson

  • People can change between distributions (at times) and move into the distribution of affair-having people.
  • E.g. when you count affairs per year this could lead to a zero count for people that had affairs at other times.
  • We aren’t just cleaving the dataset in two (the zero values and non-zero values) but we are saying that there are two distributions behind the data.
  • There is a latent variable that controls in which distribution someone is (at a certain time).
  • Latent variable: a construct we assume to exist but we don’t know the value of it (not observed).

Zero inflated Poisson

\[ \begin{align} \text{affairs} &\sim \begin{cases} \text{Poisson}(\lambda) & \text{if } z = 0 \\ 0 & \text{if } z = 1\\ \end{cases}\\ z &\sim \text{Bernoulli}(\theta) \end{align} \]

where \(\lambda\) is modelled as a Poisson and \(\theta\) is modelled as in binary logistic regression, using the usual log and logit link functions, respectively, that allow us to add predictors (later)

\[ \log(\lambda) = \beta_0 \]

and

\[ \log\left(\frac{\theta}{1 - \theta}\right) = \gamma_0 \]

We use \(\gamma\) (gamma) to indicate that regression coefficients in the Poisson and binary logistic regression are different from each other.

Zero inflated Poisson

library(pscl)
m_zip <- zeroinfl(affairs ~ 1, dist = 'poisson', data = affairs_df)

You can run zero-inflated negative binomial by changing the value of dist to 'negbin'.

(coefs <- coef(m_zip))
count_(Intercept)  zero_(Intercept) 
         1.760605          1.096853 

Given these results, whats the average number \(\lambda\) of affairs (for people who have affairs) and what’s the probability \(\theta\) of zero values (non affairs).

Zero inflated Poisson

The average of the Poisson distribution \(\lambda\), the number of affairs is 5.816. This value is very different from our earlier Poisson and negative binomial models.

The average probability of non-affairs (zeros) \(\theta\) is 0.7497. This value is similar to our binary logistic regression from earlier.

Popular confusion: \(\theta\) does NOT represents the proportion of people with affairs but the proportion of zeros, so people without affairs. The proportion of people with affairs is \(1-\) 0.7497, so 0.2503!

The Vuong test

Is it worth assuming a zero inflation? Is the zero inflated Poisson model better than the ordinary Poisson model?

The Vuong test is designed for non-nested models like Poisson vs zero-inflated Poisson, where we can’t use a likelihood‑ratio test.

It evaluates whether one model gives systematically higher predicted probability for each observation.

The Vuong test

vuong(m_poisson, m_zip)
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A    p-value
Raw                   -12.00601 model2 > model1 < 2.22e-16
AIC-corrected         -11.99233 model2 > model1 < 2.22e-16
BIC-corrected         -11.96223 model2 > model1 < 2.22e-16

Null hypothesis: “the models are indistinguishable”

The raw Vuong z-statistic is associated with a p-value below 0.05, so we reject the null hypothesis that the models are indistinguishable; the zero-inflated Poisson model is statistically supported.

Use the raw Vuong z-statistic when comparing Poisson vs zero-inflated Poisson: models differ in structure, not just parameter count. AIC / BIC corrections can over-penalize the zero-inflation component.

Zero inflated Poisson

Adding yearsmarried as a predictor to regressions of both the count distribution and the probability of non-affairs.

\[ \log(\lambda) = \beta_0 + \beta_1 \times \text{yearsmarried}_i \]

and

\[ \log\left(\frac{\theta}{1 - \theta}\right) = \gamma_0 + \gamma_1 \times \text{yearsmarried}_i \]

m_zip_1 <- zeroinfl(affairs ~ yearsmarried, dist = 'poisson', data = affairs_df)

Zero inflated Poisson

Model coefficients are identical to what you know from Poisson regression and binary logistic regression.

m_zip_1
Call:
zeroinfl(formula = affairs ~ yearsmarried, data = affairs_df, dist = "poisson")

Count model coefficients (poisson with log link):
 (Intercept)  yearsmarried  
     1.35927       0.03976  

Zero-inflation model coefficients (binomial with logit link):
 (Intercept)  yearsmarried  
     1.58854      -0.05749  

Zero inflated Poisson

\[ \log(\lambda) = 1.35927 + 0.03976 \times \text{yearsmarried}_i \]

and

\[ \log\left(\frac{\theta}{1 - \theta}\right) = 1.58854 + -0.05749 \times \text{yearsmarried}_i \]

What do the slope coefficients tell us about the effect that years in a marriage has on the average of the distribution of number of affairs and the probability of not having an affair?

Zero inflated Poisson

Inferential results:

summary(m_zip_1)$coef
$count
               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept)  1.35927459 0.081371510 16.704552 1.214372e-62
yearsmarried 0.03975665 0.006934872  5.732861 9.875084e-09

$zero
                Estimate Std. Error   z value     Pr(>|z|)
(Intercept)   1.58853581 0.18401224  8.632773 5.988234e-18
yearsmarried -0.05748937 0.01729087 -3.324840 8.846937e-04

Zero inflated Poisson

  • Predictors potentially work differently for binary logistic regression and the count model.
  • E.g. as age increases, the probability of being in the zero (non-affair) group could decrease or, in fact, increases.
  • The effect on the two components – count model and binary logistic regression – is not always the same in terms of what they’re saying about the relationship between the predictor variables and outcome.

Implement the following model

The number of affairs is distributed according to a zero-inflated Poisson distribution

\[ \begin{align} \text{affairs}_i &\sim \begin{cases} \text{Poisson}(\lambda_i) & \text{if } z_i = 0 \\ 0 & \text{if } z_i = 1\\ \end{cases}\\ z_i &\sim \text{Bernoulli}(\theta_i) \end{align} \] where the average number of affairs varies as a function of yearsmarried and children

\[ \log(\lambda_i) = \beta_0 + \beta_1 \times \text{yearsmarried}_i + \beta_2 \times \text{children}_i \]

and the probability of not having affairs varies as a function of yearsmarried and children

\[ \log\left(\frac{\theta_i}{1 - \theta_i}\right) = \gamma_0 + \gamma_1 \times \text{yearsmarried}_i + \gamma_2 \times \text{children}_i \]

Model coefficients

(On paper if you prefer) obtain the model coefficients and complete the regression equation that describes how the average number of affairs

\[ \log(\lambda_i) = \ldots + \ldots \times \text{yearsmarried}_i + \ldots \times \text{children}_i \]

and the probability of not having affairs

\[ \log\left(\frac{\theta_i}{1 - \theta_i}\right) = \ldots + \ldots \times \text{yearsmarried}_i + \ldots \times \text{children}_i \]

varies as a function of yearsmarried and children.

Predictions

There are three possible types of predictions for zero-inflated models:

  1. Probability \(\theta\) of being part of the zero-distribution (no affair).
  2. Average number of counts \(\lambda\) (e.g. affairs) for being part of the non-zero / count distribution.
  3. Average number of counts \(\lambda\) (e.g. affairs) generally.

(There are also the log count / log-odds transformations)

Predictions: probability

Access the coefficients of the binary logistic regression using 'zero':

coef(m_zip_2, 'zero')
 (Intercept) yearsmarried  childrenyes 
  1.78686783  -0.03705913  -0.49526965 

Calculate the predicted probability of not having an affair.

plogis(1.78686783 + -0.49526965 * 1) # with children
[1] 0.7844176
plogis(1.78686783 + -0.49526965 * 0) # without children
[1] 0.8565428

Probability of not having an affair is higher for individuals without children.

Predictions: probability

Translated to English: People with children are more likely to have affairs.

1 - plogis(1.78686783 + -0.49526965 * 1) # with children
[1] 0.2155824
1 - plogis(1.78686783 + -0.49526965 * 0) # without children
[1] 0.1434572

Predictions: average number of affairs

Access the coefficients of, here, the Poisson regression:

coef(m_zip_2, 'count')
 (Intercept) yearsmarried  childrenyes 
   1.4845430    0.0467285   -0.2379274 

Calculate the predicted number of affairs (for people that do have affairs)

exp(1.4845430 + -0.2379274 * 1) # with children
[1] 3.47855
exp(1.4845430 + -0.2379274 * 0) # without children
[1] 4.412948

Taken together, people with kids are more likely to have affairs but are also having less affairs than people without kids. Are we surprised?

Predictions: Exercise

Repeat for yearsmarried.

(For individuals without children) what is the predicted probability \(\theta\) and the average count \(\lambda\) for individuals that are married 1 year vs 10 years? Also change the probability \(\theta\) to represent the probability to have an affair.

Predictions: Exercise

Probability of affairs \(1 - \hat\theta\) for 1 year married is 0.148; for 10 years married, it’s 0.195.

Average number of affairs \(\hat\lambda\) for 1 year married is 4.624; for 10 years married, it’s 7.042.

What do these results mean? Are the results for the probability and average number of affairs surprising us?

Predictions: using add_predictions

# Data frame for hypothetical participants
affairs_df_new <- tibble(yearsmarried = c(0, 0, 1, 1, 10, 10), 
                         children = c('yes', 'no', 'yes', 'no', 'yes', 'no'))

affairs_df_new
# A tibble: 6 × 2
  yearsmarried children
         <dbl> <chr>   
1            0 yes     
2            0 no      
3            1 yes     
4            1 no      
5           10 yes     
6           10 no      

Predicting probability: using add_predictions

Probability that an individual has no affairs.

library(modelr)
add_predictions(data = affairs_df_new, model = m_zip_2, type = 'zero')
# A tibble: 6 × 3
  yearsmarried children  pred
         <dbl> <chr>    <dbl>
1            0 yes      0.784
2            0 no       0.857
3            1 yes      0.778
4            1 no       0.852
5           10 yes      0.715
6           10 no       0.805

Predicting counts: using add_predictions

Average number of affairs for individuals with affairs.

library(modelr)
add_predictions(data = affairs_df_new, model = m_zip_2, type = 'count')
# A tibble: 6 × 3
  yearsmarried children  pred
         <dbl> <chr>    <dbl>
1            0 yes       3.48
2            0 no        4.41
3            1 yes       3.64
4            1 no        4.62
5           10 yes       5.55
6           10 no        7.04

Predicting counts: using add_predictions

Average number of affairs for an individual who might have affairs or might not have affairs.

Expected value for count of affairs in a zero-inflated Poisson model calculated as

\[ \tilde{y} = \hat\lambda_i \times (1 - \hat\theta_i) \]

which is the weighted \((1 - \hat\theta_i)\) (probability of having affairs) average number of affairs \(\hat\lambda_i\) for a hypothetical participant with a set of characteristics.

add_predictions(data = affairs_df_new, model = m_zip_2, type = 'response')
# A tibble: 6 × 3
  yearsmarried children  pred
         <dbl> <chr>    <dbl>
1            0 yes      0.750
2            0 no       0.633
3            1 yes      0.809
4            1 no       0.685
5           10 yes      1.58 
6           10 no       1.37 

Predictions: using add_predictions

So response in

# A tibble: 6 × 5
  yearsmarried children lambda theta response
         <dbl> <chr>     <dbl> <dbl>    <dbl>
1            0 yes        3.48 0.784    0.750
2            0 no         4.41 0.857    0.633
3            1 yes        3.64 0.778    0.809
4            1 no         4.62 0.852    0.685
5           10 yes        5.55 0.715    1.58 
6           10 no         7.04 0.805    1.37 

is calculate using

\[ \tilde{y} = \hat\lambda_i \times (1 - \hat\theta_i) \]


For example, 3.478 * (1 - 0.784) = 0.75 i.e. the average number of affairs for people that are married 0 years with children.

For such hypothetical people the probability of having affairs is 1 - 0.784, so 0.216 which weighs down the average count 3.479 to an average number of affairs of 0.749.

For the other group (\(\hat\theta\) = 0.784), the average number of affairs is (always) 0.

Predictions: Exercise

Without add_predictions or predict calculate the following predictions. You don’t need to run the model, just calculate the predictions using the regression formulae.

coef(m_zip_3, "zero")
(Intercept)  gendermale 
  1.2136990  -0.2387054 
coef(m_zip_3, "count")
(Intercept)  gendermale 
  1.8238602  -0.1256744 
  1. Log-odds and probability of not having an affair for men and women.
  2. Average number of affairs for men and women who have affairs.
  3. Average number of affairs for men and women generally; you’ll need \(\lambda \times (1 - \theta)\).