Why Use Zero-Inflated Models?

Up until now, I have discussed the presence of overdispersion in these data, but failed to analytically address the issue. To refresh, when estimating a Poisson regression, we assume that the mean is equal to the variance. However, when considering the count of officer-involved shootings in each county, there is a large frequency of zero-count observations. This characteristic of the data results in the variance exceeding the mean in the observed distribution, resulting in what is referred to as “overdispersion.” A few analytical methods have been developed to address overdispersion in social science inquiry, and in this report I explore one such method, the zero-inflated model.

Zero-inflated models, in which the outcome can assume either a Poisson or negative binomial distribution, have been developed to express two different origins of zero observations: “structural” and “sampling” (Hu, Pavlicova, & Nunes). The figure below shows a zero-inflated Poisson model with the zero observations split between structural zeros and sampling zeros.

The sampling zeros are modeled with a Poisson or negative binomial distribution, and we assume that those zero observations happened by chance. In the context of this study, we assume that the sampling zeros can be explained by the fact that in certain counties, there is a natural fluctuation to the rate of officer-involved shootings from year to year. Randolph County, in North Carolina, had one shooting in 2015 and zero in 2016. This is an example of a sampling zero.

A logistic model is used model the state of sampling versus structural zeros, which contains an intercept and regressors for camera, race, and the interaction term between them (Zeileis, Kleiber, & Jackman). In the context of this study, a structural zero comes from a country for which there has never been (and presumably, never will be) an officer-involved shooting. A structural characteristic of the county, imaginably the size, police culture, or crime rate, creates an environment such that the rate of officer-involved shootings will always be zero.

The following plots use the officer-involved shooting data, aggregated to the number of shootings for each county, to illustrate the appropriateness of a zero-inflated model. Figure 1 demonstrates the high number of counties with zero shootings in the 2015 and 2016.

Figures 2a and 2b focus on the number of shootings broken down by county and race. From first glace at Figure 2a, there seem to be more Whites shot than Blacks and Hispanics, since the dark gray part, which highlights non-zero shootings, is relatively larger than for the other racial groups. However, this chart does not account for the population breakdown by race for each county. Figure 2b highlights the spread of shootings by race in the absence of zero-count observations. When excluding counties with no shootings, the average number of Hispanics shot per county is 2.4, which exceeds the average positive number of Blacks shot (2.1), which in turn exceeds the average positive number of Whites shot per county (1.64).

Figures 3a and 3b illustrate similar relationships with a focus on camera status. Figure 3a demonstrates that athough there are more non-zero counts in the absence of a body camera, the vast majority of counties broken down by camera status saw zero officer-involved shootings. Figure 3b highlights the spread of shootings by camera status in the absence of zero counts. When excluding counties with no shootings, the average number of individuals shot in the absence of a body camera shot per county is 2.42. This rate slightly exceeds the average positive number of individuals shot in the presence of a body camera, 1.44.

Zero-Inflated Poisson (ZIP)

Theory

We start with a model similar to one seen in the state-level analysis,

\[Y_{crb/n} \sim \text{Poisson} (n_{rc}e^{\alpha_r + \beta_b + \eta_{r,b} + \epsilon_{brc}}).\] As a reminder:

  • The units are \(c\) counties, \(r\) racial groups, and \(b\) body camera status.
  • The outcome \(Y\) is the rate of individuals fatally shot by an officer with or without a camera, of racial group \(r\), within one county.
  • The exposure \(n\) is the number of people of racial group \(r\) in county \(c\) in the year as recorded by the US census.
  • The term \(\alpha_{r}\) estimates the effect of race on the shot rate.
  • The term \(\beta_{b}\) estimates the effect of a body camera on the shot rate.
  • The interation term \(\eta_{r,b}\) estimates the interation between race and body camera on the shot rate.
  • The error term \(\epsilon_{brc}\) allows for variation in the data.

We introduce the ZIP model to explain the notion that some counties truly host a zero rate of officer-involved shootings, whereas other counties have zero counts from the discreteness of the data. The ZIP model translates the model above into a mixture:

\[Y_{crb/n} = \begin{cases} 0, \text{ if } S_{crb/n} = 0 \\ \text{Poisson} (n_{rc}e^{\alpha_r + \beta_b + \eta_{r,b} + \epsilon_{brc}}), \text{ if } S_{crb/n} = 1. \end{cases}\]

\(S_{crb/n}\) is an indicator of whether county \({crb}\) had any shootings in 2015-16, and this is modeled using a logistic regression:

\[Pr (S_{crb/n} = 1) = \text{logit}^{-1}(n_{rc}e^{\alpha_r + \beta_b + \eta_{r,b} + \epsilon_{brc}}) \theta),\]

where \(\theta\) is a completely different set of regression coefficients for the zero-count part of the model, using the same predictors to the ones used in the count part. Using a function to fit a zero-inflated model, we are able to estimate which zero observations correspond to \(S_{crb/n} = 0\) and which correspond to the outcomes of the Poisson distribution (Gelman 2007).

Model

summary(zeroinf2 <- zeroinfl(camShot ~ race + cam + race*cam, offset=log(racePop+1), data = countyPoisson))
## 
## Call:
## zeroinfl(formula = camShot ~ race + cam + race * cam, data = countyPoisson, 
##     offset = log(racePop + 1))
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6565 -0.1564 -0.0632 -0.0223 60.6645 
## 
## Count model coefficients (poisson with log link):
##                       Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)          -11.52116    0.06335 -181.879  < 2e-16 ***
## racehispanic          -0.43163    0.09710   -4.445 8.78e-06 ***
## racewhite             -1.01042    0.07873  -12.835  < 2e-16 ***
## camTrue               -1.71169    0.29649   -5.773 7.78e-09 ***
## racehispanic:camTrue   0.03076    0.41025    0.075    0.940    
## racewhite:camTrue      0.59622    0.38178    1.562    0.118    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.2718     0.6255  -3.632 0.000281 ***
## racehispanic           1.1090     0.7105   1.561 0.118549    
## racewhite              0.1763     0.7087   0.249 0.803517    
## camTrue                2.1209     0.8849   2.397 0.016535 *  
## racehispanic:camTrue  -0.4657     1.0817  -0.431 0.666811    
## racewhite:camTrue      0.8474     0.9962   0.851 0.394967    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 25 
## Log-likelihood: -3035 on 12 Df

The count model coefficients as expected based on what we saw from the state-level Poisson model. The count model can be written as:

\[Y_{cbr/n} \sim \text{Poisson } (\text{exp}({-11.52116 -0.43163 \alpha_h -1.01042\alpha_w -1.71169 \beta_b + 0.03076\eta_{h,b} + 0.59622\eta_{w,b} }))\]

The table below presents predictions of shooting rates by race and camera status per million people.

Term Estimate Exp(estimate)*10^6
Black and no cam -11.56040 9.917993
Black and cam -13.23285 1.790797
White and no cam -12.46258 3.610804
White and cam -13.64705 1.183482
Hispanic and no cam -11.95279 6.441237
Hispanic and cam -13.63372 1.199363

As we can see, the predicted chances of being black and shot without a camera far exceed any other race/camera combination. The descending order of predicted shot rates is:

  • Black and no cam
  • Hispanic and no cam
  • White and no camera
  • Black and camera
  • Hispanic and camera
  • White and camera.

These predictions should be received with caution. At this point, we have not implemented a random effect for county to account for multiple observations of the same county within these data. This is an important next step. We are currently reviewing how a random effect may be implemented in a zero-inflated count model setting.

Moving onto the second part of the model, the coefficicients for the zero-inflation part of the model are quite interesting. As we would expect to see, the chances of a county having zero shootings where an officer is wearing a camera is quite high. Limiting the observed shootings to instances where the officer was wearing a camera increases the odds of zero shootings in a county by 76.22% (exp(0.56656) = 1.762195), and this is statistically significant (p=.0177). This is likely due to the fact that body-cameras were not widely worn in 2015-16.

As we would expect to see, the chances of an excess zero are quite high in cases where an officer is wearing a camera. This trend can likely be explained by the few number of body-cameras worn in 2015-16 by police officers. Limiting the observed shootings to instances where the officer was wearing a camera increases the estimated odds of observing an excess zero by (exp(2.1209) =) 8.3 times, and is a statistically significant effect (p=.0177). Alternately, we can say that the odds of membership in the ‘always zero’ group (i.e., the camera status within a county that would never see any officer-involved shootings) is estimated to be 8.3 times higher when a camera is present than in the absence of a camera (Karazsia and van Dulmen 2008).

The table below presents estimates for grops most likely to have excess zeros in each county.

Term Estimate Exp(estimate)*10^6
Black and no cam -2.2718 103.1264
Black and cam -0.1509 859.9337
White and no cam -2.0955 123.0087
White and cam 0.8728 2393.6036
Hispanic and no cam -1.1628 312.6096
Hispanic and cam 0.4924 1636.2385

The descending order of the model changes when looking at the zero-inflated part from the count part. The descending order of the group most likely to have zero shootings is:

  • White and camera
  • Hispanic and camera
  • Black and camera
  • Hispanic and no camera
  • White and no camera
  • Black and no camera.

A subtle item to note from above, the predicted shooting rate for Hispanics exceeds the predicted rate for Whites (both in the absence of a camera). However, the odds of membership in the ‘always zero’ group is estimated to be higher for Hispanics than Whites, both in the absence of a camera. This indicates that ..(?)

Next we can ask whether this model is an improvement upon a Poisson model without a zero-inflated component. Using a Vuong test, we notice that the zero-inflated Poisson model has a lower AIC and thus should be selected in favor of the Poisson absent of a zero-inflated part.

p1 <- glm(camShot ~ race + cam + race*cam, family = poisson, data = countyPoisson, offset = log(racePop + 1))
vuong(p1, zeroinf2)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                  -3.5319747 model2 > model1 0.00020623
## AIC-corrected        -2.9400299 model2 > model1 0.00164090
## BIC-corrected        -0.6215319 model2 > model1 0.26712484

Zero-Inflated Negative Binomial

Theory

It is possible that these data exhibit more variability than is predicted by the mean of the Poisson distribution modeled above, even after accounting for excess zeros. To mitigate this issue, we can model overdispersed zero-inflated count data by specifying a zero-inflated negative binomial (ZINB) distribution:

\[Y_{crb/n} = \begin{cases} 0, \text{ if } S_{crb/n} = 0 \\ \text{negative-binomial} (\text{mean} = n_{rc}e^{\alpha_r + \beta_b + \eta_{r,b} + \epsilon_{brc}}, \text{overdispersion} = \upsilon_c), \text{ if } S_{crb/n} = 1. \end{cases}\]

This specification departs from the Poisson with the incorporation of \(\upsilon_c\), an overdispersion parameter. Setting \(\upsilon_c = 1\) corresponds to setting the shape parameter in the gamma distribution equal to \(\infty\), implying that the model has no overdisperson. Higher values of \(\upsilon_c\) display more variation in the distribution of shootings among groups involving county \(c\) than would be expected under the Poisson regression – or overdispersion (Gelman & Hill 2007).

\(S_{crb/n}\) is an indicator of whether county \({crb}\) had any shootings in 2015-16, and this is modeled using a logistic regression:

\[Pr (S_{crb/n} = 1) = \text{logit}^{-1}(n_{rc}e^{\alpha_r + \beta_b + \eta_{r,b} + \epsilon_{brc}}) \theta),\]

where \(\theta\) is a completely different set of regression coefficients for the zero-count part of the model, using the same predictors to the ones used in the count part. Using a function to fit a zero-inflated model, we are able to estimate which zero observations correspond to \(S_{crb/n} = 0\) and which correspond to the outcomes of the negative binomial distribution.

Model

summary(ZIPNB2 <- zeroinfl(camShot ~ race + cam + race*cam, offset=log(racePop+1), data = countyPoisson, 
                           dist = "negbin"))
## 
## Call:
## zeroinfl(formula = camShot ~ race + cam + race * cam, data = countyPoisson, 
##     offset = log(racePop + 1), dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20948 -0.15820 -0.06380 -0.02244 60.02050 
## 
## Count model coefficients (negbin with log link):
##                       Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)          -11.62922    0.06417 -181.216  < 2e-16 ***
## racehispanic          -0.46504    0.13601   -3.419 0.000628 ***
## racewhite             -0.93666    0.07841  -11.946  < 2e-16 ***
## camTrue               -1.77445    0.34326   -5.169 2.35e-07 ***
## racehispanic:camTrue  -0.00896    0.51376   -0.017 0.986086    
## racewhite:camTrue      0.55340    0.45642    1.212 0.225332    
## Log(theta)             0.77416    0.16849    4.595 4.33e-06 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)          -12.8296   134.7787  -0.095    0.924
## racehispanic          10.6745   134.7814   0.079    0.937
## racewhite              1.6726    47.0766   0.036    0.972
## camTrue               12.2041   134.7817   0.091    0.928
## racehispanic:camTrue  -9.9977   134.7866  -0.074    0.941
## racewhite:camTrue     -0.4206    47.0835  -0.009    0.993
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 2.1688 
## Number of iterations in BFGS optimization: 55 
## Log-likelihood: -2993 on 13 Df

The count model can be written as:

\[Y_{cbr/n} \sim \text{negative binomial} (\text{exp}({ -11.62922 -0.46504 \alpha_h -0.93666 \alpha_w -1.77445 \beta_b -0.00896\eta_{h,b} + 0.55340\eta_{w,b} }))\]

These coefficient estimations to not stray far from those seen in the zero-inflated negative binomial model.

Term Estimate Exp(estimate)*10^6
Black and no cam -11.62922 8.902130
Black and cam -13.40367 1.509594
White and no cam -12.56588 3.489054
White and cam -13.78693 1.028993
Hispanic and no cam -12.09426 5.591517
Hispanic and cam -13.87767 0.939733

The predicted chances of being black and shot without a camera far exceed any other race/camera combination. The descending order of predicted shot rates is:

  • Black and no cam
  • Hispanic and no cam
  • White and no camera
  • Black and camera
  • White and camera
  • Hispanic and camera.

The predicted shooting rates closely align with those seen in the zero-inflated Poisson regression, excepting the last two groups. In the zero-inflated Poisson model, the group predicted to host the smallest rate of shootings are Whites shot by an officer wearing a camera. With the negative binomial specification, the group predicted to have the smallest rate of shootings withing a county are Hispanics shot by an officer wearing a camera. The interaction term between race and camera status is insignificant, however, so these predictions should be taken with caution.

At this stage, a random effect is not implemented to account for multiple observations of the same county in these data. This is an important component that must be addressed for increased reliability of predictions.

We can formally compare the zero-inflated negative binomial model to a basic negative binomial to test if the added complexity of the zero-inflated part significantly improves estimations. The negative binomial is estimated below, using the “glm.nb” function from the MASS package.

## offset in negative binomial? weight? add it in like this? 
summary(NB1 <- glm.nb(camShot ~ race + cam + race*cam + offset(log(racePop+1)), data = countyPoisson))
## 
## Call:
## glm.nb(formula = camShot ~ race + cam + race * cam + offset(log(racePop + 
##     1)), data = countyPoisson, init.theta = 1.78726106, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1247  -0.2252  -0.0902  -0.0323   4.0218  
## 
## Coefficients:
##                       Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)          -11.63370    0.06663 -174.598  < 2e-16 ***
## racehispanic          -0.56620    0.10809   -5.238 1.62e-07 ***
## racewhite             -0.92398    0.07989  -11.566  < 2e-16 ***
## camTrue               -2.17660    0.17498  -12.439  < 2e-16 ***
## racehispanic:camTrue  -0.11344    0.28915   -0.392    0.695    
## racewhite:camTrue     -0.17941    0.21530   -0.833    0.405    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.7873) family taken to be 1)
## 
##     Null deviance: 4404.9  on 18647  degrees of freedom
## Residual deviance: 3324.7  on 18642  degrees of freedom
##   (98 observations deleted due to missingness)
## AIC: 6017.4
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.787 
##           Std. Err.:  0.256 
## 
##  2 x log-likelihood:  -6003.424

We use a likihood ratio test to test if the added complexity of the zero-inflated negative binomial model sufficiently improves the model to rationalize its use. The null hypothesis, \(H_o\), is that the simpler model (the negative binomial) is better. The alternative hypothesis, \(H_A\), is that the more complex model (the zero-inflated NB) is better. The log-likelihood of the more complex model, the zero-inflated one, is -2992.655. The log-likelihood of the simpler model, the negative binomial, is -3001.712. We take two times the difference of these models to find the t-statistic, 18.114. To find the p-value, we take \(1-\text{pchisq}(\text{t-statistic, difference in df})\). The p-value thus equals 0.005952, and we conclude that the zero-inflated NB model is a significant improvement upon the negative binomial without zero-inflation.

2*(-2992.655 --3001.712)
## [1] 18.114
1-pchisq(18.114, 6)
## [1] 0.00595353

Compare Poisson and Negative Binomial

The question might now be raised, which is better: the zero-inflated Poisson model, or the zero-inflated negative binomial model? We can perform another likihood ratio test on these two models, with the formal hypotheses to test:

  • \(H_o\): The simpler model, the zero-inflated Poisson, should be selected in favor of the zero-inflated negative binomial model.
  • \(H_A\): The zero-inflated negative binomial model (with an extra parameter for overdispersion), should be selected in favor of the zero-inflated Poisson model.
2*( -2992.655 - -3035.039)
## [1] 84.768
1 - pchisq(84.768, 1)
## [1] 0

The t-statistic is equal to 84.768, and the p-value is equal to zero. With a p-value smaller than 0.05, we select the zero-inflated negative binomial model as the best option.

References

Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel. In Cambridge University Press (p. 625). https://doi.org/10.1017/CBO9781107415324.004

Hu, M.-C., Pavlicova, M., & Nunes, E. V. (2011). Zero-inflated and Hurdle Models of Count Data with Extra Zeros: Examples from an HIV-Risk Reduction Intervention Trial. The American Journal of Drug and Alcohol Abuse, 37(5), 367–375.

Karazsia, B. T., & Van Dulmen, M. H. M. (2008). Regression models for count data: Illustrations using longitudinal predictors of childhood injury. In Journal of Pediatric Psychology (Vol. 33, pp. 1076–1084). https://doi.org/10.1093/jpepsy/jsn055 Washington Post. (2016). Fatal Police Shootings Database.

Loeys, T., Moerkerke, B., de Smet, O., & Buysse, A. (2012). The analysis of zero-inflated count data: Beyond zero-inflated Poisson regression. British Journal of Mathematical and Statistical Psychology, 65(1), 163–180. https://doi.org/10.1111/j.2044-8317.2011.02031.x

Zeileis, A., Kleiber, C., & Jackman, S. (2008). Regression Models for Count Data in R. Journal of Statistical Software, 27(8), 1–25. https://doi.org/10.18637/jss.v027.i08