Methods

I began by running Poisson regression models using the MASS package (Venables and Ripley 2002) in R (R Core Team 2019). The Poisson regression model is a type of model that deals explicitly with count outcomes, by using a Poisson distribution to calculate the probability of a count. The Poisson regression model assumes that the conditional mean of the outcome is equal to the conditional variance (Long 1997). Therefore, the dispersion is fixed to 1(φ = 1) and the variance function is V (μ) = μ (Zeileis, Kleiber, and Jackman 2007). Overdispersion occurs if the conditional variance is greater than the conditional mean (Long 1997).

As you will see below, overdispersion was an issue. So, I then ran a negative binomial model. A Negative Binomial (NB) model adds a parameter to the Poisson model that accounts for the difference between the conditional variances and the conditional mean (UCLA n.d.). I used the Vuong (1989) test implemented by the vuong () function in the pscl (Jackman et al. 2015) package to compare count models. The Vuong test compares the predicted probabilities of the two models to the saturated model (Friendly and Meyer 2016).

Poisson and NB models may not fit because of overdispersion, zero-inflation or a mixture of overdispersion and zero-inflation (Deng and Paul 2005). A failure to address zero-inflation can lead to biased parameter estimates (Harrison 2015). To address the 0’s, I employed a zero-inflated negative binomial model (ZINB) using the pscl package. ZINB model the zero counts along with the non-zero counts. Therefore, it contains two components: the zero-inflation component and the negative binomial component. The zero-inflation component is a logistic regression for the probability that the observation is a member of a class of observations whose count is “necessarily” zero ((Friendly and Meyer 2016, p.452). The negative binomial component is used for the other class of observations that may be 0 or positive (Friendly and Meyer 2016). Different predictors may be used for each component of the model. I decided to simplify the zero-inflation component by running it without predictors. After running the ZINB, I compared them to the NB models using the Vuong test in pscl (Jackman et al. 2015).

The coefficients of a categorical variable in the negative binomial part of the ZINB model tell us the expected difference in log counts of the DV between the level and the reference variable (UCLA n.d.). To aid in the interpretability of any coefficients from the NB component that reached statistical significance at p<0.05, I exponentiated the coefficients to provide the incidence rate-ratio (IRR) (UCLA n.d.). Once exponentiated the coefficients tell us the multiplicative effect (Friendly and Meyer 2016).

For our variable selection process, we used step-wise, backward elimination to determine which variable to include in our final model. We used backward elimination because we conducted a literature review on prosecutorial discretion and identified the variables that have been shown to influence prosecutorial discretion. We included these in the full initial model. Then, we removed variables, one-by-one and compared the full (saturated) model with the reduced model in an effort to find a reduced model that best explains the data. We compared the saturated model with the reduced model using the Voung test.

Descriptive Statistics

Number of Unintentional Opioid Overdose Deaths from 2015-2019

variable: OD_Ttl Source: oap_data_wide_all_nc_2020.08.23 Range: 38-654

This variable represents the total number of unintentional opioid overdose deaths from 2015-2019. We used this date range because it matches the date ranges used for DIH charges.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    38.0   107.5   140.0   180.6   202.5   654.0

DIH Charges

DIH represents the number of DIH charges that received online media mentions. It is the best available data for estimating the total number of DIH charges by county.

Source: Health in Justice Action Lab Years: 2015-2019 *Justification: 2014 DIH law enacted in N.C. Range: 0-6

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.744   3.000   6.000

Number of Prosecutors

Number of Prosecutors per Prosectural District Source: NC Statute

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    9.00   12.00   14.49   15.00   58.00

Rural v. Non-rural

The base category is Non-rural. If a district has a rural county, then it is included in the rural category (even if the other counties are not rural)(which is 1). If a district has only suburban and urban counties, it is non-rural(which is 0).

Source: https://www.ncruralcenter.org/about-us/

##  0  1 
## 13 30

So, there are 13 non-rural districts in the dataset and 30 rural districts.

##District Attorney Party Affliation

“DA party” is the District Attorney of that prosecutorial district’s party affiliation. 20 prosecutorial districts are led by Democrats and 23 are led by Republicans.

Source: Ballotpedia and news reports

##  D  R 
## 20 23

I want to check to make sure that Rural_2 and DA_party are not measuring the same construct.

##    
##      D  R
##   0  6  7
##   1 14 16

There are 6 non-rural districts with a Democrat as a DA and 14 rural districts with a Democrat as a DA. This indicates that non-rural is not a 1:1 measure of whether or not the district is run by a Democrat.

##TTl_Crim

“TTl_Crim” represents the number of criminal charges filed by prosecutors in each district in 2018-2019. Range: 15-470

Source: https://www.nccourts.gov/documents/publications/criminalinfraction-case-activity-report-by-prosecutorial-district

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    15.0    81.0   129.0   169.2   220.0   470.0

Population

We decided not to include population as the offset. For reasons discussed below. 2009 Census data source: https://files.nc.gov/ncosbm/demog/countygrowth_2009.html

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   73553  135459  174961  220249  233746  906473

#Overdispersion

When dealing with count data, it is important to consider the number of 0’s in the dependent variable (DV), because too many zero’s can lead to overdispersion. In this case, we are concerned with the number of districts that have 0 DIH charges from 2015-2019.

Percent of Prosecutorial Districts with 0 DIH charges = 34.9%, which means we will likely need to use a model that accounts for zero-dispersion, but we will test that furter below.

## [1] 34.88372

#Model Type The DV is the number of DIH charges mentioned in media mentions from 2015-2019 by N.C. Prosecutorial District (n=43). It is a count variable. The percentage of DIH charges that = 0 is 34.88%, which suggests that the distribution may be zero-inflated. Before we test this, let’s plot some of the relationships between the IV and the DVs. Below is the full equation. Do any of these predictors explain the variance in DIH prosecutions by District?

DIH = Pros + Rural_2 + Ttl_Crim + DA party + offset(log(OD_Ttl))

We offset the model by the log of OD_Ttl, because there is a greater likelihood that DIH charges will occur in places that have more opioid overdose deaths, because an overdose death is needed for a DIH charge.

##OD_19 Let’s start by looking at the relationship between the number unintentional opioid overdose deaths (opioid ODs) and the number of DIH charges. Do districts that tend to have more opioid ODs have more DIH charges?

## Warning: package 'dplyr' was built under R version 3.6.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Based on this graph it seems like the number of OD deaths may predict the number of charges in some districts but not others. This graph doesn’t show the whole picture though, because it doesn’t tell us the number of OD’s for districts that had 0 DIH charges, so let’s adjust.

The relationship between OD_Ttl may affect the number of DIH charges, but it does not do so systematically.

This graph even suggests that the prosecutorial districts with the greatest number of overdoses are the ones that have 0 DIH charges.

This suggests that we might not need to include OD_Ttl as an exposure.

##Pros Other studies have found that as the number of prosecutors in a district increase so do the number of criminal charges. Let’s see if this is the case in NC. The relationship between Pros and DIH is not clear. It seems to increase and then decrease, which suggests something else is going on here.

And this graph suggests that districts with the most prosecutors have 0 DIH charges.

##Ttl_Crim

The literature also suggests that the total number of crimes charged should predict prosecutorial activity. The graph suggests the opposite relationship; the districts with the most crimes charged have 0 DIH. This suggests that maybe these prosecutors are too busy to be bothered with DIH prosecutions. (Or that it is explained by the urban/rural division)

##Rural_2

Let’s take a look at urban/rural differences in DIH charges filed.

This graph shows that districts with rural counties have a large share of the DIH charges.

##    Category  x
## 1 Non-Rural 21
## 2     Rural 54

More specifially, the prosecutorial districts with a rural county have 54 DIH charges, while those without one have 21 DIH charges. Now, it is important to note here, that approximately 80% of NC counties are rural, but only about 20% of the population in NC lives in these rural counties.

#Population We decided not to include population as an offset because we reasoned that the number of OD deaths was a better exposure variable. This is supported by the following…

Population and DIH charges

Districts with great population seem be more likely to have 0 DIH charges.

Population and OD_Ttl

As we would explect the number of overdose deaths tends to increase as the population increases.

So, OD_19 will account for population differences and population should not be included in the model.

##Population & Pros

The number of prosecutors increase as the population increases. But, it isn’t relevant for this article as to why the number of prosecutors vary by office. So, again this supports removing population.

##Population & Rural_2

Surprisingly, the districts with the rural counties do not have smaller populations than the districts that do not have rural counties. This is likely because we are dealing with districts and not counties.

So, while we would likely need to offset by population if prosecutorial districts were divided by county, in N.C. the legislature has created prosecutorial districts by combining counties, in an effort to smooth the population differences between counties.

#Models

Time to start modeling. So, we need to figure out two things (1) Should we use a poisson model, negative binomial model, or a zero-inflated negative binomial model? (2) Which variables should we include? (3) Are there interaction variables we should include? Let’s start with figuring out which model we should use.

##Poisson When possible, you want to use the simplest model for the data (the Poisson). I’m going to go ahead and start with a model with all of the IVs (independent variables).

I’m not going to include OD_Ttl as an exposure variable. Given the descriptive graphs presented above, the total number of ODs does not seem to increase the likelihood that a DIH charge will occur. Rather than put OD_Ttl as an offset variable (which would give it a coefficient of 1), it will be better to put OD_Ttl into the model and allow the data to estimate the co-efficient.

p_model<- glm (DIH ~ Pros + Rural_2 + Ttl_Crim + DA_party+ OD_Ttl,family=poisson(link="log"), data=NC_DIH_data)
summary(p_model)
## 
## Call:
## glm(formula = DIH ~ Pros + Rural_2 + Ttl_Crim + DA_party + OD_Ttl, 
##     family = poisson(link = "log"), data = NC_DIH_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4999  -1.5012  -0.2432   0.7161   2.1506  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -0.8253150  0.4675904  -1.765   0.0776 .
## Pros         -0.0190659  0.0264482  -0.721   0.4710  
## Rural_2Rural  0.6390458  0.3328665   1.920   0.0549 .
## Ttl_Crim      0.0012543  0.0009405   1.334   0.1823  
## DA_partyR     0.0762458  0.2388106   0.319   0.7495  
## OD_Ttl        0.0048031  0.0022777   2.109   0.0350 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 91.11  on 42  degrees of freedom
## Residual deviance: 77.12  on 37  degrees of freedom
## AIC: 165.28
## 
## Number of Fisher Scoring iterations: 5

Before we worry about interpretation lets test if there is overdispersion.

library('AER')
## Warning: package 'AER' was built under R version 3.6.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.6.3
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.6.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.6.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.6.3
## Loading required package: survival
dispersiontest(p_model) #AER package
## 
##  Overdispersion test
## 
## data:  p_model
## z = 2.2319, p-value = 0.01281
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   1.502397

Overdispersion occurs when dispersion stat is greater than 1. Here the disperson statistic is 1.50, so we do have some dispersion (p=0.01).

So, we can try to see if negative binomial model will account for this dispersion. Though I have read somewhere that NB models don’t preform well with small sample sizes..this article suggests otherwise: https://watermark.silverchair.com/kxm030.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAArUwggKxBgkqhkiG9w0BBwagggKiMIICngIBADCCApcGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMvK3MnHBhYGaUD0pYAgEQgIICaJdZEL6x7O3RAkbziaQJIZfdeIuqVVxl_Yy-3oiZjLN2WK01TZxYvazhk_wDQyubA2MEN3QOLhjC36w1abXX3_hTHDW1omEINX9LUVp-tsT-2CighJWshIpWAiVoHemwsvzaTPvi9l98vjrgElHjGR7ci21T6-Mnkyo1rxBXyZuIKLlnOzz1RhGUDEZdc6lIoE2gkca6siiuV0N81NtdA3-eovCK2C4N6tNb2srVfOUBh8zFn8YBQ-x1TmGtX9FN8nuczmI051vLMpi2wmNAm5j3o9oS8SYN7dufgEhSCzU0u7rmScA4_DZkFVl78CvkkKkE8Zta09O3Ud2xspAKWs_HVJBbKPS9Xdbl1Vr3h35zrE5wvynktl1tuUpqKkq_KHsTskBNxgr3t3CWJ2ZqRoMbbRA7sF-_sk-WzZSuYWXBxUo6oajgC4PB-XT4SkEk8KPr_E38wJNmD7vPXM0Z4YsJea_s_puSSw2mOlnLALnEpQpJ1FDNTb2Mwb0L7WqQRsbYiky9OSADbgERSknJ2OAlb4k6Ftr9UmT4Dq1cTqvA9RO-RRxGBqHvAMDFudETSsZ_Jxnk-dsW8vWm9EkY0GG4Tp9qgH5wBToGryAHjredhaw2ay9QMRlBCwqS4BLO67xELdt6Xrnbizz7q41y_rGbmQbpyU94LF8ThcohryEoBmc9f77Gv69UMCHc-DB5iNoz15r4--Lzo33cmEpNgG7M-lX8Q4xYLBwJffSSjruYt5T1sXyjI9a3ASnDE32-JT7eylN_yaltyAnA-Vh9cgJ1VfQHt6zyiiL_Zkm2r3RVnt-COpTEg_M

##Negative Binomal Model

library('MASS')
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
n_model<- glm.nb(DIH ~ DA_party + Pros + Rural_2 + Ttl_Crim + OD_Ttl, data=NC_DIH_data)
summary(n_model)
## 
## Call:
## glm.nb(formula = DIH ~ DA_party + Pros + Rural_2 + Ttl_Crim + 
##     OD_Ttl, data = NC_DIH_data, init.theta = 2.205620082, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0439  -1.3429  -0.1531   0.5640   1.4439  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -0.920573   0.612269  -1.504   0.1327  
## DA_partyR     0.134460   0.326934   0.411   0.6809  
## Pros         -0.027731   0.037384  -0.742   0.4582  
## Rural_2Rural  0.679000   0.435187   1.560   0.1187  
## Ttl_Crim      0.001368   0.001312   1.043   0.2970  
## OD_Ttl        0.005554   0.003187   1.743   0.0813 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.2056) family taken to be 1)
## 
##     Null deviance: 57.080  on 42  degrees of freedom
## Residual deviance: 48.992  on 37  degrees of freedom
## AIC: 160.99
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.21 
##           Std. Err.:  1.31 
## 
##  2 x log-likelihood:  -146.991

Ratio of deviance and df is 2.2 which is greater than one so that suggests overdispersion.

Let’s compare it with a zero-inflated negative binomial.

library(pscl)
## Warning: package 'pscl' was built under R version 3.6.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:survival':
## 
##     aml
## The following object is masked from 'package:car':
## 
##     logit
library(MASS)
m1 <- zeroinfl(DIH ~ Ttl_Crim + DA_party + Pros + Rural_2 + OD_Ttl| 1, data = NC_DIH_data)
summary(m1)
## 
## Call:
## zeroinfl(formula = DIH ~ Ttl_Crim + DA_party + Pros + Rural_2 + OD_Ttl | 
##     1, data = NC_DIH_data)
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36792 -0.89505 -0.05426  0.66353  2.24923 
## 
## Count model coefficients (poisson with log link):
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -0.686817   0.527253  -1.303   0.1927  
## Ttl_Crim      0.002507   0.000998   2.512   0.0120 *
## DA_partyR     0.170029   0.254129   0.669   0.5035  
## Pros         -0.021513   0.030134  -0.714   0.4753  
## Rural_2Rural  0.618787   0.346071   1.788   0.0738 .
## OD_Ttl        0.004363   0.002557   1.706   0.0880 .
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -1.1775     0.4831  -2.437   0.0148 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 13 
## Log-likelihood: -70.9 on 7 Df
vuong(n_model, m1)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A p-value
## Raw                  -1.1843694 model2 > model1 0.11813
## AIC-corrected        -0.7273818 model2 > model1 0.23350
## BIC-corrected        -0.3249585 model2 > model1 0.37261

The Voung test suggests that the zero-inflated model is not significantly different than the NB model (p=0.12). This suggests that we go with the simpler model, NB model.

vuong(p_model, n_model)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A p-value
## Raw                   -1.271212 model2 > model1 0.10183
## AIC-corrected         -1.271212 model2 > model1 0.10183
## BIC-corrected         -1.271212 model2 > model1 0.10183

When comparing the NB model to the Poisson model, the Voung test suggests that the NB model is not significantly better than the Poisson model (p=0.1). This suggests that the poisson is the best model for our full, saturated model.

#IVs: Variable Inclusion: Backward deletion

So we have already run the saturated model with all the IVs of interest. Now we delete the variables one by one and see if there is model improvement.

Remove # of Pros.

Let’s start by removing the number of prosecutors.We start with a poisson model because that is what we decided to go with above.

m2 <-glm (DIH ~ Rural_2 + Ttl_Crim + DA_party+ OD_Ttl,family=poisson(link="log"), data=NC_DIH_data)
summary(m2)
## 
## Call:
## glm(formula = DIH ~ Rural_2 + Ttl_Crim + DA_party + OD_Ttl, family = poisson(link = "log"), 
##     data = NC_DIH_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3841  -1.5153  -0.2885   0.8631   2.2234  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.8559753  0.4714872  -1.815 0.069450 .  
## Rural_2Rural  0.6890860  0.3345375   2.060 0.039416 *  
## Ttl_Crim      0.0012143  0.0009357   1.298 0.194344    
## DA_partyR     0.0742172  0.2379753   0.312 0.755140    
## OD_Ttl        0.0033188  0.0009848   3.370 0.000752 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 91.11  on 42  degrees of freedom
## Residual deviance: 77.63  on 38  degrees of freedom
## AIC: 163.79
## 
## Number of Fisher Scoring iterations: 5
vuong(m1, m2)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A  p-value
## Raw                   1.7593262 model1 > model2 0.039261
## AIC-corrected         1.1719054 model1 > model2 0.120618
## BIC-corrected         0.6546227 model1 > model2 0.256355

Using the Raw score, the Voung z-statistic is 1.76 (p=0.04). However, once the AIC or BIC are corrected, there does not appear to be a stastically significnat difference between the two model fits.

Prosecutors awas not significant in the saturated model and based on that and the Voung raw score, we can choose to omit it and opt for the more parsimonious model

Remove DA_Party

Let’s move DA_Party next, because the literature suggests that it won’t be relevant.

m3 <-glm (DIH ~ Rural_2 + Ttl_Crim + OD_Ttl,family=poisson(link="log"), data=NC_DIH_data)
summary(m3)
## 
## Call:
## glm(formula = DIH ~ Rural_2 + Ttl_Crim + OD_Ttl, family = poisson(link = "log"), 
##     data = NC_DIH_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3600  -1.5120  -0.3257   0.8548   2.2886  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.8266595  0.4617244  -1.790 0.073394 .  
## Rural_2Rural  0.6894277  0.3339515   2.064 0.038975 *  
## Ttl_Crim      0.0012703  0.0009216   1.378 0.168076    
## OD_Ttl        0.0033258  0.0009878   3.367 0.000761 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 91.110  on 42  degrees of freedom
## Residual deviance: 77.728  on 39  degrees of freedom
## AIC: 161.89
## 
## Number of Fisher Scoring iterations: 5
vuong(m2, m3)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   0.1228457 model1 > model2  0.4511146
## AIC-corrected        -2.3958183 model2 > model1  0.0082917
## BIC-corrected        -4.6137539 model2 > model1 1.9773e-06

Both the AIC and BIC corrected scores are statistically signficant (p <0.01), which means the model without party is better than the model without party.

Total Criminal Charges

Ttl_Crim has not been significant in any of the models, so let’s remove it next.

m4 <- glm (DIH ~ Rural_2 + OD_Ttl,family=poisson(link="log"), data=NC_DIH_data)
summary(m4)
## 
## Call:
## glm(formula = DIH ~ Rural_2 + OD_Ttl, family = poisson(link = "log"), 
##     data = NC_DIH_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0512  -1.6133  -0.4826   0.9211   2.3432  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.6080851  0.4202810  -1.447 0.147938    
## Rural_2Rural  0.6828865  0.3287788   2.077 0.037798 *  
## OD_Ttl        0.0034137  0.0009638   3.542 0.000397 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 91.11  on 42  degrees of freedom
## Residual deviance: 79.54  on 40  degrees of freedom
## AIC: 161.7
## 
## Number of Fisher Scoring iterations: 5
vuong(m3, m4)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A p-value
## Raw                  0.45662693 model1 > model2 0.32397
## AIC-corrected       -0.04745881 model2 > model1 0.48107
## BIC-corrected       -0.49135675 model2 > model1 0.31159

The voung statistic suggests that there is no significant difference between the two models, so we will go with the simpler model and keep “Ttl_Crim” out.

We are left with two variables, both with statistically significant predictors, so we can stop (http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-essentials-in-r/)

Re-testing for Overdispersion

library('AER')
dispersiontest(m4) #AER package
## 
##  Overdispersion test
## 
## data:  m4
## z = 2.5739, p-value = 0.005028
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   1.609865

m4 is overdispersed, but the overdispersion is not much. We can address the overdispersion by trying a different distribution (the negative binomial) or see if we ommitted a variable we should not have.

n_4<- glm.nb(DIH ~ Rural_2 + OD_Ttl, data=NC_DIH_data)
summary(n_4)
## 
## Call:
## glm.nb(formula = DIH ~ Rural_2 + OD_Ttl, data = NC_DIH_data, 
##     init.theta = 2.036737436, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7150  -1.4066  -0.3503   0.6900   1.5091  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -0.651351   0.558811  -1.166   0.2438  
## Rural_2Rural  0.695625   0.435606   1.597   0.1103  
## OD_Ttl        0.003585   0.001466   2.445   0.0145 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.0367) family taken to be 1)
## 
##     Null deviance: 55.574  on 42  degrees of freedom
## Residual deviance: 49.434  on 40  degrees of freedom
## AIC: 156.69
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.04 
##           Std. Err.:  1.18 
## 
##  2 x log-likelihood:  -148.694

The dispersion parameter (which should be around 1), is around 2. So this model is likely not a better fit. But, let’s compare using Vuong.

vuong(m4, n_4)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A  p-value
## Raw                   -1.433781 model2 > model1 0.075817
## AIC-corrected         -1.433781 model2 > model1 0.075817
## BIC-corrected         -1.433781 model2 > model1 0.075817

If we use p=0.05 as the cut off, the NB model is not significantly different than the poisson model. And, the dispersion parameter is 2. So, the poisson model is a better fit.

“According to Anderson et al. (1994), once one has found an adequate model structure, overdispersion, values between one and three are typical. Sophisticated modeling of overdispersion may well be unnecessary at these low levels.” (See https://pdfs.semanticscholar.org/b482/2959df3eada62e37a33262a4c638fe3e870c.pdf)

Our poisson model Overdispersion is 1.6, so we should be ok, but we should acknowledge that any dispersion increases the likelihood of a Type 1 error.

Since it seems that our model structure is correct (because the model fit appears better when we consult the Voung statistic), overdispersion could be caused by an ommitted variable. Of the variables, Ttl_Crime was the closest to being statistically significant and in one model, it was significant. So, let’s add it back in and see if that gets rid of the overdispersion.

m6 <- glm (DIH ~ Rural_2 + OD_Ttl + Ttl_Crim,family=poisson(link="log"), data=NC_DIH_data)
summary(m6)
## 
## Call:
## glm(formula = DIH ~ Rural_2 + OD_Ttl + Ttl_Crim, family = poisson(link = "log"), 
##     data = NC_DIH_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3600  -1.5120  -0.3257   0.8548   2.2886  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.8266595  0.4617244  -1.790 0.073394 .  
## Rural_2Rural  0.6894277  0.3339515   2.064 0.038975 *  
## OD_Ttl        0.0033258  0.0009878   3.367 0.000761 ***
## Ttl_Crim      0.0012703  0.0009216   1.378 0.168076    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 91.110  on 42  degrees of freedom
## Residual deviance: 77.728  on 39  degrees of freedom
## AIC: 161.89
## 
## Number of Fisher Scoring iterations: 5
library('AER')
dispersiontest(m6) #AER package
## 
##  Overdispersion test
## 
## data:  m6
## z = 2.2045, p-value = 0.01374
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   1.513574

We still the same amount of dispersion.

So, MODEL 4 is the best fit and has only a small amount of overdispersion.

Interpreting Co-efficients

To interpret the co-efficients it helps to exponentiate them and this code also helps create robust standard errors.

library(sandwich)
library(msm)
## Warning: package 'msm' was built under R version 3.6.3
## 
## Attaching package: 'msm'
## The following object is masked from 'package:boot':
## 
##     cav
#robust standard errors
cov.mpd3 <- vcovHC(m4, type="HC0")
std.err3 <- sqrt(diag(cov.mpd3))
r.est3 <- cbind(Estimate= coef(m4), "Robust SE" = std.err3,
"Pr(>|z|)" = 2 * pnorm(abs(coef(m4)/std.err3), lower.tail=FALSE),
LL = coef(m4) - 1.96 * std.err3,
UL = coef(m4) + 1.96 * std.err3)

#chisquare
with(m4, cbind(res.deviance = deviance, df = df.residual,
  p = pchisq(deviance, df.residual, lower.tail=FALSE))) 
##      res.deviance df            p
## [1,]     79.53955 40 0.0001997928
#IRR
spd3 <- deltamethod(list(~ exp(x1), ~ exp(x2), ~ exp(x3)), 
                                                coef(m4), cov.mpd3)

## exponentiate old estimates dropping the p values
rexp.est3 <- exp(r.est3[, -3])
## replace SEs with estimates for exponentiated coefficients
rexp.est3[, "Robust SE"] <- spd3

rexp.est3
##               Estimate    Robust SE        LL       UL
## (Intercept)  0.5443923 0.2160076225 0.2501269 1.184851
## Rural_2Rural 1.9795836 0.6997623098 0.9900962 3.957950
## OD_Ttl       1.0034195 0.0007932326 1.0018660 1.004975

Exponentiating gives us the incident rate ration (irr).

Findings

  1. Holding the number of opioid overdose deaths constant, prosecutorial districts that include at least one rural county are expected to have a “DIH media mention” rate of 1.98 times greater than prosecutorial districts with only suburban and urban counties.
  2. For every 1 unit increase in opioid overdose deaths, there is a 0.03% increase in the number of DIH charges mentioned in the media.