Applications of count data modeling

Below I will load the 2017-2018 Area Resource File from github and create rename some variables. The data have 7,277 variables for 3,230 county geographies.

arf<-"https://github.com/coreysparks/data/blob/master/arf2018.Rdata?raw=true"
load(url(arf))

1. Define a count outcome for the dataset of your choosing

For this example, I will use the number of suicide deaths for the three year period, 2014 to 2016.

For my analysis, I will use the number of suicides and the total number of deaths as my numerator and denominator.

I also rename the variable that I will use as predictor for the analysis: the mental healthcare shortage area code from 2010. Then I filter to have non-missing cases, which reduces the number of counties.

1.a State a research question about your outcome.

How does a shortage of mental health care doctors in a county affect the suicide rate?

1.b Is an offset term necessary? why or why not?

Yes. Since the suicides are only a portion of the population of deaths. The outcome is a rate of suicides per capita deaths.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
arf2018dr<-arf2018%>%
  mutate(cofips=f00004, 
         coname=f00010,
         state = f00011,
         deaths1416=f1194114,
         sui1416=f1316614,
         hpsa10mh= as.factor(f1249210))%>%
  dplyr::select(deaths1416, sui1416, state, cofips, coname, hpsa10mh)%>%
  filter(complete.cases(.))%>%
  as.data.frame()


head(arf2018dr)
summary(arf2018dr)
##    deaths1416       sui1416          state              cofips         
##  Min.   :   53   Min.   : 10.00   Length:907         Length:907        
##  1st Qu.:  811   1st Qu.: 14.00   Class :character   Class :character  
##  Median : 1253   Median : 22.00   Mode  :character   Mode  :character  
##  Mean   : 2428   Mean   : 40.24                                        
##  3rd Qu.: 2512   3rd Qu.: 43.00                                        
##  Max.   :61460   Max.   :828.00                                        
##     coname          hpsa10mh
##  Length:907          :  0   
##  Class :character   0:174   
##  Mode  :character   1:375   
##                     2:358   
##                             
## 

Here, I do a basic map of the outcome variable, and see the highest rates of suicide deaths in California.

library(tigris)
## To enable 
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(ggplot2)

options(tigris_class="sf")
usco<-counties(cb=T, year=2015, state = "CA")
usco$cofips<-usco$GEOID
sts<-states(cb = T, year=2015)
sts<-st_boundary(sts)%>%
  filter(!STATEFP %in% c("02", "15", "60", "66", "69", "72", "78"))

arf2018dr_m<-geo_join(usco, arf2018dr, by_sp="cofips", by_df="cofips",how="left" )
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
arf2018dr_m%>%
  filter(!STATEFP %in% c("02", "15", "60", "66", "69", "72", "78"))%>%
  mutate(suirate=sui1416/deaths1416)%>%
  mutate(suic_group = cut(suirate, breaks=quantile(suirate, p=seq(0,1,length.out = 4), na.rm=T ), include.lowest=T ))%>%
  ggplot()+
  geom_sf(aes(fill=suic_group, color=NA))+
  scale_color_brewer(palette = "Reds")+
  scale_fill_brewer(palette = "Reds",na.value = "grey50")+
  geom_sf(data=sts["STATEFP"=="06",], color="black")+
  coord_sf(crs =3083)+
  ggtitle(label = "Proportion of deaths that were suicides, 2014-2016")

2. Consider a Poisson regression model for the outcome

First, I will us the Poisson model. To include the offset term for the number of deaths in each county, I use the offset() function within the model.

In this model, I use the hpsa10mh variable as the independent variable. This variable indicates whether a county has a shortage of mental health care doctors, it has three levels, 0 = not a shortage area, 1= the whole county is a shortage area, and 2= part of the county is a shortage area.

The model indicates that counties that are either partial or total shortage areas have lower suicide rates.

arf2018drsub<-filter(arf2018dr, is.na(sui1416)==F)


fit_pois<- glm(sui1416 ~ offset(log(deaths1416)) + hpsa10mh, 
               family=poisson, 
               data=arf2018drsub)
summary(fit_pois)
## 
## Call:
## glm(formula = sui1416 ~ offset(log(deaths1416)) + hpsa10mh, family = poisson, 
##     data = arf2018drsub)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -10.0779   -0.7058    0.2595    1.1901   11.2499  
## 
## Coefficients:
##             Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -4.05635    0.01428 -284.032  < 2e-16 ***
## hpsa10mh1   -0.02417    0.01679   -1.440     0.15    
## hpsa10mh2   -0.06775    0.01604   -4.224  2.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3481.2  on 906  degrees of freedom
## Residual deviance: 3456.1  on 904  degrees of freedom
## AIC: 8108.8
## 
## Number of Fisher Scoring iterations: 4

If I exponentiate the coefficients from the model, I can get the risk ratios:

exp(coef(fit_pois))
## (Intercept)   hpsa10mh1   hpsa10mh2 
##  0.01731206  0.97611682  0.93449120

In this case, the hpsa10mh, 1, or whole shortage area counties, have a mean suicide rate that is 2.4% lower than counties that are not shortage areas, and counties that are partial shortage areas, hpsa10mh, 2 have a mean suicide rate that is 6.6% lower.

2.a Evaluate the level of dispersion in the outcome

A check of this is to examine the ratio of the residual deviance and the residual degrees of freedom:

scale<-sqrt(fit_pois$deviance/fit_pois$df.residual)
scale
## [1] 1.955289

This value should be 1 if the mean and variance are equal, but in this case it is 1.96. This suggests that the variance is twice as large as the mean.

There is also a test statistic that can be used, referred to as a goodness of fit statistic. This compares the model deviance to a \(\chi^2\) distribution with degrees of freedom equal to the residual degrees of freedom. Small p-values for this test, suggest the model does not fit the data.

1-pchisq(fit_pois$deviance, df = fit_pois$df.residual)
## [1] 0

2.b Is the Poisson model a good choice?

No.In this case, the statistic has a p-value of < .001, indicating the model does not fit the data well and suffers from overdispersion.

3. Consider a Negative binomial model

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
fit_nb<- glm.nb(sui1416 ~ offset(log(deaths1416)) + hpsa10mh, 
               data=arf2018drsub)
summary(fit_nb)
## 
## Call:
## glm.nb(formula = sui1416 ~ offset(log(deaths1416)) + hpsa10mh, 
##     data = arf2018drsub, init.theta = 16.72732263, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2377  -0.6647  -0.0655   0.5412   4.9180  
## 
## Coefficients:
##             Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -4.01460    0.02462 -163.051   <2e-16 ***
## hpsa10mh1    0.03946    0.02999    1.316   0.1883    
## hpsa10mh2   -0.06510    0.02930   -2.222   0.0263 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(16.7273) family taken to be 1)
## 
##     Null deviance: 890.36  on 906  degrees of freedom
## Residual deviance: 870.35  on 904  degrees of freedom
## AIC: 6453.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  16.73 
##           Std. Err.:  1.17 
## 
##  2 x log-likelihood:  -6445.121

Again, I see the same overall interpretation of the model effects, but the risk ratios are different compared to the Poisson model:

exp(coef(fit_nb))
## (Intercept)   hpsa10mh1   hpsa10mh2 
##  0.01805011  1.04024374  0.93697422

Compare to the poisson’s estimates of risk

exp(coef(fit_pois))
## (Intercept)   hpsa10mh1   hpsa10mh2 
##  0.01731206  0.97611682  0.93449120

4. Compare the model fits of the alternative models using AIC

AIC(fit_pois)
## [1] 8108.769
AIC(fit_nb)
## [1] 6453.121