Below I will load the 2017-2018 Area Resource File from github and create rename some variables. The data have 7,277 variables for 3,230 county geographies.
arf<-"https://github.com/coreysparks/data/blob/master/arf2018.Rdata?raw=true"
load(url(arf))
For this example, I will use the number of suicide deaths for the three year period, 2014 to 2016.
For my analysis, I will use the number of suicides and the total number of deaths as my numerator and denominator.
I also rename the variable that I will use as predictor for the analysis: the mental healthcare shortage area code from 2010. Then I filter to have non-missing cases, which reduces the number of counties.
How does a shortage of mental health care doctors in a county affect the suicide rate?
Yes. Since the suicides are only a portion of the population of deaths. The outcome is a rate of suicides per capita deaths.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
arf2018dr<-arf2018%>%
mutate(cofips=f00004,
coname=f00010,
state = f00011,
deaths1416=f1194114,
sui1416=f1316614,
hpsa10mh= as.factor(f1249210))%>%
dplyr::select(deaths1416, sui1416, state, cofips, coname, hpsa10mh)%>%
filter(complete.cases(.))%>%
as.data.frame()
head(arf2018dr)
summary(arf2018dr)
## deaths1416 sui1416 state cofips
## Min. : 53 Min. : 10.00 Length:907 Length:907
## 1st Qu.: 811 1st Qu.: 14.00 Class :character Class :character
## Median : 1253 Median : 22.00 Mode :character Mode :character
## Mean : 2428 Mean : 40.24
## 3rd Qu.: 2512 3rd Qu.: 43.00
## Max. :61460 Max. :828.00
## coname hpsa10mh
## Length:907 : 0
## Class :character 0:174
## Mode :character 1:375
## 2:358
##
##
Here, I do a basic map of the outcome variable, and see the highest rates of suicide deaths in California.
library(tigris)
## To enable
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(ggplot2)
options(tigris_class="sf")
usco<-counties(cb=T, year=2015, state = "CA")
usco$cofips<-usco$GEOID
sts<-states(cb = T, year=2015)
sts<-st_boundary(sts)%>%
filter(!STATEFP %in% c("02", "15", "60", "66", "69", "72", "78"))
arf2018dr_m<-geo_join(usco, arf2018dr, by_sp="cofips", by_df="cofips",how="left" )
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
arf2018dr_m%>%
filter(!STATEFP %in% c("02", "15", "60", "66", "69", "72", "78"))%>%
mutate(suirate=sui1416/deaths1416)%>%
mutate(suic_group = cut(suirate, breaks=quantile(suirate, p=seq(0,1,length.out = 4), na.rm=T ), include.lowest=T ))%>%
ggplot()+
geom_sf(aes(fill=suic_group, color=NA))+
scale_color_brewer(palette = "Reds")+
scale_fill_brewer(palette = "Reds",na.value = "grey50")+
geom_sf(data=sts["STATEFP"=="06",], color="black")+
coord_sf(crs =3083)+
ggtitle(label = "Proportion of deaths that were suicides, 2014-2016")
First, I will us the Poisson model. To include the offset term for the number of deaths in each county, I use the offset() function within the model.
In this model, I use the hpsa10mh variable as the independent variable. This variable indicates whether a county has a shortage of mental health care doctors, it has three levels, 0 = not a shortage area, 1= the whole county is a shortage area, and 2= part of the county is a shortage area.
The model indicates that counties that are either partial or total shortage areas have lower suicide rates.
arf2018drsub<-filter(arf2018dr, is.na(sui1416)==F)
fit_pois<- glm(sui1416 ~ offset(log(deaths1416)) + hpsa10mh,
family=poisson,
data=arf2018drsub)
summary(fit_pois)
##
## Call:
## glm(formula = sui1416 ~ offset(log(deaths1416)) + hpsa10mh, family = poisson,
## data = arf2018drsub)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -10.0779 -0.7058 0.2595 1.1901 11.2499
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.05635 0.01428 -284.032 < 2e-16 ***
## hpsa10mh1 -0.02417 0.01679 -1.440 0.15
## hpsa10mh2 -0.06775 0.01604 -4.224 2.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3481.2 on 906 degrees of freedom
## Residual deviance: 3456.1 on 904 degrees of freedom
## AIC: 8108.8
##
## Number of Fisher Scoring iterations: 4
If I exponentiate the coefficients from the model, I can get the risk ratios:
exp(coef(fit_pois))
## (Intercept) hpsa10mh1 hpsa10mh2
## 0.01731206 0.97611682 0.93449120
In this case, the hpsa10mh, 1, or whole shortage area counties, have a mean suicide rate that is 2.4% lower than counties that are not shortage areas, and counties that are partial shortage areas, hpsa10mh, 2 have a mean suicide rate that is 6.6% lower.
A check of this is to examine the ratio of the residual deviance and the residual degrees of freedom:
scale<-sqrt(fit_pois$deviance/fit_pois$df.residual)
scale
## [1] 1.955289
This value should be 1 if the mean and variance are equal, but in this case it is 1.96. This suggests that the variance is twice as large as the mean.
There is also a test statistic that can be used, referred to as a goodness of fit statistic. This compares the model deviance to a \(\chi^2\) distribution with degrees of freedom equal to the residual degrees of freedom. Small p-values for this test, suggest the model does not fit the data.
1-pchisq(fit_pois$deviance, df = fit_pois$df.residual)
## [1] 0
No.In this case, the statistic has a p-value of < .001, indicating the model does not fit the data well and suffers from overdispersion.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
fit_nb<- glm.nb(sui1416 ~ offset(log(deaths1416)) + hpsa10mh,
data=arf2018drsub)
summary(fit_nb)
##
## Call:
## glm.nb(formula = sui1416 ~ offset(log(deaths1416)) + hpsa10mh,
## data = arf2018drsub, init.theta = 16.72732263, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2377 -0.6647 -0.0655 0.5412 4.9180
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.01460 0.02462 -163.051 <2e-16 ***
## hpsa10mh1 0.03946 0.02999 1.316 0.1883
## hpsa10mh2 -0.06510 0.02930 -2.222 0.0263 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(16.7273) family taken to be 1)
##
## Null deviance: 890.36 on 906 degrees of freedom
## Residual deviance: 870.35 on 904 degrees of freedom
## AIC: 6453.1
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 16.73
## Std. Err.: 1.17
##
## 2 x log-likelihood: -6445.121
Again, I see the same overall interpretation of the model effects, but the risk ratios are different compared to the Poisson model:
exp(coef(fit_nb))
## (Intercept) hpsa10mh1 hpsa10mh2
## 0.01805011 1.04024374 0.93697422
Compare to the poisson’s estimates of risk
exp(coef(fit_pois))
## (Intercept) hpsa10mh1 hpsa10mh2
## 0.01731206 0.97611682 0.93449120
AIC(fit_pois)
## [1] 8108.769
AIC(fit_nb)
## [1] 6453.121