For this analysis I am going to use Area Health Resource Files from the Health Resource and Service Administration (HRSA). The main outcome variable that I am going to use is Crude Death Rate which is calculated from the numerator Total Deaths and denominator Total Population (offset) in 2016 in the United States.
There are four predictors in this analysis: poverty rate in 2016, the rural urban continuum code (RUCC) from USDA, Food Stamp/SNAP Recipients, and per capital number of OB-GYN’s in the county. The RUCC is categorized from 0 to 9, with 0 defined as Central counties of metropolitan areas greater than 1 million population and gradually decresing upto category 9 which represents completely rural or less than 2500 urban population and not adjacent to a metropolitcan area.
arf<-"https://github.com/coreysparks/data/blob/master/arf2018.Rdata?raw=true"
load(url(arf))
arf18<-arf2018%>%
mutate(cofips=f00004,
coname=f00010,
state = f00011,
pop = f1198416,
death = f1255816,
pov = f1332116,
snap = f1408415,
rucc= as.factor(f0002013),
obgyn_pc=1000*(f1168416/f1198416))%>%
dplyr::select(pop, death, state, cofips, coname, rucc, pov, snap, rucc, obgyn_pc)%>%
filter(complete.cases(.))
#as.data.frame()
summary(arf18)
## pop death state
## Min. : 113 Min. : 0.0 Length:3141
## 1st Qu.: 10986 1st Qu.: 118.0 Class :character
## Median : 25775 Median : 273.0 Mode :character
## Mean : 102874 Mean : 873.6
## 3rd Qu.: 67560 3rd Qu.: 677.0
## Max. :10137915 Max. :67251.0
##
## cofips coname rucc pov
## Length:3141 Length:3141 06 :593 Min. : 3.40
## Class :character Class :character 07 :433 1st Qu.:11.40
## Mode :character Mode :character 01 :432 Median :14.90
## 09 :424 Mean :15.88
## 02 :378 3rd Qu.:19.10
## 03 :355 Max. :48.60
## (Other):526
## snap obgyn_pc
## Min. : 1 Min. :0.00000
## 1st Qu.: 1473 1st Qu.:0.00000
## Median : 3875 Median :0.02323
## Mean : 14189 Mean :0.05095
## 3rd Qu.: 9849 3rd Qu.:0.08509
## Max. :1154669 Max. :0.68213
##
arf18_m%>%
filter(!STATE %in% c("02", "15", "60", "66", "69", "72", "78"))%>%
mutate(cdr=(death/pop)*1000)%>% #Calculating Crude Death Rate
mutate(cdr_range = cut(cdr, breaks=quantile(cdr, p=seq(0,1,length.out = 6), na.rm=T ), include.lowest=T ))%>% #Making quantile map
ggplot()+
geom_sf(aes(fill=cdr_range, color=NA))+
scale_color_brewer(palette = "Blues")+
scale_fill_brewer(palette = "Blues",na.value = "grey50")+
geom_sf(data=sts, color="black")+
coord_sf(crs = 102008)+
ggtitle(label = "Crude Death Rates, United States, 2016")
arf18sub<-filter(arf18_m, is.na(death)==F)
fit_pois<- glm(death ~ offset(log(pop)) + pov + snap + obgyn_pc + rucc,
family=poisson,
data=arf18sub)
summary(fit_pois)
##
## Call:
## glm(formula = death ~ offset(log(pop)) + pov + snap + obgyn_pc +
## rucc, family = poisson, data = arf18sub)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -46.970 -1.853 0.271 2.373 51.081
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.896e+00 2.155e-03 -2271.33 <2e-16 ***
## pov 8.700e-03 1.317e-04 66.08 <2e-16 ***
## snap -2.000e-07 3.120e-09 -64.11 <2e-16 ***
## obgyn_pc -2.799e-01 8.912e-03 -31.41 <2e-16 ***
## rucc02 7.477e-02 1.727e-03 43.28 <2e-16 ***
## rucc03 1.281e-01 2.315e-03 55.33 <2e-16 ***
## rucc04 2.036e-01 3.054e-03 66.66 <2e-16 ***
## rucc05 1.373e-01 4.764e-03 28.83 <2e-16 ***
## rucc06 2.547e-01 3.009e-03 84.66 <2e-16 ***
## rucc07 2.457e-01 3.750e-03 65.51 <2e-16 ***
## rucc08 2.864e-01 6.650e-03 43.06 <2e-16 ***
## rucc09 2.555e-01 6.247e-03 40.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 169842 on 3073 degrees of freedom
## Residual deviance: 117606 on 3062 degrees of freedom
## AIC: 140681
##
## Number of Fisher Scoring iterations: 4
We can see significant effects on the death rates from the predictors with the RUCC categories showing similar levels of results, except the counties in Metropolitan areas of 250,000 to 1 million population.
scale<-sqrt(fit_pois$deviance/fit_pois$df.residual)
scale
## [1] 6.19743
This value should be 1 if the mean and variance are equal, but in this case it is 6.19 which suggests that the variance is quite large compared to the mean. We can be certain there’s an over-dispersion problem.
1-pchisq(fit_pois$deviance, df = fit_pois$df.residual)
## [1] 0
The Small p-values for this test suggest the model does not fit the data well at all.
Overdispersion doesn’t allow us to trust the model results because the test statistics of the model rely on the assumptions to be correct.
fit_q_pois<- glm(death ~ offset(log(pop)) + pov + snap + obgyn_pc + rucc,
family=quasipoisson,
data=arf18sub)
summary(fit_q_pois)
##
## Call:
## glm(formula = death ~ offset(log(pop)) + pov + snap + obgyn_pc +
## rucc, family = quasipoisson, data = arf18sub)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -46.970 -1.853 0.271 2.373 51.081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.896e+00 1.330e-02 -368.111 < 2e-16 ***
## pov 8.700e-03 8.124e-04 10.709 < 2e-16 ***
## snap -2.000e-07 1.925e-08 -10.390 < 2e-16 ***
## obgyn_pc -2.799e-01 5.499e-02 -5.090 3.79e-07 ***
## rucc02 7.477e-02 1.066e-02 7.015 2.82e-12 ***
## rucc03 1.281e-01 1.428e-02 8.968 < 2e-16 ***
## rucc04 2.036e-01 1.885e-02 10.804 < 2e-16 ***
## rucc05 1.373e-01 2.940e-02 4.672 3.11e-06 ***
## rucc06 2.547e-01 1.856e-02 13.720 < 2e-16 ***
## rucc07 2.457e-01 2.314e-02 10.617 < 2e-16 ***
## rucc08 2.864e-01 4.103e-02 6.979 3.62e-12 ***
## rucc09 2.555e-01 3.854e-02 6.630 3.96e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 38.07171)
##
## Null deviance: 169842 on 3073 degrees of freedom
## Residual deviance: 117606 on 3062 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
This output shows that the new test statistics are significant lower as they were under the Poisson assumptions, but the values are still quite big. The quasi-Poisson model should be the model of choice in this case.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
fit_nb<- glm.nb(death ~ offset(log(pop)) + pov + snap + obgyn_pc + rucc,
data=arf18sub)
summary(fit_nb)
##
## Call:
## glm.nb(formula = death ~ offset(log(pop)) + pov + snap + obgyn_pc +
## rucc, data = arf18sub, init.theta = 22.35543558, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.3316 -0.6101 0.0336 0.5697 3.2037
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.833e+00 1.484e-02 -325.599 < 2e-16 ***
## pov 1.043e-02 7.012e-04 14.880 < 2e-16 ***
## snap -3.537e-07 8.868e-08 -3.988 6.65e-05 ***
## obgyn_pc -4.475e-01 6.724e-02 -6.655 2.82e-11 ***
## rucc02 6.203e-02 1.585e-02 3.914 9.08e-05 ***
## rucc03 7.620e-02 1.657e-02 4.599 4.25e-06 ***
## rucc04 1.196e-01 1.902e-02 6.288 3.22e-10 ***
## rucc05 6.457e-02 2.580e-02 2.503 0.0123 *
## rucc06 1.620e-01 1.571e-02 10.316 < 2e-16 ***
## rucc07 1.491e-01 1.671e-02 8.924 < 2e-16 ***
## rucc08 1.745e-01 2.097e-02 8.324 < 2e-16 ***
## rucc09 1.299e-01 1.842e-02 7.052 1.77e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(22.3554) family taken to be 1)
##
## Null deviance: 4065.0 on 3073 degrees of freedom
## Residual deviance: 3257.2 on 3062 degrees of freedom
## AIC: 34715
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 22.355
## Std. Err.: 0.657
##
## 2 x log-likelihood: -34689.327
The model interpretations do not differ significantly from the Quassi-Poisson model
exp(coef(fit_nb))
## (Intercept) pov snap obgyn_pc rucc02 rucc03
## 0.00796074 1.01048796 0.99999965 0.63923101 1.06399177 1.07918130
## rucc04 rucc05 rucc06 rucc07 rucc08 rucc09
## 1.12707031 1.06670262 1.17588410 1.16076535 1.19068434 1.13872165
The Negative Binomial model shows even higher levels of relative risk in the undeserved areas, with a 6.35% increase in risk for counties are totally under-served, and 5.16% difference for counties that are partially under-served. This actually makes more sense that the Poisson model results:
exp(coef(fit_pois))
## (Intercept) pov snap obgyn_pc rucc02 rucc03
## 0.007479576 1.008738357 0.999999800 0.755843824 1.077635656 1.136638818
## rucc04 rucc05 rucc06 rucc07 rucc08 rucc09
## 1.225831207 1.147213324 1.290095892 1.278462952 1.331559344 1.291149513
The counties that have higher poverty, receive snap benefits, and lower per capital obgyn, show greater risk of death.
fit_bin<-glm(cbind(death, pop)~ pov + snap + obgyn_pc + rucc,
family=binomial,
data=arf18sub)
summary(fit_bin)
##
## Call:
## glm(formula = cbind(death, pop) ~ pov + snap + obgyn_pc + rucc,
## family = binomial, data = arf18sub)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -46.790 -1.843 0.270 2.360 50.841
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.896e+00 2.165e-03 -2261.31 <2e-16 ***
## pov 8.708e-03 1.323e-04 65.81 <2e-16 ***
## snap -2.000e-07 3.131e-09 -63.88 <2e-16 ***
## obgyn_pc -2.798e-01 8.948e-03 -31.27 <2e-16 ***
## rucc02 7.477e-02 1.735e-03 43.10 <2e-16 ***
## rucc03 1.281e-01 2.325e-03 55.08 <2e-16 ***
## rucc04 2.036e-01 3.069e-03 66.34 <2e-16 ***
## rucc05 1.373e-01 4.787e-03 28.68 <2e-16 ***
## rucc06 2.547e-01 3.024e-03 84.23 <2e-16 ***
## rucc07 2.456e-01 3.770e-03 65.16 <2e-16 ***
## rucc08 2.864e-01 6.687e-03 42.82 <2e-16 ***
## rucc09 2.555e-01 6.281e-03 40.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 168375 on 3073 degrees of freedom
## Residual deviance: 116593 on 3062 degrees of freedom
## AIC: 139637
##
## Number of Fisher Scoring iterations: 4
The binomial model show similar results compared to the Poisson model. The only difference is the AIC for the binomial model is slightly lower, suggesting that the binomial distribution is preferred in this analysis.
library(spdep)
nbs<-knearneigh(coordinates(as(arf18sub, "Spatial")), k = 4, longlat = T)
nbs<-knn2nb(nbs, sym = T)
us.wt4<-nb2listw(nbs, style = "W")
lm.morantest(fit_pois, listw = us.wt4)
##
## Global Moran I for regression residuals
##
## data:
## model: glm(formula = death ~ offset(log(pop)) + pov + snap +
## obgyn_pc + rucc, family = poisson, data = arf18sub)
## weights: us.wt4
##
## Moran I statistic standard deviate = 30.131, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 3.628627e-01 -3.896273e-06 1.450372e-04
lm.morantest(fit_nb, listw = us.wt4)
##
## Global Moran I for regression residuals
##
## data:
## model: glm.nb(formula = death ~ offset(log(pop)) + pov + snap +
## obgyn_pc + rucc, data = arf18sub, init.theta = 22.35543558, link =
## log)
## weights: us.wt4
##
## Moran I statistic standard deviate = 32.813, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 3.949772e-01 -5.768103e-05 1.449382e-04
Moran’s I is pretty close for both the poisson and negative binomial models.
#we have to provide the theta value for the NB family, I got this from the model fit above (fit_nb)
me.fit<-ME(death ~ offset(log(pop)) + pov + snap + obgyn_pc + rucc, family=negative.binomial(22.355),
data=arf18sub, listw = us.wt4, verbose=T,alpha=.2 )
## eV[,2], I: 0.1215512 ZI: NA, pr(ZI): 0.01
## eV[,3], I: 0.1020494 ZI: NA, pr(ZI): 0.01
## eV[,53], I: 0.08416332 ZI: NA, pr(ZI): 0.01
## eV[,2864], I: 0.07226024 ZI: NA, pr(ZI): 0.01
## eV[,83], I: 0.06020284 ZI: NA, pr(ZI): 0.02
## eV[,2915], I: 0.04901331 ZI: NA, pr(ZI): 0.01
## eV[,1], I: 0.03904555 ZI: NA, pr(ZI): 0.01
## eV[,2734], I: 0.02826626 ZI: NA, pr(ZI): 0.05
## eV[,105], I: 0.01750495 ZI: NA, pr(ZI): 0.04
## eV[,1994], I: 0.007031315 ZI: NA, pr(ZI): 0.3
me.fit
## Eigenvector ZI pr(ZI)
## 0 NA NA 0.01
## 1 2 NA 0.01
## 2 3 NA 0.01
## 3 53 NA 0.01
## 4 2864 NA 0.01
## 5 83 NA 0.02
## 6 2915 NA 0.01
## 7 1 NA 0.01
## 8 2734 NA 0.05
## 9 105 NA 0.04
## 10 1994 NA 0.30
From this above output we can we that the level of autocorrelation is no longer significant.
fits<-data.frame(fitted(me.fit))
arf18sub$me2<-fits$vec2
arf18sub$me3<-fits$vec3
arf18sub%>%
filter(!STATE %in% c("02", "15", "60", "66", "69", "72", "78"))%>%
ggplot()+
geom_sf(aes(fill=me2, color=me2))+
scale_color_viridis_c()+
scale_fill_viridis_c(na.value = "grey50")+
geom_sf(data=sts, color="black")+
coord_sf(crs = 102008)+
ggtitle(label = "Spatial Distribution Moran Eigenvector 2")
arf18sub%>%
filter(!STATE %in% c("02", "15", "60", "66", "69", "72", "78"))%>%
ggplot()+
geom_sf(aes(fill=me3, color=me3))+
scale_color_viridis_c()+
scale_fill_viridis_c(na.value = "grey50")+
geom_sf(data=sts, color="black")+
coord_sf(crs = 102008)+
ggtitle(label = "Spatial Distribution Moran Eigenvector 3")
clean.nb<-glm.nb(death ~ offset(log(pop)) + pov + snap + obgyn_pc + rucc + fitted(me.fit), arf18sub)
summary(clean.nb)
##
## Call:
## glm.nb(formula = death ~ offset(log(pop)) + pov + snap + obgyn_pc +
## rucc + fitted(me.fit), data = arf18sub, init.theta = 24.44975212,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.2914 -0.6254 0.0166 0.6010 3.8065
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.845e+00 1.444e-02 -335.574 < 2e-16 ***
## pov 1.053e-02 7.051e-04 14.934 < 2e-16 ***
## snap -3.722e-07 8.526e-08 -4.366 1.27e-05 ***
## obgyn_pc -5.540e-01 6.544e-02 -8.465 < 2e-16 ***
## rucc02 6.638e-02 1.525e-02 4.353 1.34e-05 ***
## rucc03 8.251e-02 1.607e-02 5.134 2.84e-07 ***
## rucc04 1.314e-01 1.829e-02 7.187 6.64e-13 ***
## rucc05 1.027e-01 2.508e-02 4.097 4.19e-05 ***
## rucc06 1.752e-01 1.522e-02 11.513 < 2e-16 ***
## rucc07 1.716e-01 1.641e-02 10.458 < 2e-16 ***
## rucc08 1.849e-01 2.037e-02 9.075 < 2e-16 ***
## rucc09 1.546e-01 1.821e-02 8.491 < 2e-16 ***
## fitted(me.fit)vec2 -7.550e-01 2.153e-01 -3.506 0.000454 ***
## fitted(me.fit)vec3 -2.174e+00 2.214e-01 -9.821 < 2e-16 ***
## fitted(me.fit)vec53 -7.788e-01 2.197e-01 -3.544 0.000394 ***
## fitted(me.fit)vec2864 -3.460e-01 2.190e-01 -1.580 0.114170
## fitted(me.fit)vec83 8.001e-01 2.173e-01 3.682 0.000232 ***
## fitted(me.fit)vec2915 -2.466e-01 2.152e-01 -1.146 0.251886
## fitted(me.fit)vec1 -2.099e+00 2.272e-01 -9.236 < 2e-16 ***
## fitted(me.fit)vec2734 2.275e-01 2.286e-01 0.995 0.319623
## fitted(me.fit)vec105 -6.067e-01 2.179e-01 -2.785 0.005354 **
## fitted(me.fit)vec1994 -4.232e-01 2.218e-01 -1.908 0.056371 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(24.4498) family taken to be 1)
##
## Null deviance: 4387.8 on 3073 degrees of freedom
## Residual deviance: 3272.7 on 3052 degrees of freedom
## AIC: 34510
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 24.450
## Std. Err.: 0.728
##
## 2 x log-likelihood: -34463.896
lm.morantest(clean.nb, listw=us.wt4)
##
## Global Moran I for regression residuals
##
## data:
## model: glm.nb(formula = death ~ offset(log(pop)) + pov + snap +
## obgyn_pc + rucc + fitted(me.fit), data = arf18sub, init.theta =
## 24.44975212, link = log)
## weights: us.wt4
##
## Moran I statistic standard deviate = 28.795, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.3475461550 -0.0001083410 0.0001457659
The Moran’s I value is still higher, which means the cleaning did not really help, that means it did not filter out much of spatial autocorrelation.
AIC(fit_pois)
## [1] 140681.3
AIC(fit_bin)
## [1] 139637.4
AIC(fit_nb)
## [1] 34715.33
AIC(clean.nb)
## [1] 34509.9
The clean negative binomial model has the lowest AIC, which means we have been able to decrease the autocorrelation in the model, but only marginally.