For this analysis I am going to use Area Health Resource Files from the Health Resource and Service Administration (HRSA). The main outcome variable that I am going to use is Crude Death Rate which is calculated from the numerator Total Deaths and denominator Total Population (offset) in 2016 in the United States.

There are four predictors in this analysis: poverty rate in 2016, the rural urban continuum code (RUCC) from USDA, Food Stamp/SNAP Recipients, and per capital number of OB-GYN’s in the county. The RUCC is categorized from 0 to 9, with 0 defined as Central counties of metropolitan areas greater than 1 million population and gradually decresing upto category 9 which represents completely rural or less than 2500 urban population and not adjacent to a metropolitcan area.

Getting the Data

arf<-"https://github.com/coreysparks/data/blob/master/arf2018.Rdata?raw=true"
load(url(arf))

Recoding variables

arf18<-arf2018%>%
  mutate(cofips=f00004, 
         coname=f00010,
         state = f00011,
         pop = f1198416,
         death = f1255816,
         pov = f1332116,
         snap = f1408415,
         rucc= as.factor(f0002013),
         obgyn_pc=1000*(f1168416/f1198416))%>%
dplyr::select(pop, death, state, cofips, coname, rucc, pov, snap, rucc, obgyn_pc)%>%
  filter(complete.cases(.))
  #as.data.frame()

summary(arf18)

##       pop               death            state          
##  Min.   :     113   Min.   :    0.0   Length:3141       
##  1st Qu.:   10986   1st Qu.:  118.0   Class :character  
##  Median :   25775   Median :  273.0   Mode  :character  
##  Mean   :  102874   Mean   :  873.6                     
##  3rd Qu.:   67560   3rd Qu.:  677.0                     
##  Max.   :10137915   Max.   :67251.0                     
##                                                         
##     cofips             coname               rucc          pov       
##  Length:3141        Length:3141        06     :593   Min.   : 3.40  
##  Class :character   Class :character   07     :433   1st Qu.:11.40  
##  Mode  :character   Mode  :character   01     :432   Median :14.90  
##                                        09     :424   Mean   :15.88  
##                                        02     :378   3rd Qu.:19.10  
##                                        03     :355   Max.   :48.60  
##                                        (Other):526                  
##       snap            obgyn_pc      
##  Min.   :      1   Min.   :0.00000  
##  1st Qu.:   1473   1st Qu.:0.00000  
##  Median :   3875   Median :0.02323  
##  Mean   :  14189   Mean   :0.05095  
##  3rd Qu.:   9849   3rd Qu.:0.08509  
##  Max.   :1154669   Max.   :0.68213  
##

Mapping the Crude Death Rate in the United States, 2016

arf18_m%>%
  filter(!STATE %in% c("02", "15", "60", "66", "69", "72", "78"))%>%
  mutate(cdr=(death/pop)*1000)%>% #Calculating Crude Death Rate
  mutate(cdr_range = cut(cdr, breaks=quantile(cdr, p=seq(0,1,length.out = 6), na.rm=T ), include.lowest=T ))%>% #Making quantile map
  ggplot()+
  geom_sf(aes(fill=cdr_range, color=NA))+
  scale_color_brewer(palette = "Blues")+
  scale_fill_brewer(palette = "Blues",na.value = "grey50")+
  geom_sf(data=sts, color="black")+
  coord_sf(crs = 102008)+
  ggtitle(label = "Crude Death Rates, United States, 2016")

Applications of count data models

Poisson Model

arf18sub<-filter(arf18_m, is.na(death)==F)

fit_pois<- glm(death ~ offset(log(pop)) + pov + snap + obgyn_pc + rucc, 
               family=poisson, 
               data=arf18sub)
summary(fit_pois)

## 
## Call:
## glm(formula = death ~ offset(log(pop)) + pov + snap + obgyn_pc + 
##     rucc, family = poisson, data = arf18sub)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -46.970   -1.853    0.271    2.373   51.081  
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -4.896e+00  2.155e-03 -2271.33   <2e-16 ***
## pov          8.700e-03  1.317e-04    66.08   <2e-16 ***
## snap        -2.000e-07  3.120e-09   -64.11   <2e-16 ***
## obgyn_pc    -2.799e-01  8.912e-03   -31.41   <2e-16 ***
## rucc02       7.477e-02  1.727e-03    43.28   <2e-16 ***
## rucc03       1.281e-01  2.315e-03    55.33   <2e-16 ***
## rucc04       2.036e-01  3.054e-03    66.66   <2e-16 ***
## rucc05       1.373e-01  4.764e-03    28.83   <2e-16 ***
## rucc06       2.547e-01  3.009e-03    84.66   <2e-16 ***
## rucc07       2.457e-01  3.750e-03    65.51   <2e-16 ***
## rucc08       2.864e-01  6.650e-03    43.06   <2e-16 ***
## rucc09       2.555e-01  6.247e-03    40.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 169842  on 3073  degrees of freedom
## Residual deviance: 117606  on 3062  degrees of freedom
## AIC: 140681
## 
## Number of Fisher Scoring iterations: 4

We can see significant effects on the death rates from the predictors with the RUCC categories showing similar levels of results, except the counties in Metropolitan areas of 250,000 to 1 million population.

Checking out over-dispersion

scale<-sqrt(fit_pois$deviance/fit_pois$df.residual)
scale

## [1] 6.19743

This value should be 1 if the mean and variance are equal, but in this case it is 6.19 which suggests that the variance is quite large compared to the mean. We can be certain there’s an over-dispersion problem.

Goodness of fit statistic

1-pchisq(fit_pois$deviance, df = fit_pois$df.residual)

## [1] 0

The Small p-values for this test suggest the model does not fit the data well at all.

Overdispersion doesn’t allow us to trust the model results because the test statistics of the model rely on the assumptions to be correct.

Alternatives to the Poisson: Quasi-Poisson model

fit_q_pois<- glm(death ~ offset(log(pop)) + pov + snap + obgyn_pc + rucc, 
               family=quasipoisson, 
               data=arf18sub)
summary(fit_q_pois)

## 
## Call:
## glm(formula = death ~ offset(log(pop)) + pov + snap + obgyn_pc + 
##     rucc, family = quasipoisson, data = arf18sub)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -46.970   -1.853    0.271    2.373   51.081  
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -4.896e+00  1.330e-02 -368.111  < 2e-16 ***
## pov          8.700e-03  8.124e-04   10.709  < 2e-16 ***
## snap        -2.000e-07  1.925e-08  -10.390  < 2e-16 ***
## obgyn_pc    -2.799e-01  5.499e-02   -5.090 3.79e-07 ***
## rucc02       7.477e-02  1.066e-02    7.015 2.82e-12 ***
## rucc03       1.281e-01  1.428e-02    8.968  < 2e-16 ***
## rucc04       2.036e-01  1.885e-02   10.804  < 2e-16 ***
## rucc05       1.373e-01  2.940e-02    4.672 3.11e-06 ***
## rucc06       2.547e-01  1.856e-02   13.720  < 2e-16 ***
## rucc07       2.457e-01  2.314e-02   10.617  < 2e-16 ***
## rucc08       2.864e-01  4.103e-02    6.979 3.62e-12 ***
## rucc09       2.555e-01  3.854e-02    6.630 3.96e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 38.07171)
## 
##     Null deviance: 169842  on 3073  degrees of freedom
## Residual deviance: 117606  on 3062  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

This output shows that the new test statistics are significant lower as they were under the Poisson assumptions, but the values are still quite big. The quasi-Poisson model should be the model of choice in this case.

Alternatives to the Poisson: Negative Binomial

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

fit_nb<- glm.nb(death ~ offset(log(pop)) + pov + snap + obgyn_pc + rucc, 
               data=arf18sub)
summary(fit_nb)

## 
## Call:
## glm.nb(formula = death ~ offset(log(pop)) + pov + snap + obgyn_pc + 
##     rucc, data = arf18sub, init.theta = 22.35543558, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.3316  -0.6101   0.0336   0.5697   3.2037  
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -4.833e+00  1.484e-02 -325.599  < 2e-16 ***
## pov          1.043e-02  7.012e-04   14.880  < 2e-16 ***
## snap        -3.537e-07  8.868e-08   -3.988 6.65e-05 ***
## obgyn_pc    -4.475e-01  6.724e-02   -6.655 2.82e-11 ***
## rucc02       6.203e-02  1.585e-02    3.914 9.08e-05 ***
## rucc03       7.620e-02  1.657e-02    4.599 4.25e-06 ***
## rucc04       1.196e-01  1.902e-02    6.288 3.22e-10 ***
## rucc05       6.457e-02  2.580e-02    2.503   0.0123 *  
## rucc06       1.620e-01  1.571e-02   10.316  < 2e-16 ***
## rucc07       1.491e-01  1.671e-02    8.924  < 2e-16 ***
## rucc08       1.745e-01  2.097e-02    8.324  < 2e-16 ***
## rucc09       1.299e-01  1.842e-02    7.052 1.77e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(22.3554) family taken to be 1)
## 
##     Null deviance: 4065.0  on 3073  degrees of freedom
## Residual deviance: 3257.2  on 3062  degrees of freedom
## AIC: 34715
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  22.355 
##           Std. Err.:  0.657 
## 
##  2 x log-likelihood:  -34689.327

The model interpretations do not differ significantly from the Quassi-Poisson model

exp(coef(fit_nb))

## (Intercept)         pov        snap    obgyn_pc      rucc02      rucc03 
##  0.00796074  1.01048796  0.99999965  0.63923101  1.06399177  1.07918130 
##      rucc04      rucc05      rucc06      rucc07      rucc08      rucc09 
##  1.12707031  1.06670262  1.17588410  1.16076535  1.19068434  1.13872165

The Negative Binomial model shows even higher levels of relative risk in the undeserved areas, with a 6.35% increase in risk for counties are totally under-served, and 5.16% difference for counties that are partially under-served. This actually makes more sense that the Poisson model results:

Risk Ratios

exp(coef(fit_pois))

## (Intercept)         pov        snap    obgyn_pc      rucc02      rucc03 
## 0.007479576 1.008738357 0.999999800 0.755843824 1.077635656 1.136638818 
##      rucc04      rucc05      rucc06      rucc07      rucc08      rucc09 
## 1.225831207 1.147213324 1.290095892 1.278462952 1.331559344 1.291149513

The counties that have higher poverty, receive snap benefits, and lower per capital obgyn, show greater risk of death.

Binomial count model

fit_bin<-glm(cbind(death, pop)~ pov + snap + obgyn_pc + rucc,  
               family=binomial, 
               data=arf18sub)
summary(fit_bin)

## 
## Call:
## glm(formula = cbind(death, pop) ~ pov + snap + obgyn_pc + rucc, 
##     family = binomial, data = arf18sub)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -46.790   -1.843    0.270    2.360   50.841  
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -4.896e+00  2.165e-03 -2261.31   <2e-16 ***
## pov          8.708e-03  1.323e-04    65.81   <2e-16 ***
## snap        -2.000e-07  3.131e-09   -63.88   <2e-16 ***
## obgyn_pc    -2.798e-01  8.948e-03   -31.27   <2e-16 ***
## rucc02       7.477e-02  1.735e-03    43.10   <2e-16 ***
## rucc03       1.281e-01  2.325e-03    55.08   <2e-16 ***
## rucc04       2.036e-01  3.069e-03    66.34   <2e-16 ***
## rucc05       1.373e-01  4.787e-03    28.68   <2e-16 ***
## rucc06       2.547e-01  3.024e-03    84.23   <2e-16 ***
## rucc07       2.456e-01  3.770e-03    65.16   <2e-16 ***
## rucc08       2.864e-01  6.687e-03    42.82   <2e-16 ***
## rucc09       2.555e-01  6.281e-03    40.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 168375  on 3073  degrees of freedom
## Residual deviance: 116593  on 3062  degrees of freedom
## AIC: 139637
## 
## Number of Fisher Scoring iterations: 4

The binomial model show similar results compared to the Poisson model. The only difference is the AIC for the binomial model is slightly lower, suggesting that the binomial distribution is preferred in this analysis.

Filtering GLMs for Spatial Autocorrelation

Autocorrelation in GLMs

library(spdep)
nbs<-knearneigh(coordinates(as(arf18sub, "Spatial")), k = 4, longlat = T)
nbs<-knn2nb(nbs, sym = T)
us.wt4<-nb2listw(nbs, style = "W")
lm.morantest(fit_pois, listw = us.wt4)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm(formula = death ~ offset(log(pop)) + pov + snap +
## obgyn_pc + rucc, family = poisson, data = arf18sub)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 30.131, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     3.628627e-01    -3.896273e-06     1.450372e-04

lm.morantest(fit_nb, listw = us.wt4)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = death ~ offset(log(pop)) + pov + snap +
## obgyn_pc + rucc, data = arf18sub, init.theta = 22.35543558, link =
## log)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 32.813, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     3.949772e-01    -5.768103e-05     1.449382e-04

Moran’s I is pretty close for both the poisson and negative binomial models.

Spatial nuisance and GLMs

#we have to provide the theta value for the NB family, I got this from the model fit above (fit_nb)
me.fit<-ME(death ~ offset(log(pop)) + pov + snap + obgyn_pc + rucc, family=negative.binomial(22.355),
           data=arf18sub, listw = us.wt4, verbose=T,alpha=.2 )

## eV[,2], I: 0.1215512 ZI: NA, pr(ZI): 0.01
## eV[,3], I: 0.1020494 ZI: NA, pr(ZI): 0.01
## eV[,53], I: 0.08416332 ZI: NA, pr(ZI): 0.01
## eV[,2864], I: 0.07226024 ZI: NA, pr(ZI): 0.01
## eV[,83], I: 0.06020284 ZI: NA, pr(ZI): 0.02
## eV[,2915], I: 0.04901331 ZI: NA, pr(ZI): 0.01
## eV[,1], I: 0.03904555 ZI: NA, pr(ZI): 0.01
## eV[,2734], I: 0.02826626 ZI: NA, pr(ZI): 0.05
## eV[,105], I: 0.01750495 ZI: NA, pr(ZI): 0.04
## eV[,1994], I: 0.007031315 ZI: NA, pr(ZI): 0.3

me.fit

##    Eigenvector ZI pr(ZI)
## 0           NA NA   0.01
## 1            2 NA   0.01
## 2            3 NA   0.01
## 3           53 NA   0.01
## 4         2864 NA   0.01
## 5           83 NA   0.02
## 6         2915 NA   0.01
## 7            1 NA   0.01
## 8         2734 NA   0.05
## 9          105 NA   0.04
## 10        1994 NA   0.30

From this above output we can we that the level of autocorrelation is no longer significant.

fits<-data.frame(fitted(me.fit))
arf18sub$me2<-fits$vec2
arf18sub$me3<-fits$vec3

arf18sub%>%
  filter(!STATE %in% c("02", "15", "60", "66", "69", "72", "78"))%>%
  ggplot()+
  geom_sf(aes(fill=me2, color=me2))+
  scale_color_viridis_c()+
  scale_fill_viridis_c(na.value = "grey50")+
  geom_sf(data=sts, color="black")+
  coord_sf(crs = 102008)+
  ggtitle(label = "Spatial Distribution Moran Eigenvector 2")

arf18sub%>%
  filter(!STATE %in% c("02", "15", "60", "66", "69", "72", "78"))%>%
  ggplot()+
  geom_sf(aes(fill=me3, color=me3))+
  scale_color_viridis_c()+
  scale_fill_viridis_c(na.value = "grey50")+
  geom_sf(data=sts, color="black")+
  coord_sf(crs = 102008)+
  ggtitle(label = "Spatial Distribution Moran Eigenvector 3")

clean.nb<-glm.nb(death ~ offset(log(pop)) + pov + snap + obgyn_pc + rucc + fitted(me.fit), arf18sub)
summary(clean.nb)

## 
## Call:
## glm.nb(formula = death ~ offset(log(pop)) + pov + snap + obgyn_pc + 
##     rucc + fitted(me.fit), data = arf18sub, init.theta = 24.44975212, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.2914  -0.6254   0.0166   0.6010   3.8065  
## 
## Coefficients:
##                         Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)           -4.845e+00  1.444e-02 -335.574  < 2e-16 ***
## pov                    1.053e-02  7.051e-04   14.934  < 2e-16 ***
## snap                  -3.722e-07  8.526e-08   -4.366 1.27e-05 ***
## obgyn_pc              -5.540e-01  6.544e-02   -8.465  < 2e-16 ***
## rucc02                 6.638e-02  1.525e-02    4.353 1.34e-05 ***
## rucc03                 8.251e-02  1.607e-02    5.134 2.84e-07 ***
## rucc04                 1.314e-01  1.829e-02    7.187 6.64e-13 ***
## rucc05                 1.027e-01  2.508e-02    4.097 4.19e-05 ***
## rucc06                 1.752e-01  1.522e-02   11.513  < 2e-16 ***
## rucc07                 1.716e-01  1.641e-02   10.458  < 2e-16 ***
## rucc08                 1.849e-01  2.037e-02    9.075  < 2e-16 ***
## rucc09                 1.546e-01  1.821e-02    8.491  < 2e-16 ***
## fitted(me.fit)vec2    -7.550e-01  2.153e-01   -3.506 0.000454 ***
## fitted(me.fit)vec3    -2.174e+00  2.214e-01   -9.821  < 2e-16 ***
## fitted(me.fit)vec53   -7.788e-01  2.197e-01   -3.544 0.000394 ***
## fitted(me.fit)vec2864 -3.460e-01  2.190e-01   -1.580 0.114170    
## fitted(me.fit)vec83    8.001e-01  2.173e-01    3.682 0.000232 ***
## fitted(me.fit)vec2915 -2.466e-01  2.152e-01   -1.146 0.251886    
## fitted(me.fit)vec1    -2.099e+00  2.272e-01   -9.236  < 2e-16 ***
## fitted(me.fit)vec2734  2.275e-01  2.286e-01    0.995 0.319623    
## fitted(me.fit)vec105  -6.067e-01  2.179e-01   -2.785 0.005354 ** 
## fitted(me.fit)vec1994 -4.232e-01  2.218e-01   -1.908 0.056371 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(24.4498) family taken to be 1)
## 
##     Null deviance: 4387.8  on 3073  degrees of freedom
## Residual deviance: 3272.7  on 3052  degrees of freedom
## AIC: 34510
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  24.450 
##           Std. Err.:  0.728 
## 
##  2 x log-likelihood:  -34463.896

Test for residual autocorrelation

lm.morantest(clean.nb, listw=us.wt4)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = death ~ offset(log(pop)) + pov + snap +
## obgyn_pc + rucc + fitted(me.fit), data = arf18sub, init.theta =
## 24.44975212, link = log)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 28.795, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.3475461550    -0.0001083410     0.0001457659

The Moran’s I value is still higher, which means the cleaning did not really help, that means it did not filter out much of spatial autocorrelation.

AIC(fit_pois)

## [1] 140681.3

AIC(fit_bin)

## [1] 139637.4

AIC(fit_nb)

## [1] 34715.33

AIC(clean.nb)

## [1] 34509.9

The clean negative binomial model has the lowest AIC, which means we have been able to decrease the autocorrelation in the model, but only marginally.

Dem7263 Homework 3

Muntasir

February 20, 2019