library(dplyr)
library(spatialreg)
library(spdep)
library(MASS)
library(ggplot2)
library(sf)
library(tigris)
df<- haven::read_sas("AHRF_2020-2021_SAS/AHRF2021.sas7bdat")
labels <- sapply(1:ncol(df), function(x){attr(df[[x]], "label")}) %>% 
  cbind("Variable_Name"=names(df),"Label"=.)

I want to see if the number of doctors in a county are influenced by the number of old people. We can examine the labels in the data to explore what variables are in the dataset.

I have selected the following variables to model Active M.Ds

Dependent Variable Label
f1212919 Tot Active M.D.s Non-Fed and Fed 2019
Independent Variables Label
f1318219 Population (Persons) 2019
f0954519 Inpatient Days Incl Nurs Home;Total Hosp 2019
f1547119 Persons <65 Yrs 2019
f1551919 Persons 40-64 Yrs 2019
f1434515 Median Household Income 2015-19
f1332119 Percent Persons in Poverty 2019
f1547419 % <65 without Health Insurance 2019
f00002 Header - FIPS St and Cty Code
df.subset <- df %>% 
  dplyr::select(f00002, f1212919, f1318219, f0954519, f1318219, f1551919, f1434515, f1332119, f1547419, f1547119) %>% 
  rename(FIPS = f00002,
         Total_MD = f1212919,
         Pop_2019 = f1318219,
         Inpatient_Days = f0954519,
         Pop_Old = f1547119,
         Pop_Adult = f1551919,
         Median_Income = f1434515,
         Poverty_Percent = f1332119,
         Old_No_Health = f1547419) %>% 
  filter(complete.cases(.))
summary(df.subset)
##      FIPS              Total_MD          Pop_2019        Inpatient_Days   
##  Length:3113        Min.   :    0.0   Min.   :     169   Min.   :      0  
##  Class :character   1st Qu.:    5.0   1st Qu.:   10907   1st Qu.:    843  
##  Mode  :character   Median :   18.0   Median :   26108   Median :   6089  
##                     Mean   :  309.4   Mean   :  105442   Mean   :  71190  
##                     3rd Qu.:   95.0   3rd Qu.:   69830   3rd Qu.:  36017  
##                     Max.   :31377.0   Max.   :10039107   Max.   :5825855  
##    Pop_Adult       Median_Income    Poverty_Percent Old_No_Health  
##  Min.   :     41   Min.   : 21504   Min.   : 2.70   Min.   : 2.40  
##  1st Qu.:   3384   1st Qu.: 44226   1st Qu.:10.40   1st Qu.: 8.00  
##  Median :   8097   Median : 51775   Median :13.40   Median :11.10  
##  Mean   :  32861   Mean   : 53462   Mean   :14.44   Mean   :11.97  
##  3rd Qu.:  21514   3rd Qu.: 59868   3rd Qu.:17.40   3rd Qu.:15.10  
##  Max.   :3177577   Max.   :142299   Max.   :47.70   Max.   :35.80  
##     Pop_Old       
##  Min.   :    149  
##  1st Qu.:   8301  
##  Median :  19890  
##  Mean   :  85717  
##  3rd Qu.:  54592  
##  Max.   :8464644

Spatial Info

Grabbing County spatial infomation

spatial <- counties(year = 2019, cb = TRUE)

Merging counties and scaling my x variables.

df.geom <- spatial %>% 
  dplyr::select(GEOID) %>% 
  right_join(df.subset, by= c("GEOID" = "FIPS")) %>% 
  mutate_at(vars(Inpatient_Days, Pop_Adult, Median_Income, Poverty_Percent, Old_No_Health, Poverty_Percent, Pop_Old), scale)

Looking at our outcome variable:

hist(df.geom$Total_MD)

We can see that it is very right skewed.

Let’s plot our outcome variable

df.geom %>%
  mutate(MD_group = cut(Total_MD, breaks=quantile(Total_MD, p=seq(0,1,length.out = 6), na.rm=T ), include.lowest=T )) %>% 
  ggplot() +
  geom_sf(aes(fill=MD_group, color = NA)) +
  coord_sf(xlim = c(-125, -65), ylim = c(50,25)) +
  scale_color_brewer(palette = "Blues") +
  scale_fill_brewer(palette = "Blues",na.value = "grey50")

Fitting a poisson with an offset of the total population in 2019. This does not consider spatial effects.

fit_pois<-glm(Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health + Poverty_Percent + Pop_Old,
               family=poisson, 
               data=df.geom)


summary(fit_pois)
## 
## Call:
## glm(formula = Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days + 
##     Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health + 
##     Poverty_Percent + Pop_Old, family = poisson, data = df.geom)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -94.318   -6.433   -4.382   -2.153   97.919  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)     -6.1487708  0.0015020 -4093.77   <2e-16 ***
## Inpatient_Days   0.1760316  0.0005048   348.70   <2e-16 ***
## Pop_Adult        0.0794142  0.0040010    19.85   <2e-16 ***
## Median_Income    0.2096329  0.0012350   169.74   <2e-16 ***
## Poverty_Percent  0.1794683  0.0022042    81.42   <2e-16 ***
## Old_No_Health   -0.0909379  0.0013178   -69.01   <2e-16 ***
## Pop_Old         -0.2089367  0.0041012   -50.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 449980  on 3112  degrees of freedom
## Residual deviance: 252712  on 3106  degrees of freedom
## AIC: 268035
## 
## Number of Fisher Scoring iterations: 5

We see that all of our variables are significant. We see that Population old has a negative relationship. Is goes against our initial hypothesis of the elderly people attracting doctors. It is possible that there is omitted variable bias in the model for example perhaps elder people live in an area that is less urban which that is the factor that is negative with total MD’s.

Examining overdispersion

sqrt(fit_pois$deviance/fit_pois$df.residual)
## [1] 9.020117

The data is very overdispersed the variance is 9 times greater then as the mean.

1-pchisq(fit_pois$deviance, df = fit_pois$df.residual)
## [1] 0

This p-value is extremely small which means the model is not fitting the data very well.

Because of this let’s fit a negative Binomial model.

fit_nb<-glm.nb(Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health + Poverty_Percent + Pop_Old,
               data=df.geom)


summary(fit_nb)
## 
## Call:
## glm.nb(formula = Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days + 
##     Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health + 
##     Poverty_Percent + Pop_Old, data = df.geom, init.theta = 2.064236585, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.3726  -0.9657  -0.3378   0.3769   7.0773  
## 
## Coefficients:
##                 Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)     -6.83795    0.01344 -508.659  < 2e-16 ***
## Inpatient_Days   0.87301    0.03122   27.960  < 2e-16 ***
## Pop_Adult        0.17939    0.18846    0.952  0.34116    
## Median_Income    0.34192    0.02198   15.559  < 2e-16 ***
## Poverty_Percent  0.07716    0.02195    3.515  0.00044 ***
## Old_No_Health   -0.13840    0.01504   -9.204  < 2e-16 ***
## Pop_Old         -0.68355    0.19197   -3.561  0.00037 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.0642) family taken to be 1)
## 
##     Null deviance: 5439.5  on 3112  degrees of freedom
## Residual deviance: 3371.3  on 3106  degrees of freedom
## AIC: 27236
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.0642 
##           Std. Err.:  0.0587 
## 
##  2 x log-likelihood:  -27220.3370

Doing a negative binomial model we see a change in the sizes of the coefficients and that the Population Adult is not significant. Compared to the poisson model it has a higher AIC. Though since the assumptions are better met with the negative binomial model we will proceed with this model.

Let’s check to see if this model has autocorrelation in the residuals.

nbs<-knearneigh(coordinates(as(df.geom, "Spatial")), k = 4, longlat = T)
nbs<-knn2nb(nbs, sym = T)
us.wt4<-nb2listw(nbs, style = "W")
lm.morantest(fit_nb, listw = us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = Total_MD ~ offset(log(Pop_2019)) +
## Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent +
## Old_No_Health + Poverty_Percent + Pop_Old, data = df.geom, init.theta =
## 2.064236585, link = log)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 12.266, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.1455327138    -0.0006195144     0.0001419828

Doing the Moran’s I we see that the p-value is significant so we can reject the null that there is no autocorrelation. There seems to be postive autocorrelation.

Fitting Spatial Autocorrelated model

df.geom$lag_MD<-lag.listw(x=us.wt4, var=(df.geom$Total_MD/df.geom$Pop_2019))
fit_nb.auto<-glm.nb(Total_MD ~ offset(log(Pop_2019)) + scale(lag_MD) + Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health + Poverty_Percent + Pop_Old,
               data=df.geom)


summary(fit_nb.auto)
## 
## Call:
## glm.nb(formula = Total_MD ~ offset(log(Pop_2019)) + scale(lag_MD) + 
##     Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent + 
##     Old_No_Health + Poverty_Percent + Pop_Old, data = df.geom, 
##     init.theta = 2.073131627, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4252  -0.9616  -0.3444   0.3755   6.9870  
## 
## Coefficients:
##                  Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)     -6.839803   0.013426 -509.430  < 2e-16 ***
## scale(lag_MD)    0.061067   0.015202    4.017 5.89e-05 ***
## Inpatient_Days   0.880079   0.031179   28.227  < 2e-16 ***
## Pop_Adult       -0.003738   0.190923   -0.020  0.98438    
## Median_Income    0.318693   0.022828   13.960  < 2e-16 ***
## Poverty_Percent  0.071581   0.022014    3.252  0.00115 ** 
## Old_No_Health   -0.130536   0.015203   -8.586  < 2e-16 ***
## Pop_Old         -0.525070   0.194066   -2.706  0.00682 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.0731) family taken to be 1)
## 
##     Null deviance: 5459.0  on 3112  degrees of freedom
## Residual deviance: 3367.1  on 3105  degrees of freedom
## AIC: 27223
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.0731 
##           Std. Err.:  0.0589 
## 
##  2 x log-likelihood:  -27204.7170
lm.morantest(fit_nb.auto, listw = us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = Total_MD ~ offset(log(Pop_2019)) +
## scale(lag_MD) + Inpatient_Days + Pop_Adult + Median_Income +
## Poverty_Percent + Old_No_Health + Poverty_Percent + Pop_Old, data =
## df.geom, init.theta = 2.073131627, link = log)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 8.963, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.1061314609    -0.0006670527     0.0001419795

Adding a spatial lag of our outcome variable did improve the model with a smaller AIC. But it did not get rid of our spatial autocorrelation. Let’s try the spatial filtering.

Spatial Filtering

fit_nb$theta
## [1] 2.064237

Fitting the Spatial Filtering Model

me.fit<-ME(Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health + Poverty_Percent + Pop_Old,
           data=df.geom,
           family=negative.binomial(fit_nb$theta),
           listw = us.wt4,
           verbose=T,alpha=.05 )
## eV[,1605], I: 0.000585048 ZI: NA, pr(ZI): 0.01
## eV[,1612], I: 0.0004656239 ZI: NA, pr(ZI): 0.03
## eV[,2428], I: 0.0003799792 ZI: NA, pr(ZI): 0.02
## eV[,1643], I: 0.0003184138 ZI: NA, pr(ZI): 0.01
## eV[,8], I: 0.0002702454 ZI: NA, pr(ZI): 0.01
## eV[,1493], I: 0.0002304293 ZI: NA, pr(ZI): 0.02
## eV[,1578], I: 0.0001957289 ZI: NA, pr(ZI): 0.04
## eV[,2503], I: 0.0001688467 ZI: NA, pr(ZI): 0.01
## eV[,2458], I: 0.000145039 ZI: NA, pr(ZI): 0.04
## eV[,1548], I: 0.0001260045 ZI: NA, pr(ZI): 0.03
## eV[,1647], I: 0.0001095708 ZI: NA, pr(ZI): 0.01
## eV[,2214], I: 9.567298e-05 ZI: NA, pr(ZI): 0.01
## eV[,1574], I: 8.42178e-05 ZI: NA, pr(ZI): 0.06
fits<-data.frame(fitted(me.fit))

df.geom$fit_8 = fits$vec8

Plotting One of the Eigen Vectors

df.geom %>%
  mutate(fit_8_group = cut(fit_8, breaks=quantile(fit_8, p=seq(0,1,length.out = 6), na.rm=T ), include.lowest=T )) %>% 
  ggplot() +
  geom_sf(aes(fill=fit_8_group, color = NA)) +
  coord_sf(xlim = c(-125, -65), ylim = c(50,25)) +
  scale_color_brewer(palette = "Blues") +
  scale_fill_brewer(palette = "Blues",na.value = "grey50") +
  ggtitle("Moran Eigenvector 8")

This vector seems to be from midwest to south and east coas to west.

Let’s see if spatial filtering will fix our autocorrelated model.

fit_nb.filter<-glm.nb(Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health + Poverty_Percent + Pop_Old + fitted(me.fit),
               data=df.geom)
summary(fit_nb.filter)
## 
## Call:
## glm.nb(formula = Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days + 
##     Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health + 
##     Poverty_Percent + Pop_Old + fitted(me.fit), data = df.geom, 
##     init.theta = 2.101960923, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.9238  -0.9589  -0.3339   0.4013   6.2089  
## 
## Coefficients:
##                       Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)           -6.84404    0.01337 -512.050  < 2e-16 ***
## Inpatient_Days         0.90086    0.03141   28.683  < 2e-16 ***
## Pop_Adult              0.13120    0.18776    0.699 0.484689    
## Median_Income          0.35266    0.02190   16.101  < 2e-16 ***
## Poverty_Percent        0.07633    0.02185    3.493 0.000477 ***
## Old_No_Health         -0.12891    0.01504   -8.570  < 2e-16 ***
## Pop_Old               -0.67246    0.19126   -3.516 0.000438 ***
## fitted(me.fit)vec1605  1.93150    0.71597    2.698 0.006981 ** 
## fitted(me.fit)vec1612 -2.41664    0.72148   -3.350 0.000809 ***
## fitted(me.fit)vec2428  1.22532    0.70984    1.726 0.084311 .  
## fitted(me.fit)vec1643 -1.21665    0.72731   -1.673 0.094367 .  
## fitted(me.fit)vec8    -3.10764    0.76487   -4.063 4.85e-05 ***
## fitted(me.fit)vec1493  0.98677    0.71421    1.382 0.167084    
## fitted(me.fit)vec1578  1.15069    0.73065    1.575 0.115282    
## fitted(me.fit)vec2503 -0.96934    0.69773   -1.389 0.164746    
## fitted(me.fit)vec2458 -0.71515    0.71197   -1.004 0.315158    
## fitted(me.fit)vec1548 -1.27854    0.73837   -1.732 0.083349 .  
## fitted(me.fit)vec1647  1.06226    0.73422    1.447 0.147960    
## fitted(me.fit)vec2214  1.23711    0.71244    1.736 0.082488 .  
## fitted(me.fit)vec1574  1.19945    0.75389    1.591 0.111608    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.102) family taken to be 1)
## 
##     Null deviance: 5522.0  on 3112  degrees of freedom
## Residual deviance: 3365.6  on 3093  degrees of freedom
## AIC: 27209
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.1020 
##           Std. Err.:  0.0599 
## 
##  2 x log-likelihood:  -27166.5410
lm.morantest(fit_nb.filter, listw=us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = Total_MD ~ offset(log(Pop_2019)) +
## Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent +
## Old_No_Health + Poverty_Percent + Pop_Old + fitted(me.fit), data =
## df.geom, init.theta = 2.101960923, link = log)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 11.628, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     0.1386268974    -0.0003777212     0.0001428997

It seems that the spatial filtering did not fix our spatial autocorrelation.