library(dplyr)
library(spatialreg)
library(spdep)
library(MASS)
library(ggplot2)
library(sf)
library(tigris)
df<- haven::read_sas("AHRF_2020-2021_SAS/AHRF2021.sas7bdat")
labels <- sapply(1:ncol(df), function(x){attr(df[[x]], "label")}) %>%
cbind("Variable_Name"=names(df),"Label"=.)
I want to see if the number of doctors in a county are influenced by the number of old people. We can examine the labels in the data to explore what variables are in the dataset.
I have selected the following variables to model Active M.Ds
| Dependent Variable | Label |
|---|---|
| f1212919 | Tot Active M.D.s Non-Fed and Fed 2019 |
| Independent Variables | Label |
|---|---|
| f1318219 | Population (Persons) 2019 |
| f0954519 | Inpatient Days Incl Nurs Home;Total Hosp 2019 |
| f1547119 | Persons <65 Yrs 2019 |
| f1551919 | Persons 40-64 Yrs 2019 |
| f1434515 | Median Household Income 2015-19 |
| f1332119 | Percent Persons in Poverty 2019 |
| f1547419 | % <65 without Health Insurance 2019 |
| f00002 | Header - FIPS St and Cty Code |
df.subset <- df %>%
dplyr::select(f00002, f1212919, f1318219, f0954519, f1318219, f1551919, f1434515, f1332119, f1547419, f1547119) %>%
rename(FIPS = f00002,
Total_MD = f1212919,
Pop_2019 = f1318219,
Inpatient_Days = f0954519,
Pop_Old = f1547119,
Pop_Adult = f1551919,
Median_Income = f1434515,
Poverty_Percent = f1332119,
Old_No_Health = f1547419) %>%
filter(complete.cases(.))
summary(df.subset)
## FIPS Total_MD Pop_2019 Inpatient_Days
## Length:3113 Min. : 0.0 Min. : 169 Min. : 0
## Class :character 1st Qu.: 5.0 1st Qu.: 10907 1st Qu.: 843
## Mode :character Median : 18.0 Median : 26108 Median : 6089
## Mean : 309.4 Mean : 105442 Mean : 71190
## 3rd Qu.: 95.0 3rd Qu.: 69830 3rd Qu.: 36017
## Max. :31377.0 Max. :10039107 Max. :5825855
## Pop_Adult Median_Income Poverty_Percent Old_No_Health
## Min. : 41 Min. : 21504 Min. : 2.70 Min. : 2.40
## 1st Qu.: 3384 1st Qu.: 44226 1st Qu.:10.40 1st Qu.: 8.00
## Median : 8097 Median : 51775 Median :13.40 Median :11.10
## Mean : 32861 Mean : 53462 Mean :14.44 Mean :11.97
## 3rd Qu.: 21514 3rd Qu.: 59868 3rd Qu.:17.40 3rd Qu.:15.10
## Max. :3177577 Max. :142299 Max. :47.70 Max. :35.80
## Pop_Old
## Min. : 149
## 1st Qu.: 8301
## Median : 19890
## Mean : 85717
## 3rd Qu.: 54592
## Max. :8464644
Grabbing County spatial infomation
spatial <- counties(year = 2019, cb = TRUE)
Merging counties and scaling my x variables.
df.geom <- spatial %>%
dplyr::select(GEOID) %>%
right_join(df.subset, by= c("GEOID" = "FIPS")) %>%
mutate_at(vars(Inpatient_Days, Pop_Adult, Median_Income, Poverty_Percent, Old_No_Health, Poverty_Percent, Pop_Old), scale)
Looking at our outcome variable:
hist(df.geom$Total_MD)
We can see that it is very right skewed.
Let’s plot our outcome variable
df.geom %>%
mutate(MD_group = cut(Total_MD, breaks=quantile(Total_MD, p=seq(0,1,length.out = 6), na.rm=T ), include.lowest=T )) %>%
ggplot() +
geom_sf(aes(fill=MD_group, color = NA)) +
coord_sf(xlim = c(-125, -65), ylim = c(50,25)) +
scale_color_brewer(palette = "Blues") +
scale_fill_brewer(palette = "Blues",na.value = "grey50")
Fitting a poisson with an offset of the total population in 2019. This does not consider spatial effects.
fit_pois<-glm(Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health + Poverty_Percent + Pop_Old,
family=poisson,
data=df.geom)
summary(fit_pois)
##
## Call:
## glm(formula = Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days +
## Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health +
## Poverty_Percent + Pop_Old, family = poisson, data = df.geom)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -94.318 -6.433 -4.382 -2.153 97.919
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.1487708 0.0015020 -4093.77 <2e-16 ***
## Inpatient_Days 0.1760316 0.0005048 348.70 <2e-16 ***
## Pop_Adult 0.0794142 0.0040010 19.85 <2e-16 ***
## Median_Income 0.2096329 0.0012350 169.74 <2e-16 ***
## Poverty_Percent 0.1794683 0.0022042 81.42 <2e-16 ***
## Old_No_Health -0.0909379 0.0013178 -69.01 <2e-16 ***
## Pop_Old -0.2089367 0.0041012 -50.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 449980 on 3112 degrees of freedom
## Residual deviance: 252712 on 3106 degrees of freedom
## AIC: 268035
##
## Number of Fisher Scoring iterations: 5
We see that all of our variables are significant. We see that Population old has a negative relationship. Is goes against our initial hypothesis of the elderly people attracting doctors. It is possible that there is omitted variable bias in the model for example perhaps elder people live in an area that is less urban which that is the factor that is negative with total MD’s.
Examining overdispersion
sqrt(fit_pois$deviance/fit_pois$df.residual)
## [1] 9.020117
The data is very overdispersed the variance is 9 times greater then as the mean.
1-pchisq(fit_pois$deviance, df = fit_pois$df.residual)
## [1] 0
This p-value is extremely small which means the model is not fitting the data very well.
Because of this let’s fit a negative Binomial model.
fit_nb<-glm.nb(Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health + Poverty_Percent + Pop_Old,
data=df.geom)
summary(fit_nb)
##
## Call:
## glm.nb(formula = Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days +
## Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health +
## Poverty_Percent + Pop_Old, data = df.geom, init.theta = 2.064236585,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.3726 -0.9657 -0.3378 0.3769 7.0773
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.83795 0.01344 -508.659 < 2e-16 ***
## Inpatient_Days 0.87301 0.03122 27.960 < 2e-16 ***
## Pop_Adult 0.17939 0.18846 0.952 0.34116
## Median_Income 0.34192 0.02198 15.559 < 2e-16 ***
## Poverty_Percent 0.07716 0.02195 3.515 0.00044 ***
## Old_No_Health -0.13840 0.01504 -9.204 < 2e-16 ***
## Pop_Old -0.68355 0.19197 -3.561 0.00037 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.0642) family taken to be 1)
##
## Null deviance: 5439.5 on 3112 degrees of freedom
## Residual deviance: 3371.3 on 3106 degrees of freedom
## AIC: 27236
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.0642
## Std. Err.: 0.0587
##
## 2 x log-likelihood: -27220.3370
Doing a negative binomial model we see a change in the sizes of the coefficients and that the Population Adult is not significant. Compared to the poisson model it has a higher AIC. Though since the assumptions are better met with the negative binomial model we will proceed with this model.
Let’s check to see if this model has autocorrelation in the residuals.
nbs<-knearneigh(coordinates(as(df.geom, "Spatial")), k = 4, longlat = T)
nbs<-knn2nb(nbs, sym = T)
us.wt4<-nb2listw(nbs, style = "W")
lm.morantest(fit_nb, listw = us.wt4)
##
## Global Moran I for regression residuals
##
## data:
## model: glm.nb(formula = Total_MD ~ offset(log(Pop_2019)) +
## Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent +
## Old_No_Health + Poverty_Percent + Pop_Old, data = df.geom, init.theta =
## 2.064236585, link = log)
## weights: us.wt4
##
## Moran I statistic standard deviate = 12.266, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.1455327138 -0.0006195144 0.0001419828
Doing the Moran’s I we see that the p-value is significant so we can reject the null that there is no autocorrelation. There seems to be postive autocorrelation.
fit_nb$theta
## [1] 2.064237
Fitting the Spatial Filtering Model
me.fit<-ME(Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health + Poverty_Percent + Pop_Old,
data=df.geom,
family=negative.binomial(fit_nb$theta),
listw = us.wt4,
verbose=T,alpha=.05 )
## eV[,1605], I: 0.000585048 ZI: NA, pr(ZI): 0.01
## eV[,1612], I: 0.0004656239 ZI: NA, pr(ZI): 0.03
## eV[,2428], I: 0.0003799792 ZI: NA, pr(ZI): 0.02
## eV[,1643], I: 0.0003184138 ZI: NA, pr(ZI): 0.01
## eV[,8], I: 0.0002702454 ZI: NA, pr(ZI): 0.01
## eV[,1493], I: 0.0002304293 ZI: NA, pr(ZI): 0.02
## eV[,1578], I: 0.0001957289 ZI: NA, pr(ZI): 0.04
## eV[,2503], I: 0.0001688467 ZI: NA, pr(ZI): 0.01
## eV[,2458], I: 0.000145039 ZI: NA, pr(ZI): 0.04
## eV[,1548], I: 0.0001260045 ZI: NA, pr(ZI): 0.03
## eV[,1647], I: 0.0001095708 ZI: NA, pr(ZI): 0.01
## eV[,2214], I: 9.567298e-05 ZI: NA, pr(ZI): 0.01
## eV[,1574], I: 8.42178e-05 ZI: NA, pr(ZI): 0.06
fits<-data.frame(fitted(me.fit))
df.geom$fit_8 = fits$vec8
Plotting One of the Eigen Vectors
df.geom %>%
mutate(fit_8_group = cut(fit_8, breaks=quantile(fit_8, p=seq(0,1,length.out = 6), na.rm=T ), include.lowest=T )) %>%
ggplot() +
geom_sf(aes(fill=fit_8_group, color = NA)) +
coord_sf(xlim = c(-125, -65), ylim = c(50,25)) +
scale_color_brewer(palette = "Blues") +
scale_fill_brewer(palette = "Blues",na.value = "grey50") +
ggtitle("Moran Eigenvector 8")
This vector seems to be from midwest to south and east coas to west.
Let’s see if spatial filtering will fix our autocorrelated model.
fit_nb.filter<-glm.nb(Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health + Poverty_Percent + Pop_Old + fitted(me.fit),
data=df.geom)
summary(fit_nb.filter)
##
## Call:
## glm.nb(formula = Total_MD ~ offset(log(Pop_2019)) + Inpatient_Days +
## Pop_Adult + Median_Income + Poverty_Percent + Old_No_Health +
## Poverty_Percent + Pop_Old + fitted(me.fit), data = df.geom,
## init.theta = 2.101960923, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.9238 -0.9589 -0.3339 0.4013 6.2089
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.84404 0.01337 -512.050 < 2e-16 ***
## Inpatient_Days 0.90086 0.03141 28.683 < 2e-16 ***
## Pop_Adult 0.13120 0.18776 0.699 0.484689
## Median_Income 0.35266 0.02190 16.101 < 2e-16 ***
## Poverty_Percent 0.07633 0.02185 3.493 0.000477 ***
## Old_No_Health -0.12891 0.01504 -8.570 < 2e-16 ***
## Pop_Old -0.67246 0.19126 -3.516 0.000438 ***
## fitted(me.fit)vec1605 1.93150 0.71597 2.698 0.006981 **
## fitted(me.fit)vec1612 -2.41664 0.72148 -3.350 0.000809 ***
## fitted(me.fit)vec2428 1.22532 0.70984 1.726 0.084311 .
## fitted(me.fit)vec1643 -1.21665 0.72731 -1.673 0.094367 .
## fitted(me.fit)vec8 -3.10764 0.76487 -4.063 4.85e-05 ***
## fitted(me.fit)vec1493 0.98677 0.71421 1.382 0.167084
## fitted(me.fit)vec1578 1.15069 0.73065 1.575 0.115282
## fitted(me.fit)vec2503 -0.96934 0.69773 -1.389 0.164746
## fitted(me.fit)vec2458 -0.71515 0.71197 -1.004 0.315158
## fitted(me.fit)vec1548 -1.27854 0.73837 -1.732 0.083349 .
## fitted(me.fit)vec1647 1.06226 0.73422 1.447 0.147960
## fitted(me.fit)vec2214 1.23711 0.71244 1.736 0.082488 .
## fitted(me.fit)vec1574 1.19945 0.75389 1.591 0.111608
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.102) family taken to be 1)
##
## Null deviance: 5522.0 on 3112 degrees of freedom
## Residual deviance: 3365.6 on 3093 degrees of freedom
## AIC: 27209
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.1020
## Std. Err.: 0.0599
##
## 2 x log-likelihood: -27166.5410
lm.morantest(fit_nb.filter, listw=us.wt4)
##
## Global Moran I for regression residuals
##
## data:
## model: glm.nb(formula = Total_MD ~ offset(log(Pop_2019)) +
## Inpatient_Days + Pop_Adult + Median_Income + Poverty_Percent +
## Old_No_Health + Poverty_Percent + Pop_Old + fitted(me.fit), data =
## df.geom, init.theta = 2.101960923, link = log)
## weights: us.wt4
##
## Moran I statistic standard deviate = 11.628, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.1386268974 -0.0003777212 0.0001428997
It seems that the spatial filtering did not fix our spatial autocorrelation.