This is a csv that I made elsewhere1 that contains data for each county in the continental US.
Columns are as follows:
GEOID, NAME, and ST are the identifiers for each county, with GEOID being the primary key for the spatial data, NAME being the name of the countydist_cent and dist_min are two different ways of calculating the distance in miles of each county to the nearest provider (neurosurgeon). dist_min is the distance from the nearest provider to the nearest edge of the county, while dist_cent is the distance from the nearest provider to the center of the county.
dist_min is overly conservative and zero-inflated2 so I tend to use dist_centnpi is the National Provider Identifier of the closest provider, and providerCity and providerST gives their location. nProvs is the number of providers located within a county, and is NA if there aren’t any providers in the countyDeaths is the raw number of TBI deaths in the county, and AgeAdj and Crude give the age adjusted and crude rates of TBI deaths per 100,000 of the standardized population (StdPop).
pop17 is the 2017 5-year ACS population estimate for the county. This number is not the same as the standardized population (StdPop) mentioned aboveDataSourceglimpse(dist_TBI)
## Observations: 3,108
## Variables: 15
## $ GEOID <chr> "21007", "21017", "21031", "21065", "21069", "21093", "2…
## $ NAME <chr> "Ballard", "Bourbon", "Butler", "Estill", "Fleming", "Ha…
## $ ST <chr> "KY", "KY", "KY", "KY", "KY", "KY", "KY", "KY", "KY", "K…
## $ dist_cent <dbl> 19.465282, 19.857851, 20.245012, 36.640252, 49.878536, 6…
## $ dist_min <dbl> 11.0513851, 8.8352361, 10.7969186, 24.8402589, 39.317849…
## $ npi <dbl> 1689808107, 1518925247, 1477654937, 1720051501, 15189252…
## $ providerCity <chr> "paducah", "lexington", "bowling green", "lexington", "l…
## $ providerST <chr> "ky", "ky", "ky", "ky", "ky", "ky", "ky", "ky", "ky", "k…
## $ pop17 <dbl> 8152, 20017, 12735, 14382, 14515, 107699, 18531, 10648, …
## $ Deaths <dbl> NA, 38, NA, 38, 39, 130, 35, 27, 117, 23, 53, 107, 22, 3…
## $ AgeAdj <dbl> NA, 28.307121, NA, 35.785930, 40.896276, 17.963288, 28.0…
## $ Crude <dbl> NA, 27.151778, NA, 37.205659, 38.574523, 17.625828, 27.2…
## $ StdPop <dbl> 57829, 139954, 89375, 102135, 101103, 737554, 128633, 78…
## $ DataSource <chr> "Geocoded NPI names", "Geocoded NPI names", "Geocoded NP…
## $ nProvs <dbl> NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 3, NA, 3, NA,…
Our main interest is the relationship of Crude, the crude rate of TBI mortality per 100k, with the distance from the middle of the county to the nearest neurosurgeon (dist_cent). Although Crude already incorporates population into it’s estimate3, I feel that it still might be appropriate to include a population estimate (i.e. pop17) in our estimation, as it could serve as a surrogate for other factors like population density.
An important consideration/limitation of this data is that we have many missing values for the TBI mortality rate (Crude), as the CDC suppresses data for counties with 20 or fewer TBI deaths. In fact, 761 of the 3108 total counties have missing data for their TBI deaths.
Additionally, these missing data are not at random; they tend to come from counties that have lower populations, are farther away from the nearest neurosurgeon, and in particular states. The plot below illustrates this last point by showing the number of counties with & without data, broken down by state:
## requires maps
# library(maps)
dist_TBI %>%
left_join(maps::state.fips, by=c("ST"="abb")) %>%
mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
mutate(region = case_when(region==1 ~ "Northeast",
region==2 ~ "Midwest",
region==3 ~ "South",
region==4 ~ "West")) %>%
ggplot(aes(x=ST, fill=missingTBI)) +
geom_bar() +
ggthemes::scale_fill_few() +
coord_flip() +
facet_wrap("region", scales = "free") +
labs(title="Number of missing counties, by region",
x="", y="# of counties in state") +
theme_bw() + theme(legend.position = "bottom")
dist_TBI %>%
mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
ggplot(aes(x=dist_min, fill=missingTBI)) +
geom_histogram() +
ggthemes::scale_fill_few() +
labs(title="Histogram of minimum distance", x="Min distance (mi)") +
theme_bw()
dist_TBI %>%
mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
ggplot(aes(x=dist_cent, fill=missingTBI)) +
geom_histogram() +
ggthemes::scale_fill_few() +
labs(title="Histogram of centroid distance", x="Centroid distance (mi)") +
theme_bw()
dist_TBI %>%
mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
ggplot(aes(x=Crude, fill=missingTBI)) +
geom_histogram() +
ggthemes::scale_fill_few() +
labs(title="Histogram of crude rate", x="Crude TBI rate per 100k") +
theme_bw()
dist_TBI %>%
mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
ggplot(aes(x=AgeAdj, fill=missingTBI)) +
geom_histogram() +
ggthemes::scale_fill_few() +
labs(title="Histogram of age adjusted rate", x="age adjusted TBI rate per 100k") +
theme_bw()
Based on ACS estimates
dist_TBI %>%
mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
ggplot(aes(x=pop17, fill=missingTBI)) +
geom_histogram() +
ggthemes::scale_fill_few() +
labs(title="Histogram of population (2017)", x="Log10 of county population") +
scale_x_log10() +
theme_bw()
dist_TBI %>%
ggplot(aes(dist_cent, Crude)) +
# geoms
geom_jitter(alpha=.3) +
geom_smooth(aes(color="loess"), method="loess") +
geom_smooth(aes(color="lm"), method="lm", se=F) +
ggrepel::geom_label_repel(aes(label=str_glue("{NAME}, {ST}")),
min.segment.length = 0,
data=filter(dist_TBI, GEOID %in% c("38053", "30021", "20055"))) +
## Scales / Guides / Themes
scale_color_discrete(name="method") +
labs(caption="WARNING: Missing TBI data excluded", y="Crude rate per 100k",
x="Centroid distance", title="Rate vs Distance") +
theme_bw() + theme(legend.position = "bottom")
dist_TBI %>%
ggplot(aes(pop17, Crude)) +
# geoms
ggrepel::geom_label_repel(aes(label=str_glue("{NAME}, {ST}")), box.padding=1,
data=filter(dist_TBI, GEOID %in% c("38053", "28009", "51045", "16007", "30003"))) +
ggrepel::geom_label_repel(aes(label=str_glue("{NAME}, {ST}")), nudge_y=25,
data=filter(dist_TBI, GEOID %in% c("06037"))) +
geom_jitter(alpha=.1) +
geom_smooth(aes(color="loess"), method="loess") +
geom_smooth(aes(color="lm"), method="lm", se=F) +
## Scales / Guides / Themes
scale_x_log10(labels=scales::comma) +
scale_color_discrete(name="method") +
labs(caption="WARNING: Missing TBI data excluded", y="Crude rate per 100k",
x="Population (log10)", title="Rate vs Population") +
theme_bw() + theme(legend.position = "bottom")
dist_TBI %>%
mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
ggplot(aes(pop17, dist_cent, color=missingTBI)) +
geom_jitter(alpha=.2) +
scale_x_log10(labels=scales::comma) +
ggthemes::scale_color_few() +
guides(color=guide_legend(override.aes=list(alpha=1))) +
labs(x="Population (log10)", y="Centroid distance", title="Distance vs Population") +
theme_bw() + theme(legend.position = "bottom")
My first thought was to do a simple linear regression of Crude ~ dist_cent, but I’m concerned about the linearity as the scatterplots above don’t show a particularly linear relationship.
Obviously I’m deferring to you on what we should ultimately do, but my intuition tells me maybe:
The group of figures below show the same Rate vs Distance scatterplot seen earlier but with transformations on the X-axis:
p <- dist_TBI %>%
ggplot(aes(dist_cent, Crude)) +
# geoms
geom_jitter(alpha=.15) +
geom_smooth(aes(color="loess"), method="loess") +
geom_smooth(aes(color="lm"), method="lm", se=F) +
## Scales / Guides / Themes
guides(color=F) +
labs(x="Centroid distance") +
theme_bw() + theme(legend.position = "bottom")
if(require(gridExtra)) {
gridExtra::grid.arrange(ncol=2,
p + scale_x_continuous(name="Distance (original)"),
p + scale_x_continuous(name="Squished values >130",
limits = c(0,130),oob = scales::squish),
p + scale_x_sqrt(name="Distance SQRT"),
p + scale_x_log10(name="Distance LOG10"))
} else({
p + scale_x_continuous(name="Original distance")
p + scale_x_continuous(name="Squished values >130",
limits = c(0,130),oob = scales::squish)
p + scale_x_sqrt(name="Distance SQRT")
p + scale_x_log10(name="Distance LOG10")
p + xlim(c(0,125))
})
Top left: The original untransformed scatterplot; Top right: All distances greater than 130 miles have been rounded to 130; Bottom row: X-axis transformed by squareroot (bottom left) and log10 (bottom right)
Categorizing the distance would be another option
df_binned <- dist_TBI %>%
# Make bins manually
mutate(dist_bin = cut(dist_cent, right=F,
breaks = c(0,15,30,50,75,100,Inf))) %>%
# Make bins via pentiles
mutate(pentile = str_glue("Pentile{ ntile(dist_cent, 5) }"))
# lm(Crude ~ pentile, data=df_binned) %>%
# # plot()
# # summary()
# # confint()
df_binned %>%
ggplot(aes(dist_cent, Crude, color=pentile)) +
geom_jitter() +
labs(x="Distance (miles)", y="Crude mortality rate") +
theme_bw()
df_binned %>%
ggplot(aes(x=pentile, y=Crude, fill=pentile)) +
geom_boxplot() +
labs(x="Pentile of distance", y="Crude mortality rate") +
guides(fill=F) +
theme_bw()
fit <- lm(Crude ~ dist_cent, data=dist_TBI)
fit_df <- broom::augment(fit, data=filter(dist_TBI, !is.na(Crude)))
## Residual vs Fitted
fit_df %>%
ggplot(aes(x=.fitted, y=.resid)) +
geom_hline(yintercept=0, color="red") +
geom_jitter(alpha=.5) +
geom_smooth(se=F) +
labs(title="Residual vs Fitted") +
theme_bw()
## Normal Q-Q
# car::qqp(fit)
fit_df %>%
ggplot(aes(sample=.std.resid)) +
geom_qq() +
geom_qq_line()
## Scale-Location
fit_df %>%
ggplot(aes(x=.fitted, y=sqrt(abs(.std.resid)))) +
geom_point(alpha=.5) +
geom_smooth(se=F)
# plot(fit)
See https://github.com/HunterRatliff1/NSG_geography/blob/master/workflow.Rmd↩
if the county has a provider then dist_min = 0↩
Crude = Death * 1e5 / StdPop↩