Overview of the CSV

This is a csv that I made elsewhere¹ that contains data for each county in the continental US.

Columns are as follows:

GEOGRAPHY DATA: GEOID, NAME, and ST are the identifiers for each county, with GEOID being the primary key for the spatial data, NAME being the name of the county
DISTANCE DATA: dist_cent and dist_min are two different ways of calculating the distance in miles of each county to the nearest provider (neurosurgeon). dist_min is the distance from the nearest provider to the nearest edge of the county, while dist_cent is the distance from the nearest provider to the center of the county.
- Note that dist_min is overly conservative and zero-inflated² so I tend to use dist_cent
PROVIDER DATA: npi is the National Provider Identifier of the closest provider, and providerCity and providerST gives their location. nProvs is the number of providers located within a county, and is NA if there aren’t any providers in the county
TBI MORTALITY DATA: Deaths is the raw number of TBI deaths in the county, and AgeAdj and Crude give the age adjusted and crude rates of TBI deaths per 100,000 of the standardized population (StdPop).
- Note that these values are NA if 20 or fewer deaths occurred in the county
POPULATION: pop17 is the 2017 5-year ACS population estimate for the county. This number is not the same as the standardized population (StdPop) mentioned above
Ignore DataSource

glimpse(dist_TBI)

## Observations: 3,108
## Variables: 15
## $ GEOID        <chr> "21007", "21017", "21031", "21065", "21069", "21093", "2…
## $ NAME         <chr> "Ballard", "Bourbon", "Butler", "Estill", "Fleming", "Ha…
## $ ST           <chr> "KY", "KY", "KY", "KY", "KY", "KY", "KY", "KY", "KY", "K…
## $ dist_cent    <dbl> 19.465282, 19.857851, 20.245012, 36.640252, 49.878536, 6…
## $ dist_min     <dbl> 11.0513851, 8.8352361, 10.7969186, 24.8402589, 39.317849…
## $ npi          <dbl> 1689808107, 1518925247, 1477654937, 1720051501, 15189252…
## $ providerCity <chr> "paducah", "lexington", "bowling green", "lexington", "l…
## $ providerST   <chr> "ky", "ky", "ky", "ky", "ky", "ky", "ky", "ky", "ky", "k…
## $ pop17        <dbl> 8152, 20017, 12735, 14382, 14515, 107699, 18531, 10648, …
## $ Deaths       <dbl> NA, 38, NA, 38, 39, 130, 35, 27, 117, 23, 53, 107, 22, 3…
## $ AgeAdj       <dbl> NA, 28.307121, NA, 35.785930, 40.896276, 17.963288, 28.0…
## $ Crude        <dbl> NA, 27.151778, NA, 37.205659, 38.574523, 17.625828, 27.2…
## $ StdPop       <dbl> 57829, 139954, 89375, 102135, 101103, 737554, 128633, 78…
## $ DataSource   <chr> "Geocoded NPI names", "Geocoded NPI names", "Geocoded NP…
## $ nProvs       <dbl> NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 3, NA, 3, NA,…

Set up of the problem

Our main interest is the relationship of Crude, the crude rate of TBI mortality per 100k, with the distance from the middle of the county to the nearest neurosurgeon (dist_cent). Although Crude already incorporates population into it’s estimate³, I feel that it still might be appropriate to include a population estimate (i.e. pop17) in our estimation, as it could serve as a surrogate for other factors like population density.

An important consideration/limitation of this data is that we have many missing values for the TBI mortality rate (Crude), as the CDC suppresses data for counties with 20 or fewer TBI deaths. In fact, 761 of the 3108 total counties have missing data for their TBI deaths.

Additionally, these missing data are not at random; they tend to come from counties that have lower populations, are farther away from the nearest neurosurgeon, and in particular states. The plot below illustrates this last point by showing the number of counties with & without data, broken down by state:

## requires maps 
# library(maps)

dist_TBI %>%
  left_join(maps::state.fips, by=c("ST"="abb")) %>%
  mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
  mutate(region = case_when(region==1 ~ "Northeast",
                            region==2 ~ "Midwest",
                            region==3 ~ "South",
                            region==4 ~ "West")) %>%
  ggplot(aes(x=ST, fill=missingTBI)) +
  geom_bar() +
  
  ggthemes::scale_fill_few() +
  coord_flip() +
  facet_wrap("region", scales = "free") +
  labs(title="Number of missing counties, by region",
       x="", y="# of counties in state") +
  theme_bw() + theme(legend.position = "bottom")

Histograms

Distance

dist_TBI %>%
  mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
  ggplot(aes(x=dist_min, fill=missingTBI)) +
  geom_histogram() +
  ggthemes::scale_fill_few() +
  labs(title="Histogram of minimum distance", x="Min distance (mi)") +
  theme_bw()

dist_TBI %>%
  mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
  ggplot(aes(x=dist_cent, fill=missingTBI)) +
  geom_histogram() +
  ggthemes::scale_fill_few() +
  labs(title="Histogram of centroid distance", x="Centroid distance (mi)") +
  theme_bw()

TBI rate

dist_TBI %>%
  mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
  ggplot(aes(x=Crude, fill=missingTBI)) +
  geom_histogram() +
  ggthemes::scale_fill_few() +
  labs(title="Histogram of crude rate", x="Crude TBI rate per 100k") +
  theme_bw()

dist_TBI %>%
  mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
  ggplot(aes(x=AgeAdj, fill=missingTBI)) +
  geom_histogram() +
  ggthemes::scale_fill_few() +
  labs(title="Histogram of age adjusted rate", x="age adjusted TBI rate per 100k") +
  theme_bw()

Population

Based on ACS estimates

dist_TBI %>%
  mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
  ggplot(aes(x=pop17, fill=missingTBI)) +
  geom_histogram() +
  ggthemes::scale_fill_few() +
  labs(title="Histogram of population (2017)", x="Log10 of county population") +
  scale_x_log10() +
  theme_bw()

Scatterplots

dist_TBI %>%
  ggplot(aes(dist_cent, Crude)) + 
  # geoms
  geom_jitter(alpha=.3) +
  geom_smooth(aes(color="loess"), method="loess") +
  geom_smooth(aes(color="lm"), method="lm", se=F) +
  ggrepel::geom_label_repel(aes(label=str_glue("{NAME}, {ST}")),
                            min.segment.length = 0,
                            data=filter(dist_TBI, GEOID %in% c("38053", "30021", "20055"))) +
  
  ## Scales / Guides / Themes
  scale_color_discrete(name="method") +
  labs(caption="WARNING: Missing TBI data excluded", y="Crude rate per 100k",
       x="Centroid distance", title="Rate vs Distance") +
  theme_bw() + theme(legend.position = "bottom")

dist_TBI %>%
  ggplot(aes(pop17, Crude)) + 
  
  # geoms
  ggrepel::geom_label_repel(aes(label=str_glue("{NAME}, {ST}")), box.padding=1, 
                            data=filter(dist_TBI, GEOID %in% c("38053", "28009", "51045", "16007", "30003"))) +
  ggrepel::geom_label_repel(aes(label=str_glue("{NAME}, {ST}")), nudge_y=25,
                            data=filter(dist_TBI, GEOID %in% c("06037"))) +
  geom_jitter(alpha=.1) +
  geom_smooth(aes(color="loess"), method="loess") +
  geom_smooth(aes(color="lm"), method="lm", se=F) +
  
  
  ## Scales / Guides / Themes
  scale_x_log10(labels=scales::comma) +
  scale_color_discrete(name="method") +
  labs(caption="WARNING: Missing TBI data excluded", y="Crude rate per 100k",
       x="Population (log10)", title="Rate vs Population") +
  theme_bw() + theme(legend.position = "bottom")

dist_TBI %>%
  mutate(missingTBI = ifelse(is.na(Deaths), "Missing", "Has data")) %>%
  ggplot(aes(pop17, dist_cent, color=missingTBI)) + 
  geom_jitter(alpha=.2) +
  
  
  scale_x_log10(labels=scales::comma) +
  ggthemes::scale_color_few() +
  guides(color=guide_legend(override.aes=list(alpha=1))) +
  labs(x="Population (log10)", y="Centroid distance", title="Distance vs Population") +
  theme_bw() + theme(legend.position = "bottom")

Regression

My first thought was to do a simple linear regression of Crude ~ dist_cent, but I’m concerned about the linearity as the scatterplots above don’t show a particularly linear relationship.

Obviously I’m deferring to you on what we should ultimately do, but my intuition tells me maybe:

Distance should be transformed, because I wouldn’t expect the \(\Delta Crude\) to be that same between 15 & 25 miles and 115 & 125 miles. At some point you’re just too far away from the closest neurosurgeon to make much of a marginal difference
We should bin the data into categories for the same reason as above. It seems reasonable to pick some cutoff (e.g. distances over 130 miles become 130) or make categories some other way (e.g. pentiles)

The group of figures below show the same Rate vs Distance scatterplot seen earlier but with transformations on the X-axis:

p <- dist_TBI %>%
  ggplot(aes(dist_cent, Crude)) + 
  # geoms
  geom_jitter(alpha=.15) +
  geom_smooth(aes(color="loess"), method="loess") +
  geom_smooth(aes(color="lm"), method="lm", se=F) +
  
  
  ## Scales / Guides / Themes
  guides(color=F) +
  labs(x="Centroid distance") +
  theme_bw() + theme(legend.position = "bottom")




if(require(gridExtra)) {
  gridExtra::grid.arrange(ncol=2,
                          p + scale_x_continuous(name="Distance (original)"),
                          p + scale_x_continuous(name="Squished values >130", 
                                                 limits = c(0,130),oob = scales::squish),
                          p + scale_x_sqrt(name="Distance SQRT"),
                          p + scale_x_log10(name="Distance LOG10"))
} else({
  p + scale_x_continuous(name="Original distance")
  p + scale_x_continuous(name="Squished values >130", 
                         limits = c(0,130),oob = scales::squish)
  p + scale_x_sqrt(name="Distance SQRT")
  p + scale_x_log10(name="Distance LOG10")
  p + xlim(c(0,125))
})

Top left: The original untransformed scatterplot; Top right: All distances greater than 130 miles have been rounded to 130; Bottom row: X-axis transformed by squareroot (bottom left) and log10 (bottom right)

Categorizing the distance would be another option

df_binned <- dist_TBI %>%
  # Make bins manually
  mutate(dist_bin = cut(dist_cent, right=F,
                        breaks = c(0,15,30,50,75,100,Inf))) %>%
  # Make bins via pentiles
  mutate(pentile = str_glue("Pentile{ ntile(dist_cent, 5) }"))

 
# lm(Crude ~ pentile, data=df_binned) %>%
#   # plot()
#   # summary()
#   # confint()

df_binned %>%
  ggplot(aes(dist_cent, Crude, color=pentile)) + 
  geom_jitter() +
  labs(x="Distance (miles)", y="Crude mortality rate") +
  theme_bw()

df_binned %>%
  ggplot(aes(x=pentile, y=Crude, fill=pentile)) + 
  geom_boxplot() +
  labs(x="Pentile of distance", y="Crude mortality rate") +
  guides(fill=F) +
  theme_bw()

fit <- lm(Crude ~ dist_cent, data=dist_TBI)
fit_df <- broom::augment(fit, data=filter(dist_TBI, !is.na(Crude)))

## Residual vs Fitted
fit_df %>%
  ggplot(aes(x=.fitted, y=.resid)) + 
  geom_hline(yintercept=0, color="red") +
  geom_jitter(alpha=.5) +
  geom_smooth(se=F) +
  labs(title="Residual vs Fitted") +
  theme_bw()

## Normal Q-Q
# car::qqp(fit)
fit_df %>%
  ggplot(aes(sample=.std.resid)) +
  geom_qq() +
  geom_qq_line() 

## Scale-Location
fit_df %>%
  ggplot(aes(x=.fitted, y=sqrt(abs(.std.resid)))) + 
  geom_point(alpha=.5) +
  geom_smooth(se=F)


# plot(fit)

See https://github.com/HunterRatliff1/NSG_geography/blob/master/workflow.Rmd ↩
if the county has a provider then dist_min = 0↩
Crude = Death * 1e5 / StdPop↩

Email

Hunter Ratliff

2/2/2020