The goal: To quickly determine which WVU facilities are the closest to where a patient lives
The perfect solution: You have the patient’s address, which gets geocoded to extract the coordinates (the origin). I could then calculate the routes to each of our facilities (the destinations), and select whichever is the closest
Why that won’t work:
HIPAA (probably): Addresses are PHI, and I’m not about to create a tool that people can use to disclose PHI
Addresses are too detailed: It’s not like I need turn by turn directions, I just want to know where I should tell the patient to get labs, and to get a reasonable estimate of how far their drive to in person appointments will be
Money: I’m not paying for everyone to use my personal MapBox API token! Also, there might be some value for QI in being able to map zip codes to the nearest facility
What will we do?
…
Important
Todo
Import data
Hospitals
The first bit to import is the names & locations of the hospitals. This is stored on a Google Sheet, which also holds the coordinates of each facility
Show the code
library(googlesheets4)options(gargle_oauth_email =TRUE)gs4_auth(cache =TRUE)hospitals <-read_sheet("https://docs.google.com/spreadsheets/d/1MQG2mWaKPmqM7cbKUbg3BQvElWV4-_F1nINZPg2b5XQ") %>%select(abbv, name_short, ID_access, city, state, hosp_type, Address, Zip, CMS_ID, name_full, lat, long) %>%st_as_sf(coords =c("long", "lat"), crs=params$CRS) #%>%# geocode(city=city, state=state) %>%# mutate(ID_consult = case_when(is.na(ID_consult) ~ "No ID available",# ID_consult == F ~ "No ID available",# ID_consult == T ~ "ID consults available")) %>%# filter(include_on_map)
If the Google Sheet needs to be updated in the future, you’ll need to manually update coordinates of the (new) facilities
Show the code
## To get the actual coordinates of the hospitals (i.e. what to run when updating it)# Make note of what column is being joined on (in this case `abbv`)# Geocode themhospitals_geocoded <-read_sheet("https://docs.google.com/spreadsheets/d/1MQG2mWaKPmqM7cbKUbg3BQvElWV4-_F1nINZPg2b5XQ") %>%select(abbv, name_short, city, state, Address, Zip, name_full) %>%mutate(search_query =str_glue("{name_full}, {city}, {state}")) %>%geocode_combine(queries =list(list(method ="census", mode ="batch", street ="Address",city ="city", state ="state", postalcode ="Zip" ),# list(# method = "osm", street = "Address",# city = "city", state = "state", postalcode = "Zip"# ),list(method ="mapbox", address ="search_query") ),query_names =c("census batch", "mapbox")# global_params = list(return_addresses = TRUE) )# Write to filehospitals_geocoded %>%select(abbv, name_short, lat, long) %>% readr::write_csv(glue::glue("{rstudioapi::selectDirectory()}/Hosp_geocoded.csv"))rm(hospitals_geocoded)
Show the code
mapview(hospitals, label="name_full")
Defining our area of interest
Before we gather all of our data, first we should define the area that we will…
County shapes
One approach is to build a “buffer” around the state of West Virginia, and use that buffer to define our search space
Show the code
### Get the starting shape fileswv_counties <- tigris::counties(state ="WV", cb=T) %>%st_transform(crs = params$CRS)wv_outline <-st_union(wv_counties)all_conuties_raw <- tigris::counties(state =c("WV", "PA", "MD", "VA", "OH", "KY"),cb=T) %>%st_transform(crs = params$CRS)### Define the buffer zone for countieswv_buffer <-st_union(wv_counties) %>%st_buffer(params$buffer*1000)all_conuties <- all_conuties_raw %>%st_filter(wv_buffer, .predicate = st_intersects)### Visual of selection criteriaall_conuties %>%ggplot() +geom_sf() +geom_sf(data=st_union(wv_counties), fill='darkgrey', alpha=.5) +geom_sf(data=wv_buffer, color='red', fill=NA) +labs(title =str_glue("Counties included using a {params$buffer}km buffer"))
This isn’t a bad starting place, but doesn’t actually reflect our health system: We have multiple facilities that are out of West Virginia and have stronger representation in the northern & western edges (compared to the south/east)
Another approach is to use ZIP Code Tabulation Areas (ZCTA), which are areas defined by the US Census. These are rough estimates of census blocks that correspond to ZIP codes
Show the code
# c("WV", "PA", "MD", "VA", "OH", "KY")# This is NOT the high res shapefileZCTA_full <- tigris::zctas(year=2020, cb=T)ZCTA <- ZCTA_full %>%st_filter(wv_buffer, .predicate = st_intersects) rm(ZCTA_full)
Population weighted centroids
Our next step is to calculate a point (within each zip code) that will serve as our “origin” point
The easy solution is to calculate the geographic/geometric centroid of each ZCTA, as shown below
The issue with this method is that the population may not be spread out evenly among the zip code, as shown with the same area below
Show the code
blocks <-c(041, 097, 083, 007, 101, 075) %>%# as.character() %>%map(~get_decennial(geography ="block",variables ="P1_001N", # Total populationyear =2020,state="WV",county = .x,geometry =TRUE)) %>%bind_rows()blocks %>%st_filter(st_union(subset_ZCTA), .predicate = st_intersects) %>%ggplot() +geom_sf(aes(fill=value)) +geom_sf(data=subset_ZCTA, fill=NA, color="red", alpha=.5, linetype="dashed") +geom_sf(data=st_centroid(subset_ZCTA), color="red") +scale_fill_viridis_c() +labs(fill="Population",title ="Select ZCTA, with population", subtitle ="Populated area is Webster Springs, WV") + ggthemes::theme_map()
Calculate the population weighted centroids
Unfortunately, the only way to calculate population weighted centroids at a ZCTA level is using block data (the most refined level of census data), so this section calls in over a million rows of data (each with their own geography, >2GB of RAM).
Since this takes so long, we will only run it once, then save the results in the ZCTA_coords tab of the same Google Sheet we used before
Show the code
## This could be another option: Using the census look up table to map [blocks] --> [ZCTA]# https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.2020.html#zctax <-read_delim("https://www2.census.gov/geo/docs/maps-data/data/rel2020/zcta520/tab20_zcta520_tabblock20_natl.txt", delim ="|", escape_double =FALSE, trim_ws =TRUE)crosswalk_block_ZCTA <- x %>%select(GEOID_ZCTA5_20, NAMELSAD_ZCTA5_20, GEOID_TABBLOCK_20, NAMELSAD_TABBLOCK_20) %>%# Limit to blocks that have associated ZCTA filter(!is.na(GEOID_ZCTA5_20)) %>%# Limit to the ZCTA's we are usingfilter(GEOID_ZCTA5_20 %in% ZCTA$GEOID20) %>%# Extract some helpful infomutate(GEOID_county =str_sub(GEOID_TABBLOCK_20, 1, 5), GEOID_tract =str_sub(GEOID_TABBLOCK_20, 6, 11),GEOID_blkgrp =str_sub(GEOID_TABBLOCK_20, 12, 12),GEOID_block =str_sub(GEOID_TABBLOCK_20, 12, 16))rm(x)crosswalk_block_ZCTA %>%group_by(GEOID_ZCTA5_20, GEOID_county, GEOID_tract) %>%summarise(n_blocks =n(), .groups ="drop_last") %>%summarise(n_blocks =sum(n_blocks), n_tracts =n(),.groups ="drop_last") %>%summarise(n_blocks =sum(n_blocks), n_tracts =sum(n_tracts),n_counties =n(),.groups ="drop_last")crosswalk_block_ZCTA %>%select(GEOID_ZCTA5_20, GEOID_county, GEOID_tract) %>%unique() %>%group_by(GEOID_county, GEOID_tract) %>%summarise(n_ZCTA_per_tract =n()) %>%arrange(desc(n_ZCTA_per_tract))rm(crosswalk_block_ZCTA)
Show the code
###################################################-### [part 1] Get census block datatictoc::tic("Entire process") # 449 sec (7:30)projected_complete(c(449), name="ENTIRE THING")###################################################- tic("Call census block data") # 152, 161, 197 secprojected_complete(c(152, 161, 197, 311, 218, 256))pop_block_ls <-c("WV", "PA", "MD", "VA", "OH", "KY") %>%map(~get_decennial(geography ="block",variables ="P1_001N", # Total populationyear =2020,state = .x,geometry =TRUE))tictoc::toc()tic("Join lists to a single sf object of block data") # 35 secprojected_complete(c(35, 29, 27))pop_block <-bind_rows(pop_block_ls) %>%filter(value!=0)rm(pop_block_ls)toc()###################################################-### [part 2a] The **NEW** way to join to ZCTAtic("Join blocks to ZCTA (type a)")projected_complete(c(168, 156))###################################################- ## Grab the census's look up table to map [blocks] --> [ZCTA]# https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.2020.html#zctatemp_df1 <-read_delim("https://www2.census.gov/geo/docs/maps-data/data/rel2020/zcta520/tab20_zcta520_tabblock20_natl.txt", delim ="|", escape_double =FALSE, trim_ws =TRUE) %>%select(GEOID_ZCTA5_20, NAMELSAD_ZCTA5_20, GEOID_TABBLOCK_20, NAMELSAD_TABBLOCK_20) %>%# Limit to blocks that have associated ZCTA filter(!is.na(GEOID_ZCTA5_20)) %>%# Limit to the ZCTA's we are usingfilter(GEOID_ZCTA5_20 %in% ZCTA$GEOID20) %>%# Extract some helpful infomutate(GEOID_county =str_sub(GEOID_TABBLOCK_20, 1, 5), GEOID_tract =str_sub(GEOID_TABBLOCK_20, 6, 11),GEOID_blkgrp =str_sub(GEOID_TABBLOCK_20, 12, 12),GEOID_block =str_sub(GEOID_TABBLOCK_20, 12, 16)) %>%select(-starts_with("NAME")) # Limit the geo_df of blocks to the ones we care about, then find their centroidstemp_df2 <- pop_block %>%filter(GEOID %in% temp_df1$GEOID_TABBLOCK_20) %>%st_centroid()crosswalk_block_ZCTA <- temp_df1 %>%# Join to block dataleft_join(temp_df2, by=c("GEOID_TABBLOCK_20"="GEOID")) %>%# Ensure no missing data (ie blocks with no population)filter(!is.na(value)) %>%# Group by ZCTAgroup_by(GEOID_ZCTA5_20) %>%summarize(pw_lon =weighted.mean(st_coordinates(geometry)[, 1], value),pw_lat =weighted.mean(st_coordinates(geometry)[, 2], value),population =sum(value, na.rm = T),blocks_list =str_flatten_comma(GEOID_TABBLOCK_20),.groups ="drop" )rm(temp_df1, temp_df2)toc()# ###################################################-# ### [part 2b.1] original way to join to ZCTA# tic("Join blocks to ZCTA (type b)")# projected_complete(c(136, 210, 273))# ###################################################- # blocks_zcta <- st_join(st_transform(pop_block, crs = params$CRS), # st_transform(ZCTA, crs = params$CRS), # join = st_within, # st_intersects is another option# left = FALSE)# toc()rm(pop_block)# tic()# x <- crosswalk_block_ZCTA %>%# # Get the centroids of each block# st_centroid() %>%# # # Ensure no missing data (ie blocks with no population)# filter(!is.na(value)) %>% # # # Group by ZCTA# group_by(GEOID_ZCTA5_20) %>%# # # Summarize to population weighted centroid coordinates# summarize(# pw_lon = weighted.mean(st_coordinates(geometry)[, 1], value),# pw_lat = weighted.mean(st_coordinates(geometry)[, 2], value),# population = sum(value, na.rm = T),# blocks_list = str_flatten_comma(GEOID_TABBLOCK_20),# .groups = "drop"# ) %>%# # # Convert back to sf object# st_as_sf(coords = c("pw_lon", "pw_lat"), crs = params$CRS)# toc()# # ###################################################-# ### [part 2b.2] identify pop wt centroid# tic("Get weighted centroids by ZCTA") # 56-126 sec# projected_complete(c(56, 126, 85))# ###################################################- # ZCTA_wt_centroids <- blocks_zcta %>%# # Get the centroids of each block# st_centroid() %>%# # # Ensure no missing data (ie blocks with no population)# filter(!is.na(value)) %>% # # # Group by ZCTA# group_by(ZCTA5CE20) %>%# # # Summarize to population weighted centroid coordinates# summarize(# pw_lon = weighted.mean(st_coordinates(geometry)[, 1], value),# pw_lat = weighted.mean(st_coordinates(geometry)[, 2], value),# .groups = "drop"# ) %>%# # # Convert back to sf object# st_as_sf(coords = c("pw_lon", "pw_lat"), crs = params$CRS)# toc() # end: weighted centroids# # rm(blocks_zcta)###################################################-### [part 3] Find ZCTA centroids### Find the centroid of each ZCTA. This will be done both using a ### population weighted method (pw_lon / pw_lat) and using simple### centrodis###### The goal here is to build out a data frame with### 1. **`cntr_lon/lat`**: Geometric coordinates of centroids accomplished### using the extract_coords function### 2. **`pw_lon/lat`**: Population weighted centroids, pulled from the### crosswalk_block_ZCTA dataframe above. This will also include two### additional variables (from crosswalk_block_ZCTA), which are included### for assistance in troubleshooting### - **`population`** which is the total population used from the ### blocks when calculating the population weighted centroid ### - **`blocks_list`** which is a comma separated list of the GEOID's ### of the blocks used in the calculation### 3. **`lon/lat`**: As some of the ZCTA's couldn't calculate the population### weighted centroids, this is included for convince. In cases where the### population weighted centroid could be calculated, it will use that.### If not, it falls back on the geometric/geographic centroid### - **`centroid_method`** defines which method is used for lon/lat### 4. **`diff_dist_km`**: The distance between the two calculations### 5. **`mid_lon/lat`**: The midpoint between the two calculations###tic("Find ZCTA centroids & merge various calculations")projected_complete(c(168))######################################################- ZCTA_centroids <-list() # a list for the results# Helper function that extracts the coordinates & keeps the geoidextract_coords <-function(sf_geo){#' @description#' Extracts the coordinates from a SF object#' #' @param sf_geo sf object of points, with the first column being an ID# Test to make sure the provided geography is correct n_row_correctType <-sum(st_is(sf_geo, "POINT")|st_is(sf_geo, "MULTIPOINT"))if(n_row_correctType !=nrow(sf_geo)){warning("Make sure correct geography type is provided! ")message("Did you forget to do `st_centroid()`?") } df <- sf_geo %>%st_coordinates() %>%as_tibble() %>% magrittr::set_names(c("lon", "lat")) df$geo_id <-st_drop_geometry(sf_geo)[,1] df %>%select(geo_id, lon, lat)}# Extract the centroid of unweighted ZCTAZCTA_centroids$cntr_unwt_ZCTA <- ZCTA %>%st_centroid() %>%extract_coords() %>%rename(cntr_lon=lon, cntr_lat=lat)# Join the two. If you were able to find the population weighted centroid,# use that one. If not, fall back on the regular versionZCTA_centroids$part1 <- ZCTA_centroids$cntr_unwt_ZCTA %>%left_join(crosswalk_block_ZCTA, by=c("geo_id"="GEOID_ZCTA5_20")) %>%rename(ZCTA_geoid = geo_id) %>%mutate(centroid_method =case_when(!is.na(pw_lon) ~"pop_wt",is.na(pw_lon) &!is.na(cntr_lon) ~"geo",is.na(pw_lon) &is.na(cntr_lon) ~"error" )) %>%mutate(lon =if_else(!is.na(pw_lon), pw_lon, cntr_lon),lat =if_else(!is.na(pw_lat), pw_lat, cntr_lat))# ### Used if you are running the old way of calculating it# ZCTA_centroids <- cntr_unwt_ZCTA %>%# left_join(st_drop_geometry(ZCTA_wt_centroids), by=c("geo_id"="ZCTA5CE20")) %>%# rename(ZCTA_geoid = geo_id) %>%# # mutate(centroid_method = case_when(# !is.na(pw_lon) ~ "pop_wt",# is.na(pw_lon) & !is.na(cntr_lon) ~ "geo",# is.na(pw_lon) & is.na(cntr_lon) ~ "error"# )) %>%# mutate(lon = if_else(!is.na(pw_lon), pw_lon, cntr_lon),# lat = if_else(!is.na(pw_lat), pw_lat, cntr_lat)) rm(ZCTA_wt_centroids, crosswalk_block_ZCTA, cntr_unwt_ZCTA, extract_coords)### Calculate the difference between the two, in kmtictoc::tic("Calculate the difference in km") # 26 secprojected_complete(c(26, 26, 23, 31, 28, 24))ZCTA_centroids$ZCTA_method_diff <- ZCTA_centroids$part1 %>%filter(!is.na(pw_lon)) %>%mutate(cntr =st_sfc(mapply(function(lon, lat) st_point(c(lon, lat)), cntr_lon, cntr_lat, SIMPLIFY =FALSE), crs = params$CRS),pw =st_sfc(mapply(function(lon, lat) st_point(c(lon, lat)), pw_lon, pw_lat, SIMPLIFY =FALSE), crs = params$CRS) ) %>%st_as_sf() %>%rowwise() %>%mutate(dist_m =st_distance(cntr, pw, by_element =TRUE),diff_dist_km =as.numeric(dist_m) /1000 ) %>%ungroup() %>%select(ZCTA_geoid, diff_dist_km)# Save distances to prior dfZCTA_centroids$part2 <- ZCTA_centroids$part1 %>%left_join(st_drop_geometry(ZCTA_centroids$ZCTA_method_diff), by="ZCTA_geoid")toc() # End: calc the diff# Calculate the midpoint between the two methodsZCTA_centroids$final <- ZCTA_centroids$part2 %>%rowwise() %>%mutate(mid_lon =if_else(centroid_method=="pop_wt", mean(c(cntr_lon, pw_lon)), lon),mid_lat =if_else(centroid_method=="pop_wt", mean(c(cntr_lat, pw_lat)), lat) ) %>%ungroup() # Rearrange columnsZCTA_centroids <- ZCTA_centroids$final %>%select(ZCTA_geoid, lon, lat, centroid_method, mid_lon, mid_lat, diff_dist_km, everything()) toc() # End: Overall### -------------- WRITE TO GOOGLE SHEETlibrary(googlesheets4)options(gargle_oauth_email =TRUE)gs4_auth(cache =TRUE)ss <-gs4_get("https://docs.google.com/spreadsheets/d/1MQG2mWaKPmqM7cbKUbg3BQvElWV4-_F1nINZPg2b5XQ")# Delete the old valuesrange_delete(ss, sheet ="ZCTA_coords", range ="A:O")# Write to sheetZCTA_centroids %>%write_sheet(ss, sheet ="ZCTA_coords")rm(ZCTA_centroids, ss, crosswalk_block_ZCTA, pop_block)
For some reason, I was not able to calculate the population weighted centroid for all ZCTA. It looks like most of these are in Tennessee or North Carolina (and maybe the Washington DC area?). The number of ZCTAs that are missing population weighted centroids (n_missing = 70) only represent 2.80% of the total ZCTAs (n_total = 2497).
Calculating the population weighted centroid for this quite small number of ZCTAs would require calling even more of the incredibly detailed block level data (substantially increasing the time it takes the script to run) for unclear benefit; I find it very unlikely that patients who live close to Washington DC are driving to the WVUM system,
I’m still not sure why the others didn’t. Regardless, in cases where the population weighted centroid didn’t work, I used the geographic/geometric centroid instead.
This approach does have some edge cases, as you can see for the zip code 26298 in Webster County, WV. For this zip code, the population weighted centroid is far in the northern corner
Show the code
## Example using 26298tmp <-list()tmp$within <- blocks %>%st_filter(filter(subset_ZCTA, GEOID20=="26298"), .predicate = st_within)tmp$intersect <- blocks %>%st_filter(filter(subset_ZCTA, GEOID20=="26298"), .predicate = st_intersects)tmp$lookupTable_block_IDs <- ZCTA_centroids %>%# Look up the blocks used in the lookup tablemutate(blocks_list =str_split(blocks_list, ",\\s*")) %>%# Limit to ZCTA(s) of interestfilter(ZCTA_geoid =="26298") %>%# Convert to char vector of the block IDspull(blocks_list) %>%unlist()tmp$lookupTable_block_geos <- blocks %>%filter(GEOID %in% tmp$lookupTable_block_IDs)tmp$zip <-filter(subset_ZCTA, GEOID20=="26298")tmp$pop_cent <- ZCTA_centroids %>%filter(ZCTA_geoid=="26298") %>%st_as_sf(coords =c("pw_lon","pw_lat")) %>%st_set_crs(params$CRS)tmp$geo_cent <-filter(st_centroid(ZCTA), GEOID20=="26298")# ggplot() + # # geom_sf(data=tmp$intersect, fill="grey70", color="grey50") +# geom_sf(data=tmp$within, aes(fill=value)) +# geom_sf(data=tmp$zip, fill=NA, color="red", alpha=.5, linetype="dashed") +# geom_sf(data=tmp$geo_cent, color="red") +# # geom_sf(data=tmp$pop_cent, color="blue") +# labs(fill="Population",# title="Zipcode 26298 (red outiline)", subtitle = "In Webster County, WV") +# scale_fill_viridis_c() +# ggthemes::theme_map() +# theme(legend.position = c(.9,0))ggplot() +# geom_sf(data=tmp$intersect, fill="grey70", color="grey50") +geom_sf(data=tmp$lookupTable_block_geos, aes(fill=value)) +geom_sf(data=tmp$zip, fill=NA, color="red", alpha=.5, linetype="dashed") +geom_sf(data=tmp$geo_cent, color="red") +geom_sf(data=tmp$pop_cent, color="blue") +labs(fill="Population",title="Zipcode 26298 (red outiline)", subtitle ="In Webster County, WV") +# scale_fill_viridis_c() +scale_fill_binned(type ="gradient",low = RColorBrewer::brewer.pal(3, "Greens")[1],high = RColorBrewer::brewer.pal(3, "Greens")[3]) + ggthemes::theme_map() +theme(legend.position =c(.9,0))
At first I thought this was becasue the block’s don’t line up perfectly with the ZCTA, but that might not actually be the problem. When I looked at the granular (block level) data for the census blocks that are within the ZCTA or adjacent to them (st_intersects, the most conservative estimate), it turns out that many of the blocks have a population of zero. These counties are shown in grey below
Compare this with the zip code 26288 in Webster County, WV. This zip code contains Webster Springs, WV. Here there are fewer areas that do not have any population, so the population centroid isn’t skewed as drastically
Show the code
## Example using 26288tmp <-list()tmp$within <- blocks %>%st_filter(filter(subset_ZCTA, GEOID20=="26288"), .predicate = st_within) %>%mutate(value =na_if(value, 0))tmp$intersect <- blocks %>%st_filter(filter(subset_ZCTA, GEOID20=="26288"), .predicate = st_intersects) %>%mutate(value =na_if(value, 0))tmp$zip <-filter(subset_ZCTA, GEOID20=="26288")tmp$lookupTable_block_IDs <- ZCTA_centroids %>%# Look up the blocks used in the lookup tablemutate(blocks_list =str_split(blocks_list, ",\\s*")) %>%# Limit to ZCTA(s) of interestfilter(ZCTA_geoid =="26288") %>%# Convert to char vector of the block IDspull(blocks_list) %>%unlist()tmp$lookupTable_block_geos <- blocks %>%filter(GEOID %in% tmp$lookupTable_block_IDs)tmp$pop_cent <- ZCTA_centroids %>%filter(ZCTA_geoid=="26288") %>%st_as_sf(coords =c("pw_lon","pw_lat")) %>%st_set_crs(params$CRS)tmp$geo_cent <-filter(st_centroid(ZCTA), GEOID20=="26288")ggplot() +# geom_sf(data=tmp$intersect, fill="grey70", color="grey50") +geom_sf(data=tmp$lookupTable_block_geos, aes(fill=value)) +geom_sf(data=tmp$zip, fill=NA, color="red", linetype="dashed") +geom_sf(data=tmp$geo_cent, color="red") +geom_sf(data=tmp$pop_cent, color="blue") +labs(fill="Population",title="Zipcode 26288 (Webster Springs, WV)", subtitle ="Dots are centroid (Red = geographic, Blue = population weighted)") +scale_fill_binned(type ="gradient",low = RColorBrewer::brewer.pal(3, "Greens")[1],high = RColorBrewer::brewer.pal(3, "Greens")[3]) + ggthemes::theme_map() +theme(legend.position =c(.9,0))
Show the code
rm(tmp)
Multiple together
Show the code
tmp <-list()tmp$lookupTable_block_IDs <- ZCTA_centroids %>%# Look up the blocks used in the lookup tablemutate(blocks_list =str_split(blocks_list, ",\\s*")) %>%# Limit to ZCTA(s) of interestfilter(ZCTA_geoid %in% subset_ZCTA$ZCTA5CE20) %>%# drop some of the extra columnsselect(ZCTA_geoid, population, blocks_list) %>%# Unnest the list longerunnest_longer(blocks_list)tmp$lookupTable_block_geos <- blocks %>%# Join blocks to the lookup table to find their ZCTAinner_join(tmp$lookupTable_block_IDs, by =c("GEOID"="blocks_list")) %>%# Rearrange the dfselect(ZCTA=ZCTA_geoid, GEOID, value, ZCTA_pop=population, geometry) %>%# Calculate the proportion that each block makes up (relative to ZCTA's)# total population### Method 1mutate(pct_of_pop = value / ZCTA_pop)### Method 2# group_by(ZCTA) %>%# mutate(pct_of_pop = value / sum(value, na.rm = T))tmp$pop_cent <- ZCTA_centroids %>%filter(ZCTA_geoid %in% subset_ZCTA$ZCTA5CE20) %>%st_as_sf(coords =c("pw_lon","pw_lat")) %>%st_set_crs(params$CRS)tmp$geo_cent <-filter(st_centroid(ZCTA), GEOID20 %in% subset_ZCTA$ZCTA5CE20)subset_ZCTA %>%rename(ZCTA = ZCTA5CE20) %>%ggplot() +# geom_sf(aes(fill=ZCTA), color="black", alpha=.1) +geom_sf(fill="grey", color="black", alpha=.1) +geom_sf(data=tmp$lookupTable_block_geos, # aes(alpha=value, fill=ZCTA),aes(alpha=pct_of_pop, fill=ZCTA),color=NA) +geom_sf(data=tmp$geo_cent, shape=1) +geom_sf(data=tmp$pop_cent, shape=16) +# scale_alpha_binned(range = c(.1, .9)) +labs(title ="Centroids for select zip codes",caption ="Grey = population of zero", subtitle ="Geometic (empty circle) & population weighted (filled circles)",alpha="Proportion of zipcode's population") + ggthemes::theme_map() +theme(legend.position =c(1,0))
Show the code
rm(blocks, subset_ZCTA, tmp)
Meeting in the middle
In reality, there isn’t a huge difference between the two methods, as shown in the map below:
how_many_by_cutoff <-function(df, cutoff) { n_cut <- df %>%filter(diff_dist_km >cutoff) %>%nrow() n_tot <-nrow(df) pct <-signif(n_cut / n_tot,2) *100str_glue("{n_cut} zipcodes ({pct}%) have a discrepancy of >{cutoff} km ({round(cutoff/1.609, 1)} mi)")}# how_many_by_cutoff(ZCTA_centroids, 3)
Only 254 zipcodes (10%) have a discrepancy of >3 km (1.9 mi), 62 zipcodes (2.5%) have a discrepancy of >5 km (3.1 mi), and 3 zipcodes (0.12%) have a discrepancy of >10 km (6.2 mi).
The map below shows the ZCTAs that have a discrepancy of at least 5 km. The ones around New Martinsville, Beckley, and Princeton all seem like appropriate uses of the population weighted centroids, but the area just south of Elkins looks a little problematic (as does the 28675 zipcode, the one in North Carolina.
Note
The map below uses Census tract data, which is not nearly as granular as the block level data (which is used in the static figures above & the calculations)
As a compromise, we can split the difference between the population weighted centroids and the geometric centroids
Identify the closest facilities
Now that we have our zipcode’s centroids (which is a surrogate for where the patient lives), we need identify the distance from these closest facilities
Reducing the search space
One approach would be to calculate the distance from every zipcode (n=2497) to every facility (n=23). This is computationally demanding, as it’s 57,431 queries, and not necessary.
Instead, we find the 3 closest hospitals for each zip code (using
Show the code
# If you want to show code, use code-fold: showfind_closest_nn <-function(zcta, hosp){#' @description#' Finds the 3 closest hospitals#' #' @param zcta a plain df where the first column is the ZCTA. There should #' also be two columns named "lon" and "lat"#' @param hosp a data frame where the first column is the ID for the hospital.#' there should also be a column of "geometry" library(RANN)# Extract the first column of `zcta` & `hosp` list_of_zcta <- zcta[[1]] list_of_hosp <- hosp[[1]]# Temp list for the results results <-list() results$hosp_index_lookup <-tibble(abbv = list_of_hosp) %>%mutate(hosp_index =row_number())################################### Testing assumptions################################# Test that this column is indeed ZCTAs testthat::expect_equal(nrow(zcta), { list_of_zcta %>%str_detect("^\\d{5}$") %>%sum() }) testthat::expect_equal(nrow(zcta), n_distinct(list_of_zcta))# Test that `zcta` contains columns named lon & lat testthat::expect_in("lon", names(zcta)) testthat::expect_in("lat", names(zcta))# Make sure the list of hospitals is unique testthat::expect_equal(nrow(hosp), n_distinct(list_of_hosp))# Make sure `hosp` is an SF object testthat::expect_s3_class(hosp, "sf")################################### Do the nearest neighbour search################################ ## Matrix of ZCTA centroid coordinates mat_ZCTA <- zcta %>%# Make the correct geometryst_as_sf(coords =c("lon", "lat"), crs=params$CRS) %>%st_transform(3857) %>%# must be projected# Extract coordsst_coordinates()## Matrix of hospital coordinates mat_hosp <- hospitals %>%select(abbv, geometry) %>%# Make the correct geometryst_as_sf(crs=params$CRS) %>%st_transform(3857) %>%# must be projected# Extract coordsst_coordinates() nn_results <- RANN::nn2(data = mat_hosp, query = mat_ZCTA,k =3)################################### Make a table of the closest hospitals for each ZCTA################################# {nn.idx} is a matrix, with each row corresponding to each ZCTA. The first # column is the index of the closest hospital, the second column is the index# of the second closest hospital, and so on results$closest_hosp <- nn_results$nn.idx %>%# Convert matrix to tibble, and append the ZCTAsas_tibble() %>%mutate(ZCTA = list_of_zcta) %>%# this is now a table with columns named V1, V2, V3, and ZCTA. Let's # convert this to long formatpivot_longer(cols =starts_with("V"), names_to ="order",values_to ="hosp_index") %>%# And replace the index with the supplied abbreviationleft_join(results$hosp_index_lookup, by="hosp_index") %>%select(-hosp_index ) # drop the index column################################### Similar idea, but with their distances################################# {nn.idx} is a matrix, with each row corresponding to each ZCTA. Instead of# having the index of the hospital, the values of the matrix represent the # distance results$closest_dist <- nn_results$nn.dists %>%# Convert matrix to tibble, and append the ZCTAsas_tibble() %>%mutate(ZCTA = list_of_zcta) %>%# this is now a table with columns named V1, V2, V3, and ZCTA. Let's # convert this to long formatpivot_longer(cols =starts_with("V"), names_to ="order",values_to ="nn_dist") %>%mutate(nn_dist =round(nn_dist))################################### Return results################################ results$closest_hosp %>%left_join(results$closest_dist, by=c("ZCTA", "order"))}ZCTA_n3 <- ZCTA_centroids %>%select(ZCTA_geoid, lon=mid_lon, lat=mid_lat) %>%find_closest_nn(hospitals)
With this, we can map each facility’s service area (without accounting for drive time)