knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(here)
library(janitor)
library(stringr)
library(sf)
library(tidycensus)
library(rgeoda)
library(leaflet)
# census_api_key("YOUR KEY GOES HERE", install = TRUE)
This project explores a methodology for identifying potentially errant values in the reported indicator of Census geography that most closely aligns with a public library service area. It does so by comparing the reported population of the service area to the 2020 decennial census count for that same jurisdiction (obtained with tidycensus), calculating a “population comparison ratio”. Then it generates local Moran’s I statistics (using rgeoda) and maps the resulting significance clusters to identify outliers in the population comparison ratio (using ggplot2 and leaflet). The results suggest that a local measure of spatial autocorrelation is not a sufficient method to identify outliers in the population comparison ratio.
The Public Libraries Survey (PLS) is an annual census that collects finances, staffing, offerings, and usage of all public libraries in the U.S. One of the major impediments to combining PLS data with other data sources is that the spatial information for each library is limited to the point location of the facilities; detailed data about the service area (or jurisdiction) boundaries has never been compiled nationally and is not currently available from all states. These service area boundaries are typically coterminous with municipal, school district, county subdivision, or county boundaries, but the prevalence of each of these boundary types varies by state (and sometimes within a state), and in some places they are special districts that split existing jurisdictions.
The PLS data already contained an indicator of the type of Census geography that best represented the service area boundaries; however, the levels (categories) of this variable did not actually match the available Census geographies and it was not clear whether the values were up to date. For the fiscal year (FY) 2022 PLS collection, this indicator was revised to better align with the Census geography types, as well as to allow new precision levels for the indicator (e.g., the remainder of a county outside of municipalities that may have separate library service). Now that this data is being collected, we need to devise methods for identifying outliers in the ratio of reported service area population to 2020 decennial census counts. These outliers could be errant values in the new geography type indicator. This report limits the scope to public libraries in Illinois, but the goal is to devise a method that will work for all states.
How can a measure of spatial autocorrelation (e.g., local Moran’s I) identify errant outliers in the reported values of geography type for public library service area jurisdictions?
This analysis uses two main data sources:
Any library systems added to the data universe in 2022 (that did not exist in 2021) have been excluded from the analysis because the address has not been geocoded to a point location yet. In the data for Illinois, there is one record excluded for that reason.
Due the use of tidycensus to obtain 2020 population counts for existing Census geographies, this analysis only includes the four Census geographies that have one-to-one matches with the GEOCODE values in the FY 2022 PLS data: place, county, county subdivision (i.e., township), and unified school district. There are 87 library records excluded from the analysis because they have a multi-place, multi-county-subdivision, or other GEOCODE value that would have required identifying the composite geographies and aggregating the results from tidycensus, both of which are out of scope for this report.
This analysis assumes the accuracy of the library service area population counts. If any of those counts are errant, they will negatively bias the results of the analysis.
This section details the data preparation procedures necessary to conduct an analysis of spatial autocorrelation of the ratio of reported service area population to 2020 decennial census counts using local Moran’s I.
The code chunk that follows accomplishes the following tasks:
# Load the raw data
publiclibraries22 <- read_csv(here("data","publiclibraries22.csv"))
# Load the fully processed data from FY 2021 that contains point locations
publiclibraries21 <- read_csv(here("data","publiclibraries21.csv"))
publiclibraries21_geo <- publiclibraries21 %>%
select(FSCSKEY, LONGITUD, LATITUDE, LOCALE_ADD)
# 1. Merge the point location and locale variables from 21 to 22 data
all_public_libraries <- publiclibraries22 %>%
left_join(publiclibraries21_geo,
by="FSCSKEY")
# 2./3. Clean the data
illinois_libraries <- all_public_libraries %>%
clean_names() %>%
filter(stabr == 'IL') %>% # select only Illinois libraries
filter(fscskey != "-3") %>% # remove closed libraries
filter(fscskey != "IL8034") %>% #drop record that doesn't exist in 2021 data
rename(library_id = fscskey, # rename variables to be easier to remember
library_name = libname,
legal_basis = c_legbas,
geography_code = geocode,
service_area_pop = popu_lsa,
cardholders = regbor,
long = longitud,
lat = latitude,
locale12 = locale_add
) %>%
select(library_id, library_name, legal_basis, geography_code,
service_area_pop, cardholders, long, lat, locale12) %>%
# 4. Separate the geographic_code variable into parts
separate(geography_code,
c("geography_type", "geography_precision"), # name the new variables
2, # split after 2nd character
remove = FALSE) %>% #keep the original variable
# 5. Recode geography_type values to be human readable
mutate(geography_type = case_match(geography_type,
'CO' ~ 'county',
'CD' ~ 'county subdivision',
'MD' ~ 'multi-county subdivision',
'PL' ~ 'place',
'MP' ~ 'multi-place',
'SU' ~ 'school district (unified)',
'OT' ~ 'other'
)) %>%
# Convert geography_precision to a factor with labels (for later leaflet use)
mutate(geography_precision = factor(geography_precision,
labels = c("exact", "overlap", "remainder"),
ordered = TRUE)) %>%
# 6. Calculate proportion of cardholders to service area population
mutate(users_pop_proportion = cardholders / service_area_pop) %>%
# 7. Collapse 12 locale levels to 4
mutate(locale4 = case_match(locale12,
11:13 ~ 1,
21:23 ~ 2,
31:33 ~ 3,
41:43 ~ 4
)) %>%
# Convert locale4 to a factor with labels
mutate(locale4 = factor(locale4, #use factor() NOT as.factor()!
labels = c("city", "suburb", "town", "rural"),
ordered = TRUE))
# 8. Convert dataframe to sf object with point geometry
illinois_libraries_geo <- illinois_libraries %>%
st_as_sf(coords = c("long", "lat"),
crs = 4269) # important to specify here to enable spatial join later
# 9. Get 2020 Decennial Census state total and state polygon
illinois_2020pop_total <- get_decennial(
geography = 'state',
variables = 'P1_001N',
year = 2020,
state = 'illinois',
geometry = TRUE,
progress_bar = FALSE
)
# 10. Loop over four primary geography types to get 2020 Decennial counts and polygons then spatial join to library data
# Create list of geography types for loop
geotypes <- c("place", "county", "county subdivision", "school district (unified)")
# Loop
for(i in (geotypes)) {
# Filter only libraries with reported census geotype 'i' (in loop list)
illinois_libraries_geo_temp <- illinois_libraries_geo %>%
filter(geography_type == i)
# Get 2020 population counts for geotype 'i' in state
illinois_2020pop_temp <- get_decennial(
geography = i,
variables = 'P1_001N',
year = 2020,
state = 'illinois',
geometry = TRUE,
progress_bar = FALSE
)
# Spatial join the Census polygons with the library point locations
illinois_2020pop_libraries_temp <- illinois_2020pop_temp %>%
st_join(illinois_libraries_geo_temp, left = FALSE) %>% # default to inner join
# Calculate the ratio of 2020 Census count to reported library service area pop
mutate(pop_align = value/service_area_pop) # 'value' is 2020 count
# this direction had more manageable scale than service_area_pop/value
# Trim spaces from geotype value before appending to df name
geotemp <- str_remove_all(i, " ")
# Save temp df with name that includes trimmed geotype 'i'
assign(paste0('illinois_2020pop_libraries_', geotemp), illinois_2020pop_libraries_temp)
# Cleanup temp objects
remove(illinois_libraries_geo_temp)
remove(illinois_2020pop_libraries_temp)
}
# 11. Append/stack separate geotype dfs back together
illinois_2020pop_libraries_all <- illinois_2020pop_libraries_place %>%
bind_rows(illinois_2020pop_libraries_county) %>%
bind_rows(illinois_2020pop_libraries_countysubdivision) %>%
bind_rows(`illinois_2020pop_libraries_schooldistrict(unified)`)
Tobler’s first law of geography states, “Everything is related to everything else, but near things are more related than distant things” (Tobler, 1970). That is, the values of a variable measured at different locations should be more correlated the closer the locations are to each other. Another name for this principle is spatial autocorrelation.
The analysis in this report depends on Tobler’s first law, in the form of local Moran’s I, to identify spatial clusters (or, more importantly here, exceptions to those clusters) in the values of a certain variable. The basic idea of local Moran’s I is to provide an indicator of how well the value at a given location is correlated to the values of neighboring locations.
Using the rgeoda package, we can spatially visualize the results of the local Moran’s I scatterplot that relates a given value to a weighted average of values from neighboring locations (Lloyd, 2011). We are primarily interested in the upper left (‘low-high’) and lower right (‘high-low’) quadrants of the scatterplot to identify the outliers in our variable of interest–the ratio of reported service area population to 2020 decennial census counts.
Note: Because we are only interested in identifying individual outlier values, the global Moran’s I statistic is of little use. The global statistic indicates how well the values are similar to neighboring values in an overall, summary sense within a given dataset. Because most of the ratio values are 1, the global statistic would show a very high level of autocorrelation that would not help answer the research question.
The code chunk below produces maps of the ratio of library service area population to the 2020 decennial counts, and then performs the local Moran’s I calculations and maps the resulting significance clusters.
Of the 535 libraries included in the analysis, 286 have a population comparison ratio of exactly 1. (This is expected because we expect the service area populations to be based on the 2020 decennial counts.) Thus, the map below shows most of the library service areas in a medium blue color. Only a few are a dark blue or a green/yellow color, indicating they have population comparison ratios further from 1.
# Create static map plot of population comparison ratio
illinois_2020pop_libraries_all %>%
ggplot(aes(fill = pop_align)) +
geom_sf(data = illinois_2020pop_total, fill = NA) + # add Illinois outline
geom_sf(data = illinois_2020pop_libraries_all, linewidth = 0.5) +
scale_fill_viridis_c() +
theme_void() +
labs(title = str_wrap("Ratio of 2020 Decennial Census Count to Reported Library Service Area Population", width = 40),
subtitle = str_wrap("For Illinois public libraries with municipal, township, county, or school district service areas", width = 50),
fill = str_wrap("2020 Census Count / Reported LSA Population", width = 20),
caption = str_wrap("2020 Decennial Census Counts obtained with tidycensus R package; Public library locations from IMLS's PLS, FY 2021", width = 70)
)
The map below shows the results of the local Moran’s I clusters, calculated with a weighted average of the six nearest neighbors and a significance cutoff of 0.075. However, because it is difficult to view several library service areas at a state-wide scale, the interactive map that appears below this static map is more useful.
Overall, it does not appear that the local Moran’s I clusters are helpful in identifying outliers in the population comparison ratios. Often, a service area is indicated as high-low or low-high when it has a ratio close to 1 but the surrounding service areas all have high or low values. In these instances, all of the neighbors need to be reviewed, but the “outlier” does not.
# Generate a geoda object of the spatial weights using K-nearest neighbors (k=6)
k_weights <- illinois_2020pop_libraries_all %>%
knn_weights(6)
# Using those weights, calculate the local Moran's I values for the dataset
libraries_lisa <- local_moran(
k_weights,
illinois_2020pop_libraries_all["pop_align"],
significance_cutoff = 0.075
)
# Store various values from LISA/local Moran's I analysis for mapping
cluster_colors <- c("Not significant" = "#eeeeee",
"High-High" = "#FF0000",
"Low-Low" = "#0000FF",
"Low-High" = "#a7adf9",
"High-Low" = "#f4ada8",
"Undefined" = "#464646",
"Isolated" = "#999999")
# Append the LISA results to the original dataset
illinois_2020pop_libraries_all$cluster <- as.factor(
libraries_lisa$GetClusterIndicators()
)
# Assign level values to that new LISA results variable
levels(illinois_2020pop_libraries_all$cluster) <- libraries_lisa$GetLabels()
# Create static map plot of LISA/local Moran's I results
illinois_2020pop_libraries_all %>%
ggplot(aes(fill = cluster)) +
geom_sf(data = illinois_2020pop_total, fill = NA) + # add Illinois outline
geom_sf(data = illinois_2020pop_libraries_all) +
scale_fill_manual(values = cluster_colors,
limits = c("Not significant",
"High-High",
"Low-Low",
"Low-High",
"High-Low",
"Undefined",
"Isolated"),
labels = c("Not significant",
"High-High",
"Low-Low",
"Low-High",
"High-Low",
"Undefined",
"Isolated")) +
theme_void() +
labs(title = str_wrap("Local Moran's I for Ratio of 2020 Decennial Census Count to Reported Library Service Area Population", width = 40),
subtitle = str_wrap("For Illinois public libraries with municipal, township, county, or school district service areas", width = 50),
fill = str_wrap("Local Moran's I Quadrant", width = 20),
caption = str_wrap("Local Moran's I calculated with rgeoda package", width = 70)
)
In this interactive version of the local Moran’s I significance cluster map, there are additional information provided:
# Establish color scheme for significance clusters
cluster_pal <- colorFactor(c("#eeeeee",
"#FF0000",
"#0000FF",
"#a7adf9",
"#f4ada8",
"#464646",
"#999999"),
illinois_2020pop_libraries_all$cluster)
# Establish color scheme for polygon outlines to show geography precision level
precision_pal <- colorFactor(c("green", "orange", "purple"),
illinois_2020pop_libraries_all$geography_precision)
# Generate Leaflet interactive map
leaflet() %>%
addProviderTiles(providers$CartoDB) %>%
addPolygons(data = illinois_2020pop_libraries_all,
stroke = TRUE,
weight = 1,
color = ~precision_pal(geography_precision),
fill = TRUE,
fillOpacity = 0.8,
fillColor = ~cluster_pal(cluster),
label = illinois_2020pop_libraries_all$library_name,
popup = paste(sep = "<br/>", #line break
paste("<b>", #bold
illinois_2020pop_libraries_all$library_name,
"</b>"),
paste("Geography Type:",
illinois_2020pop_libraries_all$geography_type),
paste("Geography Precision:",
illinois_2020pop_libraries_all$geography_precision),
paste("Service Area Population:",
illinois_2020pop_libraries_all$service_area_pop),
paste("2020 Decennial Count:",
illinois_2020pop_libraries_all$value),
paste("Ratio of 2020 Count to Service Area Pop:",
illinois_2020pop_libraries_all$pop_align)
)
) %>%
addLegend("topright",
pal = cluster_pal,
values = illinois_2020pop_libraries_all$cluster,
title = "Local Moran's I Cluster",
opacity = 1) %>%
addLegend("topright",
pal = precision_pal,
values = illinois_2020pop_libraries_all$geography_precision,
title = "Geography Match Precision",
opacity = 1)
While I am very pleased with how the Leaflet interactive map turned out, I am overall disappointed that this spatial analysis did not appear to be a value-add to the outlier detection process. I also would have liked to figure out a relatively simple way to include the libraries that had multi-place service areas, since leaving those 77 records out of the analysis may have led to some bias in the results.
This exercise did make me much more familiar with local Moran’s I as a spatial analysis method, and I have started thinking about other research questions I could answer using it with public library data, possibly for my capstone paper. For example, I could assess the spatial autocorrelation of circulation of children’s materials to identify libraries that may be doing something exceptional to increase reading among children in a given area.