Validating the Reported Service Area Geographies of Public Libraries in Illinois

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(here)
library(janitor)
library(stringr)
library(sf)
library(tidycensus)
library(rgeoda)
library(leaflet)
# census_api_key("YOUR KEY GOES HERE", install = TRUE)

Abstract

This project explores a methodology for identifying potentially errant values in the reported indicator of Census geography that most closely aligns with a public library service area. It does so by comparing the reported population of the service area to the 2020 decennial census count for that same jurisdiction (obtained with tidycensus), calculating a “population comparison ratio”. Then it generates local Moran’s I statistics (using rgeoda) and maps the resulting significance clusters to identify outliers in the population comparison ratio (using ggplot2 and leaflet). The results suggest that a local measure of spatial autocorrelation is not a sufficient method to identify outliers in the population comparison ratio.

Introduction

The Public Libraries Survey (PLS) is an annual census that collects finances, staffing, offerings, and usage of all public libraries in the U.S. One of the major impediments to combining PLS data with other data sources is that the spatial information for each library is limited to the point location of the facilities; detailed data about the service area (or jurisdiction) boundaries has never been compiled nationally and is not currently available from all states. These service area boundaries are typically coterminous with municipal, school district, county subdivision, or county boundaries, but the prevalence of each of these boundary types varies by state (and sometimes within a state), and in some places they are special districts that split existing jurisdictions.

The PLS data already contained an indicator of the type of Census geography that best represented the service area boundaries; however, the levels (categories) of this variable did not actually match the available Census geographies and it was not clear whether the values were up to date. For the fiscal year (FY) 2022 PLS collection, this indicator was revised to better align with the Census geography types, as well as to allow new precision levels for the indicator (e.g., the remainder of a county outside of municipalities that may have separate library service). Now that this data is being collected, we need to devise methods for identifying outliers in the ratio of reported service area population to 2020 decennial census counts. These outliers could be errant values in the new geography type indicator. This report limits the scope to public libraries in Illinois, but the goal is to devise a method that will work for all states.

Research Question

How can a measure of spatial autocorrelation (e.g., local Moran’s I) identify errant outliers in the reported values of geography type for public library service area jurisdictions?

Data

This analysis uses two main data sources:

The Institute of Museum and Library Services’s Public Libraries Survey. Point locations of library systems (i.e., administrative entities) from the FY 2021 public-use files are joined to a preliminary version of the 2022 data that is still being processed. The main variables of interest are:
- GEOCODE: the indicator of the type of Census Bureau geography that most closely aligns with the library’s legal service area.
- POPU_LSA: the official population count of the library’s legal service area.
2020 Decennial Census total population counts for various geographic areas. These data–along with the polygon geometries for each area–are obtained using the tidycensus package originally developed by Kyle Walker.

Limitations

Any library systems added to the data universe in 2022 (that did not exist in 2021) have been excluded from the analysis because the address has not been geocoded to a point location yet. In the data for Illinois, there is one record excluded for that reason.

Due the use of tidycensus to obtain 2020 population counts for existing Census geographies, this analysis only includes the four Census geographies that have one-to-one matches with the GEOCODE values in the FY 2022 PLS data: place, county, county subdivision (i.e., township), and unified school district. There are 87 library records excluded from the analysis because they have a multi-place, multi-county-subdivision, or other GEOCODE value that would have required identifying the composite geographies and aggregating the results from tidycensus, both of which are out of scope for this report.

This analysis assumes the accuracy of the library service area population counts. If any of those counts are errant, they will negatively bias the results of the analysis.

Methods

This section details the data preparation procedures necessary to conduct an analysis of spatial autocorrelation of the ratio of reported service area population to 2020 decennial census counts using local Moran’s I.

Data Preparation

The code chunk that follows accomplishes the following tasks:

Append geographic variables (latitude, longitude, and locale) from published 2021 data
Filter the records to only public libraries in Illinois
Keep and rename only the five variables of interest for this analysis
Parse the geography_code variable into two separate variables (to make it tidy)
- geography type (e.g., place, county, school district)
- precision level (e.g., exact, overlap, remainder)
Recode the values of geography_type to match what tidycensus is expecting (used in step 10 loop)
Calculate the proportion of cardholders to service area population for each library
Collapse/recode the locale variable and convert to a 4-level factor
Convert the dataframe to an sf object with point geometry for spatial join
Obtain the state polygon (with the 2020 decennial count, which isn’t needed)
Loop over four geography types to get 2020 Decennial counts and polygons then spatial join to library data
- Filter sf data to only libraries with given geography type (in preparation for spatial join)
- Use get_decennial() to obtain 2020 Census counts for all features of given geography type in state with their polygon geometries
- Spatially inner-join each feature set with the place-based libraries to end up with only the municipalities that contain a place-based library
- Calculate the ratio that compares the 2020 Census count with the reported library service area population for each library record
Append/bind all joined data subsets based on geography_type back together

# Load the raw data
publiclibraries22 <- read_csv(here("data","publiclibraries22.csv"))

# Load the fully processed data from FY 2021 that contains point locations
publiclibraries21 <- read_csv(here("data","publiclibraries21.csv"))
publiclibraries21_geo <- publiclibraries21 %>%
  select(FSCSKEY, LONGITUD, LATITUDE, LOCALE_ADD)

# 1. Merge the point location and locale variables from 21 to 22 data
all_public_libraries <- publiclibraries22 %>%
  left_join(publiclibraries21_geo,
            by="FSCSKEY")

# 2./3. Clean the data
illinois_libraries <- all_public_libraries %>%
  clean_names() %>%
  filter(stabr == 'IL') %>% # select only Illinois libraries
  filter(fscskey != "-3") %>% # remove closed libraries
  filter(fscskey != "IL8034") %>% #drop record that doesn't exist in 2021 data
  rename(library_id = fscskey, # rename variables to be easier to remember
         library_name = libname,
         legal_basis = c_legbas,
         geography_code = geocode,
         service_area_pop = popu_lsa,
         cardholders = regbor,
         long = longitud,
         lat = latitude,
         locale12 = locale_add
         ) %>%
  select(library_id, library_name, legal_basis, geography_code, 
         service_area_pop, cardholders, long, lat, locale12) %>%
  # 4. Separate the geographic_code variable into parts
  separate(geography_code, 
           c("geography_type", "geography_precision"), # name the new variables
           2, # split after 2nd character
           remove = FALSE) %>% #keep the original variable
  # 5. Recode geography_type values to be human readable
  mutate(geography_type = case_match(geography_type,
                                     'CO' ~ 'county',
                                     'CD' ~ 'county subdivision',
                                     'MD' ~ 'multi-county subdivision',
                                     'PL' ~ 'place',
                                     'MP' ~ 'multi-place',
                                     'SU' ~ 'school district (unified)',
                                     'OT' ~ 'other'
                                     )) %>%
  # Convert geography_precision to a factor with labels (for later leaflet use)
  mutate(geography_precision = factor(geography_precision,
                                      labels = c("exact", "overlap", "remainder"),
                                      ordered = TRUE)) %>%
  # 6. Calculate proportion of cardholders to service area population
  mutate(users_pop_proportion = cardholders / service_area_pop) %>%
  # 7. Collapse 12 locale levels to 4
  mutate(locale4 = case_match(locale12,
                              11:13 ~ 1,
                              21:23 ~ 2,
                              31:33 ~ 3,
                              41:43 ~ 4
                              )) %>%
  # Convert locale4 to a factor with labels
  mutate(locale4 = factor(locale4, #use factor() NOT as.factor()!
                          labels = c("city", "suburb", "town", "rural"),
                          ordered = TRUE))

# 8. Convert dataframe to sf object with point geometry
illinois_libraries_geo <- illinois_libraries %>%
  st_as_sf(coords = c("long", "lat"),
           crs = 4269) # important to specify here to enable spatial join later

# 9. Get 2020 Decennial Census state total and state polygon
illinois_2020pop_total <- get_decennial(
  geography = 'state',
  variables = 'P1_001N',
  year = 2020,
  state = 'illinois',
  geometry = TRUE,
  progress_bar = FALSE
)

# 10. Loop over four primary geography types to get 2020 Decennial counts and polygons then spatial join to library data
# Create list of geography types for loop
geotypes <- c("place", "county", "county subdivision", "school district (unified)")

# Loop
for(i in (geotypes)) {
  # Filter only libraries with reported census geotype 'i' (in loop list)
  illinois_libraries_geo_temp <- illinois_libraries_geo %>%
    filter(geography_type == i)
  
  # Get 2020 population counts for geotype 'i' in state
  illinois_2020pop_temp <- get_decennial(
    geography = i,
    variables = 'P1_001N',
    year = 2020,
    state = 'illinois',
    geometry = TRUE,
    progress_bar = FALSE
  )
  
  # Spatial join the Census polygons with the library point locations 
  illinois_2020pop_libraries_temp <- illinois_2020pop_temp %>%
    st_join(illinois_libraries_geo_temp, left = FALSE) %>% # default to inner join
    # Calculate the ratio of 2020 Census count to reported library service area pop
    mutate(pop_align = value/service_area_pop) # 'value' is 2020 count
    # this direction had more manageable scale than service_area_pop/value
  
  # Trim spaces from geotype value before appending to df name
  geotemp <- str_remove_all(i, " ")
  
  # Save temp df with name that includes trimmed geotype 'i'
  assign(paste0('illinois_2020pop_libraries_', geotemp), illinois_2020pop_libraries_temp)
  
  # Cleanup temp objects
  remove(illinois_libraries_geo_temp)
  remove(illinois_2020pop_libraries_temp)
}

# 11. Append/stack separate geotype dfs back together
illinois_2020pop_libraries_all <- illinois_2020pop_libraries_place %>%
  bind_rows(illinois_2020pop_libraries_county) %>%
  bind_rows(illinois_2020pop_libraries_countysubdivision) %>%
  bind_rows(`illinois_2020pop_libraries_schooldistrict(unified)`)

Analysis of Spatial Autocorrelation

Tobler’s first law of geography states, “Everything is related to everything else, but near things are more related than distant things” (Tobler, 1970). That is, the values of a variable measured at different locations should be more correlated the closer the locations are to each other. Another name for this principle is spatial autocorrelation.

The analysis in this report depends on Tobler’s first law, in the form of local Moran’s I, to identify spatial clusters (or, more importantly here, exceptions to those clusters) in the values of a certain variable. The basic idea of local Moran’s I is to provide an indicator of how well the value at a given location is correlated to the values of neighboring locations.

Using the rgeoda package, we can spatially visualize the results of the local Moran’s I scatterplot that relates a given value to a weighted average of values from neighboring locations (Lloyd, 2011). We are primarily interested in the upper left (‘low-high’) and lower right (‘high-low’) quadrants of the scatterplot to identify the outliers in our variable of interest–the ratio of reported service area population to 2020 decennial census counts.

Note: Because we are only interested in identifying individual outlier values, the global Moran’s I statistic is of little use. The global statistic indicates how well the values are similar to neighboring values in an overall, summary sense within a given dataset. Because most of the ratio values are 1, the global statistic would show a very high level of autocorrelation that would not help answer the research question.

Results

The code chunk below produces maps of the ratio of library service area population to the 2020 decennial counts, and then performs the local Moran’s I calculations and maps the resulting significance clusters.

Map of Population Comparison Ratio

Of the 535 libraries included in the analysis, 286 have a population comparison ratio of exactly 1. (This is expected because we expect the service area populations to be based on the 2020 decennial counts.) Thus, the map below shows most of the library service areas in a medium blue color. Only a few are a dark blue or a green/yellow color, indicating they have population comparison ratios further from 1.

# Create static map plot of population comparison ratio
illinois_2020pop_libraries_all %>%
  ggplot(aes(fill = pop_align)) +
    geom_sf(data = illinois_2020pop_total, fill = NA) + # add Illinois outline
    geom_sf(data = illinois_2020pop_libraries_all, linewidth = 0.5) +
    scale_fill_viridis_c() +
    theme_void() +
    labs(title = str_wrap("Ratio of 2020 Decennial Census Count to Reported Library Service Area Population", width = 40),
         subtitle = str_wrap("For Illinois public libraries with municipal, township, county, or school district service areas", width = 50),
         fill = str_wrap("2020 Census Count / Reported LSA Population", width = 20),
         caption = str_wrap("2020 Decennial Census Counts obtained with tidycensus R package; Public library locations from IMLS's PLS, FY 2021", width = 70)
         )

Map of Local Moran’s I Signficance Clusters

The map below shows the results of the local Moran’s I clusters, calculated with a weighted average of the six nearest neighbors and a significance cutoff of 0.075. However, because it is difficult to view several library service areas at a state-wide scale, the interactive map that appears below this static map is more useful.

Overall, it does not appear that the local Moran’s I clusters are helpful in identifying outliers in the population comparison ratios. Often, a service area is indicated as high-low or low-high when it has a ratio close to 1 but the surrounding service areas all have high or low values. In these instances, all of the neighbors need to be reviewed, but the “outlier” does not.

# Generate a geoda object of the spatial weights using K-nearest neighbors (k=6)
k_weights <- illinois_2020pop_libraries_all %>%
  knn_weights(6)

# Using those weights, calculate the local Moran's I values for the dataset
libraries_lisa <- local_moran(
  k_weights,
  illinois_2020pop_libraries_all["pop_align"],
  significance_cutoff = 0.075
)

# Store various values from LISA/local Moran's I analysis for mapping
cluster_colors <- c("Not significant" = "#eeeeee",
                    "High-High" = "#FF0000", 
                    "Low-Low" = "#0000FF", 
                    "Low-High" = "#a7adf9",
                    "High-Low" = "#f4ada8",              
                    "Undefined" = "#464646",
                    "Isolated" = "#999999")

# Append the LISA results to the original dataset
illinois_2020pop_libraries_all$cluster <- as.factor(
  libraries_lisa$GetClusterIndicators()
)

# Assign level values to that new LISA results variable
levels(illinois_2020pop_libraries_all$cluster) <- libraries_lisa$GetLabels()

# Create static map plot of LISA/local Moran's I results
illinois_2020pop_libraries_all %>%
  ggplot(aes(fill = cluster)) +
    geom_sf(data = illinois_2020pop_total, fill = NA) + # add Illinois outline
    geom_sf(data = illinois_2020pop_libraries_all) +
    scale_fill_manual(values = cluster_colors,
                      limits = c("Not significant", 
                                 "High-High", 
                                 "Low-Low", 
                                 "Low-High", 
                                 "High-Low", 
                                 "Undefined", 
                                 "Isolated"),
                      labels = c("Not significant", 
                                 "High-High", 
                                 "Low-Low", 
                                 "Low-High", 
                                 "High-Low", 
                                 "Undefined", 
                                 "Isolated")) + 
    theme_void() +
    labs(title = str_wrap("Local Moran's I for Ratio of 2020 Decennial Census Count to Reported Library Service Area Population", width = 40),
         subtitle = str_wrap("For Illinois public libraries with municipal, township, county, or school district service areas", width = 50),
         fill = str_wrap("Local Moran's I Quadrant", width = 20),
         caption = str_wrap("Local Moran's I calculated with rgeoda package", width = 70)
         )

Interactive Version

In this interactive version of the local Moran’s I significance cluster map, there are additional information provided:

The precision level of the original GEOCODE value is reflected in the outline color of each service area: green means “exact”, orange means “overlap” (i.e., the service area overlaps but doesn’t match a Census geography exactly), and “remainder” (of which there are none in Illinois).
A pop-up that shows all relevant information that would be needed to review these records and identify outliers.

# Establish color scheme for significance clusters
cluster_pal <- colorFactor(c("#eeeeee", 
                             "#FF0000", 
                             "#0000FF", 
                             "#a7adf9", 
                             "#f4ada8", 
                             "#464646", 
                             "#999999"),
                           illinois_2020pop_libraries_all$cluster)

# Establish color scheme for polygon outlines to show geography precision level
precision_pal <- colorFactor(c("green", "orange", "purple"),
                             illinois_2020pop_libraries_all$geography_precision)

# Generate Leaflet interactive map
leaflet() %>%
  addProviderTiles(providers$CartoDB) %>%
  addPolygons(data = illinois_2020pop_libraries_all,
              stroke = TRUE,
              weight = 1,
              color = ~precision_pal(geography_precision),
              fill = TRUE,
              fillOpacity = 0.8,
              fillColor = ~cluster_pal(cluster),
              label = illinois_2020pop_libraries_all$library_name,
              popup = paste(sep = "<br/>", #line break
                            paste("<b>", #bold
                                  illinois_2020pop_libraries_all$library_name, 
                                  "</b>"), 
                            paste("Geography Type:", 
                                  illinois_2020pop_libraries_all$geography_type),
                            paste("Geography Precision:",
                                  illinois_2020pop_libraries_all$geography_precision),
                            paste("Service Area Population:",
                                  illinois_2020pop_libraries_all$service_area_pop),
                            paste("2020 Decennial Count:",
                                  illinois_2020pop_libraries_all$value),
                            paste("Ratio of 2020 Count to Service Area Pop:",
                                  illinois_2020pop_libraries_all$pop_align)
                            ) 
             ) %>%
  addLegend("topright", 
            pal = cluster_pal, 
            values = illinois_2020pop_libraries_all$cluster,
            title = "Local Moran's I Cluster",
            opacity = 1) %>%
  addLegend("topright", 
            pal = precision_pal, 
            values = illinois_2020pop_libraries_all$geography_precision,
            title = "Geography Match Precision",
            opacity = 1)

Reflection

While I am very pleased with how the Leaflet interactive map turned out, I am overall disappointed that this spatial analysis did not appear to be a value-add to the outlier detection process. I also would have liked to figure out a relatively simple way to include the libraries that had multi-place service areas, since leaving those 77 records out of the analysis may have led to some bias in the results.

This exercise did make me much more familiar with local Moran’s I as a spatial analysis method, and I have started thinking about other research questions I could answer using it with public library data, possibly for my capstone paper. For example, I could assess the spatial autocorrelation of circulation of children’s materials to identify libraries that may be doing something exceptional to increase reading among children in a given area.

References

Anselin, L. (2017). “Spatial Autocorrelation Lectures”. Recordings retrieved 7/1/2023 from https://www.youtube.com/playlist?list=PLzREt6r1NennT20oeK46QOloT2k9QM_xB.
Li, X. (2021). “Tutorial of rgeoda”. Website retrieved 7/1/2023 from https://geodacenter.github.io/rgeoda/articles/rgeoda_tutorial.html.
LLoyd, C. (2011). Local Models for Spatial Analysis, 2nd ed. CRC Press.
Tobler, W. R. (1970). “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46: 234–40. https://doi.org/10.2307/143141.