1. Introduction

Public libraries in the U.S. all have a jurisdiction that they have been established to serve–and almost always from which they generate revenue from public funds (e.g., taxes). However, unlike school districts, the boundaries of these jurisdictions may not be readily available. A new indicator on the national dataset of public libraries aims to identify the Census geography that most closely aligns with the service area of each public library. Because this indicator is new, it needs to be validated. This report aims to answer the following research questions in order to validate–that is, flag outliers–for this new indicator.

Research Questions:

  • A. What are the most common service area geography types for public libraries in Illinois?
  • B. What is the typical service population size for a libraries in each geography type?
  • C. What is the distribution of the proportion of a public library’s cardholders to its service area population, and does it vary by geography type?
  • D. Is there are relationship between the number of cardholders and the service area population for a public library? Are there outliers to this relationship?

After a section that describes the data preparation, the final section presents the results for each research question.

2. Data Preparation

The raw data file contains all public libraries in the U.S. in 2022. It is a preliminary version of Public Libraries Survey, and older vintages of these data are available from the Institute of Museum and Library Services.

The code chunk that follows accomplishes the following tasks:

  1. Filter the records to only public libraries in Illinois
  2. Keep and rename only the five variables of interest for this analysis
  3. Parse the geography_code variable into two separate variables (to make it tidy)
    • geography type (e.g., place, county, school district)
    • precision level (e.g., exact, overlap, remainder)
  4. Recode the values of geography_type to match what tidycensus is expecting
    • [Above step is in preparation for anticipated processing for final project]
  5. Calculate the proportion of cardholders to service area population for each library
# Load the raw data
all_public_libraries <- read_csv(here("data","publiclibraries22.csv"))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
# Clean the data
illinois_libraries <- all_public_libraries %>%
  clean_names() %>%
  filter(stabr=='IL') %>% # select only Illinois libraries
  filter(fscskey!="-3") %>% # remove closed libraries
  rename(library_id = fscskey, # rename variables to be easier to remember
         legal_basis = c_legbas,
         geography_code = geocode,
         service_area_pop = popu_lsa,
         cardholders = regbor) %>%
  select(library_id, legal_basis, geography_code, service_area_pop, cardholders) %>%
  # Next line replaces incorrect value for IL8034 in preliminary data
  mutate(geography_code = replace(geography_code, geography_code == "CI1", "CD1")) %>%
  # Separate the geographic_code variable into parts
  separate(geography_code, 
           c("geography_type", "geography_precision"), # name the new variables
           2, # split after 2nd character
           remove=FALSE) %>% #keep the original variable
  # recode geography_type values to be human readable
  mutate(geography_type = case_match(geography_type,
                                     'CO' ~ 'county',
                                     'CD' ~ 'county subdivision',
                                     'MD' ~ 'multi-county subdivision',
                                     'PL' ~ 'place',
                                     'MP' ~ 'multi-place',
                                     'SU' ~ 'school district (unified)',
                                     'OT' ~ 'other'
                                     )) %>%
  # calculate proportion of cardholders to service area population
  mutate(users_pop_proportion = cardholders / service_area_pop)

3. Analysis

3a. What are the most common service area geography types for public libraries in Illinois?

Out of 623 public libraries in Illinois, just over half (n=331) have a service area that is based on a “place”: either incorporated, like a city or village, or unincorporated, which in this case means “Census-designated”. Another 77 libraries are based on a combination of places (i.e., “multi-place”). The next most common (n=115) is “county subdivision,” which in this case means a township.

# geom_bar for frequencies by geography_type
illinois_libraries %>%
  ggplot(aes(x = geography_type, fill = geography_type)) +
  geom_bar() +
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.2) +
  guides(fill = "none") + # remove unnecessary legend
  labs(
    title = "Count of Illinois Libraries by Geography Type",
    x = "Geography Type"
  )

3b. What is the typical service population size for a libraries in each geography type?

We would expect, in general, for counties to have larger populations than cities. The chart below confirms this assumption by showing that the median service area population of Illinois public libraries based on a county service area is larger than that of those based on a place or county subdivision. (Note: the value for “Other” in this chart should be ignored as there is only one library of geography type “Other”.)

illinois_libraries %>%
  group_by(geography_type) %>% #add vars here that need to be included in summarize
  summarize(med_svc_area_pop = median(service_area_pop)) %>%
  ggplot(aes(x = geography_type, y = med_svc_area_pop, fill = geography_type)) +
  geom_col() +
  theme(legend.position = "none") + # remove unnecessary legend
  labs(
    title = "Median Service Area Population of Illinois Libraries by Geography Type",
    x = "Geography Type",
    y = "Median Service Area Population"
  )

3c. What is the distribution of the proportion of a public library’s cardholders to its service area population, and does it vary by geography type?

According to the jitter plot below, the libraries with service areas based on place and multi-place seem to have higher proportions of the service area population that are cardholders. On the flip side, libraries with service areas based on school districts and counties seem to have lower proportions of the service area population that are cardholders.

illinois_libraries %>%
  ggplot(aes(x = geography_type, y = users_pop_proportion, color = geography_type)) +
  geom_jitter() +
  theme(legend.position = "none") + # remove unnecessary legend
  labs(
    title = "Cardolder Proportion of Service Area Population by Geography Type",
    subtitle = "Among Public Libraries in Illinois",
    x = "Geography Type",
    y = "Cardolder Proportion of Service Area Population"
  )

3d. Is there are relationship between the number of cardholders and the service area population for a public library? Are there outliers to this relationship?

The faceted scatterplots below show that public libraries with service areas based on a county subdivision (i.e., township) have a clear positive relationship between the number of cardholders and the size of the service area population. For those based on a “place” geography, the relationship is more varied, and there appear to be many outliers with a low proportion of cardholders to service area population–a fact that was not visible in 3c.

illinois_libraries %>%
    filter(cardholders < 1000000) %>% # remove extreme outlier (Chicago) to preserve scale
  ggplot(aes(x = service_area_pop, y = cardholders, color = geography_type)) +
  geom_point() +
  theme(legend.position = "none") + # remove unnecessary legend
  facet_wrap(~ geography_type) +
  labs(
    title = "Scatterplot of Cardolders and Service Area Population of Public Libraries",
    subtitle = "Faceted by Geography Type, Among Public Libraries in Illinois",
    x = "Service Area Population",
    y = "Cardolders"
  )

# Save the last plot as PNG file
# ggsave(here("results", "scatterplot.png"))