Public libraries in the U.S. all have a jurisdiction that they have been established to serve–and almost always from which they generate revenue from public funds (e.g., taxes). However, unlike school districts, the boundaries of these jurisdictions may not be readily available. A new indicator on the national dataset of public libraries aims to identify the Census geography that most closely aligns with the service area of each public library. Because this indicator is new, it needs to be validated. This report aims to answer the following research questions in order to validate–that is, flag outliers–for this new indicator.
After a section that describes the data preparation, the final section presents the results for each research question.
The raw data file contains all public libraries in the U.S. in 2022. It is a preliminary version of Public Libraries Survey, and older vintages of these data are available from the Institute of Museum and Library Services.
The code chunk that follows accomplishes the following tasks:
# Load the raw data
all_public_libraries <- read_csv(here("data","publiclibraries22.csv"))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
# Clean the data
illinois_libraries <- all_public_libraries %>%
clean_names() %>%
filter(stabr=='IL') %>% # select only Illinois libraries
filter(fscskey!="-3") %>% # remove closed libraries
rename(library_id = fscskey, # rename variables to be easier to remember
legal_basis = c_legbas,
geography_code = geocode,
service_area_pop = popu_lsa,
cardholders = regbor) %>%
select(library_id, legal_basis, geography_code, service_area_pop, cardholders) %>%
# Next line replaces incorrect value for IL8034 in preliminary data
mutate(geography_code = replace(geography_code, geography_code == "CI1", "CD1")) %>%
# Separate the geographic_code variable into parts
separate(geography_code,
c("geography_type", "geography_precision"), # name the new variables
2, # split after 2nd character
remove=FALSE) %>% #keep the original variable
# recode geography_type values to be human readable
mutate(geography_type = case_match(geography_type,
'CO' ~ 'county',
'CD' ~ 'county subdivision',
'MD' ~ 'multi-county subdivision',
'PL' ~ 'place',
'MP' ~ 'multi-place',
'SU' ~ 'school district (unified)',
'OT' ~ 'other'
)) %>%
# calculate proportion of cardholders to service area population
mutate(users_pop_proportion = cardholders / service_area_pop)
Out of 623 public libraries in Illinois, just over half (n=331) have a service area that is based on a “place”: either incorporated, like a city or village, or unincorporated, which in this case means “Census-designated”. Another 77 libraries are based on a combination of places (i.e., “multi-place”). The next most common (n=115) is “county subdivision,” which in this case means a township.
# geom_bar for frequencies by geography_type
illinois_libraries %>%
ggplot(aes(x = geography_type, fill = geography_type)) +
geom_bar() +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = -0.2) +
guides(fill = "none") + # remove unnecessary legend
labs(
title = "Count of Illinois Libraries by Geography Type",
x = "Geography Type"
)
We would expect, in general, for counties to have larger populations than cities. The chart below confirms this assumption by showing that the median service area population of Illinois public libraries based on a county service area is larger than that of those based on a place or county subdivision. (Note: the value for “Other” in this chart should be ignored as there is only one library of geography type “Other”.)
illinois_libraries %>%
group_by(geography_type) %>% #add vars here that need to be included in summarize
summarize(med_svc_area_pop = median(service_area_pop)) %>%
ggplot(aes(x = geography_type, y = med_svc_area_pop, fill = geography_type)) +
geom_col() +
theme(legend.position = "none") + # remove unnecessary legend
labs(
title = "Median Service Area Population of Illinois Libraries by Geography Type",
x = "Geography Type",
y = "Median Service Area Population"
)
According to the jitter plot below, the libraries with service areas based on place and multi-place seem to have higher proportions of the service area population that are cardholders. On the flip side, libraries with service areas based on school districts and counties seem to have lower proportions of the service area population that are cardholders.
illinois_libraries %>%
ggplot(aes(x = geography_type, y = users_pop_proportion, color = geography_type)) +
geom_jitter() +
theme(legend.position = "none") + # remove unnecessary legend
labs(
title = "Cardolder Proportion of Service Area Population by Geography Type",
subtitle = "Among Public Libraries in Illinois",
x = "Geography Type",
y = "Cardolder Proportion of Service Area Population"
)
The faceted scatterplots below show that public libraries with service areas based on a county subdivision (i.e., township) have a clear positive relationship between the number of cardholders and the size of the service area population. For those based on a “place” geography, the relationship is more varied, and there appear to be many outliers with a low proportion of cardholders to service area population–a fact that was not visible in 3c.
illinois_libraries %>%
filter(cardholders < 1000000) %>% # remove extreme outlier (Chicago) to preserve scale
ggplot(aes(x = service_area_pop, y = cardholders, color = geography_type)) +
geom_point() +
theme(legend.position = "none") + # remove unnecessary legend
facet_wrap(~ geography_type) +
labs(
title = "Scatterplot of Cardolders and Service Area Population of Public Libraries",
subtitle = "Faceted by Geography Type, Among Public Libraries in Illinois",
x = "Service Area Population",
y = "Cardolders"
)
# Save the last plot as PNG file
# ggsave(here("results", "scatterplot.png"))