Public libraries are an important part of a community; they provide peaceful, quiet, and free spaces to study, with access to books and computers, and often offer community volunteer or support programs.
For my project, I will use the “2014 Public Libraries Survey.” The data was originally collected by the Institute of Museum and Library Services (IMLS), an independent federal agency that conducts an annual census of library services. It is accessible via Kaggle Datasets, where it was uploaded by Abigail Larion.
The dataset contains a mix of categorical and quantitative variables. I selected the specific variables necessary for mapping: categorical variables including library_name, city, state, and street_address to identify locations, and quantitative variables such as latitude and longitude to plot them spatially. I also kept submission_year to ensure the data timeframe is accurate.
I chose this dataset because I spend quite a lot of my weekend time in the library, so I thought it might be nice to learn more about libraries in general, not just the ones I have access to here in Montgomery County (MoCo).
The original dataset contains over seventy columns, a lot of which I don’t need for this project. To make it easier, I selected only the variables that are necessary for visualizing library locations: state, library_id, submission_year, library_name, street_address, city, zip_code, longitude, latitude, and state_code.
In terms of cleaning, I noticed that a library in WHitehall, Wisconsin had its coordinates near the West African border, which is far outside of the US. To fix this and other potential errors, I filtered the dataset to keep only rows with coordinate ranges that fall within the general boundaries of the United States, including Alaska and Hawaii.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(leaflet)
Warning: package 'leaflet' was built under R version 4.5.2
library(janitor)
Warning: package 'janitor' was built under R version 4.5.2
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 9242 Columns: 74
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): State, Library ID, Library Name, Street Address, City, County, Pub...
dbl (62): Submission Year, Zip Code, Longitude, Latitude, State Code, County...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Cleaning the Data
libraries_clean <- libraries_raw |>select( state, library_id, submission_year, library_name, street_address, city, zip_code, longitude, latitude, state_code ) |># Fix longitude to be negative for US coordinates, this fixes the issue of the dot appearing near West Africamutate(longitude =-abs(longitude)) |># Remove rows with missing coordinatesfilter(!is.na(latitude) &!is.na(longitude)) |>filter( (between(latitude, 24, 50) &between(longitude, -125, -66)) | (between(latitude, 51, 72) &between(longitude, -170, -129)) | (between(latitude, 18, 23) &between(longitude, -161, -154)) )
# A tibble: 51 × 2
state num_libraries
<chr> <int>
1 NY 750
2 IL 620
3 TX 545
4 IA 537
5 PA 453
6 MI 388
7 WI 375
8 MA 362
9 KS 322
10 NJ 296
# ℹ 41 more rows
Exploring the Data
I grouped the dataset by state and used a summary function to see how many libraries appeared in each state. This helps me know which states have a higher and which have a lower number of [public] libraries. The plot below displays the number of public libraries in each state based on the cleaned dataset.
ggplot(state_counts, aes(x =reorder(state, num_libraries), y = num_libraries)) +geom_col(fill ="skyblue") +coord_flip() +labs(title ="Number of Public Libraries by State (2014)",x ="State",y ="Number of Libraries" )
Final Visualization
For the final visualization, I grouped states into three categories; high, Medium, and Low, based on how many public libraries they have. This makes it easier to compare states without using the exact numbers. The chart uses color to represent these groups. This graph provides a clearer comparison by showing bigger patterns rather than individual states alone.
state_counts_grouped <- state_counts |>mutate(library_group =case_when( num_libraries >=300~"High", num_libraries >=150~"Medium",TRUE~"Low" ))top_state <- state_counts_grouped |>slice_max(num_libraries, n =1)ggplot(state_counts_grouped, aes(x =reorder(state, num_libraries), y = num_libraries, fill = library_group)) +geom_col() +coord_flip() +scale_fill_brewer(palette ="Set2") +labs(title ="Public Libraries by State in the United States (2014)",subtitle ="States grouped into high, medium, and low library counts",x ="State",y ="Number of Public Libraries",fill ="Library Count Group",caption ="Source: Institute of Museum and Library Services, Public Libraries Survey (2014)" ) +theme_minimal() +annotate("text",x = top_state$state,y = top_state$num_libraries +30,label =paste("Highest:", top_state$state),size =3 )
Interactive Map
In addition to the bar charts, I created an interactive map to present the exact locations of public libraries across the United States map. Each library is shown as a point placed using its latitude and longitude. Clicking on a point displays a popup with the library’s name, city, state, and ZIP.
The map is helpful because it shows how library locations are spaced. It makes patterns easy to see, such as dense clusters in major cities and wide spacing in rural states. It also includes the libraries of both Alaska and Hawaii.
The visualizations in this project helped me understand how public libraries are distributed across the country. The grouped bar chart showed clear differences between states, and the map revealed which areas have dense library networks (like New York) compared to those with wider, more rural spacing (Nevada for example). One challenge in this project was dealing with incorrect, out-of-range coordinates in the dataset, which caused a library to appear in West Africa. Filtering by US latitude and longitude ranges fixed the issue and produced a clean map.
Bibliography
Institute of Museum and Library Services (IMLS). “2014 Public Libraries Survey”.
Kaggle - Access point for downloading the CSV file.
R packages - tidyverse, leaflet, janitor.
Gemini Ai - to add the image at the start of my QMD, I’d forgotten how.
ChatGPT Ai - helped me to fix the plot appearing near West Africa - suggested for me to check my longtiude values and make sure they’re all negative- and to make sure all my lats and longs were valid US coordinates.