Geographic Analysis of Public Libraries and User Activity
Author
M Madinko
Geographic Analysis of Public Libraries and User Activity
Introduction: A bout the Dataset
This research project, titled “Geographic Analysis of Public Libraries and User Activity,” examines regional variations in library usage and resource availability across the United States. The dataset used for this study comes from the Public Libraries Survey 2023 (pub_libraries_instMuseum_pub_libs2023.csv). It contains a total of 185 variables and 9,252 observations. For this project, I selected 12 primary variables. These include categorical variables such as library names (libname), states (stabr), cities (city), counties (cnty), and zip codes (zip), as well as quantitative variables including total visits (visits), electronic book collections (ebook), print volumes (bkvol), staff expenditures (salaries), and total circulation (totcir). Furthermore, spatial coordinates (latitude and longitud) were retained to enable precise geographic mapping. Regarding data cleaning, I first converted the selected variable names to lowercase to ensure consistency and ease of manipulation. Then, I removed incorrect observations, such as negative values for the number of visits, which represent data entry errors. Finally, I filtered the dataset to focus on a specific geographic subset: Maryland, Virginia, Washington D.C., South Carolina, and Texas. I chose these locations because I currently reside in the DMV area, and the other two states are regions my husband grew fond of during his military deployments. The choice of this topic is motivated by my family’s plan to relocate in the future. As a parent, I am carefully considering factors that contribute to my children’s education and development. I view public libraries as essential indicators of a city’s commitment to education, accessibility, and community development. This analysis helps me evaluate which areas prioritize these vital resources for my family.
Load the Libraries and Upload the Dataset
library(tidyverse)library(leaflet)library(highcharter)setwd("C:/Users/monik/OneDrive/Desktop/DATA 110")libraries <-read_csv('pub_libraries_instMuseum_pub_libs2023.csv')head(libraries) # show the first six lines of the dataset
Exploring both quantitative and categorical variables with simple plot
Quantitative Variable
options(scipen =999)ggplot(libraries2, aes(x = visits)) +geom_histogram(bins =30, fill ="green") +theme_minimal() +coord_cartesian(xlim =c(0, 3500000)) +scale_x_continuous(breaks =seq(0, 3500000, by =500000)) +labs(title ="Distribution of Library Visits Across U.S. Libraries",x ="Number of Visits",y ="Number of Libraries",caption ="Source: Public Libraries Dataset 2023" )
ggplot(libraries2, aes(x = ebook)) +geom_histogram(bins =30, fill ="purple") +theme_minimal() +coord_cartesian(xlim =c(0, 3500000)) +scale_x_continuous(breaks =seq(0, 2500000, by =500000)) +labs(title ="Distribution of eBook Usage Across U.S. Libraries",x ="eBooks",y ="Number of Libraries",caption ="Source: Public Libraries Dataset 2023" )
Qualitative variable
libraries2 |>count(stabr) |>ggplot(aes(x =reorder(stabr,n), y= n)) +geom_col(fill ="orange") +coord_flip() +theme_minimal() +ylim(0,600)+labs(title ="Libraries per State",x ="State",y ="Number of Libraries",caption ="Source: Public Libraries Dataset 2023" )
leaflet(data = libraries2) |>addTiles() |>addProviderTiles("Esri.WorldPhysical") |>addControl(html ="<h3 style='margin:10px; color: #2c3e50;'>Geographic Distribution of Public Libraries</h3>", position ="topright") |>addCircleMarkers(lng =~longitud,lat =~latitude,radius =~sqrt(ebook) /50,color ="blue",fillColor ="purple",fillOpacity =0.5,clusterOptions =markerClusterOptions(),popup =~paste0("<b>Library: </b>", libname, "<br><br>","<b>County: </b>", cnty, "<br>","<b>State: </b>", stabr, "<br>","<b>ZIP Code: </b>", zip, "<br>","<b>City: </b>", city, "<br>","<b>eBooks: </b>", ebook, "<br>","<b>Visits: </b>", visits ) )
Essay
In my analysis, I studied both quantitative and categorical variables in order to produce different visualizations and help determine what to focus on for my final visualization. First, I created two histograms for the variables e-books and visits. The distribution of both variables shows a strong concentration around moderate values. In other words, most libraries experience a moderate level of activity, while only a small proportion handle very high or very low traffic. Next, I analyzed the categorical variables using a bar chart. This allowed me to compare the number of libraries across different states. The results show that some states have significantly more libraries than others, which may be influenced by demographic, economic, or urban development factors. Additionally, I created a scatter plot to examine the correlation between e-books and visits. The correlation coefficient is 0.192, which indicates a very weak positive relationship. This means that, slightly, as e-book usage increases, physical visits also tend to increase. This result was surprising, as I initially expected an inverse relationship, assuming that higher e-book usage would reduce physical visits. One possible explanation is that e-book users do not necessarily replace physical visits but rather use them as a complement. Finally, in a scatter plot where the size of the points represents e-book usage, I observed differences between states. For example, in Virginia and Texas, e-book usage is higher compared to physical visits. In contrast, in South Carolina and Maryland, physical visits are more dominant than e-book usage. This suggests that user preferences vary across states. If the goal is to focus on physical library activities and community programs, states such as South Carolina or Maryland would be more suitable. However, for a more digital-focused model, states like Virginia or Texas would be preferable. In my case, I would choose to focus on physical visits, as these libraries offer reading programs for children and many other community activities. The map represents the geographic distribution of public libraries across the United States, where each point corresponds to a library. The size reflects the number of e-books available, with larger circle indicating higher e-book usage, while smaller and lower usage. The clustering of markers highlights areas with a high concentration of libraries, which are often located in more populated or urban regions.