Geographic Analysis of Public Libraries and User Activity

Author

M Madinko

Geographic Analysis of Public Libraries and User Activity

Introduction: A bout the Dataset

This research project, titled “Geographic Analysis of Public Libraries and User Activity,” examines regional variations in library usage and resource availability across the United States. The dataset used for this study comes from the Public Libraries Survey 2023 (pub_libraries_instMuseum_pub_libs2023.csv). It contains a total of 185 variables and 9,252 observations. For this project, I selected 12 primary variables. These include categorical variables such as library names (libname), states (stabr), cities (city), counties (cnty), and zip codes (zip), as well as quantitative variables including total visits (visits), electronic book collections (ebook), print volumes (bkvol), staff expenditures (salaries), and total circulation (totcir). Furthermore, spatial coordinates (latitude and longitud) were retained to enable precise geographic mapping. Regarding data cleaning, I first converted the selected variable names to lowercase to ensure consistency and ease of manipulation. Then, I removed incorrect observations, such as negative values for the number of visits, which represent data entry errors. Finally, I filtered the dataset to focus on a specific geographic subset: Maryland, Virginia, Washington D.C., South Carolina, and Texas. I chose these locations because I currently reside in the DMV area, and the other two states are regions my husband grew fond of during his military deployments. The choice of this topic is motivated by my family’s plan to relocate in the future. As a parent, I am carefully considering factors that contribute to my children’s education and development. I view public libraries as essential indicators of a city’s commitment to education, accessibility, and community development. This analysis helps me evaluate which areas prioritize these vital resources for my family.

Load the Libraries and Upload the Dataset

library(tidyverse)
library(leaflet)
library(highcharter)
setwd("C:/Users/monik/OneDrive/Desktop/DATA 110")
libraries <- read_csv('pub_libraries_instMuseum_pub_libs2023.csv')
head(libraries) # show the first six lines of the dataset

# A tibble: 6 × 187
  STABR FSCSKEY LIBID    LIBNAME ADDRESS CITY    ZIP ZIP4  ADDRES_M CITY_M ZIP_M
  <chr> <chr>   <chr>    <chr>   <chr>   <chr> <dbl> <chr> <chr>    <chr>  <dbl>
1 AK    AK0001  AK0001-… ANCHOR… 34020 … ANCH… 99556 9150  P.O. BO… ANCHO… 99556
2 AK    AK0002  AK0002-… ANCHOR… 3600 D… ANCH… 99503 6093  3600 DE… ANCHO… 99503
3 AK    AK0003  AK0003-… ANDERS… 101 FI… ANDE… 99744 M     P.O. BO… ANDER… 99744
4 AK    AK0006  AK0006-… KUSKOK… 420 CH… BETH… 99559 M     P.O. BO… BETHEL 99559
5 AK    AK0007  AK0007-… BIG LA… 3140 S… WASI… 99623 9663  P.O. BO… BIG L… 99652
6 AK    AK0008  AK0008-… CANTWE… 1 SCHO… CANT… 99729 M     P.O. BO… CANTW… 99729
# ℹ 176 more variables: ZIP4_M <chr>, CNTY <chr>, PHONE <dbl>, C_RELATN <chr>,
#   C_LEGBAS <chr>, C_ADMIN <chr>, C_FSCS <chr>, GEOCODE <chr>, LSABOUND <chr>,
#   STARTDAT <chr>, ENDDATE <chr>, POPU_LSA <dbl>, F_POPLSA <chr>,
#   POPU_UND <dbl>, CENTLIB <dbl>, F_CENLIB <chr>, BRANLIB <dbl>,
#   F_BRLIB <chr>, BKMOB <dbl>, F_BKMOB <chr>, MASTER <dbl>, F_MASTER <chr>,
#   LIBRARIA <dbl>, F_LIBRAR <chr>, OTHPAID <dbl>, F_OTHSTF <chr>,
#   TOTSTAFF <dbl>, F_TOTSTF <chr>, LOCGVT <dbl>, F_LOCGVT <chr>, …

Data Cleaning

Focus on only 12 variables

My dataset has 185 variables

names(libraries)

  [1] "STABR"       "FSCSKEY"     "LIBID"       "LIBNAME"     "ADDRESS"    
  [6] "CITY"        "ZIP"         "ZIP4"        "ADDRES_M"    "CITY_M"     
 [11] "ZIP_M"       "ZIP4_M"      "CNTY"        "PHONE"       "C_RELATN"   
 [16] "C_LEGBAS"    "C_ADMIN"     "C_FSCS"      "GEOCODE"     "LSABOUND"   
 [21] "STARTDAT"    "ENDDATE"     "POPU_LSA"    "F_POPLSA"    "POPU_UND"   
 [26] "CENTLIB"     "F_CENLIB"    "BRANLIB"     "F_BRLIB"     "BKMOB"      
 [31] "F_BKMOB"     "MASTER"      "F_MASTER"    "LIBRARIA"    "F_LIBRAR"   
 [36] "OTHPAID"     "F_OTHSTF"    "TOTSTAFF"    "F_TOTSTF"    "LOCGVT"     
 [41] "F_LOCGVT"    "STGVT"       "F_STGVT"     "FEDGVT"      "F_FEDGVT"   
 [46] "OTHINCM"     "F_OTHINC"    "TOTINCM"     "F_TOTINC"    "SALARIES"   
 [51] "F_SALX"      "BENEFIT"     "F_BENX"      "STAFFEXP"    "F_TOSTFX"   
 [56] "PRMATEXP"    "F_PRMATX"    "ELMATEXP"    "F_ELMATX"    "OTHMATEX"   
 [61] "F_OTMATX"    "TOTEXPCO"    "F_TOCOLX"    "OTHOPEXP"    "F_OTHOPX"   
 [66] "TOTOPEXP"    "F_TOTOPX"    "LCAP_REV"    "F_LCAPRV"    "SCAP_REV"   
 [71] "F_SCAPRV"    "FCAP_REV"    "F_FCAPRV"    "OCAP_REV"    "F_OCAPRV"   
 [76] "CAP_REV"     "F_TCAPRV"    "CAPITAL"     "F_TCAPX"     "BKVOL"      
 [81] "F_BKVOL"     "EBOOK"       "F_EBOOK"     "AUDIO_PH"    "F_AUD_PH"   
 [86] "AUDIO_DL"    "F_AUD_DL"    "VIDEO_PH"    "F_VID_PH"    "VIDEO_DL"   
 [91] "F_VID_DL"    "TOTPHYS"     "F_TOTPHY"    "OTHPHYS"     "F_OTHPHY"   
 [96] "EC_LO_OT"    "F_EC_L_O"    "EC_ST"       "F_EC_ST"     "ELECCOLL"   
[101] "F_ELECOL"    "HRS_OPEN"    "F_HRS_OP"    "VISITS"      "F_VISITS"   
[106] "VISITRPT"    "REFERENC"    "F_REFER"     "REFERRPT"    "REGBOR"     
[111] "F_REGBOR"    "ODFINE"      "TOTCIR"      "F_TOTCIR"    "KIDCIRCL"   
[116] "F_KIDCIR"    "ELMATCIR"    "F_EMTCIR"    "PHYSCIR"     "F_PHYSCR"   
[121] "ELINFO"      "F_ELINFO"    "ELCONT"      "F_ELCONT"    "TOTCOLL"    
[126] "F_TOTCOL"    "OTHPHCIR"    "F_OTHPCR"    "LOANTO"      "F_LOANTO"   
[131] "LOANFM"      "F_LOANFM"    "TOTPRO"      "F_TOTPRO"    "K0_5PRO"    
[136] "K6_11PRO"    "YAPRO"       "F_YAPRO"     "ADULTPRO"    "GENPRO"     
[141] "ONPRO"       "OFFPRO"      "VIRPRO"      "TOTATTEN"    "F_TOTATT"   
[146] "K0_5ATTEN"   "K6_11ATTEN"  "YAATTEN"     "F_YAATT"     "ADULTATTEN" 
[151] "GENATTEN"    "ONATTEN"     "OFFATTEN"    "VIRATTEN"    "TOTPRES"    
[156] "TOTVIEWS"    "GPTERMS"     "F_GPTERM"    "PITUSR"      "F_PITUSR"   
[161] "PITUSRRPT"   "WIFISESS"    "F_WIFISS"    "WIFISRPT"    "WEBVISIT"   
[166] "YR_SUB"      "OBEREG"      "RSTATUS"     "STATSTRU"    "STATNAME"   
[171] "STATADDR"    "LONGITUD"    "LATITUDE"    "LSAGEOID"    "LSAGEORATIO"
[176] "LSAGEOTYPE"  "CNTYPOP"     "LOCALE_ADD"  "LOCALE_MOD"  "CENTRACT"   
[181] "CENBLOCK"    "CDCODE"      "CBSA"        "MICROF"      "GEOSTATUS"  
[186] "GEOSCORE"    "GEOMTYPE"

I start select only avariables that can be used

libraries1<- libraries |>
  select(STABR, CITY, ZIP, VISITS, EBOOK, LATITUDE, LONGITUD, BKVOL, LIBNAME, CNTY, SALARIES, TOTCIR)
names(libraries1) <- tolower(names(libraries1))
head(libraries1)

# A tibble: 6 × 12
  stabr city    zip visits ebook latitude longitud  bkvol libname cnty  salaries
  <chr> <chr> <dbl>  <dbl> <dbl>    <dbl>    <dbl>  <dbl> <chr>   <chr>    <dbl>
1 AK    ANCH… 99556   5032     0     59.8    -152.  19054 ANCHOR… KENA…       -9
2 AK    ANCH… 99503 503934 48543     61.2    -150. 401288 ANCHOR… ANCH…  3728731
3 AK    ANDE… 99744    421     0     64.3    -149.  15600 ANDERS… DENA…       -9
4 AK    BETH… 99559  22697 22604     60.8    -162.  35096 KUSKOK… BETH…   144833
5 AK    WASI… 99623  33623  3066     61.5    -150.  22217 BIG LA… MATA…   201339
6 AK    CANT… 99729    432 22604     63.4    -149.  11045 CANTWE… DENA…       -9
# ℹ 1 more variable: totcir <dbl>

Verifying any na values

colSums(is.na(libraries1))

   stabr     city      zip   visits    ebook latitude longitud    bkvol 
       0        0        0        0        0        0        0        0 
 libname     cnty salaries   totcir 
       0        0        0        0

There is not na values but i noticed that the dataset has many 0 and some negative value i filtered that out.

libraries2 <- libraries1 |>
  filter(visits > 0, bkvol >= 0 , ebook >= 0)|>
  filter (stabr %in% c("MD", "VA", "DC","SC", "TX"))

head(libraries2)

# A tibble: 6 × 12
  stabr city            zip visits  ebook latitude longitud  bkvol libname cnty 
  <chr> <chr>         <dbl>  <dbl>  <dbl>    <dbl>    <dbl>  <dbl> <chr>   <chr>
1 DC    WASHINGTON    20001 3.03e6 131121     38.9    -77.0 1.21e6 DISTRI… DIST…
2 MD    CUMBERLAND    21502 1.11e5  73818     39.7    -78.8 1.13e5 ALLEGA… ALLE…
3 MD    ANNAPOLIS     21401 1.47e6 173928     39.0    -76.6 5.33e5 ANNE A… ANNE…
4 MD    BALTIMORE     21201 9.94e5 137288     39.3    -76.6 2.06e6 ENOCH … BALT…
5 MD    TOWSON        21204 2.44e6 147343     39.4    -76.6 9.81e5 BALTIM… BALT…
6 MD    PRINCE FREDE… 20678 2.91e5  61037     38.6    -76.6 1.70e5 CALVER… CALV…
# ℹ 2 more variables: salaries <dbl>, totcir <dbl>

Exploring both quantitative and categorical variables with simple plot

Quantitative Variable

options(scipen = 999)

ggplot(libraries2, aes(x = visits)) +
  geom_histogram(bins = 30, fill = "green") +
  theme_minimal() +
  coord_cartesian(xlim = c(0, 3500000)) +
  scale_x_continuous(breaks = seq(0, 3500000, by = 500000)) +
   labs(
    title = "Distribution of Library Visits Across U.S. Libraries",
    x = "Number of Visits",
    y = "Number of Libraries",
    caption = "Source: Public Libraries Dataset 2023"
  )

ggplot(libraries2, aes(x = ebook)) +
  geom_histogram(bins = 30, fill = "purple") +
  theme_minimal() +
   coord_cartesian(xlim = c(0, 3500000)) +
  scale_x_continuous(breaks = seq(0, 2500000, by = 500000)) +
  labs(
    title = "Distribution of eBook Usage Across U.S. Libraries",
    x = "eBooks",
    y = "Number of Libraries",
    caption = "Source: Public Libraries Dataset 2023"
  )

Qualitative variable

libraries2 |>
  count(stabr) |>
  ggplot(aes(x = reorder(stabr,n), y= n)) +
  geom_col(fill = "orange") +
  coord_flip() +
  theme_minimal() +
  ylim(0,600)+
  labs(
    title = "Libraries per State",
    x = "State",
    y = "Number of Libraries",
    caption = "Source: Public Libraries Dataset 2023"
  )

Scatterplot

library(RColorBrewer)

cols <- brewer.pal(5, "Set1")

hchart(
  libraries2,
  "scatter",
  hcaes(
    x = ebook,
    y = visits,
    group = stabr,
    libname = libname,
    stabr = stabr
  )
) |>

  hc_colors(cols) |>

  hc_plotOptions(
    scatter = list(
      marker = list(
        fillOpacity = 0.4
      )
    )
  ) |>

  hc_title(text = "Library eBook Usage vs Visits (U.S. Libraries)") |>
  hc_subtitle(text = "Transparency added for better readability") |>

  hc_xAxis(title = list(text = "Number of eBooks")) |>
  hc_yAxis(title = list(text = "Number of Visits")) |>

  hc_tooltip(
    pointFormat = "
      <b>{point.libname}</b><br>
      <b>State</b>: {point.stabr}<br>
      <b>eBooks</b>: {point.x}<br>
      <b>Visits</b>: {point.y}
    "
  
  )

Is there any correlation?

cor(libraries2$visits, libraries2$ebook)

[1] 0.192277

Sizing by ebook

cols <- brewer.pal(5, "Set1")

hchart(
  libraries2,
  "scatter",
  hcaes(
    x = ebook,
    y = visits,
    group = stabr,
    size = ebook
  )
) |>

  hc_colors(cols) |>

  hc_plotOptions(
    scatter = list(
      marker = list(
        fillOpacity = 0.4
      )
    )
  ) |>

  hc_title(text = "Library eBook Usage vs Visits (U.S. Libraries)") |>
  hc_subtitle(text = "Transparency added for better readability") |>

  hc_xAxis(title = list(text = "Number of eBooks")) |>
  hc_yAxis(title = list(text = "Number of Visits")) |>

  hc_tooltip(
    pointFormat = "
      <b>{point.libname}</b><br>
      <b>State</b>: {point.stabr}<br>
      <b>eBooks</b>: {point.x}<br>
      <b>Visits</b>: {point.y}
    "
  
  )

Map Using Cluster

leaflet(data = libraries2) |>
  addTiles() |>
  addProviderTiles("Esri.WorldPhysical") |>
  addControl(html = "<h3 style='margin:10px; color: #2c3e50;'>Geographic Distribution of Public Libraries</h3>", 
             position = "topright") |>


  addCircleMarkers(
    lng = ~longitud,
    lat = ~latitude,

    radius = ~sqrt(ebook) / 50,
    color =  "blue",
    fillColor = "purple",
    fillOpacity = 0.5,

    clusterOptions = markerClusterOptions(),

    popup = ~paste0(
      "<b>Library: </b>", libname, "<br><br>",

      "<b>County: </b>", cnty, "<br>",
      "<b>State: </b>", stabr, "<br>",
      "<b>ZIP Code: </b>", zip, "<br>",
      "<b>City: </b>", city, "<br>",

      "<b>eBooks: </b>", ebook, "<br>",
      "<b>Visits: </b>", visits
    )
  )

Essay

In my analysis, I studied both quantitative and categorical variables in order to produce different visualizations and help determine what to focus on for my final visualization. First, I created two histograms for the variables e-books and visits. The distribution of both variables shows a strong concentration around moderate values. In other words, most libraries experience a moderate level of activity, while only a small proportion handle very high or very low traffic. Next, I analyzed the categorical variables using a bar chart. This allowed me to compare the number of libraries across different states. The results show that some states have significantly more libraries than others, which may be influenced by demographic, economic, or urban development factors. Additionally, I created a scatter plot to examine the correlation between e-books and visits. The correlation coefficient is 0.192, which indicates a very weak positive relationship. This means that, slightly, as e-book usage increases, physical visits also tend to increase. This result was surprising, as I initially expected an inverse relationship, assuming that higher e-book usage would reduce physical visits. One possible explanation is that e-book users do not necessarily replace physical visits but rather use them as a complement. Finally, in a scatter plot where the size of the points represents e-book usage, I observed differences between states. For example, in Virginia and Texas, e-book usage is higher compared to physical visits. In contrast, in South Carolina and Maryland, physical visits are more dominant than e-book usage. This suggests that user preferences vary across states. If the goal is to focus on physical library activities and community programs, states such as South Carolina or Maryland would be more suitable. However, for a more digital-focused model, states like Virginia or Texas would be preferable. In my case, I would choose to focus on physical visits, as these libraries offer reading programs for children and many other community activities. The map represents the geographic distribution of public libraries across the United States, where each point corresponds to a library. The size reflects the number of e-books available, with larger circle indicating higher e-book usage, while smaller and lower usage. The clustering of markers highlights areas with a high concentration of libraries, which are often located in more populated or urban regions.