W7-proj1-Library-Services

Author

Project 1: Library Services

Introduction

The selected dataset records information on and services available at 21 library branches in Montgomery County, MD, as of July 26th, 2023, accessed October 16th, 2024. This dataset detailed 105 variables for 21 libraries, including what services they offered (ereaders, computers, literacy programs, makerspace, etc.) as well as information on nearby transportation and available collections in different languages, as well as other information like the branch names, phone numbers, material inventories and more. I decided to focus on predicting the number of computers and visualizing the non-english collections.

document setup

library(tidyverse) #load in the tidyverse library

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

options(scipen = 999) #change scientific notation display settings

#set working directory to project folder
setwd("C:/Users/zivsa/Desktop/OneDrive - montgomerycollege.edu/school/DATA 110/W7 - Project 1")

#load original dataset into global environment
library_services <- read_csv("Library_Services_20241016.csv")

Rows: 21 Columns: 105
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (83): Branch, Webpage, Address, City, Phone, TTY Phone, Wi-Fi, Charging,...
dbl (21): Zip, Square Feet, Inventory, Sunday Open, Sunday Close, Monday Ope...
lgl  (1): Population Size

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cleaning

#Preliminary cleaning to make the dataset usable:

#duplicate the data for manipulation
cleanlib <- library_services

#rename columns
names(cleanlib) <- tolower(names(cleanlib)) 
names(cleanlib) <- gsub(" ","_",names(cleanlib))
names(cleanlib) <- gsub("-","",names(cleanlib))

#remove population since it's entirely NA
cleanlib <- cleanlib |> select(-c(population_size, phone, zip, tty_phone, bike_share_1, bike_share_2, webpage))

#turn all the y/n columns into boolean values for easier analysis
for (col in colnames(cleanlib)) { #for all columns
  if ("n" %in% cleanlib[[col]] || "y" %in% cleanlib[[col]]) { #if column contains y/n values
    cleanlib[[col]] = cleanlib[[col]]!="n" #make them booleans
  }
}

#continue cleaning, condensing and exploring - not all of this was used

#condense the 47 bus stop columns into 1
cleanlib <- cleanlib |> mutate(bus_stops = rowSums(across(bus_4:metrobus_z9)), .keep = "unused")

#set hours into 24-hour format & create variables for total operating hours
cleanlib <- cleanlib |> mutate(
  sunday_open = sunday_open + 12,
  sunday_close = sunday_close + 12,
  sunday_hours = sunday_close - sunday_open,
  monday_close = monday_close + 12,
  monday_hours = monday_close - monday_open,
  tuesday_open = recode(tuesday_open, "1"=13),
  tuesday_close = tuesday_close + 12,
  tuesday_hours = sunday_close - sunday_open,
  wednesday_close = wednesday_close + 12,
  wednesday_hours = wednesday_close - wednesday_open,
  thursday_open = recode(thursday_open, "1" = 13),
  thursday_close = thursday_close +12,
  thursday_hours = thursday_close - thursday_open,
  friday_close = friday_close + 12,
  friday_hours = friday_close - friday_open,
  saturday_close = saturday_close + 12,
  saturday_hours = saturday_close - saturday_open
) 

#total language collections
cleanlib <- cleanlib |> mutate(lang_chi= rowSums(across(contains("_child"))))
cleanlib <- cleanlib |> mutate(lang_ad = rowSums(across(contains("_adult"))))

Linear Regression: predicting amount of computers in service

#linear regression equation: computers = 0.00044377(square ft) + 7.82718741 
fitcomp <- lm(computers ~ square_feet, data = cleanlib)
summary(fitcomp)


Call:
lm(formula = computers ~ square_feet, data = cleanlib)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2294 -3.1797 -0.9275  3.3006  9.5614 

Coefficients:
              Estimate Std. Error t value     Pr(>|t|)    
(Intercept) 7.82718741 1.84846362   4.234     0.000499 ***
square_feet 0.00044377 0.00005103   8.696 0.0000000732 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.215 on 18 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.8077,    Adjusted R-squared:  0.7971 
F-statistic: 75.63 on 1 and 18 DF,  p-value: 0.00000007318

#linear model plot > computers vs ft^2
cleanlib |> ggplot(aes(x=square_feet, y=computers)) +
  geom_point(aes(colour = ereader)) +
  geom_smooth(method='lm',formula=y~x) +
  labs( x = "library size (ft^2)", y = "amount of computers",
    title = "Library Size vs Available Computers", 
    caption = "dataMontgomery, Montgomery County, MD"
  )

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Analysis

After trying a few variables and eliminating those with high p-values, the equation I came up with for a linear regression line to predict the number of computers at a library is computers = 0.00044377(square ft) + 7.82718741. The p-value is very low, <.0000001 for the coefficient of the library’s size, the plot shows a pretty tight fit line, and the adjusted R^2 is pretty high, explaining about 80% of the variance. However, because the sample size is so small (less than 20 due to NA values), I wouldn’t trust the equation as a great predictor.

Visualising language collections.

selecting a few relevant columns

libviz <- cleanlib |> select(branch,inventory, contains("_adult")) |> mutate(branch = fct_reorder(branch, inventory))|>
  pivot_longer(3:10, names_to = "collection")|>group_by(branch, inventory)

libviz$collection <- gsub("_adult", "", libviz$collection)

plotting them

libviz|>
ggplot(aes(x=collection, y=branch)) +
  geom_tile(color="black") + 
  geom_tile(aes(fill=collection, alpha=value)) +
  scale_fill_brewer(palette = "Spectral") + 
  theme_classic()+
  labs(y = "library branch, ordered by inventory size",
       x = "adult non-english collections",
       title = "Library World Language Collections", 
       caption = "dataMontgomery, Montgomery County, MD")

Warning: Using alpha for a discrete variable is not advised.

Essay

The dataset, when I first imported it, was unworkable. I first renamed the columns to be easier to work with in R. The next challenge was coercing so many columns that had string “y” and “n” values into Boolean values, so that I could summarize them or otherwise perform operations on them logically. This was difficult and it took a lot of trial and error to find a method that worked, but eventually I managed to create a looping structure that checked wether the column contained “y” or “n” values, and then replace anything that wasn’t an “n” with TRUE, and “n” with false. I lost some data on the way, like the names of the metro stations, but I decided that wasn’t relevant to what I wanted to explore. Next I merged all the columns for 47 different nearby bus stops into one column that just counted how many bus stops were nearby. Again I lost the specificity of which bu routes those were, but I didn’t need that for what I was trying to do.

To create the visualization, I cleaned the dataset further by selecting for the language collection columns and pivoting them into longer format for each collection. I created a tile map that shows which libraries have collections in different non-English languages. It was surprising to me that the Germantown library, which had the largest total inventory, had collections for only 2 languages, while many smaller libraries had 4 or more. I’m not surprised to notice that Spanish is among the popular collections, since there are so many Spanish speakers here, and am a little surprised about the Chinese being up there with it. I wonder how much of it is in Mandarin or Cantonese or other languages.

I’m not satisfied with how the chart turned out, and originally wanted to show black tiles in place of blank tiles for libraries that didn’t have a particular collection, but I couldn’t figure it out yet, and in doing so lost rows for libraries that didn’t have any non-English collections, which would have been interesting to see. There’s also a lot of variables I cleaned and didn;t explore too deeply about, and would want to keep looking into, like the many services, library hours and nearby transportation. Overall this dataset has a lot of interesting information that I feel I didn’t spend enough time truly dissecting.