The dataset comes from NYC Open Data and contains information about public library branch services in Manhattan. The dataset includes details on:
Circulation Data (number of checkouts across different age groups)
Program Attendance (number of attendees across various programs)
Reference Transactions (number of librarian-assisted reference questions)
Weekly Hours of Public Service (total hours the library is open per week)
Branch Information (name and network association of the library)
1. How do circulation numbers vary across different library branches?
2. Do branches with longer public service hours have higher circulation numbers?
3. Which branches have the highest and lowest circulation?
Branch - The name of the library location.
Weekly Hours of Public Service - The number of hours the library is open per week.
CIRCULATION Adult - Number of books checked out by adults.
CIRCULATION Young Adult - Number of books checked out by young adults.
CIRCULATION Juvenile - Number of books checked out by juveniles.
CIRCULATION - Total circulation across all age groups.
TOTAL Attendance - Total attendance across all programs.
REFERENCE TRANSACTIONS - Total number of reference transactions at each branch.
1. Descriptive Statistics - Summary of circulation and hours of public service across branches.
2. Visualizations - Bar charts and scatter plots to explore relationships between variables.
3. Pivoting Data - Reshaping data to explore different perspectives.
4. MLE Analysis - Using Maximum Likelihood Estimation (MLE) to model circulation data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the dataset
nypl_data <- read_csv("~/Documents/DATA 712/NYPL_Branch_Services_Manhattan.csv", show_col_types = FALSE)
# View dataset structure
glimpse(nypl_data)
## Rows: 43
## Columns: 22
## $ `Boro/Central Library` <chr> "Manhattan", "Manhattan", "Manhat…
## $ Network <chr> "Countee Cullen Network", "Counte…
## $ Branch <chr> "67th Street", "96th Street", "11…
## $ `ADULT Program` <dbl> 146, 154, 53, 116, 29, 107, 199, …
## $ `ADULT Attendance` <dbl> 1298, 2466, 497, 1008, 524, 1241,…
## $ `YOUNG ADULT Program` <dbl> 80, 90, 75, 46, 28, 105, 159, 124…
## $ `YOUNG ADULT Attendance` <dbl> 884, 1710, 1036, 345, 450, 1245, …
## $ `JUVENILE Program` <dbl> 253, 276, 142, 142, 364, 340, 254…
## $ `JUVENILE Attendance` <dbl> 9004, 8470, 5382, 1311, 8143, 106…
## $ `OUTREACH SERVICES Program` <dbl> 71, 28, 12, 67, 247, 21, 20, 158,…
## $ `OUTREACH SERVICES Attendance` <dbl> 834, 706, 708, 1015, 3343, 327, 5…
## $ `TOTAL Program` <dbl> 550, 548, 282, 371, 668, 573, 632…
## $ `TOTAL Attendance` <dbl> 12020, 13352, 7623, 3679, 12460, …
## $ `REFERENCE TRANSACTIONS Adult` <dbl> 66209, 71448, 14924, 23400, 34502…
## $ `REFERENCE TRANSACTIONS Young Adult` <dbl> 3497, 19591, 3471, 8632, 9711, 75…
## $ `REFERENCE TRANSACTIONS Juvenile` <dbl> 5343, 9503, 6903, 11479, 21281, 6…
## $ `REFERENCE TRANSACTIONS` <dbl> 75049, 100542, 25298, 43511, 6549…
## $ `CIRCULATION Adult` <dbl> 173087, 228111, 77833, 45516, 922…
## $ `CIRCULATION Young Adult` <dbl> 20149, 39402, 16783, 14352, 26945…
## $ `CIRCULATION Juvenile` <dbl> 112878, 105718, 65517, 20366, 894…
## $ CIRCULATION <dbl> 306114, 373231, 160133, 80234, 20…
## $ `Weekly Hours of Public Service` <dbl> 42, 50, 42, 42, 44, 44, 54, 44, 4…
# Pivot circulation data to long format
nypl_long <- nypl_data %>%
pivot_longer(
cols = starts_with("CIRCULATION"),
names_to = "Circulation_Type",
values_to = "Circulation_Count"
) %>%
mutate(Circulation_Count = replace_na(Circulation_Count, 0)) # Replace NA values
# Convert back to wide format
nypl_wide <- nypl_long %>%
pivot_wider(
names_from = "Circulation_Type",
values_from = "Circulation_Count"
)
# View transformed data
head(nypl_wide)
## # A tibble: 6 × 22
## `Boro/Central Library` Network Branch `ADULT Program` `ADULT Attendance`
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Manhattan Countee Cull… 67th … 146 1298
## 2 Manhattan Countee Cull… 96th … 154 2466
## 3 Manhattan Countee Cull… 115th… 53 497
## 4 Manhattan Countee Cull… 125th… 116 1008
## 5 Manhattan Countee Cull… Aguil… 29 524
## 6 Manhattan Countee Cull… Bloom… 107 1241
## # ℹ 17 more variables: `YOUNG ADULT Program` <dbl>,
## # `YOUNG ADULT Attendance` <dbl>, `JUVENILE Program` <dbl>,
## # `JUVENILE Attendance` <dbl>, `OUTREACH SERVICES Program` <dbl>,
## # `OUTREACH SERVICES Attendance` <dbl>, `TOTAL Program` <dbl>,
## # `TOTAL Attendance` <dbl>, `REFERENCE TRANSACTIONS Adult` <dbl>,
## # `REFERENCE TRANSACTIONS Young Adult` <dbl>,
## # `REFERENCE TRANSACTIONS Juvenile` <dbl>, `REFERENCE TRANSACTIONS` <dbl>, …
# Pivot back to wide format
nypl_wide <- nypl_long %>%
pivot_wider(
names_from = "Circulation_Type",
values_from = "Circulation_Count"
)
# Get top 10 branches by total circulation
top_branches <- nypl_long %>%
group_by(Branch) %>%
summarize(Total_Circulation = sum(Circulation_Count, na.rm = TRUE)) %>%
arrange(desc(Total_Circulation)) %>%
slice_head(n = 10) %>%
pull(Branch)
# Filter dataset for only top 10 branches
nypl_top <- nypl_long %>%
filter(Branch %in% top_branches)
library(scales) # For formatting scales
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# Calculate total circulation for each branch
nypl_totals <- nypl_top %>%
group_by(Branch) %>%
summarize(Total_Circulation = sum(Circulation_Count))
ggplot(nypl_top, aes(x = fct_reorder(Branch, Circulation_Count),
y = Circulation_Count, fill = Circulation_Type)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Top 10 Library Branches by Circulation",
x = "Library Branch", y = "Circulation Count") +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
theme_minimal()
In this part of the assignment, I worked on reshaping the NYPL Branch Services - Manhattan dataset using tidyverse functions to make it easier to analyze circulation trends across different library branches.
When I first looked at the dataset, I noticed that circulation data was spread across multiple columns (Adult, Young Adult, Juvenile, and Total). This made it difficult to compare circulation types directly. To solve this, I used pivot_longer(), which converted these separate columns into a single column (Circulation_Count), while another new column (Circulation_Type) kept track of whether the circulation data was for Adults, Young Adults, or Juveniles. This change made the dataset more structured and easier to work with.
Next, I applied pivot_wider(), which essentially reversed the transformation by spreading circulation types back into separate columns. This demonstrated how I can move between long and wide formats, depending on the type of analysis I need to do.
After reshaping the data, I created a bar chart to visualize the Top 10 Library Branches by Circulation. Since the dataset includes many branches, I filtered it to show only the top 10 based on total circulation. At first, the graph was cluttered, so I used coord_flip() to switch the axes—this made branch names easier to read on the y-axis, while circulation counts were displayed on the x-axis. I also applied scale_y_continuous(labels = comma) to make sure circulation numbers displayed properly.
Key Takeaways:
pivot_longer() helped me restructure the dataset by consolidating circulation types into one column, making comparisons easier.
pivot_wider() showed me how to convert data back into its original wide format when needed.
Filtering for the top 10 branches made the visualization more focused and clear.
Using coord_flip() improved readability by displaying branch names in a vertical list.
Overall, this section helped me see the importance of data tidying and how simple transformations can make complex datasets easier to analyze and visualize.
library(tidyverse)
library(maxLik)
## Loading required package: miscTools
##
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
##
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
library(ggplot2)
library(dplyr)
# Check for invalid values in CIRCULATION
summary(nypl_data$CIRCULATION)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 161295 286563 717810 393804 10049339 1
# Remove NAs and ensure no zero values in circulation
nypl_data <- nypl_data %>%
filter(!is.na(CIRCULATION) & CIRCULATION > 0) %>% # Remove NAs and zeroes
mutate(CIRCULATION = as.numeric(CIRCULATION)) # Ensure numeric format
# Define a better starting lambda value
start_param <- mean(nypl_data$CIRCULATION, na.rm = TRUE) / 2
library(maxLik)
# Define log-likelihood function for Poisson distribution
logLik_circulation <- function(param) {
lambda <- param[1] # Poisson rate parameter (expected circulation)
y <- nypl_data$CIRCULATION # Observed circulation numbers
ll <- sum(dpois(y, lambda, log = TRUE)) # Compute log-likelihood
return(ll)
}
# Run MLE estimation
mle_circulation <- maxLik(logLik = logLik_circulation, start = c(start_param))
# Display MLE results
summary(mle_circulation)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 1 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -42057593
## 1 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## [1,] 376850 Inf 0 1
## --------------------------------------------
The estimated Poisson parameter (λ) represents the expected circulation per branch based on the MLE model. The estimated λ value of 376,850 suggests that, on average, a branch circulates approximately 377,000 books per reporting period.
However, when comparing this to the actual dataset, we see that some branches exceed 10 million checkouts, while others are significantly lower. This indicates that while the Poisson model provides a reasonable estimate for most branches, outliers (such as high-traffic branches) may influence overall circulation trends.
Additionally, the MLE results show that the standard error is Inf, suggesting that the Poisson model may not fully capture the variance in circulation numbers across branches. This could indicate overdispersion, meaning that a Negative Binomial model might provide a better fit for future analyses.
Histogram of Observed Circulation
ggplot(nypl_data %>% drop_na(CIRCULATION), aes(x = CIRCULATION)) +
geom_histogram(binwidth = 500000, fill = "orange", alpha = 0.7) +
labs(title = "Distribution of Circulation Numbers",
x = "Circulation Count", y = "Frequency") +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma) +
theme_minimal()
In this analysis, I explored circulation trends across different New York Public Library branches in Manhattan to understand how branch location, service hours, and library usage patterns impact circulation numbers. Based on my research questions, I found the following key insights:
1. Circulation numbers vary significantly across branches.
Some branches circulate over 10 million books, while others have much lower numbers. This suggests that certain libraries serve larger populations or have higher community engagement, possibly due to location, collection size, or additional services offered.
2. Branches with longer public service hours tend to have higher circulation.
Libraries that are open for more hours per week generally see higher circulation numbers. This suggests that longer accessibility increases book checkouts, as patrons have more opportunities to visit the library.
3. Some branches have significantly higher circulation than others.
The highest-circulating branches likely benefit from larger facilities, popular collections, or high foot traffic locations, while lower-circulating branches may have smaller collections or fewer visitors.
These findings align with expectations - branches with longer hours and higher demand circulate more books, while others serve smaller or less active communities. These findings confirm that library circulation is influenced by both accessibility and demand. Future analysis could incorporate additional factors like demographics, digital resource checkouts, or seasonal trends to better understand user engagement patterns.