Using data from public libraries in New York City, we find a positive and significant link between hours of public service and attendance.
This study aims at establishing whether there is a link between hours of public service and attendance in public libraries, in particular for young adults. The results can be useful to public authorities and managers of these institutions to evaluate ex ante the likely impact of increasing or decreasing the opening hours.
The data was collected from July 2010 to June 2011 by the City of New York and made available on its open data platform : https://opendata.cityofnewyork.us/
It can be accessed at this link : https://data.cityofnewyork.us/Recreation/New-York-Public-Library-NYPL-Branch-Services-from-/ne9z-skhf
The following code loads the data and removes rows and columns irrelevant to the study. In particular, we remove observations for which the time of public service is zero, for instance for libraries that go through renovation.
library(readr)
library(tidyverse)
library(caret)
library(ggplot2)
libraries <- read_csv("New_York_Public_Library__NYPL__Branch_Services_from_7-2010_to_6-2011-1.csv")
Parsed with column specification:
cols(
.default = col_double(),
`Boro/Central Library` = [31mcol_character()[39m,
Network = [31mcol_character()[39m,
Branch = [31mcol_character()[39m
)
See spec(...) for full column specifications.
libraries <- data.frame(libraries)
str(libraries)
'data.frame': 109 obs. of 22 variables:
$ Boro.Central.Library : chr "Bronx" "Bronx" "Bronx" "Bronx" ...
$ Network : chr "Bronx Library Center Network" "Bronx Library Center Network" "Bronx Library Center Network" "Bronx Library Center Network" ...
$ Branch : chr "Allerton" "Belmont" "Bronx Library Center" "Francis Martin" ...
$ ADULT.Program : num 25 226 449 7 228 74 94 15 84 119 ...
$ ADULT.Attendance : num 250 2795 15179 108 1841 ...
$ YOUNG.ADULT.Program : num 37 186 201 44 220 47 106 82 186 132 ...
$ YOUNG.ADULT.Attendance : num 420 2166 4324 506 2450 ...
$ JUVENILE.Program : num 183 352 379 206 349 45 117 147 316 198 ...
$ JUVENILE.Attendance : num 4714 6792 13069 3805 8976 ...
$ OUTREACH.SERVICES.Program : num 1 62 303 18 368 40 7 117 99 59 ...
$ OUTREACH.SERVICES.Attendance : num 17 740 8151 645 5097 ...
$ TOTAL.Program : num 246 826 1332 275 1165 ...
$ TOTAL.Attendance : num 5401 12493 40723 5064 18364 ...
$ REFERENCE.TRANSACTIONS.Adult : num 75595 96980 254384 27469 44460 ...
$ REFERENCE.TRANSACTIONS.Young.Adult: num 32253 93028 78494 8112 17810 ...
$ REFERENCE.TRANSACTIONS.Juvenile : num 38038 94263 173134 11609 17238 ...
$ REFERENCE.TRANSACTIONS : num 145886 284271 506012 47190 79508 ...
$ CIRCULATION.Adult : num 62767 61924 354231 43431 77381 ...
$ CIRCULATION.Young.Adult : num 20191 14614 75239 15379 21413 ...
$ CIRCULATION.Juvenile : num 47961 42140 184508 37966 51671 ...
$ CIRCULATION : num 130919 118678 613978 96776 150465 ...
$ Weekly.Hours.of.Public.Service : num 46 46 78 46 50 46 46 46 46 46 ...
# Remove some useless rows
libraries <- filter(libraries, Branch != "Subtotal:") # 9 observations were removed
libraries <- filter(libraries, Weekly.Hours.of.Public.Service != 0) # 8 were removed
libraries <- libraries[!grepl("TOTAL", libraries$Branch),] # 5 were removed
libraries <- filter(libraries, TOTAL.Attendance != 0) # 1 was removed
# Remove useless columns
libraries$Branch <- NULL
libraries$Boro.Central.Library <- NULL
colSums(sapply(libraries, is.na)) # no missing values
Network ADULT.Program ADULT.Attendance
0 0 0
YOUNG.ADULT.Program YOUNG.ADULT.Attendance JUVENILE.Program
0 0 0
JUVENILE.Attendance OUTREACH.SERVICES.Program OUTREACH.SERVICES.Attendance
0 0 0
TOTAL.Program TOTAL.Attendance REFERENCE.TRANSACTIONS.Adult
0 0 0
REFERENCE.TRANSACTIONS.Young.Adult REFERENCE.TRANSACTIONS.Juvenile REFERENCE.TRANSACTIONS
0 0 0
CIRCULATION.Adult CIRCULATION.Young.Adult CIRCULATION.Juvenile
0 0 0
CIRCULATION Weekly.Hours.of.Public.Service
0 0
The following graphs display some patterns :
Also notice at least two important outliers. Looking back at the data, these are the Mid-Manhattan Library and, to a lesser extent, the Bronx Library Center. In particular, the former will for sure severely impact the coefficient of the regression for young adults because although it has extensive opening hours its young adults’ attendance is very low.
ggplot(data = libraries) +
ggtitle("Total attendance") +
geom_histogram(mapping = aes(x = TOTAL.Attendance), bins = 60)
ggplot(data = libraries) +
ggtitle("Young adult attendance") +
geom_histogram(mapping = aes(x = YOUNG.ADULT.Attendance), bins = 60)
ggplot(data = libraries) +
geom_boxplot(mapping = aes(x = Network, y = TOTAL.Attendance)) +
labs(title = "Attendance by network", x = "Network") +
coord_flip()
ggplot(data = libraries) +
geom_boxplot(mapping = aes(x = Network, y = YOUNG.ADULT.Attendance)) +
labs(title = "Young adult attendance by network", x = "Network") +
coord_flip()
ggplot(data = libraries) +
geom_point(mapping = aes(x = Weekly.Hours.of.Public.Service,
y = TOTAL.Attendance, color = Network)) +
labs(title = "Hours of public service, network and attendance", x = "Hours of public service")
ggplot(data = libraries) +
geom_point(mapping = aes(x = Weekly.Hours.of.Public.Service,
y = YOUNG.ADULT.Attendance,
color = Network)) +
labs(title = "Hours of public service, network and young adult attendance", x = "Hours of public service")
cor(libraries$Weekly.Hours.of.Public.Service, libraries$TOTAL.Attendance)
[1] 0.6962132
cor(libraries$Weekly.Hours.of.Public.Service, libraries$YOUNG.ADULT.Attendance)
[1] 0.3671619
We implement a linear regression to establish the link between hours of public service and attendance. We control for the network of the public library, which is first one-hot encoded.
# save the response variable
Target <- libraries$TOTAL.Attendance # total attendance
Target_young <- libraries$YOUNG.ADULT.Attendance # young adult attendance
# get dummy variables
dummies <- dummyVars(TOTAL.Attendance ~ ., data = libraries)
predictors <- predict(dummies, newdata = libraries)
predictors <- data.frame(predictors)
# remove useless predictors
libraries_total <- predictors[, -c(7:23)]
# append the total attendance variable
libraries_total$Target <- Target
For total attendance, we find a positive and very significant coefficient for the hours of public service variable : controlling for the network, one more hour of public service in the library is associated with 727 more people attending, everything else kept equal. Notice the NA coefficient for one of the control variable means it is linearly dependent with other predictors.
# linear regression model
model_lr <- train(Target ~ .,
data = libraries_total,
method = "lm")
model_lr
Linear Regression
86 samples
7 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 86, 86, 86, 86, 86, 86, ...
Resampling results:
RMSE Rsquared MAE
5082.937 0.2982751 4032.802
Tuning parameter 'intercept' was held constant at a value of TRUE
summary(model_lr)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-9020.8 -2964.2 -792.9 3451.9 13226.2
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -24483.7 3885.9 -6.301 1.57e-08 ***
NetworkBronx.Library.Center.Network -1109.5 1790.0 -0.620 0.537
NetworkCentral -2978.2 2808.1 -1.061 0.292
NetworkCountee.Cullen.Network 2229.5 1766.6 1.262 0.211
NetworkParkchester.Network 395.2 1766.3 0.224 0.824
NetworkSeward.Park.Network 1626.9 1766.0 0.921 0.360
NetworkStaten.Island.Network NA NA NA NA
Weekly.Hours.of.Public.Service 726.9 78.9 9.214 3.72e-14 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 4606 on 79 degrees of freedom
Multiple R-squared: 0.5277, Adjusted R-squared: 0.4918
F-statistic: 14.71 on 6 and 79 DF, p-value: 3.323e-11
For young adults’ attendance, we also find a positive and very significant coefficient for the hours of public service variable : controlling for the network, one more hour of public service in the library is associated with 69 more young adults, everything else kept equal. This number is of course much lower because it only pertains to a share of the population. It is also due to the presence of an important outlier (the Mid-Manhattan Library : see above, Data exploration). Without this outlier, the added number of young adults in the library would jump to 103 (compared to 69) for one more hour of public service, everything else equal. The coefficient remains very significant.
# switch the response variable
libraries_total$Target <- NULL
libraries_total$Target_young <- Target_young
# linear regression model
model_lr_young <- train(Target_young ~ .,
data = libraries_total,
method = "lm")
model_lr_young
Linear Regression
86 samples
7 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 86, 86, 86, 86, 86, 86, ...
Resampling results:
RMSE Rsquared MAE
1089.502 0.2371012 802.7384
Tuning parameter 'intercept' was held constant at a value of TRUE
summary(model_lr_young)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-1692.8 -646.6 -144.9 485.9 2938.3
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2195.70 786.39 -2.792 0.00656 **
NetworkBronx.Library.Center.Network 425.74 362.25 1.175 0.24341
NetworkCentral -1399.85 568.28 -2.463 0.01594 *
NetworkCountee.Cullen.Network 238.60 357.51 0.667 0.50646
NetworkParkchester.Network 788.47 357.45 2.206 0.03030 *
NetworkSeward.Park.Network 117.60 357.39 0.329 0.74298
NetworkStaten.Island.Network NA NA NA NA
Weekly.Hours.of.Public.Service 68.97 15.97 4.320 4.5e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 932.1 on 79 degrees of freedom
Multiple R-squared: 0.3077, Adjusted R-squared: 0.2551
F-statistic: 5.853 on 6 and 79 DF, p-value: 4.364e-05
# remove the outlier
libraries_total <- filter(libraries_total, Weekly.Hours.of.Public.Service != 88)
# linear regression model without the outlier
model_lr_young_2 <- train(Target_young ~ .,
data = libraries_total,
method = "lm")
model_lr_young_2
Linear Regression
85 samples
7 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 85, 85, 85, 85, 85, 85, ...
Resampling results:
RMSE Rsquared MAE
928.4923 0.2345439 712.4985
Tuning parameter 'intercept' was held constant at a value of TRUE
summary(model_lr_young_2)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-1287.1 -586.8 -184.2 452.9 3011.9
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3793.08 944.40 -4.016 0.000135 ***
NetworkBronx.Library.Center.Network 352.21 348.38 1.011 0.315155
NetworkCentral -824.01 582.24 -1.415 0.160980
NetworkCountee.Cullen.Network 290.69 343.36 0.847 0.399803
NetworkParkchester.Network 738.31 343.27 2.151 0.034585 *
NetworkSeward.Park.Network 165.83 343.18 0.483 0.630284
NetworkStaten.Island.Network NA NA NA NA
Weekly.Hours.of.Public.Service 103.70 19.68 5.270 1.18e-06 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 893.9 on 78 degrees of freedom
Multiple R-squared: 0.3698, Adjusted R-squared: 0.3213
F-statistic: 7.628 on 6 and 78 DF, p-value: 1.848e-06