For every university, the library is a place possessing various functions, allowing students to learn, to study, and sometimes to take a nap. At NCCU, there are two main libraries: Chung-Cheng (中正) Library and Dah-Hsian (達賢) Library. The former is located in the main campus, while the latter is approximately a 5 to 10 minutes walk away. These libraries differ not only in location, but also in design, fulfilling users for multiple purposes.
For example, if students have to prepare for a group project, they are more likely to visit Dah-Hsian library to book a discussion room. In contrast, those seeking a quiet place to study might prefer Chung-Cheng Library. Additionally, the ground floor of Chung-Cheng library is available for students to study overnight. This area is divided in to three sections: A, B and C. Section A and B provide access to computers and other electronic devices, which the other prohibits the use of any electronic devices for silent.
Out of curiosity, the research aims to explore the usage trends between these libraries and investigate how Bayesian statistics can be applied to model the orientation and utilization of each area.
Method
Data Collection and Pre-Processing
The final data is crawled from the website of the NCCU library via Python, where the code can be found in this page.
We set the program to run every hour since 16th October 16,2024 till January 15, 2025, and the total sample size is 2126. Also, the original column of time is the combination of date and time (2024/10/16 09:00, for instance), we use excel to sepearate this column into three. See below the table.
Last but not the least, the opening of each area is different and it would vary within the exam periods, so we need to filter the data individually, and exam each data format is what we desired for later analysis.
Here are the descriptive statistics of visitors of each library. Combining the table result with the boxplot, we can tell that people mainly go to CC_L more that the DH_L, and CC_L contains more extreme values. Further, visitors in area A is more than area B and area C, and neither one area is correspond to normal distribution.
CC_L
DH_L
CC_A
CC_B
CC_C
Sample Size
1216
1055
2078
2078
2078
Min
.0
.0
.0
.0
.0
1st Qu.
180.8
172.5
14.0
.0
4.0
Median
652.5
557.0
34.0
7.0
21.0
Mean
848.5
571.9
39.5
9.9
27.8
3rd Qu.
1358.8
953.0
60.0
15.0
44.0
Max
3484.0
1667.0
133.0
60.0
130.0
Figure 1: Boxplots showing the distribution of counts across different libraries.
The two figures below are time serious plot using package xts and lubridate (can zoom in and out according to x or y axis, and double click the figure can return). The top chart represents the number of people entering Chung-Cheng library(blue lines) and Dah-Hsian libaray (green lines) during the recorded period. Notice that the sample size is smaller than original data due to the mismatch of two libraries in opening hours and days. From the pattern we can tell visitors in Chung-Cheng library is more than Dah-Hsian generally.
The figure 3 is composed by CC_A(navy lines), CC_B(green lines), and CC_C(purple lines) three area. These areas are open for 24 hours and seems no huge difference from the time series plot.
Figure 2: Time series plot for CC_L and DH_L
Figure 3: Time series plot for CC_A, CC_B, and CC_C
To summarize, the data indicates a routine variation and a slight gap between libraries. In the next section, we plan to apply Bayesian inference to model the data and explore potential factors that could provide a meaningful explanation.
Bayesian Statistics
Model 1
For our analysis, we adapted the poisson model because it is a discrete probability distribution suitable for modeling count data during a fixed time interval. The Poisson model can be defined as:
\[
P(Y=y|\lambda) = \frac{\lambda^ye^{-\lambda}}{y!}, y =0,1,2,...
\]
In this context, we recorded the number of people entering libraries per hour. Each entry was assumed to be independent and not influenced by others. Based on this assumption, the likelihood function for N observations is given by:
Here, \(\lambda\) represents both the mean and variance of the Poisson distribution, which specifically reflects the average number of people entering the library each hour in our study.
For the prior, we used Gamma distribution,which serves as a conjugate prior for the Poisson likelihood. This process will simplifies the Bayesian inference process, as the posterior distribution also follows a Gamma distribution. Specifically, the Gamma prior is parameterized as:
\[
P(\lambda)=\frac{\beta^\alpha}{\tau(\alpha)}\lambda^{\alpha-1}e^{-\beta\lambda}, \lambda>0
\] where \(\alpha\) determines the shape of the Gamma distribution and \(\beta\) influence the spread of the distribution.
We build a new dataframe for modeling,where the library_dta includes all the counts across libraries (n=8505), and library_dta_index represents the length of each library counts. Further, transform them into a list including L(the number of libraries).
# preparing data #library_dta <-c(CC_L$CC_L, DH_L$DH_L, CC_A$CC_A, CC_B$CC_B, CC_C$CC_C)library_dta_index <-rep((1:5),times=c(length(CC_L$CC_L),length(DH_L$DH_L),length(CC_A$CC_A),length(CC_B$CC_B),length(CC_C$CC_C)))dta_L <-list(y = library_dta,index = library_dta_index,N =length(library_dta), L =5)
m1 <-"model{ for (i in 1:N){ # likelihood function y[i] ~ dpois(lambda[index[i]]) pred_y[i] ~ dpois(lambda[index[i]]) } # prior for (j in 1:L){ alpha[j] ~ dgamma(1,1) beta[j] ~ dgamma(1,1) lambda[j] ~ dgamma(alpha[j],beta[j]) }}"
We set the iteration for 10000 times and 5 chains, and using thin=10 to avoid sampling from a specific range of estimation.
Extracting parameters, here we only extract lambda and prediction because we only concern the Poisson rate of each library and the predictability of model 1.
# lambda: poisson rate parameter lambda_m1 <-sapply(1:5,function(s){ temp<-paste0("lambda[",s,"]")filter(sample_m1,Parameter==temp)$value})# prediction of Model 1pred_m1 <-pblapply(1:length(dta_L$y),function(x){ #pblapply for showing progress bar temp<-paste0("pred_y[",x,"]")filter(sample_m1,Parameter==temp)$value %>%mean()})
The figure below is the density plot of lambda across each library, representing the expected number of people entering the library in an hour. The results indicate that CC_L has the highest lambda values followed by DH_L with the second-highest values.CC_A, CC_B, and CC_C have much lower values, with higher, centered density compared to CC_A and CC_C.
Figure 4: Posterior Density of Lambda in Each Library
Figure 5 demonstrates that the predicted value successfully differentiate the five libraries and align with the lambda results, showing that CC_L has the greatest number of visitors. However, the model fails to fit the observed data over time and instead represent only the average numbers of visitors per hour. Thought this limitation suggests that the model does not account for time-dependent variations in visitor patterns, it did manifest the Poisson distribution is appropriate for our data.
# data for m1 comparison compare_dta <-data.frame(y = library_dta,pred =do.call(rbind,pred_m1),index = library_dta_index)ggplot(compare_dta) +geom_line(aes(x =seq_along(y), y = y, color =as.factor(index), group =as.factor(index)), size =1) +geom_line(aes(x =seq_along(pred), y = pred, color ="Predicted", group =as.factor(index)), size =1, linetype ="dashed") +theme_minimal() +scale_color_manual(name ="Libraries",values =c("1"="#8AD6E4", "2"="#B6EEE2", "3"="#A1BDE6", "4"="#E6BBC7", "5"="#C6A5CC", "Predicted"="blue"),labels =c("1"="CC_L", "2"="DH_L", "3"="CC_A", "4"="CC_B", "5"="CC_C", "Predicted"="Predicted") ) +labs(x ="Observation Index",y ="People (hour)") +theme(legend.position ="top")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Figure 5: Comparison of Observed and Predicted Values
Model 2
With the success of model 1, we intend to build model 2 based on it and estimate the data by using a Poisson regression model. Again, the likelihood function for each entry is specified as : \[
Y_i\sim Poisson(\lambda_i),\lambda_i >0
\] where \(\lambda_i\) is modeled as a linear function of a set of predictors \(x\) with a log link function such that:
\[
log(\lambda_i) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p.
\] This formula is analogous to linear regression, where \(\beta_0\) represents the intercept, and \(\beta_j\) denotes the coefficient corresponding to the change in predictor \(x_j\). Specifically, a one-unit increase in \(x_j\) is associated with a multiplicative change of \(e^{\beta_j}\) in the expected outcome \(\mathbb{E}[Y]\).
#prepare data num_CCL <- CC_L %>%select(days,hours,weekdays)num_DHL <- DH_L %>%select(days,hours,weekdays)num_CCA <- CC_A %>%select(days,hours,weekdays)num_CCB <- CC_B %>%select(days,hours,weekdays)num_CCC <- CC_C %>%select(days,hours,weekdays)num_varia <-rbind(num_CCL,num_DHL,num_CCA,num_CCB,num_CCC)# the same with the m1, just for a check library_dta <-c(CC_L$CC_L, DH_L$DH_L, CC_A$CC_A, CC_B$CC_B, CC_C$CC_C)library_dta_index <-rep((1:5),times=c(length(CC_L$CC_L),length(DH_L$DH_L),length(CC_A$CC_A),length(CC_B$CC_B),length(CC_C$CC_C)))dta_all <-list(y = library_dta,index = library_dta_index,D = num_varia$days, H = num_varia$hours, W = num_varia$weekdays,N =length(library_dta), L =5)
m2 <-"model { for (i in 1:N) { # likelihood function log(lambda[i]) <- log(lambda_lib[index[i]]) + b0[index[i]] + b1[index[i],W[i]] + b2[index[i],H[i]] + b3[index[i]] * D[i] y[i] ~ dpois(lambda[i]) # prediction pred_y[i] ~ dpois(lambda[i]) } for (k in 1:L) { lambda_lib[k] ~ dgamma(alpha[k], beta[k]) #for designed library alpha[k] ~ dgamma(1, 1) beta[k] ~ dgamma(1, 1) b0[k] ~ dnorm(0,0.001) #intercept b3[k] ~ dnorm(0,0.001) # day for (j in 1:7){ b1[k,j] ~ dnorm(0, 0.001) #weekdays } for (m in 1:24){ b2[k,m] ~ dnorm(0,0.001) #hours } }}"
We also set the 10000 iteration, 5 chains, and thin=10. (Note: For avoiding runing to long, the iteration can be changed into 1000, 3 chains, and without thin. The results would be similar.)
# parameter of intercepts * 5 librariesb0_samples <-pbsapply(1:5,function(j){ temp <-paste0("b0[",j,"]")filter(sample_m2,Parameter==temp)$value})# parameter of days * 5 librariesb3_samples <-pbsapply(1:5,function(j){ temp <-paste0("b3[",j,"]")filter(sample_m2,Parameter==temp)$value})# parameter of weekdays b1_samples <-pbsapply(1:5,function(x){sapply(1:7,function(j){ temp <-paste0("b1[",x,",",j,"]")filter(sample_m2,Parameter==temp)$value })})# mean of b1 parameter mean_b1 <-pbsapply(1:5,function(x){ #mean sapply(1:7,function(j){ temp <-paste0("b1[",x,",",j,"]")filter(sample_m2,Parameter==temp)$value %>%mean() })})# parameter of hours b2_samples <-pbsapply(1:5,function(x){sapply(1:24,function(j){ temp <-paste0("b2[",x,",",j,"]")filter(sample_m2,Parameter==temp)$value })})# mean of b2 parameter mean_b2 <-pbsapply(1:5,function(x){sapply(1:24,function(j){ temp <-paste0("b2[",x,",",j,"]")filter(sample_m2,Parameter==temp)$value %>%mean() })})
# parameter of lambda lambda_samples <-pbsapply(1:length(dta_all$y),function(s){ temp <-paste0("lambda[",s,"]")filter(sample_m2,Parameter==temp)$value})# iteration mean of Predicted valuespred_samples <-pbsapply(1:length(dta_all$y),function(s){ temp <-paste0("pred_y[",s,"]")filter(sample_m2,Parameter==temp)$value %>%mean()})
Figure 6 demonstrates the estimation of coefficient \(\beta_1\) for weekdays across different libraries. Examining the weekday patterns, it is evident that only Thursday deviates from the others. Further inspection on the mean of the coefficient \(\beta_1\) reveals that on Thursday, the coefficient for DH_L is significantly smaller, indicating that visitor numbers at DH_L are lower compared to other libraries on this day.
Another important point to note is that Model 2 did not exclude Sunday for DH_L in the estimation of opening hours. Similarly, Figure 7 shows that the opening hours for DH_L do not meet the expected criteria.
b1_dta <-data.frame(b1_samples)colnames(b1_dta) <-c("CC_L","DH_L","CC_A","CC_B","CC_C")b1_dta$weekdays <-rep(c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"),each=500)b1_dta <-melt(b1_dta,id="weekdays")ggplot(b1_dta, aes(x = value, fill = variable, color = variable)) +geom_density(alpha =0.4) +facet_wrap(~ weekdays, scales ="free_y") +scale_x_continuous(limits=c(-10,10)) +#limit the range for clear patternlabs(x="b1")
Warning: Removed 377 rows containing non-finite outside the scale range
(`stat_density()`).
Figure 6: Density plot of B1 across weekdays and libraries
Figure 7 illustrates the coefficient \(\beta_2\) of 24 hours for various libraries. A significant drop in coefficients is observed around 9 a.m., which coincides with the opening time of CC_L and DH_L. In contrast, CC_A, CC_B, and CC_C operate 24 hours a day, leading to less variation in their coefficients throughout the day.
Notably, CC_B consistently exhibits the lowest coefficients, while DH_L shows the largest fluctuations, particularly during opening hours.
b2_dta <-data.frame(mean_b2)colnames(b2_dta) <-c("CC_L","DH_L","CC_A","CC_B","CC_C")b2_dta$hours <-rep(as.factor(1:24))b2_dta <-melt(b2_dta,id="hours")ggplot(b2_dta, aes(x = hours, y = value, color = variable, group = variable)) +geom_line() +geom_point() +labs(x ="Hours",y ="Regression Coef (B2)",color ="Libraries") +theme_minimal()
To evaluate whether Model 2 is proper for our data, the below figure shows a better estimation, effectively capturing the trends vary over time compare to Model 1. In addition, the \(R^2\) was calculated to assess the model’s performance. With an \(R^2\) of approximately 0.93, Model 2 demonstrates a strong ability to depict the usage trends across the libraries.
compare_dta <-data.frame(y = library_dta,pred = pred_samples,index = library_dta_index)ggplot(compare_dta) +geom_line(aes(x =seq_along(y), y = y, color =as.factor(index), group =as.factor(index)), size =1) +geom_line(aes(x =seq_along(pred), y = pred, color ="Predicted", group =as.factor(index)), size =0.3, linetype ="dashed") +theme_minimal() +scale_color_manual(name ="Libraies",values =c("1"="#8AD6E4", "2"="#B6EEE2", "3"="#A1BDE6", "4"="#E6BBC7", "5"="#C6A5CC", "Predicted"="#FFDA83"),labels =c("1"="CC_L", "2"="DH_L", "3"="CC_A", "4"="CC_B", "5"="CC_C", "Predicted"="Predicted") ) +labs(x ="Observation Index",y ="People (hour)") +theme(legend.position ="top")
Figure 8: Comparison of Observed and Predicted Values
In this research, we successfully achieved our goal of using Bayesian inference to clearly portray the patterns of library usage across the five libraries in NCCU. Among them, the Chung-Cheng library serves the largest number of visitors, highlighting its central role when in meeting students’ needs.
When examining usage trends by weekday, we observed a relatively stable pattern across most libraries, with the exception of the Dah-Hsian Library, which experiences a significant drop in visitors on Thursdays. However, the reason behind this unique trend remains unclear based on the current data and needs further investigation.
Turning to hourly trends, the data reveal that library usage is at its lowest around 9 a.m., likely overlapping the opening hours of certain libraries. Usage gradually increases throughout the day, stabilizing during the afternoon and evening. Interestingly, for libraries such as CC_A, CC_B, and CC_C, which operate 24 hours, the hourly fluctuations are less pronounced compared to CC_L and DH_L, whose restricted operating hours contribute to more distinct peaks and troughs in usage. Another contributing factor might be the limited seating availability of CC_A, CC_B, and CC_C, which could lead to a ceiling effect during peak times.
In summary, Bayesian statistics has proven to be a powerful tool for uncovering patterns in library usage. The results provide valuable insights that can help the university optimize scheduling for activities such as cleaning, events, and maintenance, ensuring better resource management and user satisfaction.
Reference
Jarad Niemi. (2013, January 22). Bayesian inference for Poisson data [Video]. YouTube. https://www.youtube.com/watch?v=lNrpPNk6InU
Kruschke, J. (2014). Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. Academic Press.
OpenAI. (2025). ChatGPT (January 2025 version) [Large language model]. Used for grammar checking and content verification. Retrieved from https://openai.com
Shao, S. (2022). A tutorial on Bayesian analysis of count data using JAGS. Journal of Behavioral Data Science, 2(2), 1–18. https://doi.org/10.35566/jbds/v2n2/shao
Zhou, Y. (2022, March 11). Bayesian inference for the poisson model. Yi’s Knowledge Base. https://www.y1zhou.com/series/bayesian-stat/bayesian-stat-bayesian-inference-poisson/