1 Introduction

1.1 Data Set & Variables

bike.counts <- read.csv("C:\\Users\\eh738\\OneDrive\\Documents\\STA321\\BikeCounts.csv")
#removing commas from count variable values and converting to numeric
bike.counts$BrooklynBridge <- as.numeric(gsub(",", "", bike.counts$BrooklynBridge))
bike.counts$Total <- as.numeric(gsub(",", "", bike.counts$Total))

The data set which will be used for this analysis is a subset of a larger pool of data collected by the The Traffic Information Management System (TIMS) to keep daily counts on the number of cyclists using different New York City bridges to enter and leave the city. This subset focuses exclusively on the Brooklyn Bridge, from the month of July 2017. Each observation corresponds to a day in that month, consisting of the count of cyclists who used the bridge that day (BrooklynBridge), as well as some other variables, including the day of the week, the high temperature, the low temperature, the amount of precipitation, and the total count of all cyclists who used any of the four East River Bridges (Brooklyn, Manhattan, Williamsburg, or Queensboro).

1.2 Research Question:

The aim of this analysis is to investigate a potential relationship between the number of cyclists who use the Brooklyn Bridge (response variable) in a given day and the other variables in the data set, particularly the day of the week, the temperature, and precipitation (predictor variables). Both Poisson & quasi-Poisson regression will be used to model this relationship and one will be chosen as the final model from which to make statistical and practical inferences.

1.3 Statistical Models

Poisson regression relies on a few assumptions when used for statistical inference, namely that the response variable follows a poisson distribution in which its mean is equal to its variance, all observations are independent, and that the log of the mean rate of occurrence is a linear function of the predictors. Quasi-Poisson regression is mostly similar, but relaxes the assumption that the mean & variance of the response variable are equal. Therefore, for a data set in which the variance of the chosen response variable is greater than its mean, quasi-Poisson regression would be preferable for constructing a model and drawing inferences.

In this study, the final model will chosen based on the estimated dispersion parameter of the quasi-Poisson model, which indicates to what degree the actual variance of the response variable deviates from the variance to be expected under the assumption of a perfect Poisson distribution.

2 Exploratory Data Analysis

Looking at the data set, there are some issues with the numerical predictor variables that could lead to sub-optimal regression models and invalid inferences if not properly addressed. First, it is intuitive to suspect that there is a correlation between the daily high & low temperatures (i.e., the predictor variables HighTemp & LowTemp), which can be checked via pair-wise scatter plot:

library(psych)
pairs.panels(bike.counts[, c(3, 4, 5)], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE, # show correlation ellipses
             main = "Pair-wise Scatter Plot of Numerical Predictor Variables"
             )

The high correlation coefficient for HighTemp & LowTemp confirms that the two variables are in fact highly positively correlated. Instead of simply dropping one of them however, their average will be taken to create a new variable which functions as a single balanced metric of the general temperature level on each day. This variable will be used in place of the two original temperature variables in the construction of the regression models.

bike.counts$AvgTemp <-  (bike.counts$HighTemp + bike.counts$LowTemp)/2

Another issue highlighted by the pair-wise scatter plot is the extreme skewness of the distribution of the Precipitation variable. It is clear from the histogram that the value of this variable is 0 for the vast majority of the observations (i.e., that there was no recorded precipitation on those days). Therefore, instead of keeping it a continous numerical variable, it will be discretized into a binary variable, equal to 0 if there was no recorded precipitation on that day or 1 if the amount of precipitation was greater than 0.

bike.counts$NewPrecip <- as.factor(ifelse(bike.counts$Precipitation > 0, "1", "0"))

With these issues with the predictor variables addressed, the regression models can now be constructed using the new transformed variables as predictors.

3 Constructing Poisson Regression Models

3.1 Poisson Model

The first attempt at modeling the daily cyclist count on the Brooklyn Bridge based on the predictor variables will use regular Poisson regression. That is, its interpretation will rely on the implicit assumption that the variance of the response variable (cyclist count) is equal to its mean. The predictor variables will be the day of the week (Day), the daily average temperature (AvgTemp), and the occurrence or absence of precipitation (NewPrecip).

library(knitr)
poisson.model <- glm(BrooklynBridge ~ Day + AvgTemp + NewPrecip, family = poisson(link = "log"), data = bike.counts)
pois.count.coef = summary(poisson.model)$coef
kable(pois.count.coef, caption = "Poisson Regression Model for Daily Brooklyn Bridge Cyclist Counts")
Poisson Regression Model for Daily Brooklyn Bridge Cyclist Counts
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.0310771 0.0571813 122.961145 0
DayMonday 0.3155691 0.0137790 22.902236 0
DaySaturday 0.0954727 0.0143402 6.657703 0
DaySunday 0.2117931 0.0140364 15.088854 0
DayThursday 0.2535264 0.0144529 17.541511 0
DayTuesday 0.2044464 0.0147564 13.854763 0
DayWednesday 0.2633815 0.0145623 18.086560 0
AvgTemp 0.0100066 0.0007239 13.822369 0
NewPrecip1 -0.3632282 0.0093898 -38.683148 0

According to this output, all three predictor variables are highly statistically significant. The model predicts that, assuming the other two predictors remain constant, the highest cyclist counts should occur on Mondays, and the lowest on Fridays. It also suggests that cyclist count is positively associated with the daily average temperature (even in July), and that the occurrence of precipitation should result in fewer cyclists using the bridge that day, as intuition would dictate.

3.2 Quasi-Poisson Model

Before performing a deeper analysis of the regression coefficients & their potential practical significance, a quasi-Poisson regression model using the same response & predictor variables will be constructed. This is to account for the possibility that the variance of the response variable is significantly greater than its mean, which would be a violation of a major assumption of the previous model, and therefore raise concerns about the validity of that model’s p-values.

quasi.poisson.model <- glm(BrooklynBridge ~ Day + AvgTemp + NewPrecip, family = quasipoisson, data = bike.counts)
pois.count.coef = summary(quasi.poisson.model)$coef
kable(pois.count.coef, caption = "Quasi-Poisson Regression Model for Daily Brooklyn Bridge Cyclist Counts")
Quasi-Poisson Regression Model for Daily Brooklyn Bridge Cyclist Counts
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.0310771 0.5355679 13.1282648 0.0000000
DayMonday 0.3155691 0.1290557 2.4452165 0.0229443
DaySaturday 0.0954727 0.1343122 0.7108268 0.4846600
DaySunday 0.2117931 0.1314668 1.6110005 0.1214352
DayThursday 0.2535264 0.1353682 1.8728648 0.0744367
DayTuesday 0.2044464 0.1382104 1.4792396 0.1532537
DayWednesday 0.2633815 0.1363923 1.9310584 0.0664620
AvgTemp 0.0100066 0.0067805 1.4757810 0.1541728
NewPrecip1 -0.3632282 0.0879464 -4.1301064 0.0004391
summary(quasi.poisson.model)
## 
## Call:
## glm(formula = BrooklynBridge ~ Day + AvgTemp + NewPrecip, family = quasipoisson, 
##     data = bike.counts)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.031077   0.535568  13.128 6.95e-12 ***
## DayMonday     0.315569   0.129056   2.445 0.022944 *  
## DaySaturday   0.095473   0.134312   0.711 0.484660    
## DaySunday     0.211793   0.131467   1.611 0.121435    
## DayThursday   0.253526   0.135368   1.873 0.074437 .  
## DayTuesday    0.204446   0.138210   1.479 0.153254    
## DayWednesday  0.263382   0.136392   1.931 0.066462 .  
## AvgTemp       0.010007   0.006781   1.476 0.154173    
## NewPrecip1   -0.363228   0.087946  -4.130 0.000439 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 87.72455)
## 
##     Null deviance: 5318.8  on 30  degrees of freedom
## Residual deviance: 1988.0  on 22  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Given the considerably large dispersion parameter of 87.72455, it would appear that the data set does indeed violate the major assumption of Poisson regression that the mean and variance of the response variable are equal. Consequently, the p-values of the regression coefficients produced by the regular Poisson model are likely invalid and therefore, this quasi-Poisson model should be chosen as the final model from which inferences are made.

The statistical significance of the regression coefficients in the final model varies considerably. Only two of the predictor variables are statistically significant at the 0.05 level, the dummy variables DayMonday & NewPrecip1, though a few of other variables are just slightly beyond this threshold and may still have some practical significance.

For instance, while the coefficient estimate for DayWednesday has a p-value of about 0.066, above the conventional cutoff for statistical significance, the coefficient estimate itself suggests that, assuming the other variables remain constant, the log of the daily count of cyclists who use the Brooklyn Bridge should be about 26% higher on a Wednesday than on a Friday (the baseline level). To put things in terms of regular count instead of log count, we exponentiate the coefficient estimate, subtract 1 from it, and multiply by 100. So, according to the model, the increase in daily cyclist count from a Friday to a Wednesday with all other variables equal should be about \((e^{0.2633815} -1) \times 100 \approx 30.13%\), i.e., a 30.13% increase. While not technically statistically significant, this value is large enough that it may very well have some practical significance. The difference between Thursday and Friday is similar, but slightly smaller.

The most highly statistically significant predictor is NewPrecip. The dummy variable NewPrecip1, which indicates a day with a non-zero amount of precipitation, has a coefficient of -0.3632282. This suggests that the log cyclist count should be, on average, about 36% lower on a day with precipitation versus a day with no precipitation, assuming all other variables remain constant. Following the same formula as above, this corresponds to a change in regular cyclist count of \((e^{-0.3632282} -1) \times 100 \approx -30.46%\), i.e., a 30.46% decrease. Interestingly, although it has a significantly smaller p-value, the practical impact of this change from the baseline level is roughly equal to the change from Friday to Wednesday, which as pointed out above is technically statistically insignificant. The impact of precipitation may however still have some practical significance as well.

Looking at the lone numerical variable in the model, the coefficient estimate for AvgTemp suggests that for every one-unit increase in daily average temperature, the log cyclist count should increase by 1% on average, assuming all other variables remain equal. The change in terms of regular count instead of log count is virtually equal. For instance, a rise in daily average temperature of 15 degrees should result in a 15% increase in the daily cyclist count, with all other variables remaining unchanged. Though not statistically significant at the 0.05 level, it will be retained in the final model for its potential practical significance.

The formula for the final working model can be written as:

log(BrooklynBridge) = 7.031077 + 0.315569 * DayMonday + 0.095473 * DaySaturday + 0.211793 * DaySunday + 0.253526 * Day Thursday + 0.204446 * DayTuesday + 0.263382 * DayWednesday + 0.010007 * AvgTemp - 0.363228 * NewPrecip1

5 Conclusion & Discussion

The model-derived inference that there is a somewhat significant negative association between the daily number of cyclists using the Brooklyn Bridge and the occurrence of precipitation is a fairly intuitive one. This can also be said of the model’s suggestion of a weak positive association between the daily cyclist count and daily average temperature.

However, when analyzing the regression coefficient estimates for the different days of the week, there is no immediately obvious rationalization for the pattern described by the model. For instance, the model predicts that higher cyclist counts should generally be observed on weekdays (assuming all other variables are equal), which could be explained by a higher number of people commuting to and from work via bike. However, Friday is an exception to this, as the model actually predicts this day to have the lowest daily cyclist counts. Furthermore, Sundays are actually predicted to have higher cyclist counts than Tuesdays, which are predicted to have considerably lower counts than Wednesdays.

These seemingly counter-intuitive inferences might indeed have rational explanations, but they may also be the result of a constructing the regression model with a relatively small and specific data set. Thus, regression modeling based on a larger data set and consultation with experts may be warranted if these patterns were to be investigated further.