bike.counts <- read.csv("C:\\Users\\eh738\\OneDrive\\Documents\\STA321\\BikeCounts.csv")
#removing commas from count variable values and converting to numeric
bike.counts$BrooklynBridge <- as.numeric(gsub(",", "", bike.counts$BrooklynBridge))
bike.counts$Total <- as.numeric(gsub(",", "", bike.counts$Total))
The data set which will be used for this analysis is a subset of a larger pool of data collected by the The Traffic Information Management System (TIMS) to keep daily counts on the number of cyclists using different New York City bridges to enter and leave the city. This subset focuses exclusively on the Brooklyn Bridge, from the month of July 2017. Each observation corresponds to a day in that month, consisting of the count of cyclists who used the bridge that day (BrooklynBridge), as well as some other variables, including:
Day: The day of the week (character)
HighTemp: The highest temperature recorded in the city that day, in Fahrenheit (numeric)
LowTemp: The lowest temperature recorded in the city that day, in Fahrenheit (numeric)
Precipitation: amount of precipitation that day (units unknown) (numeric)
Total: total count of all cyclists who used any of the four East River Bridges (Brooklyn, Manhattan, Williamsburg, or Queensboro) that day (numeric)
Note: The count variables (BrooklynBridge & Total) from the original data set were in the form of character strings and needed to be converted to numeric variables in order to construct regression models for analysis.
The aim of this analysis is to investigate how these weather conditions or the day of the week might affect the number of cyclists who use the Brooklyn Bridge to travel in or out of the city on a given day, and to what extent. In addition, this investigation will extend to the daily Brooklyn Bridge cyclist count as a proportion of the total number of cyclists who use any of the four bridges in a given day.
To answer these questions, the data will be used to construct poisson regression models with the daily cyclist count as the response variable and the day of the week, high & low temperatures, and precipitation amount as the predictors. These poisson regression models rely on some underlying assumptions, namely that the response variable follows a poisson distribution in which its mean is equal to its variance, all observations are independent, and that the log of the mean rate of occurrence is a linear function of the predictors. The first model will focus on only the daily count of cyclists who use the Brooklyn Bridge, while the second will include an offset term using the total daily count among all bridges to analyze the response variable as a proportion. The estimated regression coefficients will be evaluated for statistical significance and interpreted in the context of the data set to assess each predictor variable’s impact on the daily cyclist count and rate.
As noted above, the first model that will be constructed will only focus on the daily Brooklyn Bridge count alone, and will not take into account the total daily cyclist count among all bridges.
model.counts <- glm(BrooklynBridge ~ Day + HighTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = bike.counts)
##
pois.count.coef = summary(model.counts)$coef
kable(pois.count.coef, caption = "Poisson Regression Model for Daily Brooklyn Bridge Cyclist Counts")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 7.7899944 | 0.0624330 | 124.773724 | 0.0000000 |
| DayMonday | 0.1823740 | 0.0144986 | 12.578719 | 0.0000000 |
| DaySaturday | -0.0301305 | 0.0147101 | -2.048285 | 0.0405321 |
| DaySunday | 0.0787037 | 0.0148751 | 5.290985 | 0.0000001 |
| DayThursday | 0.1697322 | 0.0151941 | 11.170906 | 0.0000000 |
| DayTuesday | 0.1836214 | 0.0151793 | 12.096852 | 0.0000000 |
| DayWednesday | 0.2334373 | 0.0148752 | 15.693077 | 0.0000000 |
| HighTemp | 0.0199422 | 0.0010254 | 19.448952 | 0.0000000 |
| LowTemp | -0.0223977 | 0.0014722 | -15.213681 | 0.0000000 |
| Precipitation | -0.4381650 | 0.0148996 | -29.407830 | 0.0000000 |
As every predictor variable is highly statistically significant they will all be included in the final daily cyclist count model, which can be written as:
log(BrooklynBridge) = 7.7899944 + 0.1823740 * DayMonday - 0.0301305 * DaySaturday + 0.0787037 * DaySunday + 0.1697322 * DayThursday + 0.1836214 * DayTuesday + 0.2334373 * DayWednesday + 0.0199422 * HighTemp - 0.0223977 * LowTemp - 0.4381650 * Precipitation
According to the model, Friday (the baseline level for the Day variable) tends to be one of the least busy days of the week for cyclists on the Brooklyn Bridge, with only Saturday having a negative regression coefficient suggesting a lower count than the baseline level. Sunday, Monday, Tuesday, Wednesday, & Thursday all have positive regression coefficients, suggesting that, holding the other variables constant, there should higher cyclist counts on these days than on Fridays or Saturdays. The regression coefficient for each represents the mean increase (or decrease) in the log of the cyclist count versus on a Thursday. So, for instance, assuming all other variables remain equal, the log of the daily cyclist count should be greater on a Monday than on a Thursday by a magnitude of about 0.182.
For the numerical variables (i.e., HighTemp, LowTemp, & Precipitation), the regression coefficients represent the mean increase (or decrease) in the log of the daily cyclist count per a one-unit increase in the corresponding predictor variable, holding all other predictor variables constant. For instance, the model suggests that a one-degree increase in the daily high temperature, assuming all other predictors remained unchanged, would result in a mean increase of 0.020 in the log of the daily cyclist count. On the other hand, as the daily low temperature or precipitation amount increases, the log of the daily cyclist count should decrease, based on the negative sign of the these variables’ regression coefficient estimates.
By incorporating the Total variable (which represents the total daily cyclist count among all four East River Bridges) into the regression formula in the form of an offset term, the response variable can be transformed from the log of the daily cyclist count to the log of the daily proportion of all cyclists across all bridges who used the Brooklyn Bridge specifically. This can perhaps provide some insight into how the predictor variables affect the amount of cyclists who use this particular bridge as opposed to the other three.
model.props <- glm(BrooklynBridge ~ Day + HighTemp + LowTemp + Precipitation, offset = log(Total), family = poisson(link = "log"), data = bike.counts)
pois.props.coef = summary(model.props)$coef
kable(pois.props.coef, caption = "Poisson Regression Model for Daily Brooklyn Bridge Cyclist Proportions")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -2.0922813 | 0.0627497 | -33.343311 | 0.0000000 |
| DayMonday | 0.0608475 | 0.0145808 | 4.173139 | 0.0000300 |
| DaySaturday | 0.1451904 | 0.0148051 | 9.806798 | 0.0000000 |
| DaySunday | 0.2353611 | 0.0150947 | 15.592325 | 0.0000000 |
| DayThursday | 0.0517490 | 0.0153925 | 3.361951 | 0.0007739 |
| DayTuesday | 0.0613867 | 0.0154622 | 3.970123 | 0.0000718 |
| DayWednesday | 0.0193336 | 0.0150105 | 1.288009 | 0.1977428 |
| HighTemp | 0.0073823 | 0.0010494 | 7.034602 | 0.0000000 |
| LowTemp | -0.0073422 | 0.0014634 | -5.017090 | 0.0000005 |
| Precipitation | -0.0374192 | 0.0136533 | -2.740664 | 0.0061315 |
Once again, each of the four variables was found to be statistically significant and will be included in the final model for the daily Brooklyn Bridge cyclist proportions, which is:
log(BrooklynBridge) = -2.0922813 + 0.0608475 * DayMonday + 0.1451904 * DaySaturday + 0.2353611 * DaySunday + 0.0517490 * DayThursday + 0.0613867 * DayTuesday + 0.0193336 * DayWednesday + 0.0073823 * HighTemp - 0.0073422 * LowTemp - 0.0374192 * Precipitation + log(Total)
Alternatively, the offset term \(\log(\textbf{Total})\) at the end of the regression formula can be subtracted from both sides of the equation, removing it from the right-hand side and transforming the left-hand side to \(\log{(\frac{\textbf{BrooklynBridge}}{\textbf{Total}})}\).
Interestingly, the regression coefficients of the daily proportions model are similar to those of the daily counts model, at least in terms of the sign of each coefficient and by extension the association between each predictor and the response variable. According to this model, Fridays should exhibit the lowest daily proportion of total cyclists using the Brooklyn Bridge among all days of the week, assuming all other conditions are equal. This seems to be in agreement with the first model’s implication that Fridays tend to exhibit the second lowest daily counts of any weekday. However, while the first model implied that Sundays typically also exhibit a relatively low daily count of Brooklyn Bridge cyclists, this model suggests that Sundays generally see the greatest daily proportion of total cyclists using the Brooklyn Bridge versus the other three bridges. Specifically, the model suggests the log of the daily proportion of total cyclists who use the Brooklyn Bridge should be greater on Sundays than on Fridays by a difference of about 0.235, assuming all other conditions are equal. This phenomenon also appears to be suggested of Saturdays, though the difference in expected proportions is not quite as large.
As for the weather-related numerical variables, their respective impacts on the daily proportion of total cyclists who use the Brooklyn Bridge appear to be considerably similar to those which they have on the daily cyclist count. In this case, the model suggests that a one-degree increase in the daily high temperature would result in an increase in the proportion of total cyclists using the Brooklyn Bridge of about .007. As with the daily count, it is implied by this model that the proportion decreases as the daily low temperature or amount of precipitation increases.
The daily count model may have some use in the case where one wants to predict the total number of cyclists who will use the Brooklyn Bridge on a given day of the week in given weather conditions. However, some more interesting and potentially more valuable insights, regarding the relative cyclist usage of the Brooklyn Bridge versus the other three bridges, can be taken from the daily proportions model. For instance, this model appears to suggest that more cyclists generally use the Brooklyn Bridge than the other bridges on Saturdays & Sundays. On the other hand, it also suggests that more precipitation leads to a smaller daily proportion of cyclists using this particular bridge. Investigation into these suggested phenomena may lead to improvements in the handling of the daily flow of cyclists or the ease of riding on the bridge in inclement weather.
That said, these models only provide a vague and not necessarily completely trustworthy notion of what’s really going on, leaving considerable room for improvement. In particular, they’re based on data from only one specific month in one specific year. An established trend in cyclists’ usage of the bridge in July may not be present in, say, January. Even a phenomenon that is demonstrably present in July of 2017 may no longer be evident in the same month a few years later. Including more observations that span a longer period of time can provide a more comprehensive picture of how the cyclist usage of the bridge changes from month to month and even year to year, leading to more useful and reliable inferences.