This data set was gathered by the New York City’s Traffic Information Management System (TIMS), which monitors and records cyclists 24 hours a day. Each entry is an observation of the total bicyclists on that day. This data set is a subset of a larger data set that captures monthly records of bike counts across New York City’s four East River bridges: Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. Our data here captures the number of bikes that cross the Williamsburg Bridge.
This data set models a Poisson distribution and anlyzing it helps us understand the patterns and predict based on day of the week, temperature, and precipitation among other predictor variables.
The link to the raw CSV file in posted on GitHub: https://raw.githubusercontent.com/JZhong01/STA321/main/Topic%205%20(Poisson)/Poisson_Data_Set.csv.
The primary objectives of this analysis is to build 2 Poisson regression models that model 1) how the predictors affect the bicyclist counts across the Williamsburg Bridge and 2) how the predictors affect the proportion of bicyclists that ride on Williamsburg Bridge out of total East River Bridge ridership.
Secondary objectives for this case study are as follows:
The hypotheses of the study are that:
Poisson regression models are particularly useful for modeling count data where response variable represents the number of occurrences of an event. This can be captured in the context of a fixed time period or the rate for a given time span.
The Poisson regression model for counts is a statistical approach for analyzing count data, specifically when we are interested in the number of events occurring within a fixed period or area. This model facilitates the exploration of how various independent variables influence the rate at which events occur, with its parameters estimated using maximum likelihood estimation. By employing the natural logarithm as a link function, Poisson regression models the log of the expected event count as a linear combination of the independent variables, enabling the direct interpretation of the effect of predictors on the event rate.
The general regression model for counts follows this expression:
\(log(Response) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p\)
The betas are coefficients of the Poisson regression model. \(\beta_0\) represents the intercept of the
function, which is not useful in our analysis. \(\beta_i\) represents the change in the log
mean of a one unit change in \(x_i\)
such that all other predictor variables are held constant. - When \(\beta = 0\): The predictor variable isn’t
associated with the response variable. - When \(\beta > 0\): The predictor variable is
positively associated with the response. This means that an increase in
the predictor variable increases the expected number of occurrences in
the response.
- When \(\beta < 0\): The predictor
variable is negatively associated with the response. This means that an
increase in the predictor variable decreases the expected number of
occurrences in the response.
In the context of Poisson regression, the exponential of the beta coefficient \(e^\beta\) can be interpreted as a relative risk. This means if you take the exponential of the beta coefficient for a predictor variable, you get a factor that tells you how much the risk (or rate) of the event occurring increases (if \(\beta > 0\)) or decreases (if \(\beta < 0\)) with a one-unit increase in that predictor variable.
The Poisson regression model for rates is concerned about not just how many times an event occurs, but how often relative to a measure of time or opportunity. We therefore have to adjust our model to include a term for exposure - this adjustment term is referred to as an offset and predictor variables can share a common offset or individually have their own offset. The general regression model for rates follows this expression:
\(log(Response/t) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p\)
where t is the offset value and log(t) is an observation. Using properties of logarithms, we can add log(t) from both sides and cancel both log functions by taking Euler’s number to get this equivalent expression:
\(Response = t * e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p}\)
This demonstrates that the response is proportional to t, so the interpretation of the beta coefficients is similar to poisson regression for counts multiplied by a factor of t.
We will start creating a poisson regression model for the bike counts of Williamsburg Bridge
In our Poisson model, we’re starting off with the full model including all variables barring Total count across all bridges and Date, which acts as an observation ID and is thus not a predictor. If any predictors are not statistically significant, we will perform variable selection and remove them.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 7.6536 | 0.0220 | 347.5441 | 0.0000 |
| DayMonday | 0.0546 | 0.0099 | 5.4862 | 0.0000 |
| DaySaturday | -0.2744 | 0.0102 | -26.9960 | 0.0000 |
| DaySunday | -0.2346 | 0.0097 | -24.0919 | 0.0000 |
| DayThursday | 0.0319 | 0.0103 | 3.0997 | 0.0019 |
| DayTuesday | 0.1988 | 0.0104 | 19.2033 | 0.0000 |
| DayWednesday | 0.0511 | 0.0101 | 5.0736 | 0.0000 |
| HighTemp | 0.0171 | 0.0006 | 29.0374 | 0.0000 |
| LowTemp | -0.0024 | 0.0008 | -3.0474 | 0.0023 |
| Precipitation | -1.0321 | 0.0166 | -62.1742 | 0.0000 |
Day of the week is a categorical variable so we split that up into 6 dummy variables. It appears that all the p-values are less than \(\alpha = 0.05\) significance level and all statistically significantly predict Williamsburg bike count. As such, we will continue to use this full model.
This model is log-linear, so the coefficients represent the expected change in the log count of the response for a one-unit change in the predictor.
Before data could be analyzed, we need to transform our data into a usable format. The count for Williamsburg Bridge crossings is originally a character variable containing commas. In order to create a generalized linear regression, I transformed this response variable into a number by removing non-numeric characters and transforming the variable.
The model was then fit to a Poisson regression model using the ‘glm()’ function in R, specified as such with a log link function. Specifying a log for our link function is crucial because the Poisson model has a linear relationship between the predictors against log(Response); the link specifies that the response and mean of the distribution has a nonlinear relationship with the predictors. All predictor variables are included in the model minus Total count, which will be analyzed in our next Poisson regression model.
No variables were removed from the model because all predictor variables were statistically significant at the \(\alpha = 0.05\) significance level.
The goal of this analysis was to address one of the study hypotheses that the bike count on the Williamsburg Bridge is dependent on the predictor variables Day of the week, HighTemp, LowTemp, and Precipitation.
Our model indicates that weekdays (Monday through Thursday) experience higher bike traffic compared to Friday, while weekends (Saturday and Sunday) see a reduction. This suggests behavioral patterns consistent with commuting during the weekdays and generally suggests that people who bike across the Williamsburg Bridge are generally using it to commute.
Higher daytime temperatures correlate with increased bike traffic, perhaps due to more favorable cycling conditions, while higher low temperatures (colder nights) show a negative relationship, possibly deterring cyclists. The positive association with high temperatures reflects willingness to cycle under favorable conditions, while a negative association with low temperatures indicates a reluctance to bike due to discomfort or safety reasons related to colder weather. This highlights how temperature, in general, influences how many people choose to commute on bike - people are more likely to commute when the weather conditions are favorable.
A strong negative relationship with precipitation indicates that worse weather conditions significantly deters cyclists from choosing to bike across the Williamsburg Bridge. Negative weather events tend to cause would-be bikers to choose an alternative form of transport.
Overall, our Poisson regression model provides valuable insights into the dynamics of bicycle traffic on the Williamsburg Bridge and highlights the impact of temporal and weather-related factors on urban cycling patterns.
We are now creating a poisson regression model for the bike rates to figure out if the predictor variables affect the proportion of bicyclists that use the Williamsburg Bridge out of all East River Bridges.
For modeling the rate of Williamsburg Bridge crossings in the context of total crossings, we will first compute a new variable representing the proportion of crossings on this bridge relative to the total number of East River bridge crossings. This rate is calculated by dividing the count of cyclists on the Williamsburg Bridge by the aggregate count from all monitored bridges. This proportion, which normalizes the counts by the total opportunity for crossings, will then be used as the response variable in our Poisson regression model, with an offset term included to account for the total volume of bridge traffic, ensuring that our model reflects the rate of crossings rather than mere counts.
Similar to our first Poisson model, we’re starting off with the full model including all predictor variables and only the response that is on the rate. This means that individually the Total count and Williamsburg response variables are not used and that Date is excluded because it merely acts as an observation ID and is thus not a predictor.
If any predictors are not statistically significant, we will perform variable selection and remove them.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.0682 | 0.0223 | -47.8899 | 0.0000 |
| DayMonday | 0.0004 | 0.0099 | 0.0390 | 0.9689 |
| DaySaturday | 0.0375 | 0.0101 | 3.7140 | 0.0002 |
| DaySunday | 0.0051 | 0.0097 | 0.5278 | 0.5976 |
| DayThursday | 0.0206 | 0.0103 | 1.9990 | 0.0456 |
| DayTuesday | 0.0138 | 0.0104 | 1.3288 | 0.1839 |
| DayWednesday | 0.0233 | 0.0101 | 2.3146 | 0.0206 |
| HighTemp | -0.0012 | 0.0006 | -2.0273 | 0.0426 |
| LowTemp | 0.0004 | 0.0008 | 0.4461 | 0.6555 |
| Precipitation | 0.0505 | 0.0161 | 3.1363 | 0.0017 |
There are 4 variables that aren’t statistically significant at the \(\alpha = 0.05\) significance level. However, 3 of these variables are dummy variables for day of the week. Removing the insignificant ones is unwise because Monday, Sunday, and Tuesday being insignificant merely explains how those days of the week have no change in proportion of bikers riding the Williamsburg Bridge as compared to the baseline Friday. As a result, we will choose to keep all the insignificant dummy variables to maintain the integrity of the categorical predictor variable. We will remove LowTemp from our model because it isn’t a significant numerical predictor variable.
Therefore the final model for Poisson regression of rates is as follows:
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.0656 | 0.0215 | -49.5711 | 0.0000 |
| DayMonday | 0.0016 | 0.0096 | 0.1682 | 0.8665 |
| DaySaturday | 0.0381 | 0.0100 | 3.7975 | 0.0001 |
| DaySunday | 0.0045 | 0.0096 | 0.4655 | 0.6416 |
| DayThursday | 0.0213 | 0.0102 | 2.0970 | 0.0360 |
| DayTuesday | 0.0142 | 0.0104 | 1.3688 | 0.1711 |
| DayWednesday | 0.0240 | 0.0099 | 2.4165 | 0.0157 |
| HighTemp | -0.0010 | 0.0003 | -3.2512 | 0.0011 |
| Precipitation | 0.0527 | 0.0154 | 3.4221 | 0.0006 |
This model is log-linear, so the coefficients represent the expected change in the log count of the response for a one-unit change in the predictor.
Before data could be analyzed, we need to transform our data into a usable format. The count for Total crossings for all East River Bridges is originally a character variable containing commas. In order to create a generalized linear regression, I transformed this response variable into a number by removing non-numeric characters and transforming the variable.
The model was then fit to a Poisson regression model using the ‘glm()’ function in R, specified as such with a log link function. Specifying a log for our link function is crucial because the Poisson model has a linear relationship between the predictors against log(Response); the link specifies that the response and mean of the distribution has a nonlinear relationship with the predictors.
In order to take the poisson regression model of rates and not just count, we have to put an offset of log(Total count) to account for the varying amounts of exposure or risk periods across observations. We employ this since rate needs to be standardized by a measure of time, population, or area.
All predictor variables are included in the model minus Total count, which will be analyzed in our next Poisson regression model. Only LowTemp was removed from the model because all other predictor variables were statistically significant at the \(\alpha = 0.05\) significance level or dummy variables for a categorical variable; LowTemp was the only variable that was both a numerical predictor and had a p-value larger than 0.05.
The goal of this analysis was to address one of the study hypotheses that the bike rates on the Williamsburg Bridge is independent of the predictor variables Day of the week, HighTemp, LowTemp, and Precipitation.
The model reveals a clear pattern of variability across different days. With Friday as the baseline, Saturdays show a significantly higher rate of bicycle crossings, indicating a preference or increased opportunity for cycling during weekends. Conversely, the rates for Sunday, Monday, and Tuesday do not differ significantly from Friday, suggesting similar cycling behaviors on these days. However, Wednesday and Thursday exhibit a significant increase in rates, which may reflect mid-week behaviors or events influencing cycling traffic.
Interestingly, higher daytime temperatures are associated with a slight decrease in the rate of bicycle crossings, a finding that might seem counter intuitive. This could suggest a threshold beyond which higher temperatures become a deterrent to cyclists, possibly due to extreme heat discomfort or the availability of alternative leisure activities during very warm weather.
Contrary to common assumptions, an increase in precipitation is associated with a higher rate of cyclists crossing the bridge. This may indicate a strong commitment to cycling among regular commuters, regardless of weather conditions. It could reflect other unmeasured variables that correlate with both increased precipitation and cycling rates, such as specific events or infrastructural factors.
The previous sections weren’t intuitive, so a graphical comparison is created to potentially find any visible hidden patterns.
The parallel nature of the lines indicates that the effect of temperature and precipitation on bike rates is consistent across different days of the week. This means that regardless of the day, an increase or decrease in HighTemp or Precipitation is associated with a uniform change in bike rates. The slopes of these lines are determined by the coefficients of HighTemp and Precipitation in your Poisson regression model.
In addition, because all of the lines are parallel and do not converge or diverge suggests that the impact of HighTemp and Precipitation does not interact with the day of the week. The difference in bike rates by day of the week remains constant regardless of changes in HighTemp or Precipitation.
Our analysis through two distinct Poisson regression models has yielded insightful findings into cycling patterns across the Williamsburg Bridge. The first model, which focused on the raw counts of bicyclists, identified that weekdays, particularly Wednesday and Thursday, have a higher bicycle traffic rate compared to Friday. This aligns with the hypothesis that cycling is influenced by the day of the week, potentially reflecting commuting patterns.
The second model adjusted these counts to rates, considering the total traffic across all East River Bridges. Here, Saturday emerged as a day with significantly higher cycling rates, suggesting a weekend surge in cycling activity, potentially for recreational purposes. Interestingly, we observed that higher temperatures correlate with a slight decrease in cycling rates, hinting at a complex relationship between weather conditions and cycling behavior that might involve factors such as extreme heat or the availability of other leisure options.
An unexpected finding from both models was how precipitation decreases raw count of cyclists on the Williamsburg but increased the rate. This suggests that there’s something else besides environmental factors that are motivating biking on specifically the Williamsburg Bridge (e.g. the Williamsburg Bridge better shelters bikers from rain).
These models together affirm the initial hypothesis that the number of bicyclists varies with the day of the week, temperature, and precipitation, while also providing new insights into the proportionate use of the Williamsburg Bridge relative to other bridges. This comprehensive analysis reinforces the utility of Poisson regression in understanding discrete count data and rates in urban transportation planning.
Gupta, D. (2018). Applied Analytics through Case Studies Using SAS and R. APress.
Ciaburro G. (2018). Regression Analysis with R: Design and Develop Statistical Nodes to Identify Unique Relationships Within Data at Scale. Packt Publishing.