1 Data set Description

This data set was gathered by the New York City’s Traffic Information Management System (TIMS), which monitors and records cyclists 24 hours a day. Each entry is an observation of the total bicyclists on that day. This data set is a subset of a larger data set that captures monthly records of bike counts across New York City’s four East River bridges: Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. Our data here captures the number of bikes that cross the Williamsburg Bridge.

This data set models a Poisson distribution and anlyzing it helps us understand the patterns and predict based on day of the week, temperature, and precipitation among other predictor variables.

The link to the raw CSV file in posted on GitHub: https://raw.githubusercontent.com/JZhong01/STA321/main/Topic%205%20(Poisson)/Poisson_Data_Set.csv.

  • Date: This is our ID variable and demarcates the date which the observation was made. Note that each observation is a record of an entire day’s bicycle count.
  • Day: This is a categorical predictor variable that marks what day of the week that observation is.
  • HighTemp: This is a numerical predictor variable that marks what the recorded high for temperature is on that observed day.
  • LowTemp: This is a numerical predictor variable that marks what the recorded low for temperature is on that observed day.
  • Precipitation: This is a numerical predictor variable that tracks that amount of precipitation measured in milimeters of height.
  • WilliamsburgBridge: This is our numerical response variable that tracks the bicyclist count across the Williamsburg Bridget for each observed day.
  • Total: This is another numerical response variable, but this one tracks the total number of bicyclists across all four East River bridges. We will use this in tandem with the WilliamsburgBridge response variable and make a different response variable that measures the proportion of riders that took the Williamsburg Bridge out of all the East River bridges.

2 Research Question

The primary objectives of this analysis is to build 2 Poisson regression models that model 1) how the predictors affect the bicyclist counts across the Williamsburg Bridge and 2) how the predictors affect the proportion of bicyclists that ride on Williamsburg Bridge out of total East River Bridge ridership.

Secondary objectives for this case study are as follows:

  • To identify daily and weekly patterns in bike traffic across the Williamsburg Bridge.
  • To determine the effect of temperature on volume of bike traffic.
  • To evaluate the role precipitation plays in bike traffic.
  • To assess the utility of a Poisson distribution in modeling discrete biking data.

The hypotheses of the study are that:

  • The number of bicyclists is dependent on the day of the week, temperature, and precipitation.
  • The proportion of bicyclists crossing the Williamsburg Bridge isn’t affected by our predictor variables.

3 Poisson Models

Poisson regression models are particularly useful for modeling count data where response variable represents the number of occurrences of an event. This can be captured in the context of a fixed time period or the rate for a given time span.

3.1 Assumptions

  • Poisson Response: The assumption that the response variable is a count per unit of time or space, described by a Poisson distribution, is fundamental because the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space.
  • Independence: The assumption of independence stipulates that the occurrence of any event must not influence the occurrence of another. In practice, this means that the model assumes no correlation between event occurrences (e.g. if an event is likely to trigger subsequent events). Violation of this assumption could lead to an observed variance that is greater than the mean.
  • Mean is equal to the variance: In a Poisson distribution, the mean and variance are both equal to \(\lambda\). When actual data has a variance that is significantly different from the mean (overdispersion or underdispersion), the standard Poisson regression may not be the best fit. In such cases, alternative models like the Negative Binomial regression can be used, which includes an additional parameter to account for overdispersion.
  • Linearity: The linearity assumption in Poisson regression is about the relationship between the natural log of the event rate, \(log(\lambda)\), and the independent variables (x); This establishes a linear relationship between log of the response and predictors. This transformation allows for linear regression to be conducted on non-linear data and ensures both positive predictions and that the heteroscedastic data is properly handled.

3.2 Poisson Regression Model for Counts

The Poisson regression model for counts is a statistical approach for analyzing count data, specifically when we are interested in the number of events occurring within a fixed period or area. This model facilitates the exploration of how various independent variables influence the rate at which events occur, with its parameters estimated using maximum likelihood estimation. By employing the natural logarithm as a link function, Poisson regression models the log of the expected event count as a linear combination of the independent variables, enabling the direct interpretation of the effect of predictors on the event rate.

The general regression model for counts follows this expression:

\(log(Response) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p\)

The betas are coefficients of the Poisson regression model. \(\beta_0\) represents the intercept of the function, which is not useful in our analysis. \(\beta_i\) represents the change in the log mean of a one unit change in \(x_i\) such that all other predictor variables are held constant. - When \(\beta = 0\): The predictor variable isn’t associated with the response variable. - When \(\beta > 0\): The predictor variable is positively associated with the response. This means that an increase in the predictor variable increases the expected number of occurrences in the response.
- When \(\beta < 0\): The predictor variable is negatively associated with the response. This means that an increase in the predictor variable decreases the expected number of occurrences in the response.

In the context of Poisson regression, the exponential of the beta coefficient \(e^\beta\) can be interpreted as a relative risk. This means if you take the exponential of the beta coefficient for a predictor variable, you get a factor that tells you how much the risk (or rate) of the event occurring increases (if \(\beta > 0\)) or decreases (if \(\beta < 0\)) with a one-unit increase in that predictor variable.

3.3 Poisson Regression Model for Rates

The Poisson regression model for rates is concerned about not just how many times an event occurs, but how often relative to a measure of time or opportunity. We therefore have to adjust our model to include a term for exposure - this adjustment term is referred to as an offset and predictor variables can share a common offset or individually have their own offset. The general regression model for rates follows this expression:

\(log(Response/t) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p\)

where t is the offset value and log(t) is an observation. Using properties of logarithms, we can add log(t) from both sides and cancel both log functions by taking Euler’s number to get this equivalent expression:

\(Response = t * e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p}\)

This demonstrates that the response is proportional to t, so the interpretation of the beta coefficients is similar to poisson regression for counts multiplied by a factor of t.

4 Modeling Williamsburg Counts

We will start creating a poisson regression model for the bike counts of Williamsburg Bridge

4.1 Variable Selection

In our Poisson model, we’re starting off with the full model including all variables barring Total count across all bridges and Date, which acts as an observation ID and is thus not a predictor. If any predictors are not statistically significant, we will perform variable selection and remove them.

Full Poisson regression model for the counts of Williamsburg Bridge bicyclists.
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.6536 0.0220 347.5441 0.0000
DayMonday 0.0546 0.0099 5.4862 0.0000
DaySaturday -0.2744 0.0102 -26.9960 0.0000
DaySunday -0.2346 0.0097 -24.0919 0.0000
DayThursday 0.0319 0.0103 3.0997 0.0019
DayTuesday 0.1988 0.0104 19.2033 0.0000
DayWednesday 0.0511 0.0101 5.0736 0.0000
HighTemp 0.0171 0.0006 29.0374 0.0000
LowTemp -0.0024 0.0008 -3.0474 0.0023
Precipitation -1.0321 0.0166 -62.1742 0.0000


Day of the week is a categorical variable so we split that up into 6 dummy variables. It appears that all the p-values are less than \(\alpha = 0.05\) significance level and all statistically significantly predict Williamsburg bike count. As such, we will continue to use this full model.

4.2 Regression Coefficient Interpretation

  • Intercept: \(\beta_0\) indicates the log count of bike crossings on the Williamsburg Bridge when all other variables in the model are 0; this means that this is the log count on a Friday assuming other predictor variables are 0. However, we’re not interested in this value.
  • Day: Day was split into 6 dummy variables for this model. Friday was chosen as the baseline day of the week. In our model, Saturday and Sunday have negative coefficients; this suggests that compared to Friday, these days have a lower log count of bikes on the Williamsburg Bridge. Employing similar logic, weekdays Monday through Thursday all have positive coefficients, which means that these days all have higher log count of bikes compared to Friday. Note this is assuming that all other predictors remain constant and the only variation is the day of the week.
  • High Temp: High Temp has a positive coefficient, which means that having a higher day-high temperature has a positive association with bike count on the Williamsburg Bridge. A one unit increase in High Temp increases the log(count) of bike count.
  • Low Temp: Low Temp has a negative coefficient, which means that having a higher day-low temperature has a negative association with bike count. A one unit increase in Low Temp reduces the log(count) of bike count.
  • Precipitation: Precipitation has a relatively large negative coefficient, which means that there is a large negative association between precipitation and bike count on the Williamsburg Bridge. This means that a one unit increase in precipitation drastically reduces the log(count) of bikes on the Williamsburg Bridge.

This model is log-linear, so the coefficients represent the expected change in the log count of the response for a one-unit change in the predictor.

4.3 Model Construction

Before data could be analyzed, we need to transform our data into a usable format. The count for Williamsburg Bridge crossings is originally a character variable containing commas. In order to create a generalized linear regression, I transformed this response variable into a number by removing non-numeric characters and transforming the variable.

The model was then fit to a Poisson regression model using the ‘glm()’ function in R, specified as such with a log link function. Specifying a log for our link function is crucial because the Poisson model has a linear relationship between the predictors against log(Response); the link specifies that the response and mean of the distribution has a nonlinear relationship with the predictors. All predictor variables are included in the model minus Total count, which will be analyzed in our next Poisson regression model.

No variables were removed from the model because all predictor variables were statistically significant at the \(\alpha = 0.05\) significance level.

4.4 Motivation

The goal of this analysis was to address one of the study hypotheses that the bike count on the Williamsburg Bridge is dependent on the predictor variables Day of the week, HighTemp, LowTemp, and Precipitation.

4.5 Findings and Interpretations

Our model indicates that weekdays (Monday through Thursday) experience higher bike traffic compared to Friday, while weekends (Saturday and Sunday) see a reduction. This suggests behavioral patterns consistent with commuting during the weekdays and generally suggests that people who bike across the Williamsburg Bridge are generally using it to commute.

Higher daytime temperatures correlate with increased bike traffic, perhaps due to more favorable cycling conditions, while higher low temperatures (colder nights) show a negative relationship, possibly deterring cyclists. The positive association with high temperatures reflects willingness to cycle under favorable conditions, while a negative association with low temperatures indicates a reluctance to bike due to discomfort or safety reasons related to colder weather. This highlights how temperature, in general, influences how many people choose to commute on bike - people are more likely to commute when the weather conditions are favorable.

A strong negative relationship with precipitation indicates that worse weather conditions significantly deters cyclists from choosing to bike across the Williamsburg Bridge. Negative weather events tend to cause would-be bikers to choose an alternative form of transport.

Overall, our Poisson regression model provides valuable insights into the dynamics of bicycle traffic on the Williamsburg Bridge and highlights the impact of temporal and weather-related factors on urban cycling patterns.

5 Modeling Williamsburg Rates

We are now creating a poisson regression model for the bike rates to figure out if the predictor variables affect the proportion of bicyclists that use the Williamsburg Bridge out of all East River Bridges.

For modeling the rate of Williamsburg Bridge crossings in the context of total crossings, we will first compute a new variable representing the proportion of crossings on this bridge relative to the total number of East River bridge crossings. This rate is calculated by dividing the count of cyclists on the Williamsburg Bridge by the aggregate count from all monitored bridges. This proportion, which normalizes the counts by the total opportunity for crossings, will then be used as the response variable in our Poisson regression model, with an offset term included to account for the total volume of bridge traffic, ensuring that our model reflects the rate of crossings rather than mere counts.

5.1 Variable Selection

Similar to our first Poisson model, we’re starting off with the full model including all predictor variables and only the response that is on the rate. This means that individually the Total count and Williamsburg response variables are not used and that Date is excluded because it merely acts as an observation ID and is thus not a predictor.

If any predictors are not statistically significant, we will perform variable selection and remove them.

Full Poisson regression model for the rate of Williamsburg Bridge bicyclists.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0682 0.0223 -47.8899 0.0000
DayMonday 0.0004 0.0099 0.0390 0.9689
DaySaturday 0.0375 0.0101 3.7140 0.0002
DaySunday 0.0051 0.0097 0.5278 0.5976
DayThursday 0.0206 0.0103 1.9990 0.0456
DayTuesday 0.0138 0.0104 1.3288 0.1839
DayWednesday 0.0233 0.0101 2.3146 0.0206
HighTemp -0.0012 0.0006 -2.0273 0.0426
LowTemp 0.0004 0.0008 0.4461 0.6555
Precipitation 0.0505 0.0161 3.1363 0.0017


There are 4 variables that aren’t statistically significant at the \(\alpha = 0.05\) significance level. However, 3 of these variables are dummy variables for day of the week. Removing the insignificant ones is unwise because Monday, Sunday, and Tuesday being insignificant merely explains how those days of the week have no change in proportion of bikers riding the Williamsburg Bridge as compared to the baseline Friday. As a result, we will choose to keep all the insignificant dummy variables to maintain the integrity of the categorical predictor variable. We will remove LowTemp from our model because it isn’t a significant numerical predictor variable.

Therefore the final model for Poisson regression of rates is as follows:

Reduced Poisson regression model for the rate of Williamsburg Bridge bicyclists.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0656 0.0215 -49.5711 0.0000
DayMonday 0.0016 0.0096 0.1682 0.8665
DaySaturday 0.0381 0.0100 3.7975 0.0001
DaySunday 0.0045 0.0096 0.4655 0.6416
DayThursday 0.0213 0.0102 2.0970 0.0360
DayTuesday 0.0142 0.0104 1.3688 0.1711
DayWednesday 0.0240 0.0099 2.4165 0.0157
HighTemp -0.0010 0.0003 -3.2512 0.0011
Precipitation 0.0527 0.0154 3.4221 0.0006

5.2 Regression Coefficient Interpretation

  • Intercept: \(\beta_0\) indicates the log count of bike crossings on the Williamsburg Bridge when all other variables in the model are 0; this means that this is the log count on a Friday assuming other predictor variables are 0. However, we’re not interested in this value.
  • Day: Day was split into 6 dummy variables for this model. Friday was chosen as the baseline day of the week. In our model Monday, Sunday, and Tuesday aren’t statistically significant; this implies that those days of the week are not significantly different in rate from Friday. The remaining days of the week - Saturday, Wednesday, and Thursday - are all statistically significant with positive coefficients: This means that the rates of Williamsburg Bridge usage on those days are higher than on Friday. Note this is assuming that all other predictors remain constant and the only variation is the day of the week.
  • High Temp: High Temp has a negative coefficient, which means that having a higher day-high temperature has a negative association with bike rate of the Williamsburg Bridge. A one unit increase in High Temp decreases the rate of Williamsburg Bridge usage compared to total East River bridge crossings.
  • Precipitation: Precipitation has a positive coefficient, which means that there is a positive association between precipitation and bike rate of the Williamsburg Bridge. This means that a one unit increase in precipitation increases the rate of bicyclists using the Williamsburg Bridge.

This model is log-linear, so the coefficients represent the expected change in the log count of the response for a one-unit change in the predictor.

5.3 Model Construction

Before data could be analyzed, we need to transform our data into a usable format. The count for Total crossings for all East River Bridges is originally a character variable containing commas. In order to create a generalized linear regression, I transformed this response variable into a number by removing non-numeric characters and transforming the variable.

The model was then fit to a Poisson regression model using the ‘glm()’ function in R, specified as such with a log link function. Specifying a log for our link function is crucial because the Poisson model has a linear relationship between the predictors against log(Response); the link specifies that the response and mean of the distribution has a nonlinear relationship with the predictors.

In order to take the poisson regression model of rates and not just count, we have to put an offset of log(Total count) to account for the varying amounts of exposure or risk periods across observations. We employ this since rate needs to be standardized by a measure of time, population, or area.

All predictor variables are included in the model minus Total count, which will be analyzed in our next Poisson regression model. Only LowTemp was removed from the model because all other predictor variables were statistically significant at the \(\alpha = 0.05\) significance level or dummy variables for a categorical variable; LowTemp was the only variable that was both a numerical predictor and had a p-value larger than 0.05.

5.4 Motivation

The goal of this analysis was to address one of the study hypotheses that the bike rates on the Williamsburg Bridge is independent of the predictor variables Day of the week, HighTemp, LowTemp, and Precipitation.

5.5 Findings and Interpretations

The model reveals a clear pattern of variability across different days. With Friday as the baseline, Saturdays show a significantly higher rate of bicycle crossings, indicating a preference or increased opportunity for cycling during weekends. Conversely, the rates for Sunday, Monday, and Tuesday do not differ significantly from Friday, suggesting similar cycling behaviors on these days. However, Wednesday and Thursday exhibit a significant increase in rates, which may reflect mid-week behaviors or events influencing cycling traffic.

Interestingly, higher daytime temperatures are associated with a slight decrease in the rate of bicycle crossings, a finding that might seem counter intuitive. This could suggest a threshold beyond which higher temperatures become a deterrent to cyclists, possibly due to extreme heat discomfort or the availability of alternative leisure activities during very warm weather.

Contrary to common assumptions, an increase in precipitation is associated with a higher rate of cyclists crossing the bridge. This may indicate a strong commitment to cycling among regular commuters, regardless of weather conditions. It could reflect other unmeasured variables that correlate with both increased precipitation and cycling rates, such as specific events or infrastructural factors.

6 Graphical Comparison

The previous sections weren’t intuitive, so a graphical comparison is created to potentially find any visible hidden patterns.

The parallel nature of the lines indicates that the effect of temperature and precipitation on bike rates is consistent across different days of the week. This means that regardless of the day, an increase or decrease in HighTemp or Precipitation is associated with a uniform change in bike rates. The slopes of these lines are determined by the coefficients of HighTemp and Precipitation in your Poisson regression model.

In addition, because all of the lines are parallel and do not converge or diverge suggests that the impact of HighTemp and Precipitation does not interact with the day of the week. The difference in bike rates by day of the week remains constant regardless of changes in HighTemp or Precipitation.

7 Summary

Our analysis through two distinct Poisson regression models has yielded insightful findings into cycling patterns across the Williamsburg Bridge. The first model, which focused on the raw counts of bicyclists, identified that weekdays, particularly Wednesday and Thursday, have a higher bicycle traffic rate compared to Friday. This aligns with the hypothesis that cycling is influenced by the day of the week, potentially reflecting commuting patterns.

The second model adjusted these counts to rates, considering the total traffic across all East River Bridges. Here, Saturday emerged as a day with significantly higher cycling rates, suggesting a weekend surge in cycling activity, potentially for recreational purposes. Interestingly, we observed that higher temperatures correlate with a slight decrease in cycling rates, hinting at a complex relationship between weather conditions and cycling behavior that might involve factors such as extreme heat or the availability of other leisure options.

An unexpected finding from both models was how precipitation decreases raw count of cyclists on the Williamsburg but increased the rate. This suggests that there’s something else besides environmental factors that are motivating biking on specifically the Williamsburg Bridge (e.g. the Williamsburg Bridge better shelters bikers from rain).

These models together affirm the initial hypothesis that the number of bicyclists varies with the day of the week, temperature, and precipitation, while also providing new insights into the proportionate use of the Williamsburg Bridge relative to other bridges. This comprehensive analysis reinforces the utility of Poisson regression in understanding discrete count data and rates in urban transportation planning.

8 References

Gupta, D. (2018). Applied Analytics through Case Studies Using SAS and R. APress.

Ciaburro G. (2018). Regression Analysis with R: Design and Develop Statistical Nodes to Identify Unique Relationships Within Data at Scale. Packt Publishing.