This data set was gathered by the New York City’s Traffic Information Management System (TIMS), which monitors and records cyclists 24 hours a day. Each entry is an observation of the total bicyclists on that day. This data set is a subset of a larger data set that captures monthly records of bike counts across New York City’s four East River bridges: Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. Our data here captures the number of bikes that cross the Williamsburg Bridge.
This data set models a Poisson distribution and anlyzing it helps us understand the patterns and predict based on day of the week, temperature, and precipitation among other predictor variables.
The link to the raw CSV file in posted on GitHub: https://raw.githubusercontent.com/JZhong01/STA321/main/Topic%205%20(Poisson)/Poisson_Data_Set.csv.
To get an idea of what our data set looks like, here are the first 6 observations in the data set.
| Date | Day | HighTemp | LowTemp | Precipitation | WilliamsburgBridge | Total |
|---|---|---|---|---|---|---|
| 4/1 | Saturday | 46 | 37 | 0 | 1,915 | 5,397 |
| 4/2 | Sunday | 62.1 | 41 | 0 | 4,207 | 13,033 |
| 4/3 | Monday | 63 | 50 | 0.03 | 5,178 | 16,325 |
| 4/4 | Tuesday | 51.1 | 46 | 1.18 | 2,279 | 6,581 |
| 4/5 | Wednesday | 63 | 46 | 0 | 5,711 | 17,991 |
| 4/6 | Thursday | 48.9 | 41 | 0.73 | 1,739 | 4,896 |
The primary objectives of this analysis is to build 2 Poisson regression models and one quasi-poisson model that models 1) how the predictors affect the bicyclist counts across the Williamsburg Bridge and 2) how the predictors affect the proportion of bicyclists that ride on Williamsburg Bridge out of total East River Bridge ridership.
Secondary objectives for this case study are as follows:
The hypotheses of the study are that:
Poisson regression models are particularly useful for modeling count data where response variable represents the number of occurrences of an event. This can be captured in the context of a fixed time period or the rate for a given time span.
The Poisson regression model for counts is a statistical approach for analyzing count data, specifically when we are interested in the number of events occurring within a fixed period or area. This model facilitates the exploration of how various independent variables influence the rate at which events occur, with its parameters estimated using maximum likelihood estimation. By employing the natural logarithm as a link function, Poisson regression models the log of the expected event count as a linear combination of the independent variables, enabling the direct interpretation of the effect of predictors on the event rate.
The general regression model for counts follows this expression:
\(log(Response) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p\)
The betas are coefficients of the Poisson regression model. \(\beta_0\) represents the intercept of the
function, which is not useful in our analysis. \(\beta_i\) represents the change in the log
mean of a one unit change in \(x_i\)
such that all other predictor variables are held constant. - When \(\beta = 0\): The predictor variable isn’t
associated with the response variable. - When \(\beta > 0\): The predictor variable is
positively associated with the response. This means that an increase in
the predictor variable increases the expected number of occurrences in
the response.
- When \(\beta < 0\): The predictor
variable is negatively associated with the response. This means that an
increase in the predictor variable decreases the expected number of
occurrences in the response.
In the context of Poisson regression, the exponential of the beta coefficient \(e^\beta\) can be interpreted as a relative risk. This means if you take the exponential of the beta coefficient for a predictor variable, you get a factor that tells you how much the risk (or rate) of the event occurring increases (if \(\beta > 0\)) or decreases (if \(\beta < 0\)) with a one-unit increase in that predictor variable.
The Poisson regression model for rates is concerned about not just how many times an event occurs, but how often relative to a measure of time or opportunity. We therefore have to adjust our model to include a term for exposure - this adjustment term is referred to as an offset and predictor variables can share a common offset or individually have their own offset. The general regression model for rates follows this expression:
\(log(Response/t) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p\)
where t is the offset value and log(t) is an observation. Using properties of logarithms, we can add log(t) from both sides and cancel both log functions by taking Euler’s number to get this equivalent expression:
\(Response = t * e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p}\)
This demonstrates that the response is proportional to t, so the interpretation of the beta coefficients is similar to poisson regression for counts multiplied by a factor of t.
The Quasi-Poisson regression model is an extension of the traditional Poisson regression, tailored for count data that exhibit overdispersion, where the variance is greater than the mean. Similar to the Poisson model, it predicts the log of the expected count of events, but it relaxes the strict equality between the mean and the variance by introducing a dispersion parameter. This parameter scales the variance independently of the mean, allowing for more flexibility and providing a better fit for data that do not conform to the Poisson assumption of equal mean and variance. The general form of the model remains consistent with the Poisson regression, represented as:
\(log(Response) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p\)
In this framework, the $$ coefficients estimate the change in the log of the expected counts, with the interpretation of these coefficients remaining analogous to the standard Poisson model. However, the standard errors of the estimated coefficients are adjusted to reflect the additional variability inherent in the data, making inference more reliable for overdispersed data sets. The Quasi-Poisson model, therefore, provides a robust alternative for analysts dealing with count data where the Poisson distribution’s assumptions do not hold.
We will start creating a poisson regression model for the bike counts of Williamsburg Bridge
In our Poisson model, we’re starting off with the full model including all variables barring Total count across all bridges and Date, which acts as an observation ID and is thus not a predictor. If any predictors are not statistically significant, we will perform variable selection and remove them.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 7.6536 | 0.0220 | 347.5441 | 0.0000 |
| DayMonday | 0.0546 | 0.0099 | 5.4862 | 0.0000 |
| DaySaturday | -0.2744 | 0.0102 | -26.9960 | 0.0000 |
| DaySunday | -0.2346 | 0.0097 | -24.0919 | 0.0000 |
| DayThursday | 0.0319 | 0.0103 | 3.0997 | 0.0019 |
| DayTuesday | 0.1988 | 0.0104 | 19.2033 | 0.0000 |
| DayWednesday | 0.0511 | 0.0101 | 5.0736 | 0.0000 |
| HighTemp | 0.0171 | 0.0006 | 29.0374 | 0.0000 |
| LowTemp | -0.0024 | 0.0008 | -3.0474 | 0.0023 |
| Precipitation | -1.0321 | 0.0166 | -62.1742 | 0.0000 |
Day of the week is a categorical variable so we split that up into 6 dummy variables. It appears that all the p-values are less than \(\alpha = 0.05\) significance level and all statistically significantly predict Williamsburg bike count. As such, we will continue to use this full model.
This model is log-linear, so the coefficients represent the expected change in the log count of the response for a one-unit change in the predictor.
Before data could be analyzed, we need to transform our data into a usable format. The count for Williamsburg Bridge crossings is originally a character variable containing commas. In order to create a generalized linear regression, I transformed this response variable into a number by removing non-numeric characters and transforming the variable.
The model was then fit to a Poisson regression model using the ‘glm()’ function in R, specified as such with a log link function. Specifying a log for our link function is crucial because the Poisson model has a linear relationship between the predictors against log(Response); the link specifies that the response and mean of the distribution has a nonlinear relationship with the predictors. All predictor variables are included in the model minus Total count, which will be analyzed in our next Poisson regression model.
No variables were removed from the model because all predictor variables were statistically significant at the \(\alpha = 0.05\) significance level.
The goal of this analysis was to address one of the study hypotheses that the bike count on the Williamsburg Bridge is dependent on the predictor variables Day of the week, HighTemp, LowTemp, and Precipitation.
Our model indicates that weekdays (Monday through Thursday) experience higher bike traffic compared to Friday, while weekends (Saturday and Sunday) see a reduction. This suggests behavioral patterns consistent with commuting during the weekdays and generally suggests that people who bike across the Williamsburg Bridge are generally using it to commute.
Higher daytime temperatures correlate with increased bike traffic, perhaps due to more favorable cycling conditions, while higher low temperatures (colder nights) show a negative relationship, possibly deterring cyclists. The positive association with high temperatures reflects willingness to cycle under favorable conditions, while a negative association with low temperatures indicates a reluctance to bike due to discomfort or safety reasons related to colder weather. This highlights how temperature, in general, influences how many people choose to commute on bike - people are more likely to commute when the weather conditions are favorable.
A strong negative relationship with precipitation indicates that worse weather conditions significantly deters cyclists from choosing to bike across the Williamsburg Bridge. Negative weather events tend to cause would-be bikers to choose an alternative form of transport.
Overall, our Poisson regression model provides valuable insights into the dynamics of bicycle traffic on the Williamsburg Bridge and highlights the impact of temporal and weather-related factors on urban cycling patterns.
We are now creating a poisson regression model for the bike rates to figure out if the predictor variables affect the proportion of bicyclists that use the Williamsburg Bridge out of all East River Bridges.
For modeling the rate of Williamsburg Bridge crossings in the context of total crossings, we will first compute a new variable representing the proportion of crossings on this bridge relative to the total number of East River bridge crossings. This rate is calculated by dividing the count of cyclists on the Williamsburg Bridge by the aggregate count from all monitored bridges. This proportion, which normalizes the counts by the total opportunity for crossings, will then be used as the response variable in our Poisson regression model, with an offset term included to account for the total volume of bridge traffic, ensuring that our model reflects the rate of crossings rather than mere counts.
Similar to our first Poisson model, we’re starting off with the full model including all predictor variables and only the response that is on the rate. This means that individually the Total count and Williamsburg response variables are not used and that Date is excluded because it merely acts as an observation ID and is thus not a predictor.
If any predictors are not statistically significant, we will perform variable selection and remove them.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.0682 | 0.0223 | -47.8899 | 0.0000 |
| DayMonday | 0.0004 | 0.0099 | 0.0390 | 0.9689 |
| DaySaturday | 0.0375 | 0.0101 | 3.7140 | 0.0002 |
| DaySunday | 0.0051 | 0.0097 | 0.5278 | 0.5976 |
| DayThursday | 0.0206 | 0.0103 | 1.9990 | 0.0456 |
| DayTuesday | 0.0138 | 0.0104 | 1.3288 | 0.1839 |
| DayWednesday | 0.0233 | 0.0101 | 2.3146 | 0.0206 |
| HighTemp | -0.0012 | 0.0006 | -2.0273 | 0.0426 |
| LowTemp | 0.0004 | 0.0008 | 0.4461 | 0.6555 |
| Precipitation | 0.0505 | 0.0161 | 3.1363 | 0.0017 |
There are 4 variables that aren’t statistically significant at the \(\alpha = 0.05\) significance level. However, 3 of these variables are dummy variables for day of the week. Removing the insignificant ones is unwise because Monday, Sunday, and Tuesday being insignificant merely explains how those days of the week have no change in proportion of bikers riding the Williamsburg Bridge as compared to the baseline Friday. As a result, we will choose to keep all the insignificant dummy variables to maintain the integrity of the categorical predictor variable. We will remove LowTemp from our model because it isn’t a significant numerical predictor variable.
Therefore the final model for Poisson regression of rates is as follows:
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.0656 | 0.0215 | -49.5711 | 0.0000 |
| DayMonday | 0.0016 | 0.0096 | 0.1682 | 0.8665 |
| DaySaturday | 0.0381 | 0.0100 | 3.7975 | 0.0001 |
| DaySunday | 0.0045 | 0.0096 | 0.4655 | 0.6416 |
| DayThursday | 0.0213 | 0.0102 | 2.0970 | 0.0360 |
| DayTuesday | 0.0142 | 0.0104 | 1.3688 | 0.1711 |
| DayWednesday | 0.0240 | 0.0099 | 2.4165 | 0.0157 |
| HighTemp | -0.0010 | 0.0003 | -3.2512 | 0.0011 |
| Precipitation | 0.0527 | 0.0154 | 3.4221 | 0.0006 |
This model is log-linear, so the coefficients represent the expected change in the log count of the response for a one-unit change in the predictor.
Before data could be analyzed, we need to transform our data into a usable format. The count for Total crossings for all East River Bridges is originally a character variable containing commas. In order to create a generalized linear regression, I transformed this response variable into a number by removing non-numeric characters and transforming the variable.
The model was then fit to a Poisson regression model using the ‘glm()’ function in R, specified as such with a log link function. Specifying a log for our link function is crucial because the Poisson model has a linear relationship between the predictors against log(Response); the link specifies that the response and mean of the distribution has a nonlinear relationship with the predictors.
In order to take the poisson regression model of rates and not just count, we have to put an offset of log(Total count) to account for the varying amounts of exposure or risk periods across observations. We employ this since rate needs to be standardized by a measure of time, population, or area.
All predictor variables are included in the model minus Total count, which will be analyzed in our next Poisson regression model. Only LowTemp was removed from the model because all other predictor variables were statistically significant at the \(\alpha = 0.05\) significance level or dummy variables for a categorical variable; LowTemp was the only variable that was both a numerical predictor and had a p-value larger than 0.05.
The goal of this analysis was to address one of the study hypotheses that the bike rates on the Williamsburg Bridge is independent of the predictor variables Day of the week, HighTemp, LowTemp, and Precipitation.
The model reveals a clear pattern of variability across different days. With Friday as the baseline, Saturdays show a significantly higher rate of bicycle crossings, indicating a preference or increased opportunity for cycling during weekends. Conversely, the rates for Sunday, Monday, and Tuesday do not differ significantly from Friday, suggesting similar cycling behaviors on these days. However, Wednesday and Thursday exhibit a significant increase in rates, which may reflect mid-week behaviors or events influencing cycling traffic.
Interestingly, higher daytime temperatures are associated with a slight decrease in the rate of bicycle crossings, a finding that might seem counter intuitive. This could suggest a threshold beyond which higher temperatures become a deterrent to cyclists, possibly due to extreme heat discomfort or the availability of alternative leisure activities during very warm weather.
Contrary to common assumptions, an increase in precipitation is associated with a higher rate of cyclists crossing the bridge. This may indicate a strong commitment to cycling among regular commuters, regardless of weather conditions. It could reflect other unmeasured variables that correlate with both increased precipitation and cycling rates, such as specific events or infrastructural factors.
Quasi-Poisson regression is an extension of the standard Poisson regression model, used primarily to address the issue of overdispersion. Overdispersion occurs when the variance of the count data is greater than the mean, which violates one of the key assumptions of a traditional Poisson model. Quasi-Poisson models handle this by introducing a dispersion parameter that allows the variance to be a function of the mean, effectively scaling the standard errors of the estimates to provide more accurate confidence intervals and p-values.
This approach is particularly useful in practical scenarios where the data exhibit greater variability than the Poisson distribution can account for. Our data is one such example. Bicycle counts across a bridge where factors like weather, traffic disruptions, or social events might lead to more erratic counts than expected.
In our analysis of the Williamsburg Bridge bicycle traffic, the assumption that the variance equals the mean may not hold due to complex urban dynamics and a diverse cycling population. By employing a Quasi-Poisson model, we can adjust for the extra variation, yielding more robust standard errors and thus more reliable statistical inferences. This makes the Quasi-Poisson model an attractive option for ensuring the validity of our findings despite the presence of overdispersion in our count data.
We make several alterations to make our dispersed poisson regression. Our revised approach to analyzing bicycle traffic involves streamlining the temperature inputs by combining HighTemp and LowTemp into a single predictor, AvgTemp, representing the average temperature. We also simplify our precipitation variable: when there is no precipitation, NewPrecip is set to 0, and for any amount of precipitation, it is set to 1. This binary approach to rainfall allows us to examine its presence rather than quantity. Consequently, our dispersed Poisson regression model will consider three main predictors: the day of the week (Day), the computed average temperature (AvgTemp), and the binary precipitation variable (NewPrecip).
Consequently, our dispersed Poisson regression model will consider three main predictors: the day of the week (Day), the computed average temperature (AvgTemp), and the binary precipitation variable (NewPrecip).
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -1.044 | 0.04341 | -24.04 | 9.198e-17 |
| DayMonday | 0.001815 | 0.01971 | 0.09211 | 0.9275 |
| DaySaturday | 0.03454 | 0.02063 | 1.674 | 0.109 |
| DaySunday | 0.003457 | 0.02021 | 0.171 | 0.8658 |
| DayThursday | 0.0237 | 0.02089 | 1.134 | 0.2695 |
| DayTuesday | 0.02529 | 0.02082 | 1.215 | 0.238 |
| DayWednesday | 0.01835 | 0.02087 | 0.8795 | 0.3891 |
| AvgTemp | -0.001467 | 0.00068 | -2.158 | 0.04268 |
| NewPrecip | 0.01366 | 0.01314 | 1.04 | 0.3101 |
This model is a bit troubling as it shows that two of the predictors are not statistically significant. However, the meaning behind these p-values cannot be fully understood unless we also take a look at the estimated dispersion parameter.
We are analyzing our dispersion parameter as a form of goodness of fit test. Analyzing this dispersion parameter is important, especially for quasi-poisson regression because it measures the degree to which data are spread out or clustered around the mean. This value is important as it can correct for overdispersion (if only slightly overdispersed) and it can also explain whether the assumption that the means are equal to variance is violated for a traditional poisson distribution.
| Dispersion |
|---|
| 5.941 |
The dispersion value we got was 5.9410864, which is significantly different from 1. This value demonstrates that our poisson models suffer from overdispersion and thus violates the assumption of equality of variance and means. This means that the p-values in the output of the poisson regression model are not reliable.
Given this substantial overdispersion, it is advisable to continue using the quasi-Poisson model rather than the standard Poisson model. The quasi-Poisson approach adjusts for overdispersion by allowing the variance to be a multiple (in this case, 5.941 times) of the mean, which leads to more reliable standard errors and consequently more valid inference statistics for hypothesis testing and confidence interval construction. This adjustment can provide a better fit for the data and more trustworthy conclusions about our model.
The dispersion index is 5.941, and is very overdispersed. This means we will use the quasi-poisson regression model for our final model.
The intercept for our model is -1.044, which represents the log rate of bike crossings for our baseline day Friday with no influence from temperature or precipitation. For the one predictor variable that is statistically significant at the \(\alpha = 0.05\) significance level, Average Temperature, the regression coefficient of -0.001467 means that the difference of log rate between one degree Fahrenheit increase in average temperature is -0.001467. Taking this value as the power of Euler’s number, we get that the rate of a one degree increase in average temperature is \(e^{-0.001467} = 0.9985\); this means that increasing average temperature by one degree decreases rate of Williamsburg Bridge bike crossing by 0.15%.
Since all dummy variables are not statistically significant, this means that day of the week makes no difference on the rate bicyclists take Williamsburg Bridge compared to the remaining East River bridges. Similarly, because our new variable New Precipitation, which measures the binary event of whether there is or isn’t precipitation on that given day, isn’t statistically significant, this means that the rate of which bikers bike on the Williamsburg Bridge compared to the other East River bridges does not change based off the presence or absence of precipitation.
The inferential tables from our Poisson regression analyses have provided us with a numerical understanding of how bike rates may vary with different days and weather conditions. To visually explore these potential variations, we proceed to graphically represent the relationship between bike rates, days of the week, and weather conditions.
Building on our established regression model, we express the bike rate as a function of the day of the week and weather conditions using the formula:
\(log-rate = \beta_0 + \beta_{Monday} * Monday + ... + \beta_{NewPrecip} * NewPrecip\)
We then transform this log-rate into an actual rate by taking its exponential. This rate can be calculated for each day of the week under different weather scenarios, which are characterized by the average temperature and whether there was precipitation or not.
For instance, the exponential of the intercept gives us the base bike rate for a typical Friday with average weather conditions. By adding the coefficients for the other days and weather conditions, we can determine the bike rate for each scenario.
The parallel nature of the lines indicates that the effect of temperature and precipitation on bike rates is consistent across different days of the week. This means that regardless of the day, an increase or decrease in Average Temperature or (New) Precipitation is associated with a uniform change in bike rates. The slopes of these lines are determined by the coefficients of Average Temperature and New Precipitation in our Poisson regression model.
In addition, because all of the lines are parallel and do not converge or diverge suggests that the impact of our numerical predictors do not interact with the day of the week. The difference in bike rates by day of the week remains constant regardless of changes in average day temperature and presence of precipitation.
Our analytical journey, which now includes the exploration through a third model – the quasi-Poisson regression – has broadened our understanding of cycling patterns across the Williamsburg Bridge. This model, chosen to account for overdispersion observed in our data, allowed us to validate the appropriateness of our statistical approach in capturing the nuances of bike traffic counts.
The quasi-Poisson model substantiated our initial objectives and hypotheses. It confirmed that bicycle traffic is indeed affected by the day of the week, with weekdays showing a distinct pattern compared to the weekend, and that both temperature and precipitation influence cycling activity. Notably, the quasi-Poisson model’s findings determined that day of the week and precipitation did not significantly alter the rate at which bikers crossed the Williamsburg compared to the remaining East River bridges.
In light of the research questions and objectives set forth at the beginning of our study, the analyses suggest that our objectives have been largely met. The models enabled us to identify patterns in daily and weekly bike traffic, examine the impact of temperature and precipitation, and assess the fit of Poisson distribution models for discrete biking data. Comparing the Poisson and quasi-Poisson models, we examined the validity of these statistical tools in the face of overdispersion, enhancing the robustness of our conclusions.
There were many insights gained by our quasi-Poisson model and comparing them to our two models relying on traditional Poisson regression models. It validates the initial hypotheses regarding the dependence of cycling numbers on temporal and weather variables, and it also suggests that the proportion of cyclists choosing the Williamsburg Bridge is not merely a function of these factors. This comprehensive study, which includes the quasi-Poisson adjustment, underlines the utility of different modeling approaches when dealing with real-world data where assumptions of standard models may not hold true.
Gupta, D. (2018). Applied Analytics through Case Studies Using SAS and R. APress.
Ciaburro G. (2018). Regression Analysis with R: Design and Develop Statistical Nodes to Identify Unique Relationships Within Data at Scale. Packt Publishing.