1 Introduction

For this project, we will be creating a Poisson regression model. The data set for this project looks at the daily total of cyclists on the Williamsburg Bridge on a given day. This data set looks at the total number of cyclists on the Williamsburg Bridge in Brooklyn, New York, in order to keep track of the total number of cyclists entering and leaving this cycling route on a specific day. We will look at the various factors affecting the number of cyclists on each day, with factors such as the weather conditions on that particular day.

1.1 Data Description

The data set in this project looks at the total number of cyclists on the Williamsburg Bridge on a given day along with the weather conditions of that day such as temperature and precipitation. This data set also includes the total number of cyclists on all four of the major New York bridges the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the Queensboro Bridge.

First, let’s find the data set which will be used for this assignment.

id=sample(1:10, 1)
dat <- read.xlsx("https://pengdsci.github.io/STA321/ww09/w09-AssignDataSet.xlsx", sheet = paste("data",id, sep = ""))
write.csv(dat, paste("C:\\Users\\josie\\Downloads\\",names(dat[6]), ".csv", sep=""))

When running this code, the data set I recieved was for the Williamsburg Bridge, so that is what we will use for this Poisson regression modeling project. The data set has been uploaded to Github and now can be read in directly from the Github repository.

We will read in the data set from Github and we will call it “cycling”.

cycling <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/WilliamsburgBridge.csv", header = TRUE)

str(cycling)
'data.frame':   31 obs. of  8 variables:
 $ X                 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Date              : int  42917 42918 42919 42920 42921 42922 42923 42924 42925 42926 ...
 $ Day               : int  42917 42918 42919 42920 42921 42922 42923 42924 42925 42926 ...
 $ HighTemp          : num  84.9 87.1 87.1 82.9 84.9 75 79 82.9 81 82.9 ...
 $ LowTemp           : num  72 73 71.1 70 71.1 71.1 68 70 69.1 71.1 ...
 $ Precipitation     : num  0.23 0 0.45 0 0 0 1.78 0 0 0 ...
 $ WilliamsburgBridge: int  3845 4173 4924 3684 7308 7302 4421 5781 5782 8106 ...
 $ Total             : int  11867 13995 16067 13925 23110 21861 12805 17258 18320 24827 ...

We will use this cycling data set to create two Poisson regression models, one for the frequency counts of cyclists on the Williamsburg Bridge on a given observation, and another for the rates of cyclists entering and leaving via the Williamsburg Bridge offset by the total number of cyclists on all of the major New York bridges.

1.2 Variables

There are 8 total variables in the cycling data set. These variables include:

  • X: The number of each observation. This is not a variable that is useful for analysis, but rather is for listing each of the 31 observations in order, from observation 1 to observation 31. This ordering was added when creating the .csv file, so it is not an essential part of the dataset for our analysis.

  • Date: This represents the date on which a given observation was collected. This is the observation ID number.

  • Day: This represents the day on which a given observation was collected.

  • HighTemp: The high temperature on the given day, given in degrees Fahrenheit.

  • LowTemp: The low temperature on the given day, given in degrees Fahrenheit.

  • Precipitation: The amount of rain which occurred on the given day, given in inches.

  • WilliamsburgBridge: The total number of cyclists on the Williamsburg Bridge on a given observation.

  • Total: The total number of cyclists on all bridges on a given observation.

For the Poisson regression model for the frequency counts, the Williamsburg Bridge variable will serve as the response variable. For the Poisson regression model for the rates, the Williamsburg Bridge variable will again serve as the response variable, and it will be offset by the Total variable for this model.

1.3 Research Questions

The main goal for this project is to create a Poisson regression model for both the frequency counts and the rates of the cyclists entering and leaving Brooklyn, New York through the Williamsburg Bridge. So, the focus for this project will be on creating two Poisson regression models which can successfully predict the frequency counts and the rates of the cyclists on the Williamsburg Bridge.

Some key questions for this project include:

  • Does the data set meet all of the necessary conditions required for a Poisson regression model? If not, is there any potential explanation for this discrepancy?

  • Can we create Poisson regression models which provide statistical significance for predicting both the frequency counts and for the rates of cyclists on the Williamsburg Bridge on a given day?

We will work on creating our Poisson regression models for both the frequency counts and rates in order to see if we can in fact create models which provide statistical significance in their predictive ability.

2 Exploratory Data Analysis

Let’s take a look at the first few entries within this cycling data set for the Williamsburg Bridge.

kable(head(cycling), caption = "First Few Observations in the Data Set") 
First Few Observations in the Data Set
X Date Day HighTemp LowTemp Precipitation WilliamsburgBridge Total
1 42917 42917 84.9 72.0 0.23 3845 11867
2 42918 42918 87.1 73.0 0.00 4173 13995
3 42919 42919 87.1 71.1 0.45 4924 16067
4 42920 42920 82.9 70.0 0.00 3684 13925
5 42921 42921 84.9 71.1 0.00 7308 23110
6 42922 42922 75.0 71.1 0.00 7302 21861

This data set includes various factors which may have an influence on the number of individuals cycling, along with the date on which this data was collected. Additionally, this data set includes variables for both the number of cyclists on the Williamsburg Bridge on that given day, along with the total number of cyclists on all bridges on that given day.

An observation I made while looking at the data set is that the entries for the Date and the Day variables are in fact identical. This means that both of these variables are representative of the observation IDs and so, it would be redundant to include both variables in our models as the entries for these two variables are identical for all 31 of the observations in the data set. We will just include the Date variable in our Poisson regression models due to this observation that was made while observing the data set.

2.1 Asumptions and Conditions

There are four assumptions which must be met in order to create a Poisson regression model. These assumptions include:

  1. The response variable is a count described by a Poisson distribution.

  2. Observations are independent of one another.

  3. The mean of the Poisson random variable is equal to the variance of said Poisson random variable.

  4. The log of the mean rate, log (λ), must be a linear function of x.

We will check whether all of these four conditions have been successfully met by our cycling data set before beginning with the model building process for our Poisson regression model.

We will go through and check all four of the neccessary conditions required for a Poisson Regression Model.

2.1.1 Condition 1: The response variable is a count described by a Poisson distribution.

The response variable in this data set was stated to be the WilliamsburgBridge variable, representing the total number of cyclists on the Williamsburg Bridge on a given observation. This variable is described as a count, representing the number of cyclists on a given observation. This fits the criteria for this assumption, because we can conclude that we have a response variable that is a count.

2.1.2 Condition 2: Observations are independent of one another.

Each observation was collected on a given date, and we can safely assume that the conditions of one day did not affect the conditions of another day. The number of cyclists on the Williamsburg Bridge for a given observation is independent on this number of a different observation. So, we can safely conclude that that observations are all independent and separate from one another.

2.1.3 Condition 3: The mean of the Poisson random variable is equal to the variance of said Poisson random variable.

In order for a variable to be a Poisson random variable, its mean must be equal to its variance. We previously stated that the WilliamsburgBridge variable will be our response variable. Therefore, we must check that this variable meets the criteria for a Poisson random variable, having a mean which is equal to its variance.

# Finding the mean.
mean <- mean(cycling$WilliamsburgBridge)
print(mean)
[1] 6073.677

The mean of the WilliamsburgBridge variable is 6,073.677. This represents the mean number of individuals on the Williamsburg Bridge on a given observation. This means that the mean number of individuals on the Williamsburg Bridge on any given date is around 6,074 people. We round this value because the number of individuals is a whole number.

Next, let’s find the variance of our response variable.

# Finding the variance.
variance <- var(cycling$WilliamsburgBridge)
print(variance)
[1] 2482822

The variance of the WilliamsburgBridge variable is 2,482,822. This does not match up with the value of the mean, and indicates a violation of one the neccessary conditions for a Poisson regression model. This implies that our response variable is in fact not a Poisson random variable because the value of its mean is not equivalent to the value of its variance.

2.1.4 Condition 4: The log of the mean rate, log (λ), must be a linear function of x.

We will take a look at the plot of the mean rate against the predictor variables to check this condition.

First, let’s look at the predictor variable of date vs our response variable of WilliamsburgBridge.

plot(cycling$Date, cycling$WilliamsburgBridge, main = "Date vs. Williamsburg Bridge", xlab = "Date", ylab = "WilliamsburgBridge")

The scatterplot of these two variables shows a random distribution, but it does not appear to follow a linear pattern. This could suggest a possible violation of this condition due to the WilliamsburgBridge variable not being a linear function of the Date predictor variable.

Next, let’s look at the predictor variable of high temperature vs our response variable of WilliamsburgBridge.

plot(cycling$HighTemp, cycling$WilliamsburgBridge, main = "HighTemp vs. Williamsburg Bridge", xlab = "HighTemp", ylab = "WilliamsburgBridge")

The scatterplot of these two variables again shows a random distribution, which it does not appear to follow a distinctly linear pattern. This could suggest a possible violation of this condition due to the WilliamsburgBridge variable not being a linear function of the HighTemp predictor variable.

Next, let’s look at the predictor variable of low temperature vs our response variable of WilliamsburgBridge.

plot(cycling$LowTemp, cycling$WilliamsburgBridge, main = "LowTemp vs. Williamsburg Bridge", xlab = "LowTemp", ylab = "WilliamsburgBridge")

The scatterplot of these two variables again shows a random distribution, which it does not appear to follow a distinctly linear pattern. This could suggest a possible violation of this condition due to the WilliamsburgBridge variable not being a linear function of the LowTemp predictor variable.

Lastly, let’s look at the predictor variable of precipitation vs our response variable of WilliamsburgBridge.

plot(cycling$Precipitation, cycling$WilliamsburgBridge, main = "Precipitation vs. Williamsburg Bridge", xlab = "Precipitation", ylab = "WilliamsburgBridge")

The scatterplot of these two variables again shows a distribution which does not appear to follow a distinctly linear pattern, it appears the points are mostly centered around x = 0, with some outliers to the right. This could suggest a possible violation of this condition due to the WilliamsburgBridge variable not being a linear function of the Precipitation predictor variable.

Overall, it seems that we do have some violations of the conditions of a Poisson regression model, with the response variable not following a linear function of the predictor variables in our model.

We will still continue with building the Poisson regression models, but it is important to keep in mind that these violations may mean that the Poisson regression model is not the best model choice for this data set due to some of the neccessary conditions having been failed to have been met.

3 Poisson Regression Model on Frequency Counts

We will begin with creating a Poisson regression model of the frequency counts. Specifically, this model will be on the frequency counts of individuals on the Williamsburg Bridge for a given observations. Our goal is to create a Poisson regression model which can statistically significantly predict the count of the number of individuals on the Williamsburg Bridge for a given observation, based upon the various factors in this data set.

We will create our Poisson regression model on the frequency counts.

# Poisson Regression Model of Counts
model.counts <- glm(WilliamsburgBridge ~ Date + HighTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = cycling)
pois.count.coef = summary(model.counts)$coef
kable(pois.count.coef, caption = "Poisson Regression Model for the Counts of Cyclists \n on the Williamsburg Bridge")
Poisson Regression Model for the Counts of Cyclists on the Williamsburg Bridge
Estimate Std. Error z value Pr(>|z|)
(Intercept) -329.7412813 11.7142108 -28.148826 0
Date 0.0078648 0.0002726 28.850990 0
HighTemp 0.0035901 0.0006334 5.667892 0
LowTemp 0.0075718 0.0009046 8.370178 0
Precipitation -0.3516535 0.0086431 -40.685836 0

The regression equation for the Poisson regression model on the frequency counts is given as:

log(μ) = -329.7413 + 0.0079 * Date + 0.0036 * HighTemp + 0.0076 * LowTemp - 0.3517 * Precipitation

All four of the predictor variables, Date, HighTemp, LowTemp, and Precipitation, all have p-values of p < .001. This indicates that all of the predictor in our model variables are statistically significant in predicting the total expected counts of cyclists on the Williamsburg Bridge on a given day.

The significance of these variables in regards to predicting the expected counts can likely be attributed to potential adverse weather conditions, such as excessive heat or cold, along with intense precipitation and storms making cycling non ideal on those days with poor conditions for outdoors activities such as cycling. These predictor variables all being statistically significant shows that the weather and temperature conditions do suggest a discrepancy in the number of cyclists on the Williamsburg Bridge from day to day due to these changes in temperature and precipitation.

Overall, this Poisson model of the frequency counts of the cyclists on the Williamsburg Bridge showed statistical signficance in its prediction of the expected log counts for the number of cyclists on the Williamsburg Bridge for a given observation.

3.1 Regression Coefficients Interpretation

The Poisson regression model on frequency counts was found to have the following regression equation:

log(μ) = -329.7413 + 0.0079 * Date + 0.0036 * HighTemp + 0.0076 * LowTemp - 0.3517 * Precipitation

We will analysis the regression coefficients for the variables in this Poisson regression model on frequency counts.

  • The value of the y-intercept is given as -329.7413. This represnts the baseline of the mean of log(μ) when all predictor variables are equal to 0. However, the y-intercept does not have a practical interpretation or meaning in this scenario so we are not interested in its meaning for the Poisson regression model.

  • Date: The regression coefficient of the Date variable in this model is 0.0079. This means that the mean log of the counts increases by 0.0079 units for every 1 day increase in the date on which the observation was collected, holding all other variables constant.

  • HighTemp: The regression coefficient of the HighTemp variable in this model is 0.0036. This means that the mean log of the counts increases by 0.0036 units for every 1 degree Fahrenheit increase in the high temperature for the given observation, holding all other variables constant.

  • LowTemp: The regression coefficient of the LowTemp variable in this model is 0.0076. This means that the mean log of the counts increases by 0.0076 units for every 1 degree Fahrenheit increase in the low temperature for the given observation, holding all other variables constant.

  • Precipitation: The regression coefficient of the Precipitation variable in this model is -0.3517. This means that the mean log of the counts decreases by 0.3517 units for every 1 inch increase in the amount of precipitation for the given observation, holding all other variables constant.

4 Poisson Regression Model on Rates

Now, we will create a Poisson regression model of the rates at which cyclists enter and leave via the Williamsburg Bridge offset by the total number of cyclists on all four of the major New York bridges. This model, unlike the previous model which just focused on the frequency counts of cyclists on the Williamsburg Bridge, will also account for the total number of cyclists on all four of the major New York bridges, the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the Queensboro Bridge. This Poisson model will look at the rates of the number of cyclists on the Williamsburg Bridge for a given observation as a rate out of the total number of cyclists on all four of these major bridges for that specific observation.

We will build our Poisson regression model for the rates. This time, we will still use the WilliamsburgBridge variable as our response variable, but we will offset the model by the Total variable to make our Poisson model for the rates of cyclists on the Williamsburg Bridge out of the total number of cyclists on all four of the bridges.

# Poisson Model of Rates
model.rates <- glm(WilliamsburgBridge ~ Date + HighTemp + LowTemp + Precipitation, offset = log(Total), 
                   family = poisson(link = "log"), data = cycling)
kable(summary(model.rates)$coef, caption = "Poisson Regression Model of the Rates of Cyclists \n on the Williamsburg Bridge out of all Four Bridges")
Poisson Regression Model of the Rates of Cyclists on the Williamsburg Bridge out of all Four Bridges
Estimate Std. Error z value Pr(>|z|)
(Intercept) -50.4101583 12.0801410 -4.172978 3.01e-05
Date 0.0011422 0.0002811 4.063496 4.83e-05
HighTemp -0.0050794 0.0006460 -7.862790 0.00e+00
LowTemp 0.0092517 0.0009198 10.057847 0.00e+00
Precipitation 0.0356817 0.0078863 4.524499 6.10e-06

The regression equation for the Poisson regression model on the rates is given as:

log(μ/t) = -50.4102 + 0.0011 * Date - 0.0051 * HighTemp + 0.0093 * LowTemp - 0.0357 * Precipitation

All four of the predictor variables in this Poisson model, Date, HighTemp, LowTemp, and Precipitation, all have p-values of p < .001. This indicates that all of the predictor in our model variables are statistically significant in predicting the total expected counts of cyclists on the Williamsburg Bridge on a given day, offset by the total number of cyclists on all four of the major New York bridges.

This model shows statistical significance in predicting the expected counts of the cyclists on the Williamsburg Bridge by using the rates for the prediction. This indicates that this model for the rates shows statistical significance in its predictive power and provides good utility for prediction and estimation.

4.1 Regression Coefficients Interpretation

  • The value of the y-intercept is given as -50.4102 This represents the baseline of the mean of the log counts multiplied by t, when all predictor variables are equal to 0. However, the y-intercept does not have a practical interpretation or meaning in this scenario so we are not interested in its meaning for the Poisson regression model.

  • Date: The regression coefficient of the Date variable in this model is 0.0011. This means that the mean of the log counts multipled by t increases by 0.0011. units for every 1 day increase in the date on which the observation was collected, holding all other variables constant.

  • HighTemp: The regression coefficient of the HighTemp variable in this model is -0.0051 This means that the mean of the log counts multipled by t decreases by 0.0051 units for every 1 degree Fahrenheit increase in the high temperature for the given observation, holding all other variables constant.

  • LowTemp: The regression coefficient of the LowTemp variable in this model is 0.0093. This means that the log counts multipled by t increases by 0.0093 units for every 1 degree Fahrenheit increase in the low temperature for the given observation, holding all other variables constant.

  • Precipitation: The regression coefficient of the Precipitation variable in this model is 0.0357. This means that the log counts multipled by t increases by 0.0357 units for every 1 inch increase in the amount of precipitation for the given observation, holding all other variables constant.

5 Summary and Comparisons of the Two Models

Both of the two Poisson regression model we created, the model for the frequency counts and the model for the rates, provided statistical significance for prediction and showed good utility overall. In both of these models, we looked into the total number of cyclists on the Williamsburg Bridge in New York for a specific observation, and we looked into the various factors of that specific date. We looked at the date of the observation along with some factors which may affect the total number of cyclists out on that specific date. These factors included the high temperature, the low temperature, and the amount of precipitation for that given date. It turned out that all of these factors were indeed statistically significant for both of the two Poisson regression models, indicating that these weather related conditions have a statistically significant impact on both the counts and the rates of cyclists out on the Williamsburg Bridge for a given observation. This can be attributed to certain weather conditions making it more or less ideal for individuals to be cycling outdoors. For instance, a day with incredibly high temperatures, incredibly cold temperatures, or severe storms with heavy precipitation would be less ideal and likely lead to less cyclists being out on that given day as opposed to a day with pleasant weather.

Overall, both of the Poisson regression models showed statistical significance and good utility in their prediction. However, as was previously stated, there were some violations of this conditions for a Poisson regression model within our data set. First, it was found that the mean of the response variable, WilliamsburgBridge, was not equal to its variance. This suggests that this response variable in fact is not Poisson distributed, due to it failing to meet the condition for a Poisson random variable of its mean being equal to its variance. Additionally, all four predictor variables were checked, and it was found that the response variable in fact was not a linear function of any of these predictor variables. This indicates another major violation of this data set. These violations suggest that perhaps a Poisson model was not the best model choice for this data set, and that it is important to be mindful of these violations when using either of the Poisson regression models we created for prediction.

6 Conclusion

Overall, two Poisson regression models were created in this project. Both of these models looked at the total number of cyclists on the Williamsburg Bridge for a given day. The first model looked at the counts of cyclists that were on the Williamsburg Bridge, and the second model looked at the rates of the cyclists that were on the Williamsburg Bridge offset by the total number of cyclists on all four of the major New York bridges.

Both of these two Poisson regression models showed statistical signicifance in their predictions, with all of the predictor variables in both of these two models have p-values of p < .001. This indicates that the factors in the data set of Date, HighTemp, LowTemp, and Precipitation are statistically significnace in predicting the frequency counts or the rates of the cyclists on the Williamsburg Bridge for a given observation.

However, our data set failed to meet some of the neccessary conditions for a Poisson regression model. The mean of the response variable, WilliamsburgBridge, was not equal to its variance. This suggests that this response variable in fact is not Poisson distributed. Also, all four predictor variables were checked, and it was found that the response variable in fact was not a linear function of any of these predictor variables. These are two violations of the assumptions for a Poisson regression model. These violations mean that perhaps a Poisson regression model was not the best choice for this data set, and that these violations should be kept in mind when using either of these models for prediction.

6.1 Recommendations

Some recommendations I would suggest for further projects include:

  • Look further into the violations that were found within this data set and look into possible explanations for these violations of the necessities of a Poisson regression model. Further consider whether the Poisson regression model in fact is the best choice for this data set and if it is sufficient to use this model for prediction despite these violations.

  • Consider other variables which may affect the number of cyclists out on a given observation. Perhaps there are other factors which may provide further significance for model building which may strengthen the regression model.

  • Further expand the data set to ensure the accuracy of the predictions and to further strengthen the Poisson regression models.

