Introduction
For this project, we will be creating a Poisson regression model. The
data set for this project looks at the daily total of cyclists on the
Williamsburg Bridge on a given day. This data set looks at the total
number of cyclists on the Williamsburg Bridge in Brooklyn, New York, in
order to keep track of the total number of cyclists entering and leaving
this cycling route on a specific day. We will look at the various
factors affecting the number of cyclists on each day, with factors such
as the weather conditions on that particular day.
Data Description
The data set in this project looks at the total number of cyclists on
the Williamsburg Bridge on a given day along with the weather conditions
of that day such as temperature and precipitation. This data set also
includes the total number of cyclists on all four of the major New York
bridges the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg
Bridge, and the Queensboro Bridge.
First, let’s find the data set which will be used for this
assignment.
id=sample(1:10, 1)
dat <- read.xlsx("https://pengdsci.github.io/STA321/ww09/w09-AssignDataSet.xlsx", sheet = paste("data",id, sep = ""))
write.csv(dat, paste("C:\\Users\\josie\\Downloads\\",names(dat[6]), ".csv", sep=""))
When running this code, the data set I recieved was for the
Williamsburg Bridge, so that is what we will use for this Poisson
regression modeling project. The data set has been uploaded to Github
and now can be read in directly from the Github repository.
We will read in the data set from Github and we will call it
“cycling”.
cycling <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/WilliamsburgBridge.csv", header = TRUE)
str(cycling)
'data.frame': 31 obs. of 8 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ Date : int 42917 42918 42919 42920 42921 42922 42923 42924 42925 42926 ...
$ Day : int 42917 42918 42919 42920 42921 42922 42923 42924 42925 42926 ...
$ HighTemp : num 84.9 87.1 87.1 82.9 84.9 75 79 82.9 81 82.9 ...
$ LowTemp : num 72 73 71.1 70 71.1 71.1 68 70 69.1 71.1 ...
$ Precipitation : num 0.23 0 0.45 0 0 0 1.78 0 0 0 ...
$ WilliamsburgBridge: int 3845 4173 4924 3684 7308 7302 4421 5781 5782 8106 ...
$ Total : int 11867 13995 16067 13925 23110 21861 12805 17258 18320 24827 ...
We will use this cycling data set to create two Poisson regression
models, one for the frequency counts of cyclists on the Williamsburg
Bridge on a given observation, and another for the rates of cyclists
entering and leaving via the Williamsburg Bridge offset by the total
number of cyclists on all of the major New York bridges.
Variables
There are 8 total variables in the cycling data set. These variables
include:
X: The number of each observation. This is not a variable that is
useful for analysis, but rather is for listing each of the 31
observations in order, from observation 1 to observation 31. This
ordering was added when creating the .csv file, so it is not an
essential part of the dataset for our analysis.
Date: This represents the date on which a given observation was
collected. This is the observation ID number.
Day: This represents the day on which a given observation was
collected.
HighTemp: The high temperature on the given day, given in degrees
Fahrenheit.
LowTemp: The low temperature on the given day, given in degrees
Fahrenheit.
Precipitation: The amount of rain which occurred on the given
day, given in inches.
WilliamsburgBridge: The total number of cyclists on the
Williamsburg Bridge on a given observation.
Total: The total number of cyclists on all bridges on a given
observation.
For the Poisson regression model for the frequency counts, the
Williamsburg Bridge variable will serve as the response variable. For
the Poisson regression model for the rates, the Williamsburg Bridge
variable will again serve as the response variable, and it will be
offset by the Total variable for this model.
Research
Questions
The main goal for this project is to create a Poisson regression
model for both the frequency counts and the rates of the cyclists
entering and leaving Brooklyn, New York through the Williamsburg Bridge.
So, the focus for this project will be on creating two Poisson
regression models which can successfully predict the frequency counts
and the rates of the cyclists on the Williamsburg Bridge.
Some key questions for this project include:
Does the data set meet all of the necessary conditions required
for a Poisson regression model? If not, is there any potential
explanation for this discrepancy?
Can we create Poisson regression models which provide statistical
significance for predicting both the frequency counts and for the rates
of cyclists on the Williamsburg Bridge on a given day?
We will work on creating our Poisson regression models for both the
frequency counts and rates in order to see if we can in fact create
models which provide statistical significance in their predictive
ability.
Exploratory Data
Analysis
Let’s take a look at the first few entries within this cycling data
set for the Williamsburg Bridge.
kable(head(cycling), caption = "First Few Observations in the Data Set")
First Few Observations in the Data Set
1 |
42917 |
42917 |
84.9 |
72.0 |
0.23 |
3845 |
11867 |
2 |
42918 |
42918 |
87.1 |
73.0 |
0.00 |
4173 |
13995 |
3 |
42919 |
42919 |
87.1 |
71.1 |
0.45 |
4924 |
16067 |
4 |
42920 |
42920 |
82.9 |
70.0 |
0.00 |
3684 |
13925 |
5 |
42921 |
42921 |
84.9 |
71.1 |
0.00 |
7308 |
23110 |
6 |
42922 |
42922 |
75.0 |
71.1 |
0.00 |
7302 |
21861 |
This data set includes various factors which may have an influence on
the number of individuals cycling, along with the date on which this
data was collected. Additionally, this data set includes variables for
both the number of cyclists on the Williamsburg Bridge on that given
day, along with the total number of cyclists on all bridges on that
given day.
An observation I made while looking at the data set is that the
entries for the Date and the Day variables are in fact identical. This
means that both of these variables are representative of the observation
IDs and so, it would be redundant to include both variables in our
models as the entries for these two variables are identical for all 31
of the observations in the data set. We will just include the Date
variable in our Poisson regression models due to this observation that
was made while observing the data set.
Asumptions and
Conditions
There are four assumptions which must be met in order to create a
Poisson regression model. These assumptions include:
The response variable is a count described by a Poisson
distribution.
Observations are independent of one another.
The mean of the Poisson random variable is equal to the variance
of said Poisson random variable.
The log of the mean rate, log (λ), must be a linear function of
x.
We will check whether all of these four conditions have been
successfully met by our cycling data set before beginning with the model
building process for our Poisson regression model.
We will go through and check all four of the neccessary conditions
required for a Poisson Regression Model.
Condition 1: The
response variable is a count described by a Poisson distribution.
The response variable in this data set was stated to be the
WilliamsburgBridge variable, representing the total number of cyclists
on the Williamsburg Bridge on a given observation. This variable is
described as a count, representing the number of cyclists on a given
observation. This fits the criteria for this assumption, because we can
conclude that we have a response variable that is a count.
Condition 2:
Observations are independent of one another.
Each observation was collected on a given date, and we can safely
assume that the conditions of one day did not affect the conditions of
another day. The number of cyclists on the Williamsburg Bridge for a
given observation is independent on this number of a different
observation. So, we can safely conclude that that observations are all
independent and separate from one another.
Condition 3: The
mean of the Poisson random variable is equal to the variance of said
Poisson random variable.
In order for a variable to be a Poisson random variable, its mean
must be equal to its variance. We previously stated that the
WilliamsburgBridge variable will be our response variable. Therefore, we
must check that this variable meets the criteria for a Poisson random
variable, having a mean which is equal to its variance.
# Finding the mean.
mean <- mean(cycling$WilliamsburgBridge)
print(mean)
[1] 6073.677
The mean of the WilliamsburgBridge variable is 6,073.677. This
represents the mean number of individuals on the Williamsburg Bridge on
a given observation. This means that the mean number of individuals on
the Williamsburg Bridge on any given date is around 6,074 people. We
round this value because the number of individuals is a whole
number.
Next, let’s find the variance of our response variable.
# Finding the variance.
variance <- var(cycling$WilliamsburgBridge)
print(variance)
[1] 2482822
The variance of the WilliamsburgBridge variable is 2,482,822. This
does not match up with the value of the mean, and indicates a violation
of one the neccessary conditions for a Poisson regression model. This
implies that our response variable is in fact not a Poisson random
variable because the value of its mean is not equivalent to the value of
its variance.
Condition 4: The
log of the mean rate, log (λ), must be a linear function of x.
We will take a look at the plot of the mean rate against the
predictor variables to check this condition.
First, let’s look at the predictor variable of date vs our response
variable of WilliamsburgBridge.
plot(cycling$Date, cycling$WilliamsburgBridge, main = "Date vs. Williamsburg Bridge", xlab = "Date", ylab = "WilliamsburgBridge")

The scatterplot of these two variables shows a random distribution,
but it does not appear to follow a linear pattern. This could suggest a
possible violation of this condition due to the WilliamsburgBridge
variable not being a linear function of the Date predictor variable.
Next, let’s look at the predictor variable of high temperature vs our
response variable of WilliamsburgBridge.
plot(cycling$HighTemp, cycling$WilliamsburgBridge, main = "HighTemp vs. Williamsburg Bridge", xlab = "HighTemp", ylab = "WilliamsburgBridge")

The scatterplot of these two variables again shows a random
distribution, which it does not appear to follow a distinctly linear
pattern. This could suggest a possible violation of this condition due
to the WilliamsburgBridge variable not being a linear function of the
HighTemp predictor variable.
Next, let’s look at the predictor variable of low temperature vs our
response variable of WilliamsburgBridge.
plot(cycling$LowTemp, cycling$WilliamsburgBridge, main = "LowTemp vs. Williamsburg Bridge", xlab = "LowTemp", ylab = "WilliamsburgBridge")

The scatterplot of these two variables again shows a random
distribution, which it does not appear to follow a distinctly linear
pattern. This could suggest a possible violation of this condition due
to the WilliamsburgBridge variable not being a linear function of the
LowTemp predictor variable.
Lastly, let’s look at the predictor variable of precipitation vs our
response variable of WilliamsburgBridge.
plot(cycling$Precipitation, cycling$WilliamsburgBridge, main = "Precipitation vs. Williamsburg Bridge", xlab = "Precipitation", ylab = "WilliamsburgBridge")

The scatterplot of these two variables again shows a distribution
which does not appear to follow a distinctly linear pattern, it appears
the points are mostly centered around x = 0, with some outliers to the
right. This could suggest a possible violation of this condition due to
the WilliamsburgBridge variable not being a linear function of the
Precipitation predictor variable.
Overall, it seems that we do have some violations of the conditions
of a Poisson regression model, with the response variable not following
a linear function of the predictor variables in our model.
We will still continue with building the Poisson regression models,
but it is important to keep in mind that these violations may mean that
the Poisson regression model is not the best model choice for this data
set due to some of the neccessary conditions having been failed to have
been met.
Poisson Regression
Model on Frequency Counts
We will begin with creating a Poisson regression model of the
frequency counts. Specifically, this model will be on the frequency
counts of individuals on the Williamsburg Bridge for a given
observations. Our goal is to create a Poisson regression model which can
statistically significantly predict the count of the number of
individuals on the Williamsburg Bridge for a given observation, based
upon the various factors in this data set.
We will create our Poisson regression model on the frequency
counts.
# Poisson Regression Model of Counts
model.counts <- glm(WilliamsburgBridge ~ Date + HighTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = cycling)
pois.count.coef = summary(model.counts)$coef
kable(pois.count.coef, caption = "Poisson Regression Model for the Counts of Cyclists \n on the Williamsburg Bridge")
Poisson Regression Model for the Counts of Cyclists on the
Williamsburg Bridge
(Intercept) |
-329.7412813 |
11.7142108 |
-28.148826 |
0 |
Date |
0.0078648 |
0.0002726 |
28.850990 |
0 |
HighTemp |
0.0035901 |
0.0006334 |
5.667892 |
0 |
LowTemp |
0.0075718 |
0.0009046 |
8.370178 |
0 |
Precipitation |
-0.3516535 |
0.0086431 |
-40.685836 |
0 |
The regression equation for the Poisson regression model on the
frequency counts is given as:
log(μ) = -329.7413 + 0.0079 * Date + 0.0036 * HighTemp + 0.0076 *
LowTemp - 0.3517 * Precipitation
All four of the predictor variables, Date, HighTemp, LowTemp, and
Precipitation, all have p-values of p < .001. This indicates that all
of the predictor in our model variables are statistically significant in
predicting the total expected counts of cyclists on the Williamsburg
Bridge on a given day.
The significance of these variables in regards to predicting the
expected counts can likely be attributed to potential adverse weather
conditions, such as excessive heat or cold, along with intense
precipitation and storms making cycling non ideal on those days with
poor conditions for outdoors activities such as cycling. These predictor
variables all being statistically significant shows that the weather and
temperature conditions do suggest a discrepancy in the number of
cyclists on the Williamsburg Bridge from day to day due to these changes
in temperature and precipitation.
Overall, this Poisson model of the frequency counts of the cyclists
on the Williamsburg Bridge showed statistical signficance in its
prediction of the expected log counts for the number of cyclists on the
Williamsburg Bridge for a given observation.
Regression
Coefficients Interpretation
The Poisson regression model on frequency counts was found to have
the following regression equation:
log(μ) = -329.7413 + 0.0079 * Date + 0.0036 * HighTemp + 0.0076 *
LowTemp - 0.3517 * Precipitation
We will analysis the regression coefficients for the variables in
this Poisson regression model on frequency counts.
The value of the y-intercept is given as -329.7413. This
represnts the baseline of the mean of log(μ) when all predictor
variables are equal to 0. However, the y-intercept does not have a
practical interpretation or meaning in this scenario so we are not
interested in its meaning for the Poisson regression model.
Date: The regression coefficient of the Date variable in this
model is 0.0079. This means that the mean log of the counts increases by
0.0079 units for every 1 day increase in the date on which the
observation was collected, holding all other variables
constant.
HighTemp: The regression coefficient of the HighTemp variable in
this model is 0.0036. This means that the mean log of the counts
increases by 0.0036 units for every 1 degree Fahrenheit increase in the
high temperature for the given observation, holding all other variables
constant.
LowTemp: The regression coefficient of the LowTemp variable in
this model is 0.0076. This means that the mean log of the counts
increases by 0.0076 units for every 1 degree Fahrenheit increase in the
low temperature for the given observation, holding all other variables
constant.
Precipitation: The regression coefficient of the Precipitation
variable in this model is -0.3517. This means that the mean log of the
counts decreases by 0.3517 units for every 1 inch increase in the amount
of precipitation for the given observation, holding all other variables
constant.
Poisson Regression
Model on Rates
Now, we will create a Poisson regression model of the rates at which
cyclists enter and leave via the Williamsburg Bridge offset by the total
number of cyclists on all four of the major New York bridges. This
model, unlike the previous model which just focused on the frequency
counts of cyclists on the Williamsburg Bridge, will also account for the
total number of cyclists on all four of the major New York bridges, the
Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the
Queensboro Bridge. This Poisson model will look at the rates of the
number of cyclists on the Williamsburg Bridge for a given observation as
a rate out of the total number of cyclists on all four of these major
bridges for that specific observation.
We will build our Poisson regression model for the rates. This time,
we will still use the WilliamsburgBridge variable as our response
variable, but we will offset the model by the Total variable to make our
Poisson model for the rates of cyclists on the Williamsburg Bridge out
of the total number of cyclists on all four of the bridges.
# Poisson Model of Rates
model.rates <- glm(WilliamsburgBridge ~ Date + HighTemp + LowTemp + Precipitation, offset = log(Total),
family = poisson(link = "log"), data = cycling)
kable(summary(model.rates)$coef, caption = "Poisson Regression Model of the Rates of Cyclists \n on the Williamsburg Bridge out of all Four Bridges")
Poisson Regression Model of the Rates of Cyclists on the
Williamsburg Bridge out of all Four Bridges
(Intercept) |
-50.4101583 |
12.0801410 |
-4.172978 |
3.01e-05 |
Date |
0.0011422 |
0.0002811 |
4.063496 |
4.83e-05 |
HighTemp |
-0.0050794 |
0.0006460 |
-7.862790 |
0.00e+00 |
LowTemp |
0.0092517 |
0.0009198 |
10.057847 |
0.00e+00 |
Precipitation |
0.0356817 |
0.0078863 |
4.524499 |
6.10e-06 |
The regression equation for the Poisson regression model on the rates
is given as:
log(μ/t) = -50.4102 + 0.0011 * Date - 0.0051 * HighTemp + 0.0093 *
LowTemp - 0.0357 * Precipitation
All four of the predictor variables in this Poisson model, Date,
HighTemp, LowTemp, and Precipitation, all have p-values of p < .001.
This indicates that all of the predictor in our model variables are
statistically significant in predicting the total expected counts of
cyclists on the Williamsburg Bridge on a given day, offset by the total
number of cyclists on all four of the major New York bridges.
This model shows statistical significance in predicting the expected
counts of the cyclists on the Williamsburg Bridge by using the rates for
the prediction. This indicates that this model for the rates shows
statistical significance in its predictive power and provides good
utility for prediction and estimation.
Regression
Coefficients Interpretation
The value of the y-intercept is given as -50.4102 This represents
the baseline of the mean of the log counts multiplied by t, when all
predictor variables are equal to 0. However, the y-intercept does not
have a practical interpretation or meaning in this scenario so we are
not interested in its meaning for the Poisson regression model.
Date: The regression coefficient of the Date variable in this
model is 0.0011. This means that the mean of the log counts multipled by
t increases by 0.0011. units for every 1 day increase in the date on
which the observation was collected, holding all other variables
constant.
HighTemp: The regression coefficient of the HighTemp variable in
this model is -0.0051 This means that the mean of the log counts
multipled by t decreases by 0.0051 units for every 1 degree Fahrenheit
increase in the high temperature for the given observation, holding all
other variables constant.
LowTemp: The regression coefficient of the LowTemp variable in
this model is 0.0093. This means that the log counts multipled by t
increases by 0.0093 units for every 1 degree Fahrenheit increase in the
low temperature for the given observation, holding all other variables
constant.
Precipitation: The regression coefficient of the Precipitation
variable in this model is 0.0357. This means that the log counts
multipled by t increases by 0.0357 units for every 1 inch increase in
the amount of precipitation for the given observation, holding all other
variables constant.
Summary and Comparisons
of the Two Models
Both of the two Poisson regression model we created, the model for
the frequency counts and the model for the rates, provided statistical
significance for prediction and showed good utility overall. In both of
these models, we looked into the total number of cyclists on the
Williamsburg Bridge in New York for a specific observation, and we
looked into the various factors of that specific date. We looked at the
date of the observation along with some factors which may affect the
total number of cyclists out on that specific date. These factors
included the high temperature, the low temperature, and the amount of
precipitation for that given date. It turned out that all of these
factors were indeed statistically significant for both of the two
Poisson regression models, indicating that these weather related
conditions have a statistically significant impact on both the counts
and the rates of cyclists out on the Williamsburg Bridge for a given
observation. This can be attributed to certain weather conditions making
it more or less ideal for individuals to be cycling outdoors. For
instance, a day with incredibly high temperatures, incredibly cold
temperatures, or severe storms with heavy precipitation would be less
ideal and likely lead to less cyclists being out on that given day as
opposed to a day with pleasant weather.
Overall, both of the Poisson regression models showed statistical
significance and good utility in their prediction. However, as was
previously stated, there were some violations of this conditions for a
Poisson regression model within our data set. First, it was found that
the mean of the response variable, WilliamsburgBridge, was not equal to
its variance. This suggests that this response variable in fact is not
Poisson distributed, due to it failing to meet the condition for a
Poisson random variable of its mean being equal to its variance.
Additionally, all four predictor variables were checked, and it was
found that the response variable in fact was not a linear function of
any of these predictor variables. This indicates another major violation
of this data set. These violations suggest that perhaps a Poisson model
was not the best model choice for this data set, and that it is
important to be mindful of these violations when using either of the
Poisson regression models we created for prediction.
Conclusion
Overall, two Poisson regression models were created in this project.
Both of these models looked at the total number of cyclists on the
Williamsburg Bridge for a given day. The first model looked at the
counts of cyclists that were on the Williamsburg Bridge, and the second
model looked at the rates of the cyclists that were on the Williamsburg
Bridge offset by the total number of cyclists on all four of the major
New York bridges.
Both of these two Poisson regression models showed statistical
signicifance in their predictions, with all of the predictor variables
in both of these two models have p-values of p < .001. This indicates
that the factors in the data set of Date, HighTemp, LowTemp, and
Precipitation are statistically significnace in predicting the frequency
counts or the rates of the cyclists on the Williamsburg Bridge for a
given observation.
However, our data set failed to meet some of the neccessary
conditions for a Poisson regression model. The mean of the response
variable, WilliamsburgBridge, was not equal to its variance. This
suggests that this response variable in fact is not Poisson distributed.
Also, all four predictor variables were checked, and it was found that
the response variable in fact was not a linear function of any of these
predictor variables. These are two violations of the assumptions for a
Poisson regression model. These violations mean that perhaps a Poisson
regression model was not the best choice for this data set, and that
these violations should be kept in mind when using either of these
models for prediction.
Recommendations
Some recommendations I would suggest for further projects
include:
Look further into the violations that were found within this data
set and look into possible explanations for these violations of the
necessities of a Poisson regression model. Further consider whether the
Poisson regression model in fact is the best choice for this data set
and if it is sufficient to use this model for prediction despite these
violations.
Consider other variables which may affect the number of cyclists
out on a given observation. Perhaps there are other factors which may
provide further significance for model building which may strengthen the
regression model.
Further expand the data set to ensure the accuracy of the
predictions and to further strengthen the Poisson regression
models.
