Introduction
For this project, we will be creating a Poisson regression model. The
data set for this project looks at the daily total of cyclists on the
Williamsburg Bridge on a given day. This data set looks at the total
number of cyclists on the Williamsburg Bridge in Brooklyn, New York, in
order to keep track of the total number of cyclists entering and leaving
this cycling route on a specific day. We will look at the various
factors affecting the number of cyclists on each day, with factors such
as the weather conditions on that particular day. We will also create a
Quassi-Poisson regression model and analyze the dispersion of our
model.
Data Description
The data set in this project looks at the total number of cyclists on
the Williamsburg Bridge on a given day along with the weather conditions
of that day such as temperature and precipitation. This data set also
includes the total number of cyclists on all four of the major New York
bridges the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg
Bridge, and the Queensboro Bridge.
First, I will find the data set which will be used for this
assignment. I ran the code which was given in the assignment
description, and the data set I received was for the Williamsburg
Bridge, so that is what we will use for this Poisson regression modeling
project. When opening the downloaded data set, I noticed some issues
with the data set not being properly stored. The Date and Day variable
had the exact same values, and these values did not make sense within
the context of this situation. I went through and fixed these values
with how the appeared in the original data set, which was given under
the tab data2 in the w09-AssignDataSet.xlsx file. I replaced the
improper values in the Date and Day variable with what was in the
original data set. Now, the Date variable is an identification of the
date on which the observation occurred. And, the Day variable represents
the day of the week on which the observation occurred. I also checked
all of the other variables to ensure their values were not also messed
up in the formatting of the excel file, but they all appeared to be all
good without anything having gotten changed during the downloading
process.
The data set has been uploaded to Github and now can be read in
directly from the Github repository.
We will read in the data set from Github and we will call it
“cycling”.
cycling <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/WilliamsburgBridge.csv", header = TRUE)
str(cycling)
'data.frame': 30 obs. of 7 variables:
$ Date : chr "4/1" "4/2" "4/3" "4/4" ...
$ Day : chr "Saturday" "Sunday" "Monday" "Tuesday" ...
$ HighTemp : num 46 62.1 63 51.1 63 48.9 48 55.9 66 73.9 ...
$ LowTemp : num 37 41 50 46 46 41 43 39.9 45 55 ...
$ Precipitation : num 0 0 0.03 1.18 0 0.73 0.21 0 0 0 ...
$ WilliamsburgBridge: int 1915 4207 5178 2279 5711 1739 3399 4082 4886 6881 ...
$ Total : int 5397 13033 16325 6581 17991 4896 10341 11610 14899 21295 ...
We will use this cycling data set to create two Poisson regression
models, one for the frequency counts of cyclists on the Williamsburg
Bridge on a given observation, and another for the rates of cyclists
entering and leaving via the Williamsburg Bridge offset by the total
number of cyclists on all of the major New York bridges.
Variables
There are 7 total variables in the cycling data set. These variables
include:
Date: This represents the date on which a given observation was
collected. This is the observation ID number. This variable is just for
identification purposes of the observations, not for actual prediction.
The date is given in the format of month/day.
Day: This is a character predictor variable which represents the
day of the week on which a given observation was collected. For
instance, Monday, Tuesday, etc.
HighTemp: A quantitative predictor variable representing the high
temperature on the given day, given in degrees Fahrenheit.
LowTemp: A quantitative predictor variable representing the low
temperature on the given day, given in degrees Fahrenheit.
Precipitation: A quantitative predictor variable representing the
amount of precipitation, rain, which occurred on the given day, given in
inches.
WilliamsburgBridge: A quantitative variable representing the
total number of cyclists on the Williamsburg Bridge on a given
observation. This will be our response variable for the Poisson
regression models.
Total: The total number of cyclists on all bridges on a given
observation. This will be the variable which is offset for our Poisson
regression model of the rates and for the Quassi-Poisson regression
model.
We also will create two new variables within our analysis later on to
use for the Poisson regression model building process. These two
variables include:
AvgTemp: A quantitative predictor variable representing the
average temperature for a given observation, given in degrees
Fahrenheit. This variable will be the average of HighTemp and LowTemp,
found by calculating (HighTemp + LowTemp)/2.
NewPrecip: A discretized version of the Precipitation variable.
This will be a binary predictor variable, where 0 represents a
precipitation value equal to 0 inches, and where 1 represents a
precipitation value greater than 0 inches.
For the Poisson regression model for the frequency counts, the
Williamsburg Bridge variable will serve as the response variable. For
the Poisson regression model for the rates, the Williamsburg Bridge
variable will again serve as the response variable, and it will be
offset by the Total variable for this model.
Research
Questions
The main goal for this project is to create a Poisson regression
model for both the frequency counts and the rates of the cyclists
entering and leaving Brooklyn, New York through the Williamsburg Bridge.
So, the focus for this project will be on creating two Poisson
regression models which can successfully predict the frequency counts
and the rates of the cyclists on the Williamsburg Bridge.
Some key questions for this project include:
Does the data set meet all of the necessary conditions required
for a Poisson regression model? If not, is there any potential
explanation for this discrepancy?
Can we create Poisson regression models which provide statistical
significance for predicting both the frequency counts and for the rates
of cyclists on the Williamsburg Bridge on a given day?
Is the Quasi-Poisson regression model a better choice than the
either of the standard Poisson regression models for frequency counts
and for rates? How dispersed is this Quasi-Poisson regression
model?
We will work on creating our Poisson regression models for both the
frequency counts and rates in order to see if we can in fact create
models which provide statistical significance in their predictive
ability. We will also create a Quasi-Poisson regression model and we
will find how dispered it is. We will determine which of these models is
the ideal choice for our final regression model.
Exploratory Data
Analysis
Let’s take a look at the first few entries within this cycling data
set for the Williamsburg Bridge.
kable(head(cycling), caption = "First Few Observations in the Data Set")
First Few Observations in the Data Set
4/1 |
Saturday |
46.0 |
37 |
0.00 |
1915 |
5397 |
4/2 |
Sunday |
62.1 |
41 |
0.00 |
4207 |
13033 |
4/3 |
Monday |
63.0 |
50 |
0.03 |
5178 |
16325 |
4/4 |
Tuesday |
51.1 |
46 |
1.18 |
2279 |
6581 |
4/5 |
Wednesday |
63.0 |
46 |
0.00 |
5711 |
17991 |
4/6 |
Thursday |
48.9 |
41 |
0.73 |
1739 |
4896 |
This data set includes various factors which may have an influence on
the number of individuals cycling, along with the date on which this
data was collected. Additionally, this data set includes variables for
both the number of cyclists on the Williamsburg Bridge on that given
day, along with the total number of cyclists on all bridges on that
given day.
First, let’s check if there are any missing variables in our data
set.
colSums(is.na(cycling))
Date Day HighTemp LowTemp
0 0 0 0
Precipitation WilliamsburgBridge Total
0 0 0
It turns out that all of the variables in the data set have exactly
zero missing values. So, there are no missing values in our data set.
This is very good and means we can move on with further analyzing the
data set and the variables within it.
Checking the Variable
Distributions
We have three predictor variables which we want to use in our final
model, Day, AvgTemp, and NewPrecip. Out of these variables, Day is a
categorical character variable, AvgTemp is a quantitative variable, and
NewPrecip is a binary variable. We will check that the distributions of
all of these variables appear to be random, without any noticeable
patterns or concerns which could cause issues with the model building
process.
First, let’s look at our Day variable. This is a categorical
character variable. In order to check that this variable is properly
distributed without any major concerns, we will look for if there are
any potential imbalances within this variable. An imbalance would occur
if there were a significantly greater number of observations occuring on
one day as opposed to another day. We will check that there are not any
substantial imbalances within the Day variable.
table(cycling$Day)
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
4 4 5 5 4 4 4
It appears that all days of the week from Monday to Friday have
exactly four observations in our data set. The weekend days of Saturday
and Sunday both have exactly five observations in our data set. As we
can see, the observations appear to be distributed very evenly amongst
the days of the week, with the weekend days only having one more
observation each than the weekdays. Overall, this variable appears to be
overall evenly distributed, and so there are not any imbalances to be
concerned about for our Day predictor variable.
Next, let’s check the distribution of our AvgTemp variable. This is a
quantitative predictor variable and so we can check its distribution by
using a histogram. We will check to see if this variable has an overall
normal distribution without any notable skew or outliers.
ylimit = max(density(cycling$AvgTemp)$y)
hist(cycling$AvgTemp, probability = TRUE, main = "AvgTemp Distribution", xlab="AvgTemp",
col = "aliceblue", border="cornflowerblue")
lines(density(cycling$AvgTemp, adjust=2), col="darkorchid")

It appears that our distribution histogram of the AvgTemp variable is
unimodal with majority of the data being centered at an average
temperature between 55 and 60 degrees Fahrenheit. The distribution
appears to follow an approximately normal distribution without any
noteable skew or outliers. It does appear like perhaps there is slightly
more entries on the left side of the histogram, but it is only by a very
slight amount and is not significant enough to cause a noticeable skew
in the distribution. Overall, it appears safe to say that our AvgTemp
variable follows an approximately normal distribution, and so this
variable will be all good to use in our model building process.
Lastly, let’s check our NewPrecip variable. This is a binary
predictor variable, and so we can expect it to have values of only 0 and
1. In order to check the reliability of this variable in its use for
prediction, we will make sure it appears to meet this criteria of a
binary variable. We can take a look at a table to ensure that there are
only two possible entries for this NewPrecip variable, 0 and 1, because
these are the only two values which a binary variable can be.
table(cycling$NewPrecip)
0 1
18 12
As we can see, this NewPrecip variable has only two entries, 0 and 1.
This is exactly what we wanted to see because these are the only two
variables which a binary variable can be. We can see that there are
slightly more days with no precipitation, a value of 0, with 18 total
observations, than days with precipitation, a value of 1, with 12 total
observations. However, this difference is not large enough to be a cause
for concern. And so, we can conclude that everything is alright with
with our binary predictor variable of NewPrecip, and that we can
continue with the model building process.
Now, we have checked the distributions of all three of our predictor
variables, Day, AvgTemp, and NewPrecip, and ensured that there are not
any apparent issues with any of these variables or their distributions.
So, we can continue with using these predictor variables in our model
building process.
Asumptions and
Conditions
Before we begin with building our model, we must check the
assumptions and conditions which are required for a Poisson regression
model.
There are four assumptions which must be met in order to create a
Poisson regression model. These assumptions include:
The response variable is a count described by a Poisson
distribution.
Observations are independent of one another.
The mean of the Poisson random variable is equal to the variance
of said Poisson random variable.
The log of the mean rate, log (λ), must be a linear function of
x.
We will check whether all of these four conditions have been
successfully met by our cycling data set before beginning with the model
building process for our Poisson regression model.
We will go through and check all four of the necessary conditions
required for a Poisson Regression Model.
Condition 1: The
response variable is a count described by a Poisson distribution.
The response variable in this data set was stated to be the
WilliamsburgBridge variable, representing the total number of cyclists
on the Williamsburg Bridge on a given observation. This variable is
described as a count, representing the number of cyclists on a given
observation. This fits the criteria for this assumption, because we can
conclude that we have a response variable that is a count.
Condition 2:
Observations are independent of one another.
Each observation was collected on a given date, and we can safely
assume that the conditions of one day did not affect the conditions of
another day. The number of cyclists on the Williamsburg Bridge for a
given observation is independent on this number of a different
observation. So, we can safely conclude that that observations are all
independent and separate from one another.
Condition 3: The
mean of the Poisson random variable is equal to the variance of said
Poisson random variable.
In order for a variable to be a Poisson random variable, its mean
must be equal to its variance. We previously stated that the
WilliamsburgBridge variable will be our response variable. Therefore, we
must check that this variable meets the criteria for a Poisson random
variable, having a mean which is equal to its variance.
# Finding the mean.
mean <- mean(cycling$WilliamsburgBridge)
print(mean)
[1] 4942.267
The mean of the WilliamsburgBridge variable is 4,942.267. This
represents the mean number of individuals on the Williamsburg Bridge on
a given observation. This means that the mean number of individuals on
the Williamsburg Bridge on any given date is around 4,943 people. We
round this value up because the number of individuals is a whole number
and so the decimal must be rounded up to the next whole number to
represent that part as an individual.
Next, let’s find the variance of our response variable.
# Finding the variance.
variance <- var(cycling$WilliamsburgBridge)
print(variance)
[1] 3005665
The variance of the WilliamsburgBridge variable is 3,005,665. This
does not match up with the value of the mean, and indicates a violation
of one the neccessary conditions for a Poisson regression model. This
implies that our response variable is in fact not a Poisson random
variable because the value of its mean is not equivalent to the value of
its variance.
Condition 4: The
log of the mean rate, log (λ), must be a linear function of x.
We will take a look at the plot of the mean rate against the
predictor variables to check this condition.
Since our first predictor variable is Day, and this a categorical,
character variable, it would not create a linear function because it is
made up of categorical inputs. So, instead, we will look at the
numerical predictor variable instead to check this condition.
Let’s look at the predictor variable of average temperature vs our
response variable of WilliamsburgBridge. AvgTemp is a quantitative,
numeric variable so we can use it to check this condition.
plot(cycling$AvgTemp, cycling$WilliamsburgBridge, main = "AvgTemp vs. Williamsburg Bridge", xlab = "AvgTemp", ylab = "WilliamsburgBridge")

The scatterplot of the two variables of AvgTemp and
WilliamsburgBridge shows what does appear to be a linear relationship of
these two variables. We can see a positive relationship between the two
variables, as average temperature increases, so does the number of
cyclists on the Williamsburg Bridge. This seems logical as it makes
sense that more people would want to go outside and go cycling on a day
that is warmer outside rather than a day that is colder outside. The
relationship of the two variables does appear to have a moderate
strength, but the linear pattern can definitely be seen. So, it does
appear that WilliamsburgBridge is a linear function of AvgTemp, which
verifies this necessary condition for a Poisson regression model.
Lastly, we have our predictor variable of NewPrecip. This is a binary
predictor variable, so we will only see points at x = 0 and x = 1 if we
were to create a scatterplot of this binary predictor variable of
NewPrecip. So, we can not expect to see a linear relationship between
NewPrecip and WilliamsburgBridge, because NewPrecip can only have values
of 0 and 1, not anything in between due to it being a binary predictor
variable.
Overall, we can consider this condition satisfied since our numerical
predictor variable of AvgTemp showed that it does indeed have a linear
relationship with our response variable.
Summary of
Violations
Overall, it seems that we do have one notable violation of the
conditions of a Poisson regression model within our data set. We found
that the response variable, WilliamsburgBridge, does not meet the
necessary criteria of a Poisson random variable, because its mean is not
equal to its variance. This is a major concern, because it points to a
major violation of the conditions required for a Poisson regression
model.
This violation of the conditions for a Poisson regression model
suggests a major concern with our data set, as it fails to meet a major
condition which is required for a Poisson regression model. This suggest
that perhaps a Poisson regression model may not be the best model choice
for this data set after all.
We will still continue with building the Poisson regression models
for this project, but it is important to keep in mind that this
violation may mean that the Poisson regression model is not the best
model choice for this data set due to the necessary condition of the
mean of the response variable equaling the variance of the response
variable having been failed to have been met.
Poisson Regression
Models on the Original Variables
First, we will look at the Poisson regression models which were
created in the previous week’s assignment and look at the corrected
versions of these models. We will use the original predictor variables
first, and then in later steps of this project we will use the new
predictor variables of AvgTemp and NewPrecip.
In the previous week’s assignment, we created two Poisson regression
models, one on frequency counts and one of the rates. We will create
these again to see the proper, corrected models. Since in the previous
week’s assingment, we did not alter any of the predictor variables for
these two models, we will use the original variables for now and then
create models using the new predictor variables of AvgTemp and
NewPrecip.
For now, we will use the old predictor variables of Day, HighTemp,
LowTemp, and Precipitation. These were the predictor variables used in
the previous week’s assignment, so we will first begin by correcting the
models which were created in that assignment before we been creating the
new models for this project.
Poisson Regression
Model on Frequency Counts
We will begin with creating a Poisson regression model of the
frequency counts. This model will be on the frequency counts of
individuals on the Williamsburg Bridge for a given observations. Our
goal is to create a Poisson regression model which can statistically
significantly predict the count of the number of individuals on the
Williamsburg Bridge for a given observation, based upon the various
factors in this data set.
We will create our Poisson regression model on the frequency
counts.
# Poisson Regression Model of Counts
model.counts <- glm(WilliamsburgBridge ~ Day + HighTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = cycling)
pois.count.coef = summary(model.counts)$coef
kable(pois.count.coef, caption = "Poisson Regression Model for the Counts of Cyclists \n on the Williamsburg Bridge")
Poisson Regression Model for the Counts of Cyclists on the
Williamsburg Bridge
(Intercept) |
7.6535762 |
0.0220219 |
347.544093 |
0.0000000 |
DayMonday |
0.0545822 |
0.0099489 |
5.486232 |
0.0000000 |
DaySaturday |
-0.2743978 |
0.0101644 |
-26.996022 |
0.0000000 |
DaySunday |
-0.2345666 |
0.0097363 |
-24.091886 |
0.0000000 |
DayThursday |
0.0319064 |
0.0102933 |
3.099730 |
0.0019370 |
DayTuesday |
0.1988133 |
0.0103531 |
19.203340 |
0.0000000 |
DayWednesday |
0.0511286 |
0.0100775 |
5.073556 |
0.0000004 |
HighTemp |
0.0170556 |
0.0005874 |
29.037387 |
0.0000000 |
LowTemp |
-0.0023861 |
0.0007830 |
-3.047372 |
0.0023085 |
Precipitation |
-1.0320675 |
0.0165996 |
-62.174240 |
0.0000000 |
The regression equation for the Poisson regression model on the
frequency counts is given as:
log(μ) = 7.6538 + 0.0546 * DayMonday - 0.2744 * DaySaturday - 0.2346
* DaySunday + 0.0319 * DayThursday + 0.1988 * DayTuesday + 0.0511 *
DayWednesday + 0.0171 * HighTemp - 0.0024 * LowTemp - 1.0321 *
Precipitation
All of the predictor variables, DayMonday, DaySaturday, DaySunday,
DayThursday, DayTuesday, DayWednesday, HighTemp, LowTemp, and
Precipitation, all have p-values of p < .001. This indicates that all
of the predictor in our model variables are statistically significant in
predicting the total expected counts of cyclists on the Williamsburg
Bridge on a given day.
The significance of these variables in regards to predicting the
expected counts can likely be attributed to potential adverse weather
conditions, such as excessive heat or cold, along with intense
precipitation and storms making cycling non ideal on those days with
poor conditions for outdoors activities such as cycling. These predictor
variables all being statistically significant shows that the weather and
temperature conditions do suggest a discrepancy in the number of
cyclists on the Williamsburg Bridge from day to day due to these changes
in temperature and precipitation.
Overall, this Poisson model of the frequency counts of the cyclists
on the Williamsburg Bridge showed statistical significance in its
prediction of the expected log counts for the number of cyclists on the
Williamsburg Bridge for a given observation.
For our categorical predictor variable of Day, Friday was chosen as
the base line level, which can be seen by how there is not a “DayFriday”
variable in the regression equation output. This is because of the seven
days, Friday is the one which comes first alphabetically and R chooses
the level which comes first alphabetically as the base line level.
Therefore, for our regression coefficient interpretations for the
different levels of the Day variable, these values will be compared
against the base line level of Friday.
Regression
Coefficients Interpretation
We will analysis the regression coefficients for the variables in
this Poisson regression model on frequency counts.
The value of the y-intercept is given as 7.6536. This represents
the baseline of the mean of log(μ) when all predictor variables are
equal to 0. However, the y-intercept does not have a practical
interpretation or meaning in this scenario so we are not interested in
its meaning for the Poisson regression model.
DayMonday (p < .001): The regression coefficient for the
variable DayMonday was found to be 0.0546. This means that the mean log
count of cyclists on the Williamsburgs Bridge was 0.0546 greater on
Monday than on Friday. We can also say this means that the count of
cyclists is 1.0561 times greater on Monday than on Friday, holding all
other variables constant.
DaySaturday (p < .001): The regression coefficient for the
variable DaySaturday was found to be -0.2744. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.2744 less on
Saturday than on Friday. We can also say this means that the count of
cyclists is 0.7600 times greater on Saturday than on Friday, holding all
other variables constant.
DaySunday (p < .001): The regression coefficient for the
variable DaySunday was found to be -0.2346. This means that the mean log
count of cyclists on the Williamsburgs Bridge was 0.2346 less on Sunday
than on Friday. We can also say this means that the count of cyclists is
0.7909 times greater on Sunday than on Friday, holding all other
variables constant.
DayThursday (p = 0.0019): The regression coefficient for the
variable DayThursday was found to be 0.0319. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.0319 greater on
Thursday than on Friday. We can also say this means that the count of
cyclists is 1.0324 times greater on Thursday than on Friday, holding all
other variables constant.
DayTuesday (p < .001): The regression coefficient for the
variable DayTuesday was found to be 0.1988. This means that the mean log
count of cyclists on the Williamsburgs Bridge was 0.1988 greater on
Tuesday than on Friday. We can also say this means that the count of
cyclists is 1.2199 times greater on Tuesday than on Friday, holding all
other variables constant.
DayWednesday (p <.001): The regression coefficient for the
variable DayWednesday was found to be 0.0511. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.0511 greater on
Wednesday than on Friday. We can also say this means that the count of
cyclists is 1.0524 times greater on Wednesday than on Friday, holding
all other variables constant.
HighTemp (p <.001): The regression coefficient of the HighTemp
variable in this model is 0.0171. This means that the mean log of the
counts increases by 0.0171 units for every 1 degree Fahrenheit increase
in the high temperature for the given observation, holding all other
variables constant.
LowTemp (p = 0.0023): The regression coefficient of the LowTemp
variable in this model is -0.0024. This means that the mean log of the
counts decreases by 0.0024 units for every 1 degree Fahrenheit increase
in the low temperature for the given observation, holding all other
variables constant.
Precipitation (p < .001): The regression coefficient of the
Precipitation variable in this model is -1.0321. This means that the
mean log of the counts decreases by 1.0321 units for every 1 inch
increase in the amount of precipitation for the given observation,
holding all other variables constant.
Poisson Regression
Model on Rates
Now, we will create a Poisson regression model of the rates at which
cyclists enter and leave via the Williamsburg Bridge offset by the total
number of cyclists on all four of the major New York bridges. This
model, unlike the previous model which just focused on the frequency
counts of cyclists on the Williamsburg Bridge, will also account for the
total number of cyclists on all four of the major New York bridges, the
Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the
Queensboro Bridge. This Poisson model will look at the rates of the
number of cyclists on the Williamsburg Bridge for a given observation as
a rate out of the total number of cyclists on all four of these major
bridges for that specific observation.
We will build our Poisson regression model for the rates. This time,
we will still use the WilliamsburgBridge variable as our response
variable, but we will offset the model by the Total variable to make our
Poisson model for the rates of cyclists on the Williamsburg Bridge out
of the total number of cyclists on all four of the bridges.
# Poisson Model of Rates
model.rates <- glm(WilliamsburgBridge ~ Day + HighTemp + LowTemp + Precipitation, offset = log(Total),
family = poisson(link = "log"), data = cycling)
kable(summary(model.rates)$coef, caption = "Poisson Regression Model of the Rates of Cyclists \n on the Williamsburg Bridge out of all Four Bridges")
Poisson Regression Model of the Rates of Cyclists on the
Williamsburg Bridge out of all Four Bridges
(Intercept) |
-1.0682224 |
0.0223058 |
-47.8899346 |
0.0000000 |
DayMonday |
0.0003873 |
0.0099397 |
0.0389685 |
0.9689155 |
DaySaturday |
0.0375055 |
0.0100984 |
3.7140004 |
0.0002040 |
DaySunday |
0.0051455 |
0.0097487 |
0.5278112 |
0.5976304 |
DayThursday |
0.0205573 |
0.0102839 |
1.9989823 |
0.0456103 |
DayTuesday |
0.0138077 |
0.0103909 |
1.3288272 |
0.1839050 |
DayWednesday |
0.0233018 |
0.0100672 |
2.3146274 |
0.0206333 |
HighTemp |
-0.0011895 |
0.0005867 |
-2.0272588 |
0.0426359 |
LowTemp |
0.0003500 |
0.0007846 |
0.4460900 |
0.6555322 |
Precipitation |
0.0505341 |
0.0161127 |
3.1362935 |
0.0017110 |
The regression equation for the Poisson regression model on the rates
is given as:
log(μ/t) = -1.0682 + 0.0004 * DayMonday + 0.0375 * DaySaturday +
0.0051 * DaySunday + 0.0206 * DayThursday + 0.0138 * DayTuesday + 0.0233
* DayWednesday - 0.0012 * HighTemp + 0.0004 * LowTemp + 0.0505 *
Precipitation
All of the predictor variables in this Poisson model, Date, HighTemp,
LowTemp, and Precipitation, all have p-values of p < .001. This
indicates that all of the predictor in our model variables are
statistically significant in predicting the total expected counts of
cyclists on the Williamsburg Bridge on a given day, offset by the total
number of cyclists on all four of the major New York bridges.
This model shows statistical significance in predicting the expected
counts of the cyclists on the Williamsburg Bridge by using the rates for
the prediction. This indicates that this model for the rates shows
statistical significance in its predictive power and provides good
utility for prediction and estimation.
Like was stated for the Poisson regression model on frequency counts,
Friday was chosen by R to be the base line level of the Day variable,
and so we will compare the regression coefficients against this base
line level.
Regression
Coefficients Interpretation
We will analysis the regression coefficients for the variables in
this Poisson regression model on rates.
The value of the y-intercept is given as -1.0682. This represents
the baseline of the mean of the log counts multiplied by t, when all
predictor variables are equal to 0. However, the y-intercept does not
have a practical interpretation or meaning in this scenario so we are
not interested in its meaning for the Poisson regression model.
DayMonday (p = 0.9689): The regression coefficient for the
variable DayMonday was found to be 0.0004. This means that the mean log
count of cyclists on the Williamsburgs Bridge was 0.0004 greater on
Monday than on Friday. We can also say this means that the count of
cyclists is 1.0056 times greater on Monday than on Friday, holding all
other variables constant.
DaySaturday (p < .001): The regression coefficient for the
variable DaySaturday was found to be 0.0375. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.0375 greater on
Saturday than on Friday. We can also say this means that the count of
cyclists is 1.0382 times greater on Saturday than on Friday, holding all
other variables constant.
DaySunday (p = 0.5976): The regression coefficient for the
variable DaySunday was found to be 0.0051. This means that the mean log
count of cyclists on the Williamsburgs Bridge was 0.0051 greater on
Sunday than on Friday. We can also say this means that the count of
cyclists on Sunday is 1.005 times greater on Sunday than on Friday,
holding all other variables constant.
DayThursday (p = 0.0456): The regression coefficient for the
variable DayThursday was found to be 0.0206. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.0206 greater on
Thursday than on Friday. We can also say this means that the count of
cyclists is 1.0208 times greater on Thursday than on Friday, holding all
other variables constant.
DayTuesday (p = 0.1839): The regression coefficient for the
variable DayTuesday was found to be 0.0138. This means that the mean log
count of cyclists on the Williamsburgs Bridge was 0.0138 greater on
Tuesday than on Friday. We can also say this means that the count of
cyclists is 1.0139 times greater on Tuesday than on Friday, holding all
other variables constant.
DayWednesday (p 0.0206): The regression coefficient for the
variable DayThursday was found to be 0.0233. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.0233 greater on
Wednesday than on Friday. We can also say this means that the count of
cyclists is 1.0236 times greater on Wednesday than on Friday, holding
all other variables constant.
HighTemp (p = 0.0426): The regression coefficient of the HighTemp
variable in this model is -0.0012. This means that the mean of the log
counts multiplied by t decreases by 0.0012 units for every 1 degree
Fahrenheit increase in the high temperature for the given observation,
holding all other variables constant.
LowTemp (p = 0.6555): The regression coefficient of the LowTemp
variable in this model is 0.0004. This means that the log counts
multipled by t increases by 0.0004 units for every 1 degree Fahrenheit
increase in the low temperature for the given observation, holding all
other variables constant.
Precipitation (p = 0.0017): The regression coefficient of the
Precipitation variable in this model is 0.0505. This means that the log
counts multiplied by t increases by 0.0505 units for every 1 inch
increase in the amount of precipitation for the given observation,
holding all other variables constant.
Summary and
Comparisons of the Two Models
Both of the two Poisson regression model we created, the model for
the frequency counts and the model for the rates, provided statistical
significance for prediction and showed good utility overall. In both of
these models, we looked into the total number of cyclists on the
Williamsburg Bridge in New York for a specific observation, and we
looked into the various factors of that specific date. We looked at the
date of the observation along with some factors which may affect the
total number of cyclists out on that specific date. These factors
included the high temperature, the low temperature, and the amount of
precipitation for that given date. It turned out that all of these
factors were indeed statistically significant for both of the two
Poisson regression models, indicating that these weather related
conditions have a statistically significant impact on both the counts
and the rates of cyclists out on the Williamsburg Bridge for a given
observation. This can be attributed to certain weather conditions making
it more or less ideal for individuals to be cycling outdoors. For
instance, a day with incredibly high temperatures, incredibly cold
temperatures, or severe storms with heavy precipitation would be less
ideal and likely lead to less cyclists being out on that given day as
opposed to a day with pleasant weather.
Overall, both of the Poisson regression models showed statistical
significance and good utility in their prediction. However, as was
previously stated, there were some violations of this conditions for a
Poisson regression model within our data set. First, it was found that
the mean of the response variable, WilliamsburgBridge, was not equal to
its variance. This suggests that this response variable in fact is not
Poisson distributed, due to it failing to meet the condition for a
Poisson random variable of its mean being equal to its variance.
Additionally, all four predictor variables were checked, and it was
found that the response variable in fact was not a linear function of
any of these predictor variables. This indicates another major violation
of this data set. These violations suggest that perhaps a Poisson model
was not the best model choice for this data set, and that it is
important to be mindful of these violations when using either of the
Poisson regression models we created for prediction.
Poisson Regression
Model on Frequency Counts
We will begin with creating a Poisson regression model of the
frequency counts using the new variables we created for this project.
Specifically, this model will be on the frequency counts of individuals
on the Williamsburg Bridge for a given observations. Our goal is to
create a Poisson regression model which can statistically significantly
predict the count of the number of individuals on the Williamsburg
Bridge for a given observation, based upon the various factors in this
data set.
We will create our Poisson regression model on the frequency
counts.
# Poisson Regression Model of Counts
model.counts <- glm(WilliamsburgBridge ~ Day + AvgTemp + NewPrecip,
family = poisson(link = "log"), data = cycling)
pois.count.coef = summary(model.counts)$coef
kable(pois.count.coef, caption = "Poisson Regression Model for the Counts of Cyclists \n on the Williamsburg Bridge")
Poisson Regression Model for the Counts of Cyclists on the
Williamsburg Bridge
(Intercept) |
7.2024973 |
0.0209663 |
343.527497 |
0.0000000 |
DayMonday |
0.0301551 |
0.0095684 |
3.151542 |
0.0016241 |
DaySaturday |
-0.1944899 |
0.0100224 |
-19.405597 |
0.0000000 |
DaySunday |
-0.2450724 |
0.0097649 |
-25.097317 |
0.0000000 |
DayThursday |
-0.0298802 |
0.0101579 |
-2.941565 |
0.0032656 |
DayTuesday |
-0.0604242 |
0.0100400 |
-6.018322 |
0.0000000 |
DayWednesday |
0.1714460 |
0.0101413 |
16.905694 |
0.0000000 |
AvgTemp |
0.0253493 |
0.0003233 |
78.418600 |
0.0000000 |
NewPrecip |
-0.3407990 |
0.0063666 |
-53.528941 |
0.0000000 |
The regression equation for the Poisson regression model on the
frequency counts is given as:
log(μ) = 7.2025 + 0.0302 * DayMonday - 0.1945 * DaySaturday - 0.2451
* DaySunday - 0.0299 * DayThursday - 0.0604 * DayTuesday + 0.1714 *
DayWednesday + 0.0253 * AvgTemp - 0.3408 * NewPrecip
All of the predictor variables, Day, AvgTemp, and NewPrecip, all have
p-values of p < .001. This indicates that all of the predictor in our
model variables are statistically significant in predicting the total
expected counts of cyclists on the Williamsburg Bridge on a given
day.
For our categorical predictor variable of Day, Friday was chosen as
the base line level, which can be seen by how there is not a “DayFriday”
variable in the regression equation output. This is because of the seven
days, Friday is the one which comes first alphabetically and R chooses
the level which comes first alphabetically as the base line level.
Therefore, for our regression coefficient interpretations for the
different levels of the Day variable, these values will be compared
against the base line level of Friday.
Regression
Coefficients Interpretation
We will analysis the regression coefficients for the variables in
this Poisson regression model on frequency counts.
The value of the y-intercept is given as 7.2025 This represnts
the baseline of the mean of log(μ) when all predictor variables are
equal to 0. However, the y-intercept does not have a practical
interpretation or meaning in this scenario so we are not interested in
its meaning for the Poisson regression model.
DayMonday (p < .001): The regression coefficient of the
DayMonday variable in this model is 0.0302. This means that the mean log
count of cyclists on the Williamsburgs Bridge was 0.0302 greater on
Monday than on Friday. We can also say this means that the count of
cyclists is 1.0307 times greater on Monday than on Friday, holding all
other variables constant.
DaySaturday (p < .001): The regression coefficient of the
DaySaturday variable in this model is -0.1945. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.1945 less on
Monday than on Friday. We can also say this means that the count of
cyclists is 0.8232 times greater on Monday than on Friday, holding all
other variables constant.
DaySunday (p < .001): The regression coefficient of the
DaySunday variable in this model is -0.2451. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.2451 less on
Sunday than on Friday. We can also say this means that the count of
cyclists is 0.7826 times greater on Sunday than on Friday, holding all
other variables constant.
DayThursday (p < .001): The regression coefficient of the
DayThursday variable in this model is -0.0299. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.0299 less on
Thursday than on Friday. We can also say this means that the count of
cyclists is 0.9705 times greater on Thursday than on Friday, holding all
other variables constant.
DayTuesday (p < .001): The regression coefficient of the
DayTuesday variable in this model is -0.0604. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.0604 less on
Tuesday than on Friday. We can also say this means that the count of
cyclists is 0.9414 times greater on Tuesday than on Friday, holding all
other variables constant.
DayWednesday (p < .001): The regression coefficient of the
DayWednesday variable in this model is 0.1714. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.1714 greater on
Monday than on Friday. We can also say this means that the count of
cyclists is 1.1870 times greater on Monday than on Friday, holding all
other variables constant.
AvgTemp (p < .001): The regression coefficient of the AvgTemp
variable in this model is 0.0253. This means that the mean log of the
counts increases by 0.0253 units for every 1 degree Fahrenheit increase
in the average temperature for the given observation, holding all other
variables constant.
NewPrecip (p < .001): The regression coefficient of the
NewPrecip variable in this model is -0.3408. This means that the mean
log of the count of cyclists on the Williamsburg Bridge is 0.3408 less
on days where there is precipitation than on days where there is no
precipitation. We can also say that the count of cyclists is 0.7112
greater on days with precipitation than on days with no
precipitation.
All of the predictor variables in this Poisson regression model on
frequency counts were statistically significant with all of their
p-values being equal to p < .001.
Poisson Regression
Model on Rates
Now, we will create a Poisson regression model of the rates at which
cyclists enter and leave via the Williamsburg Bridge offset by the total
number of cyclists on all four of the major New York bridges, using the
new variables we created in this project.
This model, unlike the previous model which just focused on the
frequency counts of cyclists on the Williamsburg Bridge, will also
account for the total number of cyclists on all four of the major New
York bridges, the Brooklyn Bridge, the Manhattan Bridge, the
Williamsburg Bridge, and the Queensboro Bridge. This Poisson model will
look at the rates of the number of cyclists on the Williamsburg Bridge
for a given observation as a rate out of the total number of cyclists on
all four of these major bridges for that specific observation.
We will build our Poisson regression model for the rates. This time,
we will still use the WilliamsburgBridge variable as our response
variable, but we will offset the model by the Total variable to make our
Poisson model for the rates of cyclists on the Williamsburg Bridge out
of the total number of cyclists on all four of the bridges.
# Poisson Model of Rates
model.rates <- glm(WilliamsburgBridge ~ Day + AvgTemp + NewPrecip,
offset = log(Total),
family = poisson(link = "log"), data = cycling)
kable(summary(model.rates)$coef, caption = "Poisson Regression Model of the Rates of Cyclists \n on the Williamsburg Bridge out of all Four Bridges")
Poisson Regression Model of the Rates of Cyclists on the
Williamsburg Bridge out of all Four Bridges
(Intercept) |
-1.0435841 |
0.0210730 |
-49.5222689 |
0.0000000 |
DayMonday |
0.0018152 |
0.0095668 |
0.1897380 |
0.8495145 |
DaySaturday |
0.0345382 |
0.0100162 |
3.4482277 |
0.0005643 |
DaySunday |
0.0034566 |
0.0098102 |
0.3523466 |
0.7245783 |
DayThursday |
0.0236982 |
0.0101422 |
2.3365925 |
0.0194604 |
DayTuesday |
0.0252916 |
0.0101088 |
2.5019260 |
0.0123520 |
DayWednesday |
0.0183518 |
0.0101291 |
1.8117867 |
0.0700192 |
AvgTemp |
-0.0014673 |
0.0003301 |
-4.4449423 |
0.0000088 |
NewPrecip |
0.0136632 |
0.0063769 |
2.1426111 |
0.0321443 |
The regression equation for the Poisson regression model on the rates
is given as:
log(μ/t) = -1.0436 + 0.0018 * DayMonday + 0.0345 * DaySaturday +
0.0035 * DaySunday + 0.0237 * DayThursday + 0.0253 * DayTuesday + 0.0184
* DayWednesday - 0.0015 * AvgTemp + 0.0137 * NewPrecip
The variables DaySaturday, DayThursday, DayTuesday, AvgTemp, and
NewPrecip all had p-values less than the alpha value of 0.05, meaning
that these are the variables which are statistically significant in this
model.
Like was stated for the Poisson regression model on frequency counts,
Friday was chosen by R to be the base line level of the Day variable,
and so we will compare the regression coefficients against this base
line level.
Regression
Coefficients Interpretation
We will analysis the regression coefficients for the variables in
this Poisson regression model on frequency counts.
The value of the y-intercept is given as -1.0436. This represents
the baseline of the mean of the log counts multiplied by t, when all
predictor variables are equal to 0. However, the y-intercept does not
have a practical interpretation or meaning in this scenario so we are
not interested in its meaning for the Poisson regression model.
DayMonday (p = .8495): The regression coefficient of the
DayMonday variable in this model is 0.0018. This means that the mean log
count of cyclists on the Williamsburgs Bridge was 0.0018 greater on
Monday than on Friday. We can also say this means that the count of
cyclists is 1.0018 times greater on Monday than on Friday, holding all
other variables constant.
DaySaturday (p < .001): The regression coefficient of the
DaySaturday variable in this model is 0.0345. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.0345 greater on
Saturday than on Friday. We can also say this means that the count of
cyclists is 1.0351 times greater on Saturday than on Friday, holding all
other variables constant.
DaySunday (p = 0.7246): The regression coefficient of the
DaySunday variable in this model is 0.0035. This means that the mean log
count of cyclists on the Williamsburgs Bridge was 0.0035 greater on
Sunday than on Friday. We can also say this means that the count of
cyclists is 1.0035 times greater on Sunday than on Friday, holding all
other variables constant.
DayThursday (p = 0.0195): The regression coefficient of the
DayThursday variable in this model is 0.0237. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.0237 greater on
Thursday than on Friday. We can also say this means that the count of
cyclists is 1.0240 times greater on Thursday than on Friday, holding all
other variables constant.
DayTuesday (p = 0.0124): The regression coefficient of the
DayTuesday variable in this model is 0.0253. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.0253 greater on
Tuesday than on Friday. We can also say this means that the count of
cyclists is 1.0256 times greater on Tuesday than on Friday, holding all
other variables constant.
DayWednesday (p = 0.0700): The regression coefficient of the
DayWednesday variable in this model is 0.0184. This means that the mean
log count of cyclists on the Williamsburgs Bridge was 0.0184 greater on
Wednesday than on Friday. We can also say this means that the count of
cyclists is 1.0186 times greater on Wednesday than on Friday, holding
all other variables constant.
AvgTemp (p < .001): The regression coefficient of the AvgTemp
variable in this model is -0.0015. This means that the mean log of the
counts decreases by 0.0015 units for every 1 degree Fahrenheit increase
in the average temperature for the given observation, holding all other
variables constant.
NewPrecip (p = .0321): The regression coefficient of the
NewPrecip variable in this model is 0.0137. This means that the mean log
of the count of cyclists on the Williamsburg Bridge is 0.0137 greater on
days where there is precipitation than on days where there is no
precipitation. We can also say that the count of cyclists is 1.0138
greater on days with precipitation than on days with no
precipitation.
Out of all of the predictor variables, the ones which showed
statistical significance were DaySaturday (p < .001), DayThursday (p
= 0.019), DayTuesday (p = 0.0124), AvgTemp (p < .001), and NewPrecip
(p = .0321). All of the predictor variables have p-values less than the
alpha value of 0.05, indicating they are statistically significant to
the model.
The variables of DayMonday (p = .8495), DaySunday (p = 0.7246), and
DayWednesday (p = 0.0700) did not show statistical significance as they
have p-values greater than the alpha value of 0.05, indicating they are
not statistically significant to the model.
Quassi-Poisson
Regression Model
Next, we will create a Quasi-Poisson regression model. This
Quassi-Poisson regression model will be done on the rates, and so it
will be offset by the Total variable, while still using
WilliamsburgBridge as its response variable for this model.
# Quasi-Poisson Regression Model
quasimodel.rates <- glm(WilliamsburgBridge ~ Day + AvgTemp + NewPrecip,
offset = log(Total),
family = quasipoisson, data = cycling)
summary(quasimodel.rates)
Call:
glm(formula = WilliamsburgBridge ~ Day + AvgTemp + NewPrecip,
family = quasipoisson, data = cycling, offset = log(Total))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.043584 0.043411 -24.040 <2e-16 ***
DayMonday 0.001815 0.019708 0.092 0.9275
DaySaturday 0.034538 0.020633 1.674 0.1090
DaySunday 0.003457 0.020209 0.171 0.8658
DayThursday 0.023698 0.020893 1.134 0.2695
DayTuesday 0.025292 0.020824 1.215 0.2380
DayWednesday 0.018352 0.020866 0.880 0.3891
AvgTemp -0.001467 0.000680 -2.158 0.0427 *
NewPrecip 0.013663 0.013137 1.040 0.3101
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 4.243634)
Null deviance: 151.051 on 29 degrees of freedom
Residual deviance: 89.094 on 21 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 3
pander(summary(quasimodel.rates)$coef, caption = "Quasi-Poisson Regression Model")
Quasi-Poisson Regression Model
(Intercept) |
-1.044 |
0.04341 |
-24.04 |
9.198e-17 |
DayMonday |
0.001815 |
0.01971 |
0.09211 |
0.9275 |
DaySaturday |
0.03454 |
0.02063 |
1.674 |
0.109 |
DaySunday |
0.003457 |
0.02021 |
0.171 |
0.8658 |
DayThursday |
0.0237 |
0.02089 |
1.134 |
0.2695 |
DayTuesday |
0.02529 |
0.02082 |
1.215 |
0.238 |
DayWednesday |
0.01835 |
0.02087 |
0.8795 |
0.3891 |
AvgTemp |
-0.001467 |
0.00068 |
-2.158 |
0.04268 |
NewPrecip |
0.01366 |
0.01314 |
1.04 |
0.3101 |
The regression equation of the Quasi-Poisson Regression Model is
given as follows:
log(μ/t) = -1.0436 + 0.0018 * DayMonday + 0.0345 * DaySaturday +
0.0035 * DaySunday + 0.0237 * DayThursday + 0.0253 * DayTuesday + 0.0184
* DayWednesday - 0.0015 * AvgTemp + 0.0137 * NewPrecip
As we can see, the Quassi-Poisson regression model has the same
coefficient estimates as the standard Poisson regression model on rates,
however, the p-values for these regression coefficients are different
between these two models.
So, the regression coefficients for this Quassi-Poisson regression
model would be the exact same as they were for the previous model we
just found on the Poisson regression model of rates for the new
predictor variables of Day, AvgTemp, and NewPrecip.
Out of all of the predictor variables in our Quassi-Poisson
regression model on rates, only the variable of AvgTemp (p = 0.043) was
statistically significant, as it was the only predictor variable with a
p-value less than the alpha value of 0.05. This means, AvgTemp is the
only statistically significant predictor variable in predicting the
cyclists on the Williamsburg Bridge.
All of the other predictor variables, DayMonday, DaySaturday,
DaySunday, DayThursday, DayTuesday, DayWednesday, and NewPrecip, were
not statistically significant in the Quassi-Poisson regression model,
because they all had p-values greater than the alpha value of 0.05.
Dispersion
Now, we will look at the dispersion parameter for the Quassi-Poisson
regression model in order to see how dispersed it is.
In this output of the model summary, we were given that the
dispersion parameter for the Quasi-Poisson model is 4.2436. This
dispersion parameter given in the model summary is the Pearson
dispersion parameter.
We can also calculate the Deviance dispersion parameter to compare
these two dispersion parameters for our Quassi-Poisson regression model
on rates.
# Dispersion Parameters
yhat = quasimodel.rates$fitted.values
pearson.resid = (cycling$WilliamsburgBridge - yhat)/sqrt(yhat)
Pearson.dispersion = sum(pearson.resid^2)/quasimodel.rates$df.residual
Deviance.dispersion = (quasimodel.rates$deviance)/quasimodel.rates$df.residual
disp = cbind(Pearson.dispersion = Pearson.dispersion,
Deviance.dispersion = Deviance.dispersion)
kable(disp, caption="Dispersion parameter", align = 'c')
Dispersion parameter
4.243633 |
4.242561 |
As we can see, the value of the Pearson dispersion parameter for our
Quassi-Poisson regression model is 4.2436. The value of the Deviance
dispersion parameter for our Quassi-Poisson regression model is
4.2426.
These dispersion parameters show that our model is indeed fairly
dispersed, as these dispersion indexes do differ from the value of 1 by
quite a fair amount. We can conclude that our model is signficantly
dispered and therefore, using the standard Poisson regression model
would likely not be an ideal choice due to the potential of
over-dispersion leading to innaccurate results for prediction. The
dispersion in our model significantly differing from a value of 1
indicates that the Quassi-Poisson model likely is the better choice as
we do have some significant dispersion.
Final Model
Now, for our final model we must choose between the standard Poisson
regression model on rates and the Quassi-Poisson regression model.
One important thing to note when making this choice, is that the
regular Poisson model assumes that the mean of the response variable is
equal to its variance while the Quassi-Poisson model does not. When we
checked the conditions of the standard Poisson regression model earlier,
we found that the mean of the response variable does not equal its
variance, indicating a major violation. This violation would cause some
concern for the regular Poisson regression model as it suggest that the
response variable is, in fact, not a Poisson random variable, and
therefore a standard Poisson regression model may not be the best choice
for this data set.
Here, the Quassi-Poisson model has the advantage as it does not
assume that the mean of the response variable is equal to its variance,
which is good for our data set since it failed to meet this required
condition for a standard Poisson regression model.
Both models have advantages in disadvantages which must be considered
when making the choice of a final model. The standard Poisson regression
model on rates showed strong statistical significance for the majority
of its predictor variables. However, the data set failed to meet the
condition of the mean of the predictor variable equaling its variance
which raises concern for the fit of this model. On the other hand, the
Quassi-Poisson regression model does not require this condition of the
mean of the response variable equaling its variance. However, in the
Quassi-Poisson model, only one single predictor variable showed any
statistical significance, indicating that this model may not be
significant in its predictions after all.
Additionally, we found that our data is significantly dispersed, with
a dispersion parameter of 4.2436, which is significantly different from
1. Since our data is signficantly dispersed, it is likely that a
standard Poisson regression model is not the ideal choice as this
over-dispersion can lead to inaccruate results from this standard
Poisson regression model. When the data is signficantly dispersed, the
Quassi-Poisson regression model should be used. So, even though the
Quassi-Poisson regression model in this case did not show very good
statistical signficance within the variables for prediction, it is
likely the better choice as our data is significantly dispersed.
In the end, it seems to be a choice between the standard Poisson
regression model which is more statistically significant, but likely has
poorer accuracy in its predictions due to over-dispersion, and the
Quassi-Poisson regression model, which shows worse statistical
significance, but accounts for dispersion and is not affected by our
data set failing to meet all of the conditions required for Poisson
regression.
Overall, I would say that the Quassi-Poisson regression model is the
safer choice of the two, as it does not require the condition of the
mean of the response variable to equal its variance, as this was
something our data set failed. Additionally, using a standard Poisson
regression model on over-dispersed data can lead to inaccuracy in the
results of its predictions. However, this Quassi-Poisson regression
model shows much poorer significance which means that the results it
provides may not be significant after all. But, the Quassi-Poisson
regression model reamins the better choice in this situation as our data
fails the required condition for the response variable to be a Poisson
random variable, and we did see significant dispersion as well.
Visual Comparisons
Now, let’s look at some visual comparisons of the data within our
models.
I chose to create a graph which illustrated the predicted rates of
the cyclists on the Williamsburg Bridge based upon the day of the week
and whether or not it rained for that given day. This graph will create
two lines, one for precipitation (blue), and one for no precipitation
(red).
graph <- expand.grid(
Day = cycling$Day,
NewPrecip = cycling$NewPrecip,
AvgTemp = mean(cycling$AvgTemp, na.rm = TRUE),
Total = mean(cycling$Total, na.rm = TRUE)
)
graph$predicted_rate <- predict(quasimodel.rates, newdata = graph, type = "response")
graph$NewPrecip <- factor(graph$NewPrecip, levels = c(0, 1), labels = c("No Precipitation", "Precipitation"))
ggplot(graph, aes(x = Day, y = predicted_rate, color = NewPrecip, group = NewPrecip)) +
geom_line(size = 1) +
geom_point(size = 2) +
labs(title = "Predicted Rates of the Cyclists \n on the Williamsburg Bridge by the \n Day of the Week and the Precipitation \n Conditions",
x = "Day",
y = "Rate of Cyclists",
color = "Precipitation Conditions") +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

As we can see, this graph illustrates the predicts the rate of the
cyclists on the Williamsburg Bridge out of all four of the total major
New York bridges. This graph predicts this rate of the cyclists on the
Williamsburg Bridge based on the day of the week and whether there was
precipitation or not. This graph creates two lines, one for
precipitation (blue), and one for no precipitation (red). This graph
creates points for each of the seven days of the week and for whether
there was precipiation or not on those days.
As we can see by looking at our graph, it is predicted that the
highest rate of cyclists on the Williamsburg Bridge occurs on Saturdays
with precipitation, and the lowest rate of cyclists on the Williamsburg
Bridge occurs on Fridays with no precipitation.
Conclusion
Overall, we looked at various Poisson regression models in this
project to predict the frequency counts and the rates of the cyclists on
the Williamsburg Bridge. We also looked at a Quassi-Poisson regression
model to account for the dispersion of the data along with the
violations that were seen which indicated that a standard Poisson
regression model may not be the ideal fit for our data.
It was found that our standard Poisson regression model on rates had
several variables which showed statistical significance, indicating that
these predictor variables were statistically significant in predicting
the rates of cyclists on the Williamsburg Bridge. In our Quassi-Poisson
regression model, only one of the predictor variables showed statistical
significance in predicting the rates of cyclists on the Williamsburg
Bridge. This made it seem like the standard Poisson regression model
provided better significance for prediction.
However, we looked at the dispersion parameter of the Quassi-Poisson
regression model and found that our data is in fact significantly
dispersed. This indicates that a standard Poisson regression model is
likely not an ideal choice due to it not accounting for this
over-dispersion which can lead to innacury in the results of its
prediction. This over-dispersion along with the fact that our data set
violated the condition of the mean of the response variable equaling its
variance, showed that the standard Poisson regression model is not an
ideal choice after all. Due to this significant dispersion, the
Quassi-Poisson regression model would be the better and safer choice
than the standard regression model, despite it having less statistically
significant variables. Even though the Quassi-Poisson regression model
was less statistically significant, it provides better accuracy due to
the data being dispersed, even though it shows that the majority of the
predictor variables were not statistcially significant in their
prediction of the rates of cyclists on the Williamsburg Bridge.
Recommendations
Some recommendations I would make for future projects include:
Look further into the violation that was found within this data
set and look into possible explanations for why this violation occurred.
It was found that the mean of the response variable is not equal to its
variance, which violates one of the necessities of a Poisson regression
model. It should be further considered whether a Poisson regression
model in fact is the best choice for this data set and if it is
sufficient to use this model for prediction despite these
violations.
Consider other variables which may affect the number of cyclists
out on a given observation. Perhaps there are other factors which may
provide further significance for model building which may strengthen the
regression model. For instance, maybe a variable looking at whether
there are any holidays or other notable events occurring on the day of a
given observation could be useful. This could be a binary predictor
variable with a value of 1 if there are any events or holidays, and a
value of 0 if there are not. This could perhaps be useful as there may
tend to be less cyclists out if there is a major holiday or an event
occurring in the city on that given observation.
Further expand the data set to ensure the accuracy of the
predictions and to further strengthen the Poisson regression models. By
collecting more observations over a longer period of time, this could
help to further strenghten the Poisson regression models are provide
better accuracy and reliability in the results found by the model
building process. This would help strengthen the conclusions and
findings found in the process of bulding the Poisson regression models
of this data set.
