Introduction and
Background
Here we have a dataset sourced from New York City’s Traffic
Information Management System (TIMS). TIMS recorded the number of
cyclists entering and leaving three of New York City’s five boroughs -
Queens, Manhattan and Brooklyn - via a collection of bridges known as
the East River Bridges (Brooklyn Bridge, Manhattan Bridge, Williamsburg
Bridge, and Queensboro Bridge). These recordings took place in 2017.
April, July and October are the three months that are present in our
available copy of the data.
For today’s analysis we are going to look at a randomly selected
subset of the larger dataset (subset was chosen using R’s runif
function), that pertains to cyclists who entered and left our three
boroughs of interest - Queens, Manhattan and Brooklyn - via the
Manhattan Bridge throughout the entire month of July
2017. This data has 31 observations, one detailing each day, and no
missing values. A breakdown of each of the original dataset’s variables,
their practical meaning and data types are below.
|
Name
|
Meaning
|
Data_Type
|
|
Date
|
Date for that observation; YYYY-MM-DD form
|
Date
|
|
Day
|
Day of the week for that observation
|
character
|
|
HighTemp
|
That day’s highest recorded temperature
|
double
|
|
LowTemp
|
That day’s lowest recorded temperature
|
double
|
|
Precipitation
|
Measure of rain that day (inches)
|
double
|
|
Manhattan
|
Number of cyclists entering/leaving Queens, Manhattan or Brooklyn via
the MANHATTAN Bridge
|
double
|
|
Total
|
Total number of cyclists entering/leaving Queens, Manhattan or Brooklyn
via ANY of the East River Bridges
|
double
|
Objective of
Analysis
With the available data, my goal for this analysis is to examine the
association between weather conditions and day of the week with the
amount of cyclist traffic that the Manhattan Bridge experiences. In
order to do this, I created two new variables - MeanTemp and TempDiff -
which were calculated by averaging that particular day’s low and high
temperatures and finding the difference between those temperatures
respectively.
Using these temperature-related metrics, along with measures of
precipitation and records of the day of the week, I will use Poisson and
quasi-Poisson regression techniques to see which if any of these factors
play a particular role in the overall amount or the relative
rate of cyclist traffic that the Manhattan Bridge experiences.
Poisson Regression
Modeling
To explore any potential associations, I created Poisson models of
two different regression types, one being for counts and one being for
rates.
Poisson counts regression examines the total number of occurrences of
a particular event (in this case cyclists on the Manhattan Bridge) and
uses a logarithmic function to determine which, if any of the
explanatory variables have a significant effect on said response
variable’s mean. The formula for said regression is below:

\(\beta\)0 = the log
of our response variable’s mean; not very useful for practical
interpretation
\(\beta\)1, \(\beta\)2, \(\beta\)3, … \(\beta\)p = the change in our
response variable’s log mean, in association with a one unit increase in
said predictor variable
Additionally, Poisson rates regression aims to find the expected rate
of a particular event’s occurrence relative to that event’s proportion
within a larger “population.” In the instance of this dataset and
analysis, our variable Total, which represents the
total number of cyclists on all the East River Bridges,
will be what the number of cyclists on the Manhattan Bridge are
considered to be a proportion of. The calculation for this type of
Poisson regression is similar to counts regression, but the logarithm of
the population variable is also considered to be a factor. This can be
expressed in both of the following ways.
- In Poisson rates regression, the parameters \(\beta\)0, …. \(\beta\)p should be interpreted
in the same manner as they are in Poisson counts model.
Poisson Regression
(Counts)
Below is a summary of the Poisson counts regression model I created,
with measures of temperature range and averages, precipitation amount
and day of the week all functioning as predictors of how many cyclists
crossed the Manhattan Bridge in or out of our three boroughs of
interest.
# Counts Model:
# Response = Manhattan
# Predictors = Day, MeanTemp, TempDiff, Precipitation
# Day is stored as a Factor
Counts_Model = glm(Manhattan ~ Day + MeanTemp + TempDiff + Precipitation, family = poisson(link = "log"), data = Data)
Counts_Model_Sum = summary(Counts_Model)
Counts_Model_Coef = Counts_Model_Sum$coefficients
invisible(Counts_Model_Coef)
kable(Counts_Model_Coef, caption = "<b><center> Poisson Counts Regression: Weather and Schedule Relationship with Count of Manhattan Bridge Cyclists </center></b>")
Table:
Poisson Counts Regression: Weather and Schedule Relationship with Count
of Manhattan Bridge Cyclists
| (Intercept) |
8.5013371 |
0.0421490 |
201.697286 |
0.0000000 |
| DayMonday |
0.3199236 |
0.0089935 |
35.572866 |
0.0000000 |
| DayTuesday |
0.3357894 |
0.0093242 |
36.012843 |
0.0000000 |
| DayWednesday |
0.4023102 |
0.0090796 |
44.309388 |
0.0000000 |
| DayThursday |
0.2807381 |
0.0096557 |
29.074878 |
0.0000000 |
| DayFriday |
0.1331873 |
0.0107366 |
12.405032 |
0.0000000 |
| DaySaturday |
-0.0859127 |
0.0097279 |
-8.831613 |
0.0000000 |
| MeanTemp |
-0.0029260 |
0.0006129 |
-4.774076 |
0.0000018 |
| TempDiff |
0.0143243 |
0.0008575 |
16.703807 |
0.0000000 |
| Precipitation |
-0.4307477 |
0.0104214 |
-41.332836 |
0.0000000 |
# All predictor variables are significant
In the model, we can see that every predictor variable is
statistically significant as per p values well below the standard of
0.05, so no stepwise regression or model simplification is
necessary.
As for the practical implications of our model summary, we can say
that although every predictor variable is statistically significant, the
magnitude of their impacts are relatively small. Precipitation’s
estimated negative effect on the log mean of Manhattan Bridge cyclists
has an absolute value ~ |.4307|, which is the the highest of all our
predictors.
It appears that the day’s average temperature and difference in daily
highs and lows played very little practical significance in the log mean
of that day’s cyclists. When we look at the difference in log means from
a day-of-the-week perspective, we do see a slightly more impactful
effect. With Sunday being coded in as the baseline, it looks like
Wednesday has the greatest amount of cyclist traffic and Saturday has
the least. This higher count of cyclists during the workweek could be
due to the Manhattan Bridge functioning for many as a commuting
method.
All in all, our Poisson counts model yields some interesting and
statistically significant revelations, most notably that cyclists care
far more about precipitation than they do temperature fluctuation, and
that cyclist traffic appears to tick upwards throughout the workweek
before dying down for the weekend. However, the relatively small
magnitude of each variable’s estimated effect is a downside regarding
the model’s utility.
Poisson Regression
(Rates)
After Poisson counts regression, I then performed Poisson rates
regression with the total number of cyclists entering and exiting our
three boroughs of interest across all the East River Bridges as
the “population” for which the Manhattan Bridge cyclists are acting as a
sample of.
This process consisted of me creating two different Poisson rates
models. The first one I created listed both temperature variables as
statistically insignificant. Given their status as statistically
insignificant in this model, and their minute practical significance in
the previous counts model, I chose to remove them and create a second
Poisson rates model which did not factor in the day’s average or range
of temperature.
### Rates Model 1
Rates_Model = glm(Manhattan ~ Day + MeanTemp + TempDiff + Precipitation, offset = log(Total), family = poisson(link = "log"), data = Data)
Rates_Model_Sum = summary(Rates_Model)
Rates_Model_Coef = Rates_Model_Sum$coefficients
invisible(Rates_Model_Coef)
kable(Rates_Model_Coef, caption = "<b><center> Poisson Rates Regression (1): Weather and Schedule Relationship with Count of Manhattan Bridge Cyclists </center></b>")
Table:
Poisson Rates Regression (1): Weather and Schedule Relationship with
Count of Manhattan Bridge Cyclists
| (Intercept) |
-1.1844325 |
0.0418215 |
-28.3211719 |
0.0000000 |
| DayMonday |
0.0418134 |
0.0088829 |
4.7071774 |
0.0000025 |
| DayTuesday |
0.0549949 |
0.0094416 |
5.8247706 |
0.0000000 |
| DayWednesday |
0.0316272 |
0.0090743 |
3.4853702 |
0.0004915 |
| DayThursday |
0.0048565 |
0.0096974 |
0.5008067 |
0.6165072 |
| DayFriday |
-0.0167479 |
0.0108925 |
-1.5375635 |
0.1241554 |
| DaySaturday |
-0.0667274 |
0.0097414 |
-6.8498669 |
0.0000000 |
| MeanTemp |
-0.0010004 |
0.0006053 |
-1.6527512 |
0.0983815 |
| TempDiff |
0.0008449 |
0.0008628 |
0.9792330 |
0.3274649 |
| Precipitation |
-0.0306511 |
0.0095235 |
-3.2184824 |
0.0012887 |
### Rates Model 2
Rates_Model2 = glm(Manhattan ~ Day + Precipitation, offset = log(Total), family = poisson(link = "log"), data = Data)
Rates_Model2_Sum = summary(Rates_Model2)
Rates_Model2_Coef = Rates_Model2_Sum$coefficients
invisible(Rates_Model_Coef)
kable(Rates_Model2_Coef, caption = "<b><center> Poisson Rates Regression (2): Precipitation and Schedule Relationship with Count of Manhattan Bridge Cyclists </center></b>")
Table:
Poisson Rates Regression (2): Precipitation and Schedule Relationship
with Count of Manhattan Bridge Cyclists
| (Intercept) |
-1.2497005 |
0.0065309 |
-191.3530685 |
0.0000000 |
| DayMonday |
0.0417392 |
0.0088231 |
4.7306551 |
0.0000022 |
| DayTuesday |
0.0522471 |
0.0090521 |
5.7718313 |
0.0000000 |
| DayWednesday |
0.0285783 |
0.0088706 |
3.2216687 |
0.0012745 |
| DayThursday |
0.0006398 |
0.0091826 |
0.0696739 |
0.9444532 |
| DayFriday |
-0.0205402 |
0.0106430 |
-1.9299220 |
0.0536165 |
| DaySaturday |
-0.0684652 |
0.0096858 |
-7.0685797 |
0.0000000 |
| Precipitation |
-0.0288171 |
0.0093266 |
-3.0897749 |
0.0020031 |
Looking at the findings of our second Poisson rates regression model,
we see a trend similar to that of our Poisson counts regression model,
that being a common occurrence of statistical significance but not a
great deal of practical significance on display when the magnitude of
the regression coefficient is taken into consideration.
Once again treating Sunday as our baseline, it looks like the rate of
Manhattan Bridge cyclists in proportion to the entirety of East River
Bridge cyclists is at its highest early in the week, with that rate
declining going into the weekend. That being said, the statistical
significance of this breakdown also greatly decreases when we look at
the data for Thursday and to a much lesser but still noticeable extent
Friday, perhaps suggesting that the Manhattan Bridge cyclist rate’s
decline at the tail end of the workweek could be chalked up to random
chance and not a particular characteristic of the Bridge that affects
the experience of its cyclists only on those particular days.
Day of the Week
Averages
Since both our counts and rates models suggested that the day of the
week has the greatest association with the log mean of the Manhattan
Bridge’s cyclists, I decided to calculate the average counts and rates
per day to compare them to each other and the mean across all days
considered. The table with this information is below.
Count_Averages = c(
round(mean(Data$Manhattan)),
round(mean(Data$Manhattan[Data$Day == "Sunday"])),
round(mean(Data$Manhattan[Data$Day == "Monday"])),
round(mean(Data$Manhattan[Data$Day == "Tuesday"])),
round(mean(Data$Manhattan[Data$Day == "Wednesday"])),
round(mean(Data$Manhattan[Data$Day == "Thursday"])),
round(mean(Data$Manhattan[Data$Day == "Friday"])),
round(mean(Data$Manhattan[Data$Day == "Saturday"]))
)
AllDays_Rates_Avg = sum(Data$Manhattan)/sum(Data$Total)
Sun_Rates_Avg = sum(Data$Manhattan[Data$Day == "Sunday"])/sum(Data$Total[Data$Day == "Sunday"])
Mon_Rates_Avg = sum(Data$Manhattan[Data$Day == "Monday"])/sum(Data$Total[Data$Day == "Monday"])
Tues_Rates_Avg = sum(Data$Manhattan[Data$Day == "Tuesday"])/sum(Data$Total[Data$Day == "Tuesday"])
Wed_Rates_Avg = sum(Data$Manhattan[Data$Day == "Wednesday"])/sum(Data$Total[Data$Day == "Wednesday"])
Thur_Rates_Avg = sum(Data$Manhattan[Data$Day == "Thursday"])/sum(Data$Total[Data$Day == "Thursday"])
Fri_Rates_Avg = sum(Data$Manhattan[Data$Day == "Friday"])/sum(Data$Total[Data$Day == "Friday"])
Sat_Rates_Avg = sum(Data$Manhattan[Data$Day == "Saturday"])/sum(Data$Total[Data$Day == "Saturday"])
Day_Rates_Averages = c(AllDays_Rates_Avg, Sun_Rates_Avg, Mon_Rates_Avg, Tues_Rates_Avg, Wed_Rates_Avg, Thur_Rates_Avg, Fri_Rates_Avg, Sat_Rates_Avg)
Rate_Averages = round(Day_Rates_Averages, digits = 4)
Days = c("All Days", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
Counts_Difference = c(
0, # Difference between the average count of all days and itself
round(mean(Data$Manhattan[Data$Day == "Sunday"])) - round(mean(Data$Manhattan)),
round(mean(Data$Manhattan[Data$Day == "Monday"])) - round(mean(Data$Manhattan)),
round(mean(Data$Manhattan[Data$Day == "Tuesday"])) - round(mean(Data$Manhattan)),
round(mean(Data$Manhattan[Data$Day == "Wednesday"])) - round(mean(Data$Manhattan)),
round(mean(Data$Manhattan[Data$Day == "Thursday"])) - round(mean(Data$Manhattan)),
round(mean(Data$Manhattan[Data$Day == "Friday"])) - round(mean(Data$Manhattan)),
round(mean(Data$Manhattan[Data$Day == "Saturday"])) - round(mean(Data$Manhattan))
)
Rates_DifferenceB = c(
0,
Sun_Rates_Avg - AllDays_Rates_Avg,
Mon_Rates_Avg - AllDays_Rates_Avg,
Tues_Rates_Avg - AllDays_Rates_Avg,
Wed_Rates_Avg - AllDays_Rates_Avg,
Thur_Rates_Avg - AllDays_Rates_Avg,
Fri_Rates_Avg - AllDays_Rates_Avg,
Sat_Rates_Avg - AllDays_Rates_Avg
)
Rates_Difference = round(Rates_DifferenceB, digits = 4)
Table = cbind(Days, Count_Averages, Counts_Difference, Rate_Averages, Rates_Difference)
kable(Table, caption = "<b><center><span style='color:#000000;'>Distribution of Manhattan Bridge Cyclist Count and Rates July 2017</center></b>") %>%
kable_styling(
bootstrap_options = c("striped", "bordered"),
full_width = FALSE,
position = "center"
)
Distribution of Manhattan Bridge Cyclist
Count and Rates July 2017
|
Days
|
Count_Averages
|
Counts_Difference
|
Rate_Averages
|
Rates_Difference
|
|
All Days
|
5425
|
0
|
0.2885
|
0
|
|
Sunday
|
4690
|
-735
|
0.2865
|
-0.002
|
|
Monday
|
6001
|
576
|
0.2975
|
0.009
|
|
Tuesday
|
6363
|
938
|
0.302
|
0.0135
|
|
Wednesday
|
6938
|
1513
|
0.2949
|
0.0064
|
|
Thursday
|
5999
|
574
|
0.2868
|
-0.0017
|
|
Friday
|
4338
|
-1087
|
0.2775
|
-0.0109
|
|
Saturday
|
4031
|
-1394
|
0.2665
|
-0.022
|
The table provides greater detail into the implications of our
Poisson count and rate models. That being weekday totals of Manhattan
Bridge cyclists (specifically Monday - Thursday) far outweigh the count
of cyclists on the bridge from Friday to Sunday. With the average number
of cylclists from Monday - Thursday being about 6,325, and the average
number Friday - Sunday being about 4,353.
As for the rate of Manhattan Bridge cyclists relative to cyclists on
all East River Bridges, we see that the Manhattan Bridge’s cyclist rate
is slightly above average Monday - Wednesday, but then below average
Thursday through Sunday.
Poisson Modeling
Takeaways
To conclude, any implementations done in response to our Poisson
models’ findings should be done with some degree of caution due to the
low practical significance found in both our count and rate models. That
being said, there are still valuable takeaways that we can draw from our
analysis.
First, the Manhattan Bridge is clearly busier, both in the sense of
raw volume and as a proportion of the overall East River Bridge network,
early and throughout the standard workweek than it is during the
weekend. Second, the daily average temperature as well as the difference
between that day’s high and low played very little if any role in the
count or rate of cyclists on any given day, but the measure of
precipitation does appear to have a relatively noticeable and negative
association with the number of that day’s cyclists on the Manhattan
Bridge.
Quasi-Poisson
Regression Modeling
In addition to analyzing our data at hand via Poisson regression, I
decided to also create a quasi-Poisson model of the data. Quasi-Poisson
modeling is an alternative to Poisson modeling, and it is particularly
valuable when the mean and variance of the model’s response variable
(number of cyclists on the Manhattan bridge in this case) are not
approximately equal to one another (known as dispersion).
For my quasi-Poisson model, I included that day’s average
temperature, day of the week and precipitation amount as the relevant
factors. Day of the week obviously played the biggest role in our
previous Poisson models, with precipitation consistently being cited as
statistically significant despite relatively low practical significance.
For this model, I chose to discretize precipitation, with days of no
recorded rain being marked as “0” and days with any amount of
rain being marked as “1.”
Data$NewPrecip = Data$Precipitation
Data$NewPrecip[Data$Precipitation == 0] = 0
Data$NewPrecip[Data$Precipitation > 0] = 1
Data = data.frame(Data$Date, Data$Day, Data$Day_Num, Data$HighTemp, Data$LowTemp, Data$MeanTemp, Data$TempDiff, Data$Precipitation, Data$NewPrecip, Data$Manhattan, Data$Total)
colnames(Data) = c("Date", "Day", "Day_Num", "HighTemp", "LowTemp", "MeanTemp","TempDiff", "Precipitation", "NewPrecip","Manhattan", "Total")
# 1.) Below is the quasi-Poisson regression model
# As instructed, only includes Day, MeanTemp and NewPrecip
Quasi_Counts_Model = glm(Manhattan ~ Day + MeanTemp + NewPrecip, family = quasipoisson, data = Data)
Quasi_Counts_Model_Sum = summary(Quasi_Counts_Model)
Quasi_Counts_Model_Coef = Quasi_Counts_Model_Sum$coefficients
invisible(Quasi_Counts_Model_Coef)
kable(Quasi_Counts_Model_Coef, caption = "<b><center> Quasi-Poisson Counts Regression: Weather and Schedule Relationship with Count of Manhattan Bridge Cyclists </center></b>")
Table:
Quasi-Poisson Counts Regression: Weather and Schedule Relationship with
Count of Manhattan Bridge Cyclists
| (Intercept) |
8.0428119 |
0.5287239 |
15.2117416 |
0.0000000 |
| DayMonday |
0.3201598 |
0.1180432 |
2.7122270 |
0.0127244 |
| DayTuesday |
0.2356884 |
0.1223379 |
1.9265366 |
0.0670535 |
| DayWednesday |
0.3064451 |
0.1211249 |
2.5299920 |
0.0190731 |
| DayThursday |
0.2542985 |
0.1240203 |
2.0504581 |
0.0524206 |
| DayFriday |
0.0302193 |
0.1365853 |
0.2212482 |
0.8269398 |
| DaySaturday |
-0.0805066 |
0.1301538 |
-0.6185500 |
0.5425648 |
| MeanTemp |
0.0062851 |
0.0068223 |
0.9212583 |
0.3669084 |
| NewPrecip |
-0.4057049 |
0.0921307 |
-4.4035809 |
0.0002251 |
A summary of the quasi-Poisson counts model can be seen above. We can
see that there is great similarity between the findings of this model
and our original Poisson counts model. However before we can determine
which one is superior for interpretative use, we must calculate this
quasi-Poisson’s dispersion parameter, “phi hat” (\(\hat{\phi}\)).
Dispersion and Counts
Model Selection
\(\hat{\phi}\) is used in
quasi-Poisson regression to determine if our data’s response variable is
overly or underly dispersed. Generally, a phi hat value of around 1 is
representative of an approximately equal mean and variance of the
response. If a quasi-Poisson model’s dispersion value is significantly
different than 1, then that model should be used for associative
analysis rather than a traditional Poisson counterpart, as the
quasi-Poisson calculation includes greater estimation of standard
errors. However, if \(\hat{\phi}\) ~ 1,
then the traditional Poisson model should be used, as it is less
computationally intensive and avoids otherwise unnecessary extra steps.
The formula for \(\hat{\phi}\)’s
calculation can be seen below.

n = nrow(Data)
p = 3
Pearson_Residuals = residuals(Quasi_Counts_Model, type = "pearson")
Sq_Pearson_Residuals = Pearson_Residuals^2
Dispersion_Parameter = (sum(Sq_Pearson_Residuals))/(n-p)
#### Double checked phi's value using Prof's coding method; got same result
ydif=Data$Manhattan-exp(Quasi_Counts_Model$linear.predictors) # diff between y and yhat
prsd = ydif/sqrt(exp(Quasi_Counts_Model$linear.predictors)) # Pearson residuals
phi_check = sum(prsd^2)/(n-p)
####
invisible(Dispersion_Parameter)
invisible(phi_check)
Our model yielded a value of \(\hat{\phi}\) ~ 142, which is
well beyond the margin of error for a properly
dispersed Poisson response variable. For this reason, we can deem that
the quasi-Poisson counts model is more valuable for associative analysis
than the Poisson counts model. Because of this, we will use the
quasi-Poisson for our ultimate interpretations.
Visual Aids
Referring to our quasi-Poisson model summary above, we see that the
day’s average temperature does not appear to have significant
statistical or practical association with the Manhattan bridge’s number
of cyclists. However, there does appear to be such a difference between
the number of cyclists on a totally clear day as opposed to a day with
at least some level of precipitation (recorded via variable
NewPrecip). And, as consistently seen in our original Poisson regression
models, there is certainly a large difference between the typical number
of cyclists depending on the day of the week.
Knowing this, I created two visuals below to enhance our grasp of the
relationship that both the day of the week and the presence of
precipitation have with each other as well as the standard number of
cyclists that were on the Manhattan Bridge throughout July 2017.
Unfortunately, there were no instances of Tuesdays or Wednesdays with
precipitation in this study, resulting in a blank in both our table and
bar chart below.
#### Table
Days = c("All Days", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
No_Precipitation = c(
round(mean(Data$Manhattan[Data$NewPrecip == 0])),
round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Sunday"])),
round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Monday"])),
round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Tuesday"])),
round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Wednesday"])),
round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Thursday"])),
round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Friday"])),
round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Saturday"]))
)
Some_Precipitation = c(
round(mean(Data$Manhattan[Data$NewPrecip == 1])),
round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Sunday"])),
round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Monday"])),
round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Tuesday"])),
round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Wednesday"])),
round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Thursday"])),
round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Friday"])),
round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Saturday"]))
)
Vis_Table = data.frame(Days, No_Precipitation, Some_Precipitation)
kable(Vis_Table, caption = "<b><center><span style='color:#000000;'>Average Cyclist Counts on the Manhattan Bridge July 2017</center></b>") %>%
kable_styling(
bootstrap_options = c("striped", "bordered"),
full_width = FALSE,
position = "center")
Average Cyclist Counts on the Manhattan
Bridge July 2017
|
Days
|
No_Precipitation
|
Some_Precipitation
|
|
All Days
|
6008
|
3746
|
|
Sunday
|
4924
|
3756
|
|
Monday
|
7408
|
3892
|
|
Tuesday
|
6363
|
NaN
|
|
Wednesday
|
6938
|
NaN
|
|
Thursday
|
6006
|
5980
|
|
Friday
|
5802
|
2874
|
|
Saturday
|
4484
|
3352
|
#### Barchart
Vis_long =
Vis_Table %>%
pivot_longer(
cols = c(No_Precipitation, Some_Precipitation),
names_to = "Precipitation",
values_to = "Manhattan"
)
Vis_long$Days = factor(
Vis_long$Days,
levels = c("All Days", "Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday")
)
ggplot(Vis_long,
aes(x = Days, y = Manhattan, fill = Precipitation)) +
geom_bar(stat = "identity",
position = position_dodge(width = 0.9),
na.rm = TRUE) +
labs(
title = "Average Cyclist Counts on the Manhattan Bridge July 2017",
x = "Day of the Week",
y = "Number of Cyclists",
fill = "Precipitation"
) +
scale_fill_manual(
values = c("No_Precipitation" = "darkred",
"Some_Precipitation" = "lightblue")
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
)

The depictions above provide added clarity to the summary statistics
from our quasi-Poisson model. Sunday was the baseline for our regression
model’s calculations, and all remaining days other than Saturday had
positive coefficients. We can see in both our chart and table, that the
count of cyclists is certainly higher throughout the week than it is on
the weekends. As for the role of precipitation; our discretized variable
NewPrecip had a regression coefficent of ~ 0.4 (the highest absolute
value of any factor in the model) and a significant p value ~ .0002.
That finding can be intuitively confirmed by taking a glance at our
visuals. Other than Thursday, all other days for which there are data
points for both precipitation and no precipitation show a
noticeable decrease in cyclists when there is a presence of
precipitation.
Conclusion
Through our analysis via the means of both Poisson and quasi-Poisson
modeling techniques, our findings were relatively consistent. Those
being the following;
- The Manhattan Bridge is far busier during the week than it is on the
weekend.
- At least in the month of July, the temperature does not play any
sort of significant role in the raw number of the bridge’s cyclists nor
its share of the totality of East River Bridge cyclists.
- Although temperature is not significantly associated with cyclist
traffic, precipitation is. Any amount of precipitation has a negative
and statistically significant relationship with Manhattan Bridge’s
cyclist traffic.
Regarding the existence of both our Poisson and quasi-Poisson models
to estimate the association between all these factors, the quasi-Poisson
is more ideal due to this dataset’s extremely high dispersion
parameter (\(\hat{\phi}\) ~ 142).
If we were to continue or expand on this analysis in the future, it
would be valuable to expand the scope of our data outside of the month
of July and into months that border on seasonal changes such as March,
April or October. Intuitively, one might guess that the day’s
temperatures play a much larger role in a time of the year like
that.
---
title: "Poisson and Quasi-Poisson Analysis of Relationship between Weather and Day of Week with Cyclist Traffic on Manhattan Bridge"
author: "Chris Bahm"
date: "2025-11-10"
output:
  html_document:
    toc: true
    toc_float:
      collapsed: true
      smooth_scroll: true
    toc_depth: 4
    fig_width: 6
    fig_height: 4
    fig_caption: true
    number_sections: true
    code_folding: hide
    code_download: true
    theme: lumen
    highlight: tango
  pdf_document:
    toc: true
    toc_depth: 4
    fig_caption: true
    number_sections: true
  word_document:
    toc: true
    toc_depth: 4
---

```{css, echo = FALSE}
div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {                      # use conditional statement to detect
   install.packages("knitr")                  # whether a package was installed in
   library(knitr)                             # your machine. If not, install it and
}                                             # load it to the working directory.

if (!require(tidyverse)) {library(tidyvserse)} 

if (!require(GGally)) {library(GGally)} 

if (!require(kableExtra)) {library(kableExtra)} 

if (!require(ggplot2)) {library(ggplot2)} 

if (!require(car)) {library(car)} 

if (!require(dplyr)) {library(dplyr)} 

if (!require(pander)) {library(pander)} 

if (!require(car)) {library(car)} 

if (!require("scales")) {
install.packages("scales")                                        
library("scales") 
}

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE,
	comment = NA,
	results = TRUE
)

```

# Introduction and Background 
Here we have a dataset sourced from New York City's Traffic Information Management System (TIMS). TIMS recorded the number of cyclists entering and leaving three of New York City's five boroughs - Queens, Manhattan and Brooklyn - via a collection of bridges known as the East River Bridges (Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge). These recordings took place in 2017. April, July and October are the three months that are present in our available copy of the data.

For today's analysis we are going to look at a randomly selected subset of the larger dataset (subset was chosen using R's runif function), that pertains to cyclists who entered and left our three boroughs of interest - Queens, Manhattan and Brooklyn - via the **Manhattan** Bridge throughout the entire month of July 2017. This data has 31 observations, one detailing each day, and no missing values. A breakdown of each of the original dataset's variables, their practical meaning and data types are below.

```{r Variable Table, echo=FALSE}
library(knitr)

Var_Table = data.frame(
  Name = c("Date",
           "Day", 
           "HighTemp", 
           "LowTemp",
           "Precipitation", 
           "Manhattan", 
           "Total"),
  
  Meaning = c("Date for that observation; YYYY-MM-DD form", 
              "Day of the week for that observation",
              "That day's highest recorded temperature", 
              "That day's lowest recorded temperature", 
              "Measure of rain that day (inches)",  
              "Number of cyclists entering/leaving Queens, Manhattan or Brooklyn via the MANHATTAN Bridge",
              "Total number of cyclists entering/leaving Queens, Manhattan or Brooklyn via ANY of the East River Bridges"), 
              
  
  Data_Type = c("Date", "character", "double", "double", "double", "double", "double"))

kable(Var_Table) %>%
  kable_styling(
    bootstrap_options = c("striped", "bordered"),
    full_width = FALSE,
    position = "center")

```
## Objective of Analysis
With the available data, my goal for this analysis is to examine the association between weather conditions and day of the week with the amount of cyclist traffic that the Manhattan Bridge experiences. In order to do this, I created two new variables - MeanTemp and TempDiff - which were calculated by averaging that particular day's low and high temperatures and finding the difference between those temperatures respectively. 

Using these temperature-related metrics, along with measures of precipitation and records of the day of the week, I will use Poisson and quasi-Poisson regression techniques to see which if any of these factors play a particular role in the overall amount *or* the relative rate of cyclist traffic that the Manhattan Bridge experiences.

```{r Data Loading and Cleaning, include=FALSE}
library (openxlsx)
options(scipen = 999)

# round(runif(1, min = 1, max = 10))
  # Used line above to randomly select which subset of the data to do my analysis on. The fifth tab on the original data Excel spreadsheet was for observations on the Manhattan Bridge from 7/1 to 7/31

Data = read.xlsx("https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/NYC_Cyclists_Data.xlsx", sheet = "Manhattan 2")

glimpse(Data)

# Converting variable Date to a date object.
# Origin = 1899-12-30 since Excel stores data values as the number of days since then.
Data$Date = as.Date(Data$Date, origin = "1899-12-30")

# Originally converting variable Day to a date object, then using the weekdays function on it.
Data$Day = as.Date(Data$Day, origin = "1899-12-30")
  Data$Day = weekdays(as.Date(Data$Day))

  # Creation of Day_Num variable
Data$Day_Num[Data$Day == "Sunday"] = 1
Data$Day_Num[Data$Day == "Monday"] = 2
Data$Day_Num[Data$Day == "Tuesday"] = 3
Data$Day_Num[Data$Day == "Wednesday"] = 4
Data$Day_Num[Data$Day == "Thursday"] = 5
Data$Day_Num[Data$Day == "Friday"] = 6
Data$Day_Num[Data$Day == "Saturday"] = 7

  Data$MeanTemp = (Data$HighTemp + Data$LowTemp)/2
  Data$TempDiff = Data$HighTemp - Data$LowTemp
  Data$Day = factor(Data$Day,
                  levels = c("Sunday", "Monday", "Tuesday", "Wednesday", 
                             "Thursday", "Friday", "Saturday"))
            # Since Sunday is the first level of the factor listed here, it will be recognized as the baseline by R

# Reorder columns for ideal visual perception
Data = data.frame(Data$Date, Data$Day, Data$Day_Num, Data$HighTemp, Data$LowTemp, Data$MeanTemp, Data$TempDiff, Data$Precipitation, Data$Manhattan, Data$Total)
colnames(Data) = c("Date", "Day", "Day_Num", "HighTemp", "LowTemp", "MeanTemp","TempDiff", "Precipitation", "Manhattan", "Total")

glimpse(Data)
```

# Poisson Regression Modeling
To explore any potential associations, I created Poisson models of two different regression types, one being for counts and one being for rates.

Poisson counts regression examines the total number of occurrences of a particular event (in this case cyclists on the Manhattan Bridge) and uses a logarithmic function to determine which, if any of the explanatory variables have a significant effect on said response variable's mean. The formula for said regression is below:

```{r, echo=FALSE}
include_graphics("Poisson_Model_Form.png")
```

- $\beta$~0~ = the log of our response variable's mean; not very useful for practical interpretation 

- $\beta$~1~, $\beta$~2~, $\beta$~3~, ... $\beta$~p~ = the change in our response variable's log mean, in association with a one unit increase in said predictor variable

 <br> 

Additionally, Poisson rates regression aims to find the expected rate of a particular event's occurrence relative to that event's proportion within a larger "population." In the instance of this dataset and analysis, our variable Total, which represents the **total** number of cyclists on all the East River Bridges, will be what the number of cyclists on the Manhattan Bridge are considered to be a proportion of. The calculation for this type of Poisson regression is similar to counts regression, but the logarithm of the population variable is also considered to be a factor. This can be expressed in both of the following ways.
<br>
```{r, echo=FALSE}
include_graphics("Poisson_Form_Rates1.png")
```
 <br> 
```{r, echo=FALSE}
include_graphics("Poisson_Model_Form_Rates.png")
```
 <br> 
 
 - In Poisson rates regression, the parameters $\beta$~0~, .... $\beta$~p~ should be interpreted in the same manner as they are in Poisson counts model.

## Poisson Regression (Counts)
Below is a summary of the Poisson counts regression model I created, with measures of temperature range and averages, precipitation amount and day of the week all functioning as predictors of how many cyclists crossed the Manhattan Bridge in or out of our three boroughs of interest.
```{r}
# Counts Model:
  # Response = Manhattan
  # Predictors = Day, MeanTemp, TempDiff, Precipitation
    # Day is stored as a Factor

Counts_Model = glm(Manhattan ~ Day + MeanTemp + TempDiff + Precipitation, family = poisson(link = "log"), data = Data)

Counts_Model_Sum = summary(Counts_Model)
Counts_Model_Coef = Counts_Model_Sum$coefficients

invisible(Counts_Model_Coef)
kable(Counts_Model_Coef, caption = "<b><center> Poisson Counts Regression: Weather and Schedule Relationship with Count of Manhattan Bridge Cyclists </center></b>")

# All predictor variables are significant
```
In the model, we can see that *every* predictor variable is statistically significant as per p values well below the standard of 0.05, so no stepwise regression or model simplification is necessary. 

As for the practical implications of our model summary, we can say that although every predictor variable is statistically significant, the magnitude of their impacts are relatively small. Precipitation's estimated negative effect on the log mean of Manhattan Bridge cyclists has an absolute value ~ |.4307|, which is the the highest of all our predictors.

It appears that the day's average temperature and difference in daily highs and lows played very little practical significance in the log mean of that day's cyclists. When we look at the difference in log means from a day-of-the-week perspective, we do see a slightly more impactful effect. With Sunday being coded in as the baseline, it looks like Wednesday has the greatest amount of cyclist traffic and Saturday has the least. This higher count of cyclists during the workweek could be due to the Manhattan Bridge functioning for many as a commuting method.

All in all, our Poisson counts model yields some interesting and statistically significant revelations, most notably that cyclists care far more about precipitation than they do temperature fluctuation, and that cyclist traffic appears to tick upwards throughout the workweek before dying down for the weekend. However, the relatively small magnitude of each variable's estimated effect is a downside regarding the model's utility. 

## Poisson Regression (Rates)
After Poisson counts regression, I then performed Poisson rates regression with the total number of cyclists entering and exiting our three boroughs of interest across *all* the East River Bridges as the "population" for which the Manhattan Bridge cyclists are acting as a sample of. 

This process consisted of me creating two different Poisson rates models. The first one I created listed both temperature variables as statistically insignificant. Given their status as statistically insignificant in this model, and their minute practical significance in the previous counts model, I chose to remove them and create a second Poisson rates model which did not factor in the day's average or range of temperature.
```{r}
### Rates Model 1
Rates_Model = glm(Manhattan ~ Day + MeanTemp + TempDiff + Precipitation, offset = log(Total), family = poisson(link = "log"), data = Data)

Rates_Model_Sum = summary(Rates_Model)
Rates_Model_Coef = Rates_Model_Sum$coefficients

invisible(Rates_Model_Coef)
kable(Rates_Model_Coef, caption = "<b><center> Poisson Rates Regression (1): Weather and Schedule Relationship with Count of Manhattan Bridge Cyclists </center></b>")


### Rates Model 2
Rates_Model2 = glm(Manhattan ~ Day + Precipitation, offset = log(Total), family = poisson(link = "log"), data = Data)

Rates_Model2_Sum = summary(Rates_Model2)
Rates_Model2_Coef = Rates_Model2_Sum$coefficients

invisible(Rates_Model_Coef)
kable(Rates_Model2_Coef, caption = "<b><center> Poisson Rates Regression (2): Precipitation and Schedule Relationship with Count of Manhattan Bridge Cyclists </center></b>")

```
Looking at the findings of our second Poisson rates regression model, we see a trend similar to that of our Poisson counts regression model, that being a common occurrence of statistical significance but not a great deal of practical significance on display when the magnitude of the regression coefficient is taken into consideration.

Once again treating Sunday as our baseline, it looks like the rate of Manhattan Bridge cyclists in proportion to the entirety of East River Bridge cyclists is at its highest early in the week, with that rate declining going into the weekend. That being said, the statistical significance of this breakdown also greatly decreases when we look at the data for Thursday and to a much lesser but still noticeable extent Friday, perhaps suggesting that the Manhattan Bridge cyclist rate's decline at the tail end of the workweek could be chalked up to random chance and not a particular characteristic of the Bridge that affects the experience of its cyclists only on those particular days.

## Day of the Week Averages
Since both our counts and rates models suggested that the day of the week has the greatest association with the log mean of the Manhattan Bridge's cyclists, I decided to calculate the average counts and rates per day to compare them to each other and the mean across all days considered. The table with this information is below.
```{r}
Count_Averages = c(
  round(mean(Data$Manhattan)),
  round(mean(Data$Manhattan[Data$Day == "Sunday"])),
  round(mean(Data$Manhattan[Data$Day == "Monday"])),
  round(mean(Data$Manhattan[Data$Day == "Tuesday"])),
  round(mean(Data$Manhattan[Data$Day == "Wednesday"])),
  round(mean(Data$Manhattan[Data$Day == "Thursday"])),
  round(mean(Data$Manhattan[Data$Day == "Friday"])),
  round(mean(Data$Manhattan[Data$Day == "Saturday"]))
)

AllDays_Rates_Avg = sum(Data$Manhattan)/sum(Data$Total)

Sun_Rates_Avg = sum(Data$Manhattan[Data$Day == "Sunday"])/sum(Data$Total[Data$Day == "Sunday"])

Mon_Rates_Avg = sum(Data$Manhattan[Data$Day == "Monday"])/sum(Data$Total[Data$Day == "Monday"])

Tues_Rates_Avg = sum(Data$Manhattan[Data$Day == "Tuesday"])/sum(Data$Total[Data$Day == "Tuesday"])

Wed_Rates_Avg = sum(Data$Manhattan[Data$Day == "Wednesday"])/sum(Data$Total[Data$Day == "Wednesday"])

Thur_Rates_Avg = sum(Data$Manhattan[Data$Day == "Thursday"])/sum(Data$Total[Data$Day == "Thursday"])

Fri_Rates_Avg = sum(Data$Manhattan[Data$Day == "Friday"])/sum(Data$Total[Data$Day == "Friday"])

Sat_Rates_Avg = sum(Data$Manhattan[Data$Day == "Saturday"])/sum(Data$Total[Data$Day == "Saturday"])

Day_Rates_Averages = c(AllDays_Rates_Avg, Sun_Rates_Avg, Mon_Rates_Avg, Tues_Rates_Avg, Wed_Rates_Avg, Thur_Rates_Avg, Fri_Rates_Avg, Sat_Rates_Avg)

Rate_Averages = round(Day_Rates_Averages, digits = 4)

Days = c("All Days", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

Counts_Difference = c(
  0, # Difference between the average count of all days and itself
  round(mean(Data$Manhattan[Data$Day == "Sunday"])) - round(mean(Data$Manhattan)),
  round(mean(Data$Manhattan[Data$Day == "Monday"])) - round(mean(Data$Manhattan)),
  round(mean(Data$Manhattan[Data$Day == "Tuesday"])) - round(mean(Data$Manhattan)),
  round(mean(Data$Manhattan[Data$Day == "Wednesday"])) - round(mean(Data$Manhattan)),
  round(mean(Data$Manhattan[Data$Day == "Thursday"])) - round(mean(Data$Manhattan)),
  round(mean(Data$Manhattan[Data$Day == "Friday"])) - round(mean(Data$Manhattan)),
  round(mean(Data$Manhattan[Data$Day == "Saturday"])) - round(mean(Data$Manhattan))
)

Rates_DifferenceB = c(
  0,
  Sun_Rates_Avg - AllDays_Rates_Avg,
  Mon_Rates_Avg - AllDays_Rates_Avg,
  Tues_Rates_Avg - AllDays_Rates_Avg,
  Wed_Rates_Avg - AllDays_Rates_Avg,
  Thur_Rates_Avg - AllDays_Rates_Avg,
  Fri_Rates_Avg - AllDays_Rates_Avg,
  Sat_Rates_Avg - AllDays_Rates_Avg
)

Rates_Difference = round(Rates_DifferenceB, digits = 4)

Table = cbind(Days, Count_Averages, Counts_Difference, Rate_Averages, Rates_Difference)

kable(Table, caption = "<b><center><span style='color:#000000;'>Distribution of Manhattan Bridge Cyclist Count and Rates July 2017</center></b>") %>%
  kable_styling(
    bootstrap_options = c("striped", "bordered"),
    full_width = FALSE,
    position = "center"
  )

```

The table provides greater detail into the implications of our Poisson count and rate models. That being weekday totals of Manhattan Bridge cyclists (specifically Monday - Thursday) far outweigh the count of cyclists on the bridge from Friday to Sunday. With the average number of cylclists from Monday - Thursday being about 6,325, and the average number Friday - Sunday being about 4,353.

As for the rate of Manhattan Bridge cyclists relative to cyclists on all East River Bridges, we see that the Manhattan Bridge's cyclist rate is slightly above average Monday - Wednesday, but then below average Thursday through Sunday.


## Poisson Modeling Takeaways
To conclude, any implementations done in response to our Poisson models' findings should be done with some degree of caution due to the low practical significance found in both our count and rate models. That being said, there are still valuable takeaways that we can draw from our analysis. 

First, the Manhattan Bridge is clearly busier, both in the sense of raw volume and as a proportion of the overall East River Bridge network, early and throughout the standard workweek than it is during the weekend. Second, the daily average temperature as well as the difference between that day's high and low played very little if any role in the count or rate of cyclists on any given day, but the measure of precipitation does appear to have a relatively noticeable and negative association with the number of that day's cyclists on the Manhattan Bridge.

# Quasi-Poisson Regression Modeling
In addition to analyzing our data at hand via Poisson regression, I decided to also create a quasi-Poisson model of the data. Quasi-Poisson modeling is an alternative to Poisson modeling, and it is particularly valuable when the mean and variance of the model's response variable (number of cyclists on the Manhattan bridge in this case) are not approximately equal to one another (known as dispersion).

For my quasi-Poisson model, I included that day's average temperature, day of the week and precipitation amount as the relevant factors. Day of the week obviously played the biggest role in our previous Poisson models, with precipitation consistently being cited as statistically significant despite relatively low practical significance. For this model, I chose to discretize precipitation, with days of no recorded rain being marked as "0" and days with *any* amount of rain being marked as "1." 
```{r, Quasi-Poisson Model}
Data$NewPrecip = Data$Precipitation
Data$NewPrecip[Data$Precipitation == 0] = 0
Data$NewPrecip[Data$Precipitation > 0] = 1

Data = data.frame(Data$Date, Data$Day, Data$Day_Num, Data$HighTemp, Data$LowTemp, Data$MeanTemp, Data$TempDiff, Data$Precipitation, Data$NewPrecip, Data$Manhattan, Data$Total)
colnames(Data) = c("Date", "Day", "Day_Num", "HighTemp", "LowTemp", "MeanTemp","TempDiff", "Precipitation", "NewPrecip","Manhattan", "Total")


# 1.) Below is the quasi-Poisson regression model
  # As instructed, only includes Day, MeanTemp and NewPrecip

Quasi_Counts_Model = glm(Manhattan ~ Day + MeanTemp + NewPrecip, family = quasipoisson, data = Data)
  
Quasi_Counts_Model_Sum = summary(Quasi_Counts_Model)
Quasi_Counts_Model_Coef = Quasi_Counts_Model_Sum$coefficients

invisible(Quasi_Counts_Model_Coef)
kable(Quasi_Counts_Model_Coef, caption = "<b><center> Quasi-Poisson Counts Regression: Weather and Schedule Relationship with Count of Manhattan Bridge Cyclists </center></b>")
  
```

A summary of the quasi-Poisson counts model can be seen above. We can see that there is great similarity between the findings of this model and our original Poisson counts model. However before we can determine which one is superior for interpretative use, we must calculate this quasi-Poisson's dispersion parameter, "phi hat" ($\hat{\phi}$).

## Dispersion and Counts Model Selection

$\hat{\phi}$ is used in quasi-Poisson regression to determine if our data's response variable is overly or underly dispersed. Generally, a phi hat value of around 1 is representative of an approximately equal mean and variance of the response. If a quasi-Poisson model's dispersion value is significantly different than 1, then that model should be used for associative analysis rather than a traditional Poisson counterpart, as the quasi-Poisson calculation includes greater estimation of standard errors. However, if $\hat{\phi}$ ~ 1, then the traditional Poisson model should be used, as it is less computationally intensive and avoids otherwise unnecessary extra steps. The formula for $\hat{\phi}$'s calculation can be seen below.

```{r,  fig.align="center", echo=FALSE}
include_graphics("Dispersion_Parameter.png")
```

```{r Dispersion Parameter}
n = nrow(Data)
p = 3
Pearson_Residuals = residuals(Quasi_Counts_Model, type = "pearson")
Sq_Pearson_Residuals = Pearson_Residuals^2
Dispersion_Parameter = (sum(Sq_Pearson_Residuals))/(n-p)

#### Double checked phi's value using Prof's coding method; got same result
  ydif=Data$Manhattan-exp(Quasi_Counts_Model$linear.predictors)  # diff between y and yhat
  prsd = ydif/sqrt(exp(Quasi_Counts_Model$linear.predictors))   # Pearson residuals
  phi_check = sum(prsd^2)/(n-p)
#### 
  
invisible(Dispersion_Parameter)
invisible(phi_check)
```

Our model yielded a value of $\hat{\phi}$ ~ 142, which is **well** beyond the margin of error for a properly dispersed Poisson response variable. For this reason, we can deem that the quasi-Poisson counts model is more valuable for associative analysis than the Poisson counts model. Because of this, we will use the quasi-Poisson for our ultimate interpretations.

## Visual Aids
Referring to our quasi-Poisson model summary above, we see that the day's average temperature does not appear to have significant statistical or practical association with the Manhattan bridge's number of cyclists. However, there does appear to be such a difference between the number of cyclists on a totally clear day as opposed to a day with at least *some* level of precipitation (recorded via variable NewPrecip). And, as consistently seen in our original Poisson regression models, there is certainly a large difference between the typical number of cyclists depending on the day of the week.

Knowing this, I created two visuals below to enhance our grasp of the relationship that both the day of the week and the presence of precipitation have with each other as well as the standard number of cyclists that were on the Manhattan Bridge throughout July 2017. Unfortunately, there were no instances of Tuesdays or Wednesdays with precipitation in this study, resulting in a blank in both our table and bar chart below.

```{r, fig.align='center', fig.width=10, fig.height=6}
#### Table
Days = c("All Days", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

No_Precipitation = c(
  round(mean(Data$Manhattan[Data$NewPrecip == 0])),
  round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Sunday"])),
  round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Monday"])),
  round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Tuesday"])),
  round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Wednesday"])),
  round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Thursday"])),
  round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Friday"])),
  round(mean(Data$Manhattan[Data$NewPrecip == 0 & Data$Day == "Saturday"]))
)

Some_Precipitation = c(
  round(mean(Data$Manhattan[Data$NewPrecip == 1])),
  round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Sunday"])),
  round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Monday"])),
  round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Tuesday"])),
  round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Wednesday"])),
  round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Thursday"])),
  round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Friday"])),
  round(mean(Data$Manhattan[Data$NewPrecip == 1 & Data$Day == "Saturday"]))
)

Vis_Table = data.frame(Days, No_Precipitation, Some_Precipitation)

kable(Vis_Table, caption = "<b><center><span style='color:#000000;'>Average Cyclist Counts on the Manhattan Bridge July 2017</center></b>") %>%
  kable_styling(
    bootstrap_options = c("striped", "bordered"),
    full_width = FALSE,
    position = "center")

#### Barchart
Vis_long =
  Vis_Table %>%
  pivot_longer(
    cols = c(No_Precipitation, Some_Precipitation),
    names_to = "Precipitation",
    values_to = "Manhattan"
  )

Vis_long$Days = factor(
  Vis_long$Days,
  levels = c("All Days", "Sunday", "Monday", "Tuesday", "Wednesday",
             "Thursday", "Friday", "Saturday")
)
ggplot(Vis_long,
       aes(x = Days, y = Manhattan, fill = Precipitation)) +
  geom_bar(stat = "identity",
           position = position_dodge(width = 0.9),
           na.rm = TRUE) +
  labs(
    title = "Average Cyclist Counts on the Manhattan Bridge July 2017",
    x = "Day of the Week",
    y = "Number of Cyclists",
    fill = "Precipitation"
  ) +
  scale_fill_manual(
    values = c("No_Precipitation" = "darkred",
               "Some_Precipitation"    = "lightblue")
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  )
```

The depictions above provide added clarity to the summary statistics from our quasi-Poisson model. Sunday was the baseline for our regression model's calculations, and all remaining days other than Saturday had positive coefficients. We can see in both our chart and table, that the count of cyclists is certainly higher throughout the week than it is on the weekends. As for the role of precipitation; our discretized variable NewPrecip had a regression coefficent of ~ 0.4 (the highest absolute value of any factor in the model) and a significant p value ~ .0002. That finding can be intuitively confirmed by taking a glance at our visuals. Other than Thursday, all other days for which there are data points for both precipitation and no precipitation show a *noticeable* decrease in cyclists when there is a presence of precipitation.

# Conclusion
Through our analysis via the means of both Poisson and quasi-Poisson modeling techniques, our findings were relatively consistent. Those being the following;

- The Manhattan Bridge is far busier during the week than it is on the weekend.
- At least in the month of July, the temperature does not play any sort of significant role in the raw number of the bridge's cyclists nor its share of the totality of East River Bridge cyclists.
- Although temperature is not significantly associated with cyclist traffic, precipitation is. Any amount of precipitation has a negative and statistically significant relationship with Manhattan Bridge's cyclist traffic.

Regarding the existence of both our Poisson and quasi-Poisson models to estimate the association between all these factors, the quasi-Poisson is more ideal due to this dataset's *extremely* high dispersion parameter ($\hat{\phi}$ ~ 142). 

If we were to continue or expand on this analysis in the future, it would be valuable to expand the scope of our data outside of the month of July and into months that border on seasonal changes such as March, April or October. Intuitively, one might guess that the day's temperatures play a much larger role in a time of the year like that.

# References:

Original Dataset Source:

- https://pengdsci.github.io/STA321/ww09/w09-AssignDataSet.xlsx

Dataset Download Links via Github:

- https://raw.githubusercontent.com/ChrisB2323/STA321/refs/heads/main/NYC_Cyclists_Data.xlsx