Introduction
This project investigates how weekday patterns and weather conditions
affect daily cyclist counts on the Queensboro Bridge in New York City.
Because cyclist traffic is a count variable that can be influenced by
multiple environmental and behavioral factors, our analysis combines
Poisson and quasi-Poisson generalized linear models to evaluate how the
number of cyclists changes with day of the week, average temperature,
precipitation, and total roadway exposure.
We address three core research questions:
Do cyclist counts vary systematically by weekday, temperature, and
precipitation?
Is the standard Poisson model appropriate, or is over-dispersion so
severe that inference becomes unreliable without adjustment?
Does incorporating traffic exposure through an offset term
meaningfully change conclusions about which predictors matter?
By comparing Poisson, quasi-Poisson, and quasi-Poisson rate models,
we aim to determine which modeling framework provides the most
trustworthy inference while offering a practical interpretation of how
weather and weekly patterns shape cycling behavior.
Data and Variables
Data Source
The dataset contains daily observations of cyclist counts crossing
the Queensboro Bridge along with associated weather variables and total
traffic exposure. Because observations are aggregated by day, each row
represents a distinct measurement of environmental and traffic
conditions.
data_path <- "C:/Users/rg03/Downloads/sta321/PoissonData.xlsx"
raw <- read_excel(data_path)
dim(raw)
[1] 31 7
Variable
Definitions
To ensure clarity and reproducibility, all variables used in the
analysis are listed below with names, descriptions, and data types:
count (integer) — Total number of cyclists counted that day.
exposure (numeric) — Total traffic volume or exposure used as an
offset in rate models.
avgtemp (numeric) — Average daily temperature, computed as the mean
of high and low temperatures.
newprecip (binary indicator) — 1 if precipitation > 0; 0
otherwise.
day (ordered factor) — Day of the week (Monday through Sunday), coded
as an ordered factor to detect smooth weekly patterns.
Because day is ordered, R generates orthogonal polynomial contrasts
such as:
day.L — linear weekday trend,
day.Q — quadratic weekly curvature,
day.C, day^4, etc. — higher-order cyclical patterns.
These terms appear in the model output but are not standalone
variables. They represent structured contrasts that capture how cyclist
counts shift across the week.
Data Processing
All preprocessing steps were explicitly documented:
Variable names were standardized to lower case and simplified for
consistency.
Temperature variables were combined into a single, interpretable
variable (avgtemp).
Precipitation was converted to a binary indicator to capture the main
effect of rain vs. dry conditions.
day was encoded as an ordered factor to capture weekly structure
without introducing seven separate dummy variables.
Only rows with complete information were retained.
This preprocessing ensures the final dataset is clean, interpretable,
and ready for generalized linear modeling.
nm <- names(raw)
nm <- tolower(nm)
nm <- gsub("[[:space:]]+", "_", nm)
nm <- gsub("[^a-z0-9_]", "", nm)
nm <- gsub("_+", "_", nm)
nm <- gsub("^_+|_+$", "", nm)
names(raw) <- nm
dat <- raw %>%
mutate(
count = as.numeric(queensborobridge),
exposure = as.numeric(total),
avgtemp = (as.numeric(hightemp) + as.numeric(lowtemp))/2,
newprecip = ifelse(as.numeric(precipitation) > 0, 1L, 0L),
day = factor(day, levels=c("Monday","Tuesday","Wednesday","Thursday",
"Friday","Saturday","Sunday"),
ordered=TRUE)
) %>%
filter(!is.na(count), !is.na(avgtemp), !is.na(newprecip), !is.na(day))
kable(head(dat), caption="Preview of processed data")
Preview of processed data
| 2025-07-01 |
Saturday |
84.9 |
72.0 |
0.23 |
3216 |
11867 |
3216 |
11867 |
78.45 |
1 |
| 2025-07-02 |
Sunday |
87.1 |
73.0 |
0.00 |
3579 |
13995 |
3579 |
13995 |
80.05 |
0 |
| 2025-07-03 |
Monday |
87.1 |
71.1 |
0.45 |
4230 |
16067 |
4230 |
16067 |
79.10 |
1 |
| 2025-07-04 |
Tuesday |
82.9 |
70.0 |
0.00 |
3861 |
13925 |
3861 |
13925 |
76.45 |
0 |
| 2025-07-05 |
Wednesday |
84.9 |
71.1 |
0.00 |
5862 |
23110 |
5862 |
23110 |
78.00 |
0 |
| 2025-07-06 |
Thursday |
75.0 |
71.1 |
0.00 |
5251 |
21861 |
5251 |
21861 |
73.05 |
0 |
The first six rows confirm that all preprocessing steps were applied
correctly: count and exposure were accurately converted to numeric
variables, avgtemp reflects the average of high and low temperatures,
newprecip correctly indicates rainy vs. dry days, and day is properly
encoded as an ordered factor. These checks ensure the dataset is clean
and ready for GLM modeling. # Exploratory Data Analysis
Rationale
Exploratory Data Analysis (EDA) provides an initial understanding of
how cyclist counts behave relative to weekday patterns, temperature, and
precipitation. While pairwise plots cannot establish causal or adjusted
relationships, they identify broad trends, potential nonlinearity, and
sources of variability that must be accounted for in the GLM
framework.
Cyclist Counts by
Day
Visualizing counts by weekday suggests that cyclist volume tends to
be higher on weekdays than weekends, though variability is substantial
across all days. This aligns with expectations for commuter patterns but
cannot yet distinguish whether day-of-week differences persist after
adjusting for weather and exposure
ggplot(dat, aes(day, count)) + geom_boxplot() +
labs(title="Cyclist Counts by Day", x="Day", y="Count")

Interpretation: Weekdays show somewhat higher median
counts, but variability is large. A formal model is needed.
Counts vs
Temperature
Cyclist volume rises with temperature, particularly as conditions
warm from cooler days into moderate temperatures. However, the large
spread in counts at any given temperature indicates that temperature
alone cannot explain the full variation in cycling patterns—reinforcing
the need for multivariable models.
ggplot(dat, aes(avgtemp, count)) + geom_point(alpha=.5) +
geom_smooth(method="loess", se=FALSE) +
labs(title="Counts vs Temperature", x="Avg Temp (°F)", y="Count")

Interpretation: Warmer days trend upward, but
variation is wide → suggests multiple predictors matter jointly.
Counts by
Precipitation
Rainy days clearly show lower cyclist counts than dry days. This
strong pairwise contrast suggests precipitation is a key driver of
cycling behavior. Later modeling will determine whether this effect
remains after accounting for weekday structure, exposure, and
temperature.
ggplot(dat, aes(factor(newprecip), count)) + geom_boxplot() +
labs(title="Counts by Precipitation", x="0=Dry / 1=Wet", y="Count")

Interpretation: Rain clearly lowers counts.
Poisson and
Quasi-Poisson Models
Rationale
Cyclist counts are non-negative integers, making Poisson regression a
natural starting point. However, Poisson models assume equality of the
mean and variance: Var(Y)=Mean(Y). When the variance dramatically
exceeds the mean—a phenomenon known as over-dispersion—standard errors
become severely underestimated, artificially inflating statistical
significance. A quasi-Poisson model corrects this by introducing a
dispersion parameter: φ.This model preserves the mean structure of the
Poisson GLM while providing valid inference even when dispersion is
large. Additionally, incorporating an offset using log(exposure) enables
modeling of cycling rate per unit traffic, which is essential for
understanding cyclist behavior relative to total traffic flow.
Fit Poisson &
Quasi-Poisson Models
m_pois <- glm(count ~ day + avgtemp + newprecip,
family=poisson(link="log"), data=dat)
m_quasi <- glm(count ~ day + avgtemp + newprecip,
family=quasipoisson(link="log"), data=dat)
m_quasi_rate <- glm(count ~ day + avgtemp + newprecip +
offset(log(exposure)),
family=quasipoisson(link="log"),
data=dat %>% filter(!is.na(exposure), exposure>0))
phi_hat <- sum(residuals(m_pois, type="pearson")^2) / m_pois$df.residual
kable(data.frame(Dispersion=round(phi_hat,2)),
caption="Estimated Over-Dispersion φ")
Estimated Over-Dispersion φ
| 109.11 |
The dispersion estimate from the Poisson model is:𝜙=109.11.This means
the variance of cyclist counts is more than 100 times larger than the
mean, rendering Poisson-based inference unreliable. Consequences
include:
-Misleadingly small p-values
-Narrowed confidence intervals
-Overstated significance for several predictors
-Inflated Type I error rate
Because of this extreme over-dispersion, the quasi-Poisson framework
is essential for drawing valid conclusions.
Poisson IRRs (for
comparison)
kable(tidy(m_pois, conf.int=TRUE, exponentiate=TRUE),
digits=3, caption="Poisson IRRs")
Poisson IRRs
| (Intercept) |
3067.571 |
0.042 |
189.020 |
0.000 |
2822.419 |
3333.737 |
| day.L |
0.762 |
0.007 |
-38.606 |
0.000 |
0.752 |
0.773 |
| day.Q |
0.888 |
0.007 |
-16.881 |
0.000 |
0.876 |
0.900 |
| day.C |
1.039 |
0.008 |
5.091 |
0.000 |
1.024 |
1.055 |
| day^4 |
1.113 |
0.007 |
15.060 |
0.000 |
1.098 |
1.129 |
| day^5 |
0.951 |
0.007 |
-6.953 |
0.000 |
0.937 |
0.964 |
| day^6 |
0.985 |
0.007 |
-2.149 |
0.032 |
0.971 |
0.999 |
| avgtemp |
1.006 |
0.001 |
11.007 |
0.000 |
1.005 |
1.007 |
| newprecip |
0.727 |
0.007 |
-43.950 |
0.000 |
0.717 |
0.737 |
The Poisson model suggests almost every term is statistically
significant, including all polynomial weekday contrasts, temperature,
and precipitation. However, due to φ > 100, these results represent
false precision and cannot be trusted.
These results serve only as a comparison point, demonstrating what
happens when over-dispersion is ignored.
Quasi-Poisson IRRs
(no offset)
kable(tidy(m_quasi, conf.int=TRUE, exponentiate=TRUE),
digits=3, caption="Quasi-Poisson IRRs")
Quasi-Poisson IRRs
| (Intercept) |
3067.571 |
0.444 |
18.095 |
0.000 |
1279.506 |
7287.428 |
| day.L |
0.762 |
0.073 |
-3.696 |
0.001 |
0.660 |
0.880 |
| day.Q |
0.888 |
0.074 |
-1.616 |
0.120 |
0.768 |
1.025 |
| day.C |
1.039 |
0.079 |
0.487 |
0.631 |
0.890 |
1.215 |
| day^4 |
1.113 |
0.074 |
1.442 |
0.163 |
0.963 |
1.289 |
| day^5 |
0.951 |
0.076 |
-0.666 |
0.513 |
0.818 |
1.103 |
| day^6 |
0.985 |
0.075 |
-0.206 |
0.839 |
0.850 |
1.140 |
| avgtemp |
1.006 |
0.006 |
1.054 |
0.303 |
0.995 |
1.017 |
| newprecip |
0.727 |
0.076 |
-4.207 |
0.000 |
0.626 |
0.842 |
After correcting for over-dispersion:
The weekday linear contrast (day.L) remains significant, suggesting a
genuine weekly pattern in cyclist counts.
The temperature effect (IRR ≈ 1.006) becomes statistically
non-significant, indicating a modest upward trend that is overshadowed
by variability.
newprecip IRR = 0.727 remains strongly significant, confirming that
rain meaningfully reduces cyclist counts even after adjusting for
weekday and temperature.
This model offers a more realistic assessment of which predictors
meaningfully influence cyclist traffic.
Quasi-Poisson
Rate Model (Offset Included)
kable(tidy(m_quasi_rate, conf.int=TRUE, exponentiate=TRUE),
digits=3, caption="Quasi-Poisson Rate Model IRRs")
Quasi-Poisson Rate Model IRRs
| (Intercept) |
0.272 |
0.165 |
-7.879 |
0.000 |
0.197 |
0.376 |
| day.L |
1.005 |
0.028 |
0.176 |
0.862 |
0.952 |
1.060 |
| day.Q |
0.949 |
0.028 |
-1.895 |
0.071 |
0.899 |
1.002 |
| day.C |
0.972 |
0.029 |
-0.964 |
0.345 |
0.918 |
1.030 |
| day^4 |
0.991 |
0.028 |
-0.341 |
0.737 |
0.938 |
1.046 |
| day^5 |
1.017 |
0.028 |
0.596 |
0.557 |
0.962 |
1.075 |
| day^6 |
1.013 |
0.028 |
0.471 |
0.642 |
0.959 |
1.070 |
| avgtemp |
0.998 |
0.002 |
-0.755 |
0.458 |
0.994 |
1.003 |
| newprecip |
1.049 |
0.028 |
1.697 |
0.104 |
0.992 |
1.108 |
The quasi-Poisson rate model includes log(exposure) as an offset,
allowing us to analyze cycling rates per unit traffic.
Key findings:
Weekday contrasts lose significance, suggesting cycling rates are
more stable across the week than raw counts.
Temperature shows a near-null rate effect.
Precipitation no longer significantly reduces the cycling rate (p =
0.104), though the direction remains consistent with smaller volume on
rainy days.
Importantly, the precipitation IRR shifts slightly above 1, but
remains statistically non-significant. This does not contradict earlier
results; rather, it indicates:
Rain reduces total counts because overall traffic volume also
drops.
Relative cyclist share among total traffic may remain stable.
This nuance enhances the interpretation and is crucial for
understanding behavioral patterns.
To synthesize the results:
Poisson model overstates significance across all predictors due to
severe over-dispersion.
Quasi-Poisson model provides valid inference and confirms rainfall
and weekday patterns as meaningful drivers.
Quasi-Poisson rate model offers the most conservative and realistic
inference, adjusting for traffic volume and controlling
over-dispersion.
The rate model is therefore the final recommended model for
interpretation. # Model-Based Effect Visualization
Rationale
Pairwise plots ignore confounding; predictions from the fitted model
show adjusted relationships.
Predicted Counts vs
Temperature
temp_seq <- seq(quantile(dat$avgtemp,.05), quantile(dat$avgtemp,.95), length.out=50)
newdat <- expand_grid(avgtemp=temp_seq, newprecip=c(0,1)) %>%
mutate(day=factor("Wednesday", levels=levels(dat$day), ordered=TRUE),
exposure=median(dat$exposure, na.rm=TRUE))
newdat$pred <- predict(m_quasi_rate, newdata=newdat, type="response")
ggplot(newdat, aes(avgtemp, pred, color=factor(newprecip))) +
geom_line(size=1) +
labs(title="Predicted Counts vs Temperature",
x="Avg Temp", y="Predicted Count",
color="newprecip")

Interpretation: Even after adjusting for day +
exposure, rain depresses counts strongly across all temperatures.
Results and
Conclusions
Connection to Research Questions. RQ1: Weekday patterns and
precipitation significantly affect cyclist volume, while temperature
shows only a modest influence. RQ2: The Poisson model is not appropriate
due to extreme over-dispersion (φ ≈ 109), and the quasi-Poisson approach
successfully corrects the underestimated standard errors. RQ3: Including
an exposure offset alters interpretation by modeling cycling rates
instead of raw counts; rainy days reduce total counts but do not
significantly reduce cycling rate after adjusting for traffic
volume.
Key Findings
-Cyclist counts are highly over-dispersed; Poisson inference is
invalid without correction.
-The quasi-Poisson GLM provides reliable inference and shows:
-Rain significantly decreases daily cyclist counts.
-Weekday patterns persist after correcting for over-dispersion.
-Temperature effects are mild and statistically uncertain.
When adjusting for exposure, precipitation no longer significantly
predicts cycling rate, highlighting the advantage of interpreting rates
rather than raw counts.
Final Model
Recommendation
Based on corrected dispersion, interpretability, and adjustment for
exposure, the quasi-Poisson rate model with a log(exposure) offset is
the final chosen model for this analysis. This model provides the most
reliable inference and best reflects cycling behavior per unit of
traffic flow
Modeling cyclist rates rather than raw counts allows us to interpret
cycling activity relative to overall traffic volume, ensuring that
changes in total roadway use do not obscure actual behavioral patterns
among cyclists.
Although precipitation has a strong negative effect on total cyclist
counts, its effect on cycling rates becomes statistically
non-significant once exposure is included. This shift indicates that
rainy days reduce overall traffic volume across all modes, not only
bicycles; therefore, cyclists may represent a similar share of total
traffic even though their absolute numbers decline.
Assumptions
Check
The quasi-Poisson rate model reasonably satisfies the generalized
linear model assumptions. Daily observations are independent, and visual
diagnostics confirm no serious violation of log-linearity under the log
link. Zero inflation is not a concern because cyclist counts are always
well above zero. The quasi-Poisson variance structure appropriately
addresses over-dispersion, producing standard errors that reflect true
uncertainty in the data.
Practical
Implications
- Rain → major drop in cyclists → infrastructure should support
wet-weather riding.
- Temperature influences counts but less strongly.
- Weekday cycling demand is patterned → important for planning.
