1 Introduction

This project investigates how weekday patterns and weather conditions affect daily cyclist counts on the Queensboro Bridge in New York City. Because cyclist traffic is a count variable that can be influenced by multiple environmental and behavioral factors, our analysis combines Poisson and quasi-Poisson generalized linear models to evaluate how the number of cyclists changes with day of the week, average temperature, precipitation, and total roadway exposure.

We address three core research questions:

Do cyclist counts vary systematically by weekday, temperature, and precipitation?

Is the standard Poisson model appropriate, or is over-dispersion so severe that inference becomes unreliable without adjustment?

Does incorporating traffic exposure through an offset term meaningfully change conclusions about which predictors matter?

By comparing Poisson, quasi-Poisson, and quasi-Poisson rate models, we aim to determine which modeling framework provides the most trustworthy inference while offering a practical interpretation of how weather and weekly patterns shape cycling behavior.

2 Data and Variables

2.1 Data Source

The dataset contains daily observations of cyclist counts crossing the Queensboro Bridge along with associated weather variables and total traffic exposure. Because observations are aggregated by day, each row represents a distinct measurement of environmental and traffic conditions.

data_path <- "C:/Users/rg03/Downloads/sta321/PoissonData.xlsx"
raw <- read_excel(data_path)
dim(raw)

[1] 31 7

2.2 Variable Definitions

To ensure clarity and reproducibility, all variables used in the analysis are listed below with names, descriptions, and data types:

count (integer) — Total number of cyclists counted that day.

exposure (numeric) — Total traffic volume or exposure used as an offset in rate models.

avgtemp (numeric) — Average daily temperature, computed as the mean of high and low temperatures.

newprecip (binary indicator) — 1 if precipitation > 0; 0 otherwise.

day (ordered factor) — Day of the week (Monday through Sunday), coded as an ordered factor to detect smooth weekly patterns.

Because day is ordered, R generates orthogonal polynomial contrasts such as:

day.L — linear weekday trend,

day.Q — quadratic weekly curvature,

day.C, day^4, etc. — higher-order cyclical patterns.

These terms appear in the model output but are not standalone variables. They represent structured contrasts that capture how cyclist counts shift across the week.

2.3 Data Processing

All preprocessing steps were explicitly documented:

Variable names were standardized to lower case and simplified for consistency.

Temperature variables were combined into a single, interpretable variable (avgtemp).

Precipitation was converted to a binary indicator to capture the main effect of rain vs. dry conditions.

day was encoded as an ordered factor to capture weekly structure without introducing seven separate dummy variables.

Only rows with complete information were retained.

This preprocessing ensures the final dataset is clean, interpretable, and ready for generalized linear modeling.

nm <- names(raw)
nm <- tolower(nm)
nm <- gsub("[[:space:]]+", "_", nm)
nm <- gsub("[^a-z0-9_]", "", nm)
nm <- gsub("_+", "_", nm)
nm <- gsub("^_+|_+$", "", nm)
names(raw) <- nm

dat <- raw %>%
mutate(
 count = as.numeric(queensborobridge),
 exposure = as.numeric(total),
 avgtemp = (as.numeric(hightemp) + as.numeric(lowtemp))/2,
 newprecip = ifelse(as.numeric(precipitation) > 0, 1L, 0L),
 day = factor(day, levels=c("Monday","Tuesday","Wednesday","Thursday",
                            "Friday","Saturday","Sunday"),
              ordered=TRUE)
) %>%
filter(!is.na(count), !is.na(avgtemp), !is.na(newprecip), !is.na(day))

kable(head(dat), caption="Preview of processed data")
Preview of processed data
date day hightemp lowtemp precipitation queensborobridge total count exposure avgtemp newprecip
2025-07-01 Saturday 84.9 72.0 0.23 3216 11867 3216 11867 78.45 1
2025-07-02 Sunday 87.1 73.0 0.00 3579 13995 3579 13995 80.05 0
2025-07-03 Monday 87.1 71.1 0.45 4230 16067 4230 16067 79.10 1
2025-07-04 Tuesday 82.9 70.0 0.00 3861 13925 3861 13925 76.45 0
2025-07-05 Wednesday 84.9 71.1 0.00 5862 23110 5862 23110 78.00 0
2025-07-06 Thursday 75.0 71.1 0.00 5251 21861 5251 21861 73.05 0

3 Exploratory Data Analysis

3.1 Rationale

Exploratory Data Analysis (EDA) provides an initial understanding of how cyclist counts behave relative to weekday patterns, temperature, and precipitation. While pairwise plots cannot establish causal or adjusted relationships, they identify broad trends, potential nonlinearity, and sources of variability that must be accounted for in the GLM framework.

3.2 Cyclist Counts by Day

Visualizing counts by weekday suggests that cyclist volume tends to be higher on weekdays than weekends, though variability is substantial across all days. This aligns with expectations for commuter patterns but cannot yet distinguish whether day-of-week differences persist after adjusting for weather and exposure

ggplot(dat, aes(day, count)) + geom_boxplot() +
labs(title="Cyclist Counts by Day", x="Day", y="Count")

Interpretation: Weekdays show somewhat higher median counts, but variability is large. A formal model is needed.

3.3 Counts vs Temperature

Cyclist volume rises with temperature, particularly as conditions warm from cooler days into moderate temperatures. However, the large spread in counts at any given temperature indicates that temperature alone cannot explain the full variation in cycling patterns—reinforcing the need for multivariable models.

ggplot(dat, aes(avgtemp, count)) + geom_point(alpha=.5) +
geom_smooth(method="loess", se=FALSE) +
labs(title="Counts vs Temperature", x="Avg Temp (°F)", y="Count")

Interpretation: Warmer days trend upward, but variation is wide → suggests multiple predictors matter jointly.

3.4 Counts by Precipitation

Rainy days clearly show lower cyclist counts than dry days. This strong pairwise contrast suggests precipitation is a key driver of cycling behavior. Later modeling will determine whether this effect remains after accounting for weekday structure, exposure, and temperature.

ggplot(dat, aes(factor(newprecip), count)) + geom_boxplot() +
labs(title="Counts by Precipitation", x="0=Dry / 1=Wet", y="Count")

Interpretation: Rain clearly lowers counts.

4 Poisson and Quasi-Poisson Models

4.1 Rationale

Cyclist counts are non-negative integers, making Poisson regression a natural starting point. However, Poisson models assume equality of the mean and variance: Var(Y)=Mean(Y). When the variance dramatically exceeds the mean—a phenomenon known as over-dispersion—standard errors become severely underestimated, artificially inflating statistical significance. A quasi-Poisson model corrects this by introducing a dispersion parameter: φ.This model preserves the mean structure of the Poisson GLM while providing valid inference even when dispersion is large. Additionally, incorporating an offset using log(exposure) enables modeling of cycling rate per unit traffic, which is essential for understanding cyclist behavior relative to total traffic flow.

4.2 Fit Poisson & Quasi-Poisson Models

m_pois <- glm(count ~ day + avgtemp + newprecip,
              family=poisson(link="log"), data=dat)

m_quasi <- glm(count ~ day + avgtemp + newprecip,
               family=quasipoisson(link="log"), data=dat)

m_quasi_rate <- glm(count ~ day + avgtemp + newprecip +
                      offset(log(exposure)),
                    family=quasipoisson(link="log"),
                    data=dat %>% filter(!is.na(exposure), exposure>0))

phi_hat <- sum(residuals(m_pois, type="pearson")^2) / m_pois$df.residual
kable(data.frame(Dispersion=round(phi_hat,2)),
      caption="Estimated Over-Dispersion φ")
Estimated Over-Dispersion φ
Dispersion
109.11

The dispersion estimate from the Poisson model is:𝜙=109.11.This means the variance of cyclist counts is more than 100 times larger than the mean, rendering Poisson-based inference unreliable. Consequences include:

-Misleadingly small p-values

-Narrowed confidence intervals

-Overstated significance for several predictors

-Inflated Type I error rate

Because of this extreme over-dispersion, the quasi-Poisson framework is essential for drawing valid conclusions.

4.3 Poisson IRRs (for comparison)

kable(tidy(m_pois, conf.int=TRUE, exponentiate=TRUE),
      digits=3, caption="Poisson IRRs")
Poisson IRRs
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 3067.571 0.042 189.020 0.000 2822.419 3333.737
day.L 0.762 0.007 -38.606 0.000 0.752 0.773
day.Q 0.888 0.007 -16.881 0.000 0.876 0.900
day.C 1.039 0.008 5.091 0.000 1.024 1.055
day^4 1.113 0.007 15.060 0.000 1.098 1.129
day^5 0.951 0.007 -6.953 0.000 0.937 0.964
day^6 0.985 0.007 -2.149 0.032 0.971 0.999
avgtemp 1.006 0.001 11.007 0.000 1.005 1.007
newprecip 0.727 0.007 -43.950 0.000 0.717 0.737

The Poisson model suggests almost every term is statistically significant, including all polynomial weekday contrasts, temperature, and precipitation. However, due to φ > 100, these results represent false precision and cannot be trusted.

These results serve only as a comparison point, demonstrating what happens when over-dispersion is ignored.

4.4 Quasi-Poisson IRRs (no offset)

kable(tidy(m_quasi, conf.int=TRUE, exponentiate=TRUE),
      digits=3, caption="Quasi-Poisson IRRs")
Quasi-Poisson IRRs
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 3067.571 0.444 18.095 0.000 1279.506 7287.428
day.L 0.762 0.073 -3.696 0.001 0.660 0.880
day.Q 0.888 0.074 -1.616 0.120 0.768 1.025
day.C 1.039 0.079 0.487 0.631 0.890 1.215
day^4 1.113 0.074 1.442 0.163 0.963 1.289
day^5 0.951 0.076 -0.666 0.513 0.818 1.103
day^6 0.985 0.075 -0.206 0.839 0.850 1.140
avgtemp 1.006 0.006 1.054 0.303 0.995 1.017
newprecip 0.727 0.076 -4.207 0.000 0.626 0.842

After correcting for over-dispersion:

The weekday linear contrast (day.L) remains significant, suggesting a genuine weekly pattern in cyclist counts.

The temperature effect (IRR ≈ 1.006) becomes statistically non-significant, indicating a modest upward trend that is overshadowed by variability.

newprecip IRR = 0.727 remains strongly significant, confirming that rain meaningfully reduces cyclist counts even after adjusting for weekday and temperature.

This model offers a more realistic assessment of which predictors meaningfully influence cyclist traffic.

4.5 Quasi-Poisson Rate Model (Offset Included)

kable(tidy(m_quasi_rate, conf.int=TRUE, exponentiate=TRUE),
      digits=3, caption="Quasi-Poisson Rate Model IRRs")
Quasi-Poisson Rate Model IRRs
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.272 0.165 -7.879 0.000 0.197 0.376
day.L 1.005 0.028 0.176 0.862 0.952 1.060
day.Q 0.949 0.028 -1.895 0.071 0.899 1.002
day.C 0.972 0.029 -0.964 0.345 0.918 1.030
day^4 0.991 0.028 -0.341 0.737 0.938 1.046
day^5 1.017 0.028 0.596 0.557 0.962 1.075
day^6 1.013 0.028 0.471 0.642 0.959 1.070
avgtemp 0.998 0.002 -0.755 0.458 0.994 1.003
newprecip 1.049 0.028 1.697 0.104 0.992 1.108

The quasi-Poisson rate model includes log(exposure) as an offset, allowing us to analyze cycling rates per unit traffic.

Key findings:

Weekday contrasts lose significance, suggesting cycling rates are more stable across the week than raw counts.

Temperature shows a near-null rate effect.

Precipitation no longer significantly reduces the cycling rate (p = 0.104), though the direction remains consistent with smaller volume on rainy days.

Importantly, the precipitation IRR shifts slightly above 1, but remains statistically non-significant. This does not contradict earlier results; rather, it indicates:

Rain reduces total counts because overall traffic volume also drops.

Relative cyclist share among total traffic may remain stable.

This nuance enhances the interpretation and is crucial for understanding behavioral patterns.

To synthesize the results:

Poisson model overstates significance across all predictors due to severe over-dispersion.

Quasi-Poisson model provides valid inference and confirms rainfall and weekday patterns as meaningful drivers.

Quasi-Poisson rate model offers the most conservative and realistic inference, adjusting for traffic volume and controlling over-dispersion.

The rate model is therefore the final recommended model for interpretation. # Model-Based Effect Visualization

4.6 Rationale

Pairwise plots ignore confounding; predictions from the fitted model show adjusted relationships.

4.7 Predicted Counts vs Temperature

temp_seq <- seq(quantile(dat$avgtemp,.05), quantile(dat$avgtemp,.95), length.out=50)

newdat <- expand_grid(avgtemp=temp_seq, newprecip=c(0,1)) %>%
mutate(day=factor("Wednesday", levels=levels(dat$day), ordered=TRUE),
       exposure=median(dat$exposure, na.rm=TRUE))

newdat$pred <- predict(m_quasi_rate, newdata=newdat, type="response")

ggplot(newdat, aes(avgtemp, pred, color=factor(newprecip))) +
geom_line(size=1) +
labs(title="Predicted Counts vs Temperature",
     x="Avg Temp", y="Predicted Count",
     color="newprecip")

Interpretation: Even after adjusting for day + exposure, rain depresses counts strongly across all temperatures.

5 Results and Conclusions

5.1 Key Findings

-Cyclist counts are highly over-dispersed; Poisson inference is invalid without correction.

-The quasi-Poisson GLM provides reliable inference and shows:

-Rain significantly decreases daily cyclist counts.

-Weekday patterns persist after correcting for over-dispersion.

-Temperature effects are mild and statistically uncertain.

When adjusting for exposure, precipitation no longer significantly predicts cycling rate, highlighting the advantage of interpreting rates rather than raw counts.

5.2 Final Model Recommendation

The quasi-Poisson rate model with log(exposure) is the best model because it:

  • Corrects over-dispersion
  • Adjusts for exposure differences
  • Produces interpretable IRRs

5.3 Practical Implications

  • Rain → major drop in cyclists → infrastructure should support wet-weather riding.
  • Temperature influences counts but less strongly.
  • Weekday cycling demand is patterned → important for planning.

6 Limitations and Future Work

  • Lacks wind, holiday/event indicators, autocorrelation modeling.
  • Polynomial day contrasts obscure individual-day comparisons.
  • Future analyses should try negative binomial, time-series GLMs, and interactions.
