This project extends the previous Poisson report investigating NYC cyclist counts by addressing over-dispersion. The objective is to develop a quasi-Poisson model that produces reliable inference while explaining how day of the week, average temperature, and precipitation influence cyclist activity on the Queensboro bridge. The main questions being investigated are does daily cyclist volume depend significantly on weekday and weather conditions, and whether or not correcting for over-dispersion changes the conclusions drawn from the standard Poisson model.
The dataset contains daily observations of the day, high temperature, low temperature, precipitation, cyclist count across the bridge, and total traffic exposure. New variable average temperature was creating by averaging the high and low temp.The outcome variable is count = queensborobridge, representing the total cyclist count per day.
Two models were fit to the counts of cyclists: A regular Poisson GLM, and a Quasi-Poisson GLM. The estimated dispersion parameter phi was calculated from Pearson residuals. If phi is close to 1, Poisson suffices; if phi > 1, quasi-Poisson is more effective. Coefficients were exponentiation to IRRs, and visualization plots examined how counts vary with Day, average temperature, and new precipitation.
Estimated dispersion parameter phi = 109.12, meaning that the variance is over 100 times the mean, a clear sign of strong over-dispersion. Average temperature yielded a 0.6% increase per degree, new precipitation yielded a 27% decrease on rainy days, and week pattern loses significance after the quassi-Poisson implementation.
#Model Interpretation Rain is the only statistically robust predictor after applying the quassi-Poisson model to handle over-dispersion. Warmer days show slightly higher counts, but the variability is too large to claim formal significance. Weekday differences visible in raw data disappear.
Because Phi > 1, the quassi-Poisson Model yields better results. This is because the model preserves the mean fit but corrects the standard errors, yielding appropriate inference. Thus, the quasi-Poisson model is the final working model for this set of data. # General Discussion
This study shows how over-dispersion inflates apparent significance in count models. Using the quassi-Poisson model, we can apply robust methods without altering predicted means. The model confirms that precipitation strongly reduces cyclist count on the bridge, and temperature and weekday are not statistically reliable once daily variability is considered.
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(broom)
library(knitr)
data_path <- "C:/Users/rg03/Downloads/sta321/PoissonData.xlsx"
raw <- read_excel(data_path)
nm <- names(raw)
nm <- tolower(nm)
nm <- gsub("[[:space:]]+", "_", nm)
nm <- gsub("[^a-z0-9_]", "", nm)
nm <- gsub("_+", "_", nm)
nm <- gsub("^_+|_+$", "", nm)
names(raw) <- nm
dat <- raw %>%
mutate(
count = as.numeric(queensborobridge),
exposure = as.numeric(total),
avgtemp = (as.numeric(hightemp) + as.numeric(lowtemp)) / 2,
newprecip = ifelse(as.numeric(precipitation) > 0, 1, 0),
day = factor(
day,
levels = c("Monday","Tuesday","Wednesday","Thursday",
"Friday","Saturday","Sunday"),
ordered = TRUE
)
) %>%
filter(!is.na(count), !is.na(avgtemp), !is.na(newprecip), !is.na(day))
kable(head(dat), caption = "Preview of transformed data")
| date | day | hightemp | lowtemp | precipitation | queensborobridge | total | count | exposure | avgtemp | newprecip |
|---|---|---|---|---|---|---|---|---|---|---|
| 2025-07-01 | Saturday | 84.9 | 72.0 | 0.23 | 3216 | 11867 | 3216 | 11867 | 78.45 | 1 |
| 2025-07-02 | Sunday | 87.1 | 73.0 | 0.00 | 3579 | 13995 | 3579 | 13995 | 80.05 | 0 |
| 2025-07-03 | Monday | 87.1 | 71.1 | 0.45 | 4230 | 16067 | 4230 | 16067 | 79.10 | 1 |
| 2025-07-04 | Tuesday | 82.9 | 70.0 | 0.00 | 3861 | 13925 | 3861 | 13925 | 76.45 | 0 |
| 2025-07-05 | Wednesday | 84.9 | 71.1 | 0.00 | 5862 | 23110 | 5862 | 23110 | 78.00 | 0 |
| 2025-07-06 | Thursday | 75.0 | 71.1 | 0.00 | 5251 | 21861 | 5251 | 21861 | 73.05 | 0 |
m_pois <- glm(count ~ day + avgtemp + newprecip,
family = poisson(link = "log"),
data = dat)
m_quasi <- glm(count ~ day + avgtemp + newprecip,
family = quasipoisson(link = "log"),
data = dat)
m_quasi_rate <- glm(count ~ day + avgtemp + newprecip + offset(log(exposure)),
family = quasipoisson(link = "log"),
data = dat %>% filter(!is.na(exposure), exposure > 0))
phi_hat <- sum(residuals(m_pois, type = "pearson")^2) / m_pois$df.residual
pois_tab <- tidy(m_pois, conf.int = TRUE, exponentiate = TRUE)
quasi_tab <- tidy(m_quasi, conf.int = TRUE, exponentiate = TRUE)
quasi_rate <- tidy(m_quasi_rate, conf.int = TRUE, exponentiate = TRUE)
kable(pois_tab, digits = 3, caption = "Poisson model (IRRs)")
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3067.571 | 0.042 | 189.020 | 0.000 | 2822.419 | 3333.737 |
| day.L | 0.762 | 0.007 | -38.606 | 0.000 | 0.752 | 0.773 |
| day.Q | 0.888 | 0.007 | -16.881 | 0.000 | 0.876 | 0.900 |
| day.C | 1.039 | 0.008 | 5.091 | 0.000 | 1.024 | 1.055 |
| day^4 | 1.113 | 0.007 | 15.060 | 0.000 | 1.098 | 1.129 |
| day^5 | 0.951 | 0.007 | -6.953 | 0.000 | 0.937 | 0.964 |
| day^6 | 0.985 | 0.007 | -2.149 | 0.032 | 0.971 | 0.999 |
| avgtemp | 1.006 | 0.001 | 11.007 | 0.000 | 1.005 | 1.007 |
| newprecip | 0.727 | 0.007 | -43.950 | 0.000 | 0.717 | 0.737 |
kable(quasi_tab, digits = 3, caption = "Quasi-Poisson model (IRRs)")
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3067.571 | 0.444 | 18.095 | 0.000 | 1279.506 | 7287.428 |
| day.L | 0.762 | 0.073 | -3.696 | 0.001 | 0.660 | 0.880 |
| day.Q | 0.888 | 0.074 | -1.616 | 0.120 | 0.768 | 1.025 |
| day.C | 1.039 | 0.079 | 0.487 | 0.631 | 0.890 | 1.215 |
| day^4 | 1.113 | 0.074 | 1.442 | 0.163 | 0.963 | 1.289 |
| day^5 | 0.951 | 0.076 | -0.666 | 0.513 | 0.818 | 1.103 |
| day^6 | 0.985 | 0.075 | -0.206 | 0.839 | 0.850 | 1.140 |
| avgtemp | 1.006 | 0.006 | 1.054 | 0.303 | 0.995 | 1.017 |
| newprecip | 0.727 | 0.076 | -4.207 | 0.000 | 0.626 | 0.842 |
kable(data.frame(Dispersion_Estimate = round(phi_hat, 3)),
caption = "Estimated dispersion parameter ϕ̂")
| Dispersion_Estimate |
|---|
| 109.115 |
if (!is.null(m_quasi_rate) && inherits(m_quasi_rate, "glm")) {
kable(quasi_rate, digits = 3, caption = "Quasi-Poisson rate model (IRRs) with offset log(Total)")
}
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.272 | 0.165 | -7.879 | 0.000 | 0.197 | 0.376 |
| day.L | 1.005 | 0.028 | 0.176 | 0.862 | 0.952 | 1.060 |
| day.Q | 0.949 | 0.028 | -1.895 | 0.071 | 0.899 | 1.002 |
| day.C | 0.972 | 0.029 | -0.964 | 0.345 | 0.918 | 1.030 |
| day^4 | 0.991 | 0.028 | -0.341 | 0.737 | 0.938 | 1.046 |
| day^5 | 1.017 | 0.028 | 0.596 | 0.557 | 0.962 | 1.075 |
| day^6 | 1.013 | 0.028 | 0.471 | 0.642 | 0.959 | 1.070 |
| avgtemp | 0.998 | 0.002 | -0.755 | 0.458 | 0.994 | 1.003 |
| newprecip | 1.049 | 0.028 | 1.697 | 0.104 | 0.992 | 1.108 |
p1 <- ggplot(dat, aes(day, count)) +
geom_boxplot() +
labs(title = "Cyclist Counts by Day", y = "Count", x = "Day")
p2 <- ggplot(dat, aes(avgtemp, count)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Counts vs Average Temperature",
y = "Count", x = "Average Temperature (°F)")
p3 <- ggplot(dat, aes(factor(newprecip), count)) +
geom_boxplot() +
labs(title = "Cyclist Counts by Precipitation",
x = "NewPrecip (0 = Dry, 1 = Wet)", y = "Count")
print(p1)
print(p2)
## `geom_smooth()` using formula = 'y ~ x'
print(p3)