Introduction

This project extends the previous Poisson report investigating NYC cyclist counts by addressing over-dispersion. The objective is to develop a quasi-Poisson model that produces reliable inference while explaining how day of the week, average temperature, and precipitation influence cyclist activity on the Queensboro bridge. The main questions being investigated are does daily cyclist volume depend significantly on weekday and weather conditions, and whether or not correcting for over-dispersion changes the conclusions drawn from the standard Poisson model.

Materials

The dataset contains daily observations of the day, high temperature, low temperature, precipitation, cyclist count across the bridge, and total traffic exposure. New variable average temperature was creating by averaging the high and low temp.The outcome variable is count = queensborobridge, representing the total cyclist count per day.

Methodology and Analysis

Two models were fit to the counts of cyclists: A regular Poisson GLM, and a Quasi-Poisson GLM. The estimated dispersion parameter phi was calculated from Pearson residuals. If phi is close to 1, Poisson suffices; if phi > 1, quasi-Poisson is more effective. Coefficients were exponentiation to IRRs, and visualization plots examined how counts vary with Day, average temperature, and new precipitation.

Results and Conclusions

Model Diagnostics

Estimated dispersion parameter phi = 109.12, meaning that the variance is over 100 times the mean, a clear sign of strong over-dispersion. Average temperature yielded a 0.6% increase per degree, new precipitation yielded a 27% decrease on rainy days, and week pattern loses significance after the quassi-Poisson implementation.

#Model Interpretation Rain is the only statistically robust predictor after applying the quassi-Poisson model to handle over-dispersion. Warmer days show slightly higher counts, but the variability is too large to claim formal significance. Weekday differences visible in raw data disappear.

Statistical Conclusion

Because Phi > 1, the quassi-Poisson Model yields better results. This is because the model preserves the mean fit but corrects the standard errors, yielding appropriate inference. Thus, the quasi-Poisson model is the final working model for this set of data. # General Discussion

This study shows how over-dispersion inflates apparent significance in count models. Using the quassi-Poisson model, we can apply robust methods without altering predicted means. The model confirms that precipitation strongly reduces cyclist count on the bridge, and temperature and weekday are not statistically reliable once daily variability is considered.

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

library(readxl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(broom)
library(knitr)



data_path <- "C:/Users/rg03/Downloads/sta321/PoissonData.xlsx"
raw <- read_excel(data_path)


nm <- names(raw)
nm <- tolower(nm)
nm <- gsub("[[:space:]]+", "_", nm)     
nm <- gsub("[^a-z0-9_]", "", nm)        
nm <- gsub("_+", "_", nm)               
nm <- gsub("^_+|_+$", "", nm)          
names(raw) <- nm

dat <- raw %>%
  mutate(

    count    = as.numeric(queensborobridge),
    exposure = as.numeric(total),

    avgtemp   = (as.numeric(hightemp) + as.numeric(lowtemp)) / 2,
    newprecip = ifelse(as.numeric(precipitation) > 0, 1, 0),

    
    day = factor(
      day,
      levels = c("Monday","Tuesday","Wednesday","Thursday",
                 "Friday","Saturday","Sunday"),
      ordered = TRUE
    )
  ) %>%
  filter(!is.na(count), !is.na(avgtemp), !is.na(newprecip), !is.na(day))


kable(head(dat), caption = "Preview of transformed data")
Preview of transformed data
date day hightemp lowtemp precipitation queensborobridge total count exposure avgtemp newprecip
2025-07-01 Saturday 84.9 72.0 0.23 3216 11867 3216 11867 78.45 1
2025-07-02 Sunday 87.1 73.0 0.00 3579 13995 3579 13995 80.05 0
2025-07-03 Monday 87.1 71.1 0.45 4230 16067 4230 16067 79.10 1
2025-07-04 Tuesday 82.9 70.0 0.00 3861 13925 3861 13925 76.45 0
2025-07-05 Wednesday 84.9 71.1 0.00 5862 23110 5862 23110 78.00 0
2025-07-06 Thursday 75.0 71.1 0.00 5251 21861 5251 21861 73.05 0
m_pois <- glm(count ~ day + avgtemp + newprecip,
              family = poisson(link = "log"),
              data = dat)


m_quasi <- glm(count ~ day + avgtemp + newprecip,
               family = quasipoisson(link = "log"),
               data = dat)


m_quasi_rate <- glm(count ~ day + avgtemp + newprecip + offset(log(exposure)),
                    family = quasipoisson(link = "log"),
                    data = dat %>% filter(!is.na(exposure), exposure > 0))


phi_hat <- sum(residuals(m_pois, type = "pearson")^2) / m_pois$df.residual



pois_tab   <- tidy(m_pois, conf.int = TRUE, exponentiate = TRUE)
quasi_tab  <- tidy(m_quasi, conf.int = TRUE, exponentiate = TRUE)
quasi_rate <- tidy(m_quasi_rate, conf.int = TRUE, exponentiate = TRUE)

kable(pois_tab,  digits = 3, caption = "Poisson model (IRRs)")
Poisson model (IRRs)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 3067.571 0.042 189.020 0.000 2822.419 3333.737
day.L 0.762 0.007 -38.606 0.000 0.752 0.773
day.Q 0.888 0.007 -16.881 0.000 0.876 0.900
day.C 1.039 0.008 5.091 0.000 1.024 1.055
day^4 1.113 0.007 15.060 0.000 1.098 1.129
day^5 0.951 0.007 -6.953 0.000 0.937 0.964
day^6 0.985 0.007 -2.149 0.032 0.971 0.999
avgtemp 1.006 0.001 11.007 0.000 1.005 1.007
newprecip 0.727 0.007 -43.950 0.000 0.717 0.737
kable(quasi_tab, digits = 3, caption = "Quasi-Poisson model (IRRs)")
Quasi-Poisson model (IRRs)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 3067.571 0.444 18.095 0.000 1279.506 7287.428
day.L 0.762 0.073 -3.696 0.001 0.660 0.880
day.Q 0.888 0.074 -1.616 0.120 0.768 1.025
day.C 1.039 0.079 0.487 0.631 0.890 1.215
day^4 1.113 0.074 1.442 0.163 0.963 1.289
day^5 0.951 0.076 -0.666 0.513 0.818 1.103
day^6 0.985 0.075 -0.206 0.839 0.850 1.140
avgtemp 1.006 0.006 1.054 0.303 0.995 1.017
newprecip 0.727 0.076 -4.207 0.000 0.626 0.842
kable(data.frame(Dispersion_Estimate = round(phi_hat, 3)),
      caption = "Estimated dispersion parameter ϕ̂")
Estimated dispersion parameter ϕ̂
Dispersion_Estimate
109.115
if (!is.null(m_quasi_rate) && inherits(m_quasi_rate, "glm")) {
  kable(quasi_rate, digits = 3, caption = "Quasi-Poisson rate model (IRRs) with offset log(Total)")
}
Quasi-Poisson rate model (IRRs) with offset log(Total)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.272 0.165 -7.879 0.000 0.197 0.376
day.L 1.005 0.028 0.176 0.862 0.952 1.060
day.Q 0.949 0.028 -1.895 0.071 0.899 1.002
day.C 0.972 0.029 -0.964 0.345 0.918 1.030
day^4 0.991 0.028 -0.341 0.737 0.938 1.046
day^5 1.017 0.028 0.596 0.557 0.962 1.075
day^6 1.013 0.028 0.471 0.642 0.959 1.070
avgtemp 0.998 0.002 -0.755 0.458 0.994 1.003
newprecip 1.049 0.028 1.697 0.104 0.992 1.108
p1 <- ggplot(dat, aes(day, count)) +
  geom_boxplot() +
  labs(title = "Cyclist Counts by Day", y = "Count", x = "Day")


p2 <- ggplot(dat, aes(avgtemp, count)) +
  geom_point() +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Counts vs Average Temperature",
       y = "Count", x = "Average Temperature (°F)")


p3 <- ggplot(dat, aes(factor(newprecip), count)) +
  geom_boxplot() +
  labs(title = "Cyclist Counts by Precipitation",
       x = "NewPrecip (0 = Dry, 1 = Wet)", y = "Count")

print(p1)

print(p2)
## `geom_smooth()` using formula = 'y ~ x'

print(p3)