load_packages <- function(pkg_list) {
for (pkg in pkg_list) {
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg, dependencies = TRUE)
library(pkg, character.only = TRUE)
}
}
}
packages <- c("tidyverse", "readxl", "pander")
load_packages(packages)
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
message = FALSE,
results = TRUE,
comment = NA,
fig.align = "center"
)
url <- "https://raw.githubusercontent.com/ncbrechbill/STA321/refs/heads/main/STA321/BrooklynBridge.csv"
data <- read.csv(url)
The Brooklyn Bridge is a suspension bridge that connects the boroughs of Manhattan and Brooklyn that was first opened in 1883. It has become an icon and major tourist attraction of New York City.
Bicycles are one form of transportation across the Brooklyn Bridge. We want to analyze how the counts of bicyclists across this bridge change based on various factors on a given day.
Count data was collected daily on bicyclists entering and exiting the bridge. The Traffic Information Management System (TIMS) collected the count data and stored each day 24-hour period for this bridge, as well as all East River bridges.
The collected information was:
Date: The calendar date.
Day: The day of the week (Monday, Tuesday, etc.)
HighTemp: The recorded high temperature, in Fahrenheit.
LowTemp: The recorded low temperature, in Fahrenheit.
Precipitation: The recorded precipitation, in inches.
BrookylnBridge: The total count of bicyclists across the Brooklyn Bridge.
Total: The total count of bicyclists across all East River bridges.
It is suspected that the variables HighTemp and LowTemp will be correlated, and thus present issues of potential colinearity in further analysis.
plot(data$HighTemp, data$LowTemp)
# Basic scatterplot with labels
plot(data$HighTemp, data$LowTemp,
xlab = "High Temperature (°F)",
ylab = "Low Temperature (°F)",
main = "High vs Low Temperature",
pch = 19, col = "steelblue")
# Add regression line
fit <- lm(LowTemp ~ HighTemp, data = data)
abline(fit, col = "darkred", lwd = 2)
# Add correlation coefficient
cor_val <- cor(data$HighTemp, data$LowTemp, use = "complete.obs")
legend("topleft", legend = paste("r =", round(cor_val, 2)),
bty = "n", text.col = "black")
As suspected, the high and low temperature of a given day is correlated. More specifically, a greater high temperature explains a greater low temperature. These variables will be transformed into their average to retain this trend and account for both recorded temperatures.
data$AvgTemp <- (data$HighTemp + data$LowTemp) / 2
hist(data$AvgTemp,
xlab = "Average Temperature (°F)",
main = "Average Temperature Distribution",
col = "steelblue")
This transformation results in a normally distributed numerical variable. AvgTemp will be used in any analyses moving forward.
In its current form, the precipitation variable is highly skewed. Leaving this variable in its numerical form reduces the power of regression.
hist(data$Precipitation,
breaks = 10,
xlab = "Precipitation (Inches)",
main = "Precipitation Distribution",
col = "steelblue")
Precipitation will be transformed into a factor, with possibilities “Yes”, and “No”. “Yes” will be assigned to all days with greater than 0 recorded precipitation, and “No” to days with exactly 0 recorded precipitation.
data <- data %>%
mutate(data, PrecipFactor = ifelse(Precipitation > 0, "Yes", "No"))
This creates a more useful variable for future analysis, where a day is assigned to whether any precipitation was recorded or not. This should be more robust, as the presence of any precipitation should impact whether people decide to take a bicycle or not.
The first Poisson regression model is constructed on purely the count of bicyclists that cross the Brooklyn Bridge in a 24-hour period.
model.freq <- glm(BrooklynBridge ~ Day + AvgTemp + PrecipFactor,
family = poisson(link = "log"),
data = data)
pander(summary(model.freq))
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 7.031 | 0.05718 | 123 | 0 |
| DayMonday | 0.3156 | 0.01378 | 22.9 | 4.414e-116 |
| DaySaturday | 0.09547 | 0.01434 | 6.658 | 2.781e-11 |
| DaySunday | 0.2118 | 0.01404 | 15.09 | 1.917e-51 |
| DayThursday | 0.2535 | 0.01445 | 17.54 | 6.907e-69 |
| DayTuesday | 0.2044 | 0.01476 | 13.85 | 1.19e-43 |
| DayWednesday | 0.2634 | 0.01456 | 18.09 | 4.067e-73 |
| AvgTemp | 0.01001 | 0.0007239 | 13.82 | 1.868e-43 |
| PrecipFactorYes | -0.3632 | 0.00939 | -38.68 | 0 |
(Dispersion parameter for poisson family taken to be 1 )
| Null deviance: | 5319 on 30 degrees of freedom |
| Residual deviance: | 1988 on 22 degrees of freedom |
All predictor variables are statistically significant. Of the categorical variables, “DayFriday” and “PrecipFactorNo” were considered baseline. Thus, the categorical analysis of their variables is compared to these baseline levels.
In poisson regression, the coefficients represent a the natural log of the response of bicycle count. So the coefficients must be transformed to analyze the modeled count for a given set of predictors. For example, a clear monday with average temperature of 80 degrees is given by: \[e^(7.031+0.3156+80*0.01001) = 3454.38598\]
The previous looks at only the traffic along the Brooklyn Bridge. We can also use the total traffic across all East River bridges for comparison of the rate of bikers that cross the Brooklyn Bridge along the same predictor variables.
model.rate <- glm(BrooklynBridge ~ Day + AvgTemp + PrecipFactor,
offset(Total),
family = poisson(link = "log"),
data = data)
pander(summary(model.rate))
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 7.533 | 0.0004172 | 18057 | 0 |
| DayMonday | 0.2539 | 9.989e-05 | 2541 | 0 |
| DaySaturday | 0.02116 | 0.0001109 | 190.7 | 0 |
| DaySunday | 0.1426 | 0.000106 | 1344 | 0 |
| DayThursday | 0.1933 | 0.0001039 | 1860 | 0 |
| DayTuesday | 0.1624 | 0.0001052 | 1544 | 0 |
| DayWednesday | 0.2176 | 0.0001016 | 2142 | 0 |
| AvgTemp | 0.004368 | 5.202e-06 | 839.6 | 0 |
| PrecipFactorYes | -0.3335 | 7.426e-05 | -4491 | 0 |
(Dispersion parameter for poisson family taken to be 1 )
| Null deviance: | 73281861 on 30 degrees of freedom |
| Residual deviance: | 31815310 on 22 degrees of freedom |
We can see similar results in significance with the newly calculated coefficients. “DayFriday” and “PrecipFactorYes” were again selected as baselines.
The factors can be compared on a simple line chart. Such charts are provided below. The X axis is built upon the Average Temperature, and the Y axis is built upon the projected volume of bikers. The different days of the week are represented as different colored lines. The two charts adds a dimension by differentiating precipitation or no precipitation.
#Store Coefs
rate.coef = model.rate$coefficients
NoMon = exp(rate.coef[1] + rate.coef[2] + (rate.coef[8]*data$AvgTemp))
NoTue = exp(rate.coef[1] + rate.coef[6] + (rate.coef[8]*data$AvgTemp))
NoWed = exp(rate.coef[1] + rate.coef[7] + (rate.coef[8]*data$AvgTemp))
NoThu = exp(rate.coef[1] + rate.coef[5] + (rate.coef[8]*data$AvgTemp))
NoFri = exp(rate.coef[1] + (rate.coef[8]*data$AvgTemp))
NoSat = exp(rate.coef[1] + rate.coef[3] + (rate.coef[8]*data$AvgTemp))
NoSun = exp(rate.coef[1] + rate.coef[4] + (rate.coef[8]*data$AvgTemp))
YesMon = exp(rate.coef[1] + rate.coef[9] + rate.coef[2] + (rate.coef[8]*data$AvgTemp))
YesTue = exp(rate.coef[1] + rate.coef[9] + rate.coef[6] + (rate.coef[8]*data$AvgTemp))
YesWed = exp(rate.coef[1] + rate.coef[9] + rate.coef[7] + (rate.coef[8]*data$AvgTemp))
YesThu = exp(rate.coef[1] + rate.coef[9] + rate.coef[5] + (rate.coef[8]*data$AvgTemp))
YesFri = exp(rate.coef[1] + rate.coef[9] + (rate.coef[8]*data$AvgTemp))
YesSat = exp(rate.coef[1] + rate.coef[9] + rate.coef[3] + (rate.coef[8]*data$AvgTemp))
YesSun = exp(rate.coef[1] + rate.coef[9] + rate.coef[4] + (rate.coef[8]*data$AvgTemp))
plot_data_no <- data.frame(
AvgTemp = data$AvgTemp,
Monday = NoMon,
Tuesday = NoTue,
Wednesday = NoWed,
Thursday = NoThu,
Friday = NoFri,
Saturday = NoSat,
Sunday = NoSun
)
plot_data_long.no <- pivot_longer(
plot_data_no,
cols = -AvgTemp,
names_to = "Day",
values_to = "ModeledCount"
)
plot_data_yes <- data.frame(
AvgTemp = data$AvgTemp,
Monday = YesMon,
Tuesday = YesTue,
Wednesday = YesWed,
Thursday = YesThu,
Friday = YesFri,
Saturday = YesSat,
Sunday = YesSun
)
plot_data_long.yes <- pivot_longer(
plot_data_yes,
cols = -AvgTemp,
names_to = "Day",
values_to = "ModeledCount"
)
# Combine all modeled values
all_counts <- c(NoMon, NoTue, NoWed, NoThu, NoFri, NoSat, NoSun,
YesMon, YesTue, YesWed, YesThu, YesFri, YesSat, YesSun)
# Get min and max for consistent y-axis
ymin <- min(all_counts)
ymax <- max(all_counts)
# No precipitation plot
ggplot(plot_data_long.no, aes(x = AvgTemp, y = ModeledCount, color = Day)) +
geom_line(linewidth = 1) +
labs(title = "Brooklyn Bridge Bikers (No Precipitation)",
x = "Average Temperature (°F)",
y = "Modeled Biker Count") +
scale_y_continuous(limits = c(ymin, ymax)) +
theme_minimal()
# Yes precipitation plot
ggplot(plot_data_long.yes, aes(x = AvgTemp, y = ModeledCount, color = Day)) +
geom_line(linewidth = 1) +
labs(title = "Brooklyn Bridge Bikers (With Precipitation)",
x = "Average Temperature (°F)",
y = "Modeled Biker Count") +
scale_y_continuous(limits = c(ymin, ymax)) +
theme_minimal()
It’s important to check required assumptions for poisson regression. The most important being the namesake poisson distribution. The key for this distribution is equal mean and variance.
mean_val <- mean(data$BrooklynBridge, na.rm = TRUE)
var_val <- var(data$BrooklynBridge, na.rm = TRUE)
summary_df <- data.frame(
Statistic = c("Mean", "Variance"),
Value = c(mean_val, var_val)
)
pander(summary_df, caption = "Summary Statistics for BrooklynBridge")
| Statistic | Value |
|---|---|
| Mean | 2756 |
| Variance | 434761 |
It can be seen that the values are not close to equal. Thus, the models built so far may not be appropriate for the dataset’s distribution. Further analysis is needed under Quasi-Poisson regression to appropriately model the travel over the bridge.
We adjust for the highly dispersed variance with quasi-poisson regression. This means that the poisson regression is changed with a defined dispersion parameter from the data set. This means that we can keep our process relatively the same despite the lack of a poisson regression.
The dispersion parameter in this data set was determined to be 87.72. This greatly justifies the quasi-poisson approach as this data is very dispersed, as seen in the poisson discussion. We find much different results in this regression to standard poisson regression. Precipitation is found to be statistically significant, and in relation to the baseline of Friday, the day Monday was significant at a standard acceptance level.
model.quasi.freq <- glm(BrooklynBridge ~ AvgTemp + PrecipFactor + Day,
family = quasipoisson,
data = data)
pander(summary(model.quasi.freq))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 7.031 | 0.5356 | 13.13 | 6.947e-12 |
| AvgTemp | 0.01001 | 0.006781 | 1.476 | 0.1542 |
| PrecipFactorYes | -0.3632 | 0.08795 | -4.13 | 0.0004391 |
| DayMonday | 0.3156 | 0.1291 | 2.445 | 0.02294 |
| DaySaturday | 0.09547 | 0.1343 | 0.7108 | 0.4847 |
| DaySunday | 0.2118 | 0.1315 | 1.611 | 0.1214 |
| DayThursday | 0.2535 | 0.1354 | 1.873 | 0.07444 |
| DayTuesday | 0.2044 | 0.1382 | 1.479 | 0.1533 |
| DayWednesday | 0.2634 | 0.1364 | 1.931 | 0.06646 |
(Dispersion parameter for quasipoisson family taken to be 87.72455 )
| Null deviance: | 5319 on 30 degrees of freedom |
| Residual deviance: | 1988 on 22 degrees of freedom |
Again, line plots can be created to display the difference in each modeled level. Similar to the previous graphs, the scale is kept consistent, and the day factors are separated by color.
qcoefs <- model.quasi.freq$coefficients
qnomon <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[4])
qnotue <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[8])
qnowed <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[9])
qnothu <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[7])
qnofri <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp))
qnosat <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[5])
qnosun <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[6])
qyesmon <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[3] + qcoefs[4])
qyestue <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[3] + qcoefs[8])
qyeswed <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[3] + qcoefs[9])
qyesthu <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[3] + qcoefs[7])
qyesfri <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[3])
qyessat <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[3] + qcoefs[5])
qyessun <- exp(qcoefs[1] + (qcoefs[2]*data$AvgTemp) + qcoefs[3] + qcoefs[6])
plot_quasi_no <- data.frame(
AvgTemp = data$AvgTemp,
Monday = qnomon,
Tuesday = qnotue,
Wednesday = qnowed,
Thursday = qnothu,
Friday = qnofri,
Saturday = qnosat,
Sunday = qnosun
)
plot_quasi_long.no <- pivot_longer(
plot_quasi_no,
cols = -AvgTemp,
names_to = "Day",
values_to = "ModeledCount"
)
plot_quasi_yes <- data.frame(
AvgTemp = data$AvgTemp,
Monday = qyesmon,
Tuesday = qyestue,
Wednesday = qyeswed,
Thursday = qyesthu,
Friday = qyesfri,
Saturday = qyessat,
Sunday = qyessun
)
plot_quasi_long.yes <- pivot_longer(
plot_quasi_yes,
cols = -AvgTemp,
names_to = "Day",
values_to = "ModeledCount"
)
# Combine all modeled values
quasi_all_counts <- c(qnomon, qnotue, qnowed, qnothu, qnofri, qnosat, qnosun,
qyesmon, qyestue, qyeswed, qyesthu, qyesfri, qyessat, qyessun)
# Get min and max for consistent y-axis
ymin <- min(quasi_all_counts)
ymax <- max(quasi_all_counts)
# No precipitation plot
ggplot(plot_quasi_long.no, aes(x = AvgTemp, y = ModeledCount, color = Day)) +
geom_line(linewidth = 1) +
labs(title = "Brooklyn Bridge Bikers (No Precipitation)",
x = "Average Temperature (°F)",
y = "Modeled Biker Count") +
scale_y_continuous(limits = c(ymin, ymax)) +
theme_minimal()
# Yes precipitation plot
ggplot(plot_quasi_long.yes, aes(x = AvgTemp, y = ModeledCount, color = Day)) +
geom_line(linewidth = 1) +
labs(title = "Brooklyn Bridge Bikers (With Precipitation)",
x = "Average Temperature (°F)",
y = "Modeled Biker Count") +
scale_y_continuous(limits = c(ymin, ymax)) +
theme_minimal()
It should be noticed that the coefficients remain the same across both methods. However, the significance greatly changes when adustments are made by the dispersion parameter. It’s important to note that should the dispersion parameter equal 1, the two methods are identical. Due to the high dispersion parameter in this example, it is highly recommended for the quasipoisson model to be used in any future analysis. This reduces the significance of many predictors, but more accurately fits the model to the distribution of the response variable.