1 Introduction

1.1 Description

The daily total of bike counts was conducted monthly on the Manhattan Bridge. To keep count of cyclists entering and leaving the Manhattan bridge the Traffic Information Management System (TIMS) collects the count data. Each record represents the total number of cyclists per 24 hours at Brooklyn Bridge

Date - The date in which the total number of bikes was recorded and is our observation ID. Day - The day of the week Monday through Sunday HighTemp - The highest the temperature on a certain day LowTemp - The low temperature on a certain day Precipitation - The amount of rain recorded for that day, most likely in inches of rain ManhattanBridge - The total number of cyclists on the bridge on a certain day Total - The total number of bikes recorded on all the bridges in a certain day

1.2 Research Question

How do various predictors affect the amount of cyclists on this city’s bridges. This analysis aims to identify significant predictors of cyclist totals using Poisson regression and focusing on variables such as temperature, day, and precipitation. Our response variable for this analysis is total number of cyclists

2 Data Analysis

We will first get a brief summary of our data to get a good scope of the data set.

Cyclists <- read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/ManhattanBridge.csv", header = TRUE)
summary(Cyclists)
       X             Date            Day           HighTemp        LowTemp     
 Min.   : 1.0   Min.   :43009   Min.   :43009   Min.   :54.00   Min.   :43.00  
 1st Qu.: 8.5   1st Qu.:43017   1st Qu.:43017   1st Qu.:64.90   1st Qu.:52.00  
 Median :16.0   Median :43024   Median :43024   Median :71.10   Median :57.00  
 Mean   :16.0   Mean   :43024   Mean   :43024   Mean   :69.84   Mean   :57.92  
 3rd Qu.:23.5   3rd Qu.:43032   3rd Qu.:43032   3rd Qu.:75.45   3rd Qu.:64.90  
 Max.   :31.0   Max.   :43039   Max.   :43039   Max.   :82.00   Max.   :72.00  
 Precipitation    ManhattanBridge     Total      
 Min.   :0.0000   Min.   : 661    Min.   : 2835  
 1st Qu.:0.0000   1st Qu.:4530    1st Qu.:15956  
 Median :0.0000   Median :5610    Median :19939  
 Mean   :0.1365   Mean   :5298    Mean   :18652  
 3rd Qu.:0.0550   3rd Qu.:6640    3rd Qu.:22977  
 Max.   :3.0300   Max.   :7691    Max.   :26050  
var(Cyclists)
                            X          Date           Day     HighTemp
X                   82.666667     82.666667     82.666667  -39.7333333
Date                82.666667     82.666667     82.666667  -39.7333333
Day                 82.666667     82.666667     82.666667  -39.7333333
HighTemp           -39.733333    -39.733333    -39.733333   61.4503656
LowTemp            -25.286667    -25.286667    -25.286667   48.4430753
Precipitation        1.337333      1.337333      1.337333   -0.4978032
ManhattanBridge  -4998.000000  -4998.000000  -4998.000000 1855.0206452
Total           -16000.166667 -16000.166667 -16000.166667 6982.1644086
                      LowTemp Precipitation ManhattanBridge        Total
X               -2.528667e+01     1.3373333      -4998.0000   -16000.167
Date            -2.528667e+01     1.3373333      -4998.0000   -16000.167
Day             -2.528667e+01     1.3373333      -4998.0000   -16000.167
HighTemp         4.844308e+01    -0.4978032       1855.0206     6982.164
LowTemp          6.296340e+01     0.5293258      -3489.4718   -10768.102
Precipitation    5.293258e-01     0.2946570       -556.8061    -1890.055
ManhattanBridge -3.489472e+03  -556.8060645    2923576.0129  9607968.955
Total           -1.076810e+04 -1890.0554409    9607968.9548 31808787.525

2.1 Poisson Regression on Cyclists Count

We first build a Poisson regression with counts as our response and also leaving out the Total. We use Poisson regression for this model because this is a count data, which shows the number of occurrences of an event, in this case cyclist counts, within a fixed interval of time. We assume that our counts are independent from one another and also that our variance and mean are equal to each other.

model.freq <- glm(ManhattanBridge~ HighTemp + Precipitation + LowTemp + Day, family = poisson(link = "log"), data = Cyclists)
pois.count.coef = summary(model.freq)$coef
kable(pois.count.coef, caption = "The Poisson regression model for the counts of cyclists entering the bridge vs the tempature, precipitation, and day.")
The Poisson regression model for the counts of cyclists entering the bridge vs the tempature, precipitation, and day.
Estimate Std. Error z value Pr(>|z|)
(Intercept) 218.5321098 14.1546585 15.43888 0
HighTemp 0.0163560 0.0005981 27.34862 0
Precipitation -0.9591493 0.0194684 -49.26688 0
LowTemp -0.0206914 0.0005624 -36.79199 0
Day -0.0048775 0.0003287 -14.83949 0

The above data above shows that all our data above is significant at an alpha level of .05. This means that we have statistical evidence to draw conclusions. Each coefficient represents the log change in the expected count for one unit increase in our predictor value while assuming all other values are constant.For example we can expect when precipitation goes up by one unit we can expect our count to go down by -.959. Some other findings we can see from our output above is that precipitation is our biggest influence on the count because its absolute value is the highest out all the coefficients. We also see when comparing low temperature with high temperature that a lower temperature will tend to have more affect on our count rather than high temperatures because the low temperature has a higher absolute value.

2.2 Poisson Rregression on Rates

The following model has the same variables but we are offsetting with the Total count. By taking the logarithm of Total count as the offset, we are adjusting the counts of cyclists on the Manhattan Bridge for the total number of cyclists across all bridges. This means that if the total number of cyclists varies significantly from day to day, the model will account for that variability.

model.rates <- glm(ManhattanBridge ~ HighTemp + Precipitation + LowTemp + Day, offset = log(Total), 
                   family = poisson(link = "log"), data = Cyclists)
kable(summary(model.rates)$coef, caption = "Poisson regression on the rate of the 
      the cyclists entering and leaving the Brooklyn Bridge.")
Poisson regression on the rate of the the cyclists entering and leaving the Brooklyn Bridge.
Estimate Std. Error z value Pr(>|z|)
(Intercept) 59.9613619 14.6285560 4.0989255 0.0000415
HighTemp 0.0000196 0.0006049 0.0324407 0.9741206
Precipitation -0.0707936 0.0130863 -5.4097362 0.0000001
LowTemp -0.0018394 0.0005745 -3.2018355 0.0013655
Day -0.0014205 0.0003397 -4.1817681 0.0000289

We see from the above output that all of our p-values are blow a significance level of .01 except for High temp. We also see that all of our coefficients have changed to a lower absolute value. One change that stood out to me was the decrease of high temp, compared to the other coefficients this one dropped the most and is basically a non factor in predicting the model.

2.3 EDA and Feature Engineering

Since low temp and high temp are very similar variables we combine to make average temperature. This will give us a better understanding of how many people go cycling on the bridge given the average temperature of the day not just high and low temp. So we add up the two variables and then divide by two to get the days average. We then change our precipitation variable for the purpose of seeing if any type of precipitation throughout the day changed our totals. So what we do is assign a 0 if there was no rain on a day and then a 1 if there was any precipitation on that day.

Cyclists$AvgTemp <- (Cyclists$HighTemp + Cyclists$LowTemp) / 2

Cyclists$NewPrecip <- ifelse(Cyclists$Precipitation > 0, 1, 0)
Cyclists$NewPrecip <- as.factor(Cyclists$NewPrecip)


Cyclists<- Cyclists[, !(names(Cyclists) %in% c("HighTemp", "LowTemp", "Precipitation"))]

head(Cyclists)
  X  Date   Day ManhattanBridge Total AvgTemp NewPrecip
1 1 43009 43009            4540 15975   58.45         0
2 2 43010 43010            7059 23784   62.00         0
3 3 43011 43011            7370 25280   63.50         0
4 4 43012 43012            7691 25477   65.45         0
5 5 43013 43013            7034 23942   73.45         0
6 6 43014 43014            6204 22197   75.05         0

2.4 New Variables

Date - The date in which the total number of bikes was recorded and is our observation ID. Day - The day of the given by a number AvgTemp - The average temperature of a given day. numerical NewPrecp - Gives us a 1 or a 0 if there is precipitation or none. 1 for precip and 0 for none ManhattanBridge - The total number of cyclists on the bridge on a certain day. numerical Total - The total number of bikes recorded on all the bridges in a certain day. numerical

2.5 Dispersion Index

Next, we extract the approximated dispersion index using both Pearson and deviance residuals

pois.model = glm(ManhattanBridge ~ Day + AvgTemp+ NewPrecip, 
                 family = poisson(link="log"), data =Cyclists)  
yhat = pois.model$fitted.values
pearson.resid = (Cyclists$ManhattanBridge - yhat)/sqrt(yhat)
Pearson.disp = sum(pearson.resid^2)/pois.model$df.residual

Deviance.disp = (pois.model$deviance)/pois.model$df.residual

disp = cbind(Pearson.disp = Pearson.disp, Deviance.disp = Deviance.disp)
kable(disp, caption="Dispersion parameter", align = 'c')
Dispersion parameter
Pearson.disp Deviance.disp
324.1135 358.387

From the output above our Poison assumption is seriously violated. Our dispersion output is very far off our value of 1 so we will continue and use a Quasi poisson regression model to account for our variation being bigger than our mean.

2.6 Quasi Model

We now summarize our inferential statistics

quasi.model = glm(ManhattanBridge ~ Day + AvgTemp + NewPrecip, 
                 family = quasipoisson, data =Cyclists)  
summary(quasi.model )

Call:
glm(formula = ManhattanBridge ~ Day + AvgTemp + NewPrecip, family = quasipoisson, 
    data = Cyclists)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 507.263386 236.058816   2.149   0.0408 *  
Day          -0.011580   0.005482  -2.112   0.0441 *  
AvgTemp      -0.005623   0.006963  -0.808   0.4264    
NewPrecip1   -0.503976   0.109360  -4.608 8.72e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 324.1152)

    Null deviance: 19829.5  on 30  degrees of freedom
Residual deviance:  9676.5  on 27  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

We see from the above output that our Avgtemp is not significant. Even though this is concerning we know that average temperature does have an affect on whether or not cyclists enter and leave the bridge. So we continue but be careful of our output and its significance.

SE.quasi.pois = summary(quasi.model)$coef
kable(SE.quasi.pois, caption = "Summary statistics of quasi-poisson regression model")
Summary statistics of quasi-poisson regression model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 507.2633860 236.0588157 2.1488856 0.0407684
Day -0.0115796 0.0054821 -2.1122515 0.0440572
AvgTemp -0.0056231 0.0069633 -0.8075388 0.4264135
NewPrecip1 -0.5039760 0.1093600 -4.6084131 0.0000872

This is our final model. Since our model is in log scale we need to show these values in a way we can draw actual conclusions from. For example, the coefficient for Newprecip is -0.504. This is the estimated Poisson regression coefficient comparing precipitation or none, given the other variables are held constant in the model. The difference in the logs of expected publications is expected to be 0.504 units lower for precipitation compared to none while holding the other variables constant in the model. What we can do to make this easier to understand is make visual aids to help us see what is actually happening.

2.7 Visual Aid

Next, we make a visualization to show how the explanatory variables in the final working model affect the actual number of cyclists. We exponentiate the log count of cyclists and make two graphs showcasing precipitation. Both graphs will show the amount of cyclists given the x axis of either days and average temperature. The first graph will hold temperature constant and showcase as days progress how many cyclists are there while also showing the affect of precipitation and no precipitation. The second graph will hold days constant while also showcasing the affect of temperature on the cyclist count and also showing precipitation

day_range <- seq(min(Cyclists$Day), max(Cyclists$Day), length.out = 100)  # Days from min to max
temp_range <- seq(min(Cyclists$AvgTemp), 90, length.out = 100)  # Temperature from min to 90 degrees

# Create a new data frame for predictions
pred_data <- data.frame(Day = rep(day_range, each = 2),  # Two rows for each day, for both Precipitation 0 and 1
                        AvgTemp = rep(mean(Cyclists$AvgTemp), 200),  # Hold AvgTemp constant
                        NewPrecip1 = rep(c(0, 1), each = 100))  # Precipitation 0 and 1

# Calculate predicted cyclist counts for both conditions (no precipitation and precipitation)
pred_data$Cyclists_Predicted <- exp(507.2633860 - 0.0115796 * pred_data$Day - 
                                     0.0056231 * pred_data$AvgTemp - 
                                     0.5039760 * pred_data$NewPrecip1)

# Plot the effect of Day on predicted cyclist counts for both Precipitation 0 and 1
ggplot(pred_data, aes(x = Day, y = Cyclists_Predicted, color = factor(NewPrecip1))) +
  geom_line() +
  scale_color_manual(values = c("red", "blue"), labels = c("No Precipitation", "Precipitation")) +
  labs(title = "Effect of Day on Predicted Cyclist Counts",
       x = "Day",
       y = "Predicted Cyclist Counts",
       color = "Precipitation") +
  theme_minimal()

# Now plot the effect of AvgTemp on predicted cyclist counts for both Precipitation 0 and 1
pred_data$AvgTemp <- temp_range
pred_data$Cyclists_Predicted <- exp(507.2633860 - 0.0115796 * pred_data$Day - 
                                     0.0056231 * pred_data$AvgTemp - 
                                     0.5039760 * pred_data$NewPrecip1)

ggplot(pred_data, aes(x = AvgTemp, y = Cyclists_Predicted, color = factor(NewPrecip1))) +
  geom_line() +
  scale_color_manual(values = c("red", "blue"), labels = c("No Precipitation", "Precipitation")) +
  labs(title = "Effect of AvgTemp on Predicted Cyclist Counts",
       x = "Average Temperature (\u00b0F)",  # Using Unicode for degree symbol
       y = "Predicted Cyclist Counts",
       color = "Precipitation") +
  scale_x_continuous(limits = c(min(Cyclists$AvgTemp), 80)) +  # Set x-axis to go up to 90 degrees
  theme_minimal()
Warning: Removed 50 rows containing missing values or values outside the scale range
(`geom_line()`).

We see from the graph above that precipitation has a major effect on both of the graphs. From the red and the blue lines are drastically different, the red line is way above the blue line showing that the predicted counts go down when there is precipitation. The first graph showcases the Days, shows that there is a decrease when the days goes up. Also from the second graph we see that again when the temperature goes up the predicted count of cyclists goes down.

3 Conclusion

We do not use our original Poisson regression because of our big value of our dispersion. This big dispersion value means that our variance is a lot bigger than our mean. So we then use a Quasi Poisson model to account for this over dispersion by introducing a dispersion parameter. This model gave us more accurate p-values. Judging from our graphs and output above we see that precipitation has the biggest influence. Precipitation has the biggest absolute value coefficient and also shows the biggest difference between the two sides of our variables. There is also a gradual decrease in both of the graphs. When both days and temperature go up in their respective graphs the count of cyclists goes down.

