1 Introduction

The daily total of bike counts was conducted monthly on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. To keep count of cyclists entering and leaving Queens, Manhattan, and Brooklyn via the East River Bridges. The Traffic Information Management System (TIMS) collects the count data. Each record represents the total number of cyclists per 24 hours at Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. My assigned tab in the data set concerns 31 records only from the Queensboro Bridge.

library(readxl)
Week9Data <- read_excel("Week9Data.xlsx")
#View(Week9Data)
kable(head(Week9Data), caption = "First few records in the data set") 
First few records in the data set
Date Day HighTemp LowTemp Precipitation QueensboroBridge Total
2023-07-01 Saturday 84.9 72.0 0.23 3216 11867
2023-07-02 Sunday 87.1 73.0 0.00 3579 13995
2023-07-03 Monday 87.1 71.1 0.45 4230 16067
2023-07-04 Tuesday 82.9 70.0 0.00 3861 13925
2023-07-05 Wednesday 84.9 71.1 0.00 5862 23110
2023-07-06 Thursday 75.0 71.1 0.00 5251 21861

The variables in the data set are as follows:

  • Date: The calendar date - also serves as the observation ID

  • Day: The day of the week the observation was recorded on

  • HighTemp: The high temperature on that day

  • LowTemp: The low temperature on that day

  • Precipitation: Amount of precipitation on that day, presumably measured in inches

  • QueensboroBridge: Amount of cyclists traversing the Queensboro bridge on that day

  • Total: The total amount of cyclists entering and leaving Queens, Manhattan, and Brooklyn via the East River Bridges.

1.1 Research Question

In this analysis we use cyclist traversal data that has a frequency count and other predictor variables. Our intention is to discern what relationship exists between the number of cyclists who travel in and out of Queens, Manhattan, and Brooklyn via (for this analysis) the Queensboro Bridge, and the various predictor variables.

In this data set, the response variable is measured as a count, indicating the number of cyclists travelling over the Queensboro Bridge.

2 Poisson Regression

2.1 Assumptions and Conditions

The basic assumptions of Poisson regression are

  • Poisson Response: The response variable is a count per unit of time or space, described by a Poisson distribution.

  • Independence: The observations must be independent of one another.

  • Mean is equal to variance: By definition, the mean of a Poisson random variable must be equal to its variance.

  • Linearity: The log of the mean rate, log(λ), must be a linear function of x.

2.2 Discretizing HighTemp

Here, I will discretize the variable HighTemp to streamline the poisson rates regression process.

highTemp = Week9Data$HighTemp
grp.highTemp = highTemp
grp.highTemp[highTemp %in% c(69.1, 71.1, 75.0, 75.9, 77.0)] = "69.1-77.0"
grp.highTemp[highTemp %in% c(78.1, 79.0, 81.0, 82.9, 84.0)] = "78.1-84.0"
grp.highTemp[highTemp %in% c(84.9, 87.1)] = "84.9-87.1"
grp.highTemp[highTemp %in% c(88.0, 89.1, 91.0, 93.0)] = "88.0-93.0"

Week9Data$grp.highTemp = grp.highTemp

2.3 Building the Poisson Regression Model

We first build a Poisson frequency regression model and ignore the population size of each city in the data.

model.freq <- glm(QueensboroBridge ~ Day + grp.highTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = Week9Data)
##
pois.count.coef = summary(model.freq)$coef
kable(pois.count.coef, caption = "The Poisson regression model for the counts of Queensboro Bridge cyclists versus the weekday, temperatures, and precipitation level.")
The Poisson regression model for the counts of Queensboro Bridge cyclists versus the weekday, temperatures, and precipitation level.
Estimate Std. Error z value Pr(>|z|)
(Intercept) 9.0471658 0.0744620 121.500436 0
DayMonday 0.0792691 0.0108133 7.330716 0
DaySaturday -0.1739902 0.0113063 -15.388749 0
DaySunday -0.2274712 0.0124348 -18.293085 0
DayThursday 0.1400892 0.0121733 11.507951 0
DayTuesday 0.0867184 0.0118603 7.311647 0
DayWednesday 0.1990623 0.0111421 17.865747 0
grp.highTemp78.1-84.0 0.0944110 0.0107462 8.785557 0
grp.highTemp84.9-87.1 0.1231826 0.0124164 9.920987 0
grp.highTemp88.0-93.0 0.1190514 0.0132101 9.012182 0
LowTemp -0.0096028 0.0011246 -8.538612 0
Precipitation -0.3529774 0.0105627 -33.417438 0

The above inferential table about the regression coefficients indicates weekday, temperature, and precipitation levels are all significant. This means, if we look at cyclists travelling across the Queensboro across the day, temperature, and precipitation, there is significant statistical evidence to support the potential discrepancy across the predictor variables. Keep in mind that this does not prove causation, but merely indicates a correlation between trends in travel. Also keep in mind that the sample size could affect the outcome of the model one way or another.

2.4 Poisson Regression on Rates

The following model assesses the potential relationship between Queensboro cyclist rates and the HighTemp. This is the primary interest of the model. We also want to adjust the relationship be the potential days of the week

model.rates <- glm(QueensboroBridge ~ Day + grp.highTemp, offset = log(Total), 
                   family = poisson(link = "log"), data = Week9Data)

kable(summary(model.rates)$coef, caption = "Poisson regression on the rate of the 
      the Queensboro Bridge Cyclists on the seven weekdays adjusted by HighTemp.")
Poisson regression on the rate of the the Queensboro Bridge Cyclists on the seven weekdays adjusted by HighTemp.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.3478768 0.0099091 -136.024813 0.0000000
DayMonday -0.0686368 0.0102340 -6.706767 0.0000000
DaySaturday -0.0291092 0.0109601 -2.655928 0.0079090
DaySunday -0.0779492 0.0112852 -6.907207 0.0000000
DayThursday -0.0478951 0.0111917 -4.279513 0.0000187
DayTuesday -0.0605713 0.0108875 -5.563356 0.0000000
DayWednesday -0.0691736 0.0104731 -6.604867 0.0000000
grp.highTemp78.1-84.0 -0.0394314 0.0088408 -4.460146 0.0000082
grp.highTemp84.9-87.1 -0.0090362 0.0081616 -1.107169 0.2682209
grp.highTemp88.0-93.0 -0.0222552 0.0082689 -2.691444 0.0071143

The above table indicates that the log of Queensboro Bridge rate is not identical across the age groups and among the seven weekdays. To be more specific, the log rates of Friday (baseline day) were higher than in the other six days.The lowest high temperature group (69.1-77.0) has the highest log rate. The regression coefficients represent the change of log rate between the associate high tempetature group and the reference high temperature group. The same interpretation applies to the change in log rate among the weekdays.

3 Week 10 Adjustments:

This week’s assignment is to revise your analysis in Week #9 by adding a new section to include a quasi-Poisson model to the report. In this new analysis, we modify the predictor variables in the following ways.

AvgTemp = (Week9Data$HighTemp + Week9Data$LowTemp)/2.
Week9Data$AvgTemp = AvgTemp

precipitation = Week9Data$Precipitation
grp.precipitation = precipitation
grp.precipitation[precipitation %in% c(0.00)] = "0"
grp.precipitation[precipitation %in% c(0.01, 0.06, 0.23, 0.35, 0.45, 0.57, 0.74, 1.78)] = "1"

Week9Data$grp.Precipitation = grp.precipitation

3.1 Quasi-Poisson Model

quasimodel.rates <- glm(QueensboroBridge ~ Day + grp.highTemp + grp.Precipitation, offset = log(Total), 
                   family = quasipoisson, data = Week9Data)
pander(summary(model.rates)$coef, caption = "Quasi-Poisson regression on the rate of Queensboro cyclists on the seven weekdays adjusted by temperature")
Quasi-Poisson regression on the rate of Queensboro cyclists on the seven weekdays adjusted by temperature
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.348 0.009909 -136 0
DayMonday -0.06864 0.01023 -6.707 1.99e-11
DaySaturday -0.02911 0.01096 -2.656 0.007909
DaySunday -0.07795 0.01129 -6.907 4.943e-12
DayThursday -0.0479 0.01119 -4.28 1.873e-05
DayTuesday -0.06057 0.01089 -5.563 2.646e-08
DayWednesday -0.06917 0.01047 -6.605 3.979e-11
grp.highTemp78.1-84.0 -0.03943 0.008841 -4.46 8.19e-06
grp.highTemp84.9-87.1 -0.009036 0.008162 -1.107 0.2682
grp.highTemp88.0-93.0 -0.02226 0.008269 -2.691 0.007114
##Collect the Dispersion index:

ydif=Week9Data$QueensboroBridge-exp(model.rates$linear.predictors)  # diff between y and yhat
prsd = ydif/sqrt(exp(model.rates$linear.predictors))   # Pearson residuals
phi = sum(prsd^2)/15          # Dispersion index: 24-9 = 15  
pander(cbind(Dispersion = phi))
Dispersion
24.1

The dispersion index is 24.1, which is very dispersed. As such, we will stick to the regular Poisson model.

4 Some Graphical Comparison

# Friday
Friday = c(exp(-1.348), exp(-1.348-0.039),   
               exp(-1.348-0.009),exp(-1.348-0.022))
# Monday
Monday = c(exp(-1.348-0.069), exp(-1.348-0.069-0.039),   
            exp(-1.348-0.069-0.009),exp(-1.348-0.069-0.022))
# Saturday
Saturday= c(exp(-1.348-0.029), exp(-1.348-0.029-0.039),   
           exp(-1.348-0.029-0.009),exp(-1.348-0.029-0.022))
# Sunday
Sunday = c(exp(-1.348-0.078), exp(-1.348-0.078-0.039),   
          exp(-1.348-0.078-0.009),exp(-1.348-0.078-0.022))
# Thursday
Thursday = c(exp(-1.348-0.048), exp(-1.348-0.048-0.039),   
               exp(-1.348-0.048-0.009),exp(-1.348-0.048-0.022))
# Tuesday
Tuesday = c(exp(-1.348-0.061), exp(-1.348-0.061-0.039),   
            exp(-1.348-0.061-0.009),exp(-1.348-0.061-0.022))
# Wednesday
Wednesday= c(exp(-1.348-0.069), exp(-1.348-0.069-0.039),   
           exp(-1.348-0.069-0.009),exp(-1.348-0.069-0.022))
minmax = range(c(Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday))
####
plot(1:4,Friday, type="l", lty =1, col="red", xlab="", 
               ylab="Queensboro Cyclists Rate", xlim=c(0,4), ylim=c(0.2, 0.3), axes=FALSE )
axis(2)
axis(1, labels=c("[69.1,77.0]","[78.1,84.0]","[84.9,87.1]","[88.0,93.0]"), 
            at = 1:4)
points(1:4,Friday, pch=19, col="red")
##
lines(1:4, Monday, lty =2, col="blue")
points(1:4, Monday, pch=20, col="blue")
##
lines(1:4, Saturday, lty =3, col="purple")
points(1:4, Saturday, pch=21, col="purple")
###
lines(1:4, Sunday, lty =4, col="mediumvioletred")
points(1:4, Sunday, pch=22, col="mediumvioletred")
###
lines(1:4, Thursday, lty =4, col="darkgreen")
points(1:4, Thursday, pch=22, col="darkgreen")
###
lines(1:4, Tuesday, lty =4, col="gold4")
points(1:4, Tuesday, pch=22, col="gold4")
###
lines(1:4, Wednesday, lty =4, col="lightblue4")
points(1:4, Wednesday, pch=22, col="lightblue4")
##
legend("topleft", c("Friday","Monday", "Saturday", "Sunday", "Thursday", "Tuesday", "Wednesday" ),
                  pch=19:22, lty=1:4,  bty="n", 
        col=c("red", "blue", "purple", "mediumvioletred", "darkgreen", "gold4", "lightblue4"))

5 Conclusion & Discussion

Several conclusions we can draw from the output of the regression models.

The regression model based on the cancer count is not appropriate since the information on the total number of cyclists can not be used. Simply include the Total in the regression model to improve the model performance. See the following output of the fitted Poisson regression model.

model.freq.pop <- glm(QueensboroBridge ~ Day + grp.highTemp + Total, family = poisson(link = "log"), 
                      data = Week9Data)
##
pois.count.coef.pop = summary(model.freq.pop)$coef
kable(pois.count.coef.pop, caption = "The Poisson regression model for 
         the counts of Queensboro Bridge Cyclists versus the weekday, 
         total cyclists, and temperature group")
The Poisson regression model for the counts of Queensboro Bridge Cyclists versus the weekday, total cyclists, and temperature group
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.4532089 0.0165792 449.5511188 0.0000000
DayMonday -0.0388068 0.0108008 -3.5929637 0.0003269
DaySaturday -0.0045603 0.0110113 -0.4141456 0.6787675
DaySunday -0.0459664 0.0112871 -4.0724823 0.0000465
DayThursday 0.0275690 0.0116011 2.3764129 0.0174819
DayTuesday -0.0136387 0.0115410 -1.1817588 0.2373014
DayWednesday -0.0188565 0.0117243 -1.6083275 0.1077635
grp.highTemp78.1-84.0 0.0060722 0.0088252 0.6880490 0.4914219
grp.highTemp84.9-87.1 0.0356574 0.0081311 4.3853115 0.0000116
grp.highTemp88.0-93.0 0.0249607 0.0082450 3.0273762 0.0024669
Total 0.0000500 0.0000008 62.7806550 0.0000000

The cyclists rate on Friday is significantly higher than on the other weekdays It seems that there is no significant difference between Saturday, Tuesday, and Wednesday. The reason why Friday has a higher cyclists rate needs further investigation with additional information.

There is an uncertain relationship between highTemp and the Queensboro Cyclist rate. The cyclist rate undulates as temperature increases, reaching peaks in the 69-77 range and the 85-87 range.

The last statistical observation is that there is no interaction effect between the highTemp groups and the days of the week. The rate curves are “parallel”.

This is only a small data set with limited information. All conclusions in this report are only based on the given data set.