The daily total of bike counts was conducted monthly on the Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. To keep count of cyclists entering and leaving Queens, Manhattan, and Brooklyn via the East River Bridges. The Traffic Information Management System (TIMS) collects the count data. Each record represents the total number of cyclists per 24 hours at Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge. My assigned tab in the data set concerns 31 records only from the Queensboro Bridge.
library(readxl)
Week9Data <- read_excel("Week9Data.xlsx")
#View(Week9Data)
kable(head(Week9Data), caption = "First few records in the data set")
| Date | Day | HighTemp | LowTemp | Precipitation | QueensboroBridge | Total |
|---|---|---|---|---|---|---|
| 2023-07-01 | Saturday | 84.9 | 72.0 | 0.23 | 3216 | 11867 |
| 2023-07-02 | Sunday | 87.1 | 73.0 | 0.00 | 3579 | 13995 |
| 2023-07-03 | Monday | 87.1 | 71.1 | 0.45 | 4230 | 16067 |
| 2023-07-04 | Tuesday | 82.9 | 70.0 | 0.00 | 3861 | 13925 |
| 2023-07-05 | Wednesday | 84.9 | 71.1 | 0.00 | 5862 | 23110 |
| 2023-07-06 | Thursday | 75.0 | 71.1 | 0.00 | 5251 | 21861 |
The variables in the data set are as follows:
Date: The calendar date - also serves as the observation ID
Day: The day of the week the observation was recorded on
HighTemp: The high temperature on that day
LowTemp: The low temperature on that day
Precipitation: Amount of precipitation on that day, presumably measured in inches
QueensboroBridge: Amount of cyclists traversing the Queensboro bridge on that day
Total: The total amount of cyclists entering and leaving Queens, Manhattan, and Brooklyn via the East River Bridges.
In this analysis we use cyclist traversal data that has a frequency count and other predictor variables. Our intention is to discern what relationship exists between the number of cyclists who travel in and out of Queens, Manhattan, and Brooklyn via (for this analysis) the Queensboro Bridge, and the various predictor variables.
In this data set, the response variable is measured as a count, indicating the number of cyclists travelling over the Queensboro Bridge.
The basic assumptions of Poisson regression are
Poisson Response: The response variable is a count per unit of time or space, described by a Poisson distribution.
Independence: The observations must be independent of one another.
Mean is equal to variance: By definition, the mean of a Poisson random variable must be equal to its variance.
Linearity: The log of the mean rate, log(λ), must be a linear function of x.
Here, I will discretize the variable HighTemp to streamline the poisson rates regression process.
highTemp = Week9Data$HighTemp
grp.highTemp = highTemp
grp.highTemp[highTemp %in% c(69.1, 71.1, 75.0, 75.9, 77.0)] = "69.1-77.0"
grp.highTemp[highTemp %in% c(78.1, 79.0, 81.0, 82.9, 84.0)] = "78.1-84.0"
grp.highTemp[highTemp %in% c(84.9, 87.1)] = "84.9-87.1"
grp.highTemp[highTemp %in% c(88.0, 89.1, 91.0, 93.0)] = "88.0-93.0"
Week9Data$grp.highTemp = grp.highTemp
We first build a Poisson frequency regression model and ignore the population size of each city in the data.
model.freq <- glm(QueensboroBridge ~ Day + grp.highTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = Week9Data)
##
pois.count.coef = summary(model.freq)$coef
kable(pois.count.coef, caption = "The Poisson regression model for the counts of Queensboro Bridge cyclists versus the weekday, temperatures, and precipitation level.")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 9.0471658 | 0.0744620 | 121.500436 | 0 |
| DayMonday | 0.0792691 | 0.0108133 | 7.330716 | 0 |
| DaySaturday | -0.1739902 | 0.0113063 | -15.388749 | 0 |
| DaySunday | -0.2274712 | 0.0124348 | -18.293085 | 0 |
| DayThursday | 0.1400892 | 0.0121733 | 11.507951 | 0 |
| DayTuesday | 0.0867184 | 0.0118603 | 7.311647 | 0 |
| DayWednesday | 0.1990623 | 0.0111421 | 17.865747 | 0 |
| grp.highTemp78.1-84.0 | 0.0944110 | 0.0107462 | 8.785557 | 0 |
| grp.highTemp84.9-87.1 | 0.1231826 | 0.0124164 | 9.920987 | 0 |
| grp.highTemp88.0-93.0 | 0.1190514 | 0.0132101 | 9.012182 | 0 |
| LowTemp | -0.0096028 | 0.0011246 | -8.538612 | 0 |
| Precipitation | -0.3529774 | 0.0105627 | -33.417438 | 0 |
The above inferential table about the regression coefficients indicates weekday, temperature, and precipitation levels are all significant. This means, if we look at cyclists travelling across the Queensboro across the day, temperature, and precipitation, there is significant statistical evidence to support the potential discrepancy across the predictor variables. Keep in mind that this does not prove causation, but merely indicates a correlation between trends in travel. Also keep in mind that the sample size could affect the outcome of the model one way or another.
The following model assesses the potential relationship between Queensboro cyclist rates and the HighTemp. This is the primary interest of the model. We also want to adjust the relationship be the potential days of the week
model.rates <- glm(QueensboroBridge ~ Day + grp.highTemp, offset = log(Total),
family = poisson(link = "log"), data = Week9Data)
kable(summary(model.rates)$coef, caption = "Poisson regression on the rate of the
the Queensboro Bridge Cyclists on the seven weekdays adjusted by HighTemp.")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.3478768 | 0.0099091 | -136.024813 | 0.0000000 |
| DayMonday | -0.0686368 | 0.0102340 | -6.706767 | 0.0000000 |
| DaySaturday | -0.0291092 | 0.0109601 | -2.655928 | 0.0079090 |
| DaySunday | -0.0779492 | 0.0112852 | -6.907207 | 0.0000000 |
| DayThursday | -0.0478951 | 0.0111917 | -4.279513 | 0.0000187 |
| DayTuesday | -0.0605713 | 0.0108875 | -5.563356 | 0.0000000 |
| DayWednesday | -0.0691736 | 0.0104731 | -6.604867 | 0.0000000 |
| grp.highTemp78.1-84.0 | -0.0394314 | 0.0088408 | -4.460146 | 0.0000082 |
| grp.highTemp84.9-87.1 | -0.0090362 | 0.0081616 | -1.107169 | 0.2682209 |
| grp.highTemp88.0-93.0 | -0.0222552 | 0.0082689 | -2.691444 | 0.0071143 |
The above table indicates that the log of Queensboro Bridge rate is not identical across the age groups and among the seven weekdays. To be more specific, the log rates of Friday (baseline day) were higher than in the other six days.The lowest high temperature group (69.1-77.0) has the highest log rate. The regression coefficients represent the change of log rate between the associate high tempetature group and the reference high temperature group. The same interpretation applies to the change in log rate among the weekdays.
This week’s assignment is to revise your analysis in Week #9 by adding a new section to include a quasi-Poisson model to the report. In this new analysis, we modify the predictor variables in the following ways.
AvgTemp = (Week9Data$HighTemp + Week9Data$LowTemp)/2.
Week9Data$AvgTemp = AvgTemp
precipitation = Week9Data$Precipitation
grp.precipitation = precipitation
grp.precipitation[precipitation %in% c(0.00)] = "0"
grp.precipitation[precipitation %in% c(0.01, 0.06, 0.23, 0.35, 0.45, 0.57, 0.74, 1.78)] = "1"
Week9Data$grp.Precipitation = grp.precipitation
quasimodel.rates <- glm(QueensboroBridge ~ Day + grp.highTemp + grp.Precipitation, offset = log(Total),
family = quasipoisson, data = Week9Data)
pander(summary(model.rates)$coef, caption = "Quasi-Poisson regression on the rate of Queensboro cyclists on the seven weekdays adjusted by temperature")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.348 | 0.009909 | -136 | 0 |
| DayMonday | -0.06864 | 0.01023 | -6.707 | 1.99e-11 |
| DaySaturday | -0.02911 | 0.01096 | -2.656 | 0.007909 |
| DaySunday | -0.07795 | 0.01129 | -6.907 | 4.943e-12 |
| DayThursday | -0.0479 | 0.01119 | -4.28 | 1.873e-05 |
| DayTuesday | -0.06057 | 0.01089 | -5.563 | 2.646e-08 |
| DayWednesday | -0.06917 | 0.01047 | -6.605 | 3.979e-11 |
| grp.highTemp78.1-84.0 | -0.03943 | 0.008841 | -4.46 | 8.19e-06 |
| grp.highTemp84.9-87.1 | -0.009036 | 0.008162 | -1.107 | 0.2682 |
| grp.highTemp88.0-93.0 | -0.02226 | 0.008269 | -2.691 | 0.007114 |
##Collect the Dispersion index:
ydif=Week9Data$QueensboroBridge-exp(model.rates$linear.predictors) # diff between y and yhat
prsd = ydif/sqrt(exp(model.rates$linear.predictors)) # Pearson residuals
phi = sum(prsd^2)/15 # Dispersion index: 24-9 = 15
pander(cbind(Dispersion = phi))
| Dispersion |
|---|
| 24.1 |
The dispersion index is 24.1, which is very dispersed. As such, we will stick to the regular Poisson model.
# Friday
Friday = c(exp(-1.348), exp(-1.348-0.039),
exp(-1.348-0.009),exp(-1.348-0.022))
# Monday
Monday = c(exp(-1.348-0.069), exp(-1.348-0.069-0.039),
exp(-1.348-0.069-0.009),exp(-1.348-0.069-0.022))
# Saturday
Saturday= c(exp(-1.348-0.029), exp(-1.348-0.029-0.039),
exp(-1.348-0.029-0.009),exp(-1.348-0.029-0.022))
# Sunday
Sunday = c(exp(-1.348-0.078), exp(-1.348-0.078-0.039),
exp(-1.348-0.078-0.009),exp(-1.348-0.078-0.022))
# Thursday
Thursday = c(exp(-1.348-0.048), exp(-1.348-0.048-0.039),
exp(-1.348-0.048-0.009),exp(-1.348-0.048-0.022))
# Tuesday
Tuesday = c(exp(-1.348-0.061), exp(-1.348-0.061-0.039),
exp(-1.348-0.061-0.009),exp(-1.348-0.061-0.022))
# Wednesday
Wednesday= c(exp(-1.348-0.069), exp(-1.348-0.069-0.039),
exp(-1.348-0.069-0.009),exp(-1.348-0.069-0.022))
minmax = range(c(Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday))
####
plot(1:4,Friday, type="l", lty =1, col="red", xlab="",
ylab="Queensboro Cyclists Rate", xlim=c(0,4), ylim=c(0.2, 0.3), axes=FALSE )
axis(2)
axis(1, labels=c("[69.1,77.0]","[78.1,84.0]","[84.9,87.1]","[88.0,93.0]"),
at = 1:4)
points(1:4,Friday, pch=19, col="red")
##
lines(1:4, Monday, lty =2, col="blue")
points(1:4, Monday, pch=20, col="blue")
##
lines(1:4, Saturday, lty =3, col="purple")
points(1:4, Saturday, pch=21, col="purple")
###
lines(1:4, Sunday, lty =4, col="mediumvioletred")
points(1:4, Sunday, pch=22, col="mediumvioletred")
###
lines(1:4, Thursday, lty =4, col="darkgreen")
points(1:4, Thursday, pch=22, col="darkgreen")
###
lines(1:4, Tuesday, lty =4, col="gold4")
points(1:4, Tuesday, pch=22, col="gold4")
###
lines(1:4, Wednesday, lty =4, col="lightblue4")
points(1:4, Wednesday, pch=22, col="lightblue4")
##
legend("topleft", c("Friday","Monday", "Saturday", "Sunday", "Thursday", "Tuesday", "Wednesday" ),
pch=19:22, lty=1:4, bty="n",
col=c("red", "blue", "purple", "mediumvioletred", "darkgreen", "gold4", "lightblue4"))
Several conclusions we can draw from the output of the regression models.
The regression model based on the cancer count is not appropriate since the information on the total number of cyclists can not be used. Simply include the Total in the regression model to improve the model performance. See the following output of the fitted Poisson regression model.
model.freq.pop <- glm(QueensboroBridge ~ Day + grp.highTemp + Total, family = poisson(link = "log"),
data = Week9Data)
##
pois.count.coef.pop = summary(model.freq.pop)$coef
kable(pois.count.coef.pop, caption = "The Poisson regression model for
the counts of Queensboro Bridge Cyclists versus the weekday,
total cyclists, and temperature group")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 7.4532089 | 0.0165792 | 449.5511188 | 0.0000000 |
| DayMonday | -0.0388068 | 0.0108008 | -3.5929637 | 0.0003269 |
| DaySaturday | -0.0045603 | 0.0110113 | -0.4141456 | 0.6787675 |
| DaySunday | -0.0459664 | 0.0112871 | -4.0724823 | 0.0000465 |
| DayThursday | 0.0275690 | 0.0116011 | 2.3764129 | 0.0174819 |
| DayTuesday | -0.0136387 | 0.0115410 | -1.1817588 | 0.2373014 |
| DayWednesday | -0.0188565 | 0.0117243 | -1.6083275 | 0.1077635 |
| grp.highTemp78.1-84.0 | 0.0060722 | 0.0088252 | 0.6880490 | 0.4914219 |
| grp.highTemp84.9-87.1 | 0.0356574 | 0.0081311 | 4.3853115 | 0.0000116 |
| grp.highTemp88.0-93.0 | 0.0249607 | 0.0082450 | 3.0273762 | 0.0024669 |
| Total | 0.0000500 | 0.0000008 | 62.7806550 | 0.0000000 |
The cyclists rate on Friday is significantly higher than on the other weekdays It seems that there is no significant difference between Saturday, Tuesday, and Wednesday. The reason why Friday has a higher cyclists rate needs further investigation with additional information.
There is an uncertain relationship between highTemp and the Queensboro Cyclist rate. The cyclist rate undulates as temperature increases, reaching peaks in the 69-77 range and the 85-87 range.
The last statistical observation is that there is no interaction effect between the highTemp groups and the days of the week. The rate curves are “parallel”.
This is only a small data set with limited information. All conclusions in this report are only based on the given data set.