1 Introduction

In the following report, we will build a model to represent the number of cyclists entering and leaving Queens, Manhattan, and Brooklyn using the Williamsburg Bridge on a given day. In this report, we will build two models: a Poisson regression on counts and a Poisson regression on rates. An analysis will be conducted to determine the best model for the data. Finally, we will summarize the final results in the concluding remarks.

1.1 Data Description

The data is taken from the Traffic Information Management System (TIMS), which counts the number of cyclists entering and leaving Queens, Manhattan, and Brooklyn using the Williamsburg Bridge on a given day. The data is taken over a month time period in July and contains the following variables:

  1. Date: the year/month/day of the observation. This is the observation ID.
  2. Day: the year/month/day of the observation.
  3. HighTemp: the maximum temperature on a particular date.
  4. LowTemp: the minimum temperature on a particular date.
  5. Precipitation: amount of rain (in inches) on a particular date.
  6. WilliamsburgBridge: number of cyclists that used the Williamsburg Bridge on a particular date.
  7. Total: total number of cyclists on a given day.

1.2 Practical Question

How many cyclists use the Williamsburg Bridge on a given day with varying temperatures and precipitation compared to the total number of cyclists?

2 Analysis

In the following section, we will first provide a preliminary analysis of the data. After conducting the preliminary analysis, we will then building two models:

Finally, we will choose a model that best fits the data and summarize the model in concluding remarks.

2.1 Preliminary Analysis

In the following preliminary analysis, we will load and summarize the data. We will look at each variable and determine various descriptive statistics.

2.1.1 Load and Summarize Data

assig.dta <- readxl::read_xlsx("C:/Users/Angelo/OneDrive/Desktop/College Babyyyyyyy/Fourth Year/STA321/data/w09-AssignDataSet.xlsx", sheet = "Saporito") %>%
  as.data.frame() 
  assig.dta$Day <- as.Date(as.POSIXct(assig.dta$Day, "GMT"),"%Y-%m-%d") 
## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%Y-%m-%d'
  assig.dta$Day <- weekdays(assig.dta$Day)
pander(summary(assig.dta[3:7]), caption = "Table 1: Summary Statistics") 
Table 1: Summary Statistics (continued below)
HighTemp LowTemp Precipitation WilliamsburgBridge
Min. :69.10 Min. :63.00 Min. :0.0000 Min. :2974
1st Qu.:78.55 1st Qu.:68.00 1st Qu.:0.0000 1st Qu.:4730
Median :84.00 Median :71.10 Median :0.0000 Median :6116
Mean :82.70 Mean :71.10 Mean :0.1352 Mean :6074
3rd Qu.:87.10 3rd Qu.:74.45 3rd Qu.:0.0050 3rd Qu.:7305
Max. :93.00 Max. :78.10 Max. :1.7800 Max. :8605
Total
Min. : 8210
1st Qu.:14618
Median :18696
Mean :18805
3rd Qu.:22979
Max. :26969

After loading and developing a summary statistics table for the data we can see that the variables have the following order statistics:

  • HighTemp has a mean of 82.7 meaning that the maximum temperature on a given day is expected to be in the low 80s.
  • LowTemp has a mean of 71.1 meaning that the expected minimum temperature is in the low 70s.
  • Precipitation has a mean of 0.1352 meaning that the expected amount of precipitation on a give day is approximately 0.14 inches.
  • WilliamsburgBridge has a mean of 6074 meaning that approximately 6000 cyclists are expected to cross the Williamsburg Bridge on a given day.
  • Total has a mean of 18805 meaning that approximately 19000 cyclists are expected to cross East River Bridges on a given day.

2.1.2 Poisson Regression Frequency Model

After conducting a preliminary analysis of the data, We will now build a Poisson frequency regression model and ignore the population size of each city in the data. This model estimates the frequency of cyclists on a given day given certain weather conditions.

model.freq <- glm(WilliamsburgBridge ~ Day+HighTemp+LowTemp+Precipitation, family = poisson(link = "log"), data = assig.dta)
##
pois.count.coef = summary(model.freq)$coef
pander(pois.count.coef, caption = "Table 2: The Poisson regression model on the number of cyclists using the Williamsburg Bridge")
Table 2: The Poisson regression model on the number of cyclists using the Williamsburg Bridge
  Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.561 0.04149 206.3 0
DayMonday 0.1023 0.009376 10.91 1.072e-27
DaySaturday -0.1675 0.00964 -17.38 1.258e-67
DaySunday -0.2144 0.01004 -21.36 2.839e-101
DayThursday 0.105 0.009785 10.73 7.575e-27
DayTuesday 0.08168 0.00984 8.3 1.039e-16
DayWednesday 0.2068 0.009542 21.67 3.548e-104
HighTemp 0.008349 0.0006955 12 3.35e-33
LowTemp -0.007239 0.001014 -7.141 9.248e-13
Precipitation -0.374 0.00939 -39.82 0

The above inferential table about the regression coefficients indicates that the model as well as all variables are statistically significant. This means, if we look at the number of cyclists on a given day given certain temperature and precipitation, there is statistical evidence to support the potential discrepancy across the adifferent days and weather types.

However, the other way to look at the model is the appropriateness model. The counts of cyclists are dependent on the total number of cyclists. Ignoring the population size implies the information in the sample was not effectively used. In the next subsection, we model the cyclist rates that involve the population size.

The other way to look at the model is goodness of the model. The cyclist counts are dependent on the population size. Ignoring the population size implies the information in the sample was not effectively used. In the next subsection, we model the cyclist rates that involve the population size.

2.1.3 Poisson regression model on rates of cyclists using the Williamsburg Bridge

The following model assesses the potential relationship between cyclist rates, weekdays, and weather conditions. This is the primary interest of the model.

model.rates <- glm(WilliamsburgBridge ~ Day+HighTemp+LowTemp+Precipitation, offset = log(Total), 
                   family = poisson(link = "log"), data = assig.dta)

pander(summary(model.rates)$coef, caption = "Table 3: Poisson regression on rates of cyclists using the Williamsburg Bridge")
Table 3: Poisson regression on rates of cyclists using the Williamsburg Bridge
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.229 0.0417 -29.48 5.29e-191
DayMonday -0.02595 0.009431 -2.752 0.005927
DaySaturday -0.001161 0.009696 -0.1197 0.9047
DaySunday -0.06814 0.01017 -6.701 2.073e-11
DayThursday -0.02796 0.009916 -2.82 0.004807
DayTuesday -0.05931 0.01002 -5.919 3.239e-09
DayWednesday -0.0182 0.009618 -1.892 0.05849
HighTemp -0.005297 0.0007123 -7.437 1.034e-13
LowTemp 0.007968 0.001008 7.908 2.61e-15
Precipitation 0.005687 0.008631 0.6589 0.51

The above table indicates that the log of cyclists is not identical across the weekdays and among the weather conditions. To be more specific, the log rates of Friday (baseline day) were higher than the other weekdays. The log rates of LowTemp were higher than the HighTemp. The regression coefficients represent the change of log rate between the associate weekdays and the reference weekday. The same interpretation applies to the change in log rate among the weather conditions.

2.1.4 Quasi-Poisson model on rates of cyclists using the Williamsburg Bridge

The above two Poison models assume that there is no dispersion issue in the model. In this section, a quasi-Poisson rates model will be constructed so that we can understand the dispersion of the model.

##assign new variables
assig.dta$AvgTemp <- (assig.dta$HighTemp+assig.dta$LowTemp)/2 #average temperature
assig.dta$NewPrecip <- ifelse(assig.dta$Precipitation==0,0,
                              ifelse(assig.dta$Precipitation>0,1,NA)) #discretize Precipitation


quasimodel.rates <- glm(WilliamsburgBridge ~ Day + AvgTemp + NewPrecip, offset = log(Total), 
                   family = quasipoisson, data = assig.dta)

pander(summary(quasimodel.rates)$coef, caption = "Quasi-Poisson regression on the rate of cyclists using the Williamsburg Bridge.")
Quasi-Poisson regression on the rate of cyclists using the Williamsburg Bridge.
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.102 0.1423 -7.741 1.015e-07
DayMonday -0.04638 0.03357 -1.382 0.1809
DaySaturday -0.01098 0.03534 -0.3107 0.759
DaySunday -0.08415 0.03565 -2.36 0.02754
DayThursday -0.02134 0.03468 -0.6152 0.5448
DayTuesday -0.05508 0.03581 -1.538 0.1383
DayWednesday -0.02268 0.03473 -0.653 0.5205
AvgTemp 9.334e-05 0.001798 0.05192 0.9591
NewPrecip 0.0001672 0.02373 0.007045 0.9944

The dispersion index is as follows:

ydif=assig.dta$WilliamsburgBridge-exp(quasimodel.rates$linear.predictors)  # diff between y and yhat
prsd = ydif/sqrt(exp(quasimodel.rates$linear.predictors))   # Pearson residuals
phi = sum(prsd^2)/15          # Dispersion index: 24-9 = 15  
pander(cbind(Dispersion = phi))
Dispersion
20.86

2.1.5 Final Model

The dispersion index is 20.86 which means that the p-values of the model are not reliable since the dispersion is significantly different than 1. Thus, we will stay with the regular Poisson model.

The intercept represents the baseline log-cyclist rate (of baseline day Friday). The actual rate is exp(-1.229)≈29.26 percent. The coefficent of DayWednesday is -0.02268 which is the difference of the log-rates between baseline day Friday and the day of Wednesday, to be more specific, log(RWednesday)−log(RFriday)=-0.02268 which is equivalent to

\[ \log \left( \frac{R_{\text{Wednesday}}}{R_{\text{Friday}}} \right) = -0.02268 ~~~\Rightarrow~~~\frac{R_{\text{Wednesday}}}{R_{\text{Friday}}} = e^{-0.02268} \approx 0.9776. \]

This means, with all other variables remaining fixed, the cyclist rate on Wednesdays is about 2% lower than on Fridays.

2.1.6 Graphical Representation

The inferential tables of the Poisson regression models in the previous sections give numerical information about the potential discrepancy between various days of the week and weather conditions. But it is not intuitive. Next, we create a graphic that makes the hidden pattern visible.

The following calculation is based on the regression equation with coefficients given in the above table (table 3). Note that all variables in the model are indicator variables. Each of these indicator variables takes only two possible values: 0 and 1.

For example, exp(−1.229) gives the cyclist rate of the baseline day, Friday. Next, exp(−1.229-.005) gives the cyclist rate of baseline day, Friday, and the HighTemp. Following the same pattern, you can find the cyclist rate for each combination of the day and weather condition.

Monday <- c(exp(-1.229-.026), exp(-1.229-.026-.005), exp(-1.229-.026+.008), exp(-1.229-.026+.006))
Tuesday <- c(exp(-1.229-.059), exp(-1.229-.059-.005), exp(-1.229-.059+.008), exp(-1.229-.059+.006))
Wednesday <-c(exp(-1.229-.018), exp(-1.229-.018-.005), exp(-1.229-.018+.008), exp(-1.229-.018+.006))
Thursday <- c(exp(-1.229-.028), exp(-1.229-.028-.005), exp(-1.229-.028+.008), exp(-1.229-.028+.006))
Friday <- c(exp(-1.229), exp(-1.229+(-.005)), exp(-1.229+.008), exp(-1.229+.006))
Saturday <- c(exp(-1.229-.001), exp(-1.229-.001-.005), exp(-1.229-.001+.008), exp(-1.229-.001+.006))
Sunday <- c(exp(-1.229-.068), exp(-1.229-.068-.005), exp(-1.229-.068+.008), exp(-1.229-.068+.006))

day.mat <- matrix(data = c(Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday), ncol = 7, nrow = 4)
matplot(day.mat, type = "b", pch = 1, col = 1:7)
legend("bottomright", legend = c("M","T","W","Th","F","Sa","S"), col=1:7, pch=1, horiz = T, cex = .75)

The above graph shows that Friday and Saturday have the highest rates of cyclists on the Williamsburg Bridge, whereas, Sunday and Tuesday have the lowest rates.

3 Conclusion and Final Remarks

In the above sections, we developed a narrative to understand cyclist behavior on the Willaimsburg Bridge. A preliminary analysis of the data was developed and three models, a Poisson regression frequency model, a Poisson regression rate model, and a quasi-Poisson rate model, were built to understand cyclist behavior on various weekdays and weather conditions. After eliminating the quasi-Poisson model as the dispersion was deemed to be too high, we settled with the Poisson rates model which took the form \[\text{log-rate} = -1.229 -0.026 \times \text{DayMonday} -0.0012 \times \text{DaySaturday} -0.0681 \times \text{DaySunday} \\ -0.027 \times \text{DayThursday} -0.0593 \times \text{DayTuesday} -0.0182 \times \text{DayWednesday} \\ -0.0053 \times \text{HighTemp} + 0.008 \times \text{LowTemp} +0.0057 \times \text{Precipitation}\]

Or equivalently, we can write the rate model as

\[ rate =\exp(-1.229 -0.026 \times \text{DayMonday} -0.0012 \times \text{DaySaturday} -0.0681 \times \text{DaySunday} \\ -0.027 \times \text{DayThursday} -0.0593 \times \text{DayTuesday} -0.0182 \times \text{DayWednesday} \\ -0.0053 \times \text{HighTemp} + 0.008 \times \text{LowTemp} +0.0057 \times \text{Precipitation}) \] Finally, a graphical representation of the data was developed for a clearer understanding of the rates model. It was deduced that Friday and Saturday had the highest rates of cyclists and Sunday and Tuesday had the lowest rates.