In the following report, we will build a model to represent the number of cyclists entering and leaving Queens, Manhattan, and Brooklyn using the Williamsburg Bridge on a given day. In this report, we will build two models: a Poisson regression on counts and a Poisson regression on rates. An analysis will be conducted to determine the best model for the data. Finally, we will summarize the final results in the concluding remarks.
The data is taken from the Traffic Information Management System (TIMS), which counts the number of cyclists entering and leaving Queens, Manhattan, and Brooklyn using the Williamsburg Bridge on a given day. The data is taken over a month time period in July and contains the following variables:
How many cyclists use the Williamsburg Bridge on a given day with varying temperatures and precipitation compared to the total number of cyclists?
In the following section, we will first provide a preliminary analysis of the data. After conducting the preliminary analysis, we will then building two models:
Finally, we will choose a model that best fits the data and summarize the model in concluding remarks.
In the following preliminary analysis, we will load and summarize the data. We will look at each variable and determine various descriptive statistics.
assig.dta <- readxl::read_xlsx("C:/Users/Angelo/OneDrive/Desktop/College Babyyyyyyy/Fourth Year/STA321/data/w09-AssignDataSet.xlsx", sheet = "Saporito") %>%
as.data.frame()
assig.dta$Day <- as.Date(as.POSIXct(assig.dta$Day, "GMT"),"%Y-%m-%d")
## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%Y-%m-%d'
assig.dta$Day <- weekdays(assig.dta$Day)
pander(summary(assig.dta[3:7]), caption = "Table 1: Summary Statistics")
| HighTemp | LowTemp | Precipitation | WilliamsburgBridge |
|---|---|---|---|
| Min. :69.10 | Min. :63.00 | Min. :0.0000 | Min. :2974 |
| 1st Qu.:78.55 | 1st Qu.:68.00 | 1st Qu.:0.0000 | 1st Qu.:4730 |
| Median :84.00 | Median :71.10 | Median :0.0000 | Median :6116 |
| Mean :82.70 | Mean :71.10 | Mean :0.1352 | Mean :6074 |
| 3rd Qu.:87.10 | 3rd Qu.:74.45 | 3rd Qu.:0.0050 | 3rd Qu.:7305 |
| Max. :93.00 | Max. :78.10 | Max. :1.7800 | Max. :8605 |
| Total |
|---|
| Min. : 8210 |
| 1st Qu.:14618 |
| Median :18696 |
| Mean :18805 |
| 3rd Qu.:22979 |
| Max. :26969 |
After loading and developing a summary statistics table for the data we can see that the variables have the following order statistics:
After conducting a preliminary analysis of the data, We will now build a Poisson frequency regression model and ignore the population size of each city in the data. This model estimates the frequency of cyclists on a given day given certain weather conditions.
model.freq <- glm(WilliamsburgBridge ~ Day+HighTemp+LowTemp+Precipitation, family = poisson(link = "log"), data = assig.dta)
##
pois.count.coef = summary(model.freq)$coef
pander(pois.count.coef, caption = "Table 2: The Poisson regression model on the number of cyclists using the Williamsburg Bridge")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 8.561 | 0.04149 | 206.3 | 0 |
| DayMonday | 0.1023 | 0.009376 | 10.91 | 1.072e-27 |
| DaySaturday | -0.1675 | 0.00964 | -17.38 | 1.258e-67 |
| DaySunday | -0.2144 | 0.01004 | -21.36 | 2.839e-101 |
| DayThursday | 0.105 | 0.009785 | 10.73 | 7.575e-27 |
| DayTuesday | 0.08168 | 0.00984 | 8.3 | 1.039e-16 |
| DayWednesday | 0.2068 | 0.009542 | 21.67 | 3.548e-104 |
| HighTemp | 0.008349 | 0.0006955 | 12 | 3.35e-33 |
| LowTemp | -0.007239 | 0.001014 | -7.141 | 9.248e-13 |
| Precipitation | -0.374 | 0.00939 | -39.82 | 0 |
The above inferential table about the regression coefficients indicates that the model as well as all variables are statistically significant. This means, if we look at the number of cyclists on a given day given certain temperature and precipitation, there is statistical evidence to support the potential discrepancy across the adifferent days and weather types.
However, the other way to look at the model is the appropriateness model. The counts of cyclists are dependent on the total number of cyclists. Ignoring the population size implies the information in the sample was not effectively used. In the next subsection, we model the cyclist rates that involve the population size.
The other way to look at the model is goodness of the model. The cyclist counts are dependent on the population size. Ignoring the population size implies the information in the sample was not effectively used. In the next subsection, we model the cyclist rates that involve the population size.
The following model assesses the potential relationship between cyclist rates, weekdays, and weather conditions. This is the primary interest of the model.
model.rates <- glm(WilliamsburgBridge ~ Day+HighTemp+LowTemp+Precipitation, offset = log(Total),
family = poisson(link = "log"), data = assig.dta)
pander(summary(model.rates)$coef, caption = "Table 3: Poisson regression on rates of cyclists using the Williamsburg Bridge")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -1.229 | 0.0417 | -29.48 | 5.29e-191 |
| DayMonday | -0.02595 | 0.009431 | -2.752 | 0.005927 |
| DaySaturday | -0.001161 | 0.009696 | -0.1197 | 0.9047 |
| DaySunday | -0.06814 | 0.01017 | -6.701 | 2.073e-11 |
| DayThursday | -0.02796 | 0.009916 | -2.82 | 0.004807 |
| DayTuesday | -0.05931 | 0.01002 | -5.919 | 3.239e-09 |
| DayWednesday | -0.0182 | 0.009618 | -1.892 | 0.05849 |
| HighTemp | -0.005297 | 0.0007123 | -7.437 | 1.034e-13 |
| LowTemp | 0.007968 | 0.001008 | 7.908 | 2.61e-15 |
| Precipitation | 0.005687 | 0.008631 | 0.6589 | 0.51 |
The above table indicates that the log of cyclists is not identical across the weekdays and among the weather conditions. To be more specific, the log rates of Friday (baseline day) were higher than the other weekdays. The log rates of LowTemp were higher than the HighTemp. The regression coefficients represent the change of log rate between the associate weekdays and the reference weekday. The same interpretation applies to the change in log rate among the weather conditions.
The above two Poison models assume that there is no dispersion issue in the model. In this section, a quasi-Poisson rates model will be constructed so that we can understand the dispersion of the model.
##assign new variables
assig.dta$AvgTemp <- (assig.dta$HighTemp+assig.dta$LowTemp)/2 #average temperature
assig.dta$NewPrecip <- ifelse(assig.dta$Precipitation==0,0,
ifelse(assig.dta$Precipitation>0,1,NA)) #discretize Precipitation
quasimodel.rates <- glm(WilliamsburgBridge ~ Day + AvgTemp + NewPrecip, offset = log(Total),
family = quasipoisson, data = assig.dta)
pander(summary(quasimodel.rates)$coef, caption = "Quasi-Poisson regression on the rate of cyclists using the Williamsburg Bridge.")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -1.102 | 0.1423 | -7.741 | 1.015e-07 |
| DayMonday | -0.04638 | 0.03357 | -1.382 | 0.1809 |
| DaySaturday | -0.01098 | 0.03534 | -0.3107 | 0.759 |
| DaySunday | -0.08415 | 0.03565 | -2.36 | 0.02754 |
| DayThursday | -0.02134 | 0.03468 | -0.6152 | 0.5448 |
| DayTuesday | -0.05508 | 0.03581 | -1.538 | 0.1383 |
| DayWednesday | -0.02268 | 0.03473 | -0.653 | 0.5205 |
| AvgTemp | 9.334e-05 | 0.001798 | 0.05192 | 0.9591 |
| NewPrecip | 0.0001672 | 0.02373 | 0.007045 | 0.9944 |
The dispersion index is as follows:
ydif=assig.dta$WilliamsburgBridge-exp(quasimodel.rates$linear.predictors) # diff between y and yhat
prsd = ydif/sqrt(exp(quasimodel.rates$linear.predictors)) # Pearson residuals
phi = sum(prsd^2)/15 # Dispersion index: 24-9 = 15
pander(cbind(Dispersion = phi))
| Dispersion |
|---|
| 20.86 |
The dispersion index is 20.86 which means that the p-values of the model are not reliable since the dispersion is significantly different than 1. Thus, we will stay with the regular Poisson model.
The intercept represents the baseline log-cyclist rate (of baseline day Friday). The actual rate is exp(-1.229)≈29.26 percent. The coefficent of DayWednesday is -0.02268 which is the difference of the log-rates between baseline day Friday and the day of Wednesday, to be more specific, log(RWednesday)−log(RFriday)=-0.02268 which is equivalent to
\[ \log \left( \frac{R_{\text{Wednesday}}}{R_{\text{Friday}}} \right) = -0.02268 ~~~\Rightarrow~~~\frac{R_{\text{Wednesday}}}{R_{\text{Friday}}} = e^{-0.02268} \approx 0.9776. \]
This means, with all other variables remaining fixed, the cyclist rate on Wednesdays is about 2% lower than on Fridays.
The inferential tables of the Poisson regression models in the previous sections give numerical information about the potential discrepancy between various days of the week and weather conditions. But it is not intuitive. Next, we create a graphic that makes the hidden pattern visible.
The following calculation is based on the regression equation with coefficients given in the above table (table 3). Note that all variables in the model are indicator variables. Each of these indicator variables takes only two possible values: 0 and 1.
For example, exp(−1.229) gives the cyclist rate of the baseline day, Friday. Next, exp(−1.229-.005) gives the cyclist rate of baseline day, Friday, and the HighTemp. Following the same pattern, you can find the cyclist rate for each combination of the day and weather condition.
Monday <- c(exp(-1.229-.026), exp(-1.229-.026-.005), exp(-1.229-.026+.008), exp(-1.229-.026+.006))
Tuesday <- c(exp(-1.229-.059), exp(-1.229-.059-.005), exp(-1.229-.059+.008), exp(-1.229-.059+.006))
Wednesday <-c(exp(-1.229-.018), exp(-1.229-.018-.005), exp(-1.229-.018+.008), exp(-1.229-.018+.006))
Thursday <- c(exp(-1.229-.028), exp(-1.229-.028-.005), exp(-1.229-.028+.008), exp(-1.229-.028+.006))
Friday <- c(exp(-1.229), exp(-1.229+(-.005)), exp(-1.229+.008), exp(-1.229+.006))
Saturday <- c(exp(-1.229-.001), exp(-1.229-.001-.005), exp(-1.229-.001+.008), exp(-1.229-.001+.006))
Sunday <- c(exp(-1.229-.068), exp(-1.229-.068-.005), exp(-1.229-.068+.008), exp(-1.229-.068+.006))
day.mat <- matrix(data = c(Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday), ncol = 7, nrow = 4)
matplot(day.mat, type = "b", pch = 1, col = 1:7)
legend("bottomright", legend = c("M","T","W","Th","F","Sa","S"), col=1:7, pch=1, horiz = T, cex = .75)
The above graph shows that Friday and Saturday have the highest rates of cyclists on the Williamsburg Bridge, whereas, Sunday and Tuesday have the lowest rates.
In the above sections, we developed a narrative to understand cyclist behavior on the Willaimsburg Bridge. A preliminary analysis of the data was developed and three models, a Poisson regression frequency model, a Poisson regression rate model, and a quasi-Poisson rate model, were built to understand cyclist behavior on various weekdays and weather conditions. After eliminating the quasi-Poisson model as the dispersion was deemed to be too high, we settled with the Poisson rates model which took the form \[\text{log-rate} = -1.229 -0.026 \times \text{DayMonday} -0.0012 \times \text{DaySaturday} -0.0681 \times \text{DaySunday} \\ -0.027 \times \text{DayThursday} -0.0593 \times \text{DayTuesday} -0.0182 \times \text{DayWednesday} \\ -0.0053 \times \text{HighTemp} + 0.008 \times \text{LowTemp} +0.0057 \times \text{Precipitation}\]
Or equivalently, we can write the rate model as
\[ rate =\exp(-1.229 -0.026 \times \text{DayMonday} -0.0012 \times \text{DaySaturday} -0.0681 \times \text{DaySunday} \\ -0.027 \times \text{DayThursday} -0.0593 \times \text{DayTuesday} -0.0182 \times \text{DayWednesday} \\ -0.0053 \times \text{HighTemp} + 0.008 \times \text{LowTemp} +0.0057 \times \text{Precipitation}) \] Finally, a graphical representation of the data was developed for a clearer understanding of the rates model. It was deduced that Friday and Saturday had the highest rates of cyclists and Sunday and Tuesday had the lowest rates.