Introduction
The data set for this study was collected from the Traffic
Information Management System. It keeps track of the number of cyclists
entering and leaving the Queensboro Bridge from the dates July 1st to
July 31st. This data set includes a total of 31 observations and seven
variables. The response variable is the total number of cyclist that
pass through the Queensboro bridge on each given day. The explanatory
variables for this data set involve the specific conditions of each day,
such as the weather.
Variable
Description
Here are what the seven variables in the data set represent:
Date (x1) - note this represents the observation ID
Day (x2) - The day of the week
HighTemp (x3) - the temperature high for the day in degrees
Fahrenheit
LowTemp (x4) - the temperature high for the day in degrees
Fahrenheit
Precipitation (x5) - the total precipitation for the day in
inches
Queensboro Bridge (Y) - The number of cyclists on the Queensboro
bridge.
Total (x6) - the total number of cyclists who enter and leave the
bridges in NYC each day
Practical
Question
Do the conditions surrounding the day the cyclists are recorded
affect the number of them enter and leave the QueensboroBridge?
Data Download and
Cleaning
First, we are going to download the data. Since it is a small data
set, we can look at the data and conclude there are no missing values.
We are also going to remove to commas from the variables “Total” and
“QueensboroBridge” so R Studio classifies them as numeric.
cycle <- read.csv("https://raw.githubusercontent.com/AvaDeSt/STA-321/refs/heads/main/Assignment%205%20data(Sheet1).csv", header = TRUE)
cycle$Total <- as.numeric(gsub(",", "", cycle$Total))
cycle$QueensboroBridge <- as.numeric(gsub(",", "", cycle$QueensboroBridge))
data(cycle)
## Warning in data(cycle): data set 'cycle' not found
kable(head(cycle), caption = "First few records in the data set")
First few records in the data set
1-Jul |
Saturday |
84.9 |
72.0 |
0.23 |
3216 |
11867 |
2-Jul |
Sunday |
87.1 |
73.0 |
0.00 |
3579 |
13995 |
3-Jul |
Monday |
87.1 |
71.1 |
0.45 |
4230 |
16067 |
4-Jul |
Tuesday |
82.9 |
70.0 |
0.00 |
3861 |
13925 |
5-Jul |
Wednesday |
84.9 |
71.1 |
0.00 |
5862 |
23110 |
6-Jul |
Thursday |
75.0 |
71.1 |
0.00 |
5251 |
21861 |
Model Building
For this study, a poisson regression model will be used. The poisson
regression model has four basic assumptions that are as follows:
The response variable is a count per unit of time or space. ( In
our case it is the count of cyclists per day).
The observations are independent of one another.
The mean of the poisson random variable is equal to the
variance.
The log of the mean rate, log(λ), is a linear function of
x
Poisson Regression on
Queensboro Bridge Cyclists Only
Here we are building a poisson frequancy regression model for our
data set. The variable “Date” was left out of this model since it is
only an observation ID. The variable “Total” was left of because we can
assume that the total amount of bikers that pass the Queensboro Bridge
and the total amount of bikers overall are proportional to each
other.
model.freq <- glm(QueensboroBridge ~ Day + HighTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = cycle)
pois.count.coef = summary(model.freq)$coef
kable(pois.count.coef, caption = "The Poisson regression model for the counts of cyclist entering and leaving the Queensboro Bridge.")
The Poisson regression model for the counts of cyclist entering
and leaving the Queensboro Bridge.
(Intercept) |
8.5517234 |
0.0479068 |
178.507570 |
0e+00 |
DayMonday |
0.0541753 |
0.0108546 |
4.991021 |
6e-07 |
DaySaturday |
-0.1953380 |
0.0111460 |
-17.525400 |
0e+00 |
DaySunday |
-0.2226164 |
0.0115879 |
-19.211108 |
0e+00 |
DayThursday |
0.1157122 |
0.0113093 |
10.231628 |
0e+00 |
DayTuesday |
0.0965569 |
0.0113512 |
8.506349 |
0e+00 |
DayWednesday |
0.1836698 |
0.0110677 |
16.595046 |
0e+00 |
HighTemp |
0.0158199 |
0.0008034 |
19.691515 |
0e+00 |
LowTemp |
-0.0197465 |
0.0011701 |
-16.876092 |
0e+00 |
Precipitation |
-0.3221763 |
0.0105342 |
-30.583731 |
0e+00 |
The table indicates that the day of the week, the daily high and low
temperatures, and the precipitation levels are all highly significant.
This means that the weather and day of the week ar4e good indicators of
how many bikers will pass through the Queensboro bridge on a given day.
However, it is important to keep in mind that this does not necessarily
mean the model is important. For example, the sample size for this study
is small and may not represent the entire population. Another way to
interpret this is that the cyclist counts on the Queensboro bridge are
not dependent on the total number of cyclists on all the New York
bridges. Because all these variables are highly significant, they will
be included in the following models. We can see that the coefficient for
the temperature high is about 0.158. This means that for every one
degree increase in the temperature high, the log of the expected count
of cyclists increases by 0.158. Since exp(0.158) = 1.173, for each
one-unit increase in the predictor variable, the expected count of the
outcome variable increases by about 17.3%, holding other variables
constant.
Poisson Regression on
Rates with the Total Count
This model looks at the relationship between the rate of cyclists and
the day of the week as well as temperature. Here we will also look at
the total number of cyclists that cross all the bridges in New York.
model.rates <- glm(QueensboroBridge ~ Day + HighTemp +LowTemp + Precipitation, offset = log(Total),
family = poisson(link = "log"), data = cycle)
kable(summary(model.rates)$coef, caption = "Poisson regression on the rate of cyclists.")
Poisson regression on the rate of cyclists.
(Intercept) |
-1.2497959 |
0.0481655 |
-25.947936 |
0.0000000 |
DayMonday |
-0.0673728 |
0.0109257 |
-6.166450 |
0.0000000 |
DaySaturday |
-0.0264248 |
0.0112172 |
-2.355741 |
0.0184858 |
DaySunday |
-0.0771257 |
0.0117421 |
-6.568307 |
0.0000000 |
DayThursday |
-0.0166461 |
0.0114600 |
-1.452543 |
0.1463505 |
DayTuesday |
-0.0410056 |
0.0115661 |
-3.545335 |
0.0003921 |
DayWednesday |
-0.0420583 |
0.0111611 |
-3.768289 |
0.0001644 |
HighTemp |
0.0020227 |
0.0008231 |
2.457506 |
0.0139906 |
LowTemp |
-0.0042179 |
0.0011624 |
-3.628492 |
0.0002851 |
Precipitation |
0.0451435 |
0.0097403 |
4.634702 |
0.0000036 |
The table shows that the log of bikers crossing the bridge is not the
same across all days of the week. The log rates for the day Friday are
higher than the rest of the days of the week. The intercept represents
the log base cyclist rate for the baseline day Friday. The rest of the
coefficients are the difference of log rates between the baseline day
Friday and the rest of the days of the week. We can see from the table
that -0.067 is the coefficient for Monday so:
log(RMonday / RFriday) = -0.067 ⇒ RMonday / RFriday = e^−0.067 ≈
0.935
This means that the rate of bikers on Monday is about 6.5% lower on
Monday than on Friday.
Next, we are building a quasi poison model. This is generally a
better model to use when the mean and the variance of the data are not
the same.
model.rates <- glm(QueensboroBridge ~ Day + HighTemp + LowTemp + Precipitation, offset = log(Total),
family = quasipoisson, data = cycle)
summary(model.rates)
##
## Call:
## glm(formula = QueensboroBridge ~ Day + HighTemp + LowTemp + Precipitation,
## family = quasipoisson, data = cycle, offset = log(Total))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.249796 0.195999 -6.377 2.54e-06 ***
## DayMonday -0.067373 0.044460 -1.515 0.145
## DaySaturday -0.026425 0.045646 -0.579 0.569
## DaySunday -0.077126 0.047782 -1.614 0.121
## DayThursday -0.016646 0.046634 -0.357 0.725
## DayTuesday -0.041006 0.047066 -0.871 0.393
## DayWednesday -0.042058 0.045418 -0.926 0.365
## HighTemp 0.002023 0.003349 0.604 0.552
## LowTemp -0.004218 0.004730 -0.892 0.383
## Precipitation 0.045143 0.039636 1.139 0.268
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 16.559)
##
## Null deviance: 467.32 on 30 degrees of freedom
## Residual deviance: 343.68 on 21 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 3
We can see from this model that none of the variables are no longer
significant. This suggests that the mean and the variance of the data
set are equal, and a quasi poisson model is not needed.
Final Model
Given the data that we have, the best model to use is the first
poisson frequency regression model that does not take into account the
total number of cyclists that cross every bridge. While this variable
can help predict the number of cyclists on the Queensboro bridge, it is
not needed for a successful model, and the response variable is not
reliant on it. The poisson frequency regression model can be written
as
QueensboroBridge = 8.552 - 0.227 * DaySunday + 0.054 * DayMonday +
0.097 * DayTuesday + 0.184 * DayWednesday + 0.116 * DayThursday - 0.193
* DaySaturday + 0.016 * HighTemp - 0.020 * LowTemp - 0.322 *
Precipitation
Summary and
Conclusion
To summarize, we looked at a data set that looks at how many bikers
cross over the Queensboro Bridge in NYC every day for the month of July,
leaving us with 31 observations. The 4 explanatory variables look at the
day of the week and the weather conditions on each day. Our goal was to
see what the relationship was between these variables and the number of
cyclists. We also wanted to see if taking into account the total number
of cyclists that cross several major bridges in NYC. To figure this out,
we built a Poisson frequency regression model, a Poisson Model on rates,
and a quasi poisson model. Based on our small sample, we found that the
poisson frequency regression model performed the best. This would
indicate that the total number of cyclists in Queensboroo is not reliant
on the total number of cyclists on the other bridges although including
still makes a good model.
