EDA and Feature
Engineering
First, we are going to download the data. The variable “Date” will be
dropped because it is the observation ID, Since it is a small data set,
we can look at the data and conclude there are no missing values. We are
also going to remove to commas from the variables “Total” and
“QueensboroBridge” so R Studio classifies them as numeric.
cycle <- read.csv("https://raw.githubusercontent.com/AvaDeSt/STA-321/refs/heads/main/Assignment%205%20data(Sheet1).csv", header = TRUE)[,-1]
cycle$Total <- as.numeric(gsub(",", "", cycle$Total))
cycle$QueensboroBridge <- as.numeric(gsub(",", "", cycle$QueensboroBridge))
data(cycle)
kable(head(cycle), caption = "First few records in the data set")
First few records in the data set
Saturday |
84.9 |
72.0 |
0.23 |
3216 |
11867 |
Sunday |
87.1 |
73.0 |
0.00 |
3579 |
13995 |
Monday |
87.1 |
71.1 |
0.45 |
4230 |
16067 |
Tuesday |
82.9 |
70.0 |
0.00 |
3861 |
13925 |
Wednesday |
84.9 |
71.1 |
0.00 |
5862 |
23110 |
Thursday |
75.0 |
71.1 |
0.00 |
5251 |
21861 |
Now that we have the data, we are going to create a new data set
called “cycle.new” that contains the variable AvgTemp. This variable
will take the average temperature in degrees farenheit of the variables
HighTemp and LowTemp.
cycle.new<- cycle %>%
mutate(AvgTemp = (HighTemp + LowTemp) / 2)
Next We are going to discretize the variable Precipitation. This will
mark any observation with 0 precipitation as “0”, and any observations
with precipitation levels greater than 0 as “1”.
cycle.new$NewPrecip = ifelse(cycle.new$Precipitation > 0, 1, 0)
data(cycle.new)
kable(head(cycle.new), caption = "First few records in the data set")
First few records in the data set
Saturday |
84.9 |
72.0 |
0.23 |
3216 |
11867 |
78.45 |
1 |
Sunday |
87.1 |
73.0 |
0.00 |
3579 |
13995 |
80.05 |
0 |
Monday |
87.1 |
71.1 |
0.45 |
4230 |
16067 |
79.10 |
1 |
Tuesday |
82.9 |
70.0 |
0.00 |
3861 |
13925 |
76.45 |
0 |
Wednesday |
84.9 |
71.1 |
0.00 |
5862 |
23110 |
78.00 |
0 |
Thursday |
75.0 |
71.1 |
0.00 |
5251 |
21861 |
73.05 |
0 |
Model Building
For this study, a poisson regression and a quasi poisson model will
be used. The quasi poisson model is similar to a poisson model except
that it corrects for over dispersion. This occurs when the mean and
variance of the data is not equal.The poisson regression model has four
basic assumptions that are as follows:
The response variable is a count per unit of time or space. ( In
our case it is the count of cyclists per day).
The observations are independent of one another.
The mean of the poisson random variable is equal to the variance.
(for a quasi poisson model, this does not have to be the case)
The log of the mean rate, log(λ), is a linear function of
x
Poisson Model
First, we will build a standard poisson model using our new variables
“AvgTemp” and “NewPrecip” as well as the original one “Day”. We will
also be extracting the pearson and deviance dispersion.
Note that ϕ = variance / mean. So a dispersion equal to one means
that the variance is equal to the mean. Anything not equal to one
indicates dispersion.
pois.model = glm(QueensboroBridge ~ factor(Day) + AvgTemp + NewPrecip,
family = poisson(link="log"), data =cycle.new)
summary(pois.model)
##
## Call:
## glm(formula = QueensboroBridge ~ factor(Day) + AvgTemp + NewPrecip,
## family = poisson(link = "log"), data = cycle.new)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.9739275 0.0431838 184.651 <2e-16 ***
## factor(Day)Monday 0.1586579 0.0102399 15.494 <2e-16 ***
## factor(Day)Saturday -0.0989334 0.0107978 -9.162 <2e-16 ***
## factor(Day)Sunday -0.1283273 0.0108855 -11.789 <2e-16 ***
## factor(Day)Thursday 0.1686654 0.0106693 15.809 <2e-16 ***
## factor(Day)Tuesday 0.0934492 0.0109975 8.497 <2e-16 ***
## factor(Day)Wednesday 0.1894845 0.0107878 17.565 <2e-16 ***
## AvgTemp 0.0060353 0.0005483 11.007 <2e-16 ***
## NewPrecip -0.3189133 0.0072562 -43.950 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7891 on 30 degrees of freedom
## Residual deviance: 2466 on 22 degrees of freedom
## AIC: 2801.1
##
## Number of Fisher Scoring iterations: 4
## predicted y: yhat
yhat = pois.model$fitted.values
pearson.resid = (cycle.new$QueensboroBridge - yhat)/sqrt(yhat)
Pearson.disp = sum(pearson.resid^2)/pois.model$df.residual
##
Deviance.disp = (pois.model$deviance)/pois.model$df.residual
##
disp = cbind(Pearson.disp = Pearson.disp, Deviance.disp = Deviance.disp)
kable(disp, caption="Dispersion parameter", align = 'c')
Dispersion parameter
109.1148 |
112.0891 |
We can see from the summary table that all of the p values indicate
that each variable is highly significant. However, both the pearson and
deviance dispersion are way more than one, both exceeding one hundred.
This indicates that this poisson model seriously violates that
assumption of equal mean and variance. This suggests that the p values
from this model are not reliable in the first place. We can see that the
coefficient for the AvgTemp is about 0.006. This means that for every
one degree increase in the temperature high, the log of the expected
count of cyclists increases by 0.006. Since exp(0.006) = 1.006, for each
one-unit increase in the predictor variable, the expected count of the
outcome variable increases by about 10.06%, holding other variables
constant.
Poisson Regression on
Rates
This model and analysis is similar to the regular poisson model
except it will take into account the total number of cycles who cross
other major bridges in New York.
model.rates <- glm(QueensboroBridge ~ Day + AvgTemp + factor(NewPrecip), offset = log(Total),
family = poisson(link = "log"), data = cycle.new)
kable(summary(model.rates)$coef, caption = "Poisson regression on the rate of cyclists.")
Poisson regression on the rate of cyclists.
(Intercept) |
-1.2569023 |
0.0432727 |
-29.046054 |
0.0000000 |
DayMonday |
-0.0678714 |
0.0102390 |
-6.628717 |
0.0000000 |
DaySaturday |
-0.0357733 |
0.0107891 |
-3.315688 |
0.0009142 |
DaySunday |
-0.0817864 |
0.0108370 |
-7.546972 |
0.0000000 |
DayThursday |
-0.0350102 |
0.0105830 |
-3.308148 |
0.0009392 |
DayTuesday |
-0.0477855 |
0.0109113 |
-4.379469 |
0.0000119 |
DayWednesday |
-0.0433977 |
0.0106518 |
-4.074197 |
0.0000462 |
AvgTemp |
-0.0016134 |
0.0005475 |
-2.946729 |
0.0032115 |
factor(NewPrecip)1 |
0.0476215 |
0.0071899 |
6.623369 |
0.0000000 |
yhat = model.rates$fitted.values
pearson.resid = (cycle.new$QueensboroBridge - yhat)/sqrt(yhat)
Pearson.disp = sum(pearson.resid^2)/model.rates$df.residual
##
Deviance.disp = (model.rates$deviance)/model.rates$df.residual
##
disp = cbind(Pearson.disp = Pearson.disp, Deviance.disp = Deviance.disp)
kable(disp, caption="Dispersion parameter", align = 'c')
Dispersion parameter
15.23799 |
15.02153 |
We can see that similar to the first model, that all of the p values
indicate that each variable is highly significant. However, both the
pearson and deviance dispersion are also way more than one, but they are
also significantly less than the dispersion from the previous poisson
regression model. This indicates that our poisson model on rates still
violates that assumption of equal mean and variance. This suggests that
the p values from this model are not reliable in the first place.
The table also shows that the log of bikers crossing the bridge is
not the same across all days of the week. The log rates for the day
Friday are higher than the rest of the days of the week. The intercept
represents the log base cyclist rate for the baseline day Friday. The
rest of the coefficients are the difference of log rates between the
baseline day Friday and the rest of the days of the week. We can see
from the table that -0.067 is the coefficient for Monday so:
log(RMonday / RFriday) = -0.067 ⇒ RMonday / RFriday = e^−0.067 ≈
0.935
This means that the rate of bikers on Monday is about 6.5% lower on
Monday than on Friday.
Quasi Poisson
Model
Since the dispersion from the previous model was very high, we are
going to instead use a quasi poisson model. This model does not require
an equal mean and variance.
quasi.model = glm(QueensboroBridge ~ Day + AvgTemp + factor(NewPrecip),
family = quasipoisson, data = cycle.new)
SE.quasi.pois = summary(quasi.model)$coef
kable(SE.quasi.pois, caption = "Summary statistics of quasi-poisson regression model")
Summary statistics of quasi-poisson regression model
(Intercept) |
7.9739275 |
0.4510894 |
17.6770432 |
0.0000000 |
DayMonday |
0.1586579 |
0.1069643 |
1.4832787 |
0.1521859 |
DaySaturday |
-0.0989334 |
0.1127917 |
-0.8771338 |
0.3898920 |
DaySunday |
-0.1283273 |
0.1137081 |
-1.1285685 |
0.2712370 |
DayThursday |
0.1686654 |
0.1114490 |
1.5133867 |
0.1444160 |
DayTuesday |
0.0934492 |
0.1148775 |
0.8134680 |
0.4246668 |
DayWednesday |
0.1894845 |
0.1126871 |
1.6815099 |
0.1068069 |
AvgTemp |
0.0060353 |
0.0057277 |
1.0537146 |
0.3034495 |
factor(NewPrecip)1 |
-0.3189133 |
0.0757971 |
-4.2074578 |
0.0003635 |
We can see from the table above that the variables “Day” and
“AvgTemp” are not significant but the variable “NewPrcip” is. This could
be a concern. But first, we are going to look at the dispersion
parameter.
Next, we will calculate the dispersion index of our quasi poisson
model.
ydif=cycle.new$QueensboroBridge-exp(quasi.model$linear.predictors) # diff between y and yhat
prsd = ydif/sqrt(exp(quasi.model$linear.predictors)) # Pearson residuals
phi = sum(prsd^2)/15 # Dispersion index: 24-9 = 15
pander(cbind(Dispersion = phi))
The dispersion index for this model is very high. But given what we
know of the dispersion from the origional poisson model this is to be
expected.
Quasi Poisson model
on Rates
Similar to the last section, we are going to build another quasi
poisson regression model but this time, based on the rates of the total
number of cyclists crossing several major bridges in New York. We will
also calculate the dispersion for this model.
quasimodel.rates <- glm(QueensboroBridge ~ Day + AvgTemp + factor(NewPrecip), offset = log(Total),
family = quasipoisson, data = cycle.new)
pander(summary(quasimodel.rates)$coef, caption = "Quasi-Poisson regression on the Number of Cyclists")
Quasi-Poisson regression on the Number of Cyclists
(Intercept) |
-1.257 |
0.1689 |
-7.441 |
1.923e-07 |
DayMonday |
-0.06787 |
0.03997 |
-1.698 |
0.1036 |
DaySaturday |
-0.03577 |
0.04212 |
-0.8494 |
0.4048 |
DaySunday |
-0.08179 |
0.0423 |
-1.933 |
0.06617 |
DayThursday |
-0.03501 |
0.04131 |
-0.8475 |
0.4059 |
DayTuesday |
-0.04779 |
0.04259 |
-1.122 |
0.274 |
DayWednesday |
-0.0434 |
0.04158 |
-1.044 |
0.308 |
AvgTemp |
-0.001613 |
0.002137 |
-0.7549 |
0.4583 |
factor(NewPrecip)1 |
0.04762 |
0.02807 |
1.697 |
0.1039 |
ydif=cycle.new$QueensboroBridge-exp(quasimodel.rates$linear.predictors) # diff between y and yhat
prsd = ydif/sqrt(exp(quasimodel.rates$linear.predictors)) # Pearson residuals
phi = sum(prsd^2)/15 # Dispersion index: 24-9 = 15
pander(cbind(Dispersion = phi))
From the table above, we can see that none of the p values are
significant. The dispersion index is still very high but not nearly as
high as the one from the regular quasi poisson model.
Final Model
The final model chosen for this analysis will be the quasi poisson
model on rates. Despite the quasi models having less significant p
values, they do not require equal mean and variance. We have seen from
calculating the dispersion of the poisson models that the means and
variances for both the poisson and poisson rates models were not at all
equal with very high dispersion. The rate quasi poisson model on rates
will be used because despite still having a high dispersion index, it
was much closer to one than on the regular The final model can be
written as:
rate = exp(-1.257 -0.0818 * DaySunday - 0.0679 * DayMonday - 0.0478 *
DayTuesday - 0.0434 * DayWednesday - 0.035 * DayThursday - 0.0358 *
DaySaturday) x (0.0016 * AvgTemp) x (0.0476 * NewPrecip)
Some Visual
Comparisons
Now, we will create a visual representation of how the explanatory
variables affect the actual number of cyclists on the Queensboro Bridge.
Each line on the graph will represent a day of the week. We will
visualize what effect day and average temperature have on them.
Sunday = c(exp(-1.257+7), exp(-1.257-0.0818-0.0016),
exp(-1.257-0.0818+0.0476))
Monday = c(exp(-1.257+0.001815), exp(-1.257+0.0679-0.0016),
exp(-1.257+0.0679+0.0476))
Tuesday = c(exp(-1.257+0.02529), exp(-1.257-0.0478-0.0016),
exp(-1.257-0.0478+0.0476))
Wednesday= c(exp(-1.257+0.01835), exp(-1.257-0.0434-0.0016),
exp(-1.257-0.0434+0.0476))
Thursday = c(exp(-1.257+0.0237), exp(-1.257-0.035-0.0016),
exp(-1.257-0.035+0.0476))
Friday = c(exp(-1.257), exp(-1.257-0.0016),
exp(-1.257+0.0476))
Saturday = c(exp(-1.257+0.03454), exp(-1.257-0.0358-0.0016),
exp(-1.257-0.0358+0.0476))
minmax = range(c(Monday,Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday))
plot(1:3, Friday, type="l", lty =1, col="red", xlab="",
ylab="Bike Rate", xlim=c(0,6), ylim=c(0.2, 0.35), axes=FALSE )
axis(2)
axis(1, labels=c("AvgTemp", "NewPrecip"), at = 1:2)
points(1:3,Friday, pch=19, col="red")
##
lines(1:3, Saturday, lty =2, col="green")
points(1:3, Saturday, pch=20, col="green")
##
lines(1:3, Sunday, lty =3, col="purple")
points(1:3, Sunday, pch=21, col="purple")
###
lines(1:3, Monday, lty =4, col="black")
points(1:3, Monday, pch=22, col="black")
##
lines(1:3, Tuesday, lty =5, col="blue")
points(1:3, Tuesday, pch=23, col="blue")
##
lines(1:3, Wednesday, lty =6, col="orange")
points(1:3, Wednesday, pch=24, col="orange")
##
lines(1:3, Thursday, lty =7, col="turquoise")
points(1:3, Thursday, pch=25, col="turquoise")
##
legend("topright", c("Friday","Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday" ),
pch=19:22, lty=1:7, bty="n",
col=c("red", "green", "purple", "black", "blue", "orange", "turquoise"))

From the graph we can see a few things:
---
title: 'Dispersed Poisson Regression on New York Cyclists'
author: 'Ava DeStefano'
date: "10-29-24"
output:
  html_document: 
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 6
    fig_height: 4
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 5
    fig_height: 4
---

```{=html}
<style type="text/css">
h1.title {
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}
h4.author { /* Header 4 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 3 - and the author and data headers use this too  */
    font-size: 22px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}
h2 { /* Header 3 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}
</style>
```
```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("ISwR")) {
   install.packages("ISwR")
   library(ISwR)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("dplyr")) {
   install.packages("dplyr")
   library(dplyr)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
knitr::opts_chunk$set(echo = TRUE,      # include code chunk in the output file
                      warning = FALSE,  # sometimes, you code may produce warning messages,
                                        # you can choose to include the warning messages in
                                        # the output file. 
                      results = TRUE,   # you can also decide whether to include the output
                                        # in the output file.
                      message = FALSE,
                      fig.align='center', 
                      fig.pos = 'h'
                     
                      )  
```

# Introduction 

The data set for this study was collected from the Traffic Information Management System. It keeps track of the number of cyclists entering and leaving the Queensboro Bridge from the dates July 1st to July 31st. This data set includes a total of 31 observations and seven variables. The response variable is the total number of cyclist that pass through the  Queensboro bridge on each given day. The explanatory variables for this data set involve the specific conditions of each day, such as the weather.

## Variable Description

Here are what the seven variables in the data set represent:

* Date (x1) - note this represents the observation ID

* Day (x2) - The day of the week

* HighTemp (x3) - the temperature high for the day in degrees Fahrenheit

* LowTemp (x4) - the temperature high for the day in degrees Fahrenheit

* Precipitation (x5) - the total precipitation for the day in inches

* Queensboro Bridge (Y) - The number of cyclists on the Queensboro bridge.

* Total (x6) - the total number of cyclists who enter and leave the bridges in NYC each day

## Practical Question 

Do the conditions surrounding the day the cyclists are recorded affect the number of them enter and leave the QueensboroBridge?

# EDA and Feature Engineering 

First, we are going to download the data. The variable "Date" will be dropped because it is the observation ID, Since it is a small data set, we can look at the data and conclude there are no missing values. We are also going to remove to commas from the variables "Total" and "QueensboroBridge" so R Studio classifies them as numeric. 

```{r}
cycle <- read.csv("https://raw.githubusercontent.com/AvaDeSt/STA-321/refs/heads/main/Assignment%205%20data(Sheet1).csv", header = TRUE)[,-1]
          
cycle$Total <- as.numeric(gsub(",", "", cycle$Total))
cycle$QueensboroBridge <- as.numeric(gsub(",", "", cycle$QueensboroBridge))


data(cycle)
kable(head(cycle), caption = "First few records in the data set")

```



Now that we have the data, we are going to create a new data set called "cycle.new" that contains the variable AvgTemp. This variable will take the average temperature in degrees farenheit of the variables HighTemp and LowTemp.

```{r}
cycle.new<- cycle %>%
  mutate(AvgTemp = (HighTemp + LowTemp) / 2)

```

Next We are going to discretize the variable Precipitation. This will mark any observation with 0 precipitation as "0", and any observations with precipitation levels greater than 0 as "1". 

```{r}
cycle.new$NewPrecip = ifelse(cycle.new$Precipitation > 0, 1, 0)

data(cycle.new)
kable(head(cycle.new), caption = "First few records in the data set")

```

# Model Building

For this study, a poisson regression and a quasi poisson model will be used. The quasi poisson model is similar to a poisson model except that it corrects for over dispersion. This occurs when the mean and variance of the data is not equal.The poisson regression model has four basic assumptions that are as follows:

- The response variable is a count per unit of time or space. ( In our case it is the count of cyclists per day).

- The observations are independent of one another.

- The mean of the poisson random variable is equal to the variance. (for a quasi poisson model, this does not have to be the case)

- The log of the mean rate, log(λ), is a linear function of x

## Poisson Model

First, we will build a standard poisson model using our new variables "AvgTemp" and "NewPrecip" as well as the original one "Day".  We will also be extracting the pearson and deviance dispersion.

Note that ϕ = variance / mean. So a dispersion equal to one means that the variance is equal to the mean. Anything not equal to one indicates dispersion. 

```{r}
pois.model = glm(QueensboroBridge ~ factor(Day) + AvgTemp + NewPrecip, 
                 family = poisson(link="log"), data =cycle.new) 
summary(pois.model)

## predicted y: yhat
yhat = pois.model$fitted.values
pearson.resid = (cycle.new$QueensboroBridge - yhat)/sqrt(yhat)
Pearson.disp = sum(pearson.resid^2)/pois.model$df.residual
##
Deviance.disp = (pois.model$deviance)/pois.model$df.residual
##
disp = cbind(Pearson.disp = Pearson.disp, Deviance.disp = Deviance.disp)
kable(disp, caption="Dispersion parameter", align = 'c')
```
We can see from the summary table that all of the p values indicate that each variable is highly significant. However, both the pearson and deviance dispersion are way more than one, both exceeding one hundred. This indicates that this poisson model seriously violates that assumption of equal mean and variance. This suggests that the p values from this model are not reliable in the first place. We can see that the coefficient for the AvgTemp is about 0.006. This means that for every one degree increase in the temperature high, the log of the expected count of cyclists increases by 0.006. Since exp(0.006) = 1.006, for each one-unit increase in the predictor variable, the expected count of the outcome variable increases by about 10.06%, holding other variables constant.

## Poisson Regression on Rates

This model and analysis is similar to the regular poisson model except it will take into account the total number of cycles who cross other major bridges in New York.

```{r}
model.rates <- glm(QueensboroBridge ~ Day + AvgTemp + factor(NewPrecip), offset = log(Total), 
                   family = poisson(link = "log"), data = cycle.new)
kable(summary(model.rates)$coef, caption = "Poisson regression on the rate of cyclists.")



yhat = model.rates$fitted.values
pearson.resid = (cycle.new$QueensboroBridge - yhat)/sqrt(yhat)
Pearson.disp = sum(pearson.resid^2)/model.rates$df.residual
##
Deviance.disp = (model.rates$deviance)/model.rates$df.residual
##
disp = cbind(Pearson.disp = Pearson.disp, Deviance.disp = Deviance.disp)
kable(disp, caption="Dispersion parameter", align = 'c')
```
We can see that similar to the first model, that all of the p values indicate that each variable is highly significant. However, both the pearson and deviance dispersion are also way more than one, but they are also significantly less than the dispersion from the previous poisson regression model. This indicates that our poisson  model on rates still violates that assumption of equal mean and variance. This suggests that the p values from this model are not reliable in the first place. 

The table also shows that the log of bikers crossing the bridge is not the same across all days of the week. The log rates for the day Friday are higher than the rest of the days of the week. The intercept represents the log base cyclist rate for the baseline day Friday. The rest of the coefficients are the difference of log rates between the baseline day Friday and the rest of the days of the week. We can see from the table that -0.067 is the coefficient for Monday so:

log(RMonday / RFriday) = -0.067 ⇒   RMonday / RFriday = e^−0.067 ≈ 0.935

This means that the rate of bikers on Monday is about 6.5% lower on Monday than on Friday.

## Quasi Poisson Model

Since the dispersion from the previous model was very high, we are going to instead use a quasi poisson model. This model does not require an equal mean and variance. 

```{r}
quasi.model = glm(QueensboroBridge ~ Day + AvgTemp + factor(NewPrecip), 
                 family = quasipoisson, data = cycle.new)  

SE.quasi.pois = summary(quasi.model)$coef
kable(SE.quasi.pois, caption = "Summary statistics of quasi-poisson regression model")

```
We can see from the table above that the variables "Day" and "AvgTemp" are not significant but the variable "NewPrcip" is. This could be a concern. But first, we are going to look at the dispersion parameter.

Next, we will calculate the dispersion index of our quasi poisson model. 

```{r}
ydif=cycle.new$QueensboroBridge-exp(quasi.model$linear.predictors)  # diff between y and yhat
prsd = ydif/sqrt(exp(quasi.model$linear.predictors))   # Pearson residuals
phi = sum(prsd^2)/15          # Dispersion index: 24-9 = 15  
pander(cbind(Dispersion = phi))

```
The dispersion index for this model is very high. But given what we know of the dispersion from the origional poisson model this is to be expected. 

## Quasi Poisson model on Rates

Similar to the last section, we are going to build another quasi poisson regression model but this time, based on the rates of the total number of cyclists crossing several major bridges in New York. We will also calculate the dispersion for this model.

```{r}
quasimodel.rates <- glm(QueensboroBridge ~ Day + AvgTemp + factor(NewPrecip), offset = log(Total), 
                   family = quasipoisson, data = cycle.new)
pander(summary(quasimodel.rates)$coef, caption = "Quasi-Poisson regression on the Number of Cyclists")

ydif=cycle.new$QueensboroBridge-exp(quasimodel.rates$linear.predictors)  # diff between y and yhat
prsd = ydif/sqrt(exp(quasimodel.rates$linear.predictors))   # Pearson residuals
phi = sum(prsd^2)/15          # Dispersion index: 24-9 = 15  
pander(cbind(Dispersion = phi))
```
From the table above, we can see that none of the p values are significant. The dispersion index is still very high but not nearly as high as the one from the regular quasi poisson model. 

## Final Model

The final model chosen for this analysis will be the quasi poisson model on rates. Despite the quasi models having less significant p values, they do not require equal mean and variance. We have seen from calculating the dispersion of the poisson models that the means and variances for both the poisson and poisson rates models were not at all equal with very high dispersion. The rate quasi poisson model on rates will be used because despite still having a high dispersion index, it was much closer to one than on the regular  The final model can be written as:

rate = exp(-1.257 -0.0818 * DaySunday - 0.0679 * DayMonday - 0.0478 * DayTuesday - 0.0434 * DayWednesday - 0.035 * DayThursday - 0.0358 * DaySaturday) x (0.0016 * AvgTemp) x (0.0476 * NewPrecip)

## Some Visual Comparisons 

Now, we will create a visual representation of how the explanatory variables affect the actual number of cyclists on the Queensboro Bridge. Each line on the graph will represent a day of the week. We will visualize what effect day and average temperature have on them.  

```{r}
Sunday = c(exp(-1.257+7), exp(-1.257-0.0818-0.0016),   
               exp(-1.257-0.0818+0.0476))

Monday = c(exp(-1.257+0.001815), exp(-1.257+0.0679-0.0016),   
               exp(-1.257+0.0679+0.0476))


Tuesday = c(exp(-1.257+0.02529), exp(-1.257-0.0478-0.0016),   
               exp(-1.257-0.0478+0.0476))


Wednesday= c(exp(-1.257+0.01835), exp(-1.257-0.0434-0.0016),   
               exp(-1.257-0.0434+0.0476))


Thursday = c(exp(-1.257+0.0237), exp(-1.257-0.035-0.0016),   
               exp(-1.257-0.035+0.0476))


Friday = c(exp(-1.257), exp(-1.257-0.0016),   
               exp(-1.257+0.0476))


Saturday = c(exp(-1.257+0.03454), exp(-1.257-0.0358-0.0016),   
               exp(-1.257-0.0358+0.0476))

minmax = range(c(Monday,Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday))

```

```{r fig.align='center', fig.width=6, fig.height=6}
plot(1:3, Friday, type="l", lty =1, col="red", xlab="", 
               ylab="Bike Rate", xlim=c(0,6), ylim=c(0.2, 0.35), axes=FALSE )
axis(2)
axis(1, labels=c("AvgTemp", "NewPrecip"), at = 1:2)
points(1:3,Friday, pch=19, col="red")
##
lines(1:3, Saturday, lty =2, col="green")
points(1:3, Saturday, pch=20, col="green")
##
lines(1:3, Sunday, lty =3, col="purple")
points(1:3, Sunday, pch=21, col="purple")
###
lines(1:3, Monday, lty =4, col="black")
points(1:3, Monday, pch=22, col="black")
##
lines(1:3, Tuesday, lty =5, col="blue")
points(1:3, Tuesday, pch=23, col="blue")
##
lines(1:3, Wednesday, lty =6, col="orange")
points(1:3, Wednesday, pch=24, col="orange")
##
lines(1:3, Thursday, lty =7, col="turquoise")
points(1:3, Thursday, pch=25, col="turquoise")
##


legend("topright", c("Friday","Saturday", "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday" ),
                  pch=19:22, lty=1:7,  bty="n", 
        col=c("red", "green", "purple", "black", "blue", "orange", "turquoise"))
```


From the graph we can see a few things:

- The most bikers cross the Queensboro Bridge on Monday on average.

- There is a positive relationship between average temperature and the number of cyclists.



# Summary and Conclusion

To summarize, we looked at a data set that looks at how many bikers cross over the Queensboro Bridge in NYC every day for the month of July, leaving us with 31 observations. The 4 explanatory variables look at the day of the week and the weather conditions on each day. We created two new variables in the data set and used them in all our models along with the variable 'Day". One was called "AvgTemp" and it took the average temperature of the high and low temperature variables. The other variable "NewPrecip" categorized the variable "Precipitation" as: if Precipitation = 0, then NewPrecip = 0; if Precipitation > 0, then NewPrecip = 1 Our goal was to see what the relationship was between these variables and the number of cyclists. We also wanted to see if taking into account the total number of cyclists that cross several major bridges in NYC. To figure this out, we built a Poisson frequency regression model, a Poisson Model on rates, and a quasi poisson model, and a quasi poisson regression model on rates. 

We found from calculating the dispersion of the poisson model that the mean and variance were not equal. Because of this, we decided this was not the best model. The final model chosen was the quasi poisson model on rates.Despite the quasi models having less significant p values, they do not require equal mean and variance. A recommendation for further research is to try to find how truly significant the explanatory variables. With each model we used, the p values changed, so they were not very meaningful. 




