Introduction
For this project, we will be creating a Poisson regression model. The
data set for this project looks at the daily total of cyclists on the
Williamsburg Bridge on a given day. This data set looks at the total
number of cyclists on the Williamsburg Bridge in Brooklyn, New York, in
order to keep track of the total number of cyclists entering and leaving
this cycling route on a specific day. We will look at the various
factors affecting the number of cyclists on each day, with factors such
as the weather conditions on that particular day.
Data Description
The data set in this project looks at the total number of cyclists on
the Williamsburg Bridge on a given day along with the weather conditions
of that day such as temperature and precipitation. This data set also
includes the total number of cyclists on all four of the major New York
bridges the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg
Bridge, and the Queensboro Bridge.
First, let’s find the data set which will be used for this
assignment.
id=sample(1:10, 1)
dat <- read.xlsx("https://pengdsci.github.io/STA321/ww09/w09-AssignDataSet.xlsx", sheet = paste("data",id, sep = ""))
write.csv(dat, paste("C:\\Users\\josie\\Downloads\\",names(dat[6]), ".csv", sep=""))
When running this code, the data set I recieved was for the
Williamsburg Bridge, so that is what we will use for this Poisson
regression modeling project. The data set has been uploaded to Github
and now can be read in directly from the Github repository.
We will read in the data set from Github and we will call it
“cycling”.
cycling <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/WilliamsburgBridge.csv", header = TRUE)
str(cycling)
'data.frame': 31 obs. of 8 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ Date : int 42917 42918 42919 42920 42921 42922 42923 42924 42925 42926 ...
$ Day : int 42917 42918 42919 42920 42921 42922 42923 42924 42925 42926 ...
$ HighTemp : num 84.9 87.1 87.1 82.9 84.9 75 79 82.9 81 82.9 ...
$ LowTemp : num 72 73 71.1 70 71.1 71.1 68 70 69.1 71.1 ...
$ Precipitation : num 0.23 0 0.45 0 0 0 1.78 0 0 0 ...
$ WilliamsburgBridge: int 3845 4173 4924 3684 7308 7302 4421 5781 5782 8106 ...
$ Total : int 11867 13995 16067 13925 23110 21861 12805 17258 18320 24827 ...
We will use this cycling data set to create two Poisson regression
models, one for the frequency counts of cyclists on the Williamsburg
Bridge on a given observation, and another for the rates of cyclists
entering and leaving via the Williamsburg Bridge offset by the total
number of cyclists on all of the major New York bridges.
Variables
There are 8 total variables in the cycling data set. These variables
include:
X: The number of each observation. This is not a variable that is
useful for analysis, but rather is for listing each of the 31
observations in order, from observation 1 to observation 31. This
ordering was added when creating the .csv file, so it is not an
essential part of the dataset for our analysis.
Date: This represents the date on which a given observation was
collected. This is the observation ID number.
Day: This represents the day on which a given observation was
collected.
HighTemp: The high temperature on the given day, given in degrees
Fahrenheit.
LowTemp: The low temperature on the given day, given in degrees
Fahrenheit.
Precipitation: The amount of rain which occurred on the given
day, given in inches.
WilliamsburgBridge: The total number of cyclists on the
Williamsburg Bridge on a given observation.
Total: The total number of cyclists on all bridges on a given
observation.
For the Poisson regression model for the frequency counts, the
Williamsburg Bridge variable will serve as the response variable. For
the Poisson regression model for the rates, the Williamsburg Bridge
variable will again serve as the response variable, and it will be
offset by the Total variable for this model.
Research
Questions
The main goal for this project is to create a Poisson regression
model for both the frequency counts and the rates of the cyclists
entering and leaving Brooklyn, New York through the Williamsburg Bridge.
So, the focus for this project will be on creating two Poisson
regression models which can successfully predict the frequency counts
and the rates of the cyclists on the Williamsburg Bridge.
Some key questions for this project include:
Does the data set meet all of the necessary conditions required
for a Poisson regression model? If not, is there any potential
explanation for this discrepancy?
Can we create Poisson regression models which provide statistical
significance for predicting both the frequency counts and for the rates
of cyclists on the Williamsburg Bridge on a given day?
We will work on creating our Poisson regression models for both the
frequency counts and rates in order to see if we can in fact create
models which provide statistical significance in their predictive
ability.
Exploratory Data
Analysis
Let’s take a look at the first few entries within this cycling data
set for the Williamsburg Bridge.
kable(head(cycling), caption = "First Few Observations in the Data Set")
First Few Observations in the Data Set
1 |
42917 |
42917 |
84.9 |
72.0 |
0.23 |
3845 |
11867 |
2 |
42918 |
42918 |
87.1 |
73.0 |
0.00 |
4173 |
13995 |
3 |
42919 |
42919 |
87.1 |
71.1 |
0.45 |
4924 |
16067 |
4 |
42920 |
42920 |
82.9 |
70.0 |
0.00 |
3684 |
13925 |
5 |
42921 |
42921 |
84.9 |
71.1 |
0.00 |
7308 |
23110 |
6 |
42922 |
42922 |
75.0 |
71.1 |
0.00 |
7302 |
21861 |
This data set includes various factors which may have an influence on
the number of individuals cycling, along with the date on which this
data was collected. Additionally, this data set includes variables for
both the number of cyclists on the Williamsburg Bridge on that given
day, along with the total number of cyclists on all bridges on that
given day.
An observation I made while looking at the data set is that the
entries for the Date and the Day variables are in fact identical. This
means that both of these variables are representative of the observation
IDs and so, it would be redundant to include both variables in our
models as the entries for these two variables are identical for all 31
of the observations in the data set. We will just include the Date
variable in our Poisson regression models due to this observation that
was made while observing the data set.
Asumptions and
Conditions
There are four assumptions which must be met in order to create a
Poisson regression model. These assumptions include:
The response variable is a count described by a Poisson
distribution.
Observations are independent of one another.
The mean of the Poisson random variable is equal to the variance
of said Poisson random variable.
The log of the mean rate, log (λ), must be a linear function of
x.
We will check whether all of these four conditions have been
successfully met by our cycling data set before beginning with the model
building process for our Poisson regression model.
We will go through and check all four of the neccessary conditions
required for a Poisson Regression Model.
Condition 1: The
response variable is a count described by a Poisson distribution.
The response variable in this data set was stated to be the
WilliamsburgBridge variable, representing the total number of cyclists
on the Williamsburg Bridge on a given observation. This variable is
described as a count, representing the number of cyclists on a given
observation. This fits the criteria for this assumption, because we can
conclude that we have a response variable that is a count.
Condition 2:
Observations are independent of one another.
Each observation was collected on a given date, and we can safely
assume that the conditions of one day did not affect the conditions of
another day. The number of cyclists on the Williamsburg Bridge for a
given observation is independent on this number of a different
observation. So, we can safely conclude that that observations are all
independent and separate from one another.
Condition 3: The
mean of the Poisson random variable is equal to the variance of said
Poisson random variable.
In order for a variable to be a Poisson random variable, its mean
must be equal to its variance. We previously stated that the
WilliamsburgBridge variable will be our response variable. Therefore, we
must check that this variable meets the criteria for a Poisson random
variable, having a mean which is equal to its variance.
# Finding the mean.
mean <- mean(cycling$WilliamsburgBridge)
print(mean)
[1] 6073.677
The mean of the WilliamsburgBridge variable is 6,073.677. This
represents the mean number of individuals on the Williamsburg Bridge on
a given observation. This means that the mean number of individuals on
the Williamsburg Bridge on any given date is around 6,074 people. We
round this value because the number of individuals is a whole
number.
Next, let’s find the variance of our response variable.
# Finding the variance.
variance <- var(cycling$WilliamsburgBridge)
print(variance)
[1] 2482822
The variance of the WilliamsburgBridge variable is 2,482,822. This
does not match up with the value of the mean, and indicates a violation
of one the neccessary conditions for a Poisson regression model. This
implies that our response variable is in fact not a Poisson random
variable because the value of its mean is not equivalent to the value of
its variance.
Condition 4: The
log of the mean rate, log (λ), must be a linear function of x.
We will take a look at the plot of the mean rate against the
predictor variables to check this condition.
First, let’s look at the predictor variable of date vs our response
variable of WilliamsburgBridge.
plot(cycling$Date, cycling$WilliamsburgBridge, main = "Date vs. Williamsburg Bridge", xlab = "Date", ylab = "WilliamsburgBridge")

The scatterplot of these two variables shows a random distribution,
but it does not appear to follow a linear pattern. This could suggest a
possible violation of this condition due to the WilliamsburgBridge
variable not being a linear function of the Date predictor variable.
Next, let’s look at the predictor variable of high temperature vs our
response variable of WilliamsburgBridge.
plot(cycling$HighTemp, cycling$WilliamsburgBridge, main = "HighTemp vs. Williamsburg Bridge", xlab = "HighTemp", ylab = "WilliamsburgBridge")

The scatterplot of these two variables again shows a random
distribution, which it does not appear to follow a distinctly linear
pattern. This could suggest a possible violation of this condition due
to the WilliamsburgBridge variable not being a linear function of the
HighTemp predictor variable.
Next, let’s look at the predictor variable of low temperature vs our
response variable of WilliamsburgBridge.
plot(cycling$LowTemp, cycling$WilliamsburgBridge, main = "LowTemp vs. Williamsburg Bridge", xlab = "LowTemp", ylab = "WilliamsburgBridge")

The scatterplot of these two variables again shows a random
distribution, which it does not appear to follow a distinctly linear
pattern. This could suggest a possible violation of this condition due
to the WilliamsburgBridge variable not being a linear function of the
LowTemp predictor variable.
Lastly, let’s look at the predictor variable of precipitation vs our
response variable of WilliamsburgBridge.
plot(cycling$Precipitation, cycling$WilliamsburgBridge, main = "Precipitation vs. Williamsburg Bridge", xlab = "Precipitation", ylab = "WilliamsburgBridge")

The scatterplot of these two variables again shows a distribution
which does not appear to follow a distinctly linear pattern, it appears
the points are mostly centered around x = 0, with some outliers to the
right. This could suggest a possible violation of this condition due to
the WilliamsburgBridge variable not being a linear function of the
Precipitation predictor variable.
Overall, it seems that we do have some violations of the conditions
of a Poisson regression model, with the response variable not following
a linear function of the predictor variables in our model.
We will still continue with building the Poisson regression models,
but it is important to keep in mind that these violations may mean that
the Poisson regression model is not the best model choice for this data
set due to some of the neccessary conditions having been failed to have
been met.
Poisson Regression
Model on Frequency Counts
We will begin with creating a Poisson regression model of the
frequency counts. Specifically, this model will be on the frequency
counts of individuals on the Williamsburg Bridge for a given
observations. Our goal is to create a Poisson regression model which can
statistically significantly predict the count of the number of
individuals on the Williamsburg Bridge for a given observation, based
upon the various factors in this data set.
We will create our Poisson regression model on the frequency
counts.
# Poisson Regression Model of Counts
model.counts <- glm(WilliamsburgBridge ~ Date + HighTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = cycling)
pois.count.coef = summary(model.counts)$coef
kable(pois.count.coef, caption = "Poisson Regression Model for the Counts of Cyclists \n on the Williamsburg Bridge")
Poisson Regression Model for the Counts of Cyclists on the
Williamsburg Bridge
(Intercept) |
-329.7412813 |
11.7142108 |
-28.148826 |
0 |
Date |
0.0078648 |
0.0002726 |
28.850990 |
0 |
HighTemp |
0.0035901 |
0.0006334 |
5.667892 |
0 |
LowTemp |
0.0075718 |
0.0009046 |
8.370178 |
0 |
Precipitation |
-0.3516535 |
0.0086431 |
-40.685836 |
0 |
The regression equation for the Poisson regression model on the
frequency counts is given as:
log(μ) = -329.7413 + 0.0079 * Date + 0.0036 * HighTemp + 0.0076 *
LowTemp - 0.3517 * Precipitation
All four of the predictor variables, Date, HighTemp, LowTemp, and
Precipitation, all have p-values of p < .001. This indicates that all
of the predictor in our model variables are statistically significant in
predicting the total expected counts of cyclists on the Williamsburg
Bridge on a given day.
The significance of these variables in regards to predicting the
expected counts can likely be attributed to potential adverse weather
conditions, such as excessive heat or cold, along with intense
precipitation and storms making cycling non ideal on those days with
poor conditions for outdoors activities such as cycling. These predictor
variables all being statistically significant shows that the weather and
temperature conditions do suggest a discrepancy in the number of
cyclists on the Williamsburg Bridge from day to day due to these changes
in temperature and precipitation.
Overall, this Poisson model of the frequency counts of the cyclists
on the Williamsburg Bridge showed statistical signficance in its
prediction of the expected log counts for the number of cyclists on the
Williamsburg Bridge for a given observation.
Regression
Coefficients Interpretation
The Poisson regression model on frequency counts was found to have
the following regression equation:
log(μ) = -329.7413 + 0.0079 * Date + 0.0036 * HighTemp + 0.0076 *
LowTemp - 0.3517 * Precipitation
We will analysis the regression coefficients for the variables in
this Poisson regression model on frequency counts.
The value of the y-intercept is given as -329.7413. This
represnts the baseline of the mean of log(μ) when all predictor
variables are equal to 0. However, the y-intercept does not have a
practical interpretation or meaning in this scenario so we are not
interested in its meaning for the Poisson regression model.
Date: The regression coefficient of the Date variable in this
model is 0.0079. This means that the mean log of the counts increases by
0.0079 units for every 1 day increase in the date on which the
observation was collected, holding all other variables
constant.
HighTemp: The regression coefficient of the HighTemp variable in
this model is 0.0036. This means that the mean log of the counts
increases by 0.0036 units for every 1 degree Fahrenheit increase in the
high temperature for the given observation, holding all other variables
constant.
LowTemp: The regression coefficient of the LowTemp variable in
this model is 0.0076. This means that the mean log of the counts
increases by 0.0076 units for every 1 degree Fahrenheit increase in the
low temperature for the given observation, holding all other variables
constant.
Precipitation: The regression coefficient of the Precipitation
variable in this model is -0.3517. This means that the mean log of the
counts decreases by 0.3517 units for every 1 inch increase in the amount
of precipitation for the given observation, holding all other variables
constant.
Poisson Regression
Model on Rates
Now, we will create a Poisson regression model of the rates at which
cyclists enter and leave via the Williamsburg Bridge offset by the total
number of cyclists on all four of the major New York bridges. This
model, unlike the previous model which just focused on the frequency
counts of cyclists on the Williamsburg Bridge, will also account for the
total number of cyclists on all four of the major New York bridges, the
Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the
Queensboro Bridge. This Poisson model will look at the rates of the
number of cyclists on the Williamsburg Bridge for a given observation as
a rate out of the total number of cyclists on all four of these major
bridges for that specific observation.
We will build our Poisson regression model for the rates. This time,
we will still use the WilliamsburgBridge variable as our response
variable, but we will offset the model by the Total variable to make our
Poisson model for the rates of cyclists on the Williamsburg Bridge out
of the total number of cyclists on all four of the bridges.
# Poisson Model of Rates
model.rates <- glm(WilliamsburgBridge ~ Date + HighTemp + LowTemp + Precipitation, offset = log(Total),
family = poisson(link = "log"), data = cycling)
kable(summary(model.rates)$coef, caption = "Poisson Regression Model of the Rates of Cyclists \n on the Williamsburg Bridge out of all Four Bridges")
Poisson Regression Model of the Rates of Cyclists on the
Williamsburg Bridge out of all Four Bridges
(Intercept) |
-50.4101583 |
12.0801410 |
-4.172978 |
3.01e-05 |
Date |
0.0011422 |
0.0002811 |
4.063496 |
4.83e-05 |
HighTemp |
-0.0050794 |
0.0006460 |
-7.862790 |
0.00e+00 |
LowTemp |
0.0092517 |
0.0009198 |
10.057847 |
0.00e+00 |
Precipitation |
0.0356817 |
0.0078863 |
4.524499 |
6.10e-06 |
The regression equation for the Poisson regression model on the rates
is given as:
log(μ/t) = -50.4102 + 0.0011 * Date - 0.0051 * HighTemp + 0.0093 *
LowTemp - 0.0357 * Precipitation
All four of the predictor variables in this Poisson model, Date,
HighTemp, LowTemp, and Precipitation, all have p-values of p < .001.
This indicates that all of the predictor in our model variables are
statistically significant in predicting the total expected counts of
cyclists on the Williamsburg Bridge on a given day, offset by the total
number of cyclists on all four of the major New York bridges.
This model shows statistical significance in predicting the expected
counts of the cyclists on the Williamsburg Bridge by using the rates for
the prediction. This indicates that this model for the rates shows
statistical significance in its predictive power and provides good
utility for prediction and estimation.
Regression
Coefficients Interpretation
The value of the y-intercept is given as -50.4102 This represents
the baseline of the mean of the log counts multiplied by t, when all
predictor variables are equal to 0. However, the y-intercept does not
have a practical interpretation or meaning in this scenario so we are
not interested in its meaning for the Poisson regression model.
Date: The regression coefficient of the Date variable in this
model is 0.0011. This means that the mean of the log counts multipled by
t increases by 0.0011. units for every 1 day increase in the date on
which the observation was collected, holding all other variables
constant.
HighTemp: The regression coefficient of the HighTemp variable in
this model is -0.0051 This means that the mean of the log counts
multipled by t decreases by 0.0051 units for every 1 degree Fahrenheit
increase in the high temperature for the given observation, holding all
other variables constant.
LowTemp: The regression coefficient of the LowTemp variable in
this model is 0.0093. This means that the log counts multipled by t
increases by 0.0093 units for every 1 degree Fahrenheit increase in the
low temperature for the given observation, holding all other variables
constant.
Precipitation: The regression coefficient of the Precipitation
variable in this model is 0.0357. This means that the log counts
multipled by t increases by 0.0357 units for every 1 inch increase in
the amount of precipitation for the given observation, holding all other
variables constant.
Summary and Comparisons
of the Two Models
Both of the two Poisson regression model we created, the model for
the frequency counts and the model for the rates, provided statistical
significance for prediction and showed good utility overall. In both of
these models, we looked into the total number of cyclists on the
Williamsburg Bridge in New York for a specific observation, and we
looked into the various factors of that specific date. We looked at the
date of the observation along with some factors which may affect the
total number of cyclists out on that specific date. These factors
included the high temperature, the low temperature, and the amount of
precipitation for that given date. It turned out that all of these
factors were indeed statistically significant for both of the two
Poisson regression models, indicating that these weather related
conditions have a statistically significant impact on both the counts
and the rates of cyclists out on the Williamsburg Bridge for a given
observation. This can be attributed to certain weather conditions making
it more or less ideal for individuals to be cycling outdoors. For
instance, a day with incredibly high temperatures, incredibly cold
temperatures, or severe storms with heavy precipitation would be less
ideal and likely lead to less cyclists being out on that given day as
opposed to a day with pleasant weather.
Overall, both of the Poisson regression models showed statistical
significance and good utility in their prediction. However, as was
previously stated, there were some violations of this conditions for a
Poisson regression model within our data set. First, it was found that
the mean of the response variable, WilliamsburgBridge, was not equal to
its variance. This suggests that this response variable in fact is not
Poisson distributed, due to it failing to meet the condition for a
Poisson random variable of its mean being equal to its variance.
Additionally, all four predictor variables were checked, and it was
found that the response variable in fact was not a linear function of
any of these predictor variables. This indicates another major violation
of this data set. These violations suggest that perhaps a Poisson model
was not the best model choice for this data set, and that it is
important to be mindful of these violations when using either of the
Poisson regression models we created for prediction.
Conclusion
Overall, two Poisson regression models were created in this project.
Both of these models looked at the total number of cyclists on the
Williamsburg Bridge for a given day. The first model looked at the
counts of cyclists that were on the Williamsburg Bridge, and the second
model looked at the rates of the cyclists that were on the Williamsburg
Bridge offset by the total number of cyclists on all four of the major
New York bridges.
Both of these two Poisson regression models showed statistical
signicifance in their predictions, with all of the predictor variables
in both of these two models have p-values of p < .001. This indicates
that the factors in the data set of Date, HighTemp, LowTemp, and
Precipitation are statistically significnace in predicting the frequency
counts or the rates of the cyclists on the Williamsburg Bridge for a
given observation.
However, our data set failed to meet some of the neccessary
conditions for a Poisson regression model. The mean of the response
variable, WilliamsburgBridge, was not equal to its variance. This
suggests that this response variable in fact is not Poisson distributed.
Also, all four predictor variables were checked, and it was found that
the response variable in fact was not a linear function of any of these
predictor variables. These are two violations of the assumptions for a
Poisson regression model. These violations mean that perhaps a Poisson
regression model was not the best choice for this data set, and that
these violations should be kept in mind when using either of these
models for prediction.
Recommendations
Some recommendations I would suggest for further projects
include:
Look further into the violations that were found within this data
set and look into possible explanations for these violations of the
necessities of a Poisson regression model. Further consider whether the
Poisson regression model in fact is the best choice for this data set
and if it is sufficient to use this model for prediction despite these
violations.
Consider other variables which may affect the number of cyclists
out on a given observation. Perhaps there are other factors which may
provide further significance for model building which may strengthen the
regression model.
Further expand the data set to ensure the accuracy of the
predictions and to further strengthen the Poisson regression
models.
---
title: "Poisson Regression of the Counts and Rates of Cyclists on the Williamsburg Bridge"
author: "Josie Gallop"
date: "2024-10-29"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 6
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
editor_options: 
  chunk_output_type: console
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}
# Detect, install, and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}
if (!require("EnvStats")) {
   install.packages("EnvStats")
   library(EnvStats)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("phytools")) {
   install.packages("phytools")
   library(phytools)
}
if(!require("dplyr")) {
   install.packages("dplyr")
   library(dplyr)
}
if(!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if (!require("boot")) {
   install.packages("boot")
   library(boot)
}
if(!require("pander")) {
   install.packages("pander")
   library(pander)
}
if(!require("mlbench")) {
   install.packages("mlbench")
   library(mlbench)
}
if(!require("psych")) {
   install.packages("psych")
   library(psych)
}
if(!require("broom.mixed")) {
   install.packages("broom.mixed")
   library(broom.mixed)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
if (!require("openxlsx")) {
   install.packages("openxlsx")
   library(openxlsx)
}
knitr::opts_chunk$set(echo = TRUE,  
                   warning = FALSE,   
                   message = FALSE,  
                   results = TRUE,  
                   comment = NA   
                      )   
```


# Introduction

For this project, we will be creating a Poisson regression model. The data set for this project looks at the daily total of cyclists on the Williamsburg Bridge on a given day. This data set looks at the total number of cyclists on the Williamsburg Bridge in Brooklyn, New York, in order to keep track of the total number of cyclists entering and leaving this cycling route on a specific day. We will look at the various factors affecting the number of cyclists on each day, with factors such as the weather conditions on that particular day. 


## Data Description

The data set in this project looks at the total number of cyclists on the Williamsburg Bridge on a given day along with the weather conditions of that day such as temperature and precipitation. This data set also includes the total number of cyclists on all four of the major New York bridges the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the Queensboro Bridge.

First, let's find the data set which will be used for this assignment. 

```{r}
id=sample(1:10, 1)
dat <- read.xlsx("https://pengdsci.github.io/STA321/ww09/w09-AssignDataSet.xlsx", sheet = paste("data",id, sep = ""))
write.csv(dat, paste("C:\\Users\\josie\\Downloads\\",names(dat[6]), ".csv", sep=""))
```

When running this code, the data set I recieved was for the Williamsburg Bridge, so that is what we will use for this Poisson regression modeling project. The data set has been uploaded to Github and now can be read in directly from the Github repository. 

We will read in the data set from Github and we will call it "cycling".

```{r}
cycling <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/WilliamsburgBridge.csv", header = TRUE)

str(cycling)
```

We will use this cycling data set to create two Poisson regression models, one for the frequency counts of cyclists on the Williamsburg Bridge on a given observation, and another for the rates of cyclists entering and leaving via the Williamsburg Bridge offset by the total number of cyclists on all of the major New York bridges. 


## Variables


There are 8 total variables in the cycling data set. These variables include:

* X: The number of each observation. This is not a variable that is useful for analysis, but rather is for listing each of the 31 observations in order, from observation 1 to observation 31. This ordering was added when creating the .csv file, so it is not an essential part of the dataset for our analysis.

* Date: This represents the date on which a given observation was collected. This is the observation ID number. 

* Day: This represents the day on which a given observation was collected. 

* HighTemp: The high temperature on the given day, given in degrees Fahrenheit. 

* LowTemp: The low temperature on the given day, given in degrees Fahrenheit.

* Precipitation: The amount of rain which occurred on the given day, given in inches. 

* WilliamsburgBridge: The total number of cyclists on the Williamsburg Bridge on a given observation.

* Total: The total number of cyclists on all bridges on a given observation. 


For the Poisson regression model for the frequency counts, the Williamsburg Bridge variable will serve as the response variable. For the Poisson regression model for the rates, the Williamsburg Bridge variable will again serve as the response variable, and it will be offset by the Total variable for this model. 


## Research Questions

The main goal for this project is to create a Poisson regression model for both the frequency counts and the rates of the cyclists entering and leaving Brooklyn, New York through the Williamsburg Bridge. So, the focus for this project will be on creating two Poisson regression models which can successfully predict the frequency counts and the rates of the cyclists on the Williamsburg Bridge. 

Some key questions for this project include:

* Does the data set meet all of the necessary conditions required for a Poisson regression model? If not, is there any potential explanation for this discrepancy? 

* Can we create Poisson regression models which provide statistical significance for predicting both the frequency counts and for the rates of cyclists on the Williamsburg Bridge on a given day?

We will work on creating our Poisson regression models for both the frequency counts and rates in order to see if we can in fact create models which provide statistical significance in their predictive ability. 




# Exploratory Data Analysis

Let's take a look at the first few entries within this cycling data set for the Williamsburg Bridge.

```{r}
kable(head(cycling), caption = "First Few Observations in the Data Set") 
```

This data set includes various factors which may have an influence on the number of individuals cycling, along with the date on which this data was collected. Additionally, this data set includes variables for both the number of cyclists on the Williamsburg Bridge on that given day, along with the total number of cyclists on all bridges on that given day.  

An observation I made while looking at the data set is that the entries for the Date and the Day variables are in fact identical. This means that both of these variables are representative of the observation IDs and so, it would be redundant to include both variables in our models as the entries for these two variables are identical for all 31 of the observations in the data set. We will just include the Date variable in our Poisson regression models due to this observation that was made while observing the data set. 



## Asumptions and Conditions

There are four assumptions which must be met in order to create a Poisson regression model. These assumptions include:

1. The response variable is a count described by a Poisson distribution.

2. Observations are independent of one another.

3. The mean of the Poisson random variable is equal to the variance of said Poisson random variable.

4. The log of the mean rate, log (λ), must be a linear function of x.


We will check whether all of these four conditions have been successfully met by our cycling data set before beginning with the model building process for our Poisson regression model.

We will go through and check all four of the neccessary conditions required for a Poisson Regression Model.


### Condition 1: The response variable is a count described by a Poisson distribution.

The response variable in this data set was stated to be the WilliamsburgBridge variable, representing the total number of cyclists on the Williamsburg Bridge on a given observation. This variable is described as a count, representing the number of cyclists on a given observation. This fits the criteria for this assumption, because we can conclude that we have a response variable that is a count.


### Condition 2: Observations are independent of one another.

Each observation was collected on a given date, and we can safely assume that the conditions of one day did not affect the conditions of another day. The number of cyclists on the Williamsburg Bridge for a given observation is independent on this number of a different observation. So, we can safely conclude that that observations are all independent and separate from one another. 


### Condition 3: The mean of the Poisson random variable is equal to the variance of said Poisson random variable.

In order for a variable to be a Poisson random variable, its mean must be equal to its variance. We previously stated that the WilliamsburgBridge variable will be our response variable. Therefore, we must check that this variable meets the criteria for a Poisson random variable, having a mean which is equal to its variance.

```{r}
# Finding the mean.
mean <- mean(cycling$WilliamsburgBridge)
print(mean)
```

The mean of the WilliamsburgBridge variable is 6,073.677. This represents the mean number of individuals on the Williamsburg Bridge on a given observation. This means that the mean number of individuals on the Williamsburg Bridge on any given date is around 6,074 people. We round this value because the number of individuals is a whole number. 

Next, let's find the variance of our response variable.

```{r}
# Finding the variance.
variance <- var(cycling$WilliamsburgBridge)
print(variance)
```

The variance of the WilliamsburgBridge variable is 2,482,822. This does not match up with the value of the mean, and indicates a violation of one the neccessary conditions for a Poisson regression model. This implies that our response variable is in fact not a Poisson random variable because the value of its mean is not equivalent to the value of its variance. 


### Condition 4: The log of the mean rate, log (λ), must be a linear function of x.

We will take a look at the plot of the mean rate against the predictor variables to check this condition. 

First, let's look at the predictor variable of date vs our response variable of WilliamsburgBridge. 

```{r}
plot(cycling$Date, cycling$WilliamsburgBridge, main = "Date vs. Williamsburg Bridge", xlab = "Date", ylab = "WilliamsburgBridge")
```

The scatterplot of these two variables shows a random distribution, but it does not appear to follow a linear pattern. This could suggest a possible violation of this condition due to the WilliamsburgBridge variable not being a linear function of the Date predictor variable. 

Next, let's look at the predictor variable of high temperature vs our response variable of WilliamsburgBridge. 

```{r}
plot(cycling$HighTemp, cycling$WilliamsburgBridge, main = "HighTemp vs. Williamsburg Bridge", xlab = "HighTemp", ylab = "WilliamsburgBridge")
```

The scatterplot of these two variables again shows a random distribution, which it does not appear to follow a distinctly linear pattern. This could suggest a possible violation of this condition due to the WilliamsburgBridge variable not being a linear function of the HighTemp predictor variable. 

Next, let's look at the predictor variable of low temperature vs our response variable of WilliamsburgBridge. 

```{r}
plot(cycling$LowTemp, cycling$WilliamsburgBridge, main = "LowTemp vs. Williamsburg Bridge", xlab = "LowTemp", ylab = "WilliamsburgBridge")
```

The scatterplot of these two variables again shows a random distribution, which it does not appear to follow a distinctly linear pattern. This could suggest a possible violation of this condition due to the WilliamsburgBridge variable not being a linear function of the LowTemp predictor variable. 

Lastly, let's look at the predictor variable of precipitation vs our response variable of WilliamsburgBridge. 

```{r}
plot(cycling$Precipitation, cycling$WilliamsburgBridge, main = "Precipitation vs. Williamsburg Bridge", xlab = "Precipitation", ylab = "WilliamsburgBridge")
```

The scatterplot of these two variables again shows a distribution which does not appear to follow a distinctly linear pattern, it appears the points are mostly centered around x = 0, with some outliers to the right. This could suggest a possible violation of this condition due to the WilliamsburgBridge variable not being a linear function of the Precipitation predictor variable. 


Overall, it seems that we do have some violations of the conditions of a Poisson regression model, with the response variable not following a linear function of the predictor variables in our model. 

We will still continue with building the Poisson regression models, but it is important to keep in mind that these violations may mean that the Poisson regression model is not the best model choice for this data set due to some of the neccessary conditions having been failed to have been met. 



# Poisson Regression Model on Frequency Counts 

We will begin with creating a Poisson regression model of the frequency counts. Specifically, this model will be on the frequency counts of individuals on the Williamsburg Bridge for a given observations. Our goal is to create a Poisson regression model which can statistically significantly predict the count of the number of individuals on the Williamsburg Bridge for a given observation, based upon the various factors in this data set. 

We will create our Poisson regression model on the frequency counts.

```{r}
# Poisson Regression Model of Counts
model.counts <- glm(WilliamsburgBridge ~ Date + HighTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = cycling)
pois.count.coef = summary(model.counts)$coef
kable(pois.count.coef, caption = "Poisson Regression Model for the Counts of Cyclists \n on the Williamsburg Bridge")
```

The regression equation for the Poisson regression model on the frequency counts is given as:

log(μ) = -329.7413 + 0.0079 * Date + 0.0036 * HighTemp + 0.0076 * LowTemp - 0.3517 * Precipitation


All four of the predictor variables, Date, HighTemp, LowTemp, and Precipitation, all have p-values of p < .001. This indicates that all of the predictor in our model variables are statistically significant in predicting the total expected counts of cyclists on the Williamsburg Bridge on a given day. 

The significance of these variables in regards to predicting the expected counts can likely be attributed to potential adverse weather conditions, such as excessive heat or cold, along with intense precipitation and storms making cycling non ideal on those days with poor conditions for outdoors activities such as cycling. These predictor variables all being statistically significant shows that the weather and temperature conditions do suggest a discrepancy in the number of cyclists on the Williamsburg Bridge from day to day due to these changes in temperature and precipitation. 

Overall, this Poisson model of the frequency counts of the cyclists on the Williamsburg Bridge showed statistical signficance in its prediction of the expected log counts for the number of cyclists on the Williamsburg Bridge for a given observation.



## Regression Coefficients Interpretation


The Poisson regression model on frequency counts was found to have the following regression equation:

log(μ) = -329.7413 + 0.0079 * Date + 0.0036 * HighTemp + 0.0076 * LowTemp - 0.3517 * Precipitation

We will analysis the regression coefficients for the variables in this Poisson regression model on frequency counts.

* The value of the y-intercept is given as -329.7413. This represnts the baseline of the mean of log(μ) when all predictor variables are equal to 0. However, the y-intercept does not have a practical interpretation or meaning in this scenario so we are not interested in its meaning for the Poisson regression model.

* Date: The regression coefficient of the Date variable in this model is 0.0079. This means that the mean log of the counts increases by 0.0079 units for every 1 day increase in the date on which the observation was collected, holding all other variables constant.

* HighTemp: The regression coefficient of the HighTemp variable in this model is 0.0036. This means that the mean log of the counts increases by 0.0036 units for every 1 degree Fahrenheit increase in the high temperature for the given observation, holding all other variables constant. 

* LowTemp: The regression coefficient of the LowTemp variable in this model is 0.0076. This means that the mean log of the counts increases by 0.0076 units for every 1 degree Fahrenheit increase in the low temperature for the given observation, holding all other variables constant. 

* Precipitation: The regression coefficient of the Precipitation variable in this model is -0.3517. This means that the mean log of the counts decreases by 0.3517 units for every 1 inch increase in the amount of precipitation for the given observation, holding all other variables constant. 





# Poisson Regression Model on Rates

Now, we will create a Poisson regression model of the rates at which cyclists enter and leave via the Williamsburg Bridge offset by the total number of cyclists on all four of the major New York bridges. This model, unlike the previous model which just focused on the frequency counts of cyclists on the Williamsburg Bridge, will also account for the total number of cyclists on all four of the major New York bridges, the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the Queensboro Bridge. This Poisson model will look at the rates of the number of cyclists on the Williamsburg Bridge for a given observation as a rate out of the total number of cyclists on all four of these major bridges for that specific observation.

We will build our Poisson regression model for the rates. This time, we will still use the WilliamsburgBridge variable as our response variable, but we will offset the model by the Total variable to make our Poisson model for the rates of cyclists on the Williamsburg Bridge out of the total number of cyclists on all four of the bridges. 


```{r}
# Poisson Model of Rates
model.rates <- glm(WilliamsburgBridge ~ Date + HighTemp + LowTemp + Precipitation, offset = log(Total), 
                   family = poisson(link = "log"), data = cycling)
kable(summary(model.rates)$coef, caption = "Poisson Regression Model of the Rates of Cyclists \n on the Williamsburg Bridge out of all Four Bridges")
```

The regression equation for the Poisson regression model on the rates is given as:

log(μ/t) = -50.4102 + 0.0011 * Date - 0.0051 * HighTemp + 0.0093 * LowTemp - 0.0357 * Precipitation


All four of the predictor variables in this Poisson model, Date, HighTemp, LowTemp, and Precipitation, all have p-values of p < .001. This indicates that all of the predictor in our model variables are statistically significant in predicting the total expected counts of cyclists on the Williamsburg Bridge on a given day, offset by the total number of cyclists on all four of the major New York bridges. 

This model shows statistical significance in predicting the expected counts of the cyclists on the Williamsburg Bridge by using the rates for the prediction. This indicates that this model for the rates shows statistical significance in its predictive power and provides good utility for prediction and estimation. 


## Regression Coefficients Interpretation


* The value of the y-intercept is given as -50.4102 This represents the baseline of the mean of the log counts multiplied by t, when all predictor variables are equal to 0. However, the y-intercept does not have a practical interpretation or meaning in this scenario so we are not interested in its meaning for the Poisson regression model.

* Date: The regression coefficient of the Date variable in this model is 0.0011. This means that the mean of the log counts multipled by t increases by 0.0011. units for every 1 day increase in the date on which the observation was collected, holding all other variables constant.

* HighTemp: The regression coefficient of the HighTemp variable in this model is -0.0051 This means that the mean of the log counts multipled by t decreases by 0.0051 units for every 1 degree Fahrenheit increase in the high temperature for the given observation, holding all other variables constant. 

* LowTemp: The regression coefficient of the LowTemp variable in this model is 0.0093. This means that the log counts multipled by t increases by 0.0093 units for every 1 degree Fahrenheit increase in the low temperature for the given observation, holding all other variables constant. 

* Precipitation: The regression coefficient of the Precipitation variable in this model is 0.0357. This means that the log counts multipled by t increases by 0.0357 units for every 1 inch increase in the amount of precipitation for the given observation, holding all other variables constant.




# Summary and Comparisons of the Two Models

Both of the two Poisson regression model we created, the model for the frequency counts and the model for the rates, provided statistical significance for prediction and showed good utility overall. In both of these models, we looked into the total number of cyclists on the Williamsburg Bridge in New York for a specific observation, and we looked into the various factors of that specific date. We looked at the date of the observation along with some factors which may affect the total number of cyclists out on that specific date. These factors included the high temperature, the low temperature, and the amount of precipitation for that given date. It turned out that all of these factors were indeed statistically significant for both of the two Poisson regression models, indicating that these weather related conditions have a statistically significant impact on both the counts and the rates of cyclists out on the Williamsburg Bridge for a given observation. This can be attributed to certain weather conditions making it more or less ideal for individuals to be cycling outdoors. For instance, a day with incredibly high temperatures, incredibly cold temperatures, or severe storms with heavy precipitation would be less ideal and likely lead to less cyclists being out on that given day as opposed to a day with pleasant weather. 

Overall, both of the Poisson regression models showed statistical significance and good utility in their prediction. However, as was previously stated, there were some violations of this conditions for a Poisson regression model within our data set. First, it was found that the mean of the response variable, WilliamsburgBridge, was not equal to its variance. This suggests that this response variable in fact is not Poisson distributed, due to it failing to meet the condition for a Poisson random variable of its mean being equal to its variance. Additionally, all four predictor variables were checked, and it was found that the response variable in fact was not a linear function of any of these predictor variables. This indicates another major violation of this data set. These violations suggest that perhaps a Poisson model was not the best model choice for this data set, and that it is important to be mindful of these violations when using either of the Poisson regression models we created for prediction. 




# Conclusion

Overall, two Poisson regression models were created in this project. Both of these models looked at the total number of cyclists on the Williamsburg Bridge for a given day. The first model looked at the counts of cyclists that were on the Williamsburg Bridge, and the second model looked at the rates of the cyclists that were on the Williamsburg Bridge offset by the total number of cyclists on all four of the major New York bridges. 

Both of these two Poisson regression models showed statistical signicifance in their predictions, with all of the predictor variables in both of these two models have p-values of p < .001. This indicates that the factors in the data set of Date, HighTemp, LowTemp, and Precipitation are statistically significnace in predicting the frequency counts or the rates of the cyclists on the Williamsburg Bridge for a given observation. 

However, our data set failed to meet some of the neccessary conditions for a Poisson regression model. The mean of the response variable, WilliamsburgBridge, was not equal to its variance. This suggests that this response variable in fact is not Poisson distributed. Also, all four predictor variables were checked, and it was found that the response variable in fact was not a linear function of any of these predictor variables. These are two violations of the assumptions for a Poisson regression model. These violations mean that perhaps a Poisson regression model was not the best choice for this data set, and that these violations should be kept in mind when using either of these models for prediction.





## Recommendations

Some recommendations I would suggest for further projects include:

* Look further into the violations that were found within this data set and look into possible explanations for these violations of the necessities of a Poisson regression model. Further consider whether the Poisson regression model in fact is the best choice for this data set and if it is sufficient to use this model for prediction despite these violations.

* Consider other variables which may affect the number of cyclists out on a given observation. Perhaps there are other factors which may provide further significance for model building which may strengthen the regression model.

* Further expand the data set to ensure the accuracy of the predictions and to further strengthen the Poisson regression models. 




