1 Introduction

The data set for this study was collected from the Traffic Information Management System. It keeps track of the number of cyclists entering and leaving the Queensboro Bridge from the dates July 1st to July 31st. This data set includes a total of 31 observations and seven variables. The response variable is the total number of cyclist that pass through the Queensboro bridge on each given day. The explanatory variables for this data set involve the specific conditions of each day, such as the weather.

1.1 Variable Description

Here are what the seven variables in the data set represent:

  • Date (x1) - note this represents the observation ID

  • Day (x2) - The day of the week

  • HighTemp (x3) - the temperature high for the day in degrees Fahrenheit

  • LowTemp (x4) - the temperature high for the day in degrees Fahrenheit

  • Precipitation (x5) - the total precipitation for the day in inches

  • Queensboro Bridge (Y) - The number of cyclists on the Queensboro bridge.

  • Total (x6) - the total number of cyclists who enter and leave the bridges in NYC each day

1.2 Practical Question

Do the conditions surrounding the day the cyclists are recorded affect the number of them enter and leave the QueensboroBridge?

1.3 Data Download and Cleaning

First, we are going to download the data. Since it is a small data set, we can look at the data and conclude there are no missing values. We are also going to remove to commas from the variables “Total” and “QueensboroBridge” so R Studio classifies them as numeric.

cycle <- read.csv("https://raw.githubusercontent.com/AvaDeSt/STA-321/refs/heads/main/Assignment%205%20data(Sheet1).csv", header = TRUE)
          
cycle$Total <- as.numeric(gsub(",", "", cycle$Total))
cycle$QueensboroBridge <- as.numeric(gsub(",", "", cycle$QueensboroBridge))


data(cycle)
## Warning in data(cycle): data set 'cycle' not found
kable(head(cycle), caption = "First few records in the data set")
First few records in the data set
Date Day HighTemp LowTemp Precipitation QueensboroBridge Total
1-Jul Saturday 84.9 72.0 0.23 3216 11867
2-Jul Sunday 87.1 73.0 0.00 3579 13995
3-Jul Monday 87.1 71.1 0.45 4230 16067
4-Jul Tuesday 82.9 70.0 0.00 3861 13925
5-Jul Wednesday 84.9 71.1 0.00 5862 23110
6-Jul Thursday 75.0 71.1 0.00 5251 21861

2 Model Building

For this study, a poisson regression model will be used. The poisson regression model has four basic assumptions that are as follows:

  • The response variable is a count per unit of time or space. ( In our case it is the count of cyclists per day).

  • The observations are independent of one another.

  • The mean of the poisson random variable is equal to the variance.

  • The log of the mean rate, log(λ), is a linear function of x

2.1 Poisson Regression on Queensboro Bridge Cyclists Only

Here we are building a poisson frequancy regression model for our data set. The variable “Date” was left out of this model since it is only an observation ID. The variable “Total” was left of because we can assume that the total amount of bikers that pass the Queensboro Bridge and the total amount of bikers overall are proportional to each other.

model.freq <- glm(QueensboroBridge ~ Day + HighTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = cycle)



pois.count.coef = summary(model.freq)$coef
kable(pois.count.coef, caption = "The Poisson regression model for the counts of cyclist entering and leaving the Queensboro Bridge.")
The Poisson regression model for the counts of cyclist entering and leaving the Queensboro Bridge.
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.5517234 0.0479068 178.507570 0e+00
DayMonday 0.0541753 0.0108546 4.991021 6e-07
DaySaturday -0.1953380 0.0111460 -17.525400 0e+00
DaySunday -0.2226164 0.0115879 -19.211108 0e+00
DayThursday 0.1157122 0.0113093 10.231628 0e+00
DayTuesday 0.0965569 0.0113512 8.506349 0e+00
DayWednesday 0.1836698 0.0110677 16.595046 0e+00
HighTemp 0.0158199 0.0008034 19.691515 0e+00
LowTemp -0.0197465 0.0011701 -16.876092 0e+00
Precipitation -0.3221763 0.0105342 -30.583731 0e+00

The table indicates that the day of the week, the daily high and low temperatures, and the precipitation levels are all highly significant. This means that the weather and day of the week ar4e good indicators of how many bikers will pass through the Queensboro bridge on a given day. However, it is important to keep in mind that this does not necessarily mean the model is important. For example, the sample size for this study is small and may not represent the entire population. Another way to interpret this is that the cyclist counts on the Queensboro bridge are not dependent on the total number of cyclists on all the New York bridges. Because all these variables are highly significant, they will be included in the following models. We can see that the coefficient for the temperature high is about 0.158. This means that for every one degree increase in the temperature high, the log of the expected count of cyclists increases by 0.158. Since exp(0.158) = 1.173, for each one-unit increase in the predictor variable, the expected count of the outcome variable increases by about 17.3%, holding other variables constant.

2.2 Poisson Regression on Rates with the Total Count

This model looks at the relationship between the rate of cyclists and the day of the week as well as temperature. Here we will also look at the total number of cyclists that cross all the bridges in New York.

model.rates <- glm(QueensboroBridge ~ Day + HighTemp +LowTemp + Precipitation, offset = log(Total), 
                   family = poisson(link = "log"), data = cycle)
kable(summary(model.rates)$coef, caption = "Poisson regression on the rate of cyclists.")
Poisson regression on the rate of cyclists.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2497959 0.0481655 -25.947936 0.0000000
DayMonday -0.0673728 0.0109257 -6.166450 0.0000000
DaySaturday -0.0264248 0.0112172 -2.355741 0.0184858
DaySunday -0.0771257 0.0117421 -6.568307 0.0000000
DayThursday -0.0166461 0.0114600 -1.452543 0.1463505
DayTuesday -0.0410056 0.0115661 -3.545335 0.0003921
DayWednesday -0.0420583 0.0111611 -3.768289 0.0001644
HighTemp 0.0020227 0.0008231 2.457506 0.0139906
LowTemp -0.0042179 0.0011624 -3.628492 0.0002851
Precipitation 0.0451435 0.0097403 4.634702 0.0000036

The table shows that the log of bikers crossing the bridge is not the same across all days of the week. The log rates for the day Friday are higher than the rest of the days of the week. The intercept represents the log base cyclist rate for the baseline day Friday. The rest of the coefficients are the difference of log rates between the baseline day Friday and the rest of the days of the week. We can see from the table that -0.067 is the coefficient for Monday so:

log(RMonday / RFriday) = -0.067 ⇒ RMonday / RFriday = e^−0.067 ≈ 0.935

This means that the rate of bikers on Monday is about 6.5% lower on Monday than on Friday.

Next, we are building a quasi poison model. This is generally a better model to use when the mean and the variance of the data are not the same.

model.rates <- glm(QueensboroBridge ~ Day + HighTemp + LowTemp + Precipitation, offset = log(Total), 
                   family = quasipoisson, data = cycle)
summary(model.rates)
## 
## Call:
## glm(formula = QueensboroBridge ~ Day + HighTemp + LowTemp + Precipitation, 
##     family = quasipoisson, data = cycle, offset = log(Total))
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.249796   0.195999  -6.377 2.54e-06 ***
## DayMonday     -0.067373   0.044460  -1.515    0.145    
## DaySaturday   -0.026425   0.045646  -0.579    0.569    
## DaySunday     -0.077126   0.047782  -1.614    0.121    
## DayThursday   -0.016646   0.046634  -0.357    0.725    
## DayTuesday    -0.041006   0.047066  -0.871    0.393    
## DayWednesday  -0.042058   0.045418  -0.926    0.365    
## HighTemp       0.002023   0.003349   0.604    0.552    
## LowTemp       -0.004218   0.004730  -0.892    0.383    
## Precipitation  0.045143   0.039636   1.139    0.268    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 16.559)
## 
##     Null deviance: 467.32  on 30  degrees of freedom
## Residual deviance: 343.68  on 21  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 3

We can see from this model that none of the variables are no longer significant. This suggests that the mean and the variance of the data set are equal, and a quasi poisson model is not needed.

2.3 Final Model

Given the data that we have, the best model to use is the first poisson frequency regression model that does not take into account the total number of cyclists that cross every bridge. While this variable can help predict the number of cyclists on the Queensboro bridge, it is not needed for a successful model, and the response variable is not reliant on it. The poisson frequency regression model can be written as

QueensboroBridge = 8.552 - 0.227 * DaySunday + 0.054 * DayMonday + 0.097 * DayTuesday + 0.184 * DayWednesday + 0.116 * DayThursday - 0.193 * DaySaturday + 0.016 * HighTemp - 0.020 * LowTemp - 0.322 * Precipitation

3 Summary and Conclusion

To summarize, we looked at a data set that looks at how many bikers cross over the Queensboro Bridge in NYC every day for the month of July, leaving us with 31 observations. The 4 explanatory variables look at the day of the week and the weather conditions on each day. Our goal was to see what the relationship was between these variables and the number of cyclists. We also wanted to see if taking into account the total number of cyclists that cross several major bridges in NYC. To figure this out, we built a Poisson frequency regression model, a Poisson Model on rates, and a quasi poisson model. Based on our small sample, we found that the poisson frequency regression model performed the best. This would indicate that the total number of cyclists in Queensboroo is not reliant on the total number of cyclists on the other bridges although including still makes a good model.

---
title: 'Poisson Models For Cyclist Data'
author: 'Ava DeStefano'
date: "10-29-24"
output:
  html_document: 
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 6
    fig_height: 4
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 5
    fig_height: 4
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: system-ui;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}
# Detect, install and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("mlbench")) {
   install.packages("mlbench")
   library(mlbench)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
   library(knitr)
}
if (!require("psych")) {
   install.packages("psych")
   library(knitr)
}
if (!require("ISwR")) {
   install.packages("ISwR")
   library(ISwR)
}

# specifications of outputs of code in code chunks
knitr::opts_chunk$set(echo = TRUE,      
                      warnings = FALSE,   
                      messages = FALSE,  
                      results = TRUE     
                      )   
```

# Introduction 

The data set for this study was collected from the Traffic Information Management System. It keeps track of the number of cyclists entering and leaving the Queensboro Bridge from the dates July 1st to July 31st. This data set includes a total of 31 observations and seven variables. The response variable is the total number of cyclist that pass through the  Queensboro bridge on each given day. The explanatory variables for this data set involve the specific conditions of each day, such as the weather.

## Variable Description

Here are what the seven variables in the data set represent:

* Date (x1) - note this represents the observation ID

* Day (x2) - The day of the week

* HighTemp (x3) - the temperature high for the day in degrees Fahrenheit

* LowTemp (x4) - the temperature high for the day in degrees Fahrenheit

* Precipitation (x5) - the total precipitation for the day in inches

* Queensboro Bridge (Y) - The number of cyclists on the Queensboro bridge.

* Total (x6) - the total number of cyclists who enter and leave the bridges in NYC each day

## Practical Question 

Do the conditions surrounding the day the cyclists are recorded affect the number of them enter and leave the QueensboroBridge?

## Data Download and Cleaning

First, we are going to download the data. Since it is a small data set, we can look at the data and conclude there are no missing values. We are also going to remove to commas from the variables "Total" and "QueensboroBridge" so R Studio classifies them as numeric.

```{r}
cycle <- read.csv("https://raw.githubusercontent.com/AvaDeSt/STA-321/refs/heads/main/Assignment%205%20data(Sheet1).csv", header = TRUE)
          
cycle$Total <- as.numeric(gsub(",", "", cycle$Total))
cycle$QueensboroBridge <- as.numeric(gsub(",", "", cycle$QueensboroBridge))


data(cycle)
kable(head(cycle), caption = "First few records in the data set")

```


# Model Building 

For this study, a poisson regression model will be used. The poisson regression model has four basic assumptions that are as follows:

- The response variable is a count per unit of time or space. ( In our case it is the count of cyclists per day).

- The observations are independent of one another.

- The mean of the poisson random variable is equal to the variance.

- The log of the mean rate, log(λ), is a linear function of x


## Poisson Regression on Queensboro Bridge Cyclists Only

Here we are building a poisson frequancy regression model for our data set. The variable "Date" was left out of this model since it is only an observation ID. The variable "Total" was left of because we can assume that the total amount of bikers that pass the Queensboro Bridge and the total amount of bikers overall are proportional to each other. 
```{r}
model.freq <- glm(QueensboroBridge ~ Day + HighTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = cycle)



pois.count.coef = summary(model.freq)$coef
kable(pois.count.coef, caption = "The Poisson regression model for the counts of cyclist entering and leaving the Queensboro Bridge.")

```
The table indicates that the day of the week, the daily high and low temperatures, and the precipitation levels are all highly significant. This means that the weather and day of the week ar4e good indicators of how many bikers will pass through the Queensboro bridge on a given day. However, it is important to keep in mind that this does not necessarily mean the model is important. For example, the sample size for this study is small and may not represent the entire population. Another way to interpret this is that the cyclist counts on the Queensboro bridge are not dependent on the total number of cyclists on all the New York bridges. Because all these variables are highly significant, they will be included in the following models. We can see that the coefficient for the temperature high is about 0.158. This means that for every one degree increase in the temperature high, the log of the expected count of cyclists increases by 0.158. Since exp(0.158) = 1.173, for each one-unit increase in the predictor variable, the expected count of the outcome variable increases by about 17.3%, holding other variables constant.



## Poisson Regression on Rates with the Total Count

This model looks at the relationship between the rate of cyclists and the day of the week as well as temperature. Here we will also look at the total number of cyclists that cross all the bridges in New York.

```{r}
model.rates <- glm(QueensboroBridge ~ Day + HighTemp +LowTemp + Precipitation, offset = log(Total), 
                   family = poisson(link = "log"), data = cycle)
kable(summary(model.rates)$coef, caption = "Poisson regression on the rate of cyclists.")
```

The table shows that the log of bikers crossing the bridge is not the same across all days of the week. The log rates for the day Friday are higher than the rest of the days of the week. The intercept represents the log base cyclist rate for the baseline day Friday. The rest of the coefficients are the difference of log rates between the baseline day Friday and the rest of the days of the week. We can see from the table that -0.067 is the coefficient for Monday so:

log(RMonday / RFriday) = -0.067 ⇒   RMonday / RFriday = e^−0.067 ≈ 0.935

This means that the rate of bikers on Monday is about 6.5% lower on Monday than on Friday.

Next, we are building a quasi poison model. This is generally a better model to use when the mean and the variance of the data are not the same. 

```{r}
model.rates <- glm(QueensboroBridge ~ Day + HighTemp + LowTemp + Precipitation, offset = log(Total), 
                   family = quasipoisson, data = cycle)
summary(model.rates)
```
We can see from this model that none of the variables are no longer significant. This suggests that the mean and the variance of the data set are equal, and a quasi poisson model is not needed. 


## Final Model 

Given the data that we have, the best model to use is the first poisson frequency regression model that does not take into account the total number of cyclists that cross every bridge. While this variable can help predict the number of cyclists on the Queensboro bridge, it is not needed for a successful model, and the response variable is not reliant on it. The poisson frequency regression model can be written as 

QueensboroBridge = 8.552 - 0.227 * DaySunday + 0.054 * DayMonday + 0.097 * DayTuesday + 0.184 * DayWednesday + 0.116 * DayThursday - 0.193 * DaySaturday + 0.016 * HighTemp - 0.020 * LowTemp - 0.322 * Precipitation

# Summary and Conclusion

To summarize, we looked at a data set that looks at how many bikers cross over the Queensboro Bridge in NYC every day for the month of July, leaving us with 31 observations. The 4 explanatory variables look at the day of the week and the weather conditions on each day. Our goal was to see what the relationship was between these variables and the number of cyclists. We also wanted to see if taking into account the total number of cyclists that cross several major bridges in NYC. To figure this out, we built a Poisson frequency regression model, a Poisson Model on rates, and a quasi poisson model. Based on our small sample, we found that the poisson frequency regression model performed the best. This would indicate that the total number of cyclists in Queensboroo is not reliant on the total number of cyclists on the other bridges although including still makes a good model.
