1 Introduction

For this project, we will be creating a Poisson regression model. The data set for this project looks at the daily total of cyclists on the Williamsburg Bridge on a given day. This data set looks at the total number of cyclists on the Williamsburg Bridge in Brooklyn, New York, in order to keep track of the total number of cyclists entering and leaving this cycling route on a specific day. We will look at the various factors affecting the number of cyclists on each day, with factors such as the weather conditions on that particular day. We will also create a Quassi-Poisson regression model and analyze the dispersion of our model.

1.1 Data Description

The data set in this project looks at the total number of cyclists on the Williamsburg Bridge on a given day along with the weather conditions of that day such as temperature and precipitation. This data set also includes the total number of cyclists on all four of the major New York bridges the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the Queensboro Bridge.

First, I will find the data set which will be used for this assignment. I ran the code which was given in the assignment description, and the data set I received was for the Williamsburg Bridge, so that is what we will use for this Poisson regression modeling project. When opening the downloaded data set, I noticed some issues with the data set not being properly stored. The Date and Day variable had the exact same values, and these values did not make sense within the context of this situation. I went through and fixed these values with how the appeared in the original data set, which was given under the tab data2 in the w09-AssignDataSet.xlsx file. I replaced the improper values in the Date and Day variable with what was in the original data set. Now, the Date variable is an identification of the date on which the observation occurred. And, the Day variable represents the day of the week on which the observation occurred. I also checked all of the other variables to ensure their values were not also messed up in the formatting of the excel file, but they all appeared to be all good without anything having gotten changed during the downloading process.

The data set has been uploaded to Github and now can be read in directly from the Github repository.

We will read in the data set from Github and we will call it “cycling”.

cycling <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/WilliamsburgBridge.csv", header = TRUE)

str(cycling)
'data.frame':   30 obs. of  7 variables:
 $ Date              : chr  "4/1" "4/2" "4/3" "4/4" ...
 $ Day               : chr  "Saturday" "Sunday" "Monday" "Tuesday" ...
 $ HighTemp          : num  46 62.1 63 51.1 63 48.9 48 55.9 66 73.9 ...
 $ LowTemp           : num  37 41 50 46 46 41 43 39.9 45 55 ...
 $ Precipitation     : num  0 0 0.03 1.18 0 0.73 0.21 0 0 0 ...
 $ WilliamsburgBridge: int  1915 4207 5178 2279 5711 1739 3399 4082 4886 6881 ...
 $ Total             : int  5397 13033 16325 6581 17991 4896 10341 11610 14899 21295 ...

We will use this cycling data set to create two Poisson regression models, one for the frequency counts of cyclists on the Williamsburg Bridge on a given observation, and another for the rates of cyclists entering and leaving via the Williamsburg Bridge offset by the total number of cyclists on all of the major New York bridges.

1.2 Variables

There are 7 total variables in the cycling data set. These variables include:

  • Date: This represents the date on which a given observation was collected. This is the observation ID number. This variable is just for identification purposes of the observations, not for actual prediction. The date is given in the format of month/day.

  • Day: This is a character predictor variable which represents the day of the week on which a given observation was collected. For instance, Monday, Tuesday, etc.

  • HighTemp: A quantitative predictor variable representing the high temperature on the given day, given in degrees Fahrenheit.

  • LowTemp: A quantitative predictor variable representing the low temperature on the given day, given in degrees Fahrenheit.

  • Precipitation: A quantitative predictor variable representing the amount of precipitation, rain, which occurred on the given day, given in inches.

  • WilliamsburgBridge: A quantitative variable representing the total number of cyclists on the Williamsburg Bridge on a given observation. This will be our response variable for the Poisson regression models.

  • Total: The total number of cyclists on all bridges on a given observation. This will be the variable which is offset for our Poisson regression model of the rates and for the Quassi-Poisson regression model.

We also will create two new variables within our analysis later on to use for the Poisson regression model building process. These two variables include:

  • AvgTemp: A quantitative predictor variable representing the average temperature for a given observation, given in degrees Fahrenheit. This variable will be the average of HighTemp and LowTemp, found by calculating (HighTemp + LowTemp)/2.

  • NewPrecip: A discretized version of the Precipitation variable. This will be a binary predictor variable, where 0 represents a precipitation value equal to 0 inches, and where 1 represents a precipitation value greater than 0 inches.

For the Poisson regression model for the frequency counts, the Williamsburg Bridge variable will serve as the response variable. For the Poisson regression model for the rates, the Williamsburg Bridge variable will again serve as the response variable, and it will be offset by the Total variable for this model.

1.3 Research Questions

The main goal for this project is to create a Poisson regression model for both the frequency counts and the rates of the cyclists entering and leaving Brooklyn, New York through the Williamsburg Bridge. So, the focus for this project will be on creating two Poisson regression models which can successfully predict the frequency counts and the rates of the cyclists on the Williamsburg Bridge.

Some key questions for this project include:

  • Does the data set meet all of the necessary conditions required for a Poisson regression model? If not, is there any potential explanation for this discrepancy?

  • Can we create Poisson regression models which provide statistical significance for predicting both the frequency counts and for the rates of cyclists on the Williamsburg Bridge on a given day?

  • Is the Quasi-Poisson regression model a better choice than the either of the standard Poisson regression models for frequency counts and for rates? How dispersed is this Quasi-Poisson regression model?

We will work on creating our Poisson regression models for both the frequency counts and rates in order to see if we can in fact create models which provide statistical significance in their predictive ability. We will also create a Quasi-Poisson regression model and we will find how dispered it is. We will determine which of these models is the ideal choice for our final regression model.

2 Exploratory Data Analysis

Let’s take a look at the first few entries within this cycling data set for the Williamsburg Bridge.

kable(head(cycling), caption = "First Few Observations in the Data Set") 
First Few Observations in the Data Set
Date Day HighTemp LowTemp Precipitation WilliamsburgBridge Total
4/1 Saturday 46.0 37 0.00 1915 5397
4/2 Sunday 62.1 41 0.00 4207 13033
4/3 Monday 63.0 50 0.03 5178 16325
4/4 Tuesday 51.1 46 1.18 2279 6581
4/5 Wednesday 63.0 46 0.00 5711 17991
4/6 Thursday 48.9 41 0.73 1739 4896

This data set includes various factors which may have an influence on the number of individuals cycling, along with the date on which this data was collected. Additionally, this data set includes variables for both the number of cyclists on the Williamsburg Bridge on that given day, along with the total number of cyclists on all bridges on that given day.

First, let’s check if there are any missing variables in our data set.

colSums(is.na(cycling))
              Date                Day           HighTemp            LowTemp 
                 0                  0                  0                  0 
     Precipitation WilliamsburgBridge              Total 
                 0                  0                  0 

It turns out that all of the variables in the data set have exactly zero missing values. So, there are no missing values in our data set. This is very good and means we can move on with further analyzing the data set and the variables within it.

2.1 Variable Transformations

We will transform some of our predictor variables before beginning with our analysis and building the Poisson regression models.

First, we will create a new variable for the temperature which is an average of the high temperature and the low temperature for that given observation. We will call our new variable AvgTemp and this will serve as the average of the HighTemp and the LowTemp variables.

# Creating the new AvgTemp variable.
cycling$AvgTemp <- (cycling$HighTemp + cycling$LowTemp)/2

Now, we have a new variable, AvgTemp, representing the average of the high temperature and the low temperature for a given observation.

Next, we will discretize the Precipitation variable. We will create a variable called NewPrecip which will be a discretized version of the original Precipitation variable. For this discretized variable, NewPrecip will equal 0 if Precipitation equals 0 for that observation, and NewPrecip will equal 1 if Precipitation is greater than 0 for that observation.

We will make the NewPrecip variable a binary variable with 0 representing precipitation of 0 inches, and 1 representing precipitation greater than 0 inches. We will also convert this binary NewPrecip variable into an integer variable instead of a numeric variable.

# Creating the NewPrecip variable.
cycling$NewPrecip <- ifelse(cycling$Precipitation == 0, 0, 1)

# Making the binary variable an int.
cycling$NewPrecip <- as.integer(cycling$NewPrecip)

Now, we have a discretized NewPrecip variable in our data set which represents a binary predictor variable where 0 is for a precipitation of 0 inches and 1 is for a precipitation greater than 0 inches.

2.2 Checking the Variable Distributions

We have three predictor variables which we want to use in our final model, Day, AvgTemp, and NewPrecip. Out of these variables, Day is a categorical character variable, AvgTemp is a quantitative variable, and NewPrecip is a binary variable. We will check that the distributions of all of these variables appear to be random, without any noticeable patterns or concerns which could cause issues with the model building process.

First, let’s look at our Day variable. This is a categorical character variable. In order to check that this variable is properly distributed without any major concerns, we will look for if there are any potential imbalances within this variable. An imbalance would occur if there were a significantly greater number of observations occuring on one day as opposed to another day. We will check that there are not any substantial imbalances within the Day variable.

table(cycling$Day)

   Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
        4         4         5         5         4         4         4 

It appears that all days of the week from Monday to Friday have exactly four observations in our data set. The weekend days of Saturday and Sunday both have exactly five observations in our data set. As we can see, the observations appear to be distributed very evenly amongst the days of the week, with the weekend days only having one more observation each than the weekdays. Overall, this variable appears to be overall evenly distributed, and so there are not any imbalances to be concerned about for our Day predictor variable.

Next, let’s check the distribution of our AvgTemp variable. This is a quantitative predictor variable and so we can check its distribution by using a histogram. We will check to see if this variable has an overall normal distribution without any notable skew or outliers.

ylimit = max(density(cycling$AvgTemp)$y)
hist(cycling$AvgTemp, probability = TRUE, main = "AvgTemp Distribution", xlab="AvgTemp", 
       col = "aliceblue", border="cornflowerblue")
  lines(density(cycling$AvgTemp, adjust=2), col="darkorchid")

It appears that our distribution histogram of the AvgTemp variable is unimodal with majority of the data being centered at an average temperature between 55 and 60 degrees Fahrenheit. The distribution appears to follow an approximately normal distribution without any noteable skew or outliers. It does appear like perhaps there is slightly more entries on the left side of the histogram, but it is only by a very slight amount and is not significant enough to cause a noticeable skew in the distribution. Overall, it appears safe to say that our AvgTemp variable follows an approximately normal distribution, and so this variable will be all good to use in our model building process.

Lastly, let’s check our NewPrecip variable. This is a binary predictor variable, and so we can expect it to have values of only 0 and 1. In order to check the reliability of this variable in its use for prediction, we will make sure it appears to meet this criteria of a binary variable. We can take a look at a table to ensure that there are only two possible entries for this NewPrecip variable, 0 and 1, because these are the only two values which a binary variable can be.

table(cycling$NewPrecip)

 0  1 
18 12 

As we can see, this NewPrecip variable has only two entries, 0 and 1. This is exactly what we wanted to see because these are the only two variables which a binary variable can be. We can see that there are slightly more days with no precipitation, a value of 0, with 18 total observations, than days with precipitation, a value of 1, with 12 total observations. However, this difference is not large enough to be a cause for concern. And so, we can conclude that everything is alright with with our binary predictor variable of NewPrecip, and that we can continue with the model building process.

Now, we have checked the distributions of all three of our predictor variables, Day, AvgTemp, and NewPrecip, and ensured that there are not any apparent issues with any of these variables or their distributions. So, we can continue with using these predictor variables in our model building process.

2.3 Asumptions and Conditions

Before we begin with building our model, we must check the assumptions and conditions which are required for a Poisson regression model.

There are four assumptions which must be met in order to create a Poisson regression model. These assumptions include:

  1. The response variable is a count described by a Poisson distribution.

  2. Observations are independent of one another.

  3. The mean of the Poisson random variable is equal to the variance of said Poisson random variable.

  4. The log of the mean rate, log (λ), must be a linear function of x.

We will check whether all of these four conditions have been successfully met by our cycling data set before beginning with the model building process for our Poisson regression model.

We will go through and check all four of the necessary conditions required for a Poisson Regression Model.

2.3.1 Condition 1: The response variable is a count described by a Poisson distribution.

The response variable in this data set was stated to be the WilliamsburgBridge variable, representing the total number of cyclists on the Williamsburg Bridge on a given observation. This variable is described as a count, representing the number of cyclists on a given observation. This fits the criteria for this assumption, because we can conclude that we have a response variable that is a count.

2.3.2 Condition 2: Observations are independent of one another.

Each observation was collected on a given date, and we can safely assume that the conditions of one day did not affect the conditions of another day. The number of cyclists on the Williamsburg Bridge for a given observation is independent on this number of a different observation. So, we can safely conclude that that observations are all independent and separate from one another.

2.3.3 Condition 3: The mean of the Poisson random variable is equal to the variance of said Poisson random variable.

In order for a variable to be a Poisson random variable, its mean must be equal to its variance. We previously stated that the WilliamsburgBridge variable will be our response variable. Therefore, we must check that this variable meets the criteria for a Poisson random variable, having a mean which is equal to its variance.

# Finding the mean.
mean <- mean(cycling$WilliamsburgBridge)
print(mean)
[1] 4942.267

The mean of the WilliamsburgBridge variable is 4,942.267. This represents the mean number of individuals on the Williamsburg Bridge on a given observation. This means that the mean number of individuals on the Williamsburg Bridge on any given date is around 4,943 people. We round this value up because the number of individuals is a whole number and so the decimal must be rounded up to the next whole number to represent that part as an individual.

Next, let’s find the variance of our response variable.

# Finding the variance.
variance <- var(cycling$WilliamsburgBridge)
print(variance)
[1] 3005665

The variance of the WilliamsburgBridge variable is 3,005,665. This does not match up with the value of the mean, and indicates a violation of one the neccessary conditions for a Poisson regression model. This implies that our response variable is in fact not a Poisson random variable because the value of its mean is not equivalent to the value of its variance.

2.3.4 Condition 4: The log of the mean rate, log (λ), must be a linear function of x.

We will take a look at the plot of the mean rate against the predictor variables to check this condition.

Since our first predictor variable is Day, and this a categorical, character variable, it would not create a linear function because it is made up of categorical inputs. So, instead, we will look at the numerical predictor variable instead to check this condition.

Let’s look at the predictor variable of average temperature vs our response variable of WilliamsburgBridge. AvgTemp is a quantitative, numeric variable so we can use it to check this condition.

plot(cycling$AvgTemp, cycling$WilliamsburgBridge, main = "AvgTemp vs. Williamsburg Bridge", xlab = "AvgTemp", ylab = "WilliamsburgBridge")

The scatterplot of the two variables of AvgTemp and WilliamsburgBridge shows what does appear to be a linear relationship of these two variables. We can see a positive relationship between the two variables, as average temperature increases, so does the number of cyclists on the Williamsburg Bridge. This seems logical as it makes sense that more people would want to go outside and go cycling on a day that is warmer outside rather than a day that is colder outside. The relationship of the two variables does appear to have a moderate strength, but the linear pattern can definitely be seen. So, it does appear that WilliamsburgBridge is a linear function of AvgTemp, which verifies this necessary condition for a Poisson regression model.

Lastly, we have our predictor variable of NewPrecip. This is a binary predictor variable, so we will only see points at x = 0 and x = 1 if we were to create a scatterplot of this binary predictor variable of NewPrecip. So, we can not expect to see a linear relationship between NewPrecip and WilliamsburgBridge, because NewPrecip can only have values of 0 and 1, not anything in between due to it being a binary predictor variable.

Overall, we can consider this condition satisfied since our numerical predictor variable of AvgTemp showed that it does indeed have a linear relationship with our response variable.

2.3.5 Summary of Violations

Overall, it seems that we do have one notable violation of the conditions of a Poisson regression model within our data set. We found that the response variable, WilliamsburgBridge, does not meet the necessary criteria of a Poisson random variable, because its mean is not equal to its variance. This is a major concern, because it points to a major violation of the conditions required for a Poisson regression model.

This violation of the conditions for a Poisson regression model suggests a major concern with our data set, as it fails to meet a major condition which is required for a Poisson regression model. This suggest that perhaps a Poisson regression model may not be the best model choice for this data set after all.

We will still continue with building the Poisson regression models for this project, but it is important to keep in mind that this violation may mean that the Poisson regression model is not the best model choice for this data set due to the necessary condition of the mean of the response variable equaling the variance of the response variable having been failed to have been met.

3 Poisson Regression Models on the Original Variables

First, we will look at the Poisson regression models which were created in the previous week’s assignment and look at the corrected versions of these models. We will use the original predictor variables first, and then in later steps of this project we will use the new predictor variables of AvgTemp and NewPrecip.

In the previous week’s assignment, we created two Poisson regression models, one on frequency counts and one of the rates. We will create these again to see the proper, corrected models. Since in the previous week’s assingment, we did not alter any of the predictor variables for these two models, we will use the original variables for now and then create models using the new predictor variables of AvgTemp and NewPrecip.

For now, we will use the old predictor variables of Day, HighTemp, LowTemp, and Precipitation. These were the predictor variables used in the previous week’s assignment, so we will first begin by correcting the models which were created in that assignment before we been creating the new models for this project.

3.1 Poisson Regression Model on Frequency Counts

We will begin with creating a Poisson regression model of the frequency counts. This model will be on the frequency counts of individuals on the Williamsburg Bridge for a given observations. Our goal is to create a Poisson regression model which can statistically significantly predict the count of the number of individuals on the Williamsburg Bridge for a given observation, based upon the various factors in this data set.

We will create our Poisson regression model on the frequency counts.

# Poisson Regression Model of Counts
model.counts <- glm(WilliamsburgBridge ~ Day + HighTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = cycling)
pois.count.coef = summary(model.counts)$coef
kable(pois.count.coef, caption = "Poisson Regression Model for the Counts of Cyclists \n on the Williamsburg Bridge")
Poisson Regression Model for the Counts of Cyclists on the Williamsburg Bridge
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.6535762 0.0220219 347.544093 0.0000000
DayMonday 0.0545822 0.0099489 5.486232 0.0000000
DaySaturday -0.2743978 0.0101644 -26.996022 0.0000000
DaySunday -0.2345666 0.0097363 -24.091886 0.0000000
DayThursday 0.0319064 0.0102933 3.099730 0.0019370
DayTuesday 0.1988133 0.0103531 19.203340 0.0000000
DayWednesday 0.0511286 0.0100775 5.073556 0.0000004
HighTemp 0.0170556 0.0005874 29.037387 0.0000000
LowTemp -0.0023861 0.0007830 -3.047372 0.0023085
Precipitation -1.0320675 0.0165996 -62.174240 0.0000000

The regression equation for the Poisson regression model on the frequency counts is given as:

log(μ) = 7.6538 + 0.0546 * DayMonday - 0.2744 * DaySaturday - 0.2346 * DaySunday + 0.0319 * DayThursday + 0.1988 * DayTuesday + 0.0511 * DayWednesday + 0.0171 * HighTemp - 0.0024 * LowTemp - 1.0321 * Precipitation

All of the predictor variables, DayMonday, DaySaturday, DaySunday, DayThursday, DayTuesday, DayWednesday, HighTemp, LowTemp, and Precipitation, all have p-values of p < .001. This indicates that all of the predictor in our model variables are statistically significant in predicting the total expected counts of cyclists on the Williamsburg Bridge on a given day.

The significance of these variables in regards to predicting the expected counts can likely be attributed to potential adverse weather conditions, such as excessive heat or cold, along with intense precipitation and storms making cycling non ideal on those days with poor conditions for outdoors activities such as cycling. These predictor variables all being statistically significant shows that the weather and temperature conditions do suggest a discrepancy in the number of cyclists on the Williamsburg Bridge from day to day due to these changes in temperature and precipitation.

Overall, this Poisson model of the frequency counts of the cyclists on the Williamsburg Bridge showed statistical significance in its prediction of the expected log counts for the number of cyclists on the Williamsburg Bridge for a given observation.

For our categorical predictor variable of Day, Friday was chosen as the base line level, which can be seen by how there is not a “DayFriday” variable in the regression equation output. This is because of the seven days, Friday is the one which comes first alphabetically and R chooses the level which comes first alphabetically as the base line level. Therefore, for our regression coefficient interpretations for the different levels of the Day variable, these values will be compared against the base line level of Friday.

3.1.1 Regression Coefficients Interpretation

We will analysis the regression coefficients for the variables in this Poisson regression model on frequency counts.

  • The value of the y-intercept is given as 7.6536. This represents the baseline of the mean of log(μ) when all predictor variables are equal to 0. However, the y-intercept does not have a practical interpretation or meaning in this scenario so we are not interested in its meaning for the Poisson regression model.

  • DayMonday (p < .001): The regression coefficient for the variable DayMonday was found to be 0.0546. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0546 greater on Monday than on Friday. We can also say this means that the count of cyclists is 1.0561 times greater on Monday than on Friday, holding all other variables constant.

  • DaySaturday (p < .001): The regression coefficient for the variable DaySaturday was found to be -0.2744. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.2744 less on Saturday than on Friday. We can also say this means that the count of cyclists is 0.7600 times greater on Saturday than on Friday, holding all other variables constant.

  • DaySunday (p < .001): The regression coefficient for the variable DaySunday was found to be -0.2346. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.2346 less on Sunday than on Friday. We can also say this means that the count of cyclists is 0.7909 times greater on Sunday than on Friday, holding all other variables constant.

  • DayThursday (p = 0.0019): The regression coefficient for the variable DayThursday was found to be 0.0319. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0319 greater on Thursday than on Friday. We can also say this means that the count of cyclists is 1.0324 times greater on Thursday than on Friday, holding all other variables constant.

  • DayTuesday (p < .001): The regression coefficient for the variable DayTuesday was found to be 0.1988. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.1988 greater on Tuesday than on Friday. We can also say this means that the count of cyclists is 1.2199 times greater on Tuesday than on Friday, holding all other variables constant.

  • DayWednesday (p <.001): The regression coefficient for the variable DayWednesday was found to be 0.0511. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0511 greater on Wednesday than on Friday. We can also say this means that the count of cyclists is 1.0524 times greater on Wednesday than on Friday, holding all other variables constant.

  • HighTemp (p <.001): The regression coefficient of the HighTemp variable in this model is 0.0171. This means that the mean log of the counts increases by 0.0171 units for every 1 degree Fahrenheit increase in the high temperature for the given observation, holding all other variables constant.

  • LowTemp (p = 0.0023): The regression coefficient of the LowTemp variable in this model is -0.0024. This means that the mean log of the counts decreases by 0.0024 units for every 1 degree Fahrenheit increase in the low temperature for the given observation, holding all other variables constant.

  • Precipitation (p < .001): The regression coefficient of the Precipitation variable in this model is -1.0321. This means that the mean log of the counts decreases by 1.0321 units for every 1 inch increase in the amount of precipitation for the given observation, holding all other variables constant.

3.2 Poisson Regression Model on Rates

Now, we will create a Poisson regression model of the rates at which cyclists enter and leave via the Williamsburg Bridge offset by the total number of cyclists on all four of the major New York bridges. This model, unlike the previous model which just focused on the frequency counts of cyclists on the Williamsburg Bridge, will also account for the total number of cyclists on all four of the major New York bridges, the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the Queensboro Bridge. This Poisson model will look at the rates of the number of cyclists on the Williamsburg Bridge for a given observation as a rate out of the total number of cyclists on all four of these major bridges for that specific observation.

We will build our Poisson regression model for the rates. This time, we will still use the WilliamsburgBridge variable as our response variable, but we will offset the model by the Total variable to make our Poisson model for the rates of cyclists on the Williamsburg Bridge out of the total number of cyclists on all four of the bridges.

# Poisson Model of Rates
model.rates <- glm(WilliamsburgBridge ~ Day + HighTemp + LowTemp + Precipitation, offset = log(Total), 
                   family = poisson(link = "log"), data = cycling)
kable(summary(model.rates)$coef, caption = "Poisson Regression Model of the Rates of Cyclists \n on the Williamsburg Bridge out of all Four Bridges")
Poisson Regression Model of the Rates of Cyclists on the Williamsburg Bridge out of all Four Bridges
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0682224 0.0223058 -47.8899346 0.0000000
DayMonday 0.0003873 0.0099397 0.0389685 0.9689155
DaySaturday 0.0375055 0.0100984 3.7140004 0.0002040
DaySunday 0.0051455 0.0097487 0.5278112 0.5976304
DayThursday 0.0205573 0.0102839 1.9989823 0.0456103
DayTuesday 0.0138077 0.0103909 1.3288272 0.1839050
DayWednesday 0.0233018 0.0100672 2.3146274 0.0206333
HighTemp -0.0011895 0.0005867 -2.0272588 0.0426359
LowTemp 0.0003500 0.0007846 0.4460900 0.6555322
Precipitation 0.0505341 0.0161127 3.1362935 0.0017110

The regression equation for the Poisson regression model on the rates is given as:

log(μ/t) = -1.0682 + 0.0004 * DayMonday + 0.0375 * DaySaturday + 0.0051 * DaySunday + 0.0206 * DayThursday + 0.0138 * DayTuesday + 0.0233 * DayWednesday - 0.0012 * HighTemp + 0.0004 * LowTemp + 0.0505 * Precipitation

All of the predictor variables in this Poisson model, Date, HighTemp, LowTemp, and Precipitation, all have p-values of p < .001. This indicates that all of the predictor in our model variables are statistically significant in predicting the total expected counts of cyclists on the Williamsburg Bridge on a given day, offset by the total number of cyclists on all four of the major New York bridges.

This model shows statistical significance in predicting the expected counts of the cyclists on the Williamsburg Bridge by using the rates for the prediction. This indicates that this model for the rates shows statistical significance in its predictive power and provides good utility for prediction and estimation.

Like was stated for the Poisson regression model on frequency counts, Friday was chosen by R to be the base line level of the Day variable, and so we will compare the regression coefficients against this base line level.

3.2.1 Regression Coefficients Interpretation

We will analysis the regression coefficients for the variables in this Poisson regression model on rates.

  • The value of the y-intercept is given as -1.0682. This represents the baseline of the mean of the log counts multiplied by t, when all predictor variables are equal to 0. However, the y-intercept does not have a practical interpretation or meaning in this scenario so we are not interested in its meaning for the Poisson regression model.

  • DayMonday (p = 0.9689): The regression coefficient for the variable DayMonday was found to be 0.0004. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0004 greater on Monday than on Friday. We can also say this means that the count of cyclists is 1.0056 times greater on Monday than on Friday, holding all other variables constant.

  • DaySaturday (p < .001): The regression coefficient for the variable DaySaturday was found to be 0.0375. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0375 greater on Saturday than on Friday. We can also say this means that the count of cyclists is 1.0382 times greater on Saturday than on Friday, holding all other variables constant.

  • DaySunday (p = 0.5976): The regression coefficient for the variable DaySunday was found to be 0.0051. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0051 greater on Sunday than on Friday. We can also say this means that the count of cyclists on Sunday is 1.005 times greater on Sunday than on Friday, holding all other variables constant.

  • DayThursday (p = 0.0456): The regression coefficient for the variable DayThursday was found to be 0.0206. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0206 greater on Thursday than on Friday. We can also say this means that the count of cyclists is 1.0208 times greater on Thursday than on Friday, holding all other variables constant.

  • DayTuesday (p = 0.1839): The regression coefficient for the variable DayTuesday was found to be 0.0138. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0138 greater on Tuesday than on Friday. We can also say this means that the count of cyclists is 1.0139 times greater on Tuesday than on Friday, holding all other variables constant.

  • DayWednesday (p 0.0206): The regression coefficient for the variable DayThursday was found to be 0.0233. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0233 greater on Wednesday than on Friday. We can also say this means that the count of cyclists is 1.0236 times greater on Wednesday than on Friday, holding all other variables constant.

  • HighTemp (p = 0.0426): The regression coefficient of the HighTemp variable in this model is -0.0012. This means that the mean of the log counts multiplied by t decreases by 0.0012 units for every 1 degree Fahrenheit increase in the high temperature for the given observation, holding all other variables constant.

  • LowTemp (p = 0.6555): The regression coefficient of the LowTemp variable in this model is 0.0004. This means that the log counts multipled by t increases by 0.0004 units for every 1 degree Fahrenheit increase in the low temperature for the given observation, holding all other variables constant.

  • Precipitation (p = 0.0017): The regression coefficient of the Precipitation variable in this model is 0.0505. This means that the log counts multiplied by t increases by 0.0505 units for every 1 inch increase in the amount of precipitation for the given observation, holding all other variables constant.

3.3 Summary and Comparisons of the Two Models

Both of the two Poisson regression model we created, the model for the frequency counts and the model for the rates, provided statistical significance for prediction and showed good utility overall. In both of these models, we looked into the total number of cyclists on the Williamsburg Bridge in New York for a specific observation, and we looked into the various factors of that specific date. We looked at the date of the observation along with some factors which may affect the total number of cyclists out on that specific date. These factors included the high temperature, the low temperature, and the amount of precipitation for that given date. It turned out that all of these factors were indeed statistically significant for both of the two Poisson regression models, indicating that these weather related conditions have a statistically significant impact on both the counts and the rates of cyclists out on the Williamsburg Bridge for a given observation. This can be attributed to certain weather conditions making it more or less ideal for individuals to be cycling outdoors. For instance, a day with incredibly high temperatures, incredibly cold temperatures, or severe storms with heavy precipitation would be less ideal and likely lead to less cyclists being out on that given day as opposed to a day with pleasant weather.

Overall, both of the Poisson regression models showed statistical significance and good utility in their prediction. However, as was previously stated, there were some violations of this conditions for a Poisson regression model within our data set. First, it was found that the mean of the response variable, WilliamsburgBridge, was not equal to its variance. This suggests that this response variable in fact is not Poisson distributed, due to it failing to meet the condition for a Poisson random variable of its mean being equal to its variance. Additionally, all four predictor variables were checked, and it was found that the response variable in fact was not a linear function of any of these predictor variables. This indicates another major violation of this data set. These violations suggest that perhaps a Poisson model was not the best model choice for this data set, and that it is important to be mindful of these violations when using either of the Poisson regression models we created for prediction.

4 Poisson Regression Model on Frequency Counts

We will begin with creating a Poisson regression model of the frequency counts using the new variables we created for this project. Specifically, this model will be on the frequency counts of individuals on the Williamsburg Bridge for a given observations. Our goal is to create a Poisson regression model which can statistically significantly predict the count of the number of individuals on the Williamsburg Bridge for a given observation, based upon the various factors in this data set.

We will create our Poisson regression model on the frequency counts.

# Poisson Regression Model of Counts
model.counts <- glm(WilliamsburgBridge ~ Day + AvgTemp + NewPrecip, 
                    family = poisson(link = "log"), data = cycling)
pois.count.coef = summary(model.counts)$coef
kable(pois.count.coef, caption = "Poisson Regression Model for the Counts of Cyclists \n on the Williamsburg Bridge")
Poisson Regression Model for the Counts of Cyclists on the Williamsburg Bridge
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.2024973 0.0209663 343.527497 0.0000000
DayMonday 0.0301551 0.0095684 3.151542 0.0016241
DaySaturday -0.1944899 0.0100224 -19.405597 0.0000000
DaySunday -0.2450724 0.0097649 -25.097317 0.0000000
DayThursday -0.0298802 0.0101579 -2.941565 0.0032656
DayTuesday -0.0604242 0.0100400 -6.018322 0.0000000
DayWednesday 0.1714460 0.0101413 16.905694 0.0000000
AvgTemp 0.0253493 0.0003233 78.418600 0.0000000
NewPrecip -0.3407990 0.0063666 -53.528941 0.0000000

The regression equation for the Poisson regression model on the frequency counts is given as:

log(μ) = 7.2025 + 0.0302 * DayMonday - 0.1945 * DaySaturday - 0.2451 * DaySunday - 0.0299 * DayThursday - 0.0604 * DayTuesday + 0.1714 * DayWednesday + 0.0253 * AvgTemp - 0.3408 * NewPrecip

All of the predictor variables, Day, AvgTemp, and NewPrecip, all have p-values of p < .001. This indicates that all of the predictor in our model variables are statistically significant in predicting the total expected counts of cyclists on the Williamsburg Bridge on a given day.

For our categorical predictor variable of Day, Friday was chosen as the base line level, which can be seen by how there is not a “DayFriday” variable in the regression equation output. This is because of the seven days, Friday is the one which comes first alphabetically and R chooses the level which comes first alphabetically as the base line level. Therefore, for our regression coefficient interpretations for the different levels of the Day variable, these values will be compared against the base line level of Friday.

4.1 Regression Coefficients Interpretation

We will analysis the regression coefficients for the variables in this Poisson regression model on frequency counts.

  • The value of the y-intercept is given as 7.2025 This represnts the baseline of the mean of log(μ) when all predictor variables are equal to 0. However, the y-intercept does not have a practical interpretation or meaning in this scenario so we are not interested in its meaning for the Poisson regression model.

  • DayMonday (p < .001): The regression coefficient of the DayMonday variable in this model is 0.0302. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0302 greater on Monday than on Friday. We can also say this means that the count of cyclists is 1.0307 times greater on Monday than on Friday, holding all other variables constant.

  • DaySaturday (p < .001): The regression coefficient of the DaySaturday variable in this model is -0.1945. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.1945 less on Monday than on Friday. We can also say this means that the count of cyclists is 0.8232 times greater on Monday than on Friday, holding all other variables constant.

  • DaySunday (p < .001): The regression coefficient of the DaySunday variable in this model is -0.2451. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.2451 less on Sunday than on Friday. We can also say this means that the count of cyclists is 0.7826 times greater on Sunday than on Friday, holding all other variables constant.

  • DayThursday (p < .001): The regression coefficient of the DayThursday variable in this model is -0.0299. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0299 less on Thursday than on Friday. We can also say this means that the count of cyclists is 0.9705 times greater on Thursday than on Friday, holding all other variables constant.

  • DayTuesday (p < .001): The regression coefficient of the DayTuesday variable in this model is -0.0604. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0604 less on Tuesday than on Friday. We can also say this means that the count of cyclists is 0.9414 times greater on Tuesday than on Friday, holding all other variables constant.

  • DayWednesday (p < .001): The regression coefficient of the DayWednesday variable in this model is 0.1714. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.1714 greater on Monday than on Friday. We can also say this means that the count of cyclists is 1.1870 times greater on Monday than on Friday, holding all other variables constant.

  • AvgTemp (p < .001): The regression coefficient of the AvgTemp variable in this model is 0.0253. This means that the mean log of the counts increases by 0.0253 units for every 1 degree Fahrenheit increase in the average temperature for the given observation, holding all other variables constant.

  • NewPrecip (p < .001): The regression coefficient of the NewPrecip variable in this model is -0.3408. This means that the mean log of the count of cyclists on the Williamsburg Bridge is 0.3408 less on days where there is precipitation than on days where there is no precipitation. We can also say that the count of cyclists is 0.7112 greater on days with precipitation than on days with no precipitation.

All of the predictor variables in this Poisson regression model on frequency counts were statistically significant with all of their p-values being equal to p < .001.

5 Poisson Regression Model on Rates

Now, we will create a Poisson regression model of the rates at which cyclists enter and leave via the Williamsburg Bridge offset by the total number of cyclists on all four of the major New York bridges, using the new variables we created in this project.

This model, unlike the previous model which just focused on the frequency counts of cyclists on the Williamsburg Bridge, will also account for the total number of cyclists on all four of the major New York bridges, the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the Queensboro Bridge. This Poisson model will look at the rates of the number of cyclists on the Williamsburg Bridge for a given observation as a rate out of the total number of cyclists on all four of these major bridges for that specific observation.

We will build our Poisson regression model for the rates. This time, we will still use the WilliamsburgBridge variable as our response variable, but we will offset the model by the Total variable to make our Poisson model for the rates of cyclists on the Williamsburg Bridge out of the total number of cyclists on all four of the bridges.

# Poisson Model of Rates
model.rates <- glm(WilliamsburgBridge ~ Day + AvgTemp + NewPrecip, 
                   offset = log(Total), 
                   family = poisson(link = "log"), data = cycling)
kable(summary(model.rates)$coef, caption = "Poisson Regression Model of the Rates of Cyclists \n on the Williamsburg Bridge out of all Four Bridges")
Poisson Regression Model of the Rates of Cyclists on the Williamsburg Bridge out of all Four Bridges
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0435841 0.0210730 -49.5222689 0.0000000
DayMonday 0.0018152 0.0095668 0.1897380 0.8495145
DaySaturday 0.0345382 0.0100162 3.4482277 0.0005643
DaySunday 0.0034566 0.0098102 0.3523466 0.7245783
DayThursday 0.0236982 0.0101422 2.3365925 0.0194604
DayTuesday 0.0252916 0.0101088 2.5019260 0.0123520
DayWednesday 0.0183518 0.0101291 1.8117867 0.0700192
AvgTemp -0.0014673 0.0003301 -4.4449423 0.0000088
NewPrecip 0.0136632 0.0063769 2.1426111 0.0321443

The regression equation for the Poisson regression model on the rates is given as:

log(μ/t) = -1.0436 + 0.0018 * DayMonday + 0.0345 * DaySaturday + 0.0035 * DaySunday + 0.0237 * DayThursday + 0.0253 * DayTuesday + 0.0184 * DayWednesday - 0.0015 * AvgTemp + 0.0137 * NewPrecip

The variables DaySaturday, DayThursday, DayTuesday, AvgTemp, and NewPrecip all had p-values less than the alpha value of 0.05, meaning that these are the variables which are statistically significant in this model.

Like was stated for the Poisson regression model on frequency counts, Friday was chosen by R to be the base line level of the Day variable, and so we will compare the regression coefficients against this base line level.

5.1 Regression Coefficients Interpretation

We will analysis the regression coefficients for the variables in this Poisson regression model on frequency counts.

  • The value of the y-intercept is given as -1.0436. This represents the baseline of the mean of the log counts multiplied by t, when all predictor variables are equal to 0. However, the y-intercept does not have a practical interpretation or meaning in this scenario so we are not interested in its meaning for the Poisson regression model.

  • DayMonday (p = .8495): The regression coefficient of the DayMonday variable in this model is 0.0018. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0018 greater on Monday than on Friday. We can also say this means that the count of cyclists is 1.0018 times greater on Monday than on Friday, holding all other variables constant.

  • DaySaturday (p < .001): The regression coefficient of the DaySaturday variable in this model is 0.0345. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0345 greater on Saturday than on Friday. We can also say this means that the count of cyclists is 1.0351 times greater on Saturday than on Friday, holding all other variables constant.

  • DaySunday (p = 0.7246): The regression coefficient of the DaySunday variable in this model is 0.0035. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0035 greater on Sunday than on Friday. We can also say this means that the count of cyclists is 1.0035 times greater on Sunday than on Friday, holding all other variables constant.

  • DayThursday (p = 0.0195): The regression coefficient of the DayThursday variable in this model is 0.0237. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0237 greater on Thursday than on Friday. We can also say this means that the count of cyclists is 1.0240 times greater on Thursday than on Friday, holding all other variables constant.

  • DayTuesday (p = 0.0124): The regression coefficient of the DayTuesday variable in this model is 0.0253. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0253 greater on Tuesday than on Friday. We can also say this means that the count of cyclists is 1.0256 times greater on Tuesday than on Friday, holding all other variables constant.

  • DayWednesday (p = 0.0700): The regression coefficient of the DayWednesday variable in this model is 0.0184. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0184 greater on Wednesday than on Friday. We can also say this means that the count of cyclists is 1.0186 times greater on Wednesday than on Friday, holding all other variables constant.

  • AvgTemp (p < .001): The regression coefficient of the AvgTemp variable in this model is -0.0015. This means that the mean log of the counts decreases by 0.0015 units for every 1 degree Fahrenheit increase in the average temperature for the given observation, holding all other variables constant.

  • NewPrecip (p = .0321): The regression coefficient of the NewPrecip variable in this model is 0.0137. This means that the mean log of the count of cyclists on the Williamsburg Bridge is 0.0137 greater on days where there is precipitation than on days where there is no precipitation. We can also say that the count of cyclists is 1.0138 greater on days with precipitation than on days with no precipitation.

Out of all of the predictor variables, the ones which showed statistical significance were DaySaturday (p < .001), DayThursday (p = 0.019), DayTuesday (p = 0.0124), AvgTemp (p < .001), and NewPrecip (p = .0321). All of the predictor variables have p-values less than the alpha value of 0.05, indicating they are statistically significant to the model.

The variables of DayMonday (p = .8495), DaySunday (p = 0.7246), and DayWednesday (p = 0.0700) did not show statistical significance as they have p-values greater than the alpha value of 0.05, indicating they are not statistically significant to the model.

6 Quassi-Poisson Regression Model

Next, we will create a Quasi-Poisson regression model. This Quassi-Poisson regression model will be done on the rates, and so it will be offset by the Total variable, while still using WilliamsburgBridge as its response variable for this model.

# Quasi-Poisson Regression Model
quasimodel.rates <- glm(WilliamsburgBridge ~ Day + AvgTemp + NewPrecip, 
                        offset = log(Total), 
                        family = quasipoisson, data = cycling)
summary(quasimodel.rates)

Call:
glm(formula = WilliamsburgBridge ~ Day + AvgTemp + NewPrecip, 
    family = quasipoisson, data = cycling, offset = log(Total))

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.043584   0.043411 -24.040   <2e-16 ***
DayMonday     0.001815   0.019708   0.092   0.9275    
DaySaturday   0.034538   0.020633   1.674   0.1090    
DaySunday     0.003457   0.020209   0.171   0.8658    
DayThursday   0.023698   0.020893   1.134   0.2695    
DayTuesday    0.025292   0.020824   1.215   0.2380    
DayWednesday  0.018352   0.020866   0.880   0.3891    
AvgTemp      -0.001467   0.000680  -2.158   0.0427 *  
NewPrecip     0.013663   0.013137   1.040   0.3101    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 4.243634)

    Null deviance: 151.051  on 29  degrees of freedom
Residual deviance:  89.094  on 21  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 3
pander(summary(quasimodel.rates)$coef, caption = "Quasi-Poisson Regression Model")
Quasi-Poisson Regression Model
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.044 0.04341 -24.04 9.198e-17
DayMonday 0.001815 0.01971 0.09211 0.9275
DaySaturday 0.03454 0.02063 1.674 0.109
DaySunday 0.003457 0.02021 0.171 0.8658
DayThursday 0.0237 0.02089 1.134 0.2695
DayTuesday 0.02529 0.02082 1.215 0.238
DayWednesday 0.01835 0.02087 0.8795 0.3891
AvgTemp -0.001467 0.00068 -2.158 0.04268
NewPrecip 0.01366 0.01314 1.04 0.3101

The regression equation of the Quasi-Poisson Regression Model is given as follows:

log(μ/t) = -1.0436 + 0.0018 * DayMonday + 0.0345 * DaySaturday + 0.0035 * DaySunday + 0.0237 * DayThursday + 0.0253 * DayTuesday + 0.0184 * DayWednesday - 0.0015 * AvgTemp + 0.0137 * NewPrecip

As we can see, the Quassi-Poisson regression model has the same coefficient estimates as the standard Poisson regression model on rates, however, the p-values for these regression coefficients are different between these two models.

So, the regression coefficients for this Quassi-Poisson regression model would be the exact same as they were for the previous model we just found on the Poisson regression model of rates for the new predictor variables of Day, AvgTemp, and NewPrecip.

Out of all of the predictor variables in our Quassi-Poisson regression model on rates, only the variable of AvgTemp (p = 0.043) was statistically significant, as it was the only predictor variable with a p-value less than the alpha value of 0.05. This means, AvgTemp is the only statistically significant predictor variable in predicting the cyclists on the Williamsburg Bridge.

All of the other predictor variables, DayMonday, DaySaturday, DaySunday, DayThursday, DayTuesday, DayWednesday, and NewPrecip, were not statistically significant in the Quassi-Poisson regression model, because they all had p-values greater than the alpha value of 0.05.

6.1 Dispersion

Now, we will look at the dispersion parameter for the Quassi-Poisson regression model in order to see how dispersed it is.

In this output of the model summary, we were given that the dispersion parameter for the Quasi-Poisson model is 4.2436. This dispersion parameter given in the model summary is the Pearson dispersion parameter.

We can also calculate the Deviance dispersion parameter to compare these two dispersion parameters for our Quassi-Poisson regression model on rates.

# Dispersion Parameters
yhat = quasimodel.rates$fitted.values
pearson.resid = (cycling$WilliamsburgBridge - yhat)/sqrt(yhat)
Pearson.dispersion = sum(pearson.resid^2)/quasimodel.rates$df.residual
Deviance.dispersion = (quasimodel.rates$deviance)/quasimodel.rates$df.residual
disp = cbind(Pearson.dispersion = Pearson.dispersion, 
             Deviance.dispersion = Deviance.dispersion)
kable(disp, caption="Dispersion parameter", align = 'c')
Dispersion parameter
Pearson.dispersion Deviance.dispersion
4.243633 4.242561

As we can see, the value of the Pearson dispersion parameter for our Quassi-Poisson regression model is 4.2436. The value of the Deviance dispersion parameter for our Quassi-Poisson regression model is 4.2426.

These dispersion parameters show that our model is indeed fairly dispersed, as these dispersion indexes do differ from the value of 1 by quite a fair amount. We can conclude that our model is signficantly dispered and therefore, using the standard Poisson regression model would likely not be an ideal choice due to the potential of over-dispersion leading to innaccurate results for prediction. The dispersion in our model significantly differing from a value of 1 indicates that the Quassi-Poisson model likely is the better choice as we do have some significant dispersion.

7 Final Model

Now, for our final model we must choose between the standard Poisson regression model on rates and the Quassi-Poisson regression model.

One important thing to note when making this choice, is that the regular Poisson model assumes that the mean of the response variable is equal to its variance while the Quassi-Poisson model does not. When we checked the conditions of the standard Poisson regression model earlier, we found that the mean of the response variable does not equal its variance, indicating a major violation. This violation would cause some concern for the regular Poisson regression model as it suggest that the response variable is, in fact, not a Poisson random variable, and therefore a standard Poisson regression model may not be the best choice for this data set.

Here, the Quassi-Poisson model has the advantage as it does not assume that the mean of the response variable is equal to its variance, which is good for our data set since it failed to meet this required condition for a standard Poisson regression model.

Both models have advantages in disadvantages which must be considered when making the choice of a final model. The standard Poisson regression model on rates showed strong statistical significance for the majority of its predictor variables. However, the data set failed to meet the condition of the mean of the predictor variable equaling its variance which raises concern for the fit of this model. On the other hand, the Quassi-Poisson regression model does not require this condition of the mean of the response variable equaling its variance. However, in the Quassi-Poisson model, only one single predictor variable showed any statistical significance, indicating that this model may not be significant in its predictions after all.

Additionally, we found that our data is significantly dispersed, with a dispersion parameter of 4.2436, which is significantly different from 1. Since our data is signficantly dispersed, it is likely that a standard Poisson regression model is not the ideal choice as this over-dispersion can lead to inaccruate results from this standard Poisson regression model. When the data is signficantly dispersed, the Quassi-Poisson regression model should be used. So, even though the Quassi-Poisson regression model in this case did not show very good statistical signficance within the variables for prediction, it is likely the better choice as our data is significantly dispersed.

In the end, it seems to be a choice between the standard Poisson regression model which is more statistically significant, but likely has poorer accuracy in its predictions due to over-dispersion, and the Quassi-Poisson regression model, which shows worse statistical significance, but accounts for dispersion and is not affected by our data set failing to meet all of the conditions required for Poisson regression.

Overall, I would say that the Quassi-Poisson regression model is the safer choice of the two, as it does not require the condition of the mean of the response variable to equal its variance, as this was something our data set failed. Additionally, using a standard Poisson regression model on over-dispersed data can lead to inaccuracy in the results of its predictions. However, this Quassi-Poisson regression model shows much poorer significance which means that the results it provides may not be significant after all. But, the Quassi-Poisson regression model reamins the better choice in this situation as our data fails the required condition for the response variable to be a Poisson random variable, and we did see significant dispersion as well.

8 Visual Comparisons

Now, let’s look at some visual comparisons of the data within our models.

I chose to create a graph which illustrated the predicted rates of the cyclists on the Williamsburg Bridge based upon the day of the week and whether or not it rained for that given day. This graph will create two lines, one for precipitation (blue), and one for no precipitation (red).

graph <- expand.grid(
  Day = cycling$Day, 
  NewPrecip = cycling$NewPrecip, 
  AvgTemp = mean(cycling$AvgTemp, na.rm = TRUE),  
  Total = mean(cycling$Total, na.rm = TRUE)      
)
graph$predicted_rate <- predict(quasimodel.rates, newdata = graph, type = "response")

graph$NewPrecip <- factor(graph$NewPrecip, levels = c(0, 1), labels = c("No Precipitation", "Precipitation"))

ggplot(graph, aes(x = Day, y = predicted_rate, color = NewPrecip, group = NewPrecip)) +
  geom_line(size = 1) +   
  geom_point(size = 2) +   
  labs(title = "Predicted Rates of the Cyclists \n on the Williamsburg Bridge by the \n Day of the Week and the Precipitation \n Conditions",
       x = "Day", 
       y = "Rate of Cyclists",
       color = "Precipitation Conditions") +
  theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

As we can see, this graph illustrates the predicts the rate of the cyclists on the Williamsburg Bridge out of all four of the total major New York bridges. This graph predicts this rate of the cyclists on the Williamsburg Bridge based on the day of the week and whether there was precipitation or not. This graph creates two lines, one for precipitation (blue), and one for no precipitation (red). This graph creates points for each of the seven days of the week and for whether there was precipiation or not on those days.

As we can see by looking at our graph, it is predicted that the highest rate of cyclists on the Williamsburg Bridge occurs on Saturdays with precipitation, and the lowest rate of cyclists on the Williamsburg Bridge occurs on Fridays with no precipitation.

9 Conclusion

Overall, we looked at various Poisson regression models in this project to predict the frequency counts and the rates of the cyclists on the Williamsburg Bridge. We also looked at a Quassi-Poisson regression model to account for the dispersion of the data along with the violations that were seen which indicated that a standard Poisson regression model may not be the ideal fit for our data.

It was found that our standard Poisson regression model on rates had several variables which showed statistical significance, indicating that these predictor variables were statistically significant in predicting the rates of cyclists on the Williamsburg Bridge. In our Quassi-Poisson regression model, only one of the predictor variables showed statistical significance in predicting the rates of cyclists on the Williamsburg Bridge. This made it seem like the standard Poisson regression model provided better significance for prediction.

However, we looked at the dispersion parameter of the Quassi-Poisson regression model and found that our data is in fact significantly dispersed. This indicates that a standard Poisson regression model is likely not an ideal choice due to it not accounting for this over-dispersion which can lead to innacury in the results of its prediction. This over-dispersion along with the fact that our data set violated the condition of the mean of the response variable equaling its variance, showed that the standard Poisson regression model is not an ideal choice after all. Due to this significant dispersion, the Quassi-Poisson regression model would be the better and safer choice than the standard regression model, despite it having less statistically significant variables. Even though the Quassi-Poisson regression model was less statistically significant, it provides better accuracy due to the data being dispersed, even though it shows that the majority of the predictor variables were not statistcially significant in their prediction of the rates of cyclists on the Williamsburg Bridge.

9.1 Recommendations

Some recommendations I would make for future projects include:

  • Look further into the violation that was found within this data set and look into possible explanations for why this violation occurred. It was found that the mean of the response variable is not equal to its variance, which violates one of the necessities of a Poisson regression model. It should be further considered whether a Poisson regression model in fact is the best choice for this data set and if it is sufficient to use this model for prediction despite these violations.

  • Consider other variables which may affect the number of cyclists out on a given observation. Perhaps there are other factors which may provide further significance for model building which may strengthen the regression model. For instance, maybe a variable looking at whether there are any holidays or other notable events occurring on the day of a given observation could be useful. This could be a binary predictor variable with a value of 1 if there are any events or holidays, and a value of 0 if there are not. This could perhaps be useful as there may tend to be less cyclists out if there is a major holiday or an event occurring in the city on that given observation.

  • Further expand the data set to ensure the accuracy of the predictions and to further strengthen the Poisson regression models. By collecting more observations over a longer period of time, this could help to further strenghten the Poisson regression models are provide better accuracy and reliability in the results found by the model building process. This would help strengthen the conclusions and findings found in the process of bulding the Poisson regression models of this data set.

---
title: "Quassi Poisson Regression Model of the Cyclists on the Williamsburg Bridge"
author: "Josie Gallop"
date: "2024-11-08"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 6
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
editor_options: 
  chunk_output_type: console
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```
```{r setup, include=FALSE}
# Detect, install, and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}
if (!require("EnvStats")) {
   install.packages("EnvStats")
   library(EnvStats)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("phytools")) {
   install.packages("phytools")
   library(phytools)
}
if(!require("dplyr")) {
   install.packages("dplyr")
   library(dplyr)
}
if(!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if (!require("boot")) {
   install.packages("boot")
   library(boot)
}
if(!require("pander")) {
   install.packages("pander")
   library(pander)
}
if(!require("mlbench")) {
   install.packages("mlbench")
   library(mlbench)
}
if(!require("psych")) {
   install.packages("psych")
   library(psych)
}
if(!require("broom.mixed")) {
   install.packages("broom.mixed")
   library(broom.mixed)
}
if(!require("GGally")) {
   install.packages("GGally")
   library(GGally)
}
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
if (!require("openxlsx")) {
   install.packages("openxlsx")
   library(openxlsx)
}
knitr::opts_chunk$set(echo = TRUE,  
                   warning = FALSE,   
                   message = FALSE,  
                   results = TRUE,  
                   comment = NA   
                      )   
```


# Introduction

For this project, we will be creating a Poisson regression model. The data set for this project looks at the daily total of cyclists on the Williamsburg Bridge on a given day. This data set looks at the total number of cyclists on the Williamsburg Bridge in Brooklyn, New York, in order to keep track of the total number of cyclists entering and leaving this cycling route on a specific day. We will look at the various factors affecting the number of cyclists on each day, with factors such as the weather conditions on that particular day. We will also create a Quassi-Poisson regression model and analyze the dispersion of our model. 



## Data Description

The data set in this project looks at the total number of cyclists on the Williamsburg Bridge on a given day along with the weather conditions of that day such as temperature and precipitation. This data set also includes the total number of cyclists on all four of the major New York bridges the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the Queensboro Bridge.

First, I will find the data set which will be used for this assignment. I ran the code which was given in the assignment description, and the data set I received was for the Williamsburg Bridge, so that is what we will use for this Poisson regression modeling project. When opening the downloaded data set, I noticed some issues with the data set not being properly stored. The Date and Day variable had the exact same values, and these values did not make sense within the context of this situation. I went through and fixed these values with how the appeared in the original data set, which was given under the tab data2 in the w09-AssignDataSet.xlsx file. I replaced the improper values in the Date and Day variable with what was in the original data set. Now, the Date variable is an identification of the date on which the observation occurred. And, the Day variable represents the day of the week on which the observation occurred. I also checked all of the other variables to ensure their values were not also messed up in the formatting of the excel file, but they all appeared to be all good without anything having gotten changed during the downloading process. 

The data set has been uploaded to Github and now can be read in directly from the Github repository. 

We will read in the data set from Github and we will call it "cycling".

```{r}
cycling <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA321/refs/heads/main/dataset/WilliamsburgBridge.csv", header = TRUE)

str(cycling)
```

We will use this cycling data set to create two Poisson regression models, one for the frequency counts of cyclists on the Williamsburg Bridge on a given observation, and another for the rates of cyclists entering and leaving via the Williamsburg Bridge offset by the total number of cyclists on all of the major New York bridges. 


## Variables


There are 7 total variables in the cycling data set. These variables include:

* Date: This represents the date on which a given observation was collected. This is the observation ID number. This variable is just for identification purposes of the observations, not for actual prediction. The date is given in the format of month/day. 

* Day: This is a character predictor variable which represents the day of the week on which a given observation was collected. For instance, Monday, Tuesday, etc. 

* HighTemp: A quantitative predictor variable representing the high temperature on the given day, given in degrees Fahrenheit. 

* LowTemp: A quantitative predictor variable representing the low temperature on the given day, given in degrees Fahrenheit.

* Precipitation: A quantitative predictor variable representing the amount of precipitation, rain, which occurred on the given day, given in inches. 

* WilliamsburgBridge: A quantitative variable representing the total number of cyclists on the Williamsburg Bridge on a given observation. This will be our response variable for the Poisson regression models. 

* Total: The total number of cyclists on all bridges on a given observation. This will be the variable which is offset for our Poisson regression model of the rates and for the Quassi-Poisson regression model.


We also will create two new variables within our analysis later on to use for the Poisson regression model building process. These two variables include:

* AvgTemp: A quantitative predictor variable representing the average temperature for a given observation, given in degrees Fahrenheit. This variable will be the average of HighTemp and LowTemp, found by calculating (HighTemp + LowTemp)/2. 

* NewPrecip: A discretized version of the Precipitation variable. This will be a binary predictor variable, where 0 represents a precipitation value equal to 0 inches, and where 1 represents a precipitation value greater than 0 inches. 


For the Poisson regression model for the frequency counts, the Williamsburg Bridge variable will serve as the response variable. For the Poisson regression model for the rates, the Williamsburg Bridge variable will again serve as the response variable, and it will be offset by the Total variable for this model. 


## Research Questions

The main goal for this project is to create a Poisson regression model for both the frequency counts and the rates of the cyclists entering and leaving Brooklyn, New York through the Williamsburg Bridge. So, the focus for this project will be on creating two Poisson regression models which can successfully predict the frequency counts and the rates of the cyclists on the Williamsburg Bridge. 

Some key questions for this project include:

* Does the data set meet all of the necessary conditions required for a Poisson regression model? If not, is there any potential explanation for this discrepancy? 

* Can we create Poisson regression models which provide statistical significance for predicting both the frequency counts and for the rates of cyclists on the Williamsburg Bridge on a given day?

* Is the Quasi-Poisson regression model a better choice than the either of the standard Poisson regression models for frequency counts and for rates? How dispersed is this Quasi-Poisson regression model?


We will work on creating our Poisson regression models for both the frequency counts and rates in order to see if we can in fact create models which provide statistical significance in their predictive ability. We will also create a Quasi-Poisson regression model and we will find how dispered it is. We will determine which of these models is the ideal choice for our final regression model. 



# Exploratory Data Analysis

Let's take a look at the first few entries within this cycling data set for the Williamsburg Bridge.

```{r}
kable(head(cycling), caption = "First Few Observations in the Data Set") 
```

This data set includes various factors which may have an influence on the number of individuals cycling, along with the date on which this data was collected. Additionally, this data set includes variables for both the number of cyclists on the Williamsburg Bridge on that given day, along with the total number of cyclists on all bridges on that given day.  


First, let's check if there are any missing variables in our data set.

```{r}
colSums(is.na(cycling))
```

It turns out that all of the variables in the data set have exactly zero missing values. So, there are no missing values in our data set. This is very good and means we can move on with further analyzing the data set and the variables within it.


## Variable Transformations

We will transform some of our predictor variables before beginning with our analysis and building the Poisson regression models.

First, we will create a new variable for the temperature which is an average of the high temperature and the low temperature for that given observation. We will call our new variable AvgTemp and this will serve as the average of the HighTemp and the LowTemp variables. 

```{r}
# Creating the new AvgTemp variable.
cycling$AvgTemp <- (cycling$HighTemp + cycling$LowTemp)/2
```

Now, we have a new variable, AvgTemp, representing the average of the high temperature and the low temperature for a given observation.


Next, we will discretize the Precipitation variable. We will create a variable called NewPrecip which will be a discretized version of the original Precipitation variable. For this discretized variable, NewPrecip will equal 0 if Precipitation equals 0 for that observation, and NewPrecip will equal 1 if Precipitation is greater than 0 for that observation. 

We will make the NewPrecip variable a binary variable with 0 representing precipitation of 0 inches, and 1 representing precipitation greater than 0 inches. We will also convert this binary NewPrecip variable into an integer variable instead of a numeric variable. 

```{r}
# Creating the NewPrecip variable.
cycling$NewPrecip <- ifelse(cycling$Precipitation == 0, 0, 1)

# Making the binary variable an int.
cycling$NewPrecip <- as.integer(cycling$NewPrecip)
```

Now, we have a discretized NewPrecip variable in our data set which represents a binary predictor variable where 0 is for a precipitation of 0 inches and 1 is for a precipitation greater than 0 inches. 



## Checking the Variable Distributions

We have three predictor variables which we want to use in our final model, Day, AvgTemp, and NewPrecip. Out of these variables, Day is a categorical character variable, AvgTemp is a quantitative variable, and NewPrecip is a binary variable. We will check that the distributions of all of these variables appear to be random, without any noticeable patterns or concerns which could cause issues with the model building process. 

First, let's look at our Day variable. This is a categorical character variable. In order to check that this variable is properly distributed without any major concerns, we will look for if there are any potential imbalances within this variable. An imbalance would occur if there were a significantly greater number of observations occuring on one day as opposed to another day. We will check that there are not any substantial imbalances within the Day variable.

```{r}
table(cycling$Day)
```

It appears that all days of the week from Monday to Friday have exactly four observations in our data set. The weekend days of Saturday and Sunday both have exactly five observations in our data set. As we can see, the observations appear to be distributed very evenly amongst the days of the week, with the weekend days only having one more observation each than the weekdays. Overall, this variable appears to be overall evenly distributed, and so there are not any imbalances to be concerned about for our Day predictor variable.

Next, let's check the distribution of our AvgTemp variable. This is a quantitative predictor variable and so we can check its distribution by using a histogram. We will check to see if this variable has an overall normal distribution without any notable skew or outliers. 

```{r}
ylimit = max(density(cycling$AvgTemp)$y)
hist(cycling$AvgTemp, probability = TRUE, main = "AvgTemp Distribution", xlab="AvgTemp", 
       col = "aliceblue", border="cornflowerblue")
  lines(density(cycling$AvgTemp, adjust=2), col="darkorchid")
```

It appears that our distribution histogram of the AvgTemp variable is unimodal with majority of the data being centered at an average temperature between 55 and 60 degrees Fahrenheit. The distribution appears to follow an approximately normal distribution without any noteable skew or outliers. It does appear like perhaps there is slightly more entries on the left side of the histogram, but it is only by a very slight amount and is not significant enough to cause a noticeable skew in the distribution. Overall, it appears safe to say that our AvgTemp variable follows an approximately normal distribution, and so this variable will be all good to use in our model building process.  

Lastly, let's check our NewPrecip variable. This is a binary predictor variable, and so we can expect it to have values of only 0 and 1. In order to check the reliability of this variable in its use for prediction, we will make sure it appears to meet this criteria of a binary variable. We can take a look at a table to ensure that there are only two possible entries for this NewPrecip variable, 0 and 1, because these are the only two values which a binary variable can be.

```{r}
table(cycling$NewPrecip)
```

As we can see, this NewPrecip variable has only two entries, 0 and 1. This is exactly what we wanted to see because these are the only two variables which a binary variable can be. We can see that there are slightly more days with no precipitation, a value of 0, with 18 total observations, than days with precipitation, a value of 1, with 12 total observations. However, this difference is not large enough to be a cause for concern. And so, we can conclude that everything is alright with with our binary predictor variable of NewPrecip, and that we can continue with the model building process.

Now, we have checked the distributions of all three of our predictor variables, Day, AvgTemp, and NewPrecip, and ensured that there are not any apparent issues with any of these variables or their distributions. So, we can continue with using these predictor variables in our model building process.


## Asumptions and Conditions

Before we begin with building our model, we must check the assumptions and conditions which are required for a Poisson regression model.

There are four assumptions which must be met in order to create a Poisson regression model. These assumptions include:

1. The response variable is a count described by a Poisson distribution.

2. Observations are independent of one another.

3. The mean of the Poisson random variable is equal to the variance of said Poisson random variable.

4. The log of the mean rate, log (λ), must be a linear function of x.


We will check whether all of these four conditions have been successfully met by our cycling data set before beginning with the model building process for our Poisson regression model.

We will go through and check all four of the necessary conditions required for a Poisson Regression Model.


### Condition 1: The response variable is a count described by a Poisson distribution.

The response variable in this data set was stated to be the WilliamsburgBridge variable, representing the total number of cyclists on the Williamsburg Bridge on a given observation. This variable is described as a count, representing the number of cyclists on a given observation. This fits the criteria for this assumption, because we can conclude that we have a response variable that is a count.


### Condition 2: Observations are independent of one another.

Each observation was collected on a given date, and we can safely assume that the conditions of one day did not affect the conditions of another day. The number of cyclists on the Williamsburg Bridge for a given observation is independent on this number of a different observation. So, we can safely conclude that that observations are all independent and separate from one another. 


### Condition 3: The mean of the Poisson random variable is equal to the variance of said Poisson random variable.

In order for a variable to be a Poisson random variable, its mean must be equal to its variance. We previously stated that the WilliamsburgBridge variable will be our response variable. Therefore, we must check that this variable meets the criteria for a Poisson random variable, having a mean which is equal to its variance.

```{r}
# Finding the mean.
mean <- mean(cycling$WilliamsburgBridge)
print(mean)
```

The mean of the WilliamsburgBridge variable is 4,942.267. This represents the mean number of individuals on the Williamsburg Bridge on a given observation. This means that the mean number of individuals on the Williamsburg Bridge on any given date is around 4,943 people. We round this value up because the number of individuals is a whole number and so the decimal must be rounded up to the next whole number to represent that part as an individual. 

Next, let's find the variance of our response variable.

```{r}
# Finding the variance.
variance <- var(cycling$WilliamsburgBridge)
print(variance)
```

The variance of the WilliamsburgBridge variable is 3,005,665. This does not match up with the value of the mean, and indicates a violation of one the neccessary conditions for a Poisson regression model. This implies that our response variable is in fact not a Poisson random variable because the value of its mean is not equivalent to the value of its variance. 


### Condition 4: The log of the mean rate, log (λ), must be a linear function of x.

We will take a look at the plot of the mean rate against the predictor variables to check this condition. 

Since our first predictor variable is Day, and this a categorical, character variable, it would not create a linear function because it is made up of categorical inputs. So, instead, we will look at the numerical predictor variable instead to check this condition.

Let's look at the predictor variable of average temperature vs our response variable of WilliamsburgBridge. AvgTemp is a quantitative, numeric variable so we can use it to check this condition.

```{r}
plot(cycling$AvgTemp, cycling$WilliamsburgBridge, main = "AvgTemp vs. Williamsburg Bridge", xlab = "AvgTemp", ylab = "WilliamsburgBridge")
```

The scatterplot of the two variables of AvgTemp and WilliamsburgBridge shows what does appear to be a linear relationship of these two variables. We can see a positive relationship between the two variables, as average temperature increases, so does the number of cyclists on the Williamsburg Bridge. This seems logical as it makes sense that more people would want to go outside and go cycling on a day that is warmer outside rather than a day that is colder outside. The relationship of the two variables does appear to have a moderate strength, but the linear pattern can definitely be seen. So, it does appear that WilliamsburgBridge is a linear function of AvgTemp, which verifies this necessary condition for a Poisson regression model.


Lastly, we have our predictor variable of NewPrecip. This is a binary predictor variable, so we will only see points at x = 0 and x = 1 if we were to create a scatterplot of this binary predictor variable of NewPrecip. So, we can not expect to see a linear relationship between NewPrecip and WilliamsburgBridge, because NewPrecip can only have values of 0 and 1, not anything in between due to it being a binary predictor variable.

Overall, we can consider this condition satisfied since our numerical predictor variable of AvgTemp showed that it does indeed have a linear relationship with our response variable. 



### Summary of Violations

Overall, it seems that we do have one notable violation of the conditions of a Poisson regression model within our data set. We found that the response variable, WilliamsburgBridge, does not meet the necessary criteria of a Poisson random variable, because its mean is not equal to its variance. This is a major concern, because it points to a major violation of the conditions required for a Poisson regression model. 

This violation of the conditions for a Poisson regression model suggests a major concern with our data set, as it fails to meet a major condition which is required for a Poisson regression model. This suggest that perhaps a Poisson regression model may not be the best model choice for this data set after all.  

We will still continue with building the Poisson regression models for this project, but it is important to keep in mind that this violation may mean that the Poisson regression model is not the best model choice for this data set due to the necessary condition of the mean of the response variable equaling the variance of the response variable having been failed to have been met. 




# Poisson Regression Models on the Original Variables

First, we will look at the Poisson regression models which were created in the previous week's assignment and look at the corrected versions of these models. We will use the original predictor variables first, and then in later steps of this project we will use the new predictor variables of AvgTemp and NewPrecip.

In the previous week's assignment, we created two Poisson regression models, one on frequency counts and one of the rates. We will create these again to see the proper, corrected models. Since in the previous week's assingment, we did not alter any of the predictor variables for these two models, we will use the original variables for now and then create models using the new predictor variables of AvgTemp and NewPrecip.

For now, we will use the old predictor variables of Day, HighTemp, LowTemp, and Precipitation. These were the predictor variables used in the previous week's assignment, so we will first begin by correcting the models which were created in that assignment before we been creating the new models for this project.

## Poisson Regression Model on Frequency Counts 

We will begin with creating a Poisson regression model of the frequency counts. This model will be on the frequency counts of individuals on the Williamsburg Bridge for a given observations. Our goal is to create a Poisson regression model which can statistically significantly predict the count of the number of individuals on the Williamsburg Bridge for a given observation, based upon the various factors in this data set. 

We will create our Poisson regression model on the frequency counts.

```{r}
# Poisson Regression Model of Counts
model.counts <- glm(WilliamsburgBridge ~ Day + HighTemp + LowTemp + Precipitation, family = poisson(link = "log"), data = cycling)
pois.count.coef = summary(model.counts)$coef
kable(pois.count.coef, caption = "Poisson Regression Model for the Counts of Cyclists \n on the Williamsburg Bridge")
```

The regression equation for the Poisson regression model on the frequency counts is given as:

log(μ) = 7.6538 + 0.0546 * DayMonday - 0.2744 * DaySaturday - 0.2346 * DaySunday + 0.0319 * DayThursday + 0.1988 * DayTuesday + 0.0511 * DayWednesday + 0.0171 * HighTemp - 0.0024 * LowTemp - 1.0321 * Precipitation


All of the predictor variables, DayMonday, DaySaturday, DaySunday, DayThursday, DayTuesday, DayWednesday, HighTemp, LowTemp, and Precipitation, all have p-values of p < .001. This indicates that all of the predictor in our model variables are statistically significant in predicting the total expected counts of cyclists on the Williamsburg Bridge on a given day. 

The significance of these variables in regards to predicting the expected counts can likely be attributed to potential adverse weather conditions, such as excessive heat or cold, along with intense precipitation and storms making cycling non ideal on those days with poor conditions for outdoors activities such as cycling. These predictor variables all being statistically significant shows that the weather and temperature conditions do suggest a discrepancy in the number of cyclists on the Williamsburg Bridge from day to day due to these changes in temperature and precipitation. 

Overall, this Poisson model of the frequency counts of the cyclists on the Williamsburg Bridge showed statistical significance in its prediction of the expected log counts for the number of cyclists on the Williamsburg Bridge for a given observation.

For our categorical predictor variable of Day, Friday was chosen as the base line level, which can be seen by how there is not a "DayFriday" variable in the regression equation output. This is because of the seven days, Friday is the one which comes first alphabetically and R chooses the level which comes first alphabetically as the base line level. Therefore, for our regression coefficient interpretations for the different levels of the Day variable, these values will be compared against the base line level of Friday.


### Regression Coefficients Interpretation

We will analysis the regression coefficients for the variables in this Poisson regression model on frequency counts.

* The value of the y-intercept is given as 7.6536. This represents the baseline of the mean of log(μ) when all predictor variables are equal to 0. However, the y-intercept does not have a practical interpretation or meaning in this scenario so we are not interested in its meaning for the Poisson regression model.

* DayMonday (p < .001): The regression coefficient for the variable DayMonday was found to be 0.0546. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0546 greater on Monday than on Friday. We can also say this means that the count of cyclists is 1.0561 times greater on Monday than on Friday, holding all other variables constant.

* DaySaturday (p < .001): The regression coefficient for the variable DaySaturday was found to be -0.2744. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.2744 less on Saturday than on Friday. We can also say this means that the count of cyclists is 0.7600 times greater on Saturday than on Friday, holding all other variables constant.

* DaySunday (p < .001): The regression coefficient for the variable DaySunday was found to be -0.2346. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.2346 less on Sunday than on Friday. We can also say this means that the count of cyclists is 0.7909 times greater on Sunday than on Friday, holding all other variables constant.

* DayThursday (p = 0.0019): The regression coefficient for the variable DayThursday was found to be 0.0319. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0319 greater on Thursday than on Friday. We can also say this means that the count of cyclists is 1.0324 times greater on Thursday than on Friday, holding all other variables constant.

* DayTuesday (p < .001): The regression coefficient for the variable DayTuesday was found to be 0.1988. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.1988 greater on Tuesday than on Friday. We can also say this means that the count of cyclists is 1.2199 times greater on Tuesday than on Friday, holding all other variables constant.

* DayWednesday (p <.001): The regression coefficient for the variable DayWednesday was found to be 0.0511. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0511 greater on Wednesday than on Friday. We can also say this means that the count of cyclists is 1.0524 times greater on Wednesday than on Friday, holding all other variables constant.

* HighTemp (p <.001): The regression coefficient of the HighTemp variable in this model is 0.0171. This means that the mean log of the counts increases by 0.0171 units for every 1 degree Fahrenheit increase in the high temperature for the given observation, holding all other variables constant. 

* LowTemp (p = 0.0023): The regression coefficient of the LowTemp variable in this model is -0.0024. This means that the mean log of the counts decreases by 0.0024 units for every 1 degree Fahrenheit increase in the low temperature for the given observation, holding all other variables constant. 

* Precipitation (p < .001): The regression coefficient of the Precipitation variable in this model is -1.0321. This means that the mean log of the counts decreases by 1.0321 units for every 1 inch increase in the amount of precipitation for the given observation, holding all other variables constant. 


## Poisson Regression Model on Rates

Now, we will create a Poisson regression model of the rates at which cyclists enter and leave via the Williamsburg Bridge offset by the total number of cyclists on all four of the major New York bridges. This model, unlike the previous model which just focused on the frequency counts of cyclists on the Williamsburg Bridge, will also account for the total number of cyclists on all four of the major New York bridges, the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the Queensboro Bridge. This Poisson model will look at the rates of the number of cyclists on the Williamsburg Bridge for a given observation as a rate out of the total number of cyclists on all four of these major bridges for that specific observation.

We will build our Poisson regression model for the rates. This time, we will still use the WilliamsburgBridge variable as our response variable, but we will offset the model by the Total variable to make our Poisson model for the rates of cyclists on the Williamsburg Bridge out of the total number of cyclists on all four of the bridges. 


```{r}
# Poisson Model of Rates
model.rates <- glm(WilliamsburgBridge ~ Day + HighTemp + LowTemp + Precipitation, offset = log(Total), 
                   family = poisson(link = "log"), data = cycling)
kable(summary(model.rates)$coef, caption = "Poisson Regression Model of the Rates of Cyclists \n on the Williamsburg Bridge out of all Four Bridges")
```

The regression equation for the Poisson regression model on the rates is given as:

log(μ/t) = -1.0682 + 0.0004 * DayMonday + 0.0375 * DaySaturday + 0.0051 * DaySunday + 0.0206 * DayThursday + 0.0138 * DayTuesday + 0.0233 * DayWednesday - 0.0012 * HighTemp + 0.0004 * LowTemp + 0.0505 * Precipitation

All of the predictor variables in this Poisson model, Date, HighTemp, LowTemp, and Precipitation, all have p-values of p < .001. This indicates that all of the predictor in our model variables are statistically significant in predicting the total expected counts of cyclists on the Williamsburg Bridge on a given day, offset by the total number of cyclists on all four of the major New York bridges. 

This model shows statistical significance in predicting the expected counts of the cyclists on the Williamsburg Bridge by using the rates for the prediction. This indicates that this model for the rates shows statistical significance in its predictive power and provides good utility for prediction and estimation. 

Like was stated for the Poisson regression model on frequency counts, Friday was chosen by R to be the base line level of the Day variable, and so we will compare the regression coefficients against this base line level.



### Regression Coefficients Interpretation

We will analysis the regression coefficients for the variables in this Poisson regression model on rates.

* The value of the y-intercept is given as -1.0682. This represents the baseline of the mean of the log counts multiplied by t, when all predictor variables are equal to 0. However, the y-intercept does not have a practical interpretation or meaning in this scenario so we are not interested in its meaning for the Poisson regression model.

* DayMonday (p = 0.9689): The regression coefficient for the variable DayMonday was found to be 0.0004. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0004 greater on Monday than on Friday. We can also say this means that the count of cyclists is 1.0056 times greater on Monday than on Friday, holding all other variables constant. 

* DaySaturday (p < .001): The regression coefficient for the variable DaySaturday was found to be 0.0375. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0375 greater on Saturday than on Friday. We can also say this means that the count of cyclists is 1.0382 times greater on Saturday than on Friday, holding all other variables constant. 

* DaySunday (p = 0.5976): The regression coefficient for the variable DaySunday was found to be 0.0051. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0051 greater on Sunday than on Friday. We can also say this means that the count of cyclists on Sunday is 1.005 times greater on Sunday than on Friday, holding all other variables constant. 

* DayThursday (p = 0.0456): The regression coefficient for the variable DayThursday was found to be 0.0206. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0206 greater on Thursday than on Friday. We can also say this means that the count of cyclists is 1.0208 times greater on Thursday than on Friday, holding all other variables constant. 

* DayTuesday (p = 0.1839): The regression coefficient for the variable DayTuesday was found to be 0.0138. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0138 greater on Tuesday than on Friday. We can also say this means that the count of cyclists is 1.0139 times greater on Tuesday than on Friday, holding all other variables constant. 

* DayWednesday (p 0.0206): The regression coefficient for the variable DayThursday was found to be 0.0233. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0233 greater on Wednesday than on Friday. We can also say this means that the count of cyclists is 1.0236 times greater on Wednesday than on Friday, holding all other variables constant. 

* HighTemp (p = 0.0426): The regression coefficient of the HighTemp variable in this model is -0.0012. This means that the mean of the log counts multiplied by t decreases by 0.0012 units for every 1 degree Fahrenheit increase in the high temperature for the given observation, holding all other variables constant. 

* LowTemp (p = 0.6555): The regression coefficient of the LowTemp variable in this model is 0.0004. This means that the log counts multipled by t increases by 0.0004 units for every 1 degree Fahrenheit increase in the low temperature for the given observation, holding all other variables constant. 

* Precipitation (p = 0.0017): The regression coefficient of the Precipitation variable in this model is 0.0505. This means that the log counts multiplied by t increases by 0.0505 units for every 1 inch increase in the amount of precipitation for the given observation, holding all other variables constant.


## Summary and Comparisons of the Two Models

Both of the two Poisson regression model we created, the model for the frequency counts and the model for the rates, provided statistical significance for prediction and showed good utility overall. In both of these models, we looked into the total number of cyclists on the Williamsburg Bridge in New York for a specific observation, and we looked into the various factors of that specific date. We looked at the date of the observation along with some factors which may affect the total number of cyclists out on that specific date. These factors included the high temperature, the low temperature, and the amount of precipitation for that given date. It turned out that all of these factors were indeed statistically significant for both of the two Poisson regression models, indicating that these weather related conditions have a statistically significant impact on both the counts and the rates of cyclists out on the Williamsburg Bridge for a given observation. This can be attributed to certain weather conditions making it more or less ideal for individuals to be cycling outdoors. For instance, a day with incredibly high temperatures, incredibly cold temperatures, or severe storms with heavy precipitation would be less ideal and likely lead to less cyclists being out on that given day as opposed to a day with pleasant weather. 

Overall, both of the Poisson regression models showed statistical significance and good utility in their prediction. However, as was previously stated, there were some violations of this conditions for a Poisson regression model within our data set. First, it was found that the mean of the response variable, WilliamsburgBridge, was not equal to its variance. This suggests that this response variable in fact is not Poisson distributed, due to it failing to meet the condition for a Poisson random variable of its mean being equal to its variance. Additionally, all four predictor variables were checked, and it was found that the response variable in fact was not a linear function of any of these predictor variables. This indicates another major violation of this data set. These violations suggest that perhaps a Poisson model was not the best model choice for this data set, and that it is important to be mindful of these violations when using either of the Poisson regression models we created for prediction. 





# Poisson Regression Model on Frequency Counts


We will begin with creating a Poisson regression model of the frequency counts using the new variables we created for this project. Specifically, this model will be on the frequency counts of individuals on the Williamsburg Bridge for a given observations. Our goal is to create a Poisson regression model which can statistically significantly predict the count of the number of individuals on the Williamsburg Bridge for a given observation, based upon the various factors in this data set. 

We will create our Poisson regression model on the frequency counts.

```{r}
# Poisson Regression Model of Counts
model.counts <- glm(WilliamsburgBridge ~ Day + AvgTemp + NewPrecip, 
                    family = poisson(link = "log"), data = cycling)
pois.count.coef = summary(model.counts)$coef
kable(pois.count.coef, caption = "Poisson Regression Model for the Counts of Cyclists \n on the Williamsburg Bridge")
```

The regression equation for the Poisson regression model on the frequency counts is given as:

log(μ) = 7.2025 + 0.0302 * DayMonday - 0.1945 * DaySaturday - 0.2451 * DaySunday - 0.0299 * DayThursday - 0.0604 * DayTuesday + 0.1714 * DayWednesday + 0.0253 * AvgTemp - 0.3408 * NewPrecip

All of the predictor variables, Day, AvgTemp, and NewPrecip, all have p-values of p < .001. This indicates that all of the predictor in our model variables are statistically significant in predicting the total expected counts of cyclists on the Williamsburg Bridge on a given day. 

For our categorical predictor variable of Day, Friday was chosen as the base line level, which can be seen by how there is not a "DayFriday" variable in the regression equation output. This is because of the seven days, Friday is the one which comes first alphabetically and R chooses the level which comes first alphabetically as the base line level. Therefore, for our regression coefficient interpretations for the different levels of the Day variable, these values will be compared against the base line level of Friday.


## Regression Coefficients Interpretation

We will analysis the regression coefficients for the variables in this Poisson regression model on frequency counts.

* The value of the y-intercept is given as 7.2025 This represnts the baseline of the mean of log(μ) when all predictor variables are equal to 0. However, the y-intercept does not have a practical interpretation or meaning in this scenario so we are not interested in its meaning for the Poisson regression model.

* DayMonday (p < .001): The regression coefficient of the DayMonday variable in this model is 0.0302. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0302 greater on Monday than on Friday. We can also say this means that the count of cyclists is 1.0307 times greater on Monday than on Friday, holding all other variables constant. 

* DaySaturday (p < .001): The regression coefficient of the DaySaturday variable in this model is -0.1945. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.1945 less on Monday than on Friday. We can also say this means that the count of cyclists is 0.8232 times greater on Monday than on Friday, holding all other variables constant. 

* DaySunday (p < .001): The regression coefficient of the DaySunday variable in this model is -0.2451. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.2451 less on Sunday than on Friday. We can also say this means that the count of cyclists is 0.7826 times greater on Sunday than on Friday, holding all other variables constant. 

* DayThursday (p < .001): The regression coefficient of the DayThursday variable in this model is -0.0299. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0299 less on Thursday than on Friday. We can also say this means that the count of cyclists is 0.9705 times greater on Thursday than on Friday, holding all other variables constant. 

* DayTuesday (p < .001): The regression coefficient of the DayTuesday variable in this model is -0.0604. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0604 less on Tuesday than on Friday. We can also say this means that the count of cyclists is 0.9414 times greater on Tuesday than on Friday, holding all other variables constant. 

* DayWednesday (p < .001): The regression coefficient of the DayWednesday variable in this model is 0.1714. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.1714 greater on Monday than on Friday. We can also say this means that the count of cyclists is 1.1870 times greater on Monday than on Friday, holding all other variables constant. 

* AvgTemp (p < .001): The regression coefficient of the AvgTemp variable in this model is 0.0253. This means that the mean log of the counts increases by 0.0253 units for every 1 degree Fahrenheit increase in the average temperature for the given observation, holding all other variables constant. 

* NewPrecip (p < .001): The regression coefficient of the NewPrecip variable in this model is -0.3408. This means that the mean log of the count of cyclists on the Williamsburg Bridge is 0.3408 less on days where there is precipitation than on days where there is no precipitation. We can also say that the count of cyclists is 0.7112 greater on days with precipitation than on days with no precipitation.


All of the predictor variables in this Poisson regression model on frequency counts were statistically significant with all of their p-values being equal to p < .001. 




# Poisson Regression Model on Rates


Now, we will create a Poisson regression model of the rates at which cyclists enter and leave via the Williamsburg Bridge offset by the total number of cyclists on all four of the major New York bridges, using the new variables we created in this project. 

This model, unlike the previous model which just focused on the frequency counts of cyclists on the Williamsburg Bridge, will also account for the total number of cyclists on all four of the major New York bridges, the Brooklyn Bridge, the Manhattan Bridge, the Williamsburg Bridge, and the Queensboro Bridge. This Poisson model will look at the rates of the number of cyclists on the Williamsburg Bridge for a given observation as a rate out of the total number of cyclists on all four of these major bridges for that specific observation.

We will build our Poisson regression model for the rates. This time, we will still use the WilliamsburgBridge variable as our response variable, but we will offset the model by the Total variable to make our Poisson model for the rates of cyclists on the Williamsburg Bridge out of the total number of cyclists on all four of the bridges. 


```{r}
# Poisson Model of Rates
model.rates <- glm(WilliamsburgBridge ~ Day + AvgTemp + NewPrecip, 
                   offset = log(Total), 
                   family = poisson(link = "log"), data = cycling)
kable(summary(model.rates)$coef, caption = "Poisson Regression Model of the Rates of Cyclists \n on the Williamsburg Bridge out of all Four Bridges")
```

The regression equation for the Poisson regression model on the rates is given as:

log(μ/t) = -1.0436 + 0.0018 * DayMonday + 0.0345 * DaySaturday + 0.0035 * DaySunday + 0.0237 * DayThursday + 0.0253 * DayTuesday + 0.0184 * DayWednesday - 0.0015 * AvgTemp + 0.0137 * NewPrecip


The variables DaySaturday, DayThursday, DayTuesday, AvgTemp, and NewPrecip all had p-values less than the alpha value of 0.05, meaning that these are the variables which are statistically significant in this model. 

Like was stated for the Poisson regression model on frequency counts, Friday was chosen by R to be the base line level of the Day variable, and so we will compare the regression coefficients against this base line level.



## Regression Coefficients Interpretation

We will analysis the regression coefficients for the variables in this Poisson regression model on frequency counts.

* The value of the y-intercept is given as -1.0436. This represents the baseline of the mean of the log counts multiplied by t, when all predictor variables are equal to 0. However, the y-intercept does not have a practical interpretation or meaning in this scenario so we are not interested in its meaning for the Poisson regression model.

* DayMonday (p = .8495): The regression coefficient of the DayMonday variable in this model is 0.0018. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0018 greater on Monday than on Friday. We can also say this means that the count of cyclists is 1.0018 times greater on Monday than on Friday, holding all other variables constant. 

* DaySaturday (p < .001): The regression coefficient of the DaySaturday variable in this model is 0.0345. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0345 greater on Saturday than on Friday. We can also say this means that the count of cyclists is 1.0351 times greater on Saturday than on Friday, holding all other variables constant.  

* DaySunday (p = 0.7246): The regression coefficient of the DaySunday variable in this model is 0.0035. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0035 greater on Sunday than on Friday. We can also say this means that the count of cyclists is 1.0035 times greater on Sunday than on Friday, holding all other variables constant. 

* DayThursday (p = 0.0195): The regression coefficient of the DayThursday variable in this model is 0.0237. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0237 greater on Thursday than on Friday. We can also say this means that the count of cyclists is 1.0240 times greater on Thursday than on Friday, holding all other variables constant. 

* DayTuesday (p = 0.0124): The regression coefficient of the DayTuesday variable in this model is 0.0253. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0253 greater on Tuesday than on Friday. We can also say this means that the count of cyclists is 1.0256 times greater on Tuesday than on Friday, holding all other variables constant. 

* DayWednesday (p = 0.0700): The regression coefficient of the DayWednesday variable in this model is 0.0184. This means that the mean log count of cyclists on the Williamsburgs Bridge was 0.0184 greater on Wednesday than on Friday. We can also say this means that the count of cyclists is 1.0186 times greater on Wednesday than on Friday, holding all other variables constant. 

* AvgTemp (p < .001): The regression coefficient of the AvgTemp variable in this model is -0.0015. This means that the mean log of the counts decreases by 0.0015 units for every 1 degree Fahrenheit increase in the average temperature for the given observation, holding all other variables constant.

* NewPrecip (p = .0321): The regression coefficient of the NewPrecip variable in this model is 0.0137. This means that the mean log of the count of cyclists on the Williamsburg Bridge is 0.0137 greater on days where there is precipitation than on days where there is no precipitation. We can also say that the count of cyclists is 1.0138 greater on days with precipitation than on days with no precipitation.



Out of all of the predictor variables, the ones which showed statistical significance were DaySaturday (p < .001), DayThursday (p = 0.019), DayTuesday (p = 0.0124), AvgTemp (p < .001), and NewPrecip (p = .0321). All of the predictor variables have p-values less than the alpha value of 0.05, indicating they are statistically significant to the model. 

The variables of DayMonday (p = .8495), DaySunday (p = 0.7246), and DayWednesday (p = 0.0700) did not show statistical significance as they have p-values greater than the alpha value of 0.05, indicating they are not statistically significant to the model.



# Quassi-Poisson Regression Model 


Next, we will create a Quasi-Poisson regression model. This Quassi-Poisson regression model will be done on the rates, and so it will be offset by the Total variable, while still using WilliamsburgBridge as its response variable for this model.

```{r}
# Quasi-Poisson Regression Model
quasimodel.rates <- glm(WilliamsburgBridge ~ Day + AvgTemp + NewPrecip, 
                        offset = log(Total), 
                        family = quasipoisson, data = cycling)
summary(quasimodel.rates)

pander(summary(quasimodel.rates)$coef, caption = "Quasi-Poisson Regression Model")
```


The regression equation of the Quasi-Poisson Regression Model is given as follows:

log(μ/t) = -1.0436 + 0.0018 * DayMonday + 0.0345 * DaySaturday + 0.0035 * DaySunday + 0.0237 * DayThursday + 0.0253 * DayTuesday + 0.0184 * DayWednesday - 0.0015 * AvgTemp + 0.0137 * NewPrecip
 
 
As we can see, the Quassi-Poisson regression model has the same coefficient estimates as the standard Poisson regression model on rates, however, the p-values for these regression coefficients are different between these two models. 

So, the regression coefficients for this Quassi-Poisson regression model would be the exact same as they were for the previous model we just found on the Poisson regression model of rates for the new predictor variables of Day, AvgTemp, and NewPrecip.

Out of all of the predictor variables in our Quassi-Poisson regression model on rates, only the variable of AvgTemp (p = 0.043) was statistically significant, as it was the only predictor variable with a p-value less than the alpha value of 0.05. This means, AvgTemp is the only statistically significant predictor variable in predicting the cyclists on the Williamsburg Bridge. 

All of the other predictor variables, DayMonday, DaySaturday, DaySunday, DayThursday, DayTuesday, DayWednesday, and NewPrecip, were not statistically significant in the Quassi-Poisson regression model, because they all had p-values greater than the alpha value of 0.05.



## Dispersion

Now, we will look at the dispersion parameter for the Quassi-Poisson regression model in order to see how dispersed it is. 

In this output of the model summary, we were given that the dispersion parameter for the Quasi-Poisson model is 4.2436. This dispersion parameter given in the model summary is the Pearson dispersion parameter. 

We can also calculate the Deviance dispersion parameter to compare these two dispersion parameters for our Quassi-Poisson regression model on rates. 

```{r}
# Dispersion Parameters
yhat = quasimodel.rates$fitted.values
pearson.resid = (cycling$WilliamsburgBridge - yhat)/sqrt(yhat)
Pearson.dispersion = sum(pearson.resid^2)/quasimodel.rates$df.residual
Deviance.dispersion = (quasimodel.rates$deviance)/quasimodel.rates$df.residual
disp = cbind(Pearson.dispersion = Pearson.dispersion, 
             Deviance.dispersion = Deviance.dispersion)
kable(disp, caption="Dispersion parameter", align = 'c')
```

As we can see, the value of the Pearson dispersion parameter for our Quassi-Poisson regression model is 4.2436. The value of the Deviance dispersion parameter for our Quassi-Poisson regression model is 4.2426. 

These dispersion parameters show that our model is indeed fairly dispersed, as these dispersion indexes do differ from the value of 1 by quite a fair amount. We can conclude that our model is signficantly dispered and therefore, using the standard Poisson regression model would likely not be an ideal choice due to the potential of over-dispersion leading to innaccurate results for prediction. The dispersion in our model significantly differing from a value of 1 indicates that the Quassi-Poisson model likely is the better choice as we do have some significant dispersion. 



# Final Model 

Now, for our final model we must choose between the standard Poisson regression model on rates and the Quassi-Poisson regression model. 

One important thing to note when making this choice, is that the regular Poisson model assumes that the mean of the response variable is equal to its variance while the Quassi-Poisson model does not. When we checked the conditions of the standard Poisson regression model earlier, we found that the mean of the response variable does not equal its variance, indicating a major violation. This violation would cause some concern for the regular Poisson regression model as it suggest that the response variable is, in fact, not a Poisson random variable, and therefore a standard Poisson regression model may not be the best choice for this data set. 

Here, the Quassi-Poisson model has the advantage as it does not assume that the mean of the response variable is equal to its variance, which is good for our data set since it failed to meet this required condition for a standard Poisson regression model. 

Both models have advantages in disadvantages which must be considered when making the choice of a final model. The standard Poisson regression model on rates showed strong statistical significance for the majority of its predictor variables. However, the data set failed to meet the condition of the mean of the predictor variable equaling its variance which raises concern for the fit of this model. On the other hand, the Quassi-Poisson regression model does not require this condition of the mean of the response variable equaling its variance. However, in the Quassi-Poisson model, only one single predictor variable showed any statistical significance, indicating that this model may not be significant in its predictions after all. 

Additionally, we found that our data is significantly dispersed, with a dispersion parameter of 4.2436, which is significantly different from 1. Since our data is signficantly dispersed, it is likely that a standard Poisson regression model is not the ideal choice as this over-dispersion can lead to inaccruate results from this standard Poisson regression model. When the data is signficantly dispersed, the Quassi-Poisson regression model should be used. So, even though the Quassi-Poisson regression model in this case did not show very good statistical signficance within the variables for prediction, it is likely the better choice as our data is significantly dispersed. 

In the end, it seems to be a choice between the standard Poisson regression model which is more statistically significant, but likely has poorer accuracy in its predictions due to over-dispersion, and the Quassi-Poisson regression model, which shows worse statistical significance, but accounts for dispersion and is not affected by our data set failing to meet all of the conditions required for Poisson regression.

Overall, I would say that the Quassi-Poisson regression model is the safer choice of the two, as it does not require the condition of the mean of the response variable to equal its variance, as this was something our data set failed. Additionally, using a standard Poisson regression model on over-dispersed data can lead to inaccuracy in the results of its predictions. However, this Quassi-Poisson regression model shows much poorer significance which means that the results it provides may not be significant after all. But, the Quassi-Poisson regression model reamins the better choice in this situation as our data fails the required condition for the response variable to be a Poisson random variable, and we did see significant dispersion as well. 



# Visual Comparisons

Now, let's look at some visual comparisons of the data within our models. 

I chose to create a graph which illustrated the predicted rates of the cyclists on the Williamsburg Bridge based upon the day of the week and whether or not it rained for that given day. This graph will create two lines, one for precipitation (blue), and one for no precipitation (red). 

```{r}
graph <- expand.grid(
  Day = cycling$Day, 
  NewPrecip = cycling$NewPrecip, 
  AvgTemp = mean(cycling$AvgTemp, na.rm = TRUE),  
  Total = mean(cycling$Total, na.rm = TRUE)      
)
graph$predicted_rate <- predict(quasimodel.rates, newdata = graph, type = "response")

graph$NewPrecip <- factor(graph$NewPrecip, levels = c(0, 1), labels = c("No Precipitation", "Precipitation"))

ggplot(graph, aes(x = Day, y = predicted_rate, color = NewPrecip, group = NewPrecip)) +
  geom_line(size = 1) +   
  geom_point(size = 2) +   
  labs(title = "Predicted Rates of the Cyclists \n on the Williamsburg Bridge by the \n Day of the Week and the Precipitation \n Conditions",
       x = "Day", 
       y = "Rate of Cyclists",
       color = "Precipitation Conditions") +
  theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

As we can see, this graph illustrates the predicts the rate of the cyclists on the Williamsburg Bridge out of all four of the total major New York bridges. This graph predicts this rate of the cyclists on the Williamsburg Bridge based on the day of the week and whether there was precipitation or not. This graph creates two lines, one for precipitation (blue), and one for no precipitation (red). This graph creates points for each of the seven days of the week and for whether there was precipiation or not on those days.

As we can see by looking at our graph, it is predicted that the highest rate of cyclists on the Williamsburg Bridge occurs on Saturdays with precipitation, and the lowest rate of cyclists on the Williamsburg Bridge occurs on Fridays with no precipitation. 





# Conclusion

Overall, we looked at various Poisson regression models in this project to predict the frequency counts and the rates of the cyclists on the Williamsburg Bridge. We also looked at a Quassi-Poisson regression model to account for the dispersion of the data along with the violations that were seen which indicated that a standard Poisson regression model may not be the ideal fit for our data.

It was found that our standard Poisson regression model on rates had several variables which showed statistical significance, indicating that these predictor variables were statistically significant in predicting the rates of cyclists on the Williamsburg Bridge. In our Quassi-Poisson regression model, only one of the predictor variables showed statistical significance in predicting the rates of cyclists on the Williamsburg Bridge. This made it seem like the standard Poisson regression model provided better significance for prediction.

However, we looked at the dispersion parameter of the Quassi-Poisson regression model and found that our data is in fact significantly dispersed. This indicates that a standard Poisson regression model is likely not an ideal choice due to it not accounting for this over-dispersion which can lead to innacury in the results of its prediction. This over-dispersion along with the fact that our data set violated the condition of the mean of the response variable equaling its variance, showed that the standard Poisson regression model is not an ideal choice after all. Due to this significant dispersion, the Quassi-Poisson regression model would be the better and safer choice than the standard regression model, despite it having less statistically significant variables. Even though the Quassi-Poisson regression model was less statistically significant, it provides better accuracy due to the data being dispersed, even though it shows that the majority of the predictor variables were not statistcially significant in their prediction of the rates of cyclists on the Williamsburg Bridge. 


## Recommendations 


Some recommendations I would make for future projects include:

* Look further into the violation that was found within this data set and look into possible explanations for why this violation occurred. It was found that the mean of the response variable is not equal to its variance, which violates one of the necessities of a Poisson regression model. It should be further considered whether a Poisson regression model in fact is the best choice for this data set and if it is sufficient to use this model for prediction despite these violations.

* Consider other variables which may affect the number of cyclists out on a given observation. Perhaps there are other factors which may provide further significance for model building which may strengthen the regression model. For instance, maybe a variable looking at whether there are any holidays or other notable events occurring on the day of a given observation could be useful. This could be a binary predictor variable with a value of 1 if there are any events or holidays, and a value of 0 if there are not. This could perhaps be useful as there may tend to be less cyclists out if there is a major holiday or an event occurring in the city on that given observation.

* Further expand the data set to ensure the accuracy of the predictions and to further strengthen the Poisson regression models. By collecting more observations over a longer period of time, this could help to further strenghten the Poisson regression models are provide better accuracy and reliability in the results found by the model building process. This would help strengthen the conclusions and findings found in the process of bulding the Poisson regression models of this data set.  
