Data Description

Research Question

For brevity, the data set was described as the counts of cyclists per 24 hours entering and leaving Queens, Manhattan, and Brooklyn via the East River Bridges. A response variable for the proposed analysis may then be the QueensboroBridge and the explanatory variable might be Date. So, one of the many objectives of this analysis, might be exploring the relationship between the count data for the cyclist counts on the QueensboroBridge and the 24 hour period that those counts are observed. The other variables, Day, HighTemp, LowTemp, Precipitation, will also be included; they may have helped better fit the proposed model. Total However was with held and instead used to scale the response variable later in the anaylsis.

Exploratory Data Analysis

The output represents the data set in its current state. notice that none of the none numeric variables are stored as factors, the variables QueensboroBridge and Total, which are assumed to be numeric are stored as character variables and the Date variable which is also stored as character variable. before any statistics are pulled from the data set some adjustments were made.

'data.frame':   31 obs. of  7 variables:
 $ Date            : chr  "7/1" "7/2" "7/3" "7/4" ...
 $ Day             : chr  "Saturday" "Sunday" "Monday" "Tuesday" ...
 $ HighTemp        : num  84.9 87.1 87.1 82.9 84.9 75 79 82.9 81 82.9 ...
 $ LowTemp         : num  72 73 71.1 70 71.1 71.1 68 70 69.1 71.1 ...
 $ Precipitation   : num  0.23 0 0.45 0 0 0 1.78 0 0 0 ...
 $ QueensboroBridge: chr  "3,216" "3,579" "4,230" "3,861" ...
 $ Total           : chr  "11,867" "13,995" "16,067" "13,925" ...

The following output displays the data set after adjustments were made to it as well as some summary statistics for the variables contained in the data set, and the some of the observations from the data set. Notice that the observations are recorded over 31 days. Also that their are more observations for the days Monday, Saturday, and Sunday; these days correspond to the first three observations taken.

'data.frame':   31 obs. of  7 variables:
 $ Date            : Date, format: "2023-07-01" "2023-07-02" ...
 $ Day             : Factor w/ 7 levels "Friday","Monday",..: 3 4 2 6 7 5 1 3 4 2 ...
 $ HighTemp        : num  84.9 87.1 87.1 82.9 84.9 75 79 82.9 81 82.9 ...
 $ LowTemp         : num  72 73 71.1 70 71.1 71.1 68 70 69.1 71.1 ...
 $ Precipitation   : num  0.23 0 0.45 0 0 0 1.78 0 0 0 ...
 $ QueensboroBridge: int  3216 3579 4230 3861 5862 5251 3304 3952 4044 5712 ...
 $ Total           : int  11867 13995 16067 13925 23110 21861 12805 17258 18320 24827 ...
First Few Observations from the data set
Date Day HighTemp LowTemp Precipitation QueensboroBridge Total
7/1 2023-07-01 Saturday 84.9 72.0 0.23 3216 11867
7/2 2023-07-02 Sunday 87.1 73.0 0.00 3579 13995
7/3 2023-07-03 Monday 87.1 71.1 0.45 4230 16067
7/4 2023-07-04 Tuesday 82.9 70.0 0.00 3861 13925
7/5 2023-07-05 Wednesday 84.9 71.1 0.00 5862 23110
7/6 2023-07-06 Thursday 75.0 71.1 0.00 5251 21861
7/7 2023-07-07 Friday 79.0 68.0 1.78 3304 12805
Summary Staistics from the data set
Date Day HighTemp LowTemp Precipitation QueensboroBridge Total
Min. :2023-07-01 Friday :4 Min. :69.10 Min. :63.00 Min. :0.0000 Min. :2147 Min. : 8210
1st Qu.:2023-07-08 Monday :5 1st Qu.:78.55 1st Qu.:68.00 1st Qu.:0.0000 1st Qu.:3851 1st Qu.:14618
Median :2023-07-16 Saturday :5 Median :84.00 Median :71.10 Median :0.0000 Median :4461 Median :18696
Mean :2023-07-16 Sunday :5 Mean :82.70 Mean :71.10 Mean :0.1352 Mean :4551 Mean :18805
3rd Qu.:2023-07-23 Thursday :4 3rd Qu.:87.10 3rd Qu.:74.45 3rd Qu.:0.0050 3rd Qu.:5482 3rd Qu.:22979
Max. :2023-07-31 Tuesday :4 Max. :93.00 Max. :78.10 Max. :1.7800 Max. :6556 Max. :26969
NA Wednesday:4 NA NA NA NA NA

Before diving into the modeling process, the assumptions for Poisson Regression are addressed. One Assumption is that the response variable consist of count data. The next is that the observations are independent. Another, is that the distribution of the response is poisson distributed. The last is that the variance for the response is equal to the mean. The following output may demonstrate violation to those assumptions. Specifically, the mean and variance are not equal, so while the response may be count data, it may not be poisson distributed. This may mean that using the poison regression model is incorrect. Further, after the following scatter-plots are inspected, one may not find an exponential relationship present between the response and explanatory variables, which may also indicate that that using the poison regression model for this analysis is incorrect. Next the modeling process.

Mean and Var QueensboroBridge
Mean Var DIF
4550.581 1151703 -1147152

Multiple Poisson Regression Model

Here the Poisson Regression Models for both the count and rates are fitted. Tabular output of each model’s statistics follow along with interpretations of the information within the tables. First the frequency data was fit. Then the rate date followed.

frequency data

The following output is from the MPR on the count data. At the top under Call, is the glm formula for the model. In the coefficients section estimates, std. errors, z-values, and p-values can be found for each variable in the model. Further down the null deviance, residual deviance, and AIC are listed.


Call:
glm(formula = QueensboroBridge ~ Date + Day + HighTemp + LowTemp + 
    Precipitation, family = poisson(link = "log"), data = data0)

Coefficients:
                  Estimate   Std. Error z value             Pr(>|z|)    
(Intercept)   -101.5353926    6.4585844 -15.721 < 0.0000000000000002 ***
Date             0.0056143    0.0003294  17.046 < 0.0000000000000002 ***
DayMonday        0.0836674    0.0109719   7.626   0.0000000000000243 ***
DaySaturday     -0.1568821    0.0113079 -13.874 < 0.0000000000000002 ***
DaySunday       -0.1793569    0.0118122 -15.184 < 0.0000000000000002 ***
DayThursday      0.1390094    0.0113536  12.244 < 0.0000000000000002 ***
DayTuesday       0.1417822    0.0116331  12.188 < 0.0000000000000002 ***
DayWednesday     0.2188624    0.0112170  19.512 < 0.0000000000000002 ***
HighTemp         0.0147046    0.0008006  18.367 < 0.0000000000000002 ***
LowTemp         -0.0147034    0.0011943 -12.311 < 0.0000000000000002 ***
Precipitation   -0.2725458    0.0109020 -24.999 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 7891  on 30  degrees of freedom
Residual deviance: 2732  on 20  degrees of freedom
AIC: 3071.2

Number of Fisher Scoring iterations: 4

Interpretation

A poisson regression was conducted. All variables in the model were found to be statistically significant. The z and p values are now given, as well as interpretations for each estimate.

Date: (z= 17.0456,p<.05). The coefficient estimate for Date is 0.0056. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in Date may be 0.0056. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 1.0056 times for each additional Date unit increase.

DayMonday: (z= 7.6256,p<.05). The coefficient estimate for DayMonday is 0.0837. This may mean the expected log count for cyclist on the QueensboroBridges on DayMonday is 0.0837 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayMonday is 1.0873 times larger than DayFriday.

DaySaturday: (z= -13.8737,p<.05). The coefficient estimate for DaySaturday is -0.1569. This may mean the expected log count for cyclist on the QueensboroBridges on DaySaturday is -0.1569 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DaySaturday is 0.8548 times larger than DayFriday.

DaySunday: (z= -15.184,p<.05). The coefficient estimate for DaySunday is -0.1794. This may mean the expected log count for cyclist on the QueensboroBridges on DaySunday is -0.1794 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DaySunday is 0.8358 times larger than DayFriday.

DayThursday: (z= 12.2436,p<.05). The coefficient estimate for DayThursday is 0.139. This may mean the expected log count for cyclist on the QueensboroBridges on DayThursday is 0.139 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayThursday is 1.1491 times larger than DayFriday.

DayTuesday: (z= 12.1879,p<.05). The coefficient estimate for DayTuesday is 0.1418. This may mean the expected log count for cyclist on the QueensboroBridges on DayTuesday is 0.1418 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayTuesday is 1.1523 times larger than DayFriday.

DayWednesday: (z= 19.5116,p<.05). The coefficient estimate for DayWednesday is 0.2189. This may mean the expected log count for cyclist on the QueensboroBridges on DayWednesday is 0.2189 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayWednesday is 1.2447 times larger than DayFriday.

HighTemp: (z= 18.3671,p<.05). The coefficient estimate for HighTemp is 0.0147. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in HighTemp may be 0.0147. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 1.0148 times for each additional HighTemp unit increase.

LowTemp: (z= -12.3112,p<.05). The coefficient estimate for LowTemp is -0.0147. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in LowTemp may be -0.0147. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 0.9854 times for each additional LowTemp unit increase.

Precipitation: (z= -24.9995,p<.05). The coefficient estimate for Precipitation is -0.2725. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in Precipitation may be -0.2725. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 0.7614 times for each additional Precipitation unit increase.

rate data

The following output is from the MPR on the rate data. At the top under Call, is the glm formula for the model. In the coefficients section estimates, std. errors, z-values, and p-values can be found for each variable in the model. Further down the null deviance, residual deviance, and AIC are listed.


Call:
glm(formula = QueensboroBridge ~ Date + Day + HighTemp + LowTemp + 
    Precipitation, family = poisson(link = "log"), data = data0, 
    offset = log(Total))

Coefficients:
                Estimate Std. Error z value           Pr(>|z|)    
(Intercept)   27.7076811  6.6039900   4.196 0.0000272153711376 ***
Date          -0.0014765  0.0003367  -4.385 0.0000116004923419 ***
DayMonday     -0.0779061  0.0111944  -6.959 0.0000000000034181 ***
DaySaturday   -0.0382064  0.0115547  -3.307           0.000945 ***
DaySunday     -0.0907494  0.0121608  -7.462 0.0000000000000849 ***
DayThursday   -0.0249113  0.0116257  -2.143           0.032131 *  
DayTuesday    -0.0523313  0.0118574  -4.413 0.0000101775111261 ***
DayWednesday  -0.0529423  0.0114447  -4.626 0.0000037295140346 ***
HighTemp       0.0024545  0.0008306   2.955           0.003124 ** 
LowTemp       -0.0057570  0.0012177  -4.728 0.0000022693644187 ***
Precipitation  0.0315786  0.0102236   3.089           0.002010 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 467.32  on 30  degrees of freedom
Residual deviance: 324.46  on 20  degrees of freedom
AIC: 663.63

Number of Fisher Scoring iterations: 3

Interpretation

A poisson regression was conducted. All variables in the model were found to be statistically significant. The z and p values are now given, as well as interpretations for each estimate.

Date: (z=-4.385,p<.05). The coefficient estimate for Date is -0.0015. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in Date may be -0.0015. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 0.9985 times for each additional Date unit increase.

DayMonday: (z= -6.9594,p<.05). The coefficient estimate for DayMonday is -0.0779. This may mean the expected log count for cyclist on the QueensboroBridges on DayMonday is -0.0779 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayMonday is 0.9251 times larger than DayFriday.

DaySaturday: (z= -3.3066,p<.05). The coefficient estimate for DaySaturday is -0.0382. This may mean the expected log count for cyclist on the QueensboroBridges on DaySaturday is -0.0382 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DaySaturday is 0.9625 times larger than DayFriday.

DaySunday: (z= -7.4624,p<.05). The coefficient estimate for DaySunday is -0.0907. This may mean the expected log count for cyclist on the QueensboroBridges on DaySunday is -0.0907 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DaySunday is 0.9132 times larger than DayFriday.

DayThursday: (z= -2.1428,p<.05). The coefficient estimate for DayThursday is -0.0249. This may mean the expected log count for cyclist on the QueensboroBridges on DayThursday is -0.0249 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayThursday is 0.9754 times larger than DayFriday.

DayTuesday: (z= -4.4134,p<.05). The coefficient estimate for DayTuesday is -0.0523. This may mean the expected log count for cyclist on the QueensboroBridges on DayTuesday is -0.0523 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayTuesday is 0.949 times larger than DayFriday.

DayWednesday: (z= -4.6259,p<.05). The coefficient estimate for DayWednesday is -0.0529. This may mean the expected log count for cyclist on the QueensboroBridges on DayWednesday is -0.0529 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayWednesday is 0.9484 times larger than DayFriday.

HighTemp: (z=2.9552,p<.05). The coefficient estimate for HighTemp is 0.0025. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in HighTemp may be 0.0025. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 1.0025 times for each additional HighTemp unit increase.

LowTemp: (z=-4.7278,p<.05). The coefficient estimate for LowTemp is -0.0058. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in LowTemp may be -0.0058. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 0.9943 times for each additional LowTemp unit increase.

Precipitation: (z=3.0888,p<.05). The coefficient estimate for Precipitation is 0.0316. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in Precipitation may be 0.0316. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 1.0321 times for each additional Precipitation unit increase.

Discussion

This analysis explored the relationship between the count data for the cyclist counts on the QueensboroBridge in 24 hour periods, along with other factors. A poisson regression was conducted on both rate and count data. All variables in each model were found to be statistically significant; no variables were removed between each of the model fittings. The z and p values were given, as well as interpretations for each estimate. The data set used for the analysis was described as the counts of cyclists per 24 hours entering and leaving Queens, Manhattan, and Brooklyn via the East River Bridges. The response variable used was the QueensboroBridge and the explanatory variables were Date, Day, HighTemp, LowTemp, and Precipitation. the variable Total was withheld from the model and instead used to scale the response variable for the MPR on rates. During the exploratory data analysis some adjustments were made to the variables in the data set. Non-numeric variables were changed to factors, the variables QueensboroBridge and Total were changed from character variables to numeric variables. Also, the Date variable was changed from a character variable to a date variable. Incidentally, the year 2023 was appended to each date; no effect on the analysis was detected. Before diving into the modeling process, the assumptions for Poisson Regression Were addressed. Output that may have demonstrated violation to those assumptions was given. Specifically, It was found that the mean and variance were not equal, so while the response may be count data, it may not be poisson distributed. This may mean that using the poison regression model was incorrect. Further, one may not find an exponential relationship present between the response and explanatory variables after inspecting their respective scatter-plots, which may also indicate that that using the poison regression model for this analysis is incorrect. After these assumptions were checked, the modeling process followed.

Conclussion

Using the Poisson regression model to fit the data, may not have been appropriate; there may have been multiple violations to the assumptions of the response variable. Mainly, the response may not have been Poisson distributed. Further investigation may be recommended, though no direction seems apparent.