Data Description

Research Question

For brevity, the data set was described as the counts of cyclists per 24 hours entering and leaving Queens, Manhattan, and Brooklyn via the East River Bridges. A response variable for the proposed analysis may then be the QueensboroBridge and the explanatory variable might be Date. So, one of the many objectives of this analysis, might be exploring the relationship between the count data for the cyclist counts on the QueensboroBridge and the 24 hour period that those counts are observed. The other variables, Day, HighTemp, LowTemp, Precipitation, will also be included; they may have helped better fit the proposed model. Total However was with held and instead used to scale the response variable later in the anaylsis.

Exploratory Data Analysis

The output represents the data set in its current state. notice that none of the none numeric variables are stored as factors, the variables QueensboroBridge and Total, which are assumed to be numeric are stored as character variables and the Date variable which is also stored as character variable. before any statistics are pulled from the data set some adjustments were made.

'data.frame':   31 obs. of  7 variables:
 $ Date            : chr  "7/1" "7/2" "7/3" "7/4" ...
 $ Day             : chr  "Saturday" "Sunday" "Monday" "Tuesday" ...
 $ HighTemp        : num  84.9 87.1 87.1 82.9 84.9 75 79 82.9 81 82.9 ...
 $ LowTemp         : num  72 73 71.1 70 71.1 71.1 68 70 69.1 71.1 ...
 $ Precipitation   : num  0.23 0 0.45 0 0 0 1.78 0 0 0 ...
 $ QueensboroBridge: chr  "3,216" "3,579" "4,230" "3,861" ...
 $ Total           : chr  "11,867" "13,995" "16,067" "13,925" ...

The following output displays the data set after adjustments were made to it as well as some summary statistics for the variables contained in the data set, and the some of the observations from the data set. Notice that the observations are recorded over 31 days. Also that their are more observations for the days Monday, Saturday, and Sunday; these days correspond to the first three observations taken.

'data.frame':   31 obs. of  7 variables:
 $ Date            : Date, format: "2023-07-01" "2023-07-02" ...
 $ Day             : Factor w/ 7 levels "Friday","Monday",..: 3 4 2 6 7 5 1 3 4 2 ...
 $ HighTemp        : num  84.9 87.1 87.1 82.9 84.9 75 79 82.9 81 82.9 ...
 $ LowTemp         : num  72 73 71.1 70 71.1 71.1 68 70 69.1 71.1 ...
 $ Precipitation   : num  0.23 0 0.45 0 0 0 1.78 0 0 0 ...
 $ QueensboroBridge: int  3216 3579 4230 3861 5862 5251 3304 3952 4044 5712 ...
 $ Total           : int  11867 13995 16067 13925 23110 21861 12805 17258 18320 24827 ...
First Few Observations from the data set
Date Day HighTemp LowTemp Precipitation QueensboroBridge Total
7/1 2023-07-01 Saturday 84.9 72.0 0.23 3216 11867
7/2 2023-07-02 Sunday 87.1 73.0 0.00 3579 13995
7/3 2023-07-03 Monday 87.1 71.1 0.45 4230 16067
7/4 2023-07-04 Tuesday 82.9 70.0 0.00 3861 13925
7/5 2023-07-05 Wednesday 84.9 71.1 0.00 5862 23110
7/6 2023-07-06 Thursday 75.0 71.1 0.00 5251 21861
7/7 2023-07-07 Friday 79.0 68.0 1.78 3304 12805
Summary Staistics from the data set
Date Day HighTemp LowTemp Precipitation QueensboroBridge Total
Min. :2023-07-01 Friday :4 Min. :69.10 Min. :63.00 Min. :0.0000 Min. :2147 Min. : 8210
1st Qu.:2023-07-08 Monday :5 1st Qu.:78.55 1st Qu.:68.00 1st Qu.:0.0000 1st Qu.:3851 1st Qu.:14618
Median :2023-07-16 Saturday :5 Median :84.00 Median :71.10 Median :0.0000 Median :4461 Median :18696
Mean :2023-07-16 Sunday :5 Mean :82.70 Mean :71.10 Mean :0.1352 Mean :4551 Mean :18805
3rd Qu.:2023-07-23 Thursday :4 3rd Qu.:87.10 3rd Qu.:74.45 3rd Qu.:0.0050 3rd Qu.:5482 3rd Qu.:22979
Max. :2023-07-31 Tuesday :4 Max. :93.00 Max. :78.10 Max. :1.7800 Max. :6556 Max. :26969
NA Wednesday:4 NA NA NA NA NA

Before diving into the modeling process, the assumptions for Poisson Regression are addressed. One Assumption is that the response variable consist of count data. The next is that the observations are independent. Another, is that the distribution of the response is poisson distributed. The last is that the variance for the response is equal to the mean. The following output may demonstrate violation to those assumptions. Specifically, the mean and variance are not equal, so while the response may be count data, it may not be poisson distributed. This may mean that using the poison regression model is incorrect. Further, after the following scatter-plots are inspected, one may not find an exponential relationship present between the response and explanatory variables, which may also indicate that that using the poison regression model for this analysis is incorrect. Next the modeling process.

Mean and Var QueensboroBridge
Mean Var DIF
4550.581 1151703 -1147152

Multiple Poisson and Qusi-poisson Regression Model

Here the Poisson Regression Models for both the count and rates are fitted. Tabular output of each model’s statistics follow along with interpretations of the information within the tables. First the frequency data was fit. Then the rate date followed.

frequency data

The following output is from the MPR on the count data. At the top under Call, is the glm formula for the model. In the coefficients section estimates, std. errors, z-values, and p-values can be found for each variable in the model. Further down the null deviance, residual deviance, and AIC are listed.


Call:
glm(formula = QueensboroBridge ~ Date + Day + HighTemp + LowTemp + 
    Precipitation, family = poisson(link = "log"), data = data0)

Coefficients:
                  Estimate   Std. Error z value             Pr(>|z|)    
(Intercept)   -101.5353926    6.4585844 -15.721 < 0.0000000000000002 ***
Date             0.0056143    0.0003294  17.046 < 0.0000000000000002 ***
DayMonday        0.0836674    0.0109719   7.626   0.0000000000000243 ***
DaySaturday     -0.1568821    0.0113079 -13.874 < 0.0000000000000002 ***
DaySunday       -0.1793569    0.0118122 -15.184 < 0.0000000000000002 ***
DayThursday      0.1390094    0.0113536  12.244 < 0.0000000000000002 ***
DayTuesday       0.1417822    0.0116331  12.188 < 0.0000000000000002 ***
DayWednesday     0.2188624    0.0112170  19.512 < 0.0000000000000002 ***
HighTemp         0.0147046    0.0008006  18.367 < 0.0000000000000002 ***
LowTemp         -0.0147034    0.0011943 -12.311 < 0.0000000000000002 ***
Precipitation   -0.2725458    0.0109020 -24.999 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 7891  on 30  degrees of freedom
Residual deviance: 2732  on 20  degrees of freedom
AIC: 3071.2

Number of Fisher Scoring iterations: 4

Interpretation

A poisson regression was conducted. All variables in the model were found to be statistically significant. The z and p values are now given, as well as interpretations for each estimate. Dispersion was not assessed; the model assumes their are no dispersion issues. However, Residual deviance was assessed: \(\chi^2_{20}\) = 0, the model may not have good fit.

Date: (z= 17.0456,p<.05). The coefficient estimate for Date is 0.0056. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in Date may be 0.0056. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 1.0056 times for each additional Date unit increase.

DayMonday: (z= 7.6256,p<.05). The coefficient estimate for DayMonday is 0.0837. This may mean the expected log count for cyclist on the QueensboroBridges on DayMonday is 0.0837 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayMonday is 1.0873 times larger than DayFriday.

DaySaturday: (z= -13.8737,p<.05). The coefficient estimate for DaySaturday is -0.1569. This may mean the expected log count for cyclist on the QueensboroBridges on DaySaturday is -0.1569 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DaySaturday is 0.8548 times larger than DayFriday.

DaySunday: (z= -15.184,p<.05). The coefficient estimate for DaySunday is -0.1794. This may mean the expected log count for cyclist on the QueensboroBridges on DaySunday is -0.1794 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DaySunday is 0.8358 times larger than DayFriday.

DayThursday: (z= 12.2436,p<.05). The coefficient estimate for DayThursday is 0.139. This may mean the expected log count for cyclist on the QueensboroBridges on DayThursday is 0.139 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayThursday is 1.1491 times larger than DayFriday.

DayTuesday: (z= 12.1879,p<.05). The coefficient estimate for DayTuesday is 0.1418. This may mean the expected log count for cyclist on the QueensboroBridges on DayTuesday is 0.1418 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayTuesday is 1.1523 times larger than DayFriday.

DayWednesday: (z= 19.5116,p<.05). The coefficient estimate for DayWednesday is 0.2189. This may mean the expected log count for cyclist on the QueensboroBridges on DayWednesday is 0.2189 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayWednesday is 1.2447 times larger than DayFriday.

HighTemp: (z= 18.3671,p<.05). The coefficient estimate for HighTemp is 0.0147. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in HighTemp may be 0.0147. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 1.0148 times for each additional HighTemp unit increase.

LowTemp: (z= -12.3112,p<.05). The coefficient estimate for LowTemp is -0.0147. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in LowTemp may be -0.0147. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 0.9854 times for each additional LowTemp unit increase.

Precipitation: (z= -24.9995,p<.05). The coefficient estimate for Precipitation is -0.2725. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in Precipitation may be -0.2725. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 0.7614 times for each additional Precipitation unit increase.

rate data

The following output is from the MPR on the rate data. At the top under Call, is the glm formula for the model. In the coefficients section estimates, std. errors, z-values, and p-values can be found for each variable in the model. Further down the null deviance, residual deviance, and AIC are listed.


Call:
glm(formula = QueensboroBridge ~ Date + Day + HighTemp + LowTemp + 
    Precipitation, family = poisson(link = "log"), data = data0, 
    offset = log(Total))

Coefficients:
                Estimate Std. Error z value           Pr(>|z|)    
(Intercept)   27.7076811  6.6039900   4.196 0.0000272153711376 ***
Date          -0.0014765  0.0003367  -4.385 0.0000116004923419 ***
DayMonday     -0.0779061  0.0111944  -6.959 0.0000000000034181 ***
DaySaturday   -0.0382064  0.0115547  -3.307           0.000945 ***
DaySunday     -0.0907494  0.0121608  -7.462 0.0000000000000849 ***
DayThursday   -0.0249113  0.0116257  -2.143           0.032131 *  
DayTuesday    -0.0523313  0.0118574  -4.413 0.0000101775111261 ***
DayWednesday  -0.0529423  0.0114447  -4.626 0.0000037295140346 ***
HighTemp       0.0024545  0.0008306   2.955           0.003124 ** 
LowTemp       -0.0057570  0.0012177  -4.728 0.0000022693644187 ***
Precipitation  0.0315786  0.0102236   3.089           0.002010 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 467.32  on 30  degrees of freedom
Residual deviance: 324.46  on 20  degrees of freedom
AIC: 663.63

Number of Fisher Scoring iterations: 3

Interpretation

A poisson regression was conducted. All variables in the model were found to be statistically significant. The z and p values are now given, as well as interpretations for each estimate. Dispersion was not assessed; the model assumes their are no dispersion issues. However, Residual deviance was assessed: \(\chi^2_{20}\) = 0, the model may not have good fit.

Date: (z=-4.385,p<.05). The coefficient estimate for Date is -0.0015. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in Date may be -0.0015. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 0.9985 times for each additional Date unit increase.

DayMonday: (z= -6.9594,p<.05). The coefficient estimate for DayMonday is -0.0779. This may mean the expected log count for cyclist on the QueensboroBridges on DayMonday is -0.0779 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayMonday is 0.9251 times larger than DayFriday.

DaySaturday: (z= -3.3066,p<.05). The coefficient estimate for DaySaturday is -0.0382. This may mean the expected log count for cyclist on the QueensboroBridges on DaySaturday is -0.0382 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DaySaturday is 0.9625 times larger than DayFriday.

DaySunday: (z= -7.4624,p<.05). The coefficient estimate for DaySunday is -0.0907. This may mean the expected log count for cyclist on the QueensboroBridges on DaySunday is -0.0907 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DaySunday is 0.9132 times larger than DayFriday.

DayThursday: (z= -2.1428,p<.05). The coefficient estimate for DayThursday is -0.0249. This may mean the expected log count for cyclist on the QueensboroBridges on DayThursday is -0.0249 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayThursday is 0.9754 times larger than DayFriday.

DayTuesday: (z= -4.4134,p<.05). The coefficient estimate for DayTuesday is -0.0523. This may mean the expected log count for cyclist on the QueensboroBridges on DayTuesday is -0.0523 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayTuesday is 0.949 times larger than DayFriday.

DayWednesday: (z= -4.6259,p<.05). The coefficient estimate for DayWednesday is -0.0529. This may mean the expected log count for cyclist on the QueensboroBridges on DayWednesday is -0.0529 higher than on DayFriday. Alternatively, after exponentiation, another interpretation may be that the cyclist count on the QueensboroBridges on DayWednesday is 0.9484 times larger than DayFriday.

HighTemp: (z=2.9552,p<.05). The coefficient estimate for HighTemp is 0.0025. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in HighTemp may be 0.0025. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 1.0025 times for each additional HighTemp unit increase.

LowTemp: (z=-4.7278,p<.05). The coefficient estimate for LowTemp is -0.0058. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in LowTemp may be -0.0058. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 0.9943 times for each additional LowTemp unit increase.

Precipitation: (z=3.0888,p<.05). The coefficient estimate for Precipitation is 0.0316. This may mean that the expected log count for the cyclist counts on the QueensboroBridges for a one-unit increase in Precipitation may be 0.0316. Alternatively, after exponentiation, another interpretation may be that the cyclist counts on the QueensboroBridges may increase by 1.0321 times for each additional Precipitation unit increase.

quasi-poison

Before the quasi-poison regression was conducted, some adjustments were made to the data set. Two variables AvgTemp and NewPrecip were added. AvgTemp is the average of the HighTemp and LowTemp from the previous data set. NewPrecip is a binary variable, where zero corresponds to a Precipitation of zero and 1 to all other values. The following output displays the data set after adjustments were made to it as well as some summary statistics for the variables contained in the data set, and the some of the observations from the data set. Notice that the observations are recorded over 31 days. Also that their are more observations for the days Monday, Saturday, and Sunday; these days correspond to the first three observations taken.

'data.frame':   31 obs. of  5 variables:
 $ Day             : Factor w/ 7 levels "Friday","Monday",..: 3 4 2 6 7 5 1 3 4 2 ...
 $ AvgTemp         : num  78.5 80 79.1 76.5 78 ...
 $ NewPrecip       : num  1 0 1 0 0 0 1 0 0 0 ...
 $ QueensboroBridge: int  3216 3579 4230 3861 5862 5251 3304 3952 4044 5712 ...
 $ Total           : int  11867 13995 16067 13925 23110 21861 12805 17258 18320 24827 ...
First Few Observations from the data set
Day AvgTemp NewPrecip QueensboroBridge Total
7/1 Saturday 78.45 1 3216 11867
7/2 Sunday 80.05 0 3579 13995
7/3 Monday 79.10 1 4230 16067
7/4 Tuesday 76.45 0 3861 13925
7/5 Wednesday 78.00 0 5862 23110
7/6 Thursday 73.05 0 5251 21861
7/7 Friday 73.50 1 3304 12805
Summary Staistics from the data set
Day AvgTemp NewPrecip QueensboroBridge Total
Friday :4 Min. :66.05 Min. :0.0000 Min. :2147 Min. : 8210
Monday :5 1st Qu.:73.28 1st Qu.:0.0000 1st Qu.:3851 1st Qu.:14618
Saturday :5 Median :77.45 Median :0.0000 Median :4461 Median :18696
Sunday :5 Mean :76.90 Mean :0.2581 Mean :4551 Mean :18805
Thursday :4 3rd Qu.:79.78 3rd Qu.:0.5000 3rd Qu.:5482 3rd Qu.:22979
Tuesday :4 Max. :85.55 Max. :1.0000 Max. :6556 Max. :26969
Wednesday:4 NA NA NA NA

The following output is from the QMPR on the rate data. At the top under Call, is the glm formula for the model. In the coefficients section estimates, std. errors, t-values, and p-values can be found for each variable in the model. Further down the null deviance, residual deviance, and AIC are listed.


Call:
glm(formula = QueensboroBridge ~ Day + AvgTemp + NewPrecip, family = quasipoisson(link = "log"), 
    data = data1, offset = log(Total))

Coefficients:
              Estimate Std. Error t value    Pr(>|t|)    
(Intercept)  -1.256902   0.168919  -7.441 0.000000192 ***
DayMonday    -0.067871   0.039969  -1.698      0.1036    
DaySaturday  -0.035773   0.042116  -0.849      0.4048    
DaySunday    -0.081786   0.042303  -1.933      0.0662 .  
DayThursday  -0.035010   0.041312  -0.847      0.4059    
DayTuesday   -0.047786   0.042593  -1.122      0.2740    
DayWednesday -0.043398   0.041580  -1.044      0.3080    
AvgTemp      -0.001613   0.002137  -0.755      0.4583    
NewPrecip     0.047621   0.028067   1.697      0.1039    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 15.23802)

    Null deviance: 467.32  on 30  degrees of freedom
Residual deviance: 330.47  on 22  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 3

Interpretation

A Quasi-poisson regression was conducted. Dispersion was dettected \(\phi\) = 15.2380228. Only the intercept was found to be statistically significant. The t and p values are given, as well as interpretations for the estimate. Likewise, Residual deviance was assessed: \(\chi^2_{22}\) = 0; the model may not have good fit.

DayFriday: (t= -7.4409,p<.05). The coefficient estimate for DayFriday is -1.2569. This may mean the expected log count for cyclist on the QueensboroBridges on DayFriday is -1.2569. Alternatively, after exponentiation, another interpretation may be that the count for cyclist on the QueensboroBridges on DayFriday is 0.2845.

Final Model

The dispersion value for the count and rate models did not factor in dispersion, so their estimates may not be statistically significant. However, the quais-poisson did, and found only the intercept to be significant. So, while a quasi-poisson model may be appropriate for association analysis, when dispersion is an issue, the associations in this model may not have been significant. Further, note that the models are similar but not the same: the rate model has an offset where the count model does not, and the quasi-Poisson is modeled with different variables from either, despite sharing the rate model’s offset. therefore, in predictive modeling, these models may not yield the same results. Additionally, after assessing the residual deviance for each of the models, evidence confirming that the deviance residuals originate from a \(\chi^2\) distribution, was not found, so the models may not be useful. With these considerations a determination on a final model could not be reached.

Discussion

This analysis explored the relationship between the count data for the cyclist counts on the QueensboroBridge in 24 hour periods, along with other factors. A poisson regression was conducted on both rate and count data. All variables in each model were found to be statistically significant; no variables were removed between each of the model fittings. The z and p values were given, as well as interpretations for each estimate. The data set used for the analysis was described as the counts of cyclists per 24 hours entering and leaving Queens, Manhattan, and Brooklyn via the East River Bridges. The response variable used was the QueensboroBridge and the explanatory variables were Date, Day, HighTemp, LowTemp, and Precipitation. the variable Total was withheld from the model and instead used to scale the response variable for the MPR on rates. During the exploratory data analysis some adjustments were made to the variables in the data set. Non-numeric variables were changed to factors, the variables QueensboroBridge and Total were changed from character variables to numeric variables. Also, the Date variable was changed from a character variable to a date variable. Incidentally, the year 2023 was appended to each date; no effect on the analysis was detected. Before diving into the modeling process, the assumptions for Poisson Regression Were addressed. Output that may have demonstrated violation to those assumptions was given. Specifically, It was found that the mean and variance were not equal, so while the response may be count data, it may not be poisson distributed. This may mean that using the poison regression model was incorrect. Further, one may not find an exponential relationship present between the response and explanatory variables after inspecting their respective scatter-plots, which may also indicate that that using the poison regression model for this analysis is incorrect. After these assumptions were checked, the modeling process followed.

After the MPR a Quasi-MPR was conducted. Dispersion was detected. Only the intercept was found to be statistically significant. The t and p values were given, as well as interpretations for the estimate. The Quasi-poisson model used the same response variable and offset as the rate model, however, the explanatory variables it used were Day (from the initial data set), AvgTemp (the average of HighTemp and LowTemp), and NewPrecip (a binary variable with zero equal to a zero in Precipitation).

The Residual deviance was assessed for the count, rate, and quasi-poisson models, evidence confirming that their deviance residuals originated from a \(\chi^2\) distribution, was not found, so the models may not be useful. Note that the predictive ability for each model may vary; the models are similar but not the same. Further the association analysis given by the non-quasi poison models may be questionable; they do not factor in dispersion issues. Additionally, the association analysis given by the quasi-poison models may be non-existent; only the intercept was found to be significant. With these considerations, a determination on a final model could not be reached.

Conclussion

Using the Poisson regression model to fit the data, may not have been appropriate; there may have been multiple violations to the assumptions of the response variable. Mainly, the response may not have been Poisson distributed. Additionally, using the Quasi-Poisson regression model to fit the data, may not have been appropriate either; there may not have been a temporal explanatory variable among, other issues. Lastly, based on model discrepancies, a determination on a final model could not be reached. Further investigation may be recommended, though no direction seems apparent.