DATA712 Homework 06

Yung Ki Cho

2025-03-24

Introduction

This data set is available on NYC Open Data and is administered by the Transportation Division of the New York City Department of City Planning (NYCDCP). The count is conducted along 10 designated bicycle routes in Manhattan.

Methodology

This study investigates the following research question: What is the relationship between cyclist volume and the choice of travel location (bicycle lane, street, sidewalk) in NYC? To address these questions, the study will quantify the relationship between cyclist volume and location across three distinct urban pathways in Manhattan: dedicated bicycle lanes (including protected paths), street roadways, and sidewalks. It will investigate how overall ridership impacts traffic distribution across these routes. Prior to modeling, tests for overdispersion and excess zeros will be conducted to select the most appropriate statistical approach. Furthermore, simulation techniques employing the clarify package will be implemented to estimate marginal effects and response functions, thereby providing a comprehensive understanding of how varying ridership levels impact traffic volume on each pathway.

Pre-Analysis

Understanding the Variables

  • CycBikeLane: Number of cyclists riding in the bicycle lane or protected bike path
  • CyclOtherLane: Number of cyclists riding in any one of the other travel lanes on the road (outside of the bike lane or protected bike path)
  • CycSidewalk: Number of cyclists riding on the sidewalks

Wrangling

  • Check for N/A values.
  • Convert the N/A values to 0
  • Verify that no variable contains a N/A value.
sum(is.na(bc_data))
## [1] 14
bc_data[is.na(bc_data)] <- 0
colSums(is.na(bc_data))
##            LocationID          LocationType            TypeOfTime 
##                     0                     0                     0 
##              Location          Location_Lat         Location_Long 
##                     0                     0                     0 
##                  Year            TotalUsers      NonCyc_OtherUser 
##                     0                     0                     0 
##         CyclistVolume           CycBikeLane       CycAdjacentLane 
##                     0                     0                     0 
##         CyclOtherLane  CycCounterFlowInLane           CycSidewalk 
##                     0                     0                     0 
## CycCnterFlowOutOfLane       FemaleCyc_Total         MaleCyc_Total 
##                     0                     0                     0 
##     Female_Cyc_Helmet       Male_Cyc_Helmet       Cycl_Helmet_all 
##                     0                     0                     0 
##           Cyc_Under16         Citibike_Male       Citibike_female 
##                     0                     0                     0 
##          Citibike_All       Non_citibikeCyc 
##                     0                     0

Checking for zeros

Since there is only one zero, a zero-inflated regression model would not be necessary.

## [1] "Number of Zeros: 1"
## [1] "Proportion of Zeros: 0.01"

Analysis

Model 1 Poisson Regression

## 
## Call:
## glm(formula = CycBikeLane ~ CycSidewalk + CyclOtherLane, family = poisson, 
##     data = bc_data)
## 
## Coefficients:
##                 Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)    6.3875381  0.0082061   778.4 <0.0000000000000002 ***
## CycSidewalk   -0.0104826  0.0002487   -42.1 <0.0000000000000002 ***
## CyclOtherLane  0.0026046  0.0000294    88.5 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 39293  on 107  degrees of freedom
## Residual deviance: 26902  on 105  degrees of freedom
## AIC: 27793
## 
## Number of Fisher Scoring iterations: 5

Checking for Overdispersion

The estimated dispersion (334) indicates severe overdispersion, meaning the poisson model is not appropriate for this data.

## 
##  Overdispersion test
## 
## data:  pm2
## z = 5, p-value = 0.0000008
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##        273

Model 2 Negative Binomial Regression

## 
## Call:
## glm.nb(formula = CycBikeLane ~ CycSidewalk + CyclOtherLane, data = bc_data, 
##     init.theta = 2.814935993, link = log)
## 
## Coefficients:
##                Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)    6.325329   0.130329   48.53 < 0.0000000000000002 ***
## CycSidewalk   -0.009232   0.003432   -2.69               0.0072 ** 
## CyclOtherLane  0.002847   0.000524    5.43          0.000000057 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.81) family taken to be 1)
## 
##     Null deviance: 166.71  on 107  degrees of freedom
## Residual deviance: 118.65  on 105  degrees of freedom
## AIC: 1596
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.815 
##           Std. Err.:  0.374 
## 
##  2 x log-likelihood:  -1588.432

Poission vs Negative Binomial

Comparison of Model 1 (Poisson) and Model 2 (Negative Binomial)

Model 1 (Poisson) assumes that the mean and variance of the bike lane cyclist counts are equal, which is not the case in the data (as seen from the overdispersion issue in previous tests). Model 2 (Negative Binomial) accounts for overdispersion, indicated by the theta value (2.815), which suggests the presence of extra variability beyond what the Poisson model can handle.

Coefficient Estimates and Significance:

Both models show negative effects of sidewalk cycling on bike lane usage (i.e., more cyclists on the sidewalk are associated with fewer cyclists in the bike lane). However, the effect size in Model 2 (-0.0092) is slightly smaller than in Model 1 (-0.0105).The effect of cycling in other lanes (CyclOtherLane) is positive and significant in both models, meaning that more cyclists in other lanes are associated with an increase in bike lane usage.

Model Fit (Residual Deviance & AIC):

Importantly, however, Model 1 (Poisson) has a high residual deviance (26,902) and AIC (27,793), indicating poor fit. Model 2 (Negative Binomial) has a much lower residual deviance (118.65) and AIC (1,596), suggesting a significantly better model fit.

Simulation Techniques

Average Marginal Effect of Sideway Cyclist on Bikelane Cyclist

The negative estimate (-7.16) indicates that an increase in sidewalk cyclists is associated with a decrease in bike lane cyclists. The Confidence Interval (CI: -12.55 to -1.87) suggests that the true effect is likely within this range, with 95% confidence.

sim_coefs3 <- sim(pm3)
sim_est3 <- sim_ame(sim_coefs3, var = "CycSidewalk",
                    contrast = "rd")
summary(sim_est3)
##                      Estimate  2.5 % 97.5 %
## E[dY/d(CycSidewalk)]    -7.16 -12.55  -1.87

Average Marginal Effect of Otherlane Cyclist on Bikelane Cyclist

The Estimate (2.21) represents the expected change in the number of cyclists in the bike lane when cycling in other lanes increases by one unit (one additional cyclist in other lanes). The Confidence Interval (CI: 1.35 to 3.19) indicates that the true effect is likely within this range with 95% confidence.

sim_est3a <- sim_ame(sim_coefs3, var = "CyclOtherLane",
                     contrast = "rd")
summary(sim_est3a)
##                        Estimate 2.5 % 97.5 %
## E[dY/d(CyclOtherLane)]     2.21  1.32   3.23

Graph 1: ADRF (Average Dose-Response Function)

The black line shows the estimated dose-response curve, which reflects how the expected outcome changes as CycSidewalk increases.The shaded gray area represents the confidence interval for these estimates.

Graph 2: AMEF (Average Marginal Effect Function)

The black line represents the estimated marginal effect,on the outcome at different levels of CycSidewalk.The confidence interval is indicating uncertainty around the estimate. The marginal effect starts negative at lower levels of CycSidewalk and increases (becomes less negative) as CycSidewalk increases. This suggests that increasing CycSidewalk has a diminishing negative impact on the outcome, with the effect approaching zero at higher levels.

Summary

Conclusion: Model

Model 2 (Negative Binomial) is the better choice because it corrects for overdispersion and provides more reliable coefficient estimates. The key takeaway remains that sidewalk cycling slightly reduces bike lane usage, while cycling in other lanes is positively correlated with bike lane cyclists.

Conclusion: Graph

Together, these graphs suggest that while increasing CycSidewalk negatively impacts the outcome, this effect becomes less pronounced at higher levels. This could indicate diminishing returns or a saturation effect where further increases in CycSidewalk have less additional impact.