Modeling the Impact of the IBX on 60minute+ commute times

Foundation

Census Data

Looking at census data from 2022.

Picked 2022 because:

This is a good Post-Pandemic option; NYC was “fully reopened”
The 5-Year ACS helps avoid the 2020 collection gap (where due to COVID, the Census Bureau couldn’t send people door to door) and the 2021 remote-forward living arrangements, where many people left Manhattan during the high period of remote work.
The 5-year estimate is a rolling average (2018–2022), and some COVID-impacted data is included either way. However, multiple years of more robust data are included in this time window.

Another option would be to switch to the most recent census data available, but it’s worth noting that that won’t avoid the COVID data being included in the 5 -year estimate.

Current Vars from census

#CENSUS_API_KEY is in environment 

nyc_counties <- c("New York", "Kings", "Queens", "Bronx", "Richmond")

 
# Entire 2022 5-Year ACS variable dictionary
acs_vars_2022 <- load_variables(2022, "acs5", cache = TRUE)

# MISC.
# Exploring different potential features
feature_exploration <- acs_vars_2022 %>%
  filter(
    str_detect(str_to_lower(label), "public transportation|bicycle|walked|worked from home") |
    str_detect(str_to_lower(concept), "means of transportation to work|travel time")
  )

# Comment Me out before final knit
#View(feature_exploration)

# hiding the results of this chunk becuase it jsut is unreadable in the knitted document

# Census variables
vars <- c(
  # Base Demographics
  no_car      = "B08201_002", 
  med_income  = "B19013_001", 
  med_age     = "B01002_001",
  total_pop   = "B01003_001",
  
  # Transportation Base & Features
  tot_workers = "B08301_001",
  transit     = "B08301_010",
  walk        = "B08301_019",
  wfh         = "B08301_021",
  
  # Travel Time Features
  # Potential target variable
  commute_60  = "B08303_012",
  commute_90  = "B08303_013",
  
  # Education Base & Features
  tot_pop_25  = "B15003_001",
  bachelors   = "B15003_022",
  
  # Occupation Base & The 5 Major Buckets
  tot_employ  = "C24010_001",
  mngmt_job   = "C24010_002",
  service_job = "C24010_014",
  sales_job   = "C24010_021",
  const_job   = "C24010_030",
  trans_job   = "C24010_034",

  # Continuous Age Spectrum (Table B01001 - Detailed Age)
  # 18 to 24
  m_18_19="B01001_007", m_20="B01001_008", m_21="B01001_009", m_22_24="B01001_010",
  f_18_19="B01001_031", f_20="B01001_032", f_21="B01001_033", f_22_24="B01001_034",
  # 25 to 34
  m_25_29="B01001_011", m_30_34="B01001_012", f_25_29="B01001_035", f_30_34="B01001_036",
  # 35 to 44
  m_35_39="B01001_013", m_40_44="B01001_014", f_35_39="B01001_037", f_40_44="B01001_038",
  # 45 to 64
  m_45_49="B01001_015", m_50_54="B01001_016", m_55_59="B01001_017", m_60_61="B01001_018", m_62_64="B01001_019",
  f_45_49="B01001_039", f_50_54="B01001_040", f_55_59="B01001_041", f_60_61="B01001_042", f_62_64="B01001_043",
  # 65 Plus
  m_65_66="B01001_020", m_67_69="B01001_021", m_70_74="B01001_022", m_75_79="B01001_023", m_80_84="B01001_024", m_85_up="B01001_025",
  f_65_66="B01001_044", f_67_69="B01001_045", f_70_74="B01001_046", f_75_79="B01001_047", f_80_84="B01001_048", f_85_up="B01001_049",

  # Commercial vs. Residential Proxy (Housing)
  tot_housing = "B25002_001",
  vacant_unit = "B25002_003",
  tot_occup   = "B25003_001",
  renter_unit = "B25003_003",

  # Economic Barrier Spectrum (Poverty)
  tot_pov_base = "B17001_001",
  pov_below    = "B17001_002"
) 

# Getting from census data
nyc_tracts <- get_acs(
  geography = "tract",
  variables = vars,
  state = "NY",
  county = nyc_counties,
  geometry = TRUE, 
  year = 2022,
  output = "wide"
)

Feature Engineering Census Data

This chunk does an explicit selection of columns from the census data to start with. (There are just so many features in that data that clearly selecting and renaming when applicable seemed like a cleaner route.)

After that this chunk is doing some feature engineering on the census data. Mostly turning raw count data into percent of population data.

pct_commute_60p is currently being used as the target variable. It’s the count of people with commutes over 60 min + the count of people with commutes over 90 minutes. Divided by total number of workers. x 100 to make is a percentage.
pct_no_car is percent of people who do not have a car
pct_transit is percent of workers using public transit (mta, buses not uber/taxi) as primary commute methos
pct_walk is percent of workers using walking as primary commute method
pct_wfhis percent of workers working from home
pct_bachelors is the percent of 25+ year olds with Bachelor’s Degrees or higher
Occupation buckets
pct_mngmt is percent of workers in management, business, science, and arts
pct_service is percent of workers in
pct_sales is percent of workers in sales and office administration
pct_const is percent of workers in natural resources, construction, and maintenance
pct_trans is percent of workers in production, transportation, and material moving
area_sq_miles is the area in square miles
pop_densityis the numebr of people per square mile
Age GRoups
I didn’t want to bucket the ages because I know having continuous data is better, but unfortuantely the census already had some underlying bucketing
pct_age_18_24 is the percent of the population in that age range
pct_age_25_34 is the percent of the population in that age range
pct_age_35_44 is the percent of the population in that age range
pct_age_45_64 is the percent of the population in that age range
pct_age_65_plus is the percent of the population in that age range
pct_vacant is the percent of housing that is vacant
pct_renter is the percent of occupued units that are being rented
pct_poverty is the percent of people living below the poverty line out of the total population for whom poverty status is determined

Not all of the engineered features are used in the later model.

nyc_tracts <- nyc_tracts %>%
  # Explicitly rename every variable 
  select(
    GEOID, NAME, geometry,
    
    # Base Demographics
    no_car      = no_carE, 
    med_income  = med_incomeE, 
    med_age     = med_ageE, 
    total_pop   = total_popE, 
    tot_workers = tot_workersE, 
    tot_pop_25  = tot_pop_25E, 
    tot_employ  = tot_employE,
    
    # Transportation & Travel Time
    transit     = transitE, 
    walk        = walkE, 
    wfh         = wfhE, 
    commute_60  = commute_60E, 
    commute_90  = commute_90E, 
    
    # Education & Occupation
    bachelors   = bachelorsE, 
    mngmt_job   = mngmt_jobE, 
    service_job = service_jobE, 
    sales_job   = sales_jobE, 
    const_job   = const_jobE, 
    trans_job   = trans_jobE,
    
    # Age - Male
    m_18_19 = m_18_19E, m_20 = m_20E, m_21 = m_21E, m_22_24 = m_22_24E,
    m_25_29 = m_25_29E, m_30_34 = m_30_34E,
    m_35_39 = m_35_39E, m_40_44 = m_40_44E,
    m_45_49 = m_45_49E, m_50_54 = m_50_54E, m_55_59 = m_55_59E, m_60_61 = m_60_61E, m_62_64 = m_62_64E,
    m_65_66 = m_65_66E, m_67_69 = m_67_69E, m_70_74 = m_70_74E, m_75_79 = m_75_79E, m_80_84 = m_80_84E, m_85_up = m_85_upE,
    
    # Age - Female
    f_18_19 = f_18_19E, f_20 = f_20E, f_21 = f_21E, f_22_24 = f_22_24E,
    f_25_29 = f_25_29E, f_30_34 = f_30_34E,
    f_35_39 = f_35_39E, f_40_44 = f_40_44E,
    f_45_49 = f_45_49E, f_50_54 = f_50_54E, f_55_59 = f_55_59E, f_60_61 = f_60_61E, f_62_64 = f_62_64E,
    f_65_66 = f_65_66E, f_67_69 = f_67_69E, f_70_74 = f_70_74E, f_75_79 = f_75_79E, f_80_84 = f_80_84E, f_85_up = f_85_upE,

    # Housing & Poverty
    tot_housing  = tot_housingE, 
    vacant_unit  = vacant_unitE, 
    tot_occup    = tot_occupE, 
    renter_unit  = renter_unitE, 
    tot_pov_base = tot_pov_baseE, 
    pov_below    = pov_belowE
  ) %>%
  
  # Feature Engineering: Convert raw counts to percentages
  mutate(
    pct_no_car      = (no_car / total_pop) * 100,
    pct_transit     = (transit / tot_workers) * 100,
    pct_walk        = (walk / tot_workers) * 100,
    pct_wfh         = (wfh / tot_workers) * 100,
    pct_commute_60p = ((commute_60 + commute_90) / tot_workers) * 100,
    pct_bachelors   = (bachelors / tot_pop_25) * 100,
    pct_mngmt       = (mngmt_job / tot_employ) * 100,
    pct_service     = (service_job / tot_employ) * 100,
    pct_sales       = (sales_job / tot_employ) * 100,
    pct_const       = (const_job / tot_employ) * 100,
    pct_trans       = (trans_job / tot_employ) * 100,
    area_sq_miles   = as.numeric(st_area(geometry)) / 2589988, 
    pop_density     = total_pop / area_sq_miles,
    
    # Continuous Age Spectrum
    pct_age_18_24   = ((m_18_19 + m_20 + m_21 + m_22_24 + f_18_19 + f_20 + f_21 + f_22_24) / total_pop) * 100,
    pct_age_25_34   = ((m_25_29 + m_30_34 + f_25_29 + f_30_34) / total_pop) * 100,
    pct_age_35_44   = ((m_35_39 + m_40_44 + f_35_39 + f_40_44) / total_pop) * 100,
    pct_age_45_64   = ((m_45_49 + m_50_54 + m_55_59 + m_60_61 + m_62_64 + f_45_49 + f_50_54 + f_55_59 + f_60_61 + f_62_64) / total_pop) * 100,
    pct_age_65_plus = ((m_65_66 + m_67_69 + m_70_74 + m_75_79 + m_80_84 + m_85_up + f_65_66 + f_67_69 + f_70_74 + f_75_79 + f_80_84 + f_85_up) / total_pop) * 100,
    
    # Commercial/Residential Proxy
    pct_vacant      = (vacant_unit / tot_housing) * 100,
    pct_renter      = (renter_unit / tot_occup) * 100,
    
    # Economic Barrier
    pct_poverty     = (pov_below / tot_pov_base) * 100
  ) %>%
  
  # Explicitly drop the raw counts so they don't leak into the model
  select(
    -no_car, -transit, -walk, -wfh, -commute_60, -commute_90, -bachelors, 
    -mngmt_job, -service_job, -sales_job, -const_job, -trans_job, 
    -tot_workers, -tot_pop_25, -tot_employ,
    -m_18_19, -m_20, -m_21, -m_22_24, -m_25_29, -m_30_34, -m_35_39, -m_40_44, 
    -m_45_49, -m_50_54, -m_55_59, -m_60_61, -m_62_64, -m_65_66, -m_67_69, -m_70_74, -m_75_79, -m_80_84, -m_85_up,
    -f_18_19, -f_20, -f_21, -f_22_24, -f_25_29, -f_30_34, -f_35_39, -f_40_44, 
    -f_45_49, -f_50_54, -f_55_59, -f_60_61, -f_62_64, -f_65_66, -f_67_69, -f_70_74, -f_75_79, -f_80_84, -f_85_up,
    -tot_housing, -vacant_unit, -tot_occup, -renter_unit, -tot_pov_base, -pov_below
  )

Checking for missing census data

## Checking for total NAs and NaNs in each column
colSums(is.na(nyc_tracts %>% st_drop_geometry())) %>%
  kable(caption = "# of NAs and NaNs in each column")

# of NAs and NaNs in each column
	x
GEOID	0
NAME	0
med_income	126
med_age	88
total_pop	0
pct_no_car	85
pct_transit	90
pct_walk	90
pct_wfh	90
pct_commute_60p	90
pct_bachelors	85
pct_mngmt	90
pct_service	90
pct_sales	90
pct_const	90
pct_trans	90
area_sq_miles	0
pop_density	3
pct_age_18_24	85
pct_age_25_34	85
pct_age_35_44	85
pct_age_45_64	85
pct_age_65_plus	85
pct_vacant	93
pct_renter	96
pct_poverty	90

## Percentage of missing data 
# (is it a low enough percent to be negligible) 
missing_summary <- nyc_tracts %>%
  st_drop_geometry() %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Missing_Count") %>%
  mutate(
    Total_Rows = nrow(nyc_tracts),
    Pct_Missing = (Missing_Count / Total_Rows) * 100,
    
    # Flag if the missing data is below 5% 
    Is_Negligible_Under_5Pct = Pct_Missing < 5 
  ) %>%
  # Filter to only show features that actually have missing data
  filter(Missing_Count > 0) %>% 
  arrange(desc(Pct_Missing))

# Output Percentage of Missing Data
kable(missing_summary, digits = 2, caption = "Missing Data Percentage by Feature")

Missing Data Percentage by Feature
Feature	Missing_Count	Total_Rows	Pct_Missing	Is_Negligible_Under_5Pct
med_income	126	2327	5.41	FALSE
pct_renter	96	2327	4.13	TRUE
pct_vacant	93	2327	4.00	TRUE
pct_transit	90	2327	3.87	TRUE
pct_walk	90	2327	3.87	TRUE
pct_wfh	90	2327	3.87	TRUE
pct_commute_60p	90	2327	3.87	TRUE
pct_mngmt	90	2327	3.87	TRUE
pct_service	90	2327	3.87	TRUE
pct_sales	90	2327	3.87	TRUE
pct_const	90	2327	3.87	TRUE
pct_trans	90	2327	3.87	TRUE
pct_poverty	90	2327	3.87	TRUE
med_age	88	2327	3.78	TRUE
pct_no_car	85	2327	3.65	TRUE
pct_bachelors	85	2327	3.65	TRUE
pct_age_18_24	85	2327	3.65	TRUE
pct_age_25_34	85	2327	3.65	TRUE
pct_age_35_44	85	2327	3.65	TRUE
pct_age_45_64	85	2327	3.65	TRUE
pct_age_65_plus	85	2327	3.65	TRUE
pop_density	3	2327	0.13	TRUE

## Map for missing income data
missing_map_data <- nyc_tracts %>%
  mutate(missing_income_flag = ifelse(is.na(med_income), "Missing Data", "Valid Data"))

ggplot(missing_map_data) +
  geom_sf(aes(fill = missing_income_flag), color = "white", size = 0.1) +
  scale_fill_manual(values = c("Valid Data" = "#e0e0e0", "Missing Data" = "#d73027")) +
  theme_void() +
  labs(title = "Map of Missing Census Income Data",
       fill = "Status")

Census NA Handling

This chunk is currently dropping census tracts that have very few people or no workers.

It’s also imputing where income or age was missing using the median value. This was just a simple implementation, but we can definitely consider more robust imputing methos, and also properly investingating MAR, MNAR etc. (Though given the nature of census data, MNAR is a good assumption.)

## Drop tracts with less than 50 people or 0 workers
# There were a ton of tracts with a small number of or 0 people, screwing with denominators and also making for super volatile percentages
nyc_tracts_clean <- nyc_tracts %>%
  filter(total_pop >= 50, !is.na(pct_transit))

## Imputing
# There were also many tracts where income or age was missing, but it wasn't due to a denominator issue
# Just doing simple median imputation
# More robust method would be better
# Checking for MNAR and stuff would be better
# The current median imputation: 
nyc_tracts_clean <- nyc_tracts_clean %>%
  # Extract the borough from the NAME string (e.g., "Queens County")
  mutate(borough = str_extract(NAME, "(?<=; ).*(?= County)")) %>%
  
  # Group by Borough so the imputation is localized
  group_by(borough) %>%
  mutate(
    med_income = ifelse(is.na(med_income), 
                        median(med_income, na.rm = TRUE), 
                        med_income),
    med_age = ifelse(is.na(med_age), 
                     median(med_age, na.rm = TRUE), 
                     med_age)
  ) %>%
  ungroup() %>%
  
  # Drop the temporary borough column 
  select(-borough)
 
print(paste("Original tracts:", nrow(nyc_tracts)))

## [1] "Original tracts: 2327"

print(paste("Cleaned tracts of NAs:", nrow(nyc_tracts_clean)))

## [1] "Cleaned tracts of NAs: 2226"

# Save 
census_save_path <- file.path(projPath, "data", "nyc_census_tracts.rds") 
saveRDS(nyc_tracts_clean, census_save_path)

print(paste("Census features and geometry saved to:", census_save_path))

## [1] "Census features and geometry saved to: C:/Users/cdube/src/DATA_622_Project/data/nyc_census_tracts.rds"

# Peak
head(nyc_tracts_clean)

## Simple feature collection with 6 features and 26 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.91641 ymin: 40.82439 xmax: -73.84726 ymax: 40.90264
## Geodetic CRS:  NAD83
## # A tibble: 6 × 27
##   GEOID       NAME  med_income med_age total_pop pct_no_car pct_transit pct_walk
##   <chr>       <chr>      <dbl>   <dbl>     <dbl>      <dbl>       <dbl>    <dbl>
## 1 36005013500 Cens…      27602    36.5      3125       31.8        48.6    14.5 
## 2 36005009200 Cens…      56208    33.9      5959       14.1        38.6     8.07
## 3 36005005400 Cens…      48750    30.5      5715       20.6        64.7     6.20
## 4 36005036501 Cens…      41287    31.8      4196       23.1        60.3     6.61
## 5 36005044902 Cens…      88264    32.4      2146       12.1        37.8     2.43
## 6 36005017500 Cens…      26463    33.9      6116       39.5        69.1     8.69
## # ℹ 19 more variables: pct_wfh <dbl>, pct_commute_60p <dbl>,
## #   pct_bachelors <dbl>, pct_mngmt <dbl>, pct_service <dbl>, pct_sales <dbl>,
## #   pct_const <dbl>, pct_trans <dbl>, area_sq_miles <dbl>, pop_density <dbl>,
## #   pct_age_18_24 <dbl>, pct_age_25_34 <dbl>, pct_age_35_44 <dbl>,
## #   pct_age_45_64 <dbl>, pct_age_65_plus <dbl>, pct_vacant <dbl>,
## #   pct_renter <dbl>, pct_poverty <dbl>, geometry <MULTIPOLYGON [°]>

Taxi Zones

The taxi zone data is luckily pretty straightforward. It’s just being imported here.

#Zip file from below url is downloaded and extracted into repo
#taxi_zones_url <- "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip" 

# Read the shapefile
taxi_zones_sf <- st_read(file.path(projPath, 'taxi_zones', 'taxi_zones', "taxi_zones.shp")) %>%
  st_transform(4326) # Transform to standard Lat/Long (WGS84)

## Reading layer `taxi_zones' from data source 
##   `C:\Users\cdube\src\DATA_622_Project\taxi_zones\taxi_zones\taxi_zones.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)

# Lookup table with names and boroughs is downloaded into repo
#zone_lookup_url <- "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv"
zone_lookup <- read_csv(file.path(projPath, 'taxi_zone_lookup.csv'))

## Rows: 265 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Borough, Zone, service_zone
## dbl (1): LocationID
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Join 
taxi_zones <- taxi_zones_sf %>%
  left_join(zone_lookup, by = c("LocationID" = "LocationID"))

# Peak: Does this map look right 
plot(st_geometry(taxi_zones))

TLC Data

TLC Data is from the NYC Taxi and Limousine Commission.

High volume for hire vehicles = Uber/Lyft.

Originally was pulling January 2026 data for High Volume for-hire vehicles.
https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2026-01.parquet
Currently pulling October 2023 data for yellow taxis, green taxis, ~for-hire vehicles (FHV (local livery/black car)),~ and high volume for-hire vehicles (FHVHV (Uber/Lyft) Uber/Lyft). This made a little more sense with the 2022 census data, and with the seasonality of commuting and to avoid holiday season.
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-10.parquet
https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-10.parquet
https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2023-10.parquet
https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-10.parquet

## Aggregate all vehicular trips for October 2023
# Combine Yellow, Green, FHV, and High-Volume FHV (Uber/Lyft) 
# See "all" the car-based 'Last-Mile' choices

# URL definitions for source data
tlc_urls <- c(
  yellow = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-10.parquet",
  green  = "https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-10.parquet",
  #fhv    = "https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2023-10.parquet",
  fhvhv  = "https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-10.parquet"
)

# Function to read and count trips by Taxi Zone 
# PULocationID can be used to get the # of trips that started from (were picked up in) a zone
count_zone_trips <- function(url) { 
  read_parquet(url, col_select = any_of(c("PULocationID"))) %>%
    count(PULocationID, name = "trips") %>%
    filter(!is.na(PULocationID))
}

# Process the sources and sum them up 
# Getting number of trips that started per pick up location
# This 'map_dfr' loop downloads each, counts by zone, and stacks them.
total_zone_counts <- tlc_urls %>%
  map_dfr(count_zone_trips) %>%
  group_by(PULocationID) %>%
  summarize(total_vehicular_trips = sum(trips))

Census uses Tracts, TLC uses Taxi Zones

So Census Tracts are not the same as Taxi Zones. This chunk tries to handle it.

EPSG:2263 is apparently a specific map projection built just for the New York/Long Island area. In it, 1 unit equals exactly 1 physical foot. (I had not ever used this before.) The idea is that if the censues tracts and the taxi zones are both transformed to EPSG:2263, then this will be the most precise way to layer these maps on top of one another and see where they intersect. (At least it appears to be more precise than using latitude and longitude.)

st_intersection() lays the two maps on top of one another and separated out the overlapping pieces properly.

# Coordinate Reference System (CRS)
# EPSG:2263 (NAD83 / New York Long Island) uses feet 
nyc_tracts_proj <- st_transform(nyc_tracts, 2263)
taxi_zones_proj <- st_transform(taxi_zones, 2263)

# Intersect regions
# This creates a new spacial dataframe where tracts and zones overlap
intersection <- st_intersection(nyc_tracts_proj, taxi_zones_proj) %>%
  mutate(intersect_area = st_area(.))

## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

# Calculate the weights
# For every unique Census Tract, 
# What % of Tract X's area is inside Taxi Zone Y?
tract_weights <- intersection %>%
  as_tibble() %>%
  group_by(GEOID) %>%
  mutate(total_tract_area = sum(intersect_area)) %>%
  ungroup() %>%
  mutate(weight = as.numeric(intersect_area / total_tract_area)) %>%
  select(GEOID, LocationID, zone, weight)

# Peak Crosswalk 
head(tract_weights)

## # A tibble: 6 × 4
##   GEOID       LocationID zone                       weight
##   <chr>            <dbl> <chr>                       <dbl>
## 1 36081071600          2 Jamaica Bay             0.000628 
## 2 36047070203          2 Jamaica Bay             1        
## 3 36081107201          2 Jamaica Bay             0.0124   
## 4 36081100801          2 Jamaica Bay             0.00106  
## 5 36081107202          2 Jamaica Bay             0.989    
## 6 36005037000          3 Allerton/Pelham Gardens 0.0000606

The below chunk uses the newly defined geography, and then it disperses the uber/taxi trip data (which uses taxi zones) fairly across the census tracts.

# issue is that taxi zones are larger than census tracts, and we obvs want to use the data together 
# Assumes 'tract_weights' object exists from your earlier step.

tract_vehicular_estimates <- tract_weights %>%
  left_join(total_zone_counts, by = c("LocationID" = "PULocationID")) %>%
  mutate(estimated_trips = total_vehicular_trips * weight) %>%
  group_by(GEOID) %>%
  summarize(total_uber_taxi_trips = sum(estimated_trips, na.rm = TRUE))

# Save the 'Vehicular' side of the Master Table
save_path_tlc <- file.path(projPath, "data", "processed_tlc_counts.rds")
saveRDS(tract_vehicular_estimates, save_path_tlc)

print(paste("Vehicular trip estimates for October 2023 saved to:", save_path_tlc))

## [1] "Vehicular trip estimates for October 2023 saved to: C:/Users/cdube/src/DATA_622_Project/data/processed_tlc_counts.rds"

Citibike

Citibike Data Download
- Because the Citibike publishes their data online in zip files, it can’t be read directly from web.
Originally was pulling January 2026 data for citibike
- 202601-citibike-tripdata_1.csv
- 202601-citibike-tripdata_2.csv
Currently pulling October 2023 data

# Map Station IDs to Census Tracts  

# file locations
cb_file1 <- file.path(projPath, "citibike", "202310-citibike-tripdata_1.csv")
cb_file2 <- file.path(projPath, "citibike", "202310-citibike-tripdata_2.csv")
cb_file3 <- file.path(projPath, "citibike", "202310-citibike-tripdata_3.csv")
cb_file4 <- file.path(projPath, "citibike", "202310-citibike-tripdata_4.csv")
cb_files <- c(cb_file1, cb_file2, cb_file3, cb_file4)

# Station-to-Tract Lookup Table
# unique coordinates for every station found in files
all_stations <- cb_files %>%
  map_dfr(~read_csv(.x, 
                    col_select = c(start_station_id, start_lat, start_lng),
                    # Force the station ID to be a character, and coords to be numeric
                    col_types = cols(
                      start_station_id = col_character(),
                      start_lat = col_double(),
                      start_lng = col_double()
                    ))) %>%
  filter(!is.na(start_lat)) %>%
  distinct(start_station_id, .keep_all = TRUE)

# Spatial Work: Convert Stations to Points
stations_sf <- st_as_sf(all_stations, coords = c("start_lng", "start_lat"), crs = 4326) %>%
  st_transform(2263) # Project to NYC Feet (EPSG:2263)

# Ensure nyc_tracts (from Step 1) matches the projection
nyc_tracts_proj <- st_transform(nyc_tracts, 2263)

# Find which Tract Polygon physically contains each Station Point
station_tract_lookup <- st_join(stations_sf, nyc_tracts_proj, join = st_intersects) %>%
  as_tibble() %>%
  select(start_station_id, GEOID)

# Count the number of unique stations in each Census Tract
tract_station_counts <- station_tract_lookup %>%
  group_by(GEOID) %>%
  summarize(total_stations = n()) %>%
  filter(!is.na(GEOID))

# Count Trips per Tract 
tract_bike_counts <- cb_files %>%
  map_dfr(~read_csv(.x, 
                    col_select = c(start_station_id),
                    col_types = cols(start_station_id = col_character()) 
                    )) %>%
  left_join(station_tract_lookup, by = "start_station_id") %>%
  group_by(GEOID) %>%
  summarize(total_bike_trips = n()) %>%
  filter(!is.na(GEOID)) %>%
  # NEW: Attach the station capacity feature to the trip counts
  left_join(tract_station_counts, by = "GEOID")


# Save the final "Label" feature for the Machine Learning model
save_path <- file.path(projPath, "data", "processed_bike_counts.rds")
saveRDS(tract_bike_counts, save_path)

print(paste("Success! Bike counts saved to:", save_path))

## [1] "Success! Bike counts saved to: C:/Users/cdube/src/DATA_622_Project/data/processed_bike_counts.rds"

Distance from Subway

Source is https://data.ny.gov/Transportation/MTA-Subway-Entrances-and-Exits-2024/i9wp-a4ja/about_data.

File name is “MTA_Subway_Entrances_and_Exits__2024.csv”

## Distance to Subway 

# Load csv
subway_csv_path <- file.path(projPath, "MTA_Subway_Entrances_and_Exits__2024.csv")
subway_raw <- read_csv(subway_csv_path) %>%
  clean_names() # otherwise this had spaces in the column names

## Rows: 2120 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): Division, Line, Borough, Stop Name, Constituent Station Name, GTFS...
## dbl  (4): Complex ID, Station ID, Entrance Latitude, Entrance Longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Transform into a Spatial Object (sf)
subway_entrances_sf <- subway_raw %>%
  filter(!is.na(entrance_latitude) & !is.na(entrance_longitude)) %>% # 'entrance_longitude' is X and 'entrance_latitude' is Y
  st_as_sf(coords = c("entrance_longitude", "entrance_latitude"), crs = 4326) %>% # telling it to read this as 4326, the standard GPS coordinate system  
  st_transform(2263) # Converting to the same 1 unit = 1 foot NYC-specific map that was used earlier

# Centroids
# Find the center point of every Census Tract
tract_centers <- st_centroid(nyc_tracts_proj)

## Warning: st_centroid assumes attributes are constant over geometries

# Nearest Entrance for every Tract
# Find the closest subway for each tract
# Doing this from the tract center 
nearest_indices <- st_nearest_feature(tract_centers, subway_entrances_sf)

# Distance in Miles
# (Distance in feet / 5280)
dist_to_subway_feet <- st_distance(tract_centers, subway_entrances_sf[nearest_indices, ], by_element = TRUE)

subway_dist_lookup <- nyc_tracts_proj %>%
  as_tibble() %>%
  mutate(dist_subway_miles = as.numeric(dist_to_subway_feet) / 5280) %>%
  select(GEOID, dist_subway_miles)

# Save the Subway Distance
saveRDS(subway_dist_lookup, file.path(projPath, "data", "subway_dist_lookup.rds"))

print("Subway distance feature successfully calculated from MTA data!")

## [1] "Subway distance feature successfully calculated from MTA data!"

Assembling

Big Join

## Big Join 

# Load components
census_features <- readRDS(file.path(projPath, "data", "nyc_census_tracts.rds"))
bike_counts     <- readRDS(file.path(projPath, "data", "processed_bike_counts.rds"))
tlc_counts      <- readRDS(file.path(projPath, "data", "processed_tlc_counts.rds"))
subway_dist     <- readRDS(file.path(projPath, "data", "subway_dist_lookup.rds"))

# Before joining, force each Tract to belong to only one Taxi Zone, the one it overlaps most.
# I was seeing repeating values of income and distance for different zones in the final data set back when I was not doing this. 
unique_weights <- tract_weights %>%
  group_by(GEOID) %>%
  slice_max(weight, n = 1, with_ties = FALSE) %>% 
  ungroup() %>%
  select(GEOID, LocationID, zone) # Keep the Zone name and ID for joining

# Join everything to the Census Tracts
# left_join to not lose any tracts that might have had 0 trips
master_table <- census_features %>%
  st_drop_geometry() %>% # drops the maps, keeps ALL features 
  # unique Neighborhood/LocationID mapping
  left_join(unique_weights, by = "GEOID") %>%
  # Bike counts (already at Tract level)
  left_join(bike_counts, by = "GEOID") %>%
  # Uber/Taxi counts  
  left_join(tlc_counts, by = "GEOID") %>% 
  # Subway Distance (already at Tract level)
  left_join(subway_dist, by = "GEOID")


# Clean up NAs; create the Target Variable
# If a tract has no recorded trips, replace NA with 0
# If a tract has no stations, replace NA with 0 
master_table <- master_table %>%
  mutate(
    total_bike_trips      = replace_na(total_bike_trips, 0),
    total_uber_taxi_trips = replace_na(total_uber_taxi_trips, 0),
    total_stations        = replace_na(total_stations, 0),
    
    # Spatial Density (Stations per Square Mile)
    # How physically clustered is the infrastructure?
    stations_per_sq_mile  = total_stations / area_sq_miles,
    
    # Per Capita Availability (Stations per 1,000 Residents)
    # How much infrastructure exists relative to the local population size?
    stations_per_1000_pop = (total_stations / total_pop) * 1000,
    
    # CALCULATE THE TARGET: % of Last-Mile trips taken by Bike
    bike_index = total_bike_trips / (total_bike_trips + total_uber_taxi_trips)
  ) %>%
  # Filter out tracts with 0 total trips to avoid "Divided by Zero" errors
  filter(!is.na(bike_index))

# Save joined data set
saveRDS(master_table, file.path(projPath, "data", "model_input.rds"))

print("Master Join Complete. The model_input.rds file is ready for modeling.")

## [1] "Master Join Complete. The model_input.rds file is ready for modeling."

Intermediate Sanity Checking

We could drop this later, but right now it was just a helpful way to make sure that combining the census tracts and the taxi zones worked properly.

### The Data Audit (Sanity Check)

# Load the Master Table
master_table <- readRDS(file.path(projPath, "data", "model_input.rds"))

# Join with Taxi Zone names for readability
# We'll use the tract_weights crosswalk to get neighborhood names back
audit_table <- master_table %>%
  select(zone, GEOID, bike_index, total_bike_trips, total_uber_taxi_trips, med_income, dist_subway_miles)

# Top 10 "Bike-Heavy" Neighborhoods (Active)
top_bike <- audit_table %>%
  arrange(desc(bike_index)) %>%
  head(50)

# Top 10 "Uber-Heavy" Neighborhoods (Vehicular)
top_uber <- audit_table %>%
  arrange(bike_index) %>%
  head(50)

# quick summary
print("--- TOP 10 BIKE-HEAVY TRACTS ---")

## [1] "--- TOP 10 BIKE-HEAVY TRACTS ---"

print(top_bike)

## # A tibble: 50 × 7
##    zone       GEOID bike_index total_bike_trips total_uber_taxi_trips med_income
##    <chr>      <chr>      <dbl>            <int>                 <dbl>      <dbl>
##  1 Columbia … 3604…      0.390             7840                12250.     145417
##  2 Randalls … 3606…      0.374             3447                 5781.     112969
##  3 Stuy Town… 3606…      0.277            21350                55802.     142975
##  4 Hudson Sq  3606…      0.276            36404                95636.     246250
##  5 Battery P… 3606…      0.219            34144               121618.     250001
##  6 Bloomingd… 3606…      0.219            10638                37894.     113379
##  7 DUMBO/Vin… 3604…      0.208            20761                79157.     250001
##  8 World Tra… 3606…      0.189            23224                99747.     199576
##  9 Stuy Town… 3606…      0.187            12168                52846.     202632
## 10 Gowanus    3604…      0.153             7486                41594.     134167
## # ℹ 40 more rows
## # ℹ 1 more variable: dist_subway_miles <dbl>

print("--- TOP 10 UBER-HEAVY TRACTS ---")

## [1] "--- TOP 10 UBER-HEAVY TRACTS ---"

print(top_uber)

## # A tibble: 50 × 7
##    zone       GEOID bike_index total_bike_trips total_uber_taxi_trips med_income
##    <chr>      <chr>      <dbl>            <int>                 <dbl>      <dbl>
##  1 Westchest… 3600…          0                0                44590.      56208
##  2 West Farm… 3600…          0                0                44826.      48750
##  3 East Trem… 3600…          0                0                80595.      41287
##  4 Woodlawn/… 3600…          0                0                60533.      88264
##  5 Soundview… 3600…          0                0                87212       53166
##  6 East Trem… 3600…          0                0                75865.      32174
##  7 Williamsb… 3600…          0                0                87768.      82039
##  8 Eastchest… 3600…          0                0                51946.      27753
##  9 Williamsb… 3600…          0                0                87771       59758
## 10 Soundview… 3600…          0                0                47388.      29657
## # ℹ 40 more rows
## # ℹ 1 more variable: dist_subway_miles <dbl>

Model Building (Super Duper Rough, Proof of Concept)

This is currently a simple linear model for proof of concept. We’ll be able to build up something much more robust .

Model 1

ml_data is selecting the features to go into the model. A very very simple model is happening here. We need ot implement train-test and better model selection still.

# The selection for the Machine Learning model
# Right now, this has a lot of room for improvement trimming or adding features
ml_data <- readRDS(file.path(projPath, "data", "model_input.rds")) %>%
  select(
    GEOID,  
    
    # Old target variable which was preference for a bike versus taxi/uber
    bike_index, 
    
    # Scaled Infrastructure Controls
    stations_per_sq_mile, stations_per_1000_pop, dist_subway_miles, 
    
    # Economic & Housing Controls
    med_income, pct_poverty, pct_vacant, pct_renter, 
    
    # Demographic & Commuter Behaviors
    pop_density, pct_no_car, pct_transit, pct_commute_60p, pct_bachelors,
    
    # Continuous Age Spectrum
    pct_age_18_24, pct_age_25_34, pct_age_35_44, pct_age_45_64, pct_age_65_plus
  ) %>%
  drop_na() %>%
  
  # Standardize all predictors 
  # We deliberately exclude BOTH our target (bike_index) and our ID (GEOID) from scaling
  mutate(across(
    .cols = -c(bike_index, GEOID), 
    .fns = ~ as.numeric(scale(.))
  ))

# Fit Model
# The minus sign (- GEOID) explicitly means don't use ID as a predictor
lm_fit_standardized <- lm(bike_index ~ . - GEOID, data = ml_data)

print("Model Summary: ")

## [1] "Model Summary: "

summary(lm_fit_standardized)

## 
## Call:
## lm(formula = bike_index ~ . - GEOID, data = ml_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.05810 -0.00659 -0.00078  0.00381  0.34136 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.0105153  0.0004073  25.819  < 2e-16 ***
## stations_per_sq_mile   0.0077108  0.0005322  14.489  < 2e-16 ***
## stations_per_1000_pop  0.0009301  0.0004495   2.069  0.03865 *  
## dist_subway_miles      0.0003310  0.0005230   0.633  0.52691    
## med_income             0.0082630  0.0007194  11.486  < 2e-16 ***
## pct_poverty            0.0028011  0.0006727   4.164 3.25e-05 ***
## pct_vacant            -0.0011723  0.0004692  -2.498  0.01255 *  
## pct_renter             0.0018097  0.0008545   2.118  0.03430 *  
## pop_density           -0.0043459  0.0005404  -8.042 1.42e-15 ***
## pct_no_car             0.0068048  0.0008130   8.370  < 2e-16 ***
## pct_transit            0.0003460  0.0006247   0.554  0.57973    
## pct_commute_60p       -0.0030413  0.0005952  -5.109 3.51e-07 ***
## pct_bachelors          0.0027874  0.0006780   4.111 4.08e-05 ***
## pct_age_18_24          0.0001825  0.0005011   0.364  0.71567    
## pct_age_25_34         -0.0021820  0.0006761  -3.227  0.00127 ** 
## pct_age_35_44          0.0003264  0.0005906   0.553  0.58056    
## pct_age_45_64          0.0002210  0.0005514   0.401  0.68858    
## pct_age_65_plus       -0.0008789  0.0006432  -1.366  0.17196    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01919 on 2202 degrees of freedom
## Multiple R-squared:  0.4495, Adjusted R-squared:  0.4452 
## F-statistic: 105.7 on 17 and 2202 DF,  p-value: < 2.2e-16

IBX Simulation

Here, the approximate locations of the new IBX stations are being brought in. And then in the census areas where they go, we’re basically dropping the distance to train for those areas to really low. There may be a more robust way of implementing the “simulation” of the IBX station than this, but this seemed like a good starting point.

# Build the list of stations based on the MTA's 2025 CE Document
# This document is a pdf, and can be found in the repo
ibx_stations <- tibble::tribble(
  ~station_name,             ~address,
  "Brooklyn Army Terminal",  "2nd Ave & 61st St, Brooklyn, NY",
  "4th Ave",                 "5th Ave & 61st St, Brooklyn, NY",
  "8th Ave",                 "8th Ave & 61st St, Brooklyn, NY",
  "New Utrecht Ave",         "14th Ave & 61st St, Brooklyn, NY",
  "McDonald Ave",            "McDonald Ave & Avenue I, Brooklyn, NY",
  "East 16th St",            "East 16th St & Avenue H, Brooklyn, NY",
  "Flatbush-Nostrand Ave",   "Nostrand Ave & Avenue H, Brooklyn, NY",
  "Utica Ave",               "Utica Ave & Farragut Rd, Brooklyn, NY",
  "Remsen Ave",              "Remsen Ave & Farragut Rd, Brooklyn, NY",
  "Linden Blvd",             "Linden Blvd & Junius St, Brooklyn, NY",
  "Livonia Ave",             "Livonia Ave & Junius St, Brooklyn, NY",
  "Sutter Ave",              "Sutter Ave & Junius St, Brooklyn, NY",
  "Atlantic Ave",            "Atlantic Ave & East New York Ave, Brooklyn, NY",
  "Wilson Ave",              "Wilson Ave & Moffat St, Brooklyn, NY",
  "Myrtle Ave",              "Cypress Hills St & 78th Ave, Queens, NY",
  "Metropolitan Ave",        "Metropolitan Ave & 69th St, Queens, NY",
  "Eliot Ave",               "Eliot Ave & 69th St, Queens, NY",
  "Grand Ave",               "Grand Ave & 69th St, Queens, NY",
  "Roosevelt Ave",           "Roosevelt Ave & 74th St, Queens, NY"
)

# Geocode the addresses into Lat/Long coordinates
ibx_geocoded <- ibx_stations %>%
  geocode(address, method = 'osm', lat = latitude, long = longitude) %>% # get coordinated for each address
  drop_na() %>% # Drop any that the open-source map couldn't perfectly match
  
  # Convert the GPS coordinates into an sf object
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  
  # Convert and draw 10-minute-walk-radiuses 
  st_transform(2263) %>% # transform to the more precide NYC + Long Island map we've been using
  st_buffer(dist = 2640) %>% # drawing a circle around point that is 2640 feet = 0.5 miles ~ 10 minute walk. Most people don't want to walk further than 10 minutes to the train
  st_transform(4326) # Convert back to standard GPS

## Passing 19 addresses to the Nominatim single address geocoder

## Query completed in: 19.3 seconds

# Find all Census Tracts that fall at least partially inside these new 10-minute-walk-radiuses 
ibx_tracts_sf <- st_filter(
  nyc_tracts, 
  st_transform(ibx_geocoded, st_crs(nyc_tracts)), 
  .predicate = st_intersects
)

# Extract just the 11-digit IDs as a clean list
ibx_geoid_list <- ibx_tracts_sf$GEOID

## Create an IBX UNIVERSE data set
# finding the Census Tract in NYC that is physically closest to a subway entrance
# going to use this later to simulate a really close distance to subway for places where IBX is being built
best_transit_z_score <- min(ml_data$dist_subway_miles)

ibx_scenario_data <- ml_data %>%
  mutate(
    # Inject best_transit_z_score into the IBX tracts
    dist_subway_miles = ifelse(GEOID %in% ibx_geoid_list, best_transit_z_score, dist_subway_miles)
  )

# Predict and flag new IBX station locations
final_map_data <- ml_data %>%
  mutate(
    predicted_bike_baseline = predict(lm_fit_standardized, newdata = ml_data),
    predicted_bike_ibx      = predict(lm_fit_standardized, newdata = ibx_scenario_data),
    ibx_adoption_surge      = predicted_bike_ibx - predicted_bike_baseline, 
    is_ibx_location         = ifelse(GEOID %in% ibx_geoid_list, 1, 0)
  )

saveRDS(final_map_data, file.path(projPath, "data", "model_input.rds"))

# Model Results
lm_results <- tidy(lm_fit_standardized) %>%
  filter(term != "(Intercept)") # Remove the intercept for better scaling

# Plot the "Pull" or "Impact" of each feature
ggplot(lm_results, aes(x = estimate, y = reorder(term, estimate))) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  geom_point(size = 3, color = "darkblue") +
  geom_errorbarh(aes(xmin = estimate - std.error, xmax = estimate + std.error), height = 0.2) +
  theme_minimal() +
  labs(title = "What's driving 60 minute + length communtes?",
       subtitle = "Standardized Coefficients (Variables further from 0 have a stronger impact)",
       x = "Effect on 60+ Minute Commute Rate (per 1 Std. Dev. increase)",
       y = "Feature")

## Warning: `geom_errorbarh()` was deprecated in ggplot2 4.0.0.
## ℹ Please use the `orientation` argument of `geom_errorbar()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `height` was translated to `width`.

Misc. Notes Dump

nothing right now.

DATA 622 Project

Catherine Dube, Guillermo Schneider, Tai Chou-Kudu