Modeling the Impact of the IBX on 60minute+ commute times

Foundation

Census Data

Looking at census data from 2022.

Picked 2022 because:

This is a good Post-Pandemic option; NYC was “fully reopened”
The 5-Year ACS helps avoid the 2020 collection gap (where due to COVID, the Census Bureau couldn’t send people door to door) and the 2021 remote-forward living arrangements, where many people left Manhattan during the high period of remote work.
The 5-year estimate is a rolling average (2018–2022), and some COVID-impacted data is included either way. However, multiple years of more robust data are included in this time window.

Another option would be to switch to the most recent census data available, but it’s worth noting that that won’t avoid the COVID data being included in the 5 -year estimate.

Current Vars from census

#CENSUS_API_KEY is in environment 

nyc_counties <- c("New York", "Kings", "Queens", "Bronx", "Richmond")

 
# Entire 2022 5-Year ACS variable dictionary
acs_vars_2022 <- load_variables(2022, "acs5", cache = TRUE)

# MISC.
# Exploring different potential features
feature_exploration <- acs_vars_2022 %>%
  filter(
    str_detect(str_to_lower(label), "public transportation|bicycle|walked|worked from home") |
    str_detect(str_to_lower(concept), "means of transportation to work|travel time")
  )

# Comment Me out before final knit
#View(feature_exploration)

# hiding the results of this chunk becuase it jsut is unreadable in the knitted document

# Census variables
vars <- c(
  # Base Demographics
  no_car      = "B08201_002", 
  med_income  = "B19013_001", 
  med_age     = "B01002_001",
  total_pop   = "B01003_001",
  
  # Transportation Base & Features
  tot_workers = "B08301_001",
  transit     = "B08301_010",
  walk        = "B08301_019",
  wfh         = "B08301_021",
  
  # Travel Time Features
  # Potential target variable
  commute_60  = "B08303_012",
  commute_90  = "B08303_013",
  
  # Education Base & Features
  tot_pop_25  = "B15003_001",
  bachelors   = "B15003_022",
  
  # Occupation Base & The 5 Major Buckets
  tot_employ  = "C24010_001",
  mngmt_job   = "C24010_002",
  service_job = "C24010_014",
  sales_job   = "C24010_021",
  const_job   = "C24010_030",
  trans_job   = "C24010_034",

  # Continuous Age Spectrum (Table B01001 - Detailed Age)
  # 18 to 24
  m_18_19="B01001_007", m_20="B01001_008", m_21="B01001_009", m_22_24="B01001_010",
  f_18_19="B01001_031", f_20="B01001_032", f_21="B01001_033", f_22_24="B01001_034",
  # 25 to 34
  m_25_29="B01001_011", m_30_34="B01001_012", f_25_29="B01001_035", f_30_34="B01001_036",
  # 35 to 44
  m_35_39="B01001_013", m_40_44="B01001_014", f_35_39="B01001_037", f_40_44="B01001_038",
  # 45 to 64
  m_45_49="B01001_015", m_50_54="B01001_016", m_55_59="B01001_017", m_60_61="B01001_018", m_62_64="B01001_019",
  f_45_49="B01001_039", f_50_54="B01001_040", f_55_59="B01001_041", f_60_61="B01001_042", f_62_64="B01001_043",
  # 65 Plus
  m_65_66="B01001_020", m_67_69="B01001_021", m_70_74="B01001_022", m_75_79="B01001_023", m_80_84="B01001_024", m_85_up="B01001_025",
  f_65_66="B01001_044", f_67_69="B01001_045", f_70_74="B01001_046", f_75_79="B01001_047", f_80_84="B01001_048", f_85_up="B01001_049",

  # Commercial vs. Residential Proxy (Housing)
  tot_housing = "B25002_001",
  vacant_unit = "B25002_003",
  tot_occup   = "B25003_001",
  renter_unit = "B25003_003",

  # Economic Barrier Spectrum (Poverty)
  tot_pov_base = "B17001_001",
  pov_below    = "B17001_002"
) 

# Getting from census data
nyc_tracts <- get_acs(
  geography = "tract",
  variables = vars,
  state = "NY",
  county = nyc_counties,
  geometry = TRUE, 
  year = 2022,
  output = "wide"
)

Feature Engineering Census Data

This chunk does an explicit selection of columns from the census data to start with. (There are just so many features in that data that clearly selecting and renaming when applicable seemed like a cleaner route.)

After that this chunk is doing some feature engineering on the census data. Mostly turning raw count data into percent of population data.

pct_commute_60p is currently being used as the target variable. It’s the count of people with commutes over 60 min + the count of people with commutes over 90 minutes. Divided by total number of workers. x 100 to make is a percentage.
pct_no_car is percent of people who do not have a car
pct_transit is percent of workers using public transit (mta, buses not uber/taxi) as primary commute methos
pct_walk is percent of workers using walking as primary commute method
pct_wfhis percent of workers working from home
pct_bachelors is the percent of 25+ year olds with Bachelor’s Degrees or higher
Occupation buckets
pct_mngmt is percent of workers in management, business, science, and arts
pct_service is percent of workers in
pct_sales is percent of workers in sales and office administration
pct_const is percent of workers in natural resources, construction, and maintenance
pct_trans is percent of workers in production, transportation, and material moving
area_sq_miles is the area in square miles
pop_densityis the numebr of people per square mile
Age GRoups
I didn’t want to bucket the ages because I know having continuous data is better, but unfortuantely the census already had some underlying bucketing
pct_age_18_24 is the percent of the population in that age range
pct_age_25_34 is the percent of the population in that age range
pct_age_35_44 is the percent of the population in that age range
pct_age_45_64 is the percent of the population in that age range
pct_age_65_plus is the percent of the population in that age range
pct_vacant is the percent of housing that is vacant
pct_renter is the percent of occupued units that are being rented
pct_poverty is the percent of people living below the poverty line out of the total population for whom poverty status is determined

Not all of the engineered features are used in the later model.

nyc_tracts <- nyc_tracts %>%
  # Explicitly rename every variable 
  select(
    GEOID, NAME, geometry,
    
    # Base Demographics
    no_car      = no_carE, 
    med_income  = med_incomeE, 
    med_age     = med_ageE, 
    total_pop   = total_popE, 
    tot_workers = tot_workersE, 
    tot_pop_25  = tot_pop_25E, 
    tot_employ  = tot_employE,
    
    # Transportation & Travel Time
    transit     = transitE, 
    walk        = walkE, 
    wfh         = wfhE, 
    commute_60  = commute_60E, 
    commute_90  = commute_90E, 
    
    # Education & Occupation
    bachelors   = bachelorsE, 
    mngmt_job   = mngmt_jobE, 
    service_job = service_jobE, 
    sales_job   = sales_jobE, 
    const_job   = const_jobE, 
    trans_job   = trans_jobE,
    
    # Age - Male
    m_18_19 = m_18_19E, m_20 = m_20E, m_21 = m_21E, m_22_24 = m_22_24E,
    m_25_29 = m_25_29E, m_30_34 = m_30_34E,
    m_35_39 = m_35_39E, m_40_44 = m_40_44E,
    m_45_49 = m_45_49E, m_50_54 = m_50_54E, m_55_59 = m_55_59E, m_60_61 = m_60_61E, m_62_64 = m_62_64E,
    m_65_66 = m_65_66E, m_67_69 = m_67_69E, m_70_74 = m_70_74E, m_75_79 = m_75_79E, m_80_84 = m_80_84E, m_85_up = m_85_upE,
    
    # Age - Female
    f_18_19 = f_18_19E, f_20 = f_20E, f_21 = f_21E, f_22_24 = f_22_24E,
    f_25_29 = f_25_29E, f_30_34 = f_30_34E,
    f_35_39 = f_35_39E, f_40_44 = f_40_44E,
    f_45_49 = f_45_49E, f_50_54 = f_50_54E, f_55_59 = f_55_59E, f_60_61 = f_60_61E, f_62_64 = f_62_64E,
    f_65_66 = f_65_66E, f_67_69 = f_67_69E, f_70_74 = f_70_74E, f_75_79 = f_75_79E, f_80_84 = f_80_84E, f_85_up = f_85_upE,

    # Housing & Poverty
    tot_housing  = tot_housingE, 
    vacant_unit  = vacant_unitE, 
    tot_occup    = tot_occupE, 
    renter_unit  = renter_unitE, 
    tot_pov_base = tot_pov_baseE, 
    pov_below    = pov_belowE
  ) %>%
  
  # Feature Engineering: Convert raw counts to percentages
  mutate(
    pct_no_car      = (no_car / total_pop) * 100,
    pct_transit     = (transit / tot_workers) * 100,
    pct_walk        = (walk / tot_workers) * 100,
    pct_wfh         = (wfh / tot_workers) * 100,
    pct_commute_60p = ((commute_60 + commute_90) / tot_workers) * 100,
    pct_bachelors   = (bachelors / tot_pop_25) * 100,
    pct_mngmt       = (mngmt_job / tot_employ) * 100,
    pct_service     = (service_job / tot_employ) * 100,
    pct_sales       = (sales_job / tot_employ) * 100,
    pct_const       = (const_job / tot_employ) * 100,
    pct_trans       = (trans_job / tot_employ) * 100,
    area_sq_miles   = as.numeric(st_area(geometry)) / 2589988, 
    pop_density     = total_pop / area_sq_miles,
    
    # Continuous Age Spectrum
    pct_age_18_24   = ((m_18_19 + m_20 + m_21 + m_22_24 + f_18_19 + f_20 + f_21 + f_22_24) / total_pop) * 100,
    pct_age_25_34   = ((m_25_29 + m_30_34 + f_25_29 + f_30_34) / total_pop) * 100,
    pct_age_35_44   = ((m_35_39 + m_40_44 + f_35_39 + f_40_44) / total_pop) * 100,
    pct_age_45_64   = ((m_45_49 + m_50_54 + m_55_59 + m_60_61 + m_62_64 + f_45_49 + f_50_54 + f_55_59 + f_60_61 + f_62_64) / total_pop) * 100,
    pct_age_65_plus = ((m_65_66 + m_67_69 + m_70_74 + m_75_79 + m_80_84 + m_85_up + f_65_66 + f_67_69 + f_70_74 + f_75_79 + f_80_84 + f_85_up) / total_pop) * 100,
    
    # Commercial/Residential Proxy
    pct_vacant      = (vacant_unit / tot_housing) * 100,
    pct_renter      = (renter_unit / tot_occup) * 100,
    
    # Economic Barrier
    pct_poverty     = (pov_below / tot_pov_base) * 100
  ) %>%
  
  # Explicitly drop the raw counts so they don't leak into the model
  select(
    -no_car, -transit, -walk, -wfh, -commute_60, -commute_90, -bachelors, 
    -mngmt_job, -service_job, -sales_job, -const_job, -trans_job, 
    -tot_workers, -tot_pop_25, -tot_employ,
    -m_18_19, -m_20, -m_21, -m_22_24, -m_25_29, -m_30_34, -m_35_39, -m_40_44, 
    -m_45_49, -m_50_54, -m_55_59, -m_60_61, -m_62_64, -m_65_66, -m_67_69, -m_70_74, -m_75_79, -m_80_84, -m_85_up,
    -f_18_19, -f_20, -f_21, -f_22_24, -f_25_29, -f_30_34, -f_35_39, -f_40_44, 
    -f_45_49, -f_50_54, -f_55_59, -f_60_61, -f_62_64, -f_65_66, -f_67_69, -f_70_74, -f_75_79, -f_80_84, -f_85_up,
    -tot_housing, -vacant_unit, -tot_occup, -renter_unit, -tot_pov_base, -pov_below
  )

Checking for missing census data

## Checking for total NAs and NaNs in each column
colSums(is.na(nyc_tracts %>% st_drop_geometry())) %>%
  kable(caption = "# of NAs and NaNs in each column")

# of NAs and NaNs in each column
	x
GEOID	0
NAME	0
med_income	126
med_age	88
total_pop	0
pct_no_car	85
pct_transit	90
pct_walk	90
pct_wfh	90
pct_commute_60p	90
pct_bachelors	85
pct_mngmt	90
pct_service	90
pct_sales	90
pct_const	90
pct_trans	90
area_sq_miles	0
pop_density	3
pct_age_18_24	85
pct_age_25_34	85
pct_age_35_44	85
pct_age_45_64	85
pct_age_65_plus	85
pct_vacant	93
pct_renter	96
pct_poverty	90

## Percentage of missing data 
# (is it a low enough percent to be negligible) 
missing_summary <- nyc_tracts %>%
  st_drop_geometry() %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Missing_Count") %>%
  mutate(
    Total_Rows = nrow(nyc_tracts),
    Pct_Missing = (Missing_Count / Total_Rows) * 100,
    
    # Flag if the missing data is below 5% 
    Is_Negligible_Under_5Pct = Pct_Missing < 5 
  ) %>%
  # Filter to only show features that actually have missing data
  filter(Missing_Count > 0) %>% 
  arrange(desc(Pct_Missing))

# Output Percentage of Missing Data
kable(missing_summary, digits = 2, caption = "Missing Data Percentage by Feature")

Missing Data Percentage by Feature
Feature	Missing_Count	Total_Rows	Pct_Missing	Is_Negligible_Under_5Pct
med_income	126	2327	5.41	FALSE
pct_renter	96	2327	4.13	TRUE
pct_vacant	93	2327	4.00	TRUE
pct_transit	90	2327	3.87	TRUE
pct_walk	90	2327	3.87	TRUE
pct_wfh	90	2327	3.87	TRUE
pct_commute_60p	90	2327	3.87	TRUE
pct_mngmt	90	2327	3.87	TRUE
pct_service	90	2327	3.87	TRUE
pct_sales	90	2327	3.87	TRUE
pct_const	90	2327	3.87	TRUE
pct_trans	90	2327	3.87	TRUE
pct_poverty	90	2327	3.87	TRUE
med_age	88	2327	3.78	TRUE
pct_no_car	85	2327	3.65	TRUE
pct_bachelors	85	2327	3.65	TRUE
pct_age_18_24	85	2327	3.65	TRUE
pct_age_25_34	85	2327	3.65	TRUE
pct_age_35_44	85	2327	3.65	TRUE
pct_age_45_64	85	2327	3.65	TRUE
pct_age_65_plus	85	2327	3.65	TRUE
pop_density	3	2327	0.13	TRUE

## Map for missing income data
missing_map_data <- nyc_tracts %>%
  mutate(missing_income_flag = ifelse(is.na(med_income), "Missing Data", "Valid Data"))

ggplot(missing_map_data) +
  geom_sf(aes(fill = missing_income_flag), color = "white", size = 0.1) +
  scale_fill_manual(values = c("Valid Data" = "#e0e0e0", "Missing Data" = "#d73027")) +
  theme_void() +
  labs(title = "Map of Missing Census Income Data",
       fill = "Status")

How to interpret: Red tracts are missing the median income field; gray tracts have valid data. If the red dots cluster (same borough, same neighborhood), any model relying on med_income will be geographically biased — we’d be making predictions in those areas without real data. Scattered red is much less harmful.

Census NA Handling

This chunk is currently dropping census tracts that have very few people or no workers.

It’s also imputing where income or age was missing using the median value. This was just a simple implementation, but we can definitely consider more robust imputing methos, and also properly investingating MAR, MNAR etc. (Though given the nature of census data, MNAR is a good assumption.)

## Drop tracts with less than 50 people or 0 workers
# There were a ton of tracts with a small number of or 0 people, screwing with denominators and also making for super volatile percentages
nyc_tracts_clean <- nyc_tracts %>%
  filter(total_pop >= 50, !is.na(pct_transit))

## Imputing
# There were also many tracts where income or age was missing, but it wasn't due to a denominator issue
# Just doing simple median imputation
# More robust method would be better
# Checking for MNAR and stuff would be better
# The current median imputation: 
nyc_tracts_clean <- nyc_tracts_clean %>%
  # Extract the borough from the NAME string (e.g., "Queens County")
  mutate(borough = str_extract(NAME, "(?<=; ).*(?= County)")) %>%
  
  # Group by Borough so the imputation is localized
  group_by(borough) %>%
  mutate(
    med_income = ifelse(is.na(med_income), 
                        median(med_income, na.rm = TRUE), 
                        med_income),
    med_age = ifelse(is.na(med_age), 
                     median(med_age, na.rm = TRUE), 
                     med_age)
  ) %>%
  ungroup() %>%
  
  # Drop the temporary borough column 
  select(-borough)
 
print(paste("Original tracts:", nrow(nyc_tracts)))

## [1] "Original tracts: 2327"

print(paste("Cleaned tracts of NAs:", nrow(nyc_tracts_clean)))

## [1] "Cleaned tracts of NAs: 2226"

# Save 
census_save_path <- file.path(projPath, "data", "nyc_census_tracts.rds") 
saveRDS(nyc_tracts_clean, census_save_path)

print(paste("Census features and geometry saved to:", census_save_path))

## [1] "Census features and geometry saved to: /Users/catherinedube/src/DATA_622_Project/data/nyc_census_tracts.rds"

# Peak
head(nyc_tracts_clean)

## Simple feature collection with 6 features and 26 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.91641 ymin: 40.82439 xmax: -73.84726 ymax: 40.90264
## Geodetic CRS:  NAD83
## # A tibble: 6 × 27
##   GEOID       NAME  med_income med_age total_pop pct_no_car pct_transit pct_walk
##   <chr>       <chr>      <dbl>   <dbl>     <dbl>      <dbl>       <dbl>    <dbl>
## 1 36005013500 Cens…      27602    36.5      3125       31.8        48.6    14.5 
## 2 36005009200 Cens…      56208    33.9      5959       14.1        38.6     8.07
## 3 36005005400 Cens…      48750    30.5      5715       20.6        64.7     6.20
## 4 36005036501 Cens…      41287    31.8      4196       23.1        60.3     6.61
## 5 36005044902 Cens…      88264    32.4      2146       12.1        37.8     2.43
## 6 36005017500 Cens…      26463    33.9      6116       39.5        69.1     8.69
## # ℹ 19 more variables: pct_wfh <dbl>, pct_commute_60p <dbl>,
## #   pct_bachelors <dbl>, pct_mngmt <dbl>, pct_service <dbl>, pct_sales <dbl>,
## #   pct_const <dbl>, pct_trans <dbl>, area_sq_miles <dbl>, pop_density <dbl>,
## #   pct_age_18_24 <dbl>, pct_age_25_34 <dbl>, pct_age_35_44 <dbl>,
## #   pct_age_45_64 <dbl>, pct_age_65_plus <dbl>, pct_vacant <dbl>,
## #   pct_renter <dbl>, pct_poverty <dbl>, geometry <MULTIPOLYGON [°]>

Vehicle Availability (Cars Owned)

Adding car ownership as a feature — flagged by our professor as likely the strongest missing predictor and a key source of omitted-variable bias. Pulling just this table separately to avoid re-running the full cached census pull.

pct_2plus_cars captures car-dependent households (2+ vehicles) which is more informative than the binary pct_no_car already in the model.

# Pull B25044: Tenure by Vehicles Available (2022 ACS 5-Year)
vehicle_vars <- c(
  tot_occup_veh  = "B25044_001", # Total occupied housing units
  own_2car       = "B25044_005", # Owner occupied, 2 vehicles
  own_3car       = "B25044_006", # Owner occupied, 3 vehicles
  own_4plus_car  = "B25044_007", # Owner occupied, 4+ vehicles
  rent_2car      = "B25044_011", # Renter occupied, 2 vehicles
  rent_3car      = "B25044_012", # Renter occupied, 3 vehicles
  rent_4plus_car = "B25044_013"  # Renter occupied, 4+ vehicles
)

nyc_vehicle_raw <- get_acs(
  geography = "tract",
  variables = vehicle_vars,
  state     = "NY",
  county    = nyc_counties,
  year      = 2022,
  output    = "wide"
)

## Getting data from the 2018-2022 5-year ACS

# Engineer pct_2plus_cars: % of occupied households with 2 or more vehicles
nyc_vehicle <- nyc_vehicle_raw %>%
  transmute(
    GEOID,
    pct_2plus_cars = ((own_2carE + own_3carE + own_4plus_carE +
                       rent_2carE + rent_3carE + rent_4plus_carE) / tot_occup_vehE) * 100
  )

# Join onto existing census tracts and re-save
nyc_tracts_clean <- readRDS(file.path(projPath, "data", "nyc_census_tracts.rds")) %>%
  left_join(nyc_vehicle, by = "GEOID") %>%
  mutate(pct_2plus_cars = replace_na(pct_2plus_cars, 0))

saveRDS(nyc_tracts_clean, file.path(projPath, "data", "nyc_census_tracts.rds"))
print("Vehicle availability feature added to nyc_census_tracts.rds")

## [1] "Vehicle availability feature added to nyc_census_tracts.rds"

Taxi Zones

The taxi zone data is luckily pretty straightforward. It’s just being imported here.

#Zip file from below url is downloaded and extracted into repo
#taxi_zones_url <- "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip" 

# Read the shapefile
taxi_zones_sf <- st_read(file.path(projPath, 'taxi_zones', 'taxi_zones', "taxi_zones.shp")) %>%
  st_transform(4326) # Transform to standard Lat/Long (WGS84)

## Reading layer `taxi_zones' from data source 
##   `/Users/catherinedube/src/DATA_622_Project/taxi_zones/taxi_zones/taxi_zones.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)

# Lookup table with names and boroughs is downloaded into repo
#zone_lookup_url <- "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv"
zone_lookup <- read_csv(file.path(projPath, 'taxi_zone_lookup.csv'))

## Rows: 265 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Borough, Zone, service_zone
## dbl (1): LocationID
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Join 
taxi_zones <- taxi_zones_sf %>%
  left_join(zone_lookup, by = c("LocationID" = "LocationID"))

# Peak: Does this map look right
plot(st_geometry(taxi_zones))

How to interpret: Sanity check that the 263 NYC taxi zones loaded and overlay correctly. We’re looking for a recognizable NYC silhouette — five boroughs visible, no obvious gaps or holes. If the shape looks broken, the shapefile didn’t load right.

TLC Data

TLC Data is from the NYC Taxi and Limousine Commission.

High volume for hire vehicles = Uber/Lyft.

Originally was pulling January 2026 data for High Volume for-hire vehicles.
https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2026-01.parquet
Currently pulling October 2023 data for yellow taxis, green taxis, ~for-hire vehicles (FHV (local livery/black car)),~ and high volume for-hire vehicles (FHVHV (Uber/Lyft) Uber/Lyft). This made a little more sense with the 2022 census data, and with the seasonality of commuting and to avoid holiday season.
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-10.parquet
https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-10.parquet
https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2023-10.parquet
https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-10.parquet

## Aggregate all vehicular trips for October 2023
# Combine Yellow, Green, FHV, and High-Volume FHV (Uber/Lyft) 
# See "all" the car-based 'Last-Mile' choices

# URL definitions for source data
tlc_urls <- c(
  yellow = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-10.parquet",
  green  = "https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-10.parquet",
  #fhv    = "https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2023-10.parquet",
  fhvhv  = "https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-10.parquet"
)

# Function to read and count trips by Taxi Zone 
# PULocationID can be used to get the # of trips that started from (were picked up in) a zone
count_zone_trips <- function(url) { 
  read_parquet(url, col_select = any_of(c("PULocationID"))) %>%
    count(PULocationID, name = "trips") %>%
    filter(!is.na(PULocationID))
}

# Process the sources and sum them up 
# Getting number of trips that started per pick up location
# This 'map_dfr' loop downloads each, counts by zone, and stacks them.
total_zone_counts <- tlc_urls %>%
  map_dfr(count_zone_trips) %>%
  group_by(PULocationID) %>%
  summarize(total_vehicular_trips = sum(trips))

Census uses Tracts, TLC uses Taxi Zones

So Census Tracts are not the same as Taxi Zones. This chunk tries to handle it.

EPSG:2263 is apparently a specific map projection built just for the New York/Long Island area. In it, 1 unit equals exactly 1 physical foot. (I had not ever used this before.) The idea is that if the censues tracts and the taxi zones are both transformed to EPSG:2263, then this will be the most precise way to layer these maps on top of one another and see where they intersect. (At least it appears to be more precise than using latitude and longitude.)

st_intersection() lays the two maps on top of one another and separated out the overlapping pieces properly.

# Coordinate Reference System (CRS)
# EPSG:2263 (NAD83 / New York Long Island) uses feet 
nyc_tracts_proj <- st_transform(nyc_tracts, 2263)
taxi_zones_proj <- st_transform(taxi_zones, 2263)

# Intersect regions
# This creates a new spacial dataframe where tracts and zones overlap
intersection <- st_intersection(nyc_tracts_proj, taxi_zones_proj) %>%
  mutate(intersect_area = st_area(.))

## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

# Calculate the weights
# For every unique Census Tract, 
# What % of Tract X's area is inside Taxi Zone Y?
tract_weights <- intersection %>%
  as_tibble() %>%
  # Two complementary shares — different jobs, different denominators:
  # tract_share = how much of this tract is in this zone (picks the dominant zone)
  # zone_share  = how much of this zone is in this tract (apportions zone trips)
  group_by(GEOID) %>%
  mutate(tract_share = as.numeric(intersect_area / sum(intersect_area))) %>%
  ungroup() %>%
  group_by(LocationID) %>%
  mutate(zone_share = as.numeric(intersect_area / sum(intersect_area))) %>%
  ungroup() %>%
  select(GEOID, LocationID, zone, tract_share, zone_share)

# Peak Crosswalk 
head(tract_weights)

## # A tibble: 6 × 5
##   GEOID       LocationID zone                    tract_share zone_share
##   <chr>            <dbl> <chr>                         <dbl>      <dbl>
## 1 36081071600          2 Jamaica Bay               0.000628  0.000858  
## 2 36047070203          2 Jamaica Bay               1         0.451     
## 3 36081107201          2 Jamaica Bay               0.0124    0.00124   
## 4 36081100801          2 Jamaica Bay               0.00106   0.0000508 
## 5 36081107202          2 Jamaica Bay               0.989     0.547     
## 6 36005037000          3 Allerton/Pelham Gardens   0.0000606 0.00000326

The below chunk uses the newly defined geography, and then it disperses the uber/taxi trip data (which uses taxi zones) fairly across the census tracts.

# issue is that taxi zones are larger than census tracts, and we obvs want to use the data together 
# Assumes 'tract_weights' object exists from your earlier step.

tract_vehicular_estimates <- tract_weights %>%
  left_join(total_zone_counts, by = c("LocationID" = "PULocationID")) %>%
  mutate(estimated_trips = total_vehicular_trips * zone_share) %>%
  group_by(GEOID) %>%
  summarize(total_uber_taxi_trips = sum(estimated_trips, na.rm = TRUE))

# Save the 'Vehicular' side of the Master Table
save_path_tlc <- file.path(projPath, "data", "processed_tlc_counts.rds")
saveRDS(tract_vehicular_estimates, save_path_tlc)

print(paste("Vehicular trip estimates for October 2023 saved to:", save_path_tlc))

## [1] "Vehicular trip estimates for October 2023 saved to: /Users/catherinedube/src/DATA_622_Project/data/processed_tlc_counts.rds"

Citibike

Citibike Data Download
- Because the Citibike publishes their data online in zip files, it can’t be read directly from web.
Originally was pulling January 2026 data for citibike
- 202601-citibike-tripdata_1.csv
- 202601-citibike-tripdata_2.csv
Currently pulling October 2023 data

# Map Station IDs to Census Tracts  

# file locations
cb_file1 <- file.path(projPath, "citibike", "202310-citibike-tripdata_1.csv")
cb_file2 <- file.path(projPath, "citibike", "202310-citibike-tripdata_2.csv")
cb_file3 <- file.path(projPath, "citibike", "202310-citibike-tripdata_3.csv")
cb_file4 <- file.path(projPath, "citibike", "202310-citibike-tripdata_4.csv")
cb_files <- c(cb_file1, cb_file2, cb_file3, cb_file4)

# Station-to-Tract Lookup Table
# unique coordinates for every station found in files
all_stations <- cb_files %>%
  map_dfr(~read_csv(.x, 
                    col_select = c(start_station_id, start_lat, start_lng),
                    # Force the station ID to be a character, and coords to be numeric
                    col_types = cols(
                      start_station_id = col_character(),
                      start_lat = col_double(),
                      start_lng = col_double()
                    ))) %>%
  filter(!is.na(start_lat)) %>%
  distinct(start_station_id, .keep_all = TRUE)

# Spatial Work: Convert Stations to Points
stations_sf <- st_as_sf(all_stations, coords = c("start_lng", "start_lat"), crs = 4326) %>%
  st_transform(2263) # Project to NYC Feet (EPSG:2263)

# Ensure nyc_tracts (from Step 1) matches the projection
nyc_tracts_proj <- st_transform(nyc_tracts, 2263)

# Find which Tract Polygon physically contains each Station Point
station_tract_lookup <- st_join(stations_sf, nyc_tracts_proj, join = st_intersects) %>%
  as_tibble() %>%
  select(start_station_id, GEOID)

# Count the number of unique stations in each Census Tract
tract_station_counts <- station_tract_lookup %>%
  group_by(GEOID) %>%
  summarize(total_stations = n()) %>%
  filter(!is.na(GEOID))

# Count Trips per Tract 
tract_bike_counts <- cb_files %>%
  map_dfr(~read_csv(.x, 
                    col_select = c(start_station_id),
                    col_types = cols(start_station_id = col_character()) 
                    )) %>%
  left_join(station_tract_lookup, by = "start_station_id") %>%
  group_by(GEOID) %>%
  summarize(total_bike_trips = n()) %>%
  filter(!is.na(GEOID)) %>%
  # NEW: Attach the station capacity feature to the trip counts
  left_join(tract_station_counts, by = "GEOID")


# Save the final "Label" feature for the Machine Learning model
save_path <- file.path(projPath, "data", "processed_bike_counts.rds")
saveRDS(tract_bike_counts, save_path)

print(paste("Success! Bike counts saved to:", save_path))

## [1] "Success! Bike counts saved to: /Users/catherinedube/src/DATA_622_Project/data/processed_bike_counts.rds"

Distance from Subway

Source is https://data.ny.gov/Transportation/MTA-Subway-Entrances-and-Exits-2024/i9wp-a4ja/about_data.

File name is “MTA_Subway_Entrances_and_Exits__2024.csv”

## Distance to Subway 

# Load csv
subway_csv_path <- file.path(projPath, "MTA_Subway_Entrances_and_Exits__2024.csv")
subway_raw <- read_csv(subway_csv_path) %>%
  clean_names() # otherwise this had spaces in the column names

## Rows: 2120 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): Division, Line, Borough, Stop Name, Constituent Station Name, GTFS...
## dbl  (4): Complex ID, Station ID, Entrance Latitude, Entrance Longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Transform into a Spatial Object (sf)
subway_entrances_sf <- subway_raw %>%
  filter(!is.na(entrance_latitude) & !is.na(entrance_longitude)) %>% # 'entrance_longitude' is X and 'entrance_latitude' is Y
  st_as_sf(coords = c("entrance_longitude", "entrance_latitude"), crs = 4326) %>% # telling it to read this as 4326, the standard GPS coordinate system  
  st_transform(2263) # Converting to the same 1 unit = 1 foot NYC-specific map that was used earlier

# Picking a reference point inside each tract.
# Switched away from st_centroid() — for L-shaped or coastal tracts the
# geographic center can land in water or in a corner where nobody lives.
# st_point_on_surface() is guaranteed to be inside the polygon.
tract_centers <- st_point_on_surface(nyc_tracts_proj)

## Warning: st_point_on_surface assumes attributes are constant over geometries

# Nearest Entrance for every Tract
# Find the closest subway for each tract
# Doing this from the tract center 
nearest_indices <- st_nearest_feature(tract_centers, subway_entrances_sf)

# Distance in Miles
# (Distance in feet / 5280)
dist_to_subway_feet <- st_distance(tract_centers, subway_entrances_sf[nearest_indices, ], by_element = TRUE)

subway_dist_lookup <- nyc_tracts_proj %>%
  as_tibble() %>%
  mutate(dist_subway_miles = as.numeric(dist_to_subway_feet) / 5280) %>%
  select(GEOID, dist_subway_miles)

# Save the Subway Distance
saveRDS(subway_dist_lookup, file.path(projPath, "data", "subway_dist_lookup.rds"))

print("Subway distance feature successfully calculated from MTA data!")

## [1] "Subway distance feature successfully calculated from MTA data!"

Distance to Manhattan (CBD)

Borough fixed effects can’t tell apart “Williamsburg, 1 mile from Manhattan” and “Bay Ridge, 8 miles away” — they’re both Brooklyn, same borough effect. This chunk computes distance from each tract to Grand Central Terminal as a CBD-proximity proxy. Per George’s flag.

## Distance from each tract to Grand Central — proxy for "proximity to the
## Manhattan CBD". Same point_on_surface reference points as subway distance.

# Grand Central Terminal: 40.7527° N, 73.9772° W
grand_central <- st_sfc(st_point(c(-73.9772, 40.7527)), crs = 4326) %>%
  st_transform(2263)  # NYC State Plane (feet)

tract_centers_pos <- st_point_on_surface(nyc_tracts_proj)

## Warning: st_point_on_surface assumes attributes are constant over geometries

manhattan_dist_lookup <- nyc_tracts_proj %>%
  as_tibble() %>%
  mutate(
    dist_to_manhattan_miles = as.numeric(
      st_distance(tract_centers_pos, grand_central)
    ) / 5280
  ) %>%
  select(GEOID, dist_to_manhattan_miles)

saveRDS(manhattan_dist_lookup, file.path(projPath, "data", "manhattan_dist_lookup.rds"))
print("Distance to Manhattan (Grand Central) feature saved.")

## [1] "Distance to Manhattan (Grand Central) feature saved."

Assembling

Big Join

## Big Join 

# Load components
census_features <- readRDS(file.path(projPath, "data", "nyc_census_tracts.rds"))
bike_counts     <- readRDS(file.path(projPath, "data", "processed_bike_counts.rds"))
tlc_counts      <- readRDS(file.path(projPath, "data", "processed_tlc_counts.rds"))
subway_dist     <- readRDS(file.path(projPath, "data", "subway_dist_lookup.rds"))
manhattan_dist  <- readRDS(file.path(projPath, "data", "manhattan_dist_lookup.rds"))

# Before joining, force each Tract to belong to only one Taxi Zone, the one it overlaps most.
# I was seeing repeating values of income and distance for different zones in the final data set back when I was not doing this. 
unique_weights <- tract_weights %>%
  group_by(GEOID) %>%
  slice_max(tract_share, n = 1, with_ties = FALSE) %>% 
  ungroup() %>%
  select(GEOID, LocationID, zone) # Keep the Zone name and ID for joining

# Join everything to the Census Tracts
# left_join to not lose any tracts that might have had 0 trips
master_table <- census_features %>%
  st_drop_geometry() %>% # drops the maps, keeps ALL features 
  # unique Neighborhood/LocationID mapping
  left_join(unique_weights, by = "GEOID") %>%
  # Bike counts (already at Tract level)
  left_join(bike_counts, by = "GEOID") %>%
  # Uber/Taxi counts  
  left_join(tlc_counts, by = "GEOID") %>% 
  # Subway Distance (already at Tract level)
  left_join(subway_dist, by = "GEOID") %>%
  # Distance to Manhattan CBD (Grand Central proxy)
  left_join(manhattan_dist, by = "GEOID")


# Clean up NAs; create the Target Variable
# If a tract has no recorded trips, replace NA with 0
# If a tract has no stations, replace NA with 0 
master_table <- master_table %>%
  mutate(
    total_bike_trips      = replace_na(total_bike_trips, 0),
    total_uber_taxi_trips = replace_na(total_uber_taxi_trips, 0),
    total_stations        = replace_na(total_stations, 0),
    
    # Spatial Density (Stations per Square Mile)
    # How physically clustered is the infrastructure?
    stations_per_sq_mile  = total_stations / area_sq_miles,
    
    # Per Capita Availability (Stations per 1,000 Residents)
    # How much infrastructure exists relative to the local population size?
    stations_per_1000_pop = (total_stations / total_pop) * 1000,
    uber_trips_per_capita = total_uber_taxi_trips / total_pop,
    bike_trips_per_capita = total_bike_trips / total_pop,
    
    # Old target: % of Last-Mile trips taken by Bike
    # I left this in here, but we aren't really using it now
    bike_index = total_bike_trips / (total_bike_trips + total_uber_taxi_trips)
  )  

# Save joined data set
saveRDS(master_table, file.path(projPath, "data", "model_input.rds"))

print("Master Join Complete. The model_input.rds file is ready for modeling.")

## [1] "Master Join Complete. The model_input.rds file is ready for modeling."

Intermediate Sanity Checking

We could drop this later, but right now it was just a helpful way to make sure that combining the census tracts and the taxi zones worked properly.

### The Data Audit (Sanity Check)

# Load the Master Table
master_table <- readRDS(file.path(projPath, "data", "model_input.rds"))

# Join with Taxi Zone names for readability
# We'll use the tract_weights crosswalk to get neighborhood names back
audit_table <- master_table %>%
  select(zone, GEOID, bike_index, total_bike_trips, total_uber_taxi_trips, med_income, dist_subway_miles)

# Top 10 "Bike-Heavy" Neighborhoods (Active)
top_bike <- audit_table %>%
  arrange(desc(bike_index)) %>%
  head(50)

# Top 10 "Uber-Heavy" Neighborhoods (Vehicular)
top_uber <- audit_table %>%
  arrange(bike_index) %>%
  head(50)

# quick summary
print("--- TOP 10 BIKE-HEAVY TRACTS ---")

## [1] "--- TOP 10 BIKE-HEAVY TRACTS ---"

print(top_bike)

## # A tibble: 50 × 7
##    zone       GEOID bike_index total_bike_trips total_uber_taxi_trips med_income
##    <chr>      <chr>      <dbl>            <int>                 <dbl>      <dbl>
##  1 Two Bridg… 3606…      0.635            11311                 6499.      89873
##  2 Steinway   3608…      0.580             2748                 1993.      76889
##  3 Columbia … 3604…      0.571             7840                 5882.     145417
##  4 Kips Bay   3606…      0.549            26003                21371.     153902
##  5 Sunnyside  3608…      0.537             2569                 2216.      65667
##  6 Crown Hei… 3604…      0.527             8783                 7875.     114750
##  7 East Will… 3604…      0.520             6701                 6177.     151250
##  8 Fort Gree… 3604…      0.519            11949                11052.     142813
##  9 Two Bridg… 3606…      0.510            17361                16690.      25655
## 10 Stuy Town… 3606…      0.502            12168                12052.     202632
## # ℹ 40 more rows
## # ℹ 1 more variable: dist_subway_miles <dbl>

print("--- TOP 10 UBER-HEAVY TRACTS ---")

## [1] "--- TOP 10 UBER-HEAVY TRACTS ---"

print(top_uber)

## # A tibble: 50 × 7
##    zone       GEOID bike_index total_bike_trips total_uber_taxi_trips med_income
##    <chr>      <chr>      <dbl>            <int>                 <dbl>      <dbl>
##  1 Westchest… 3600…          0                0                 6170.      56208
##  2 West Farm… 3600…          0                0                 6420.      48750
##  3 East Trem… 3600…          0                0                 4399.      41287
##  4 Woodlawn/… 3600…          0                0                 2676.      88264
##  5 Soundview… 3600…          0                0                 2949.      53166
##  6 East Trem… 3600…          0                0                 8996.      32174
##  7 Williamsb… 3600…          0                0                 4127.      82039
##  8 Eastchest… 3600…          0                0                 3373.      27753
##  9 Williamsb… 3600…          0                0                 4719.      59758
## 10 Soundview… 3600…          0                0                 5698.      29657
## # ℹ 40 more rows
## # ℹ 1 more variable: dist_subway_miles <dbl>

Model Building

This section has 3 predictive models:

Linear Regression “baseline” model.
A linear model with a natural cubic spline on dist_subway_miles
A random forest model

It also has 1 causal model * BART, to estimate the average treatment effect

Borough Fixed Effects

Prof. George flagged that NYC neighborhoods are very different from each other. Added this section, and borough as a factor variable lets both models learn a borough-level intercept, capturing unobserved variation that our census features don’t fully explain.

Borough is derived from the first 5 digits of GEOID (standard Census FIPS codes) so no additional data pull is needed.

# FIPS county codes for NYC boroughs
borough_lookup <- c(
  "36005" = "Bronx",
  "36047" = "Brooklyn",
  "36061" = "Manhattan",
  "36081" = "Queens",
  "36085" = "Staten Island"
)

# Derive borough from first 5 chars of GEOID and join onto the master table
master_table <- readRDS(file.path(projPath, "data", "model_input.rds")) %>%
  # Brooklyn as reference level — most-treated borough in the IBX corridor,
  # so all other borough coefficients read as "vs Brooklyn (the IBX baseline)".
  mutate(borough = factor(borough_lookup[substr(GEOID, 1, 5)],
                          levels = c("Brooklyn", "Bronx", "Manhattan",
                                     "Queens", "Staten Island")))

saveRDS(master_table, file.path(projPath, "data", "model_input.rds"))
print(table(master_table$borough))

## 
##      Brooklyn         Bronx     Manhattan        Queens Staten Island 
##           775           344           303           684           120

Models 1 and 2: Linear Regression (with and without spline)

Two linear baselines:

lm_fit_standardized: pure linear regression
lm_fit_splineL same model + a natural cubic spline on dist_subway_miles

Updated this section so that both fit on a logit-transformed outcome so predictions stay in [0, 100]. (Without it, lm was predicting percentages that didn’t make sense because they were outsid ethat range).

The spline version handles the saturating distance effect Prof. George flagged. When the distance to the subway is closer it matters a lot, far doesn’t.

Heavy-tailed predictors (uber_trips_per_capita, bike_trips_per_capita, stations_per_1000_pop, pop_density) get a log1p() transform before z-scoring. Prof. George also pointed this out at some point, that some of these variables would have heavily skewed distributions. After logging, a 1 SD change in the standardized predictor is an order-of-magnitude change rather than an absolute one, which is the right scale for these variables.

# The selection for the Machine Learning model
# Right now, this has a lot of room for improvement trimming or adding features 

# To make iteration easier:
##  COLUMN/FEATURE SELECTION
id_vars <- c("GEOID")
target_var <- c("pct_commute_60p")

predictor_vars <- c(
  # Borough Fixed Effect
  "borough",

  # Infrastructure & Mobility
  "stations_per_sq_mile", "stations_per_1000_pop", "dist_subway_miles",
  "dist_to_manhattan_miles",
  "uber_trips_per_capita", "bike_trips_per_capita",

  # Economic & Housing
  "med_income", "pct_poverty", "pct_vacant", "pct_renter", "pct_2plus_cars",

  # Demographics & Commute Behaviors
  "pop_density", "pct_no_car", "pct_transit", "pct_bachelors", "pct_wfh", "pct_walk",

  # Age Buckets
  "pct_age_18_24", "pct_age_25_34", "pct_age_35_44", "pct_age_45_64", "pct_age_65_plus"
)

## SAVING RAW (NON-Z-SCORE) DATA TO DISPLAY IN SHINY APP
ml_data_raw <- readRDS(file.path(projPath, "data", "model_input.rds")) %>%
  # all_of() safely selects only the columns defined in our lists above
  select(all_of(c(id_vars, target_var, predictor_vars))) %>%
  drop_na()

# Sanity: any Inf/NaN in the numeric columns? scale() turns Inf into NaN.
print(ml_data_raw %>% summarise(across(where(is.numeric), ~ sum(!is.finite(.)))))

## # A tibble: 1 × 23
##   pct_commute_60p stations_per_sq_mile stations_per_1000_pop dist_subway_miles
##             <int>                <int>                 <int>             <int>
## 1               0                    0                     0                 0
## # ℹ 19 more variables: dist_to_manhattan_miles <int>,
## #   uber_trips_per_capita <int>, bike_trips_per_capita <int>, med_income <int>,
## #   pct_poverty <int>, pct_vacant <int>, pct_renter <int>,
## #   pct_2plus_cars <int>, pop_density <int>, pct_no_car <int>,
## #   pct_transit <int>, pct_bachelors <int>, pct_wfh <int>, pct_walk <int>,
## #   pct_age_18_24 <int>, pct_age_25_34 <int>, pct_age_35_44 <int>,
## #   pct_age_45_64 <int>, pct_age_65_plus <int>

## SCALE THE DATA FOR THE MODEL
# Scale numeric predictors only — borough is a factor and must not be z-scored
numeric_predictors <- predictor_vars[predictor_vars != "borough"]

ml_data_scaled <- ml_data_raw %>%
  # Log-transform heavy-tailed predictors before scaling so standardized
  # values reflect order-of-magnitude differences (per George's flag re:
  # citibike share, uber/taxi share, plus pop_density spans 4 orders).
  mutate(across(
    .cols = c(uber_trips_per_capita, bike_trips_per_capita,
              stations_per_1000_pop, pop_density),
    .fns  = log1p
  )) %>%
  mutate(across(
    .cols = all_of(numeric_predictors),
    .fns  = ~ as.numeric(scale(.))
  )) %>%
  # Logit-transform the outcome so lm predictions can't escape [0, 100].
  # Small offset handles 0% / 100% edges without -Inf/+Inf.
  mutate(pct_commute_60p_logit = qlogis((pct_commute_60p + 0.5) / 101))

# Helper: bring logit-space predictions back to raw % space.
logit_to_pct <- function(z) (plogis(z) * 101) - 0.5

## TRAIN THE MODEL
# Fit on the logit-transformed outcome — back-transform when we predict.
lm_fit_standardized <- lm(pct_commute_60p_logit ~ . - GEOID - pct_commute_60p,
                          data = ml_data_scaled)

# Same model with a natural cubic spline on dist_subway_miles — kept alongside
# the pure-linear baseline so we can compare them in the Shiny app.
lm_fit_spline <- lm(pct_commute_60p_logit ~ . - GEOID - pct_commute_60p - dist_subway_miles
                    + splines::ns(dist_subway_miles, df = 4),
                    data = ml_data_scaled)

print("Model Summary: ")

## [1] "Model Summary: "

summary(lm_fit_standardized)

## 
## Call:
## lm(formula = pct_commute_60p_logit ~ . - GEOID - pct_commute_60p, 
##     data = ml_data_scaled)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5619 -0.2472  0.0243  0.3020  2.8051 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.152266   0.019216 -59.962  < 2e-16 ***
## boroughBronx             0.008190   0.035471   0.231 0.817418    
## boroughManhattan        -0.533225   0.047750 -11.167  < 2e-16 ***
## boroughQueens           -0.041091   0.031410  -1.308 0.190938    
## boroughStaten Island    -0.276646   0.062500  -4.426 1.01e-05 ***
## stations_per_sq_mile     0.044296   0.019031   2.328 0.020024 *  
## stations_per_1000_pop   -0.111398   0.025688  -4.337 1.51e-05 ***
## dist_subway_miles       -0.007864   0.014561  -0.540 0.589218    
## dist_to_manhattan_miles  0.241930   0.021178  11.423  < 2e-16 ***
## uber_trips_per_capita   -0.045419   0.026221  -1.732 0.083384 .  
## bike_trips_per_capita   -0.049733   0.026616  -1.869 0.061819 .  
## med_income              -0.106383   0.020043  -5.308 1.22e-07 ***
## pct_poverty             -0.068325   0.017665  -3.868 0.000113 ***
## pct_vacant               0.034210   0.012962   2.639 0.008367 ** 
## pct_renter              -0.001947   0.024678  -0.079 0.937135    
## pct_2plus_cars           0.019960   0.024783   0.805 0.420680    
## pop_density             -0.063155   0.021817  -2.895 0.003832 ** 
## pct_no_car              -0.005559   0.032784  -0.170 0.865376    
## pct_transit              0.180915   0.021174   8.544  < 2e-16 ***
## pct_bachelors           -0.056834   0.018176  -3.127 0.001790 ** 
## pct_wfh                 -0.168961   0.016207 -10.425  < 2e-16 ***
## pct_walk                -0.095116   0.017528  -5.426 6.38e-08 ***
## pct_age_18_24            0.028315   0.013139   2.155 0.031267 *  
## pct_age_25_34           -0.052292   0.017840  -2.931 0.003412 ** 
## pct_age_35_44           -0.003262   0.015474  -0.211 0.833049    
## pct_age_45_64            0.041001   0.014926   2.747 0.006062 ** 
## pct_age_65_plus          0.029769   0.016704   1.782 0.074857 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4876 on 2193 degrees of freedom
## Multiple R-squared:  0.7066, Adjusted R-squared:  0.7031 
## F-statistic: 203.1 on 26 and 2193 DF,  p-value: < 2.2e-16

Spatial lag features

For each geographic predictor below, we compute the neighbor-weighted average (the “spatial lag”). This lets a model express patterns like “a tract’s commute outcomes depend on its own features AND on its neighbors’ features.” We keep the original predictors as-is so we can compare with-lag vs. without-lag model performance directly.

library(spdep)

## Warning: package 'spdep' was built under R version 4.5.2

## Loading required package: spData

## Warning: package 'spData' was built under R version 4.5.2

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

library(sf)

# Same spatial weights pattern as the residual diagnostic. Row order matches
# ml_data_scaled, so lag.listw is correctly positional.
tracts_sf <- nyc_tracts_clean %>%
  filter(GEOID %in% ml_data_scaled$GEOID) %>%
  arrange(match(GEOID, ml_data_scaled$GEOID))

neighbors <- poly2nb(tracts_sf)

## Warning in poly2nb(tracts_sf): some observations have no neighbours;
## if this seems unexpected, try increasing the snap argument.

## Warning in poly2nb(tracts_sf): neighbour object has 6 sub-graphs;
## if this sub-graph count seems unexpected, try increasing the snap argument.

weights   <- nb2listw(neighbors, style = "W", zero.policy = TRUE)

# Predictors with the most geographic structure. Skipping the outcome
# (lag-of-outcome opens the SAR can of worms) and skipping categorical
# borough (lag of factor levels is awkward).
predictors_to_lag <- c(
  "dist_to_manhattan_miles", "dist_subway_miles",
  "pop_density", "pct_transit", "pct_wfh",
  "med_income", "stations_per_1000_pop"
)

# Compute each lag column inside ml_data_scaled.
for (col in predictors_to_lag) {
  ml_data_scaled[[paste0("lag_", col)]] <-
    lag.listw(weights, ml_data_scaled[[col]], zero.policy = TRUE)
}

# Sanity: no Inf/NaN in the new lag columns
lag_finite_check <- ml_data_scaled %>%
  select(starts_with("lag_")) %>%
  summarise(across(everything(), ~ sum(!is.finite(.))))
print(lag_finite_check)

## # A tibble: 1 × 7
##   lag_dist_to_manhattan_…¹ lag_dist_subway_miles lag_pop_density lag_pct_transit
##                      <int>                 <int>           <int>           <int>
## 1                        0                     0               0               0
## # ℹ abbreviated name: ¹lag_dist_to_manhattan_miles
## # ℹ 3 more variables: lag_pct_wfh <int>, lag_med_income <int>,
## #   lag_stations_per_1000_pop <int>

Model 3: Random Forest + 10-fold Cross-Validation

Three jobs in this section: train the Random Forest (Model 3), build the spatial-lag variants of all three models (Linear, Linear + Spline, Random Forest), and put all six variants through 10-fold cross-validation to compare them to one another.

The spatial-lag variants use the same model formulas as the non-spatial ones, with the neighbor-weighted lag_* columns from the previous chunk added to the predictor set.

CV mechanics: each variant is fit 10 times (once per fold, on the 90% training side) and predicted on the held-out 10%. Folds are stratified by borough and reused across all six variants (same seed, same vfold_cv call), so any pair of models is directly comparable. The “one-SE rule” then picks the simplest model within one SE of the best.

A separate fit of rf_fit on the full dataset happens at the end (that’s the version used by the IBX simulation downstream, separate from the CV loop).

## CV SETUP
# 10-fold CV stratified by borough — same folds for every model so RMSE/R^2
# comparisons are paired 
set.seed(622)
model_data <- ml_data_scaled[, c(target_var, predictor_vars)]

# Spatial-lag predictor set: original predictors + the lag_* columns added
# in the build-spatial-lags chunk. Same target, same row order.
spatial_predictor_vars <- c(
  predictor_vars,
  paste0("lag_", predictors_to_lag)
)
model_data_spatial <- ml_data_scaled[, c(target_var, spatial_predictor_vars)]

folds <- rsample::vfold_cv(model_data, v = 10, strata = borough)

# Fit one model on the analysis side of a fold, predict on the assessment side.
run_cv <- function(folds, fit_fn) {
  imap_dfr(folds$splits, function(split, fold_id) {
    train <- rsample::analysis(split)
    test  <- rsample::assessment(split)
    fit   <- fit_fn(train)
    tibble(
      fold   = as.integer(fold_id),
      actual = test$pct_commute_60p,
      pred   = as.numeric(predict(fit, newdata = test))
    )
  })
}

## RUN EACH MODEL ACROSS THE FOLDS
lm_cv        <- run_cv(folds, function(d) lm(pct_commute_60p ~ ., data = d))
lm_spline_cv <- run_cv(folds, function(d) lm(pct_commute_60p ~ . - dist_subway_miles
                                              + splines::ns(dist_subway_miles, df = 4),
                                              data = d))
rf_cv        <- run_cv(folds, function(d) randomForest(pct_commute_60p ~ ., data = d, ntree = 500))

## SPATIAL-LAG VARIANTS: same models, same folds (same seed + same rows),
## but using model_data_spatial so the lag_* columns enter as predictors.
set.seed(622)
folds_spatial <- rsample::vfold_cv(model_data_spatial, v = 10, strata = borough)

lm_cv_spatial        <- run_cv(folds_spatial, function(d) lm(pct_commute_60p ~ ., data = d))
lm_spline_cv_spatial <- run_cv(folds_spatial, function(d) lm(pct_commute_60p ~ . - dist_subway_miles
                                                              + splines::ns(dist_subway_miles, df = 4),
                                                              data = d))
rf_cv_spatial        <- run_cv(folds_spatial, function(d) randomForest(pct_commute_60p ~ ., data = d, ntree = 500))

## Train RF on full dataset (used for map predictions — outside the CV loop)
rf_fit <- randomForest(
  pct_commute_60p ~ .,
  data       = model_data,
  ntree      = 500,
  importance = TRUE
)

## Spatial-lag production fits — used by the spillover IBX simulation
lm_fit_standardized_spatial <- lm(
  pct_commute_60p_logit ~ . - GEOID - pct_commute_60p,
  data = ml_data_scaled
)

lm_fit_spline_spatial <- lm(
  pct_commute_60p_logit ~ . - GEOID - pct_commute_60p - dist_subway_miles
    + splines::ns(dist_subway_miles, df = 4),
  data = ml_data_scaled
)

set.seed(622)
rf_fit_spatial <- randomForest(
  pct_commute_60p ~ .,
  data       = ml_data_scaled[, c(target_var, spatial_predictor_vars)],
  ntree      = 500,
  importance = TRUE
)

## SUMMARIZE
## mean +- SE across folds, plus the one-SE rule flag
rmse <- function(a, p) sqrt(mean((a - p)^2))
r2   <- function(a, p) 1 - sum((a - p)^2) / sum((a - mean(a))^2)

fold_metrics <- function(cv_df) {
  cv_df %>% group_by(fold) %>%
    summarise(rmse_fold = rmse(actual, pred),
              r2_fold   = r2(actual, pred),
              .groups   = "drop")
}

cv_summary <- function(name, cv_df) {
  m <- fold_metrics(cv_df)
  tibble(
    Model     = name,
    RMSE_mean = mean(m$rmse_fold),
    RMSE_SE   = sd(m$rmse_fold) / sqrt(nrow(m)),
    R2_mean   = mean(m$r2_fold),
    R2_SE     = sd(m$r2_fold)   / sqrt(nrow(m))
  )
}

cv_table <- bind_rows(
  cv_summary("Linear Regression",                  lm_cv),
  cv_summary("Linear Regression + Spatial Lags",   lm_cv_spatial),
  cv_summary("Linear + Spline (dist_subway_miles)", lm_spline_cv),
  cv_summary("Linear + Spline + Spatial Lags",     lm_spline_cv_spatial),
  cv_summary("Random Forest",                       rf_cv),
  cv_summary("Random Forest + Spatial Lags",       rf_cv_spatial)
)

# One-SE rule: any model within (best_RMSE + SE_of_best) is "as good as best".
best_idx      <- which.min(cv_table$RMSE_mean)
one_se_thresh <- cv_table$RMSE_mean[best_idx] + cv_table$RMSE_SE[best_idx]
cv_table <- cv_table %>%
  mutate(within_1se_of_best = ifelse(RMSE_mean <= one_se_thresh, "✓", ""))

kable(cv_table %>% mutate(across(where(is.numeric), ~ round(., 3))),
      caption = "10-fold CV (stratified by borough) — mean ± SE, one-SE rule flag")

10-fold CV (stratified by borough) — mean ± SE, one-SE rule flag
Model	RMSE_mean	RMSE_SE	R2_mean	R2_SE	within_1se_of_best
Linear Regression	7.768	0.118	0.610	0.011
Linear Regression + Spatial Lags	7.623	0.118	0.624	0.011
Linear + Spline (dist_subway_miles)	7.606	0.135	0.626	0.013
Linear + Spline + Spatial Lags	7.496	0.136	0.637	0.013
Random Forest	7.236	0.140	0.661	0.013	✓
Random Forest + Spatial Lags	7.138	0.144	0.670	0.013	✓

Interpretation: Random Forest + Spatial Lags is the CV-best by just a little bit, with 7.14 RMSE vs 7.24 for plain Random Forest. The two Random Forest variants are the only rows within 1 SE of the best, so they’re statistically a tie, but we’re opting to treat Random Forest + Spatial Lags as the winner. Its typical prediction is off by ~7.1pp on the 0-100% scale, and it explains ~67% of variance. There are also two patterns in the table that are worth pointing at. Across the spatial-lag axis, every model class gains roughly 0.10 to 0.15 RMSE once neighbor-weighted features are added (Linear Regression 7.77 to 7.62, Linear + Spline 7.61 to 7.5, Random Forest 7.24 to 7.14). The improvement is consistent across all three classes. Basically, neighbors are carrying real signal beyond what a tract’s own features capture. Across the model-family axis, the jump from Linear + Spline to Random Forest (-0.37 RMSE) is bigger than the jump from Linear Regression to Linear + Spline (-0.16), meaning Random Forest is catching nonlinearities and interactions the spline alone can’t. For the IBX simulation, the Linear + Spline, Random Forest, and spatial-lag variants are the ones to lean on. Plain Linear Regression can’t capture the saturating subway-access effect, so its predicted relief is essentially zero. The spatial-lag variants also let predicted relief spill over to tracts next to the new corridor, so they reach more tracts than the non-spatial versions (more on that in the IBX simulation section below).

P.s. the “saturating subway-access effect” was just the most succinct way I could phrase “the effect where 5 minutes versus 10 minutes as a walk to the subway is a big different, but 20 minutes versus 25 is not”. Ajs Prof. George pointed this out to us in feedback notes at some point in time.

RF Tuning: mtry sweep via OOB

mtry is the knob for how many features RF considers at each tree split. Tuning it using OOB error instead of a full CV loop (ISLP says they’re basically equivalent with enough trees, and OOB is faster). The sweep covers the default p/3 ≈ 7, a few neighbors, and mtry = 22 (which is plain bagging and useful as a control). If multiple values come back within 1% of the best, the default is being picked becuase it’s “cleaner” to defend than the algorithmic min when the gaps are tiny.

## mtry SWEEP via OOB
mtry_grid <- c(3, 5, 7, 11, 22)

rf_tuning <- map_dfr(mtry_grid, function(m) {
  set.seed(622)
  fit <- randomForest(pct_commute_60p ~ ., data = model_data,
                      mtry = m, ntree = 500)
  tibble(
    mtry     = m,
    OOB_RMSE = sqrt(tail(fit$mse, 1)),
    OOB_R2   = tail(fit$rsq, 1)
  )
})

# Flag mtry values within ~1% of best — rough OOB analogue of the one-SE rule.
best_oob <- min(rf_tuning$OOB_RMSE)
rf_tuning <- rf_tuning %>%
  mutate(within_1pct_of_best = ifelse(OOB_RMSE <= best_oob * 1.01, "✓", ""))

kable(rf_tuning %>% mutate(across(where(is.numeric), ~ round(., 3))),
      caption = "RF mtry sweep — OOB RMSE / R² on full data")

RF mtry sweep — OOB RMSE / R² on full data
mtry	OOB_RMSE	OOB_R2	within_1pct_of_best
3	7.364	0.652
5	7.278	0.660	✓
7	7.250	0.663	✓
11	7.241	0.664	✓
22	7.270	0.661	✓

## Refit production rf_fit with the chosen mtry
# OOB-best is mtry=11 by 0.03 RMSE, but ISLP default p/3 ≈ 7 is within 1%.
# Going with the simpler/standard default — easier to defend and statistically tied.
chosen_mtry <- 7
set.seed(622)
rf_fit <- randomForest(
  pct_commute_60p ~ .,
  data       = model_data,
  ntree      = 500,
  mtry       = chosen_mtry,
  importance = TRUE
)
cat("rf_fit refit with mtry =", chosen_mtry, "\n")

## rf_fit refit with mtry = 7

## OOB-vs-ntree — confirm 500 trees is enough
oob_curve <- tibble(
  ntree    = seq_along(rf_fit$mse),
  oob_rmse = sqrt(rf_fit$mse)
)

ggplot(oob_curve, aes(x = ntree, y = oob_rmse)) +
  geom_line(color = "darkblue") +
  geom_hline(yintercept = tail(oob_curve$oob_rmse, 1),
             color = "red", linetype = "dashed") +
  theme_minimal() +
  labs(title = paste("OOB RMSE vs number of trees (mtry =", chosen_mtry, ")"),
       subtitle = "Red dashed = final OOB RMSE — curve should plateau",
       x = "Number of trees", y = "OOB RMSE")

How to interpret: OOB error should drop fast, then plateau. If the curve flattens well before the right edge, 500 trees is enough. If it’s still falling at the right edge, ntree should be higher. The dashed red line marks the final OOB value — looking for the curve to be hugging it.

Interpretation: mtry sweep was a near-tie — the algorithmic best was mtry=11 at OOB RMS 7.241, but 4 of 5 values landed within 1% of the best, so we went with ISLP’s default mtry=7 (RMSE 7.25). That mtry=2 (essentially bagging) is only marginally worse tells us feature decorrelation isn’t doing heavy lifting on this dataset. The OOB-vs-ntree curve drops fast from ~10.5 at ntree=1 to ~7.28 by ntree=200, then flatlines through 500 — confirming ntree=500 is well past the plateau. Could safely cut to ~300 trees, but 500 is standard so we kept it.

IBX Simulation

Here, the approximate locations of the new IBX stations are being brought in. And then in the census areas where they go, we’re basically dropping the distance to train for those areas to really low. There may be a more robust way of implementing the “simulation” of the IBX station than this, but this seemed like a good starting point.

# Build the list of stations based on the MTA's 2025 CE Document
# This document is a pdf, and can be found in the repo
ibx_stations <- tibble::tribble(
  ~station_name,             ~address,
  "Brooklyn Army Terminal",  "2nd Ave & 61st St, Brooklyn, NY",
  "4th Ave",                 "5th Ave & 61st St, Brooklyn, NY",
  "8th Ave",                 "8th Ave & 61st St, Brooklyn, NY",
  "New Utrecht Ave",         "14th Ave & 61st St, Brooklyn, NY",
  "McDonald Ave",            "McDonald Ave & Avenue I, Brooklyn, NY",
  "East 16th St",            "East 16th St & Avenue H, Brooklyn, NY",
  "Flatbush-Nostrand Ave",   "Nostrand Ave & Avenue H, Brooklyn, NY",
  "Utica Ave",               "Utica Ave & Farragut Rd, Brooklyn, NY",
  "Remsen Ave",              "Remsen Ave & Farragut Rd, Brooklyn, NY",
  "Linden Blvd",             "Linden Blvd & Junius St, Brooklyn, NY",
  "Livonia Ave",             "Livonia Ave & Junius St, Brooklyn, NY",
  "Sutter Ave",              "Sutter Ave & Junius St, Brooklyn, NY",
  "Atlantic Ave",            "Atlantic Ave & East New York Ave, Brooklyn, NY",
  "Wilson Ave",              "Wilson Ave & Moffat St, Brooklyn, NY",
  "Myrtle Ave",              "Cypress Hills St & 78th Ave, Queens, NY",
  "Metropolitan Ave",        "Metropolitan Ave & 69th St, Queens, NY",
  "Eliot Ave",               "Eliot Ave & 69th St, Queens, NY",
  "Grand Ave",               "Grand Ave & 69th St, Queens, NY",
  "Roosevelt Ave",           "Roosevelt Ave & 74th St, Queens, NY"
)

# Geocode the addresses into Lat/Long coordinates
ibx_geocoded <- ibx_stations %>%
  geocode(address, method = 'osm', lat = latitude, long = longitude) %>% # get coordinated for each address
  drop_na() %>% # Drop any that the open-source map couldn't perfectly match
  
  # Convert the GPS coordinates into an sf object
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  
  # Convert and draw 10-minute-walk-radiuses 
  st_transform(2263) %>% # transform to the more precide NYC + Long Island map we've been using
  st_buffer(dist = 2640) %>% # drawing a circle around point that is 2640 feet = 0.5 miles ~ 10 minute walk. Most people don't want to walk further than 10 minutes to the train
  st_transform(4326) # Convert back to standard GPS

## Passing 19 addresses to the Nominatim single address geocoder

## Query completed in: 19.1 seconds

# Find all Census Tracts that fall at least partially inside these new 10-minute-walk-radiuses 
ibx_tracts_sf <- st_filter(
  nyc_tracts, 
  st_transform(ibx_geocoded, st_crs(nyc_tracts)), 
  .predicate = st_intersects
)

# Extract just the 11-digit IDs as a clean list
# Only keep tracts that made it into the modeling set, otherwise
# treated tracts get no prediction and show up as map holes.
ibx_geoid_list <- intersect(ibx_tracts_sf$GEOID, ml_data_scaled$GEOID)

The IBX simulation works like this: geocode the 19 IBX station addresses, draw 0.5-mile walksheds around each, and flag every tract that intersects a walkshed as “treated.” For the treated tracts, we’re setting dist_subway_miles to the city-wide minimum. This is our way of saying that this tract now has subway access as close as the best-served tract in NYC. We then re-predict commute outcomes. The same setup runs through all six model variants (Linear, Linear + Spline, Random Forest, each with and without spatial lags) so we can compare how each forecasts IBX’s effect. The spatial-lag variants additionally let predicted relief spill over to tracts next to the IBX corridor (more on that below). It’s a coarse proxy for what a real subway station does, but a defensible starting point.

## Create an IBX UNIVERSE data set
# finding the Census Tract in NYC that is physically closest to a subway entrance
# going to use this later to simulate a really close distance to subway for places where IBX is being built
best_transit_z_score <- min(ml_data_scaled$dist_subway_miles)

ibx_scenario_data_scaled <- ml_data_scaled %>%
  mutate(
     # Inject best_transit_z_score into the IBX tracts
    dist_subway_miles = ifelse(GEOID %in% ibx_geoid_list, best_transit_z_score, dist_subway_miles)
  )

# Predicted data is from the model which consumed scaled data
# But we're using raw data so the display of neighborhood features is correct
# lm predictions live in logit space — back-transform + clamp to [0, 100].
final_map_data <- ml_data_raw %>%
  mutate(
    predicted_commute_baseline = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_standardized, newdata = ml_data_scaled)))),
    predicted_commute_ibx      = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_standardized, newdata = ibx_scenario_data_scaled)))),
    ibx_commute_relief         = pmax(0, predicted_commute_baseline - predicted_commute_ibx),
    is_ibx_location            = ifelse(GEOID %in% ibx_geoid_list, 1, 0)
  )

# Sanity: where do predictions actually land?
cat("lm baseline range:", range(final_map_data$predicted_commute_baseline), "\n")

## lm baseline range: 0 50.66438

cat("lm IBX range:    ", range(final_map_data$predicted_commute_ibx), "\n")

## lm IBX range:     0 50.66438

cat("lm relief range: ", range(final_map_data$ibx_commute_relief), "\n")

## lm relief range:  0 0

saveRDS(final_map_data, file.path(projPath, "data", "model_input.rds"))

# Spline-lm IBX predictions — appended to final_map_data alongside the linear baseline.
final_map_data <- final_map_data %>%
  mutate(
    predicted_commute_baseline_spline = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_spline, newdata = ml_data_scaled)))),
    predicted_commute_ibx_spline      = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_spline, newdata = ibx_scenario_data_scaled)))),
    ibx_commute_relief_spline         = pmax(0, predicted_commute_baseline_spline - predicted_commute_ibx_spline)
  )

cat("spline baseline range:", range(final_map_data$predicted_commute_baseline_spline), "\n")

## spline baseline range: 0 49.26246

cat("spline IBX range:    ", range(final_map_data$predicted_commute_ibx_spline), "\n")

## spline IBX range:     0 49.26246

cat("spline relief range: ", range(final_map_data$ibx_commute_relief_spline), "\n")

## spline relief range:  0 7.856936

saveRDS(final_map_data, file.path(projPath, "data", "model_input.rds"))

# Append RF IBX predictions to the map data already saved by the chunk above.
# final_map_data is still in memory — adding RF columns and re-saving.
final_map_data_rf <- final_map_data %>%
  mutate(
    # Restore Tai's clamps — % can't be negative, neither can relief.
    predicted_commute_baseline_rf = pmax(0, predict(rf_fit, newdata = ml_data_scaled)),
    predicted_commute_ibx_rf      = pmax(0, predict(rf_fit, newdata = ibx_scenario_data_scaled)),
    ibx_commute_relief_rf         = pmax(0, predicted_commute_baseline_rf - predicted_commute_ibx_rf),
    # Residual: positive = model under-predicts (actual > predicted),
    #           negative = model over-predicts (actual < predicted).
    # Used by the Shiny app's residual layer.
    rf_residual                   = pct_commute_60p - predicted_commute_baseline_rf
  )

# Sanity: same range check for RF
cat("RF baseline range:", range(final_map_data_rf$predicted_commute_baseline_rf), "\n")

## RF baseline range: 1.731167 65.80985

cat("RF IBX range:    ", range(final_map_data_rf$predicted_commute_ibx_rf), "\n")

## RF IBX range:     1.731167 65.80985

cat("RF relief range: ", range(final_map_data_rf$ibx_commute_relief_rf), "\n")

## RF relief range:  0 7.366372

saveRDS(final_map_data_rf, file.path(projPath, "data", "model_input.rds"))
print("RF predictions appended to model_input.rds")

## [1] "RF predictions appended to model_input.rds"

IBX simulation with spatial-lag spillover

Same scenario as the non-lag versions (push dist_subway_miles to the global minimum for treated tracts), but with one extra step: we recompute lag_dist_subway_miles using the modified values. That way tracts neighboring the IBX corridor see their neighbor-weighted subway-distance drop too, and the spatial models can express spillover benefits to those adjacent tracts. The other six lag columns don’t move because their underlying features aren’t part of the simulation.

# Build the spatial scenario: modify dist_subway_miles (same as before) AND
# recompute its spatial lag using the updated values.
ibx_scenario_data_spatial <- ml_data_scaled %>%
  mutate(
    dist_subway_miles = ifelse(GEOID %in% ibx_geoid_list,
                                best_transit_z_score,
                                dist_subway_miles)
  )

ibx_scenario_data_spatial$lag_dist_subway_miles <-
  lag.listw(weights, ibx_scenario_data_spatial$dist_subway_miles, zero.policy = TRUE)

# Predict from the three spatial models on baseline and IBX scenarios.
# Append to final_map_data_rf so all simulation outputs live in one frame.
final_map_data_rf <- final_map_data_rf %>%
  mutate(
    # Linear + Spatial Lags
    predicted_commute_baseline_lm_spatial = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_standardized_spatial, newdata = ml_data_scaled)))),
    predicted_commute_ibx_lm_spatial      = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_standardized_spatial, newdata = ibx_scenario_data_spatial)))),
    ibx_commute_relief_lm_spatial         = pmax(0, predicted_commute_baseline_lm_spatial - predicted_commute_ibx_lm_spatial),

    # Spline + Spatial Lags
    predicted_commute_baseline_spline_spatial = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_spline_spatial, newdata = ml_data_scaled)))),
    predicted_commute_ibx_spline_spatial      = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_spline_spatial, newdata = ibx_scenario_data_spatial)))),
    ibx_commute_relief_spline_spatial         = pmax(0, predicted_commute_baseline_spline_spatial - predicted_commute_ibx_spline_spatial),

    # RF + Spatial Lags
    predicted_commute_baseline_rf_spatial = pmax(0, predict(rf_fit_spatial, newdata = ml_data_scaled)),
    predicted_commute_ibx_rf_spatial      = pmax(0, predict(rf_fit_spatial, newdata = ibx_scenario_data_spatial)),
    ibx_commute_relief_rf_spatial         = pmax(0, predicted_commute_baseline_rf_spatial - predicted_commute_ibx_rf_spatial)
  )

# Range printouts
cat("\n--- Spatial-lag IBX simulation ranges ---\n")

## 
## --- Spatial-lag IBX simulation ranges ---

cat("lm + lags relief range:     ", range(final_map_data_rf$ibx_commute_relief_lm_spatial), "\n")

## lm + lags relief range:      0 3.031121

cat("spline + lags relief range: ", range(final_map_data_rf$ibx_commute_relief_spline_spatial), "\n")

## spline + lags relief range:  0 8.148764

cat("RF + lags relief range:     ", range(final_map_data_rf$ibx_commute_relief_rf_spatial), "\n")

## RF + lags relief range:      0 9.283269

# Spillover diagnostic: how many tracts get non-zero relief in each scenario?
# Non-spatial: only directly-treated tracts.
# Spatial: directly-treated + their neighbors (because the lag changed for them).
cat("\n--- Spillover breadth (tracts with non-zero predicted relief) ---\n")

## 
## --- Spillover breadth (tracts with non-zero predicted relief) ---

cat("Treated tracts in dataset:    ", length(ibx_geoid_list), "\n")

## Treated tracts in dataset:     200

cat("Non-spatial (lm / spline / rf):       ",
    sum(final_map_data_rf$ibx_commute_relief         > 0), "/",
    sum(final_map_data_rf$ibx_commute_relief_spline  > 0), "/",
    sum(final_map_data_rf$ibx_commute_relief_rf      > 0), "\n")

## Non-spatial (lm / spline / rf):        0 / 200 / 157

cat("Spatial (lm+lag / spline+lag / rf+lag):",
    sum(final_map_data_rf$ibx_commute_relief_lm_spatial      > 0), "/",
    sum(final_map_data_rf$ibx_commute_relief_spline_spatial  > 0), "/",
    sum(final_map_data_rf$ibx_commute_relief_rf_spatial      > 0), "\n")

## Spatial (lm+lag / spline+lag / rf+lag): 120 / 358 / 270

saveRDS(final_map_data_rf, file.path(projPath, "data", "model_input.rds"))

Interpretation: The spillover diagnostic above adds a second dimension to the IBX picture. Without spatial lags, the simulation only registers relief for directly-treated tracts. The Linear model predicts zero for every tract. Linear + Spline reaches all 200 treated tracts. Random Forest reaches 157 of them. With spatial lags, the same three model classes reach 120 / 358 / 270 tracts (Linear, Linear + Spline, Random Forest). That broader reach comes from the way we recompute lag_dist_subway_miles after the treatment update. Once a tract’s neighbors get closer to subway, the tract’s own neighbor-weighted distance-to-subway also drops, so the spatial model picks up the spillover.

Max relief grows too. Plain Linear stays at 0pp (no saturating-effect capture). Linear + Spatial Lags rescues it to 3.03pp. Linear + Spline goes from 7.86pp to 8.15pp. Random Forest goes from 7.37pp to 9.28pp. Random Forest + Spatial Lags ends up with the highest max relief and the most tracts reached, which lines up with it being the CV-best model overall.

# Model Results
lm_results <- tidy(lm_fit_standardized) %>%
  filter(term != "(Intercept)") # Remove the intercept for better scaling

# Plot the "Pull" or "Impact" of each feature
ggplot(lm_results, aes(x = estimate, y = reorder(term, estimate))) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  geom_point(size = 3, color = "darkblue") +
  geom_errorbarh(aes(xmin = estimate - std.error, xmax = estimate + std.error), height = 0.2) +
  theme_minimal() +
  labs(title = "What's driving 60 minute + length communtes?",
       subtitle = "Standardized Coefficients (Variables further from 0 have a stronger impact)",
       x = "Effect on 60+ Minute Commute Rate (per 1 Std. Dev. increase)",
       y = "Feature")

## Warning: `geom_errobarh()` was deprecated in ggplot2 4.0.0.
## ℹ Please use the `orientation` argument of `geom_errorbar()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `height` was translated to `width`.

How to interpret: Each point is a feature’s standardized coefficient — how much logit(% long commute) shifts per 1 SD change in that feature, holding everything else fixed. Distance from zero = effect strength; the red dashed line marks “no effect.” Borough effects (vs Brooklyn baseline) and dist_to_manhattan_miles tend to dominate. Error bars show ±1 SE — features whose bars cross zero aren’t statistically distinguishable from zero effect.

Interpretation: Distance to Manhattan and being in Manhattan are the dominant features. Together they carry most of the geography story (more distance from Manhattan → more long commutes; Manhattan itself strongly fewer). The most counterintuitive coefficient is pct_transit (+0.18, p < 10⁻¹⁶) — tracts where more people commute by transit have more long commutes, reflecting NYC’s reality that long-distance commuters disproportionately use the subway. Distance to subway in the linear model is essentially zero (-0.008, p=0.59); the relationship is saturating, which is why we added the spline. After controlling for distance to Manhattan, Queens and Bronx are statistical ties with Brooklyn; only Manhattan and Staten Island stand out as boroughs with meaningfully fewer long commutes.

# Variable importance from the full RF model
rf_importance <- importance(rf_fit) %>%
  as.data.frame() %>%
  rownames_to_column("Feature") %>%
  arrange(desc(`%IncMSE`))

ggplot(rf_importance, aes(x = `%IncMSE`, y = reorder(Feature, `%IncMSE`))) +
  geom_point(size = 3, color = "darkgreen") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(
    title    = "Random Forest: Feature Importance",
    subtitle = "% increase in MSE when feature is permuted — higher = more important",
    x        = "% Increase in MSE (Permutation Importance)",
    y        = "Feature"
  )

How to interpret: Higher %IncMSE = the feature mattered more for prediction (test error rises more when that feature is randomly shuffled). Compare to the lm coefficient plot — features ranking high in both are “robustly important.” A feature high in RF but flat in lm typically means RF is capturing a nonlinear or interaction effect that lm can’t.

Interpretation: RF’s top three features — pct_transit, dist_to_manhattan_miles, and pct_wfh — are largely the same top features the linear model picked, so the predictive story holds across model classes. The big difference: dist_subway_miles ranks #5 in RF (%IncMSE = 23.2) despite being essentially zero in the linear model (-0.008, p=0.59) — clean confirmation that the relationship is nonlinear, which is exactly what the spline was meant to capture. borough only ranks #9 here — RF can use the continuous geographic features directly, so it leans less on the categorical borough proxy than the lm does. The bottom of the list (age buckets, pct_vacant) carries little predictive weight; candidates for trimming if we ever wanted a leaner feature set.

RF Residual Diagnostics

If RF residuals cluster geographically, the model is missing a feature with spatial structure (likely “proximity to Manhattan” per George’s flag). The map shows where the model under- or over-predicts; Moran’s I quantifies whether the clustering is statistically meaningful.

library(spdep)

## RESIDUAL MAP
residual_map_data <- final_map_data_rf %>%
  select(GEOID, rf_residual) %>%
  left_join(nyc_tracts_clean %>% select(GEOID, geometry), by = "GEOID") %>%
  st_as_sf()

# Symmetric color scale capped at p99 of |residual| for readability
cap <- quantile(abs(residual_map_data$rf_residual), 0.99, na.rm = TRUE)

ggplot(residual_map_data) +
  geom_sf(aes(fill = rf_residual), color = NA) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red",
                       midpoint = 0,
                       limits = c(-cap, cap),
                       oob = scales::squish) +
  theme_void() +
  labs(title = "RF residuals — actual minus predicted % long commute",
       subtitle = "Red = model under-predicts | Blue = model over-predicts",
       fill = "Residual\n(pp)")

## MORAN'S I — is the geographic clustering statistically real?
tracts_for_moran <- nyc_tracts_clean %>%
  filter(GEOID %in% final_map_data_rf$GEOID) %>%
  arrange(match(GEOID, final_map_data_rf$GEOID))

neighbors_local <- poly2nb(tracts_for_moran)

## Warning in poly2nb(tracts_for_moran): some observations have no neighbours;
## if this seems unexpected, try increasing the snap argument.

## Warning in poly2nb(tracts_for_moran): neighbour object has 6 sub-graphs;
## if this sub-graph count seems unexpected, try increasing the snap argument.

weights_local   <- nb2listw(neighbors_local, style = "W", zero.policy = TRUE)

moran_test <- moran.test(final_map_data_rf$rf_residual,
                         weights_local,
                         zero.policy = TRUE,
                         na.action  = na.omit)

cat("Moran's I:   ", round(moran_test$estimate["Moran I statistic"], 3), "\n")

## Moran's I:    0.036

cat("Expected:    ", round(moran_test$estimate["Expectation"], 3),       "\n")

## Expected:     0

cat("p-value:     ", format.pval(moran_test$p.value),                     "\n")

## p-value:      0.0017458

How to interpret the map: Red = the model under-predicts (the real % long commute is higher than what the model expected). Blue = over-predicts. Patches of solid color = the model has a systematic geographic blind spot — a missing feature with spatial structure. A salt-and-pepper mix of red and blue = the model captures geography fine, residuals are just noise.

How to read Moran’s I:

I > 0 with low p: residuals cluster geographically — there’s a spatial pattern the model is missing.
I ≈ 0: residuals are spatially random — the model captures geographic structure well.
I < 0: residuals checkerboard (rare in real data) — would be unusual.

# Snapshot of the analysis state just before we added dist_to_manhattan_miles.
# Hardcoded because the prior model state isn't reproducible from current code.
hist_before <- list(
  morans_i             = 0.079,
  morans_i_p           = "1e-10",
  rmse_reduction_range = "0.3-0.4",
  spline_rf_gap_pp     = 1.4
)

The Moran’s I value above is after we added dist_to_manhattan_miles. Before that feature, Moran’s I was 0.079 (p < 1e-10), a much stronger spatial signal. Here’s what changed when we added distance to Grand Central as a CBD-proximity proxy. (This is a snapshot from before the spatial-lag work that came later.)

Cut Moran’s I roughly in half (0.079 -> 0.036)
Reduced cross-validated RMSE by 0.3-0.4 across the three model classes we had at that point (Linear, Linear + Spline, Random Forest)
Caused several previously-significant coefficients (borough effects for Brooklyn and Queens, uber_trips_per_capita, pop_density) to lose their inflated magnitudes. They had been silently proxying for Manhattan proximity.
Tightened spline-vs-RF IBX agreement to ~0.5pp (was ~1.4pp)

After that change, the pure linear baseline predicts ~0pp IBX relief (its coefficient on dist_subway_miles is -0.008), while the Linear + Spline and Random Forest models, which capture the saturating subway-access effect, predict around 7-8pp: Linear + Spline at 7.9pp and Random Forest at 7.4pp. Adding spatial lags on top of all this widened the relief picture further. See the CV table interpretation above and the IBX simulation diagnostics for the current numbers.

BART (Bayesian Additive Regression Trees)

Predicts the effect of an intervention (e.g., a new medicine) by combining hundreds of small, simple decision trees to model complex, non-linear relationships, even without explicit programming. It excels at estimating, “What would have happened to this specific person if they had taken the treatment?”.

#install.packages("bartCause")
library(bartCause)

## Warning: package 'bartCause' was built under R version 4.5.2

## 
## Attaching package: 'bartCause'

## The following object is masked from 'package:fabletools':
## 
##     refit

## The following object is masked from 'package:tidyr':
## 
##     extract

Treatment variable (0/1)

# add a binary 1/0 column noting which GEOID's where included in our IBX "treatment"
ml_data_scaled$ibx_treatment <- as.integer(ml_data_scaled$GEOID %in% ibx_geoid_list)

thinkCausal website:

# I was trying to use the thinkCausal website (https://apsta.shinyapps.io/thinkCausal/) to explore bartCause but it kept crashing when fitting our full model, but i was able to try using it with just a few variables at a time, some of the below sections were inspired from that
#write.csv(ml_data_scaled, "ml_data_scaled.csv")

# Drop ID column
BART_data <- ml_data_scaled[, !(names(ml_data_scaled) %in%
                                c("GEOID", "pct_commute_60p_logit"))]
BART_data <- na.omit(BART_data)

# Outcome
y <- BART_data$pct_commute_60p

# Treatment (must be numeric 0/1)
z <- as.numeric(BART_data$ibx_treatment)

# Covariates (confounders only)
# Keep x as a data.frame — borough is a factor, and as.matrix() would coerce
# the whole frame to character, which lands a NaN in bartc's sigma estimate.
x <- BART_data[, !(names(BART_data) %in% c("pct_commute_60p", "ibx_treatment"))]

Visualize the data: 1. overlap graph? (for ATE, ATT, and ATC). Overlap is a property of the data (treatment vs covariates)

ps_model <- glm(
  ibx_treatment ~ . - pct_commute_60p,
  data = BART_data,
  family = binomial
)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

ps <- predict(ps_model, type = "response")

Basic overlap plot (treated vs control)

hist(ps[z == 1], col = rgb(1, 0, 0, 0.5),
     xlim = c(0,1), main = "Propensity Score Overlap",
     xlab = "Propensity Score")

hist(ps[z == 0], col = rgb(0, 0, 1, 0.5), add = TRUE)

legend("topright", legend = c("Treated", "Control"),
       fill = c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)))

How to interpret: Red = treated tracts, blue = untreated. Good overlap (the two distributions both span the same range) is required for trustworthy causal estimates. Poor overlap (e.g., red cluster at high values, blue at low) means the BART model has to extrapolate — comparing treated tracts to controls that don’t really resemble them.

Weighting for ATE, ATT, ATC

w_ate <- ifelse(z == 1, 1/ps, 1/(1 - ps))
w_att <- ifelse(z == 1, 1, ps/(1 - ps))
w_atc <- ifelse(z == 1, (1 - ps)/ps, 1)

ATE plot: We want strong overlap across full range. Poor overlap implies ATE relies on extrapolation?

library(ggplot2)

BART_df <- data.frame(ps = ps, z = z, y = y, x=x)
BART_df <- na.omit(BART_df)

# ATE
ggplot(BART_df, aes(x = ps, fill = factor(z))) +
  geom_density(alpha = 0.4, aes(weight = w_ate)) +
  labs(title = "Overlap (ATE)", fill = "Treatment")

How to interpret: Same idea as the basic overlap plot, but reweighted for the ATE (effect on the average tract, treated or not). Tighter alignment between the two density curves = more trustworthy ATE estimate. Big gaps between curves at certain propensity values = those tracts have no good comparison group.

ATT plot: Do control units resemble treated units?

# ATT
ggplot(BART_df, aes(x = ps, fill = factor(z))) +
  geom_density(alpha = 0.4, aes(weight = w_att)) +
  labs(title = "Overlap (ATT)", fill = "Treatment")

How to interpret: Reweighted for the ATT — the effect on tracts that received treatment (the IBX corridor). Look for the control (blue) curve to match the shape of the treated (red) curve — that means we have similar-looking untreated tracts to compare each treated tract against.

ATC plot: Do treated units resemble controls?

# ATC
ggplot(BART_df, aes(x = ps, fill = factor(z))) +
  geom_density(alpha = 0.4, aes(weight = w_atc)) +
  labs(title = "Overlap (ATC)", fill = "Treatment")

How to interpret: Mirror image of the ATT plot — reweighted for tracts that did not receive treatment (everywhere except the IBX corridor). Good overlap here means we have treated tracts that resemble each control tract, so a hypothetical “if we treated this Bronx tract, what would happen?” estimate is well-grounded.

Balance graph, poitns represent the treatment group (for ATE, ATT, and ATC). Each point = a covariate. X-axis = standardized mean difference (SMD). Two sets of points: Before weighting and after weighting. |SMD| < 0.1 is generally considered “balanced”. Balance plots are still based on propensity-style weighting. They diagnose whether your causal identification assumption is plausible. If balance is poor, the BART ATE may still be biased, especially in regions with weak overlap. Should we be dropping any variables with weak overlap?

# loaded via pacman instead
#install.packages("cobalt")
#library(cobalt)

bal_ate <- bal.tab(x, treat = z, weights = w_ate, estimand = "ATE",un = TRUE,quick = TRUE)
bal_att <- bal.tab(x, treat = z, weights = w_att, estimand = "ATT",un = TRUE,quick = TRUE)
bal_atc <- bal.tab(x, treat = z, weights = w_atc, estimand = "ATC",un = TRUE,quick = TRUE)

Covariate Balance (ATE)

love.plot(bal_ate, threshold = 0.1, abs = TRUE,
          title = "Covariate Balance (ATE)",stars = "raw",stats = "mean.diffs")

How to interpret: Each dot is a covariate. X-axis = absolute standardized mean difference (SMD) between treated and untreated groups for that covariate. Dashed line at 0.1 marks the standard “balanced” threshold. Anything to the right of 0.1 means treated and control groups differ meaningfully on that variable — and the weighting hasn’t fully fixed it. Two columns of dots compare before-weighting vs after-weighting.

Covariate Balance (ATT)

love.plot(bal_att, threshold = 0.1, abs = TRUE,
          title = "Covariate Balance (ATT)", stars = "raw")

How to interpret: Same as the ATE balance plot but reweighted for the ATT estimand. This is the most relevant of the three for our project since the IBX claim is “what does treatment do for the treated tracts.” Anything past 0.1 here is a covariate where the comparison group doesn’t quite match the treated group — flag-able for the writeup.

Covariate Balance (ATC)

love.plot(bal_atc, threshold = 0.1, abs = TRUE,
          title = "Covariate Balance (ATC)",stars = "raw")

How to interpret: Same again, but reweighted for the ATC. Less directly relevant to our IBX question, but useful as a diagnostic — covariates that are imbalanced for ATC but not ATT tell us where the model is making bigger extrapolation leaps for the untreated tracts.

Fit our model: 1. Causal Estimand: do we want ATE, ATT, and ATC? We’re adding a new train line which is alittle different than just giving a new drug or something? I think we’re most interested in ATT to measure how effective our treatment was for our area where we’re adding it.

ATE (Average Treatment Effect): Measures the average impact of a treatment if everyone in the population was treated versus if nobody was treated. It answers: “What is the overall effect of the intervention?”.

ATT (Average Treatment Effect on the Treated): Measures the average impact of a treatment only for those who actually received it. It answers: “Did the treatment work for those who got it?”.

ATC / ATU (Average Treatment Effect on the Control/Untreated): Measures the impact of treatment on the subgroup that did not receive it. It answers: “What would have happened if those who didn’t get the treatment actually did?”

ATE BART Fit

# Exclude the treatment mechanism (dist_subway_miles and its spatial lag) from
# the BART confounders. Including them would condition on the very feature the
# IBX changes, which controls away the main causal pathway.
dist_cols    <- grep("dist_subway_miles", names(BART_df), value = TRUE)
exclude_cols <- c("y", "z", dist_cols)

set.seed(622)
bart_fit <- bartc(
  response    = BART_df$y,
  treatment   = BART_df$z,
  confounders = BART_df[, !(names(BART_df) %in% exclude_cols)],
  estimand    = "ate"
)

## fitting treatment model via method 'bart'
## fitting response model via method 'bart'

cat("BART refitted excluding from confounders:", paste(dist_cols, collapse = ", "), "\n")

## BART refitted excluding from confounders: x.dist_subway_miles, x.lag_dist_subway_miles

summary(bart_fit)

## Call: bartc(response = BART_df$y, treatment = BART_df$z, confounders = BART_df[, 
##             !(names(BART_df) %in% exclude_cols)], estimand = "ate")
## 
## Causal inference model fit by:
##   model.rsp: bart
##   model.trt: bart
## 
## Treatment effect (population average):
##     estimate     sd ci.lower ci.upper
## ate  -0.2973 0.7801   -1.826    1.232
## Estimates fit from 2220 total observations
## 95% credible interval calculated by: normal approximation
##   population TE approximated by: posterior predictive distribution
## Result based on 500 posterior samples times 10 chains

class(bart_fit)

## [1] "bartcFit"

Individual treatment effects

ite <- fitted(bart_fit, type = "icate")

ATT “Assuming Ignorability and SUTVA are satisfied, for participants in our study that received the treatment/new train line, receiving the new train line/treatment led to a decrease of n units (percent decrease in 60 min commutes?) compared to what would have happened had these participants not received the treatment.”:

att <- mean(ite[z == 1], na.rm = TRUE)
att

## [1] -0.1861364

ATT Credible interval:

att_ci <- quantile(ite[z == 1], c(0.025, 0.975), na.rm = TRUE)
att_ci

##        2.5%       97.5% 
## -0.49504799  0.04960185

Check: 1. ATT Distribution (Treated Units)

att <- mean(ite[z == 1], na.rm = TRUE)

hist(ite[z == 1],
     main = "ATT Distribution (Treated Units)",
     xlab = "Individual Treatment Effects")
abline(v = att, col = "red", lwd = 2)

How to interpret: Distribution of individual treatment effects (ITE) across the treated tracts. The red vertical line is the ATT (average across treated). Negative ITE values = treatment reduces long-commute % (i.e. helps). All-negative = treatment uniformly helps every treated tract, which is what we want to see.

Results:

Subgroup analysis: 1. treatment effect variation waterfall graph

Treatment Effect Heterogeneity (Waterfall graph)

ite_sorted <- as.numeric(ite)
ite_sorted <- sort(ite_sorted)

plot(ite_sorted, type = "l",
     col = "black",
     xlab = "Units (sorted)",
     ylab = "Treatment Effect",
     main = "Treatment Effect Heterogeneity (Waterfall graph)")
abline(h = 0, col = "red", lwd = 2)

How to interpret: Every tract’s ITE plotted in sorted order, smallest (most negative = biggest benefit) on the left, largest on the right. The red zero-line is the threshold — anything below it = treatment helps that tract. The slope tells you how heterogeneous the effects are: a steep curve = lots of variation in who benefits; a flat curve = effects are uniform.

predictors of variation (tree depth graph)

Individual Treatment Effect (ITE)

ite <- as.numeric(fitted(bart_fit, type = "icate"))

var_model <- lm(ite ~ ., data = as.data.frame(x))
summary(var_model)

## 
## Call:
## lm(formula = ite ~ ., data = as.data.frame(x))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.273472 -0.030732  0.007274  0.040740  0.196577 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -0.0750199  0.0027042 -27.742  < 2e-16 ***
## boroughBronx                -0.3327836  0.0050323 -66.129  < 2e-16 ***
## boroughManhattan            -0.2848545  0.0066419 -42.887  < 2e-16 ***
## boroughQueens               -0.3771035  0.0045216 -83.400  < 2e-16 ***
## boroughStaten Island        -0.3012093  0.0090001 -33.467  < 2e-16 ***
## stations_per_sq_mile        -0.0084048  0.0026402  -3.183 0.001476 ** 
## stations_per_1000_pop       -0.0064303  0.0037772  -1.702 0.088829 .  
## dist_subway_miles           -0.0052244  0.0089753  -0.582 0.560570    
## dist_to_manhattan_miles      0.0274150  0.0243614   1.125 0.260566    
## uber_trips_per_capita        0.0006526  0.0038519   0.169 0.865478    
## bike_trips_per_capita        0.0046746  0.0038586   1.211 0.225844    
## med_income                   0.0105479  0.0030725   3.433 0.000608 ***
## pct_poverty                 -0.0012517  0.0024499  -0.511 0.609452    
## pct_vacant                   0.0023991  0.0018014   1.332 0.183068    
## pct_renter                  -0.0028323  0.0034743  -0.815 0.415038    
## pct_2plus_cars               0.0078505  0.0034746   2.259 0.023957 *  
## pop_density                  0.0039865  0.0034012   1.172 0.241293    
## pct_no_car                   0.0097711  0.0045761   2.135 0.032854 *  
## pct_transit                  0.0231793  0.0030510   7.597 4.45e-14 ***
## pct_bachelors               -0.0033945  0.0025973  -1.307 0.191379    
## pct_wfh                     -0.0009672  0.0023507  -0.411 0.680786    
## pct_walk                    -0.0064697  0.0025093  -2.578 0.009992 ** 
## pct_age_18_24                0.0137972  0.0018174   7.592 4.64e-14 ***
## pct_age_25_34                0.0239213  0.0024834   9.632  < 2e-16 ***
## pct_age_35_44                0.0166713  0.0021515   7.749 1.41e-14 ***
## pct_age_45_64                0.0150208  0.0020725   7.248 5.85e-13 ***
## pct_age_65_plus              0.0132189  0.0023138   5.713 1.26e-08 ***
## lag_dist_to_manhattan_miles -0.0250021  0.0246521  -1.014 0.310600    
## lag_dist_subway_miles        0.0094854  0.0094393   1.005 0.315066    
## lag_pop_density              0.0084221  0.0040094   2.101 0.035793 *  
## lag_pct_transit              0.0312416  0.0039180   7.974 2.45e-15 ***
## lag_pct_wfh                  0.0110938  0.0038888   2.853 0.004375 ** 
## lag_med_income               0.0225664  0.0037486   6.020 2.04e-09 ***
## lag_stations_per_1000_pop    0.0010901  0.0035368   0.308 0.757954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06706 on 2186 degrees of freedom
## Multiple R-squared:  0.8801, Adjusted R-squared:  0.8782 
## F-statistic: 486.1 on 33 and 2186 DF,  p-value: < 2.2e-16

I guess this is expected, we put our new treatment line in queens, then ofc the biggest predictor of reduced commute from our new line in queens… is that the district is in queens…

coef_df <- data.frame(
  variable = names(coef(var_model))[-1],
  effect = coef(var_model)[-1]
)

ggplot(coef_df, aes(x = reorder(variable, effect), y = effect)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Predictors of Treatment Effect Variation",
       x = "Covariates",
       y = "Effect on ITE")

How to interpret: Each bar is a feature’s coefficient when regressing BART’s ITE on covariates across all NYC tracts. Negative coefficient = bigger benefit (more negative ITE) per unit of that feature. Positive = smaller benefit. Borough bars dominate because they capture geography that the continuous predictors don’t. Coefficients for boroughs the IBX doesn’t actually serve (Bronx, Manhattan, Staten Island) reflect BART’s extrapolation of “what would happen if we did treat them” — interpret cautiously.

TREATED TRACTS ONLY: Among the treated units only, which covariates explain variation in their treatment effects?

Individual Treatment Effect (ITE) for just treated, and X covariates just for treated:

x_treated <- x[z == 1, ]
ite_treated <- ite[z == 1]

var_model_treated <- lm(ite_treated ~ ., data = as.data.frame(x_treated))

coef_df <- data.frame(
  variable = names(coef(var_model_treated))[-1],
  effect = coef(var_model_treated)[-1]
)

Still Queens is Queen. But Bike trips per capita is up there too now! More stations per 1000 pop predicting a higher expected commute after treatment is confusing to me? Same with no car percent? Positive coefficient increases treatment effect among treated. Negative coefficient decreases treatment effect among treated.

ggplot(coef_df, aes(x = reorder(variable, effect), y = effect)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Predictors of Treatment Effect Variation (Treated Only)",
       x = "Covariates",
       y = "Effect on ITE")

How to interpret: Same as the previous bar plot but limited to actually-treated tracts — no extrapolation. More defensible than the all-tracts version. Negative bar = feature is associated with bigger benefit. The boroughQueens bar (negative) means Queens treated tracts benefit more than Brooklyn treated tracts (Brooklyn is the reference). The continuous predictors are mostly small here — borough explains most of the heterogeneity.

Exploratory subgroup analysis (histograms): explore treatment effect heterogeneity across subgroups

Why are there MORE expected 60+ min commutes in brooklyn post treatment??? More stations should just reduce expected commute times, i dont understand, that number should fall. I think we’ must’ve messed something up using our tracts geographic center for predicting commute times?

ggplot(x_treated, aes(x = ite_treated)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  facet_wrap(~ borough) +
  labs(title = "Treatment Effect Distribution by Borough (Treated Only)",
       x = "Treatment Effect",
       y = "Count") +
  theme_minimal()

How to interpret: ITE distribution among treated tracts, faceted by borough. All values being negative = treatment helps every treated tract in that borough (good). Boroughs whose histograms sit further left (more negative) benefit more on average. Width of the histogram = how heterogeneous treatment effects are within that borough.

ggplot(x_treated, aes(x = ite_treated, color = borough)) +
  geom_density(size = 1) +
  labs(title = "Treatment Effect Density by Borough",
       x = "ITE",
       y = "Density") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

How to interpret: Smoothed version of the histograms above, easier to compare across boroughs at a glance. The further left a borough’s curve sits, the bigger the average benefit. Tight curves = uniform effects within that borough; wide curves = heterogeneous effects (some tracts benefit a lot, others less).

Tree-based subgroup discovery for just treatment group only: Which splits in covariates create groups with different treatment effects?

library("rpart.plot")

## Warning: package 'rpart.plot' was built under R version 4.5.2

## Loading required package: rpart

library(rpart)

tree_model <- rpart(ite_treated ~ ., data = x_treated, method = "anova", control = rpart.control(cp = 0.001, minbucket = 10))

rpart.plot(tree_model, main = "Subgroups Driving Treatment Effect Heterogeneity")

How to interpret: A regression tree on ITE for treated tracts only. Each split is the question that best separates “tracts with bigger benefit” from “tracts with smaller benefit.” Each terminal node shows the predicted ITE for tracts that follow that path (the % at the bottom shows what share of tracts ended up there). Read top-to-bottom along a branch to identify “tracts with feature X beyond Y benefit Z amount.”

# Append BART's individual treatment effect (ITE) per tract to the map data so
# the Shiny app can visualize the causal estimates alongside the predictive ones.
# ITE rows match ml_data_scaled row order; we map back to GEOID for a safe join.
bart_ite_df <- data.frame(
  GEOID    = ml_data_scaled$GEOID,
  bart_ite = as.numeric(fitted(bart_fit, type = "icate"))
)

final_map_data_rf <- final_map_data_rf %>%
  select(-any_of(c("bart_ite", "bart_ite_treated_only"))) %>%
  left_join(bart_ite_df, by = "GEOID") %>%
  mutate(
    bart_ite_treated_only = ifelse(is_ibx_location == 1, bart_ite, NA_real_)
  )

saveRDS(final_map_data_rf, file.path(projPath, "data", "model_input.rds"))

Summary of Findings

The question we wanted to answer was: how would the new Interborough Express (IBX) subway line change commute times for the NYC neighborhoods it would serve? Our best model says the IBX would cut the share of 60+ minute commutes by up to ~9.3pp in tracts within walking distance of a new station, plus measurable benefits spilling over to 113 nearby tracts that don’t have a station of their own. The benefit isn’t even across the city. Queens neighborhoods gain the most on average, in both our predictive simulation and our causal estimates. Brooklyn tracts gain too, just less.

Predictive modeling. We trained six versions of a model that predicts the share of long commutes in each tract. Three core types (Linear Regression, Linear + Spline, Random Forest), each tested with and without “spatial-lag” features. Spatial lags are just a way of letting each tract’s model also see its neighbors’ values. 10-fold cross-validation picked Random Forest + Spatial Lags as the best at 7.14 RMSE, with plain Random Forest close behind at 7.24 and statistically tied within 1 SE. The neighbor-aware features helped every type. Basically, what’s happening around a tract carries real information about that tract’s outcomes. To run the IBX simulation, we set each treated tract’s distance to the subway to the city’s smallest value (the closest any NYC tract is to a station) and re-ran the predictions. The spatial-lag versions go one step further. After the distance changes for treated tracts, we re-compute the neighbor-averages too, which is how the spillover effect shows up.

Causal modeling. Separately from the predictive modeling, we used a method called bartCause to estimate what each tract’s commute outcome would be if it got an IBX station of its own. One important fix happened along the way. An earlier version of the model included dist_subway_miles as a confounder, but that variable IS the mechanism the IBX changes. Controlling for it removed the effect we were trying to measure. Excluding it (and its spatial lag) from the BART confounders gave us a causal estimate that agrees with the predictive simulation in direction. Magnitudes are smaller than the predictive ones (BART’s max benefit is around 0.5pp versus the predictive simulation’s 9pp max relief), which is expected. BART estimates an average causal effect with Bayesian shrinkage toward zero, while the predictive model reports max relief under an extreme scenario. Queens treated tracts get the biggest causal benefit (mean -0.46pp). Brooklyn benefits too but less (mean -0.09pp). 49 Brooklyn tracts show tiny positive ITE that we treat as noise. The model also produces counterfactual estimates for boroughs the IBX doesn’t actually serve (Bronx, Manhattan, Staten Island), but we treat those with caution since BART is extrapolating into territory it didn’t see in the data.

Limitations.

We treat a tract as “treated by IBX” if any part of it falls within a 0.5-mile walk of any new station. That’s too generous. A tract barely touching a walk-zone gets credit for the same treatment as one sitting right on top of a station. A graded version would be more honest.
We simulate the IBX by setting each treated tract’s distance to subway to the smallest value in the city. This is a rough proxy. It doesn’t capture things like how often the IBX runs, how many lines connect at a station, or whether a station is a major transfer point.
The IBX station addresses are geocoded through OpenStreetMap, which doesn’t always return every address on every run. The count of geocoded stations (and so the count of treated tracts) bounces around a bit between runs.
The predictive simulation gives a single point estimate of IBX relief without uncertainty bounds. The causal (BART) side does include posterior credible intervals, so we have honest uncertainty there at least.

Future work.

Add XGBoost as a fourth model family for the predictive comparison.
Add SHAP values for our best model (Random Forest + Spatial Lags) so we can explain individual tract predictions.
Try a graded version of the IBX treatment (partial credit for partial walk-zone coverage) and see how much that shifts the answer.
Add a convergence-check plot for the BART model.
Refit the BART propensity model without dist_subway_miles (the propensity formula still includes it, which carries the mediator effect partly through the propensity score).

Misc. Notes Dump

Spatial lag features — include averages of neighboring tract values to account for spatial autocorrelation; used in similar transit research and directly relevant to the IBX corridor analysis

K-fold cross-validation — more robust evaluation than a single 80/20 split given our dataset size

George: it might be good to have a “proximity to Manhattan” feature, otherwise the coefficients that are going to be learned for other coefficients will partially be capturing the geographic structure.

George: Second, you may want think about whether linear relationships are appropriate for all variables, particularly those which have outlier values (for example the citibike share or uber/taxi share). I also suspect the distance from subway relationship is nonlinear, like the difference from 20 minutes and 25 minutes walk to the subway might not matter. This is important because if you use a linear model, the relationship at short walk times will be constrained by the data for long walk times.

George: Try the the bartCause R package. It has built in a lot of model checks that determine how much you can truly say about causal effects given the data. read about it more here: https://apsta.shinyapps.io/thinkCausal/. IMO it is better than the CausalForest stuff from the EconML package, but that is my subjective opinion.

George: In the predicted 60m commute time, note the presence of negative values. I highly recommend applying a transformation to take the interval from 0-1 to the entire number line, i.e. a log transformation or a sigmoid function of some type. This would prevent the negative number issue.

BART mediator-as-confounder fix: An earlier version of bart_fit included dist_subway_miles and lag_dist_subway_miles in the confounders, which IS the variable the IBX changes. Controlling for it removed the causal pathway we were trying to measure. The original BART estimates came back with tiny positive ITE for most treated tracts (all 148 Brooklyn treated tracts positive, mean +0.5pp), which contradicted the predictive simulation. We refit BART without those two columns. The new estimates agree with the predictive simulation in direction (treatment helps most treated tracts) but with smaller magnitudes due to BART’s Bayesian shrinkage toward zero.

DATA 622 Project

Catherine Dube, Guillermo Schneider, Tai Chou-Kudu