Looking at census data from 2022.
Picked 2022 because:
Another option would be to switch to the most recent census data available, but it’s worth noting that that won’t avoid the COVID data being included in the 5 -year estimate.
#CENSUS_API_KEY is in environment
nyc_counties <- c("New York", "Kings", "Queens", "Bronx", "Richmond")
# Entire 2022 5-Year ACS variable dictionary
acs_vars_2022 <- load_variables(2022, "acs5", cache = TRUE)
# MISC.
# Exploring different potential features
feature_exploration <- acs_vars_2022 %>%
filter(
str_detect(str_to_lower(label), "public transportation|bicycle|walked|worked from home") |
str_detect(str_to_lower(concept), "means of transportation to work|travel time")
)
# Comment Me out before final knit
#View(feature_exploration)
# hiding the results of this chunk becuase it jsut is unreadable in the knitted document
# Census variables
vars <- c(
# Base Demographics
no_car = "B08201_002",
med_income = "B19013_001",
med_age = "B01002_001",
total_pop = "B01003_001",
# Transportation Base & Features
tot_workers = "B08301_001",
transit = "B08301_010",
walk = "B08301_019",
wfh = "B08301_021",
# Travel Time Features
# Potential target variable
commute_60 = "B08303_012",
commute_90 = "B08303_013",
# Education Base & Features
tot_pop_25 = "B15003_001",
bachelors = "B15003_022",
# Occupation Base & The 5 Major Buckets
tot_employ = "C24010_001",
mngmt_job = "C24010_002",
service_job = "C24010_014",
sales_job = "C24010_021",
const_job = "C24010_030",
trans_job = "C24010_034",
# Continuous Age Spectrum (Table B01001 - Detailed Age)
# 18 to 24
m_18_19="B01001_007", m_20="B01001_008", m_21="B01001_009", m_22_24="B01001_010",
f_18_19="B01001_031", f_20="B01001_032", f_21="B01001_033", f_22_24="B01001_034",
# 25 to 34
m_25_29="B01001_011", m_30_34="B01001_012", f_25_29="B01001_035", f_30_34="B01001_036",
# 35 to 44
m_35_39="B01001_013", m_40_44="B01001_014", f_35_39="B01001_037", f_40_44="B01001_038",
# 45 to 64
m_45_49="B01001_015", m_50_54="B01001_016", m_55_59="B01001_017", m_60_61="B01001_018", m_62_64="B01001_019",
f_45_49="B01001_039", f_50_54="B01001_040", f_55_59="B01001_041", f_60_61="B01001_042", f_62_64="B01001_043",
# 65 Plus
m_65_66="B01001_020", m_67_69="B01001_021", m_70_74="B01001_022", m_75_79="B01001_023", m_80_84="B01001_024", m_85_up="B01001_025",
f_65_66="B01001_044", f_67_69="B01001_045", f_70_74="B01001_046", f_75_79="B01001_047", f_80_84="B01001_048", f_85_up="B01001_049",
# Commercial vs. Residential Proxy (Housing)
tot_housing = "B25002_001",
vacant_unit = "B25002_003",
tot_occup = "B25003_001",
renter_unit = "B25003_003",
# Economic Barrier Spectrum (Poverty)
tot_pov_base = "B17001_001",
pov_below = "B17001_002"
)
# Getting from census data
nyc_tracts <- get_acs(
geography = "tract",
variables = vars,
state = "NY",
county = nyc_counties,
geometry = TRUE,
year = 2022,
output = "wide"
)
This chunk does an explicit selection of columns from the census data to start with. (There are just so many features in that data that clearly selecting and renaming when applicable seemed like a cleaner route.)
After that this chunk is doing some feature engineering on the census data. Mostly turning raw count data into percent of population data.
pct_commute_60p is currently being used as the target
variable. It’s the count of people with commutes over 60 min + the count
of people with commutes over 90 minutes. Divided by total number of
workers. x 100 to make is a percentage.pct_no_car is percent of people who do not have a
carpct_transit is percent of workers using public transit
(mta, buses not uber/taxi) as primary commute
methospct_walk is percent of workers using walking as primary
commute methodpct_wfhis percent of workers working from homepct_bachelors is the percent of 25+ year olds with
Bachelor’s Degrees or higherpct_mngmt is percent of workers in management,
business, science, and artspct_service is percent of workers inpct_sales is percent of workers in sales and office
administrationpct_const is percent of workers in natural resources,
construction, and maintenancepct_trans is percent of workers in production,
transportation, and material movingarea_sq_miles is the area in square milespop_densityis the numebr of people per square milepct_age_18_24 is the percent of the population in that
age rangepct_age_25_34 is the percent of the population in that
age rangepct_age_35_44 is the percent of the population in that
age rangepct_age_45_64 is the percent of the population in that
age rangepct_age_65_plus is the percent of the population in
that age rangepct_vacant is the percent of housing that is
vacantpct_renter is the percent of occupued units that are
being rentedpct_poverty is the percent of people living below the
poverty line out of the total population for whom poverty status is
determinedNot all of the engineered features are used in the later model.
nyc_tracts <- nyc_tracts %>%
# Explicitly rename every variable
select(
GEOID, NAME, geometry,
# Base Demographics
no_car = no_carE,
med_income = med_incomeE,
med_age = med_ageE,
total_pop = total_popE,
tot_workers = tot_workersE,
tot_pop_25 = tot_pop_25E,
tot_employ = tot_employE,
# Transportation & Travel Time
transit = transitE,
walk = walkE,
wfh = wfhE,
commute_60 = commute_60E,
commute_90 = commute_90E,
# Education & Occupation
bachelors = bachelorsE,
mngmt_job = mngmt_jobE,
service_job = service_jobE,
sales_job = sales_jobE,
const_job = const_jobE,
trans_job = trans_jobE,
# Age - Male
m_18_19 = m_18_19E, m_20 = m_20E, m_21 = m_21E, m_22_24 = m_22_24E,
m_25_29 = m_25_29E, m_30_34 = m_30_34E,
m_35_39 = m_35_39E, m_40_44 = m_40_44E,
m_45_49 = m_45_49E, m_50_54 = m_50_54E, m_55_59 = m_55_59E, m_60_61 = m_60_61E, m_62_64 = m_62_64E,
m_65_66 = m_65_66E, m_67_69 = m_67_69E, m_70_74 = m_70_74E, m_75_79 = m_75_79E, m_80_84 = m_80_84E, m_85_up = m_85_upE,
# Age - Female
f_18_19 = f_18_19E, f_20 = f_20E, f_21 = f_21E, f_22_24 = f_22_24E,
f_25_29 = f_25_29E, f_30_34 = f_30_34E,
f_35_39 = f_35_39E, f_40_44 = f_40_44E,
f_45_49 = f_45_49E, f_50_54 = f_50_54E, f_55_59 = f_55_59E, f_60_61 = f_60_61E, f_62_64 = f_62_64E,
f_65_66 = f_65_66E, f_67_69 = f_67_69E, f_70_74 = f_70_74E, f_75_79 = f_75_79E, f_80_84 = f_80_84E, f_85_up = f_85_upE,
# Housing & Poverty
tot_housing = tot_housingE,
vacant_unit = vacant_unitE,
tot_occup = tot_occupE,
renter_unit = renter_unitE,
tot_pov_base = tot_pov_baseE,
pov_below = pov_belowE
) %>%
# Feature Engineering: Convert raw counts to percentages
mutate(
pct_no_car = (no_car / total_pop) * 100,
pct_transit = (transit / tot_workers) * 100,
pct_walk = (walk / tot_workers) * 100,
pct_wfh = (wfh / tot_workers) * 100,
pct_commute_60p = ((commute_60 + commute_90) / tot_workers) * 100,
pct_bachelors = (bachelors / tot_pop_25) * 100,
pct_mngmt = (mngmt_job / tot_employ) * 100,
pct_service = (service_job / tot_employ) * 100,
pct_sales = (sales_job / tot_employ) * 100,
pct_const = (const_job / tot_employ) * 100,
pct_trans = (trans_job / tot_employ) * 100,
area_sq_miles = as.numeric(st_area(geometry)) / 2589988,
pop_density = total_pop / area_sq_miles,
# Continuous Age Spectrum
pct_age_18_24 = ((m_18_19 + m_20 + m_21 + m_22_24 + f_18_19 + f_20 + f_21 + f_22_24) / total_pop) * 100,
pct_age_25_34 = ((m_25_29 + m_30_34 + f_25_29 + f_30_34) / total_pop) * 100,
pct_age_35_44 = ((m_35_39 + m_40_44 + f_35_39 + f_40_44) / total_pop) * 100,
pct_age_45_64 = ((m_45_49 + m_50_54 + m_55_59 + m_60_61 + m_62_64 + f_45_49 + f_50_54 + f_55_59 + f_60_61 + f_62_64) / total_pop) * 100,
pct_age_65_plus = ((m_65_66 + m_67_69 + m_70_74 + m_75_79 + m_80_84 + m_85_up + f_65_66 + f_67_69 + f_70_74 + f_75_79 + f_80_84 + f_85_up) / total_pop) * 100,
# Commercial/Residential Proxy
pct_vacant = (vacant_unit / tot_housing) * 100,
pct_renter = (renter_unit / tot_occup) * 100,
# Economic Barrier
pct_poverty = (pov_below / tot_pov_base) * 100
) %>%
# Explicitly drop the raw counts so they don't leak into the model
select(
-no_car, -transit, -walk, -wfh, -commute_60, -commute_90, -bachelors,
-mngmt_job, -service_job, -sales_job, -const_job, -trans_job,
-tot_workers, -tot_pop_25, -tot_employ,
-m_18_19, -m_20, -m_21, -m_22_24, -m_25_29, -m_30_34, -m_35_39, -m_40_44,
-m_45_49, -m_50_54, -m_55_59, -m_60_61, -m_62_64, -m_65_66, -m_67_69, -m_70_74, -m_75_79, -m_80_84, -m_85_up,
-f_18_19, -f_20, -f_21, -f_22_24, -f_25_29, -f_30_34, -f_35_39, -f_40_44,
-f_45_49, -f_50_54, -f_55_59, -f_60_61, -f_62_64, -f_65_66, -f_67_69, -f_70_74, -f_75_79, -f_80_84, -f_85_up,
-tot_housing, -vacant_unit, -tot_occup, -renter_unit, -tot_pov_base, -pov_below
)
## Checking for total NAs and NaNs in each column
colSums(is.na(nyc_tracts %>% st_drop_geometry())) %>%
kable(caption = "# of NAs and NaNs in each column")
| x | |
|---|---|
| GEOID | 0 |
| NAME | 0 |
| med_income | 126 |
| med_age | 88 |
| total_pop | 0 |
| pct_no_car | 85 |
| pct_transit | 90 |
| pct_walk | 90 |
| pct_wfh | 90 |
| pct_commute_60p | 90 |
| pct_bachelors | 85 |
| pct_mngmt | 90 |
| pct_service | 90 |
| pct_sales | 90 |
| pct_const | 90 |
| pct_trans | 90 |
| area_sq_miles | 0 |
| pop_density | 3 |
| pct_age_18_24 | 85 |
| pct_age_25_34 | 85 |
| pct_age_35_44 | 85 |
| pct_age_45_64 | 85 |
| pct_age_65_plus | 85 |
| pct_vacant | 93 |
| pct_renter | 96 |
| pct_poverty | 90 |
## Percentage of missing data
# (is it a low enough percent to be negligible)
missing_summary <- nyc_tracts %>%
st_drop_geometry() %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Missing_Count") %>%
mutate(
Total_Rows = nrow(nyc_tracts),
Pct_Missing = (Missing_Count / Total_Rows) * 100,
# Flag if the missing data is below 5%
Is_Negligible_Under_5Pct = Pct_Missing < 5
) %>%
# Filter to only show features that actually have missing data
filter(Missing_Count > 0) %>%
arrange(desc(Pct_Missing))
# Output Percentage of Missing Data
kable(missing_summary, digits = 2, caption = "Missing Data Percentage by Feature")
| Feature | Missing_Count | Total_Rows | Pct_Missing | Is_Negligible_Under_5Pct |
|---|---|---|---|---|
| med_income | 126 | 2327 | 5.41 | FALSE |
| pct_renter | 96 | 2327 | 4.13 | TRUE |
| pct_vacant | 93 | 2327 | 4.00 | TRUE |
| pct_transit | 90 | 2327 | 3.87 | TRUE |
| pct_walk | 90 | 2327 | 3.87 | TRUE |
| pct_wfh | 90 | 2327 | 3.87 | TRUE |
| pct_commute_60p | 90 | 2327 | 3.87 | TRUE |
| pct_mngmt | 90 | 2327 | 3.87 | TRUE |
| pct_service | 90 | 2327 | 3.87 | TRUE |
| pct_sales | 90 | 2327 | 3.87 | TRUE |
| pct_const | 90 | 2327 | 3.87 | TRUE |
| pct_trans | 90 | 2327 | 3.87 | TRUE |
| pct_poverty | 90 | 2327 | 3.87 | TRUE |
| med_age | 88 | 2327 | 3.78 | TRUE |
| pct_no_car | 85 | 2327 | 3.65 | TRUE |
| pct_bachelors | 85 | 2327 | 3.65 | TRUE |
| pct_age_18_24 | 85 | 2327 | 3.65 | TRUE |
| pct_age_25_34 | 85 | 2327 | 3.65 | TRUE |
| pct_age_35_44 | 85 | 2327 | 3.65 | TRUE |
| pct_age_45_64 | 85 | 2327 | 3.65 | TRUE |
| pct_age_65_plus | 85 | 2327 | 3.65 | TRUE |
| pop_density | 3 | 2327 | 0.13 | TRUE |
## Map for missing income data
missing_map_data <- nyc_tracts %>%
mutate(missing_income_flag = ifelse(is.na(med_income), "Missing Data", "Valid Data"))
ggplot(missing_map_data) +
geom_sf(aes(fill = missing_income_flag), color = "white", size = 0.1) +
scale_fill_manual(values = c("Valid Data" = "#e0e0e0", "Missing Data" = "#d73027")) +
theme_void() +
labs(title = "Map of Missing Census Income Data",
fill = "Status")
How to interpret: Red tracts are missing the median
income field; gray tracts have valid data. If the red dots cluster (same
borough, same neighborhood), any model relying on
med_income will be geographically biased — we’d be making
predictions in those areas without real data. Scattered red is much less
harmful.
This chunk is currently dropping census tracts that have very few people or no workers.
It’s also imputing where income or age was missing using the median value. This was just a simple implementation, but we can definitely consider more robust imputing methos, and also properly investingating MAR, MNAR etc. (Though given the nature of census data, MNAR is a good assumption.)
## Drop tracts with less than 50 people or 0 workers
# There were a ton of tracts with a small number of or 0 people, screwing with denominators and also making for super volatile percentages
nyc_tracts_clean <- nyc_tracts %>%
filter(total_pop >= 50, !is.na(pct_transit))
## Imputing
# There were also many tracts where income or age was missing, but it wasn't due to a denominator issue
# Just doing simple median imputation
# More robust method would be better
# Checking for MNAR and stuff would be better
# The current median imputation:
nyc_tracts_clean <- nyc_tracts_clean %>%
# Extract the borough from the NAME string (e.g., "Queens County")
mutate(borough = str_extract(NAME, "(?<=; ).*(?= County)")) %>%
# Group by Borough so the imputation is localized
group_by(borough) %>%
mutate(
med_income = ifelse(is.na(med_income),
median(med_income, na.rm = TRUE),
med_income),
med_age = ifelse(is.na(med_age),
median(med_age, na.rm = TRUE),
med_age)
) %>%
ungroup() %>%
# Drop the temporary borough column
select(-borough)
print(paste("Original tracts:", nrow(nyc_tracts)))
## [1] "Original tracts: 2327"
print(paste("Cleaned tracts of NAs:", nrow(nyc_tracts_clean)))
## [1] "Cleaned tracts of NAs: 2226"
# Save
census_save_path <- file.path(projPath, "data", "nyc_census_tracts.rds")
saveRDS(nyc_tracts_clean, census_save_path)
print(paste("Census features and geometry saved to:", census_save_path))
## [1] "Census features and geometry saved to: /Users/catherinedube/src/DATA_622_Project/data/nyc_census_tracts.rds"
# Peak
head(nyc_tracts_clean)
## Simple feature collection with 6 features and 26 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -73.91641 ymin: 40.82439 xmax: -73.84726 ymax: 40.90264
## Geodetic CRS: NAD83
## # A tibble: 6 × 27
## GEOID NAME med_income med_age total_pop pct_no_car pct_transit pct_walk
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 36005013500 Cens… 27602 36.5 3125 31.8 48.6 14.5
## 2 36005009200 Cens… 56208 33.9 5959 14.1 38.6 8.07
## 3 36005005400 Cens… 48750 30.5 5715 20.6 64.7 6.20
## 4 36005036501 Cens… 41287 31.8 4196 23.1 60.3 6.61
## 5 36005044902 Cens… 88264 32.4 2146 12.1 37.8 2.43
## 6 36005017500 Cens… 26463 33.9 6116 39.5 69.1 8.69
## # ℹ 19 more variables: pct_wfh <dbl>, pct_commute_60p <dbl>,
## # pct_bachelors <dbl>, pct_mngmt <dbl>, pct_service <dbl>, pct_sales <dbl>,
## # pct_const <dbl>, pct_trans <dbl>, area_sq_miles <dbl>, pop_density <dbl>,
## # pct_age_18_24 <dbl>, pct_age_25_34 <dbl>, pct_age_35_44 <dbl>,
## # pct_age_45_64 <dbl>, pct_age_65_plus <dbl>, pct_vacant <dbl>,
## # pct_renter <dbl>, pct_poverty <dbl>, geometry <MULTIPOLYGON [°]>
Adding car ownership as a feature — flagged by our professor as likely the strongest missing predictor and a key source of omitted-variable bias. Pulling just this table separately to avoid re-running the full cached census pull.
pct_2plus_cars captures car-dependent households (2+
vehicles) which is more informative than the binary
pct_no_car already in the model.
# Pull B25044: Tenure by Vehicles Available (2022 ACS 5-Year)
vehicle_vars <- c(
tot_occup_veh = "B25044_001", # Total occupied housing units
own_2car = "B25044_005", # Owner occupied, 2 vehicles
own_3car = "B25044_006", # Owner occupied, 3 vehicles
own_4plus_car = "B25044_007", # Owner occupied, 4+ vehicles
rent_2car = "B25044_011", # Renter occupied, 2 vehicles
rent_3car = "B25044_012", # Renter occupied, 3 vehicles
rent_4plus_car = "B25044_013" # Renter occupied, 4+ vehicles
)
nyc_vehicle_raw <- get_acs(
geography = "tract",
variables = vehicle_vars,
state = "NY",
county = nyc_counties,
year = 2022,
output = "wide"
)
## Getting data from the 2018-2022 5-year ACS
# Engineer pct_2plus_cars: % of occupied households with 2 or more vehicles
nyc_vehicle <- nyc_vehicle_raw %>%
transmute(
GEOID,
pct_2plus_cars = ((own_2carE + own_3carE + own_4plus_carE +
rent_2carE + rent_3carE + rent_4plus_carE) / tot_occup_vehE) * 100
)
# Join onto existing census tracts and re-save
nyc_tracts_clean <- readRDS(file.path(projPath, "data", "nyc_census_tracts.rds")) %>%
left_join(nyc_vehicle, by = "GEOID") %>%
mutate(pct_2plus_cars = replace_na(pct_2plus_cars, 0))
saveRDS(nyc_tracts_clean, file.path(projPath, "data", "nyc_census_tracts.rds"))
print("Vehicle availability feature added to nyc_census_tracts.rds")
## [1] "Vehicle availability feature added to nyc_census_tracts.rds"
The taxi zone data is luckily pretty straightforward. It’s just being imported here.
#Zip file from below url is downloaded and extracted into repo
#taxi_zones_url <- "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip"
# Read the shapefile
taxi_zones_sf <- st_read(file.path(projPath, 'taxi_zones', 'taxi_zones', "taxi_zones.shp")) %>%
st_transform(4326) # Transform to standard Lat/Long (WGS84)
## Reading layer `taxi_zones' from data source
## `/Users/catherinedube/src/DATA_622_Project/taxi_zones/taxi_zones/taxi_zones.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
# Lookup table with names and boroughs is downloaded into repo
#zone_lookup_url <- "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv"
zone_lookup <- read_csv(file.path(projPath, 'taxi_zone_lookup.csv'))
## Rows: 265 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Borough, Zone, service_zone
## dbl (1): LocationID
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Join
taxi_zones <- taxi_zones_sf %>%
left_join(zone_lookup, by = c("LocationID" = "LocationID"))
# Peak: Does this map look right
plot(st_geometry(taxi_zones))
How to interpret: Sanity check that the 263 NYC taxi zones loaded and overlay correctly. We’re looking for a recognizable NYC silhouette — five boroughs visible, no obvious gaps or holes. If the shape looks broken, the shapefile didn’t load right.
TLC Data is from the NYC Taxi and Limousine Commission.
High volume for hire vehicles = Uber/Lyft.
## Aggregate all vehicular trips for October 2023
# Combine Yellow, Green, FHV, and High-Volume FHV (Uber/Lyft)
# See "all" the car-based 'Last-Mile' choices
# URL definitions for source data
tlc_urls <- c(
yellow = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-10.parquet",
green = "https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2023-10.parquet",
#fhv = "https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_2023-10.parquet",
fhvhv = "https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-10.parquet"
)
# Function to read and count trips by Taxi Zone
# PULocationID can be used to get the # of trips that started from (were picked up in) a zone
count_zone_trips <- function(url) {
read_parquet(url, col_select = any_of(c("PULocationID"))) %>%
count(PULocationID, name = "trips") %>%
filter(!is.na(PULocationID))
}
# Process the sources and sum them up
# Getting number of trips that started per pick up location
# This 'map_dfr' loop downloads each, counts by zone, and stacks them.
total_zone_counts <- tlc_urls %>%
map_dfr(count_zone_trips) %>%
group_by(PULocationID) %>%
summarize(total_vehicular_trips = sum(trips))
So Census Tracts are not the same as Taxi Zones. This chunk tries to handle it.
EPSG:2263 is apparently a specific map projection built just for the New York/Long Island area. In it, 1 unit equals exactly 1 physical foot. (I had not ever used this before.) The idea is that if the censues tracts and the taxi zones are both transformed to EPSG:2263, then this will be the most precise way to layer these maps on top of one another and see where they intersect. (At least it appears to be more precise than using latitude and longitude.)
st_intersection() lays the two maps on top of one
another and separated out the overlapping pieces properly.
# Coordinate Reference System (CRS)
# EPSG:2263 (NAD83 / New York Long Island) uses feet
nyc_tracts_proj <- st_transform(nyc_tracts, 2263)
taxi_zones_proj <- st_transform(taxi_zones, 2263)
# Intersect regions
# This creates a new spacial dataframe where tracts and zones overlap
intersection <- st_intersection(nyc_tracts_proj, taxi_zones_proj) %>%
mutate(intersect_area = st_area(.))
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
# Calculate the weights
# For every unique Census Tract,
# What % of Tract X's area is inside Taxi Zone Y?
tract_weights <- intersection %>%
as_tibble() %>%
# Two complementary shares — different jobs, different denominators:
# tract_share = how much of this tract is in this zone (picks the dominant zone)
# zone_share = how much of this zone is in this tract (apportions zone trips)
group_by(GEOID) %>%
mutate(tract_share = as.numeric(intersect_area / sum(intersect_area))) %>%
ungroup() %>%
group_by(LocationID) %>%
mutate(zone_share = as.numeric(intersect_area / sum(intersect_area))) %>%
ungroup() %>%
select(GEOID, LocationID, zone, tract_share, zone_share)
# Peak Crosswalk
head(tract_weights)
## # A tibble: 6 × 5
## GEOID LocationID zone tract_share zone_share
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 36081071600 2 Jamaica Bay 0.000628 0.000858
## 2 36047070203 2 Jamaica Bay 1 0.451
## 3 36081107201 2 Jamaica Bay 0.0124 0.00124
## 4 36081100801 2 Jamaica Bay 0.00106 0.0000508
## 5 36081107202 2 Jamaica Bay 0.989 0.547
## 6 36005037000 3 Allerton/Pelham Gardens 0.0000606 0.00000326
The below chunk uses the newly defined geography, and then it disperses the uber/taxi trip data (which uses taxi zones) fairly across the census tracts.
# issue is that taxi zones are larger than census tracts, and we obvs want to use the data together
# Assumes 'tract_weights' object exists from your earlier step.
tract_vehicular_estimates <- tract_weights %>%
left_join(total_zone_counts, by = c("LocationID" = "PULocationID")) %>%
mutate(estimated_trips = total_vehicular_trips * zone_share) %>%
group_by(GEOID) %>%
summarize(total_uber_taxi_trips = sum(estimated_trips, na.rm = TRUE))
# Save the 'Vehicular' side of the Master Table
save_path_tlc <- file.path(projPath, "data", "processed_tlc_counts.rds")
saveRDS(tract_vehicular_estimates, save_path_tlc)
print(paste("Vehicular trip estimates for October 2023 saved to:", save_path_tlc))
## [1] "Vehicular trip estimates for October 2023 saved to: /Users/catherinedube/src/DATA_622_Project/data/processed_tlc_counts.rds"
# Map Station IDs to Census Tracts
# file locations
cb_file1 <- file.path(projPath, "citibike", "202310-citibike-tripdata_1.csv")
cb_file2 <- file.path(projPath, "citibike", "202310-citibike-tripdata_2.csv")
cb_file3 <- file.path(projPath, "citibike", "202310-citibike-tripdata_3.csv")
cb_file4 <- file.path(projPath, "citibike", "202310-citibike-tripdata_4.csv")
cb_files <- c(cb_file1, cb_file2, cb_file3, cb_file4)
# Station-to-Tract Lookup Table
# unique coordinates for every station found in files
all_stations <- cb_files %>%
map_dfr(~read_csv(.x,
col_select = c(start_station_id, start_lat, start_lng),
# Force the station ID to be a character, and coords to be numeric
col_types = cols(
start_station_id = col_character(),
start_lat = col_double(),
start_lng = col_double()
))) %>%
filter(!is.na(start_lat)) %>%
distinct(start_station_id, .keep_all = TRUE)
# Spatial Work: Convert Stations to Points
stations_sf <- st_as_sf(all_stations, coords = c("start_lng", "start_lat"), crs = 4326) %>%
st_transform(2263) # Project to NYC Feet (EPSG:2263)
# Ensure nyc_tracts (from Step 1) matches the projection
nyc_tracts_proj <- st_transform(nyc_tracts, 2263)
# Find which Tract Polygon physically contains each Station Point
station_tract_lookup <- st_join(stations_sf, nyc_tracts_proj, join = st_intersects) %>%
as_tibble() %>%
select(start_station_id, GEOID)
# Count the number of unique stations in each Census Tract
tract_station_counts <- station_tract_lookup %>%
group_by(GEOID) %>%
summarize(total_stations = n()) %>%
filter(!is.na(GEOID))
# Count Trips per Tract
tract_bike_counts <- cb_files %>%
map_dfr(~read_csv(.x,
col_select = c(start_station_id),
col_types = cols(start_station_id = col_character())
)) %>%
left_join(station_tract_lookup, by = "start_station_id") %>%
group_by(GEOID) %>%
summarize(total_bike_trips = n()) %>%
filter(!is.na(GEOID)) %>%
# NEW: Attach the station capacity feature to the trip counts
left_join(tract_station_counts, by = "GEOID")
# Save the final "Label" feature for the Machine Learning model
save_path <- file.path(projPath, "data", "processed_bike_counts.rds")
saveRDS(tract_bike_counts, save_path)
print(paste("Success! Bike counts saved to:", save_path))
## [1] "Success! Bike counts saved to: /Users/catherinedube/src/DATA_622_Project/data/processed_bike_counts.rds"
Source is https://data.ny.gov/Transportation/MTA-Subway-Entrances-and-Exits-2024/i9wp-a4ja/about_data.
File name is “MTA_Subway_Entrances_and_Exits__2024.csv”
## Distance to Subway
# Load csv
subway_csv_path <- file.path(projPath, "MTA_Subway_Entrances_and_Exits__2024.csv")
subway_raw <- read_csv(subway_csv_path) %>%
clean_names() # otherwise this had spaces in the column names
## Rows: 2120 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): Division, Line, Borough, Stop Name, Constituent Station Name, GTFS...
## dbl (4): Complex ID, Station ID, Entrance Latitude, Entrance Longitude
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Transform into a Spatial Object (sf)
subway_entrances_sf <- subway_raw %>%
filter(!is.na(entrance_latitude) & !is.na(entrance_longitude)) %>% # 'entrance_longitude' is X and 'entrance_latitude' is Y
st_as_sf(coords = c("entrance_longitude", "entrance_latitude"), crs = 4326) %>% # telling it to read this as 4326, the standard GPS coordinate system
st_transform(2263) # Converting to the same 1 unit = 1 foot NYC-specific map that was used earlier
# Picking a reference point inside each tract.
# Switched away from st_centroid() — for L-shaped or coastal tracts the
# geographic center can land in water or in a corner where nobody lives.
# st_point_on_surface() is guaranteed to be inside the polygon.
tract_centers <- st_point_on_surface(nyc_tracts_proj)
## Warning: st_point_on_surface assumes attributes are constant over geometries
# Nearest Entrance for every Tract
# Find the closest subway for each tract
# Doing this from the tract center
nearest_indices <- st_nearest_feature(tract_centers, subway_entrances_sf)
# Distance in Miles
# (Distance in feet / 5280)
dist_to_subway_feet <- st_distance(tract_centers, subway_entrances_sf[nearest_indices, ], by_element = TRUE)
subway_dist_lookup <- nyc_tracts_proj %>%
as_tibble() %>%
mutate(dist_subway_miles = as.numeric(dist_to_subway_feet) / 5280) %>%
select(GEOID, dist_subway_miles)
# Save the Subway Distance
saveRDS(subway_dist_lookup, file.path(projPath, "data", "subway_dist_lookup.rds"))
print("Subway distance feature successfully calculated from MTA data!")
## [1] "Subway distance feature successfully calculated from MTA data!"
Borough fixed effects can’t tell apart “Williamsburg, 1 mile from Manhattan” and “Bay Ridge, 8 miles away” — they’re both Brooklyn, same borough effect. This chunk computes distance from each tract to Grand Central Terminal as a CBD-proximity proxy. Per George’s flag.
## Distance from each tract to Grand Central — proxy for "proximity to the
## Manhattan CBD". Same point_on_surface reference points as subway distance.
# Grand Central Terminal: 40.7527° N, 73.9772° W
grand_central <- st_sfc(st_point(c(-73.9772, 40.7527)), crs = 4326) %>%
st_transform(2263) # NYC State Plane (feet)
tract_centers_pos <- st_point_on_surface(nyc_tracts_proj)
## Warning: st_point_on_surface assumes attributes are constant over geometries
manhattan_dist_lookup <- nyc_tracts_proj %>%
as_tibble() %>%
mutate(
dist_to_manhattan_miles = as.numeric(
st_distance(tract_centers_pos, grand_central)
) / 5280
) %>%
select(GEOID, dist_to_manhattan_miles)
saveRDS(manhattan_dist_lookup, file.path(projPath, "data", "manhattan_dist_lookup.rds"))
print("Distance to Manhattan (Grand Central) feature saved.")
## [1] "Distance to Manhattan (Grand Central) feature saved."
## Big Join
# Load components
census_features <- readRDS(file.path(projPath, "data", "nyc_census_tracts.rds"))
bike_counts <- readRDS(file.path(projPath, "data", "processed_bike_counts.rds"))
tlc_counts <- readRDS(file.path(projPath, "data", "processed_tlc_counts.rds"))
subway_dist <- readRDS(file.path(projPath, "data", "subway_dist_lookup.rds"))
manhattan_dist <- readRDS(file.path(projPath, "data", "manhattan_dist_lookup.rds"))
# Before joining, force each Tract to belong to only one Taxi Zone, the one it overlaps most.
# I was seeing repeating values of income and distance for different zones in the final data set back when I was not doing this.
unique_weights <- tract_weights %>%
group_by(GEOID) %>%
slice_max(tract_share, n = 1, with_ties = FALSE) %>%
ungroup() %>%
select(GEOID, LocationID, zone) # Keep the Zone name and ID for joining
# Join everything to the Census Tracts
# left_join to not lose any tracts that might have had 0 trips
master_table <- census_features %>%
st_drop_geometry() %>% # drops the maps, keeps ALL features
# unique Neighborhood/LocationID mapping
left_join(unique_weights, by = "GEOID") %>%
# Bike counts (already at Tract level)
left_join(bike_counts, by = "GEOID") %>%
# Uber/Taxi counts
left_join(tlc_counts, by = "GEOID") %>%
# Subway Distance (already at Tract level)
left_join(subway_dist, by = "GEOID") %>%
# Distance to Manhattan CBD (Grand Central proxy)
left_join(manhattan_dist, by = "GEOID")
# Clean up NAs; create the Target Variable
# If a tract has no recorded trips, replace NA with 0
# If a tract has no stations, replace NA with 0
master_table <- master_table %>%
mutate(
total_bike_trips = replace_na(total_bike_trips, 0),
total_uber_taxi_trips = replace_na(total_uber_taxi_trips, 0),
total_stations = replace_na(total_stations, 0),
# Spatial Density (Stations per Square Mile)
# How physically clustered is the infrastructure?
stations_per_sq_mile = total_stations / area_sq_miles,
# Per Capita Availability (Stations per 1,000 Residents)
# How much infrastructure exists relative to the local population size?
stations_per_1000_pop = (total_stations / total_pop) * 1000,
uber_trips_per_capita = total_uber_taxi_trips / total_pop,
bike_trips_per_capita = total_bike_trips / total_pop,
# Old target: % of Last-Mile trips taken by Bike
# I left this in here, but we aren't really using it now
bike_index = total_bike_trips / (total_bike_trips + total_uber_taxi_trips)
)
# Save joined data set
saveRDS(master_table, file.path(projPath, "data", "model_input.rds"))
print("Master Join Complete. The model_input.rds file is ready for modeling.")
## [1] "Master Join Complete. The model_input.rds file is ready for modeling."
We could drop this later, but right now it was just a helpful way to make sure that combining the census tracts and the taxi zones worked properly.
### The Data Audit (Sanity Check)
# Load the Master Table
master_table <- readRDS(file.path(projPath, "data", "model_input.rds"))
# Join with Taxi Zone names for readability
# We'll use the tract_weights crosswalk to get neighborhood names back
audit_table <- master_table %>%
select(zone, GEOID, bike_index, total_bike_trips, total_uber_taxi_trips, med_income, dist_subway_miles)
# Top 10 "Bike-Heavy" Neighborhoods (Active)
top_bike <- audit_table %>%
arrange(desc(bike_index)) %>%
head(50)
# Top 10 "Uber-Heavy" Neighborhoods (Vehicular)
top_uber <- audit_table %>%
arrange(bike_index) %>%
head(50)
# quick summary
print("--- TOP 10 BIKE-HEAVY TRACTS ---")
## [1] "--- TOP 10 BIKE-HEAVY TRACTS ---"
print(top_bike)
## # A tibble: 50 × 7
## zone GEOID bike_index total_bike_trips total_uber_taxi_trips med_income
## <chr> <chr> <dbl> <int> <dbl> <dbl>
## 1 Two Bridg… 3606… 0.635 11311 6499. 89873
## 2 Steinway 3608… 0.580 2748 1993. 76889
## 3 Columbia … 3604… 0.571 7840 5882. 145417
## 4 Kips Bay 3606… 0.549 26003 21371. 153902
## 5 Sunnyside 3608… 0.537 2569 2216. 65667
## 6 Crown Hei… 3604… 0.527 8783 7875. 114750
## 7 East Will… 3604… 0.520 6701 6177. 151250
## 8 Fort Gree… 3604… 0.519 11949 11052. 142813
## 9 Two Bridg… 3606… 0.510 17361 16690. 25655
## 10 Stuy Town… 3606… 0.502 12168 12052. 202632
## # ℹ 40 more rows
## # ℹ 1 more variable: dist_subway_miles <dbl>
print("--- TOP 10 UBER-HEAVY TRACTS ---")
## [1] "--- TOP 10 UBER-HEAVY TRACTS ---"
print(top_uber)
## # A tibble: 50 × 7
## zone GEOID bike_index total_bike_trips total_uber_taxi_trips med_income
## <chr> <chr> <dbl> <int> <dbl> <dbl>
## 1 Westchest… 3600… 0 0 6170. 56208
## 2 West Farm… 3600… 0 0 6420. 48750
## 3 East Trem… 3600… 0 0 4399. 41287
## 4 Woodlawn/… 3600… 0 0 2676. 88264
## 5 Soundview… 3600… 0 0 2949. 53166
## 6 East Trem… 3600… 0 0 8996. 32174
## 7 Williamsb… 3600… 0 0 4127. 82039
## 8 Eastchest… 3600… 0 0 3373. 27753
## 9 Williamsb… 3600… 0 0 4719. 59758
## 10 Soundview… 3600… 0 0 5698. 29657
## # ℹ 40 more rows
## # ℹ 1 more variable: dist_subway_miles <dbl>
This section has 3 predictive models:
dist_subway_milesIt also has 1 causal model * BART, to estimate the average treatment effect
Prof. George flagged that NYC neighborhoods are very different from each other. Added this section, and borough as a factor variable lets both models learn a borough-level intercept, capturing unobserved variation that our census features don’t fully explain.
Borough is derived from the first 5 digits of GEOID (standard Census FIPS codes) so no additional data pull is needed.
# FIPS county codes for NYC boroughs
borough_lookup <- c(
"36005" = "Bronx",
"36047" = "Brooklyn",
"36061" = "Manhattan",
"36081" = "Queens",
"36085" = "Staten Island"
)
# Derive borough from first 5 chars of GEOID and join onto the master table
master_table <- readRDS(file.path(projPath, "data", "model_input.rds")) %>%
# Brooklyn as reference level — most-treated borough in the IBX corridor,
# so all other borough coefficients read as "vs Brooklyn (the IBX baseline)".
mutate(borough = factor(borough_lookup[substr(GEOID, 1, 5)],
levels = c("Brooklyn", "Bronx", "Manhattan",
"Queens", "Staten Island")))
saveRDS(master_table, file.path(projPath, "data", "model_input.rds"))
print(table(master_table$borough))
##
## Brooklyn Bronx Manhattan Queens Staten Island
## 775 344 303 684 120
Two linear baselines:
lm_fit_standardized: pure linear regressionlm_fit_splineL same model + a natural cubic spline on
dist_subway_milesUpdated this section so that both fit on a logit-transformed outcome
so predictions stay in [0, 100]. (Without it, lm was
predicting percentages that didn’t make sense because they were outsid
ethat range).
The spline version handles the saturating distance effect Prof. George flagged. When the distance to the subway is closer it matters a lot, far doesn’t.
Heavy-tailed predictors (uber_trips_per_capita,
bike_trips_per_capita, stations_per_1000_pop,
pop_density) get a log1p() transform before
z-scoring. Prof. George also pointed this out at some point, that some
of these variables would have heavily skewed distributions. After
logging, a 1 SD change in the standardized predictor is an
order-of-magnitude change rather than an absolute one, which is the
right scale for these variables.
# The selection for the Machine Learning model
# Right now, this has a lot of room for improvement trimming or adding features
# To make iteration easier:
## COLUMN/FEATURE SELECTION
id_vars <- c("GEOID")
target_var <- c("pct_commute_60p")
predictor_vars <- c(
# Borough Fixed Effect
"borough",
# Infrastructure & Mobility
"stations_per_sq_mile", "stations_per_1000_pop", "dist_subway_miles",
"dist_to_manhattan_miles",
"uber_trips_per_capita", "bike_trips_per_capita",
# Economic & Housing
"med_income", "pct_poverty", "pct_vacant", "pct_renter", "pct_2plus_cars",
# Demographics & Commute Behaviors
"pop_density", "pct_no_car", "pct_transit", "pct_bachelors", "pct_wfh", "pct_walk",
# Age Buckets
"pct_age_18_24", "pct_age_25_34", "pct_age_35_44", "pct_age_45_64", "pct_age_65_plus"
)
## SAVING RAW (NON-Z-SCORE) DATA TO DISPLAY IN SHINY APP
ml_data_raw <- readRDS(file.path(projPath, "data", "model_input.rds")) %>%
# all_of() safely selects only the columns defined in our lists above
select(all_of(c(id_vars, target_var, predictor_vars))) %>%
drop_na()
# Sanity: any Inf/NaN in the numeric columns? scale() turns Inf into NaN.
print(ml_data_raw %>% summarise(across(where(is.numeric), ~ sum(!is.finite(.)))))
## # A tibble: 1 × 23
## pct_commute_60p stations_per_sq_mile stations_per_1000_pop dist_subway_miles
## <int> <int> <int> <int>
## 1 0 0 0 0
## # ℹ 19 more variables: dist_to_manhattan_miles <int>,
## # uber_trips_per_capita <int>, bike_trips_per_capita <int>, med_income <int>,
## # pct_poverty <int>, pct_vacant <int>, pct_renter <int>,
## # pct_2plus_cars <int>, pop_density <int>, pct_no_car <int>,
## # pct_transit <int>, pct_bachelors <int>, pct_wfh <int>, pct_walk <int>,
## # pct_age_18_24 <int>, pct_age_25_34 <int>, pct_age_35_44 <int>,
## # pct_age_45_64 <int>, pct_age_65_plus <int>
## SCALE THE DATA FOR THE MODEL
# Scale numeric predictors only — borough is a factor and must not be z-scored
numeric_predictors <- predictor_vars[predictor_vars != "borough"]
ml_data_scaled <- ml_data_raw %>%
# Log-transform heavy-tailed predictors before scaling so standardized
# values reflect order-of-magnitude differences (per George's flag re:
# citibike share, uber/taxi share, plus pop_density spans 4 orders).
mutate(across(
.cols = c(uber_trips_per_capita, bike_trips_per_capita,
stations_per_1000_pop, pop_density),
.fns = log1p
)) %>%
mutate(across(
.cols = all_of(numeric_predictors),
.fns = ~ as.numeric(scale(.))
)) %>%
# Logit-transform the outcome so lm predictions can't escape [0, 100].
# Small offset handles 0% / 100% edges without -Inf/+Inf.
mutate(pct_commute_60p_logit = qlogis((pct_commute_60p + 0.5) / 101))
# Helper: bring logit-space predictions back to raw % space.
logit_to_pct <- function(z) (plogis(z) * 101) - 0.5
## TRAIN THE MODEL
# Fit on the logit-transformed outcome — back-transform when we predict.
lm_fit_standardized <- lm(pct_commute_60p_logit ~ . - GEOID - pct_commute_60p,
data = ml_data_scaled)
# Same model with a natural cubic spline on dist_subway_miles — kept alongside
# the pure-linear baseline so we can compare them in the Shiny app.
lm_fit_spline <- lm(pct_commute_60p_logit ~ . - GEOID - pct_commute_60p - dist_subway_miles
+ splines::ns(dist_subway_miles, df = 4),
data = ml_data_scaled)
print("Model Summary: ")
## [1] "Model Summary: "
summary(lm_fit_standardized)
##
## Call:
## lm(formula = pct_commute_60p_logit ~ . - GEOID - pct_commute_60p,
## data = ml_data_scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5619 -0.2472 0.0243 0.3020 2.8051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.152266 0.019216 -59.962 < 2e-16 ***
## boroughBronx 0.008190 0.035471 0.231 0.817418
## boroughManhattan -0.533225 0.047750 -11.167 < 2e-16 ***
## boroughQueens -0.041091 0.031410 -1.308 0.190938
## boroughStaten Island -0.276646 0.062500 -4.426 1.01e-05 ***
## stations_per_sq_mile 0.044296 0.019031 2.328 0.020024 *
## stations_per_1000_pop -0.111398 0.025688 -4.337 1.51e-05 ***
## dist_subway_miles -0.007864 0.014561 -0.540 0.589218
## dist_to_manhattan_miles 0.241930 0.021178 11.423 < 2e-16 ***
## uber_trips_per_capita -0.045419 0.026221 -1.732 0.083384 .
## bike_trips_per_capita -0.049733 0.026616 -1.869 0.061819 .
## med_income -0.106383 0.020043 -5.308 1.22e-07 ***
## pct_poverty -0.068325 0.017665 -3.868 0.000113 ***
## pct_vacant 0.034210 0.012962 2.639 0.008367 **
## pct_renter -0.001947 0.024678 -0.079 0.937135
## pct_2plus_cars 0.019960 0.024783 0.805 0.420680
## pop_density -0.063155 0.021817 -2.895 0.003832 **
## pct_no_car -0.005559 0.032784 -0.170 0.865376
## pct_transit 0.180915 0.021174 8.544 < 2e-16 ***
## pct_bachelors -0.056834 0.018176 -3.127 0.001790 **
## pct_wfh -0.168961 0.016207 -10.425 < 2e-16 ***
## pct_walk -0.095116 0.017528 -5.426 6.38e-08 ***
## pct_age_18_24 0.028315 0.013139 2.155 0.031267 *
## pct_age_25_34 -0.052292 0.017840 -2.931 0.003412 **
## pct_age_35_44 -0.003262 0.015474 -0.211 0.833049
## pct_age_45_64 0.041001 0.014926 2.747 0.006062 **
## pct_age_65_plus 0.029769 0.016704 1.782 0.074857 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4876 on 2193 degrees of freedom
## Multiple R-squared: 0.7066, Adjusted R-squared: 0.7031
## F-statistic: 203.1 on 26 and 2193 DF, p-value: < 2.2e-16
For each geographic predictor below, we compute the neighbor-weighted average (the “spatial lag”). This lets a model express patterns like “a tract’s commute outcomes depend on its own features AND on its neighbors’ features.” We keep the original predictors as-is so we can compare with-lag vs. without-lag model performance directly.
library(spdep)
## Warning: package 'spdep' was built under R version 4.5.2
## Loading required package: spData
## Warning: package 'spData' was built under R version 4.5.2
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
library(sf)
# Same spatial weights pattern as the residual diagnostic. Row order matches
# ml_data_scaled, so lag.listw is correctly positional.
tracts_sf <- nyc_tracts_clean %>%
filter(GEOID %in% ml_data_scaled$GEOID) %>%
arrange(match(GEOID, ml_data_scaled$GEOID))
neighbors <- poly2nb(tracts_sf)
## Warning in poly2nb(tracts_sf): some observations have no neighbours;
## if this seems unexpected, try increasing the snap argument.
## Warning in poly2nb(tracts_sf): neighbour object has 6 sub-graphs;
## if this sub-graph count seems unexpected, try increasing the snap argument.
weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
# Predictors with the most geographic structure. Skipping the outcome
# (lag-of-outcome opens the SAR can of worms) and skipping categorical
# borough (lag of factor levels is awkward).
predictors_to_lag <- c(
"dist_to_manhattan_miles", "dist_subway_miles",
"pop_density", "pct_transit", "pct_wfh",
"med_income", "stations_per_1000_pop"
)
# Compute each lag column inside ml_data_scaled.
for (col in predictors_to_lag) {
ml_data_scaled[[paste0("lag_", col)]] <-
lag.listw(weights, ml_data_scaled[[col]], zero.policy = TRUE)
}
# Sanity: no Inf/NaN in the new lag columns
lag_finite_check <- ml_data_scaled %>%
select(starts_with("lag_")) %>%
summarise(across(everything(), ~ sum(!is.finite(.))))
print(lag_finite_check)
## # A tibble: 1 × 7
## lag_dist_to_manhattan_…¹ lag_dist_subway_miles lag_pop_density lag_pct_transit
## <int> <int> <int> <int>
## 1 0 0 0 0
## # ℹ abbreviated name: ¹lag_dist_to_manhattan_miles
## # ℹ 3 more variables: lag_pct_wfh <int>, lag_med_income <int>,
## # lag_stations_per_1000_pop <int>
Three jobs in this section: train the Random Forest (Model 3), build the spatial-lag variants of all three models (Linear, Linear + Spline, Random Forest), and put all six variants through 10-fold cross-validation to compare them to one another.
The spatial-lag variants use the same model formulas as the
non-spatial ones, with the neighbor-weighted lag_* columns
from the previous chunk added to the predictor set.
CV mechanics: each variant is fit 10 times (once per fold, on the 90%
training side) and predicted on the held-out 10%. Folds are stratified
by borough and reused across all six variants (same seed, same
vfold_cv call), so any pair of models is directly
comparable. The “one-SE rule” then picks the simplest model within one
SE of the best.
A separate fit of rf_fit on the full dataset happens at
the end (that’s the version used by the IBX simulation downstream,
separate from the CV loop).
## CV SETUP
# 10-fold CV stratified by borough — same folds for every model so RMSE/R^2
# comparisons are paired
set.seed(622)
model_data <- ml_data_scaled[, c(target_var, predictor_vars)]
# Spatial-lag predictor set: original predictors + the lag_* columns added
# in the build-spatial-lags chunk. Same target, same row order.
spatial_predictor_vars <- c(
predictor_vars,
paste0("lag_", predictors_to_lag)
)
model_data_spatial <- ml_data_scaled[, c(target_var, spatial_predictor_vars)]
folds <- rsample::vfold_cv(model_data, v = 10, strata = borough)
# Fit one model on the analysis side of a fold, predict on the assessment side.
run_cv <- function(folds, fit_fn) {
imap_dfr(folds$splits, function(split, fold_id) {
train <- rsample::analysis(split)
test <- rsample::assessment(split)
fit <- fit_fn(train)
tibble(
fold = as.integer(fold_id),
actual = test$pct_commute_60p,
pred = as.numeric(predict(fit, newdata = test))
)
})
}
## RUN EACH MODEL ACROSS THE FOLDS
lm_cv <- run_cv(folds, function(d) lm(pct_commute_60p ~ ., data = d))
lm_spline_cv <- run_cv(folds, function(d) lm(pct_commute_60p ~ . - dist_subway_miles
+ splines::ns(dist_subway_miles, df = 4),
data = d))
rf_cv <- run_cv(folds, function(d) randomForest(pct_commute_60p ~ ., data = d, ntree = 500))
## SPATIAL-LAG VARIANTS: same models, same folds (same seed + same rows),
## but using model_data_spatial so the lag_* columns enter as predictors.
set.seed(622)
folds_spatial <- rsample::vfold_cv(model_data_spatial, v = 10, strata = borough)
lm_cv_spatial <- run_cv(folds_spatial, function(d) lm(pct_commute_60p ~ ., data = d))
lm_spline_cv_spatial <- run_cv(folds_spatial, function(d) lm(pct_commute_60p ~ . - dist_subway_miles
+ splines::ns(dist_subway_miles, df = 4),
data = d))
rf_cv_spatial <- run_cv(folds_spatial, function(d) randomForest(pct_commute_60p ~ ., data = d, ntree = 500))
## Train RF on full dataset (used for map predictions — outside the CV loop)
rf_fit <- randomForest(
pct_commute_60p ~ .,
data = model_data,
ntree = 500,
importance = TRUE
)
## Spatial-lag production fits — used by the spillover IBX simulation
lm_fit_standardized_spatial <- lm(
pct_commute_60p_logit ~ . - GEOID - pct_commute_60p,
data = ml_data_scaled
)
lm_fit_spline_spatial <- lm(
pct_commute_60p_logit ~ . - GEOID - pct_commute_60p - dist_subway_miles
+ splines::ns(dist_subway_miles, df = 4),
data = ml_data_scaled
)
set.seed(622)
rf_fit_spatial <- randomForest(
pct_commute_60p ~ .,
data = ml_data_scaled[, c(target_var, spatial_predictor_vars)],
ntree = 500,
importance = TRUE
)
## SUMMARIZE
## mean +- SE across folds, plus the one-SE rule flag
rmse <- function(a, p) sqrt(mean((a - p)^2))
r2 <- function(a, p) 1 - sum((a - p)^2) / sum((a - mean(a))^2)
fold_metrics <- function(cv_df) {
cv_df %>% group_by(fold) %>%
summarise(rmse_fold = rmse(actual, pred),
r2_fold = r2(actual, pred),
.groups = "drop")
}
cv_summary <- function(name, cv_df) {
m <- fold_metrics(cv_df)
tibble(
Model = name,
RMSE_mean = mean(m$rmse_fold),
RMSE_SE = sd(m$rmse_fold) / sqrt(nrow(m)),
R2_mean = mean(m$r2_fold),
R2_SE = sd(m$r2_fold) / sqrt(nrow(m))
)
}
cv_table <- bind_rows(
cv_summary("Linear Regression", lm_cv),
cv_summary("Linear Regression + Spatial Lags", lm_cv_spatial),
cv_summary("Linear + Spline (dist_subway_miles)", lm_spline_cv),
cv_summary("Linear + Spline + Spatial Lags", lm_spline_cv_spatial),
cv_summary("Random Forest", rf_cv),
cv_summary("Random Forest + Spatial Lags", rf_cv_spatial)
)
# One-SE rule: any model within (best_RMSE + SE_of_best) is "as good as best".
best_idx <- which.min(cv_table$RMSE_mean)
one_se_thresh <- cv_table$RMSE_mean[best_idx] + cv_table$RMSE_SE[best_idx]
cv_table <- cv_table %>%
mutate(within_1se_of_best = ifelse(RMSE_mean <= one_se_thresh, "✓", ""))
kable(cv_table %>% mutate(across(where(is.numeric), ~ round(., 3))),
caption = "10-fold CV (stratified by borough) — mean ± SE, one-SE rule flag")
| Model | RMSE_mean | RMSE_SE | R2_mean | R2_SE | within_1se_of_best |
|---|---|---|---|---|---|
| Linear Regression | 7.768 | 0.118 | 0.610 | 0.011 | |
| Linear Regression + Spatial Lags | 7.623 | 0.118 | 0.624 | 0.011 | |
| Linear + Spline (dist_subway_miles) | 7.606 | 0.135 | 0.626 | 0.013 | |
| Linear + Spline + Spatial Lags | 7.496 | 0.136 | 0.637 | 0.013 | |
| Random Forest | 7.236 | 0.140 | 0.661 | 0.013 | ✓ |
| Random Forest + Spatial Lags | 7.138 | 0.144 | 0.670 | 0.013 | ✓ |
Interpretation: Random Forest + Spatial Lags is the CV-best by just a little bit, with 7.14 RMSE vs 7.24 for plain Random Forest. The two Random Forest variants are the only rows within 1 SE of the best, so they’re statistically a tie, but we’re opting to treat Random Forest + Spatial Lags as the winner. Its typical prediction is off by ~7.1pp on the 0-100% scale, and it explains ~67% of variance. There are also two patterns in the table that are worth pointing at. Across the spatial-lag axis, every model class gains roughly 0.10 to 0.15 RMSE once neighbor-weighted features are added (Linear Regression 7.77 to 7.62, Linear + Spline 7.61 to 7.5, Random Forest 7.24 to 7.14). The improvement is consistent across all three classes. Basically, neighbors are carrying real signal beyond what a tract’s own features capture. Across the model-family axis, the jump from Linear + Spline to Random Forest (-0.37 RMSE) is bigger than the jump from Linear Regression to Linear + Spline (-0.16), meaning Random Forest is catching nonlinearities and interactions the spline alone can’t. For the IBX simulation, the Linear + Spline, Random Forest, and spatial-lag variants are the ones to lean on. Plain Linear Regression can’t capture the saturating subway-access effect, so its predicted relief is essentially zero. The spatial-lag variants also let predicted relief spill over to tracts next to the new corridor, so they reach more tracts than the non-spatial versions (more on that in the IBX simulation section below).
P.s. the “saturating subway-access effect” was just the most succinct way I could phrase “the effect where 5 minutes versus 10 minutes as a walk to the subway is a big different, but 20 minutes versus 25 is not”. Ajs Prof. George pointed this out to us in feedback notes at some point in time.
mtry is the knob for how many features RF considers at
each tree split. Tuning it using OOB error instead of a full CV loop
(ISLP says they’re basically equivalent with enough trees, and OOB is
faster). The sweep covers the default p/3 ≈ 7, a few neighbors, and mtry
= 22 (which is plain bagging and useful as a control). If multiple
values come back within 1% of the best, the default is being picked
becuase it’s “cleaner” to defend than the algorithmic min when the gaps
are tiny.
## mtry SWEEP via OOB
mtry_grid <- c(3, 5, 7, 11, 22)
rf_tuning <- map_dfr(mtry_grid, function(m) {
set.seed(622)
fit <- randomForest(pct_commute_60p ~ ., data = model_data,
mtry = m, ntree = 500)
tibble(
mtry = m,
OOB_RMSE = sqrt(tail(fit$mse, 1)),
OOB_R2 = tail(fit$rsq, 1)
)
})
# Flag mtry values within ~1% of best — rough OOB analogue of the one-SE rule.
best_oob <- min(rf_tuning$OOB_RMSE)
rf_tuning <- rf_tuning %>%
mutate(within_1pct_of_best = ifelse(OOB_RMSE <= best_oob * 1.01, "✓", ""))
kable(rf_tuning %>% mutate(across(where(is.numeric), ~ round(., 3))),
caption = "RF mtry sweep — OOB RMSE / R² on full data")
| mtry | OOB_RMSE | OOB_R2 | within_1pct_of_best |
|---|---|---|---|
| 3 | 7.364 | 0.652 | |
| 5 | 7.278 | 0.660 | ✓ |
| 7 | 7.250 | 0.663 | ✓ |
| 11 | 7.241 | 0.664 | ✓ |
| 22 | 7.270 | 0.661 | ✓ |
## Refit production rf_fit with the chosen mtry
# OOB-best is mtry=11 by 0.03 RMSE, but ISLP default p/3 ≈ 7 is within 1%.
# Going with the simpler/standard default — easier to defend and statistically tied.
chosen_mtry <- 7
set.seed(622)
rf_fit <- randomForest(
pct_commute_60p ~ .,
data = model_data,
ntree = 500,
mtry = chosen_mtry,
importance = TRUE
)
cat("rf_fit refit with mtry =", chosen_mtry, "\n")
## rf_fit refit with mtry = 7
## OOB-vs-ntree — confirm 500 trees is enough
oob_curve <- tibble(
ntree = seq_along(rf_fit$mse),
oob_rmse = sqrt(rf_fit$mse)
)
ggplot(oob_curve, aes(x = ntree, y = oob_rmse)) +
geom_line(color = "darkblue") +
geom_hline(yintercept = tail(oob_curve$oob_rmse, 1),
color = "red", linetype = "dashed") +
theme_minimal() +
labs(title = paste("OOB RMSE vs number of trees (mtry =", chosen_mtry, ")"),
subtitle = "Red dashed = final OOB RMSE — curve should plateau",
x = "Number of trees", y = "OOB RMSE")
How to interpret: OOB error should drop fast, then plateau. If the curve flattens well before the right edge, 500 trees is enough. If it’s still falling at the right edge, ntree should be higher. The dashed red line marks the final OOB value — looking for the curve to be hugging it.
Interpretation: mtry sweep was a near-tie — the algorithmic best was mtry=11 at OOB RMS 7.241, but 4 of 5 values landed within 1% of the best, so we went with ISLP’s default mtry=7 (RMSE 7.25). That mtry=2 (essentially bagging) is only marginally worse tells us feature decorrelation isn’t doing heavy lifting on this dataset. The OOB-vs-ntree curve drops fast from ~10.5 at ntree=1 to ~7.28 by ntree=200, then flatlines through 500 — confirming ntree=500 is well past the plateau. Could safely cut to ~300 trees, but 500 is standard so we kept it.
Here, the approximate locations of the new IBX stations are being brought in. And then in the census areas where they go, we’re basically dropping the distance to train for those areas to really low. There may be a more robust way of implementing the “simulation” of the IBX station than this, but this seemed like a good starting point.
# Build the list of stations based on the MTA's 2025 CE Document
# This document is a pdf, and can be found in the repo
ibx_stations <- tibble::tribble(
~station_name, ~address,
"Brooklyn Army Terminal", "2nd Ave & 61st St, Brooklyn, NY",
"4th Ave", "5th Ave & 61st St, Brooklyn, NY",
"8th Ave", "8th Ave & 61st St, Brooklyn, NY",
"New Utrecht Ave", "14th Ave & 61st St, Brooklyn, NY",
"McDonald Ave", "McDonald Ave & Avenue I, Brooklyn, NY",
"East 16th St", "East 16th St & Avenue H, Brooklyn, NY",
"Flatbush-Nostrand Ave", "Nostrand Ave & Avenue H, Brooklyn, NY",
"Utica Ave", "Utica Ave & Farragut Rd, Brooklyn, NY",
"Remsen Ave", "Remsen Ave & Farragut Rd, Brooklyn, NY",
"Linden Blvd", "Linden Blvd & Junius St, Brooklyn, NY",
"Livonia Ave", "Livonia Ave & Junius St, Brooklyn, NY",
"Sutter Ave", "Sutter Ave & Junius St, Brooklyn, NY",
"Atlantic Ave", "Atlantic Ave & East New York Ave, Brooklyn, NY",
"Wilson Ave", "Wilson Ave & Moffat St, Brooklyn, NY",
"Myrtle Ave", "Cypress Hills St & 78th Ave, Queens, NY",
"Metropolitan Ave", "Metropolitan Ave & 69th St, Queens, NY",
"Eliot Ave", "Eliot Ave & 69th St, Queens, NY",
"Grand Ave", "Grand Ave & 69th St, Queens, NY",
"Roosevelt Ave", "Roosevelt Ave & 74th St, Queens, NY"
)
# Geocode the addresses into Lat/Long coordinates
ibx_geocoded <- ibx_stations %>%
geocode(address, method = 'osm', lat = latitude, long = longitude) %>% # get coordinated for each address
drop_na() %>% # Drop any that the open-source map couldn't perfectly match
# Convert the GPS coordinates into an sf object
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
# Convert and draw 10-minute-walk-radiuses
st_transform(2263) %>% # transform to the more precide NYC + Long Island map we've been using
st_buffer(dist = 2640) %>% # drawing a circle around point that is 2640 feet = 0.5 miles ~ 10 minute walk. Most people don't want to walk further than 10 minutes to the train
st_transform(4326) # Convert back to standard GPS
## Passing 19 addresses to the Nominatim single address geocoder
## Query completed in: 19.1 seconds
# Find all Census Tracts that fall at least partially inside these new 10-minute-walk-radiuses
ibx_tracts_sf <- st_filter(
nyc_tracts,
st_transform(ibx_geocoded, st_crs(nyc_tracts)),
.predicate = st_intersects
)
# Extract just the 11-digit IDs as a clean list
# Only keep tracts that made it into the modeling set, otherwise
# treated tracts get no prediction and show up as map holes.
ibx_geoid_list <- intersect(ibx_tracts_sf$GEOID, ml_data_scaled$GEOID)
The IBX simulation works like this: geocode the 19 IBX station
addresses, draw 0.5-mile walksheds around each, and flag every tract
that intersects a walkshed as “treated.” For the treated tracts, we’re
setting dist_subway_miles to the city-wide minimum. This is
our way of saying that this tract now has subway access as close as the
best-served tract in NYC. We then re-predict commute outcomes. The same
setup runs through all six model variants (Linear, Linear + Spline,
Random Forest, each with and without spatial lags) so we can compare how
each forecasts IBX’s effect. The spatial-lag variants additionally let
predicted relief spill over to tracts next to the IBX corridor (more on
that below). It’s a coarse proxy for what a real subway station does,
but a defensible starting point.
## Create an IBX UNIVERSE data set
# finding the Census Tract in NYC that is physically closest to a subway entrance
# going to use this later to simulate a really close distance to subway for places where IBX is being built
best_transit_z_score <- min(ml_data_scaled$dist_subway_miles)
ibx_scenario_data_scaled <- ml_data_scaled %>%
mutate(
# Inject best_transit_z_score into the IBX tracts
dist_subway_miles = ifelse(GEOID %in% ibx_geoid_list, best_transit_z_score, dist_subway_miles)
)
# Predicted data is from the model which consumed scaled data
# But we're using raw data so the display of neighborhood features is correct
# lm predictions live in logit space — back-transform + clamp to [0, 100].
final_map_data <- ml_data_raw %>%
mutate(
predicted_commute_baseline = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_standardized, newdata = ml_data_scaled)))),
predicted_commute_ibx = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_standardized, newdata = ibx_scenario_data_scaled)))),
ibx_commute_relief = pmax(0, predicted_commute_baseline - predicted_commute_ibx),
is_ibx_location = ifelse(GEOID %in% ibx_geoid_list, 1, 0)
)
# Sanity: where do predictions actually land?
cat("lm baseline range:", range(final_map_data$predicted_commute_baseline), "\n")
## lm baseline range: 0 50.66438
cat("lm IBX range: ", range(final_map_data$predicted_commute_ibx), "\n")
## lm IBX range: 0 50.66438
cat("lm relief range: ", range(final_map_data$ibx_commute_relief), "\n")
## lm relief range: 0 0
saveRDS(final_map_data, file.path(projPath, "data", "model_input.rds"))
# Spline-lm IBX predictions — appended to final_map_data alongside the linear baseline.
final_map_data <- final_map_data %>%
mutate(
predicted_commute_baseline_spline = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_spline, newdata = ml_data_scaled)))),
predicted_commute_ibx_spline = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_spline, newdata = ibx_scenario_data_scaled)))),
ibx_commute_relief_spline = pmax(0, predicted_commute_baseline_spline - predicted_commute_ibx_spline)
)
cat("spline baseline range:", range(final_map_data$predicted_commute_baseline_spline), "\n")
## spline baseline range: 0 49.26246
cat("spline IBX range: ", range(final_map_data$predicted_commute_ibx_spline), "\n")
## spline IBX range: 0 49.26246
cat("spline relief range: ", range(final_map_data$ibx_commute_relief_spline), "\n")
## spline relief range: 0 7.856936
saveRDS(final_map_data, file.path(projPath, "data", "model_input.rds"))
# Append RF IBX predictions to the map data already saved by the chunk above.
# final_map_data is still in memory — adding RF columns and re-saving.
final_map_data_rf <- final_map_data %>%
mutate(
# Restore Tai's clamps — % can't be negative, neither can relief.
predicted_commute_baseline_rf = pmax(0, predict(rf_fit, newdata = ml_data_scaled)),
predicted_commute_ibx_rf = pmax(0, predict(rf_fit, newdata = ibx_scenario_data_scaled)),
ibx_commute_relief_rf = pmax(0, predicted_commute_baseline_rf - predicted_commute_ibx_rf),
# Residual: positive = model under-predicts (actual > predicted),
# negative = model over-predicts (actual < predicted).
# Used by the Shiny app's residual layer.
rf_residual = pct_commute_60p - predicted_commute_baseline_rf
)
# Sanity: same range check for RF
cat("RF baseline range:", range(final_map_data_rf$predicted_commute_baseline_rf), "\n")
## RF baseline range: 1.731167 65.80985
cat("RF IBX range: ", range(final_map_data_rf$predicted_commute_ibx_rf), "\n")
## RF IBX range: 1.731167 65.80985
cat("RF relief range: ", range(final_map_data_rf$ibx_commute_relief_rf), "\n")
## RF relief range: 0 7.366372
saveRDS(final_map_data_rf, file.path(projPath, "data", "model_input.rds"))
print("RF predictions appended to model_input.rds")
## [1] "RF predictions appended to model_input.rds"
Same scenario as the non-lag versions (push
dist_subway_miles to the global minimum for treated
tracts), but with one extra step: we recompute
lag_dist_subway_miles using the modified values. That way
tracts neighboring the IBX corridor see their neighbor-weighted
subway-distance drop too, and the spatial models can express spillover
benefits to those adjacent tracts. The other six lag columns don’t move
because their underlying features aren’t part of the simulation.
# Build the spatial scenario: modify dist_subway_miles (same as before) AND
# recompute its spatial lag using the updated values.
ibx_scenario_data_spatial <- ml_data_scaled %>%
mutate(
dist_subway_miles = ifelse(GEOID %in% ibx_geoid_list,
best_transit_z_score,
dist_subway_miles)
)
ibx_scenario_data_spatial$lag_dist_subway_miles <-
lag.listw(weights, ibx_scenario_data_spatial$dist_subway_miles, zero.policy = TRUE)
# Predict from the three spatial models on baseline and IBX scenarios.
# Append to final_map_data_rf so all simulation outputs live in one frame.
final_map_data_rf <- final_map_data_rf %>%
mutate(
# Linear + Spatial Lags
predicted_commute_baseline_lm_spatial = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_standardized_spatial, newdata = ml_data_scaled)))),
predicted_commute_ibx_lm_spatial = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_standardized_spatial, newdata = ibx_scenario_data_spatial)))),
ibx_commute_relief_lm_spatial = pmax(0, predicted_commute_baseline_lm_spatial - predicted_commute_ibx_lm_spatial),
# Spline + Spatial Lags
predicted_commute_baseline_spline_spatial = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_spline_spatial, newdata = ml_data_scaled)))),
predicted_commute_ibx_spline_spatial = pmax(0, pmin(100, logit_to_pct(predict(lm_fit_spline_spatial, newdata = ibx_scenario_data_spatial)))),
ibx_commute_relief_spline_spatial = pmax(0, predicted_commute_baseline_spline_spatial - predicted_commute_ibx_spline_spatial),
# RF + Spatial Lags
predicted_commute_baseline_rf_spatial = pmax(0, predict(rf_fit_spatial, newdata = ml_data_scaled)),
predicted_commute_ibx_rf_spatial = pmax(0, predict(rf_fit_spatial, newdata = ibx_scenario_data_spatial)),
ibx_commute_relief_rf_spatial = pmax(0, predicted_commute_baseline_rf_spatial - predicted_commute_ibx_rf_spatial)
)
# Range printouts
cat("\n--- Spatial-lag IBX simulation ranges ---\n")
##
## --- Spatial-lag IBX simulation ranges ---
cat("lm + lags relief range: ", range(final_map_data_rf$ibx_commute_relief_lm_spatial), "\n")
## lm + lags relief range: 0 3.031121
cat("spline + lags relief range: ", range(final_map_data_rf$ibx_commute_relief_spline_spatial), "\n")
## spline + lags relief range: 0 8.148764
cat("RF + lags relief range: ", range(final_map_data_rf$ibx_commute_relief_rf_spatial), "\n")
## RF + lags relief range: 0 9.283269
# Spillover diagnostic: how many tracts get non-zero relief in each scenario?
# Non-spatial: only directly-treated tracts.
# Spatial: directly-treated + their neighbors (because the lag changed for them).
cat("\n--- Spillover breadth (tracts with non-zero predicted relief) ---\n")
##
## --- Spillover breadth (tracts with non-zero predicted relief) ---
cat("Treated tracts in dataset: ", length(ibx_geoid_list), "\n")
## Treated tracts in dataset: 200
cat("Non-spatial (lm / spline / rf): ",
sum(final_map_data_rf$ibx_commute_relief > 0), "/",
sum(final_map_data_rf$ibx_commute_relief_spline > 0), "/",
sum(final_map_data_rf$ibx_commute_relief_rf > 0), "\n")
## Non-spatial (lm / spline / rf): 0 / 200 / 157
cat("Spatial (lm+lag / spline+lag / rf+lag):",
sum(final_map_data_rf$ibx_commute_relief_lm_spatial > 0), "/",
sum(final_map_data_rf$ibx_commute_relief_spline_spatial > 0), "/",
sum(final_map_data_rf$ibx_commute_relief_rf_spatial > 0), "\n")
## Spatial (lm+lag / spline+lag / rf+lag): 120 / 358 / 270
saveRDS(final_map_data_rf, file.path(projPath, "data", "model_input.rds"))
Interpretation: The spillover diagnostic above adds
a second dimension to the IBX picture. Without spatial lags, the
simulation only registers relief for directly-treated tracts. The Linear
model predicts zero for every tract. Linear + Spline reaches all 200
treated tracts. Random Forest reaches 157 of them. With spatial lags,
the same three model classes reach 120 / 358 / 270 tracts (Linear,
Linear + Spline, Random Forest). That broader reach comes from the way
we recompute lag_dist_subway_miles after the treatment
update. Once a tract’s neighbors get closer to subway, the tract’s own
neighbor-weighted distance-to-subway also drops, so the spatial model
picks up the spillover.
Max relief grows too. Plain Linear stays at 0pp (no saturating-effect capture). Linear + Spatial Lags rescues it to 3.03pp. Linear + Spline goes from 7.86pp to 8.15pp. Random Forest goes from 7.37pp to 9.28pp. Random Forest + Spatial Lags ends up with the highest max relief and the most tracts reached, which lines up with it being the CV-best model overall.
# Model Results
lm_results <- tidy(lm_fit_standardized) %>%
filter(term != "(Intercept)") # Remove the intercept for better scaling
# Plot the "Pull" or "Impact" of each feature
ggplot(lm_results, aes(x = estimate, y = reorder(term, estimate))) +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
geom_point(size = 3, color = "darkblue") +
geom_errorbarh(aes(xmin = estimate - std.error, xmax = estimate + std.error), height = 0.2) +
theme_minimal() +
labs(title = "What's driving 60 minute + length communtes?",
subtitle = "Standardized Coefficients (Variables further from 0 have a stronger impact)",
x = "Effect on 60+ Minute Commute Rate (per 1 Std. Dev. increase)",
y = "Feature")
## Warning: `geom_errobarh()` was deprecated in ggplot2 4.0.0.
## ℹ Please use the `orientation` argument of `geom_errorbar()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `height` was translated to `width`.
How to interpret: Each point is a feature’s
standardized coefficient — how much logit(% long commute) shifts per 1
SD change in that feature, holding everything else fixed. Distance from
zero = effect strength; the red dashed line marks “no effect.” Borough
effects (vs Brooklyn baseline) and dist_to_manhattan_miles
tend to dominate. Error bars show ±1 SE — features whose bars cross zero
aren’t statistically distinguishable from zero effect.
Interpretation: Distance to Manhattan and being in
Manhattan are the dominant features. Together they carry most of the
geography story (more distance from Manhattan → more long commutes;
Manhattan itself strongly fewer). The most counterintuitive coefficient
is pct_transit (+0.18, p < 10⁻¹⁶) — tracts where more
people commute by transit have more long commutes, reflecting NYC’s
reality that long-distance commuters disproportionately use the subway.
Distance to subway in the linear model is essentially zero (-0.008,
p=0.59); the relationship is saturating, which is why we added the
spline. After controlling for distance to Manhattan, Queens and Bronx
are statistical ties with Brooklyn; only Manhattan and Staten Island
stand out as boroughs with meaningfully fewer long commutes.
# Variable importance from the full RF model
rf_importance <- importance(rf_fit) %>%
as.data.frame() %>%
rownames_to_column("Feature") %>%
arrange(desc(`%IncMSE`))
ggplot(rf_importance, aes(x = `%IncMSE`, y = reorder(Feature, `%IncMSE`))) +
geom_point(size = 3, color = "darkgreen") +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
theme_minimal() +
labs(
title = "Random Forest: Feature Importance",
subtitle = "% increase in MSE when feature is permuted — higher = more important",
x = "% Increase in MSE (Permutation Importance)",
y = "Feature"
)
How to interpret: Higher %IncMSE = the feature mattered more for prediction (test error rises more when that feature is randomly shuffled). Compare to the lm coefficient plot — features ranking high in both are “robustly important.” A feature high in RF but flat in lm typically means RF is capturing a nonlinear or interaction effect that lm can’t.
Interpretation: RF’s top three features —
pct_transit, dist_to_manhattan_miles, and
pct_wfh — are largely the same top features the linear
model picked, so the predictive story holds across model classes. The
big difference: dist_subway_miles ranks #5 in RF (%IncMSE =
23.2) despite being essentially zero in the linear model (-0.008,
p=0.59) — clean confirmation that the relationship is nonlinear, which
is exactly what the spline was meant to capture. borough
only ranks #9 here — RF can use the continuous geographic features
directly, so it leans less on the categorical borough proxy than the lm
does. The bottom of the list (age buckets, pct_vacant)
carries little predictive weight; candidates for trimming if we ever
wanted a leaner feature set.
If RF residuals cluster geographically, the model is missing a feature with spatial structure (likely “proximity to Manhattan” per George’s flag). The map shows where the model under- or over-predicts; Moran’s I quantifies whether the clustering is statistically meaningful.
library(spdep)
## RESIDUAL MAP
residual_map_data <- final_map_data_rf %>%
select(GEOID, rf_residual) %>%
left_join(nyc_tracts_clean %>% select(GEOID, geometry), by = "GEOID") %>%
st_as_sf()
# Symmetric color scale capped at p99 of |residual| for readability
cap <- quantile(abs(residual_map_data$rf_residual), 0.99, na.rm = TRUE)
ggplot(residual_map_data) +
geom_sf(aes(fill = rf_residual), color = NA) +
scale_fill_gradient2(low = "blue", mid = "white", high = "red",
midpoint = 0,
limits = c(-cap, cap),
oob = scales::squish) +
theme_void() +
labs(title = "RF residuals — actual minus predicted % long commute",
subtitle = "Red = model under-predicts | Blue = model over-predicts",
fill = "Residual\n(pp)")
## MORAN'S I — is the geographic clustering statistically real?
tracts_for_moran <- nyc_tracts_clean %>%
filter(GEOID %in% final_map_data_rf$GEOID) %>%
arrange(match(GEOID, final_map_data_rf$GEOID))
neighbors_local <- poly2nb(tracts_for_moran)
## Warning in poly2nb(tracts_for_moran): some observations have no neighbours;
## if this seems unexpected, try increasing the snap argument.
## Warning in poly2nb(tracts_for_moran): neighbour object has 6 sub-graphs;
## if this sub-graph count seems unexpected, try increasing the snap argument.
weights_local <- nb2listw(neighbors_local, style = "W", zero.policy = TRUE)
moran_test <- moran.test(final_map_data_rf$rf_residual,
weights_local,
zero.policy = TRUE,
na.action = na.omit)
cat("Moran's I: ", round(moran_test$estimate["Moran I statistic"], 3), "\n")
## Moran's I: 0.036
cat("Expected: ", round(moran_test$estimate["Expectation"], 3), "\n")
## Expected: 0
cat("p-value: ", format.pval(moran_test$p.value), "\n")
## p-value: 0.0017458
How to interpret the map: Red = the model under-predicts (the real % long commute is higher than what the model expected). Blue = over-predicts. Patches of solid color = the model has a systematic geographic blind spot — a missing feature with spatial structure. A salt-and-pepper mix of red and blue = the model captures geography fine, residuals are just noise.
How to read Moran’s I:
# Snapshot of the analysis state just before we added dist_to_manhattan_miles.
# Hardcoded because the prior model state isn't reproducible from current code.
hist_before <- list(
morans_i = 0.079,
morans_i_p = "1e-10",
rmse_reduction_range = "0.3-0.4",
spline_rf_gap_pp = 1.4
)
The Moran’s I value above is after we added
dist_to_manhattan_miles. Before that feature,
Moran’s I was 0.079 (p < 1e-10), a much stronger spatial signal.
Here’s what changed when we added distance to Grand Central as a
CBD-proximity proxy. (This is a snapshot from before the spatial-lag
work that came later.)
uber_trips_per_capita,
pop_density) to lose their inflated magnitudes. They had
been silently proxying for Manhattan proximity.After that change, the pure linear baseline predicts ~0pp IBX relief
(its coefficient on dist_subway_miles is -0.008), while the
Linear + Spline and Random Forest models, which capture the saturating
subway-access effect, predict around 7-8pp: Linear + Spline at 7.9pp and
Random Forest at 7.4pp. Adding spatial lags on top of all this widened
the relief picture further. See the CV table interpretation above and
the IBX simulation diagnostics for the current numbers.
Predicts the effect of an intervention (e.g., a new medicine) by combining hundreds of small, simple decision trees to model complex, non-linear relationships, even without explicit programming. It excels at estimating, “What would have happened to this specific person if they had taken the treatment?”.
#install.packages("bartCause")
library(bartCause)
## Warning: package 'bartCause' was built under R version 4.5.2
##
## Attaching package: 'bartCause'
## The following object is masked from 'package:fabletools':
##
## refit
## The following object is masked from 'package:tidyr':
##
## extract
Treatment variable (0/1)
# add a binary 1/0 column noting which GEOID's where included in our IBX "treatment"
ml_data_scaled$ibx_treatment <- as.integer(ml_data_scaled$GEOID %in% ibx_geoid_list)
thinkCausal website:
# I was trying to use the thinkCausal website (https://apsta.shinyapps.io/thinkCausal/) to explore bartCause but it kept crashing when fitting our full model, but i was able to try using it with just a few variables at a time, some of the below sections were inspired from that
#write.csv(ml_data_scaled, "ml_data_scaled.csv")
# Drop ID column
BART_data <- ml_data_scaled[, !(names(ml_data_scaled) %in%
c("GEOID", "pct_commute_60p_logit"))]
BART_data <- na.omit(BART_data)
# Outcome
y <- BART_data$pct_commute_60p
# Treatment (must be numeric 0/1)
z <- as.numeric(BART_data$ibx_treatment)
# Covariates (confounders only)
# Keep x as a data.frame — borough is a factor, and as.matrix() would coerce
# the whole frame to character, which lands a NaN in bartc's sigma estimate.
x <- BART_data[, !(names(BART_data) %in% c("pct_commute_60p", "ibx_treatment"))]
Visualize the data: 1. overlap graph? (for ATE, ATT, and ATC). Overlap is a property of the data (treatment vs covariates)
ps_model <- glm(
ibx_treatment ~ . - pct_commute_60p,
data = BART_data,
family = binomial
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
ps <- predict(ps_model, type = "response")
Basic overlap plot (treated vs control)
hist(ps[z == 1], col = rgb(1, 0, 0, 0.5),
xlim = c(0,1), main = "Propensity Score Overlap",
xlab = "Propensity Score")
hist(ps[z == 0], col = rgb(0, 0, 1, 0.5), add = TRUE)
legend("topright", legend = c("Treated", "Control"),
fill = c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)))
How to interpret: Red = treated tracts, blue = untreated. Good overlap (the two distributions both span the same range) is required for trustworthy causal estimates. Poor overlap (e.g., red cluster at high values, blue at low) means the BART model has to extrapolate — comparing treated tracts to controls that don’t really resemble them.
Weighting for ATE, ATT, ATC
w_ate <- ifelse(z == 1, 1/ps, 1/(1 - ps))
w_att <- ifelse(z == 1, 1, ps/(1 - ps))
w_atc <- ifelse(z == 1, (1 - ps)/ps, 1)
ATE plot: We want strong overlap across full range. Poor overlap implies ATE relies on extrapolation?
library(ggplot2)
BART_df <- data.frame(ps = ps, z = z, y = y, x=x)
BART_df <- na.omit(BART_df)
# ATE
ggplot(BART_df, aes(x = ps, fill = factor(z))) +
geom_density(alpha = 0.4, aes(weight = w_ate)) +
labs(title = "Overlap (ATE)", fill = "Treatment")
How to interpret: Same idea as the basic overlap plot, but reweighted for the ATE (effect on the average tract, treated or not). Tighter alignment between the two density curves = more trustworthy ATE estimate. Big gaps between curves at certain propensity values = those tracts have no good comparison group.
ATT plot: Do control units resemble treated units?
# ATT
ggplot(BART_df, aes(x = ps, fill = factor(z))) +
geom_density(alpha = 0.4, aes(weight = w_att)) +
labs(title = "Overlap (ATT)", fill = "Treatment")
How to interpret: Reweighted for the ATT — the effect on tracts that received treatment (the IBX corridor). Look for the control (blue) curve to match the shape of the treated (red) curve — that means we have similar-looking untreated tracts to compare each treated tract against.
ATC plot: Do treated units resemble controls?
# ATC
ggplot(BART_df, aes(x = ps, fill = factor(z))) +
geom_density(alpha = 0.4, aes(weight = w_atc)) +
labs(title = "Overlap (ATC)", fill = "Treatment")
How to interpret: Mirror image of the ATT plot — reweighted for tracts that did not receive treatment (everywhere except the IBX corridor). Good overlap here means we have treated tracts that resemble each control tract, so a hypothetical “if we treated this Bronx tract, what would happen?” estimate is well-grounded.
# loaded via pacman instead
#install.packages("cobalt")
#library(cobalt)
bal_ate <- bal.tab(x, treat = z, weights = w_ate, estimand = "ATE",un = TRUE,quick = TRUE)
bal_att <- bal.tab(x, treat = z, weights = w_att, estimand = "ATT",un = TRUE,quick = TRUE)
bal_atc <- bal.tab(x, treat = z, weights = w_atc, estimand = "ATC",un = TRUE,quick = TRUE)
Covariate Balance (ATE)
love.plot(bal_ate, threshold = 0.1, abs = TRUE,
title = "Covariate Balance (ATE)",stars = "raw",stats = "mean.diffs")
How to interpret: Each dot is a covariate. X-axis = absolute standardized mean difference (SMD) between treated and untreated groups for that covariate. Dashed line at 0.1 marks the standard “balanced” threshold. Anything to the right of 0.1 means treated and control groups differ meaningfully on that variable — and the weighting hasn’t fully fixed it. Two columns of dots compare before-weighting vs after-weighting.
Covariate Balance (ATT)
love.plot(bal_att, threshold = 0.1, abs = TRUE,
title = "Covariate Balance (ATT)", stars = "raw")
How to interpret: Same as the ATE balance plot but reweighted for the ATT estimand. This is the most relevant of the three for our project since the IBX claim is “what does treatment do for the treated tracts.” Anything past 0.1 here is a covariate where the comparison group doesn’t quite match the treated group — flag-able for the writeup.
Covariate Balance (ATC)
love.plot(bal_atc, threshold = 0.1, abs = TRUE,
title = "Covariate Balance (ATC)",stars = "raw")
How to interpret: Same again, but reweighted for the ATC. Less directly relevant to our IBX question, but useful as a diagnostic — covariates that are imbalanced for ATC but not ATT tell us where the model is making bigger extrapolation leaps for the untreated tracts.
Fit our model: 1. Causal Estimand: do we want ATE, ATT, and ATC? We’re adding a new train line which is alittle different than just giving a new drug or something? I think we’re most interested in ATT to measure how effective our treatment was for our area where we’re adding it.
ATE (Average Treatment Effect): Measures the average impact of a treatment if everyone in the population was treated versus if nobody was treated. It answers: “What is the overall effect of the intervention?”.
ATT (Average Treatment Effect on the Treated): Measures the average impact of a treatment only for those who actually received it. It answers: “Did the treatment work for those who got it?”.
ATC / ATU (Average Treatment Effect on the Control/Untreated): Measures the impact of treatment on the subgroup that did not receive it. It answers: “What would have happened if those who didn’t get the treatment actually did?”
ATE BART Fit
# Exclude the treatment mechanism (dist_subway_miles and its spatial lag) from
# the BART confounders. Including them would condition on the very feature the
# IBX changes, which controls away the main causal pathway.
dist_cols <- grep("dist_subway_miles", names(BART_df), value = TRUE)
exclude_cols <- c("y", "z", dist_cols)
set.seed(622)
bart_fit <- bartc(
response = BART_df$y,
treatment = BART_df$z,
confounders = BART_df[, !(names(BART_df) %in% exclude_cols)],
estimand = "ate"
)
## fitting treatment model via method 'bart'
## fitting response model via method 'bart'
cat("BART refitted excluding from confounders:", paste(dist_cols, collapse = ", "), "\n")
## BART refitted excluding from confounders: x.dist_subway_miles, x.lag_dist_subway_miles
summary(bart_fit)
## Call: bartc(response = BART_df$y, treatment = BART_df$z, confounders = BART_df[,
## !(names(BART_df) %in% exclude_cols)], estimand = "ate")
##
## Causal inference model fit by:
## model.rsp: bart
## model.trt: bart
##
## Treatment effect (population average):
## estimate sd ci.lower ci.upper
## ate -0.2973 0.7801 -1.826 1.232
## Estimates fit from 2220 total observations
## 95% credible interval calculated by: normal approximation
## population TE approximated by: posterior predictive distribution
## Result based on 500 posterior samples times 10 chains
class(bart_fit)
## [1] "bartcFit"
Individual treatment effects
ite <- fitted(bart_fit, type = "icate")
ATT “Assuming Ignorability and SUTVA are satisfied, for participants in our study that received the treatment/new train line, receiving the new train line/treatment led to a decrease of n units (percent decrease in 60 min commutes?) compared to what would have happened had these participants not received the treatment.”:
att <- mean(ite[z == 1], na.rm = TRUE)
att
## [1] -0.1861364
ATT Credible interval:
att_ci <- quantile(ite[z == 1], c(0.025, 0.975), na.rm = TRUE)
att_ci
## 2.5% 97.5%
## -0.49504799 0.04960185
Check: 1. ATT Distribution (Treated Units)
att <- mean(ite[z == 1], na.rm = TRUE)
hist(ite[z == 1],
main = "ATT Distribution (Treated Units)",
xlab = "Individual Treatment Effects")
abline(v = att, col = "red", lwd = 2)
How to interpret: Distribution of individual treatment effects (ITE) across the treated tracts. The red vertical line is the ATT (average across treated). Negative ITE values = treatment reduces long-commute % (i.e. helps). All-negative = treatment uniformly helps every treated tract, which is what we want to see.
Results:
Subgroup analysis: 1. treatment effect variation waterfall graph
Treatment Effect Heterogeneity (Waterfall graph)
ite_sorted <- as.numeric(ite)
ite_sorted <- sort(ite_sorted)
plot(ite_sorted, type = "l",
col = "black",
xlab = "Units (sorted)",
ylab = "Treatment Effect",
main = "Treatment Effect Heterogeneity (Waterfall graph)")
abline(h = 0, col = "red", lwd = 2)
How to interpret: Every tract’s ITE plotted in sorted order, smallest (most negative = biggest benefit) on the left, largest on the right. The red zero-line is the threshold — anything below it = treatment helps that tract. The slope tells you how heterogeneous the effects are: a steep curve = lots of variation in who benefits; a flat curve = effects are uniform.
Individual Treatment Effect (ITE)
ite <- as.numeric(fitted(bart_fit, type = "icate"))
var_model <- lm(ite ~ ., data = as.data.frame(x))
summary(var_model)
##
## Call:
## lm(formula = ite ~ ., data = as.data.frame(x))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.273472 -0.030732 0.007274 0.040740 0.196577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0750199 0.0027042 -27.742 < 2e-16 ***
## boroughBronx -0.3327836 0.0050323 -66.129 < 2e-16 ***
## boroughManhattan -0.2848545 0.0066419 -42.887 < 2e-16 ***
## boroughQueens -0.3771035 0.0045216 -83.400 < 2e-16 ***
## boroughStaten Island -0.3012093 0.0090001 -33.467 < 2e-16 ***
## stations_per_sq_mile -0.0084048 0.0026402 -3.183 0.001476 **
## stations_per_1000_pop -0.0064303 0.0037772 -1.702 0.088829 .
## dist_subway_miles -0.0052244 0.0089753 -0.582 0.560570
## dist_to_manhattan_miles 0.0274150 0.0243614 1.125 0.260566
## uber_trips_per_capita 0.0006526 0.0038519 0.169 0.865478
## bike_trips_per_capita 0.0046746 0.0038586 1.211 0.225844
## med_income 0.0105479 0.0030725 3.433 0.000608 ***
## pct_poverty -0.0012517 0.0024499 -0.511 0.609452
## pct_vacant 0.0023991 0.0018014 1.332 0.183068
## pct_renter -0.0028323 0.0034743 -0.815 0.415038
## pct_2plus_cars 0.0078505 0.0034746 2.259 0.023957 *
## pop_density 0.0039865 0.0034012 1.172 0.241293
## pct_no_car 0.0097711 0.0045761 2.135 0.032854 *
## pct_transit 0.0231793 0.0030510 7.597 4.45e-14 ***
## pct_bachelors -0.0033945 0.0025973 -1.307 0.191379
## pct_wfh -0.0009672 0.0023507 -0.411 0.680786
## pct_walk -0.0064697 0.0025093 -2.578 0.009992 **
## pct_age_18_24 0.0137972 0.0018174 7.592 4.64e-14 ***
## pct_age_25_34 0.0239213 0.0024834 9.632 < 2e-16 ***
## pct_age_35_44 0.0166713 0.0021515 7.749 1.41e-14 ***
## pct_age_45_64 0.0150208 0.0020725 7.248 5.85e-13 ***
## pct_age_65_plus 0.0132189 0.0023138 5.713 1.26e-08 ***
## lag_dist_to_manhattan_miles -0.0250021 0.0246521 -1.014 0.310600
## lag_dist_subway_miles 0.0094854 0.0094393 1.005 0.315066
## lag_pop_density 0.0084221 0.0040094 2.101 0.035793 *
## lag_pct_transit 0.0312416 0.0039180 7.974 2.45e-15 ***
## lag_pct_wfh 0.0110938 0.0038888 2.853 0.004375 **
## lag_med_income 0.0225664 0.0037486 6.020 2.04e-09 ***
## lag_stations_per_1000_pop 0.0010901 0.0035368 0.308 0.757954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06706 on 2186 degrees of freedom
## Multiple R-squared: 0.8801, Adjusted R-squared: 0.8782
## F-statistic: 486.1 on 33 and 2186 DF, p-value: < 2.2e-16
I guess this is expected, we put our new treatment line in queens, then ofc the biggest predictor of reduced commute from our new line in queens… is that the district is in queens…
coef_df <- data.frame(
variable = names(coef(var_model))[-1],
effect = coef(var_model)[-1]
)
ggplot(coef_df, aes(x = reorder(variable, effect), y = effect)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Predictors of Treatment Effect Variation",
x = "Covariates",
y = "Effect on ITE")
How to interpret: Each bar is a feature’s coefficient when regressing BART’s ITE on covariates across all NYC tracts. Negative coefficient = bigger benefit (more negative ITE) per unit of that feature. Positive = smaller benefit. Borough bars dominate because they capture geography that the continuous predictors don’t. Coefficients for boroughs the IBX doesn’t actually serve (Bronx, Manhattan, Staten Island) reflect BART’s extrapolation of “what would happen if we did treat them” — interpret cautiously.
Individual Treatment Effect (ITE) for just treated, and X covariates just for treated:
x_treated <- x[z == 1, ]
ite_treated <- ite[z == 1]
var_model_treated <- lm(ite_treated ~ ., data = as.data.frame(x_treated))
coef_df <- data.frame(
variable = names(coef(var_model_treated))[-1],
effect = coef(var_model_treated)[-1]
)
Still Queens is Queen. But Bike trips per capita is up there too now! More stations per 1000 pop predicting a higher expected commute after treatment is confusing to me? Same with no car percent? Positive coefficient increases treatment effect among treated. Negative coefficient decreases treatment effect among treated.
ggplot(coef_df, aes(x = reorder(variable, effect), y = effect)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Predictors of Treatment Effect Variation (Treated Only)",
x = "Covariates",
y = "Effect on ITE")
How to interpret: Same as the previous bar plot but limited to actually-treated tracts — no extrapolation. More defensible than the all-tracts version. Negative bar = feature is associated with bigger benefit. The boroughQueens bar (negative) means Queens treated tracts benefit more than Brooklyn treated tracts (Brooklyn is the reference). The continuous predictors are mostly small here — borough explains most of the heterogeneity.
Why are there MORE expected 60+ min commutes in brooklyn post treatment??? More stations should just reduce expected commute times, i dont understand, that number should fall. I think we’ must’ve messed something up using our tracts geographic center for predicting commute times?
ggplot(x_treated, aes(x = ite_treated)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
facet_wrap(~ borough) +
labs(title = "Treatment Effect Distribution by Borough (Treated Only)",
x = "Treatment Effect",
y = "Count") +
theme_minimal()
How to interpret: ITE distribution among treated tracts, faceted by borough. All values being negative = treatment helps every treated tract in that borough (good). Boroughs whose histograms sit further left (more negative) benefit more on average. Width of the histogram = how heterogeneous treatment effects are within that borough.
ggplot(x_treated, aes(x = ite_treated, color = borough)) +
geom_density(size = 1) +
labs(title = "Treatment Effect Density by Borough",
x = "ITE",
y = "Density") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
How to interpret: Smoothed version of the histograms above, easier to compare across boroughs at a glance. The further left a borough’s curve sits, the bigger the average benefit. Tight curves = uniform effects within that borough; wide curves = heterogeneous effects (some tracts benefit a lot, others less).
Tree-based subgroup discovery for just treatment group only: Which splits in covariates create groups with different treatment effects?
library("rpart.plot")
## Warning: package 'rpart.plot' was built under R version 4.5.2
## Loading required package: rpart
library(rpart)
tree_model <- rpart(ite_treated ~ ., data = x_treated, method = "anova", control = rpart.control(cp = 0.001, minbucket = 10))
rpart.plot(tree_model, main = "Subgroups Driving Treatment Effect Heterogeneity")
How to interpret: A regression tree on ITE for treated tracts only. Each split is the question that best separates “tracts with bigger benefit” from “tracts with smaller benefit.” Each terminal node shows the predicted ITE for tracts that follow that path (the % at the bottom shows what share of tracts ended up there). Read top-to-bottom along a branch to identify “tracts with feature X beyond Y benefit Z amount.”
# Append BART's individual treatment effect (ITE) per tract to the map data so
# the Shiny app can visualize the causal estimates alongside the predictive ones.
# ITE rows match ml_data_scaled row order; we map back to GEOID for a safe join.
bart_ite_df <- data.frame(
GEOID = ml_data_scaled$GEOID,
bart_ite = as.numeric(fitted(bart_fit, type = "icate"))
)
final_map_data_rf <- final_map_data_rf %>%
select(-any_of(c("bart_ite", "bart_ite_treated_only"))) %>%
left_join(bart_ite_df, by = "GEOID") %>%
mutate(
bart_ite_treated_only = ifelse(is_ibx_location == 1, bart_ite, NA_real_)
)
saveRDS(final_map_data_rf, file.path(projPath, "data", "model_input.rds"))
The question we wanted to answer was: how would the new Interborough Express (IBX) subway line change commute times for the NYC neighborhoods it would serve? Our best model says the IBX would cut the share of 60+ minute commutes by up to ~9.3pp in tracts within walking distance of a new station, plus measurable benefits spilling over to 113 nearby tracts that don’t have a station of their own. The benefit isn’t even across the city. Queens neighborhoods gain the most on average, in both our predictive simulation and our causal estimates. Brooklyn tracts gain too, just less.
Predictive modeling. We trained six versions of a model that predicts the share of long commutes in each tract. Three core types (Linear Regression, Linear + Spline, Random Forest), each tested with and without “spatial-lag” features. Spatial lags are just a way of letting each tract’s model also see its neighbors’ values. 10-fold cross-validation picked Random Forest + Spatial Lags as the best at 7.14 RMSE, with plain Random Forest close behind at 7.24 and statistically tied within 1 SE. The neighbor-aware features helped every type. Basically, what’s happening around a tract carries real information about that tract’s outcomes. To run the IBX simulation, we set each treated tract’s distance to the subway to the city’s smallest value (the closest any NYC tract is to a station) and re-ran the predictions. The spatial-lag versions go one step further. After the distance changes for treated tracts, we re-compute the neighbor-averages too, which is how the spillover effect shows up.
Causal modeling. Separately from the predictive
modeling, we used a method called bartCause to estimate what each
tract’s commute outcome would be if it got an IBX station of its own.
One important fix happened along the way. An earlier version of the
model included dist_subway_miles as a confounder, but that
variable IS the mechanism the IBX changes. Controlling for it removed
the effect we were trying to measure. Excluding it (and its spatial lag)
from the BART confounders gave us a causal estimate that agrees with the
predictive simulation in direction. Magnitudes are smaller than the
predictive ones (BART’s max benefit is around 0.5pp versus the
predictive simulation’s 9pp max relief), which is expected. BART
estimates an average causal effect with Bayesian shrinkage toward zero,
while the predictive model reports max relief under an extreme scenario.
Queens treated tracts get the biggest causal benefit (mean -0.46pp).
Brooklyn benefits too but less (mean -0.09pp). 49 Brooklyn tracts show
tiny positive ITE that we treat as noise. The model also produces
counterfactual estimates for boroughs the IBX doesn’t actually serve
(Bronx, Manhattan, Staten Island), but we treat those with caution since
BART is extrapolating into territory it didn’t see in the data.
Limitations.
Future work.
dist_subway_miles (the propensity formula still includes
it, which carries the mediator effect partly through the propensity
score).Spatial lag features — include averages of neighboring tract values to account for spatial autocorrelation; used in similar transit research and directly relevant to the IBX corridor analysis
K-fold cross-validation — more robust evaluation than a single 80/20 split given our dataset size
George: it might be good to have a “proximity to Manhattan” feature, otherwise the coefficients that are going to be learned for other coefficients will partially be capturing the geographic structure.
George: Second, you may want think about whether linear relationships are appropriate for all variables, particularly those which have outlier values (for example the citibike share or uber/taxi share). I also suspect the distance from subway relationship is nonlinear, like the difference from 20 minutes and 25 minutes walk to the subway might not matter. This is important because if you use a linear model, the relationship at short walk times will be constrained by the data for long walk times.
George: Try the the bartCause R package. It has built in a lot of model checks that determine how much you can truly say about causal effects given the data. read about it more here: https://apsta.shinyapps.io/thinkCausal/. IMO it is better than the CausalForest stuff from the EconML package, but that is my subjective opinion.
George: In the predicted 60m commute time, note the presence of negative values. I highly recommend applying a transformation to take the interval from 0-1 to the entire number line, i.e. a log transformation or a sigmoid function of some type. This would prevent the negative number issue.
BART mediator-as-confounder fix: An earlier version of bart_fit
included dist_subway_miles and
lag_dist_subway_miles in the confounders, which IS the
variable the IBX changes. Controlling for it removed the causal pathway
we were trying to measure. The original BART estimates came back with
tiny positive ITE for most treated tracts (all 148 Brooklyn treated
tracts positive, mean +0.5pp), which contradicted the predictive
simulation. We refit BART without those two columns. The new estimates
agree with the predictive simulation in direction (treatment helps most
treated tracts) but with smaller magnitudes due to BART’s Bayesian
shrinkage toward zero.