# install.packages("tidyverse")
# install.packages("httr")
# install.packages("jsonlite")
# install.packages("tidycensus")
# install.packages("fredr")
# install.packages("lubridate")
# install.packages("zoo")
# install.packages("gt")
# install.packages("scales")
# install.packages("patchwork")
# install.packages("dplyr")
# install.packages("shiny")
# install.packages("leaflet")
# install.packages("tigris")
# install.packages("sf")
# install.packages("tidymodels")
# install.packages("vip")Final Project
Housing Affordability Dynamics in New York State: A Multi-Source Analysis Using HUD, Census ACS, and FRED Economic Indicators
Project Description
Housing affordability has emerged as one of the defining economic pressures facing New York State residents over the past two decades. Median home prices and rental costs have consistently outpaced income growth, leaving a growing share of households spending disproportionate fractions of their earnings on shelter. This project investigates the scope and drivers of that affordability gap by integrating three complementary data sources: the U.S. Census Bureau’s American Community Survey (ACS) 5-Year Estimates for county-level income and housing cost data, the Federal Reserve Economic Data (FRED) API for macroeconomic indicators such as mortgage rates and inflation, and the U.S. Department of Housing and Urban Development (HUD) Fair Market Rent (FMR) schedules as a policy-grounded benchmark for rental affordability.
The analysis follows an OSEMN-style data science workflow — Obtain, Scrub, Explore, Model, Interpret — progressing from API-based data acquisition, through cleaning and feature engineering, into exploratory visualization, predictive modeling, and a novel application of machine unlearning to assess how sensitive model predictions are to a specific economic shock period. Two research questions structure the modeling work: first, what structural and macroeconomic factors best predict a county’s housing burden? Second, how much did the high-interest-rate years of 2022–2024 shape what models learned about rental market conditions?
Data Sources
The study combines multiple heterogeneous datasets to construct a unified housing affordability framework:
U.S. Census ACS 5-Year Estimates (via Census API) for median income, home values, and rent data at the county level. Available here: https://api.census.gov/data.html
Federal Reserve Economic Data (FRED API) for macroeconomic indicators including mortgage rates, federal funds rate, and inflation (CPI). Available here: https://fred.stlouisfed.org/docs/api/fred/
HUD Fair Market Rent (FMR) data from publicly available GitHub-hosted datasets. Available here: https://www.huduser.gov/portal/datasets/fmr.html. All year’s data are saved as CSV files and pulled from GitHub: https://github.com/NafeesKhandker/DATA-607-Final-Project/tree/main/HUD%20dataset
This multi-source integration enables a comprehensive view of housing affordability across both local and macroeconomic dimensions.
Package Installation
ACS 5-Year Data Pull
Median household income, home values, and gross rent are pulled for all 62 New York counties from the Census ACS 5-Year Estimates API, covering 2009 to 2024. County names are cleaned and all numeric fields are standardized before the data is stacked into a single panel and saved as a CSV.
library(tidyverse)
library(httr)
library(jsonlite)
library(tidycensus)
library(fredr)
library(lubridate)
library(zoo)
library(gt)############################################################
# ACS 5-YEAR - MULTI-YEAR PULL (NY COUNTIES)
############################################################
library(tidyverse)
library(jsonlite)
years <- 2009:2024
acs_5yr_all <- list()
for (yr in years) {
url <- paste0(
"https://api.census.gov/data/", yr, "/acs/acs5?",
"get=NAME,B19013_001E,B25077_001E,B25064_001E&",
"for=county:*&in=state:36"
)
try({
raw <- fromJSON(url)
df <- as.data.frame(raw[-1, ], stringsAsFactors = FALSE)
colnames(df) <- raw[1, ]
df <- df %>%
mutate(year = yr) %>%
rename(
county_name = NAME,
median_income = B19013_001E,
median_home = B25077_001E,
median_rent = B25064_001E,
state_fips = state,
county_fips = county
) %>%
# CLEAN COUNTY NAME
mutate(
county_name = county_name %>%
str_remove(", New York") %>%
str_trim()
) %>%
# NUMERIC CONVERSION
mutate(
median_income = as.numeric(median_income),
median_home = as.numeric(median_home),
median_rent = as.numeric(median_rent)
)
acs_5yr_all[[as.character(yr)]] <- df
}, silent = TRUE)
}
############################################################
# COMBINE ALL YEARS
############################################################
acs_ny_5yr <- bind_rows(acs_5yr_all)
############################################################
# EXPORT CSV
############################################################
write_csv(acs_ny_5yr, "acs_ny_5year_2005_2024.csv")
############################################################
# CHECK
############################################################
acs_ny_5yr |> head(5) |> gt()| county_name | median_income | median_home | median_rent | state_fips | county_fips | year |
|---|---|---|---|---|---|---|
| Albany County | 55350 | 192500 | 830 | 36 | 001 | 2009 |
| Allegany County | 40917 | 65100 | 566 | 36 | 003 | 2009 |
| Bronx County | 33794 | 369600 | 885 | 36 | 005 | 2009 |
| Broome County | 43467 | 95600 | 598 | 36 | 007 | 2009 |
| Cattaraugus County | 41482 | 76000 | 576 | 36 | 009 | 2009 |
ACS Exploratory Data Analysis
Two charts summarize statewide housing trends across the study period. The left panel plots median income, home values, and rent as separate trend lines. The most striking feature is the steep acceleration in home values after 2019, driven by pandemic-era demand and constrained supply, while income growth remained gradual. Rent increased steadily but at a slower pace than home values, suggesting that the ownership market absorbed the larger share of the post-2020 shock.
The right panel converts these trends into a price-to-income (PTI) ratio — home value divided by median income — which is a standard affordability benchmark. A PTI above 3.0× is conventionally considered unaffordable. The statewide mean PTI has remained above 3.0× throughout the entire 2009–2024 period, meaning New York counties have been in persistent affordability stress for over a decade. The interquartile ribbon widens over time, indicating that the gap between lower-burden rural counties and higher-burden urban counties has grown, not narrowed. The sharpest increase occurs after 2020, where the mean PTI climbs toward 4.5× — a level that signals severe structural unaffordability rather than a temporary cyclical dip.
############################################################
# ACS 5-YEAR EDA — STATEWIDE TRENDS + AFFORDABILITY RATIO
############################################################
library(tidyverse)
library(scales)
library(patchwork)
# ── PREP ─────────────────────────────────────────────────
acs_clean <- acs_ny_5yr |>
filter(median_income > 0, median_home > 0, median_rent > 0)
state_trends <- acs_clean |>
group_by(year) |>
summarise(
`Median Income` = mean(median_income, na.rm = TRUE),
`Median Home Value` = mean(median_home, na.rm = TRUE),
`Median Monthly Rent` = mean(median_rent, na.rm = TRUE),
.groups = "drop"
) |>
pivot_longer(-year, names_to = "Metric", values_to = "Value")
# ── PLOT 1: TRENDS ───────────────────────────────────────
p1 <- ggplot(state_trends, aes(x = year, y = Value, color = Metric)) +
geom_line(linewidth = 1.3) +
geom_point(size = 2) +
facet_wrap(~Metric, scales = "free_y", ncol = 1) +
scale_color_manual(values = c(
"Median Income" = "#1B9E77",
"Median Home Value" = "#D95F02",
"Median Monthly Rent" = "#7570B3"
)) +
scale_x_continuous(breaks = seq(2009, 2024, 2)) +
scale_y_continuous(labels = label_dollar(scale_cut = cut_short_scale())) +
labs(
title = "NY Housing and Income Trends",
subtitle = "State averages across all NY counties",
x = "Year", y = "USD"
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
strip.text = element_text(face = "bold", size = 11),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11, color = "gray40"),
axis.title = element_text(face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
)
# ── PLOT 2: PRICE-TO-INCOME RATIO ────────────────────────
pti_summary <- acs_clean |>
mutate(price_to_income = median_home / median_income) |>
group_by(year) |>
summarise(
pti_mean = mean(price_to_income, na.rm = TRUE),
pti_lo = quantile(price_to_income, 0.25, na.rm = TRUE),
pti_hi = quantile(price_to_income, 0.75, na.rm = TRUE),
.groups = "drop"
)
p2 <- ggplot(pti_summary, aes(x = year, y = pti_mean)) +
geom_ribbon(aes(ymin = pti_lo, ymax = pti_hi), fill = "#D95F02", alpha = 0.15) +
geom_line(color = "#D95F02", linewidth = 1.3) +
geom_point(color = "#D95F02", size = 2) +
geom_hline(yintercept = 3, linetype = "dashed", color = "gray50") +
annotate("text", x = 2009, y = 3.15, label = "3× affordability threshold",
hjust = 0, size = 3.5, color = "gray40") +
scale_x_continuous(breaks = seq(2009, 2024, 2)) +
scale_y_continuous(labels = label_number(suffix = "×", accuracy = 0.1)) +
labs(
title = "Home Price-to-Income Ratio",
subtitle = "Mean across NY counties; shaded band = interquartile range",
x = "Year", y = "Home Value ÷ Median Income"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11, color = "gray40"),
axis.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank()
)
# ── COMBINE ──────────────────────────────────────────────
p1 + p2 +
plot_layout(widths = c(1, 1)) +
plot_annotation(
caption = "Note: ACS 5-Year data averages survey responses over 5 years, producing smoother trends that reduce year-to-year volatility but may lag real-time market changes.",
theme = theme(
plot.caption = element_text(size = 10, color = "gray40", hjust = 0, face = "italic")
)
)FRED Data Pull
Three macroeconomic series are retrieved from the FRED API: the 30-year fixed mortgage rate, the effective Federal Funds Rate, and the CPI. A retry-enabled function handles intermittent server errors, attempting each request up to five times before failing. All three series are aligned to monthly frequency, joined together, and exported for use in the master dataset.
############################################################
# FRED DATA PULL
############################################################
library(tidyverse)
library(httr)
library(jsonlite)
library(lubridate)
library(zoo)
############################################################
# SAFE FUNCTION WITH RETRY LOGIC
############################################################
get_fred_data <- function(series_id, retries = 5, wait_sec = 10) {
api_key <- Sys.getenv("FRED_API_KEY")
url <- paste0(
"https://api.stlouisfed.org/fred/series/observations?",
"series_id=", series_id,
"&api_key=", api_key,
"&file_type=json"
)
for(i in 1:retries){
cat("Trying", series_id, "- Attempt", i, "\n")
result <- try(GET(url), silent = TRUE)
if(!inherits(result, "try-error") && status_code(result) == 200){
txt <- content(result, "text", encoding = "UTF-8")
data_raw <- fromJSON(txt)
df <- data_raw$observations %>%
select(date, value) %>%
mutate(
date = as.Date(date),
value = as.numeric(value)
)
return(df)
}
Sys.sleep(wait_sec)
}
stop(paste("Failed after retries:", series_id))
}
############################################################
# PULL DATA
############################################################
mortgage30 <- get_fred_data("MORTGAGE30US")Trying MORTGAGE30US - Attempt 1
Sys.sleep(5)
fed_funds <- get_fred_data("FEDFUNDS")Trying FEDFUNDS - Attempt 1
Sys.sleep(5)
cpi <- get_fred_data("CPIAUCSL")Trying CPIAUCSL - Attempt 1
############################################################
# MONTHLY CLEAN DATASET
############################################################
mortgage_m <- mortgage30 %>%
mutate(month = floor_date(date, "month")) %>%
group_by(month) %>%
summarise(mortgage_rate = mean(value, na.rm = TRUE), .groups = "drop")
fed_m <- fed_funds %>%
mutate(month = floor_date(date, "month")) %>%
group_by(month) %>%
summarise(fed_rate = mean(value, na.rm = TRUE), .groups = "drop")
cpi_m <- cpi %>%
mutate(month = floor_date(date, "month")) %>%
group_by(month) %>%
summarise(cpi = mean(value, na.rm = TRUE), .groups = "drop")
fred_full_monthly <- mortgage_m %>%
inner_join(fed_m, by = "month") %>%
inner_join(cpi_m, by = "month") %>%
arrange(month) %>%
mutate(
mortgage_rate = na.approx(mortgage_rate, na.rm = FALSE),
fed_rate = na.approx(fed_rate, na.rm = FALSE),
cpi = na.approx(cpi, na.rm = FALSE)
)
############################################################
# EXPORT
############################################################
write_csv(fred_full_monthly, "fred_full_monthly_1971_present.csv")
fred_full_monthly |> head(5) |> gt()| month | mortgage_rate | fed_rate | cpi |
|---|---|---|---|
| 1971-04-01 | 7.3100 | 4.16 | 40.1 |
| 1971-05-01 | 7.4250 | 4.63 | 40.3 |
| 1971-06-01 | 7.5300 | 4.91 | 40.5 |
| 1971-07-01 | 7.6040 | 5.31 | 40.6 |
| 1971-08-01 | 7.6975 | 5.57 | 40.7 |
FRED Exploratory Data Analysis
Two charts provide macroeconomic context for the housing analysis. The top panel plots the 30-year mortgage rate alongside the Federal Funds Rate from 1971 to the present. The two lines track closely, confirming that retail mortgage costs move in step with monetary policy. The most relevant feature for this project is the sharp rate reversal beginning in early 2022 — the Fed Funds Rate climbed from near zero to over 5%, pushing the 30-year mortgage rate above 7% for the first time since 2001. This rate shock directly reduced purchasing power for homebuyers and added upward pressure on rents as demand shifted toward the rental market.
The bottom panel shows annual CPI inflation as a bar chart. Inflation ran near or below the 2% Fed target for most of 2010–2020, then surged to nearly 9% in 2022 — the highest level since 1981. This inflation spike eroded real household income at the same time that mortgage rates were rising, creating a dual affordability squeeze. The 2% target line shows how far the post-pandemic inflation deviated from the prior decade’s baseline, and the rapid return toward target by 2023–2024 provides important context for interpreting the rate environment during the machine unlearning experiment in Research Question 2.
############################################################
# FRED EDA — INTEREST RATES + CPI (YoY INFLATION)
############################################################
library(tidyverse)
library(scales)
library(patchwork)
# Okabe-Ito palette
COL_MORTGAGE <- "#0072B2" # blue
COL_FED <- "#D55E00" # vermillion
COL_CPI_POS <- "#D55E00" # vermillion (inflation up)
COL_CPI_NEG <- "#0072B2" # blue (deflation)
# ── PLOT 1: MORTGAGE RATE vs FED FUNDS RATE ──────────────
p1 <- fred_full_monthly |>
select(month,
`30-Yr Mortgage Rate` = mortgage_rate,
`Fed Funds Rate` = fed_rate) |>
pivot_longer(-month, names_to = "Series", values_to = "Rate") |>
ggplot(aes(x = month, y = Rate, color = Series)) +
geom_line(linewidth = 0.9) +
scale_color_manual(values = c(
"30-Yr Mortgage Rate" = COL_MORTGAGE,
"Fed Funds Rate" = COL_FED
)) +
scale_x_date(date_breaks = "5 years", date_labels = "%Y") +
scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
labs(
title = "Interest Rates Over Time",
subtitle = "30-year fixed mortgage rate closely tracks Fed policy cycles",
x = NULL, y = "Rate (%)", color = NULL
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11, color = "gray40"),
axis.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank()
)
# ── PLOT 2: CPI YEAR-OVER-YEAR % CHANGE ──────────────────
cpi_yoy <- fred_full_monthly |>
arrange(month) |>
mutate(
cpi_yoy = (cpi / lag(cpi, 12) - 1) * 100,
positive = cpi_yoy >= 0
) |>
filter(!is.na(cpi_yoy))
p2 <- ggplot(cpi_yoy, aes(x = month, y = cpi_yoy)) +
geom_hline(yintercept = 0, color = "gray60", linewidth = 0.5) +
geom_hline(yintercept = 2, linetype = "dashed", color = "gray50", linewidth = 0.5) +
geom_col(aes(fill = positive), width = 20, show.legend = FALSE) +
annotate("text", x = min(cpi_yoy$month), y = 2.4,
label = "2% Fed target", hjust = 0, size = 3.2, color = "gray40") +
scale_fill_manual(values = c("TRUE" = COL_CPI_POS, "FALSE" = COL_CPI_NEG)) +
scale_x_date(date_breaks = "5 years", date_labels = "%Y") +
scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
labs(
title = "Annual Inflation Rate (CPI Year-over-Year)",
subtitle = "Monthly % change vs. same month prior year; 2022 spike reflects post-COVID surge",
x = NULL, y = "YoY Change (%)"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11, color = "gray40"),
axis.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank()
)
# ── COMBINE ───────────────────────────────────────────────
p1 / p2 +
plot_annotation(
caption = "Note: Monthly FRED data. Mortgage rate averaged to monthly from weekly Freddie Mac survey. Fed Funds Rate is the effective rate. CPI inflation = YoY % change in CPIAUCSL (seasonally adjusted).",
theme = theme(
plot.caption = element_text(size = 10, color = "gray40", hjust = 0, face = "italic")
)
)FRED ACS-Aligned Yearly Aggregate
To align with the ACS panel, the monthly FRED data is aggregated into annual averages for 2009–2024. This produces a compact 16-row table with one row per year, capturing the average mortgage rate, Fed Funds Rate, and CPI level — ready to join with the county-level data. The table confirms the historically low-rate environment of 2010–2021 (mortgage rate averaging around 3.5–5%) before the sharp rise to 6.28% in 2022–2024.
############################################################
# ACS-ALIGNED YEARLY FRED (CLEAN VERSION)
############################################################
library(dplyr)
library(lubridate)
start_date <- as.Date("2009-01-01")
fred_acs_ready <- fred_full_monthly %>%
filter(month >= start_date) %>%
mutate(year = year(month)) %>%
group_by(year) %>%
summarise(
# macroeconomic level indicators (core variables)
mortgage_rate = mean(mortgage_rate, na.rm = TRUE),
fed_rate = mean(fed_rate, na.rm = TRUE),
cpi = mean(cpi, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(year)
# View
head(fred_acs_ready, 5)# A tibble: 5 × 4
year mortgage_rate fed_rate cpi
<dbl> <dbl> <dbl> <dbl>
1 2009 5.04 0.16 215.
2 2010 4.69 0.175 218.
3 2011 4.46 0.102 225.
4 2012 3.66 0.14 230.
5 2013 3.98 0.108 233.
# Save
write_csv(fred_acs_ready, "fred_acs_ready_2009_present.csv")
fred_acs_ready |> head(5) |> gt()| year | mortgage_rate | fed_rate | cpi |
|---|---|---|---|
| 2009 | 5.041375 | 0.1600000 | 214.5647 |
| 2010 | 4.690583 | 0.1750000 | 218.0762 |
| 2011 | 4.455833 | 0.1016667 | 224.9230 |
| 2012 | 3.655917 | 0.1400000 | 229.5861 |
| 2013 | 3.981917 | 0.1075000 | 232.9518 |
HUD Fair Market Rent Data Pull
HUD Fair Market Rent data for all five bedroom sizes — studio through four-bedroom — are downloaded for each New York county from fiscal years 2009 to 2024. Files are read from GitHub and stacked into a single panel using map_dfr(). FMRs represent the 40th percentile of gross rents paid by recent movers and serve as the federal benchmark for housing voucher payment standards under the Section 8 program. The preview confirms the expected structure — in 2009, Bronx County FMRs already ranged from $1,091 (studio) to $1,817 (4-BR), substantially above most upstate counties.
library(tidyverse)
############################################################
# PULL HUD FMR DATA FROM GITHUB (2009–2024)
############################################################
years <- sprintf("%02d", 9:24)
base_url <- "https://raw.githubusercontent.com/NafeesKhandker/DATA-607-Final-Project/main/HUD%20dataset/"
############################################################
# FUNCTION TO READ EACH FILE
############################################################
get_hud_year <- function(yy){
file_url <- paste0(base_url, "FY", yy, "_FMRs.csv")
message("Reading: ", file_url)
df <- read_csv(file_url, show_col_types = FALSE)
df %>%
mutate(
year = as.integer(paste0("20", yy))
)
}
############################################################
# COMBINE ALL YEARS
############################################################
hud_all <- map_dfr(years, get_hud_year)
############################################################
# CLEAN COLUMN NAMES
############################################################
names(hud_all) <- names(hud_all) %>%
str_to_lower() %>%
str_replace_all(" ", "_")
############################################################
# VIEW DATA
############################################################
hud_all |> head(n = 5) |> gt()| stusps | state | countyname | pop | fmr_0 | fmr_1 | fmr_2 | fmr_3 | fmr_4 | year |
|---|---|---|---|---|---|---|---|---|---|
| NY | 36 | Albany County | 294565 | 686 | 711 | 868 | 1039 | 1135 | 2009 |
| NY | 36 | Allegany County | 49927 | 553 | 555 | 665 | 829 | 1018 | 2009 |
| NY | 36 | Bronx County | 1332650 | 1091 | 1180 | 1313 | 1615 | 1817 | 2009 |
| NY | 36 | Broome County | 200536 | 580 | 583 | 697 | 910 | 1067 | 2009 |
| NY | 36 | Cattaraugus County | 83955 | 560 | 562 | 676 | 888 | 1019 | 2009 |
HUD FMR Exploratory Data Analysis
A dot plot of 2024 HUD FMRs displays the full spread of rental costs across all 62 counties by unit size, sorted from lowest to highest two-bedroom rent. The chart is split into lower-rent and higher-rent panels for readability.
The findings reveal two important patterns. First, geographic variation is enormous: 2-BR FMRs in 2024 range from under $850 in Allegany, Cattaraugus, and Wyoming counties to over $2,500 in New York City and Rockland County — nearly a 3× difference from the cheapest to the most expensive market in the same state. Second, the width of each county’s connecting line — which spans all five bedroom sizes — is much larger in high-cost markets. For example, the spread between a studio and a 4-BR unit in New York City exceeds $1,000/month, while the same spread in a rural county is under $300. This bedroom-size premium effect reflects tighter supply of family-sized units in dense urban areas, where larger apartments command a disproportionate market premium beyond the base rent level.
############################################################
# HUD FMR — FMR BY COUNTY & BEDROOM SIZE (2024)
############################################################
library(tidyverse)
library(scales)
library(patchwork)
# ── PREP ─────────────────────────────────────────────────
hud_avg <- hud_all |>
filter(stusps == "NY", year == 2024) |>
mutate(county_label = str_remove(countyname, " County")) |>
group_by(county_label) |>
summarise(across(fmr_0:fmr_4, ~ mean(.x, na.rm = TRUE)), .groups = "drop") |>
pivot_longer(fmr_0:fmr_4, names_to = "bedroom", values_to = "avg_fmr") |>
mutate(bedroom = recode(bedroom,
fmr_0 = "Studio", fmr_1 = "1-BR", fmr_2 = "2-BR",
fmr_3 = "3-BR", fmr_4 = "4-BR"
))
county_order <- hud_avg |>
filter(bedroom == "2-BR") |>
arrange(avg_fmr) |>
pull(county_label)
hud_avg <- hud_avg |>
mutate(
county_label = factor(county_label, levels = county_order),
bedroom = factor(bedroom, levels = c("Studio","1-BR","2-BR","3-BR","4-BR")),
panel = ifelse(
as.integer(county_label) > length(county_order) / 2,
"Higher-Rent Counties", "Lower-Rent Counties"
)
)
COLOR_VALS <- c(
"Studio" = "#332288", "1-BR" = "#44AA99", "2-BR" = "#117733",
"3-BR" = "#DDCC77", "4-BR" = "#CC6677"
)
# ── SHARED PLOT FUNCTION ──────────────────────────────────
dot_panel <- function(data, show_legend = FALSE) {
ggplot(data, aes(x = avg_fmr, y = county_label, color = bedroom)) +
geom_line(aes(group = county_label), color = "gray80", linewidth = 0.5) +
geom_point(size = 2.2, alpha = 0.9) +
scale_color_manual(values = COLOR_VALS, name = "Bedroom Size") +
scale_x_continuous(
labels = label_dollar(),
expand = expansion(mult = c(0.02, 0.05))
) +
scale_y_discrete(drop = TRUE) +
facet_wrap(~ panel, scales = "free_y") +
labs(x = "Monthly FMR (USD)", y = NULL) +
theme_minimal(base_size = 11) +
theme(
legend.position = if (show_legend) "top" else "none",
legend.title = element_text(face = "bold", size = 9),
strip.text = element_text(face = "bold", size = 11),
axis.text.y = element_text(size = 8.5),
axis.title.x = element_text(face = "bold"),
axis.ticks.y = element_blank(),
panel.grid.major.y = element_line(color = "gray93"),
panel.grid.major.x = element_line(color = "gray90"),
panel.grid.minor = element_blank()
)
}
# ── BUILD TWO PANELS ──────────────────────────────────────
p_high <- dot_panel(filter(hud_avg, panel == "Higher-Rent Counties"), show_legend = TRUE)
p_low <- dot_panel(filter(hud_avg, panel == "Lower-Rent Counties"), show_legend = FALSE)
# ── COMBINE ───────────────────────────────────────────────
p_high + p_low +
plot_annotation(
title = "HUD Fair Market Rents by NY County & Bedroom Size (2024)", # changed
subtitle = "Counties ordered by 2-BR FMR (ascending); connecting line shows spread across bedroom sizes",
caption = "Note: 2024 HUD Fair Market Rents represent the 40th percentile gross rent for standard units.\nCounties with wider horizontal spans have a larger bedroom-size premium, often reflecting tighter family-sized unit supply.", # changed
theme = theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "gray40"),
plot.caption = element_text(size = 9, color = "gray40", hjust = 0, face = "italic")
)
)Master Dataset Build
The three data sources are merged into a single county-year panel by joining ACS, HUD, and FRED data onto a complete 62-county × 16-year grid. Four derived features are calculated: housing burden (annualized rent divided by median income), HUD burden (2-BR FMR divided by median income), real income index (CPI-adjusted income), and macro housing pressure (burden multiplied by mortgage rate). The final dataset has 992 rows and 21 variables, providing a rich foundation for both exploratory analysis and predictive modeling.
############################################################
# MASTER DATASET
# YEARS RESTRICTED TO 2009–2024
############################################################
library(tidyverse)
library(lubridate)
library(zoo)
############################################################
# 1. CLEAN HUD DATA
############################################################
hud_clean <- hud_all %>%
filter(
stusps == "NY",
year >= 2009,
year <= 2024
) %>%
mutate(
county_name = countyname %>%
str_remove(", NY") %>%
str_remove(", New York") %>%
str_trim()
) %>%
select(
county_name,
year,
pop,
fmr_0,
fmr_1,
fmr_2,
fmr_3,
fmr_4
) %>%
distinct()
############################################################
# 2. CLEAN ACS5 DATA
############################################################
acs_clean <- acs_ny_5yr %>%
filter(
year >= 2009,
year <= 2024
) %>%
distinct()
############################################################
# 3. CLEAN FRED DATA
############################################################
fred_clean <- fred_acs_ready %>%
filter(
year >= 2009,
year <= 2024
) %>%
distinct()
############################################################
# 4. BUILD FULL COUNTY-YEAR GRID
############################################################
all_counties <- sort(unique(hud_clean$county_name))
all_years <- 2009:2024
full_grid <- expand_grid(
county_name = all_counties,
year = all_years
)
############################################################
# 5. JOIN ACS5
############################################################
master1 <- full_grid %>%
left_join(acs_clean, by = c("county_name", "year"))
############################################################
# 6. JOIN HUD
############################################################
master2 <- master1 %>%
left_join(hud_clean, by = c("county_name", "year"))
############################################################
# 7. JOIN FRED
############################################################
final_master <- master2 %>%
left_join(fred_clean, by = "year")
############################################################
# 8. INFLATION RATE (FIXED)
############################################################
inflation_tbl <- fred_clean %>%
arrange(year) %>%
mutate(
inflation_rate = (cpi / lag(cpi) - 1) * 100
) %>%
select(year, inflation_rate)
final_master <- final_master %>%
select(-any_of("inflation_rate")) %>%
left_join(inflation_tbl, by = "year")
############################################################
# 9. MINIMAL & NON-REDUNDANT DERIVED FEATURES
############################################################
final_master <- final_master %>%
mutate(
# 1. Housing burden (market-based)
housing_burden = (median_rent * 12) / median_income,
# 2. HUD affordability burden (policy benchmark)
hud_burden = (fmr_2 * 12) / median_income,
# 3. Real income (CPI-adjusted purchasing power)
real_income_index = (median_income / cpi) * 100,
# 4. Housing stress from macro environment
macro_housing_pressure = housing_burden * mortgage_rate
)
############################################################
# 10. FINAL SORT
############################################################
final_master <- final_master %>%
arrange(county_name, year)
############################################################
# 11. EXPORT
############################################################
write_csv(
final_master,
"ny_housing_final_master_dataset_acs5_2009_2024.csv"
)
final_master |> head(n = 5) |> gt()| county_name | year | median_income | median_home | median_rent | state_fips | county_fips | pop | fmr_0 | fmr_1 | fmr_2 | fmr_3 | fmr_4 | mortgage_rate | fed_rate | cpi | inflation_rate | housing_burden | hud_burden | real_income_index | macro_housing_pressure |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Albany County | 2009 | 55350 | 192500 | 830 | 36 | 001 | 294565 | 686 | 711 | 868 | 1039 | 1135 | 5.041375 | 0.1600000 | 214.5647 | NA | 0.1799458 | 0.1881843 | 25796.42 | 0.9071743 |
| Albany County | 2010 | 56090 | 202500 | 855 | 36 | 001 | 294565 | 690 | 716 | 874 | 1046 | 1143 | 4.690583 | 0.1750000 | 218.0762 | 1.636570 | 0.1829203 | 0.1869852 | 25720.37 | 0.8580029 |
| Albany County | 2011 | 57715 | 207300 | 880 | 36 | 001 | 294565 | 711 | 737 | 900 | 1077 | 1177 | 4.455833 | 0.1016667 | 224.9230 | 3.139652 | 0.1829680 | 0.1871264 | 25659.89 | 0.8152751 |
| Albany County | 2012 | 59359 | 210200 | 890 | 36 | 001 | 294565 | 687 | 713 | 870 | 1041 | 1138 | 3.655917 | 0.1400000 | 229.5861 | 2.073191 | 0.1799222 | 0.1758790 | 25854.79 | 0.6577805 |
| Albany County | 2013 | 59394 | 209300 | 904 | 36 | 001 | 304204 | 657 | 744 | 921 | 1147 | 1231 | 3.981917 | 0.1075000 | 232.9518 | 1.465972 | 0.1826447 | 0.1860794 | 25496.27 | 0.7272760 |
Shiny App
An interactive Shiny application maps any variable in the master dataset across all 62 New York counties for any year from 2009 to 2024. Users select a metric from a dropdown and step through years with an animated slider. Hovering over a county displays a tooltip with income, rent, population, and the selected metric value. The app makes it easy to observe, for example, how the housing burden ratio darkens across downstate counties after 2020, or how FMR increases spread geographically over time — patterns that are difficult to detect from tables alone.
Live here: https://khandker.shinyapps.io/shinyapp/
############################################################
# SHINY APP — NY HOUSING AFFORDABILITY
############################################################
library(shiny)
library(leaflet)
library(tigris)
library(sf)
library(tidyverse)
options(tigris_use_cache = TRUE)
ny_sf <- counties(state = "NY", cb = TRUE, year = 2020, progress_bar = FALSE) |>
st_transform(crs = 4326) |>
mutate(county_name = paste0(NAME, " County"))
metric_choices <- c(
"Median Household Income" = "median_income",
"Median Home Value" = "median_home",
"Median Monthly Rent" = "median_rent",
"FMR — Studio (0 BR)" = "fmr_0",
"FMR — 1 Bedroom" = "fmr_1",
"FMR — 2 Bedroom" = "fmr_2",
"FMR — 3 Bedroom" = "fmr_3",
"FMR — 4 Bedroom" = "fmr_4",
"Housing Burden (Rent/Income)" = "housing_burden",
"HUD Burden (FMR-2/Income)" = "hud_burden",
"Real Income Index" = "real_income_index"
)
dollar_metrics <- c("median_income","median_home","median_rent",
"fmr_0","fmr_1","fmr_2","fmr_3","fmr_4")
ratio_metrics <- c("housing_burden","hud_burden")
fmt_display <- function(m, val) {
if (is.na(val) || !is.finite(val)) return("N/A")
if (m %in% dollar_metrics)
paste0("$", formatC(val, format = "f", digits = 0, big.mark = ","))
else if (m %in% ratio_metrics)
paste0(round(val * 100, 1), "%")
else
as.character(round(val, 2))
}
# UI
ui <- fluidPage(
tags$head(tags$style(HTML("
body, .container-fluid { margin: 0; padding: 0; }
#controls {
display: flex;
align-items: flex-end;
gap: 32px; /* more breathing room between items */
padding: 14px 24px; /* more vertical + horizontal padding */
background: #f8f9fa;
border-bottom: 2px solid #dee2e6;
flex-wrap: wrap;
}
#controls h4 {
margin: 0 16px 6px 0; /* push title away from dropdowns */
font-size: 1.05em;
font-weight: 700;
align-self: center;
white-space: nowrap;
color: #1e2d40;
}
.ctrl-group {
display: flex;
flex-direction: column;
gap: 3px; /* small gap between label and input */
}
.ctrl-group label {
font-size: 0.78em;
font-weight: 600;
color: #555;
margin: 0;
}
.form-group { margin-bottom: 0; }
.irs--shiny .irs-single { font-size: 11px; }
"))),
div(id = "controls",
tags$h4("NY Housing Affordability Explorer"),
div(class = "ctrl-group",
tags$label("Metric"),
selectInput("metric", label = NULL,
choices = metric_choices,
selected = "housing_burden",
width = "230px")
),
div(class = "ctrl-group",
tags$label("Year"),
sliderInput("year", label = NULL,
min = 2009, max = 2024, value = 2024,
step = 1, sep = "", width = "320px",
animate = animationOptions(interval = 1000, loop = FALSE))
)
),
leafletOutput("map", width = "100%", height = "calc(100vh - 85px)")
)
# SERVER
server <- function(input, output, session) {
year_data <- reactive({
final_master |>
filter(year == input$year) |>
transmute(
county_name,
pop = as.numeric(pop),
median_income = as.numeric(median_income),
median_rent = as.numeric(median_rent),
value = as.numeric(.data[[input$metric]])
)
})
map_df <- reactive({
ny_sf |> left_join(year_data(), by = "county_name")
})
output$map <- renderLeaflet({
leaflet() |>
addProviderTiles(providers$CartoDB.Positron) |>
setView(lng = -75.5, lat = 42.9, zoom = 7)
})
observe({
req(input$metric, input$year)
df <- map_df()
m <- input$metric
mlabel <- names(metric_choices)[metric_choices == m]
finite_vals <- df$value[is.finite(df$value)]
req(length(finite_vals) > 0)
pal <- colorNumeric("YlOrRd", domain = finite_vals, na.color = "#cccccc")
if (m %in% ratio_metrics) {
leg_vals <- finite_vals * 100
leg_pal <- colorNumeric("YlOrRd", domain = leg_vals, na.color = "#cccccc")
leg_title <- paste0(mlabel, " (%)<br/><small>", input$year, "</small>")
} else {
leg_vals <- finite_vals
leg_pal <- pal
leg_title <- paste0(mlabel, "<br/><small>", input$year, "</small>")
}
# TOOLTIPS
labels <- lapply(seq_len(nrow(df)), function(i) {
cname <- df$county_name[i]
pop <- df$pop[i]
inc <- df$median_income[i]
rent <- df$median_rent[i]
val <- df$value[i]
pop_fmt <- if (is.na(pop) || !is.finite(pop)) "N/A" else
formatC(pop, format = "f", digits = 0, big.mark = ",")
inc_fmt <- fmt_display("median_income", inc)
rent_fmt <- fmt_display("median_rent", rent)
val_fmt <- fmt_display(m, val)
htmltools::HTML(paste0(
"<div style='font-family:Arial,sans-serif; font-size:12px;",
"min-width:180px; line-height:1.5;'>",
"<div style='font-size:13px; font-weight:700; color:#1e2d40;",
"border-bottom:1px solid #ddd; padding-bottom:4px;",
"margin-bottom:6px;'>",
cname,
"</div>",
"<div style='color:#555;'>",
"<b>Population:</b> ", pop_fmt, "<br/>",
"<b>Median Income:</b> ", inc_fmt, "<br/>",
"<b>Median Rent:</b> ", rent_fmt, "<br/>",
"</div>",
"<div style='margin-top:6px; padding-top:5px;",
"border-top:1px solid #ddd; color:#1e2d40;'>",
"<b>", mlabel, ":</b> ",
"<span style='color:#c0392b; font-weight:700;'>", val_fmt, "</span>",
"</div>",
"</div>"
))
})
leafletProxy("map", session) |>
clearShapes() |>
clearControls() |>
addPolygons(
data = df,
fillColor = ~pal(value),
fillOpacity = 0.72,
color = "white",
weight = 1,
highlight = highlightOptions(
weight = 2.5,
color = "#1e2d40",
fillOpacity = 0.9,
bringToFront = TRUE
),
label = labels,
labelOptions = labelOptions(
style = list(
"background-color" = "white",
"border" = "1px solid #ccc",
"border-radius" = "6px",
"padding" = "8px",
"box-shadow" = "2px 2px 6px rgba(0,0,0,0.15)"
),
textsize = "12px",
direction = "auto",
opacity = 1
),
layerId = ~county_name
) |>
addLegend(
pal = leg_pal,
values = leg_vals,
title = leg_title,
position = "bottomright",
opacity = 0.85
)
})
}
shinyApp(ui, server)Research Question 1: What factors best predict county-level housing affordability burden in New York State from 2009–2024?
To identify the key drivers of housing burden, four models are trained and compared: Linear Regression, Elastic Net, Random Forest, and XGBoost. The target variable is the rent-to-income housing burden ratio. Predictors include median home value, population, all five FMR bedroom sizes, macroeconomic indicators, and county fixed effects encoded as dummy variables. An 80/20 stratified train-test split and five-fold cross-validation are used to tune hyperparameters and evaluate performance.
Model Comparison and Performance Plot
The cross-validated results table and bar chart compare all four models on RMSE, MAE, and R². The Random Forest achieves the best overall performance with RMSE = 0.00705 and R² = 0.964, followed closely by Elastic Net and Linear Regression (both R² ≈ 0.960). XGBoost performs slightly worse (R² = 0.947) despite its greater complexity, suggesting the data structure does not benefit from deep boosting iterations. Importantly, all four models cluster within a narrow RMSE range of 0.007–0.008, which means the strength of the predictive signal is consistent and not dependent on model choice. This is a strong indicator that the features — particularly population, home value, and FMR — carry genuine, stable information about affordability burden rather than noise.
Test Set Evaluation and Feature Importance
The winning Random Forest model is evaluated on the held-out test set, confirming RMSE = 0.00638 and R² = 0.959 — both consistent with cross-validation, indicating no overfitting. The variable importance plot reveals a clear hierarchy of predictors. Population is the single strongest driver, reflecting the well-documented relationship between housing demand density and rent-to-income stress. County fixed effects rank second, capturing persistent local market character that numeric variables alone cannot explain. Median home value ranks third — counties where purchasing is already expensive tend to have proportionally higher rents as well. The FMR bedroom-size series and CPI follow, while mortgage rate and the Fed Funds Rate rank near the bottom. This pattern is significant: it suggests that affordability burden is primarily structural and geographic, not cyclical — a finding that has direct implications for what kinds of policy interventions are most likely to be effective.
Correlation Table and Conclusion
The correlation matrix provides a quantitative summary of variable relationships and validates the feature importance findings. Housing burden correlates most strongly with population (r = 0.73) and median home value (r = 0.60), confirming these as the primary structural drivers. In contrast, mortgage rate (r = −0.04) and the Fed Funds Rate (r = −0.04) show near-zero correlation with housing burden, reinforcing that macroeconomic rate variables play a limited direct role at the county level. The FMR columns are highly intercorrelated with each other (r = 0.96–1.00), indicating multicollinearity across bedroom sizes — the Random Forest handles this naturally, but it would require regularization in a linear setting. The year variable correlates strongly with CPI (r = 0.94) and the Fed Funds Rate (r = 0.73), confirming that both variables largely reflect long-run time trends rather than independent signals.
############################################################
# RQ1: What factors best predict county-level housing
# affordability burden in New York State (2009–2024)?
############################################################
library(tidyverse)
library(tidymodels)
library(vip)
library(scales)
library(patchwork)
set.seed(607)
############################################################
# 1. MODELING DATASET
############################################################
ml_refined <- final_master %>%
drop_na(
housing_burden,
median_home,
pop,
fmr_0, fmr_1, fmr_2, fmr_3, fmr_4,
mortgage_rate, fed_rate,
cpi, inflation_rate
) %>%
filter(year >= 2010) %>%
mutate(
county_name = as.factor(county_name)
) %>%
select(
county_name,
year,
housing_burden,
median_home,
pop,
fmr_0, fmr_1, fmr_2, fmr_3, fmr_4,
mortgage_rate,
fed_rate,
cpi,
inflation_rate
)
############################################################
# 2. TRAIN / TEST SPLIT
############################################################
split_obj <- initial_split(
ml_refined,
prop = 0.80,
strata = housing_burden
)
train_data <- training(split_obj)
test_data <- testing(split_obj)
############################################################
# 3. CROSS VALIDATION
############################################################
cv_folds <- vfold_cv(
train_data,
v = 5,
strata = housing_burden
)
############################################################
# 4. PREPROCESSING
############################################################
rec <- recipe(housing_burden ~ ., data = train_data) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
############################################################
# 5. MODELS
############################################################
lm_mod <- linear_reg() %>%
set_engine("lm")
enet_mod <- linear_reg(
penalty = tune(),
mixture = tune()
) %>%
set_engine("glmnet")
rf_mod <- rand_forest(
trees = 500,
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("regression")
xgb_mod <- boost_tree(
trees = 500,
tree_depth = tune(),
learn_rate = tune(),
min_n = tune(),
loss_reduction = tune()
) %>%
set_engine("xgboost") %>%
set_mode("regression")
############################################################
# 6. WORKFLOWS
############################################################
wf_lm <- workflow() %>% add_recipe(rec) %>% add_model(lm_mod)
wf_enet <- workflow() %>% add_recipe(rec) %>% add_model(enet_mod)
wf_rf <- workflow() %>% add_recipe(rec) %>% add_model(rf_mod)
wf_xgb <- workflow() %>% add_recipe(rec) %>% add_model(xgb_mod)
############################################################
# 7. FIT MODELS
############################################################
metric_used <- metric_set(rmse, rsq, mae)
res_lm <- fit_resamples(
wf_lm,
resamples = cv_folds,
metrics = metric_used
)
res_enet <- tune_grid(
wf_enet,
resamples = cv_folds,
grid = 10,
metrics = metric_used
)
res_rf <- tune_grid(
wf_rf,
resamples = cv_folds,
grid = 10,
metrics = metric_used
)
res_xgb <- tune_grid(
wf_xgb,
resamples = cv_folds,
grid = 10,
metrics = metric_used
)
############################################################
# 8. BEST MODELS
############################################################
best_enet <- select_best(res_enet, metric = "rmse")
best_rf <- select_best(res_rf, metric = "rmse")
best_xgb <- select_best(res_xgb, metric = "rmse")
############################################################
# 9. MODEL COMPARISON
############################################################
results_tbl <- bind_rows(
collect_metrics(res_lm) %>%
mutate(model = "Linear Regression"),
collect_metrics(res_enet) %>%
filter(.config == best_enet$.config) %>%
mutate(model = "Elastic Net"),
collect_metrics(res_rf) %>%
filter(.config == best_rf$.config) %>%
mutate(model = "Random Forest"),
collect_metrics(res_xgb) %>%
filter(.config == best_xgb$.config) %>%
mutate(model = "XGBoost")
) %>%
filter(.metric %in% c("rmse","rsq","mae")) %>%
select(model, .metric, mean) %>%
pivot_wider(names_from = .metric, values_from = mean) %>%
distinct() %>%
arrange(rmse)
results_tbl# A tibble: 4 × 4
model mae rmse rsq
<chr> <dbl> <dbl> <dbl>
1 Random Forest 0.00522 0.00705 0.964
2 Elastic Net 0.00549 0.00719 0.960
3 Linear Regression 0.00550 0.00719 0.960
4 XGBoost 0.00600 0.00844 0.947
############################################################
# 10. PERFORMANCE PLOT
############################################################
plot_tbl <- results_tbl %>%
mutate(model = fct_reorder(model, rmse))
ggplot(plot_tbl,
aes(x = model, y = rmse, fill = model)) +
geom_col(width = .72, alpha = .92) +
geom_text(aes(label = round(rmse, 4)),
vjust = -0.5,
size = 4.2,
fontface = "bold") +
scale_y_continuous(
expand = expansion(mult = c(0, .12))
) +
labs(
title = "Model Performance Comparison",
subtitle = "Lower RMSE indicates stronger predictive accuracy",
x = NULL,
y = "RMSE"
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "none",
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(face = "bold"),
axis.text.x = element_text(face = "bold")
)############################################################
# 11. TEST SET PERFORMANCE
############################################################
best_model_name <- results_tbl$model[1]
final_wf <-
if(best_model_name == "Linear Regression") wf_lm else
if(best_model_name == "Elastic Net") finalize_workflow(wf_enet, best_enet) else
if(best_model_name == "Random Forest") finalize_workflow(wf_rf, best_rf) else
finalize_workflow(wf_xgb, best_xgb)
final_fit <- last_fit(final_wf, split_obj)
collect_metrics(final_fit)# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 0.00638 pre0_mod0_post0
2 rsq standard 0.959 pre0_mod0_post0
############################################################
# 12. FEATURE IMPORTANCE
############################################################
fit_obj <- extract_workflow(final_fit) %>%
fit(train_data)
vip(fit_obj, num_features = 12) +
labs(
title = paste("Top Predictors -", best_model_name),
subtitle = "Most influential factors associated with housing burden"
) +
theme_minimal(base_size = 13)############################################################
# 13. CORRELATION TABLE
############################################################
ml_refined %>%
select(-county_name) %>%
cor(use = "pairwise.complete.obs") %>%
round(2) year housing_burden median_home pop fmr_0 fmr_1 fmr_2 fmr_3
year 1.00 -0.04 0.15 0.01 0.29 0.30 0.33 0.33
housing_burden -0.04 1.00 0.60 0.73 0.60 0.58 0.57 0.57
median_home 0.15 0.60 1.00 0.78 0.88 0.89 0.88 0.87
pop 0.01 0.73 0.78 1.00 0.67 0.68 0.67 0.67
fmr_0 0.29 0.60 0.88 0.67 1.00 0.98 0.97 0.97
fmr_1 0.30 0.58 0.89 0.68 0.98 1.00 1.00 0.99
fmr_2 0.33 0.57 0.88 0.67 0.97 1.00 1.00 1.00
fmr_3 0.33 0.57 0.87 0.67 0.97 0.99 1.00 1.00
fmr_4 0.32 0.57 0.87 0.67 0.96 0.98 0.99 0.99
mortgage_rate 0.43 -0.04 0.13 0.00 0.25 0.23 0.24 0.23
fed_rate 0.73 -0.04 0.15 0.00 0.29 0.29 0.31 0.30
cpi 0.94 -0.05 0.17 0.01 0.32 0.33 0.35 0.35
inflation_rate 0.51 -0.06 0.11 0.00 0.19 0.18 0.19 0.18
fmr_4 mortgage_rate fed_rate cpi inflation_rate
year 0.32 0.43 0.73 0.94 0.51
housing_burden 0.57 -0.04 -0.04 -0.05 -0.06
median_home 0.87 0.13 0.15 0.17 0.11
pop 0.67 0.00 0.00 0.01 0.00
fmr_0 0.96 0.25 0.29 0.32 0.19
fmr_1 0.98 0.23 0.29 0.33 0.18
fmr_2 0.99 0.24 0.31 0.35 0.19
fmr_3 0.99 0.23 0.30 0.35 0.18
fmr_4 1.00 0.23 0.30 0.34 0.17
mortgage_rate 0.23 1.00 0.87 0.66 0.40
fed_rate 0.30 0.87 1.00 0.84 0.33
cpi 0.34 0.66 0.84 1.00 0.61
inflation_rate 0.17 0.40 0.33 0.61 1.00
############################################################
# 14. RESEARCH QUESTION OUTPUT
############################################################
cat("
RQ1:
What factors best predict county-level housing affordability
burden in New York State from 2009 to 2024?
Based on the strongest-performing model, the most important
predictors include:
1. Population size
2. County-specific location effects
3. Median home values
4. Fair Market Rent levels
5. Inflation conditions
6. Long-term time trends
These findings suggest that affordability burden is largely
driven by housing market conditions, regional demand pressures,
and geographic differences across counties.
")
RQ1:
What factors best predict county-level housing affordability
burden in New York State from 2009 to 2024?
Based on the strongest-performing model, the most important
predictors include:
1. Population size
2. County-specific location effects
3. Median home values
4. Fair Market Rent levels
5. Inflation conditions
6. Long-term time trends
These findings suggest that affordability burden is largely
driven by housing market conditions, regional demand pressures,
and geographic differences across counties.
Research Question 2: How sensitive are county-level HUD Fair Market Rent (2-BR) predictions to the removal of high-interest-rate years?
Research Question 2 tests how sensitive Fair Market Rent predictions are to the removal of high-rate years (2022–2024), when the 30-year mortgage rate exceeded 5%. Both Linear Regression and Random Forest are retrained at five forgetting levels (case weight 1.0 down to 0.0 for high-rate-year observations).
R² forgetting curve
Both models maintain high R² across all forgetting steps, declining only marginally from ~0.947 to ~0.940 at full unlearning. This stability confirms that FMR is structurally driven — a model with no knowledge of the 2022–2024 rate environment can still explain 94% of the cross-county variation in rents.
Predicted FMR trajectory
Prior to 2022, the full and unlearned models predict nearly identical statewide rent levels. After 2022, a visible gap opens, with the full model predicting approximately $70/month higher rents on average. This gap — the shaded region in the chart — represents the direct imprint of the rate-hike era on model predictions.
County-level sensitivity
Geographic variation in the shift is striking. Downstate counties show the largest negative shifts: Rockland (−$533), New York County (−$519), the Bronx (−$518), Queens (−$474), and Kings (−$469) — meaning rate-hike-era data caused the model to learn significantly higher rents in these markets. Conversely, upstate counties like Erie (+$313), Onondaga (+$271), and Monroe (+$253) show positive shifts, suggesting rate hikes suppressed rents in those markets relative to their structural trend. Rural counties cluster near zero, indicating minimal sensitivity to the rate environment.
Shift distribution by era
The boxplot confirms that prediction shifts are concentrated in the high-rate era observations, with a wider and off-center distribution compared to the normal-rate era. The spread across county-year pairs in the high-rate era reflects the geographic heterogeneity documented in Figure 3 — some markets were strongly shaped by the rate environment while others were largely unaffected. Together, the four figures show that machine unlearning is a useful diagnostic tool for identifying which housing markets are most exposed to rate-cycle distortions in model training data.
############################################################
# RQ2: HOW SENSITIVE ARE COUNTY-LEVEL HUD FAIR MARKET RENT (2-BR)
# PREDICTIONS TO THE REMOVAL OF HIGH-INTEREST-RATE YEARS?
#
# Target : fmr_2 (2-BR Fair Market Rent — annual HUD data)
# Forget : years where mortgage rate > 5% (2022–2024)
# Models : Linear Regression + Random Forest
############################################################
library(tidyverse)
library(tidymodels)
library(scales)
library(patchwork)
set.seed(607)
############################################################
# 1. DEFINE HIGH-RATE YEARS & SUMMARISE
############################################################
high_rate_years <- 2022:2024
rate_summary <- final_master %>%
group_by(year) %>%
summarise(avg_rate = mean(mortgage_rate, na.rm = TRUE),
.groups = "drop") %>%
filter(!is.na(avg_rate))
cat("=== MORTGAGE RATE CONTEXT ===\n")=== MORTGAGE RATE CONTEXT ===
cat("Normal-rate era (2010–2021) avg:",
round(mean(rate_summary$avg_rate[rate_summary$year < 2022],
na.rm = TRUE), 2), "%\n")Normal-rate era (2010–2021) avg: 4 %
cat("High-rate era (2022–2024) avg:",
round(mean(rate_summary$avg_rate[rate_summary$year >= 2022],
na.rm = TRUE), 2), "%\n\n")High-rate era (2022–2024) avg: 6.28 %
############################################################
# 2. MODELING DATASET
############################################################
ml_fmr <- final_master %>%
drop_na(fmr_2, median_income, pop,
mortgage_rate, fed_rate,
cpi, inflation_rate) %>%
filter(year >= 2010,
fmr_2 > 0,
median_income > 0) %>%
mutate(county_name = as.factor(county_name)) %>%
select(
county_name, # geographic fixed effects
year, # time trend
fmr_2, # TARGET: 2-BR Fair Market Rent
median_income, # demand driver
pop, # market size
mortgage_rate, # key forgotten variable
fed_rate, # monetary policy
cpi, # price level
inflation_rate # year-over-year inflation
)
cat("=== DATASET OVERVIEW ===\n")=== DATASET OVERVIEW ===
cat("Rows :", nrow(ml_fmr), "\n")Rows : 930
cat("Unique counties:", n_distinct(ml_fmr$county_name), "\n")Unique counties: 62
cat("Years covered :", min(ml_fmr$year), "–", max(ml_fmr$year), "\n\n")Years covered : 2010 – 2024
############################################################
# 3. TRAIN / TEST SPLIT
############################################################
split_obj <- initial_split(ml_fmr, prop = 0.80, strata = fmr_2)
train_data <- training(split_obj)
test_data <- testing(split_obj)
cv_folds <- vfold_cv(train_data, v = 5, strata = fmr_2)
############################################################
# 4. RECIPES & MODELS
############################################################
base_rec <- recipe(fmr_2 ~ ., data = train_data) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
lm_mod <- linear_reg() %>%
set_engine("lm")
rf_mod <- rand_forest(
trees = 500,
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("regression")
wf_lm <- workflow() %>% add_recipe(base_rec) %>% add_model(lm_mod)
wf_rf <- workflow() %>% add_recipe(base_rec) %>% add_model(rf_mod)
############################################################
# 5. TUNE & BASELINE COMPARISON
############################################################
metric_set_used <- metric_set(rmse, rsq, mae)
res_lm <- fit_resamples(wf_lm,
resamples = cv_folds,
metrics = metric_set_used)
res_rf <- tune_grid(wf_rf,
resamples = cv_folds,
grid = 10,
metrics = metric_set_used)
best_rf <- select_best(res_rf, metric = "rmse")
rf_spec_tuned <- finalize_workflow(wf_rf, best_rf) %>%
extract_spec_parsnip()
cat("=== BASELINE MODEL COMPARISON (CV R²) ===\n")=== BASELINE MODEL COMPARISON (CV R²) ===
bind_rows(
collect_metrics(res_lm) %>%
filter(.metric == "rsq") %>%
mutate(model = "Linear Regression"),
collect_metrics(res_rf) %>%
filter(.config == best_rf$.config, .metric == "rsq") %>%
mutate(model = "Random Forest")
) %>%
select(model, mean) %>%
rename(`CV R²` = mean) %>%
print()# A tibble: 2 × 2
model `CV R²`
<chr> <dbl>
1 Linear Regression 0.947
2 Random Forest 0.943
cat("\n")############################################################
# 6. UNLEARNING SETUP
############################################################
forget_steps <- c(1.00, 0.75, 0.50, 0.25, 0.00)
step_labels <- c("Full (w=1.00)", "w=0.75", "w=0.50",
"w=0.25", "Unlearned (w=0.00)")
############################################################
# 7. UNLEARNING LOOP — TEST SET (R² evaluation only)
############################################################
lm_metrics_list <- list()
rf_metrics_list <- list()
for (i in seq_along(forget_steps)) {
w <- forget_steps[i]
label <- step_labels[i]
cat(">>> Fitting:", label, "\n")
train_w <- train_data %>%
mutate(
case_wt = importance_weights(
if_else(year %in% high_rate_years, w, 1.0)
)
)
rec_w <- recipe(fmr_2 ~ ., data = train_w) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
# Linear Regression
fit_lm_w <- workflow() %>%
add_recipe(rec_w) %>%
add_model(lm_mod) %>%
add_case_weights(case_wt) %>%
fit(data = train_w)
lm_metrics_list[[i]] <- predict(fit_lm_w, new_data = test_data) %>%
bind_cols(test_data %>% select(fmr_2)) %>%
metrics(truth = fmr_2, estimate = .pred) %>%
mutate(forget_weight = w, step_label = label,
model = "Linear Regression")
# Random Forest
fit_rf_w <- workflow() %>%
add_recipe(rec_w) %>%
add_model(rf_spec_tuned) %>%
add_case_weights(case_wt) %>%
fit(data = train_w)
rf_metrics_list[[i]] <- predict(fit_rf_w, new_data = test_data) %>%
bind_cols(test_data %>% select(fmr_2)) %>%
metrics(truth = fmr_2, estimate = .pred) %>%
mutate(forget_weight = w, step_label = label,
model = "Random Forest")
}>>> Fitting: Full (w=1.00)
>>> Fitting: w=0.75
>>> Fitting: w=0.50
>>> Fitting: w=0.25
>>> Fitting: Unlearned (w=0.00)
all_metrics <- bind_rows(
bind_rows(lm_metrics_list),
bind_rows(rf_metrics_list)
)
############################################################
# 8. FULL-DATASET PREDICTIONS — ALL 62 COUNTIES
############################################################
make_full_preds <- function(w, label) {
train_w <- train_data %>%
mutate(
case_wt = importance_weights(
if_else(year %in% high_rate_years, w, 1.0)
)
)
rec_w <- recipe(fmr_2 ~ ., data = train_w) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
fit_w <- workflow() %>%
add_recipe(rec_w) %>%
add_model(rf_spec_tuned) %>%
add_case_weights(case_wt) %>%
fit(data = train_w)
predict(fit_w, new_data = ml_fmr) %>%
bind_cols(ml_fmr %>% select(county_name, year, fmr_2)) %>%
mutate(
forget_weight = w,
step_label = factor(label, levels = step_labels),
high_rate_era = year %in% high_rate_years
)
}
cat("\n>>> Predicting on all 62 counties...\n")
>>> Predicting on all 62 counties...
full_preds <- make_full_preds(1.00, step_labels[1])
unlearn_preds <- make_full_preds(0.00, step_labels[5])
all_steps_full <- map2_dfr(forget_steps, step_labels, make_full_preds)
############################################################
# 9. SENSITIVITY CALCULATION — ALL 62 COUNTIES
############################################################
sensitivity_df <- full_preds %>%
select(county_name, year, high_rate_era,
actual = fmr_2,
full_pred = .pred) %>%
left_join(
unlearn_preds %>% select(county_name, year,
unlearned_pred = .pred),
by = c("county_name", "year")
) %>%
mutate(
pred_shift = unlearned_pred - full_pred,
era_label = if_else(high_rate_era,
"High-Rate Era (2022–24)",
"Normal-Rate Era (2010–21)")
)
county_sensitivity <- sensitivity_df %>%
filter(high_rate_era == TRUE) %>%
group_by(county_name) %>%
summarise(
mean_shift = mean(pred_shift, na.rm = TRUE),
mean_actual = mean(actual, na.rm = TRUE),
mean_full = mean(full_pred, na.rm = TRUE),
mean_unlearned = mean(unlearned_pred, na.rm = TRUE),
pct_shift = mean_shift / mean_full * 100,
.groups = "drop"
) %>%
mutate(
county_label = str_remove(county_name, " County"),
direction = if_else(mean_shift >= 0,
"Predicts Higher Without Rate-Hike Data",
"Predicts Lower Without Rate-Hike Data")
) %>%
arrange(desc(mean_shift))
cat("Counties analysed:", nrow(county_sensitivity), "\n\n")Counties analysed: 62
############################################################
# 10. VISUALIZATIONS
############################################################
# ── FIGURE 1: R² Both Models (More Space + No Grids) ──────────────────────────────
rsq_tbl <- all_metrics %>%
filter(.metric == "rsq") %>%
mutate(forget_weight = as.numeric(forget_weight))
p1 <- ggplot(rsq_tbl,
aes(x = forget_weight, y = .estimate,
color = model, group = model)) +
geom_line(linewidth = 1.3) +
geom_point(size = 3.5) +
# Blue labels lower
geom_label(
data = rsq_tbl %>% filter(model == "Linear Regression"),
aes(label = round(.estimate, 3)),
nudge_y = -0.025,
nudge_x = 0.02,
size = 3.2,
label.size = 0.2,
show.legend = FALSE
) +
# Orange labels higher
geom_label(
data = rsq_tbl %>% filter(model == "Random Forest"),
aes(label = round(.estimate, 3)),
nudge_y = 0.025,
nudge_x = -0.02,
size = 3.2,
label.size = 0.2,
show.legend = FALSE
) +
scale_x_reverse(
breaks = forget_steps,
labels = c("Full\n(all years)", "75%", "50%",
"25%", "Forgotten\n(2022–24 removed)")
) +
# more headroom on top
scale_y_continuous(
labels = percent_format(accuracy = 1),
limits = c(0, 1.05),
expand = expansion(mult = c(0.02, 0.08))
) +
scale_color_manual(values = c(
"Linear Regression" = "#0072B2",
"Random Forest" = "#D55E00"
)) +
labs(
title = "Figure 1 — Model Reliability as High-Rate Years Are Gradually Forgotten",
subtitle = paste(
"A flat R² line means FMR is structurally driven.",
"A sharp drop means the high-rate era was critical to model learning."
),
x = "← How much of the high-rate era (2022–24) the model remembers",
y = "R² (variance explained)",
color = NULL,
caption = paste(
"Note: R² evaluated on held-out test set at each forgetting step.",
"High-rate years = 2022–2024 when 30-yr mortgage rate exceeded 5%."
)
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(color = "gray40", size = 10),
plot.caption = element_text(color = "gray50", size = 9,
hjust = 0, face = "italic"),
axis.text = element_text(size = 10),
# remove all grids
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
# clean axis lines
axis.line = element_line(color = "gray70")
)
print(p1)# ── FIGURE 2: Trajectory — Full vs Unlearned ──────────────
traj_df <- all_steps_full %>%
filter(forget_weight %in% c(1.00, 0.00)) %>%
group_by(year, forget_weight) %>%
summarise(mean_pred = mean(.pred, na.rm = TRUE),
.groups = "drop")
ribbon_df <- traj_df %>%
pivot_wider(names_from = forget_weight,
values_from = mean_pred,
names_prefix = "w_") %>%
rename(full_line = w_1, unlearn_line = w_0) %>%
mutate(ymin = pmin(full_line, unlearn_line),
ymax = pmax(full_line, unlearn_line))
lines_df <- traj_df %>%
mutate(series = if_else(
forget_weight == 1.00,
"Full Model (rate hike included)",
"Unlearned Model (rate hike removed)"
))
p2 <- ggplot() +
annotate("rect", xmin = 2021.5, xmax = 2024.5,
ymin = -Inf, ymax = Inf,
fill = "#C84B31", alpha = 0.07) +
annotate("text", x = 2022.1,
y = max(lines_df$mean_pred, na.rm = TRUE) * 0.88,
label = "High-Rate\nEra",
color = "#C84B31", size = 3.5,
fontface = "bold", hjust = 0) +
geom_ribbon(
data = ribbon_df,
mapping = aes(x = year, ymin = ymin, ymax = ymax),
fill = "#C84B31", alpha = 0.15
) +
geom_line(
data = lines_df,
mapping = aes(x = year, y = mean_pred,
color = series, linetype = series),
linewidth = 1.4
) +
geom_point(
data = lines_df,
mapping = aes(x = year, y = mean_pred, color = series),
size = 2.5
) +
scale_color_manual(values = c(
"Full Model (rate hike included)" = "#1F3C88",
"Unlearned Model (rate hike removed)" = "#2A9D8F"
)) +
scale_linetype_manual(values = c(
"Full Model (rate hike included)" = "solid",
"Unlearned Model (rate hike removed)" = "dashed"
)) +
scale_x_continuous(breaks = seq(2010, 2024, 2)) +
scale_y_continuous(labels = label_dollar()) +
labs(
title = "Figure 2 — Predicted 2-BR FMR: Full vs. Rate-Hike Unlearned Model (All 62 Counties)",
subtitle = "Shaded gap = how much rate-hike years shift the model's FMR predictions statewide",
x = NULL,
y = "Mean Predicted 2-BR FMR ($/month)",
color = NULL, linetype = NULL,
caption = paste(
"Note: Lines show mean RF prediction across all 62 NY counties per year.",
"Wider shaded gap = greater sensitivity of FMR predictions to rate-hike era.",
"Dashed = what the model predicts having never seen 2022–2024 data."
)
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "bottom",
legend.text = element_text(size = 10),
plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(color = "gray40", size = 10),
plot.caption = element_text(color = "gray50", size = 9,
hjust = 0, face = "italic"),
# remove all grids
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_text(size = 10)
)
print(p2)# ── FIGURE 3: ALL 62 COUNTIES — Prediction Shift ──────────
county_plot_df <- county_sensitivity %>%
mutate(
county_label = fct_reorder(county_label, mean_shift),
bar_color = if_else(mean_shift >= 0, "#C84B31", "#1B9E77")
)
p3 <- ggplot(county_plot_df,
aes(x = mean_shift, y = county_label,
fill = direction)) +
geom_col(alpha = 0.85, width = 0.75) +
geom_vline(xintercept = 0, color = "gray30",
linewidth = 0.7) +
geom_text(
aes(
label = dollar(mean_shift, accuracy = 1),
hjust = if_else(mean_shift >= 0, -0.1, 1.1)
),
size = 2.8, fontface = "bold",
color = if_else(county_plot_df$mean_shift >= 0,
"#C84B31", "#1B9E77")
) +
scale_fill_manual(values = c(
"Predicts Higher Without Rate-Hike Data" = "#C84B31",
"Predicts Lower Without Rate-Hike Data" = "#1B9E77"
)) +
scale_x_continuous(
labels = label_dollar(),
expand = expansion(mult = c(0.15, 0.15))
) +
labs(
title = "Figure 3 — FMR Prediction Shift After Rate-Hike Unlearning: All 62 NY Counties",
subtitle = paste(
"Each bar = how much the predicted 2-BR FMR changes when high-rate years (2022–24) are forgotten.",
"\nRed = model predicts higher FMR without rate-hike data (rate hikes suppressed predictions).",
"Green = model predicts lower FMR (rate hikes inflated predictions)."
),
x = "Mean Prediction Shift: Unlearned − Full Model ($/month)",
y = NULL,
fill = NULL,
caption = paste(
"Note: Counties sorted by prediction shift magnitude.",
"Larger absolute shift = county FMR was more sensitive to high-rate era training data.",
"Counties near zero had stable predictions regardless of rate environment."
)
) +
theme_minimal(base_size = 10) +
theme(
legend.position = "bottom",
legend.text = element_text(size = 9),
plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(color = "gray40", size = 9,
lineheight = 1.3),
plot.caption = element_text(color = "gray50", size = 8,
hjust = 0, face = "italic"),
# remove all grids
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.y = element_text(size = 8)
)
# Save tall enough for all 62 counties
ggsave("rq2_fig3_all_counties.png",
plot = p3, width = 12, height = 14, dpi = 150)
print(p3)# ── FIGURE 4: Distribution of Shift by Era ────────────────
p4 <- ggplot(sensitivity_df,
aes(x = era_label, y = pred_shift,
fill = era_label)) +
geom_hline(yintercept = 0, linetype = "dashed",
color = "gray50", linewidth = 0.8) +
geom_boxplot(width = 0.45, alpha = 0.85,
outlier.size = 1.5, outlier.alpha = 0.4) +
geom_jitter(width = 0.09, alpha = 0.18,
size = 1.1, color = "gray30") +
annotate("text", x = 2.35,
y = max(sensitivity_df$pred_shift, na.rm = TRUE) * 0.75,
label = "↑ Unlearned model predicts HIGHER FMR\n(rate hikes suppressed predictions)",
hjust = 0, size = 3.2, color = "gray35",
lineheight = 1.3) +
annotate("text", x = 2.35,
y = min(sensitivity_df$pred_shift, na.rm = TRUE) * 0.75,
label = "↓ Unlearned model predicts LOWER FMR\n(rate hikes inflated predictions)",
hjust = 0, size = 3.2, color = "gray35",
lineheight = 1.3) +
scale_fill_manual(values = c(
"High-Rate Era (2022–24)" = "#0072B2",
"Normal-Rate Era (2010–21)" = "#D55E00"
)) +
scale_y_continuous(labels = label_dollar()) +
labs(
title = "Figure 4 — Spread of FMR Prediction Shifts Across All Counties",
subtitle = paste(
"If the red box is further from zero than green,",
"rate-hike years meaningfully shaped how the model learned FMR."
),
x = NULL,
y = "Prediction Shift: Unlearned − Full ($/month)",
fill = NULL,
caption = paste(
"How to read: Box = middle 50% of county-year observations.",
"Line inside = median shift. Jittered dots = individual observations.",
"\nA shift concentrated in the high-rate era (blue) confirms",
"the model learned rate-specific patterns from 2022–2024."
)
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(color = "gray40", size = 10,
lineheight = 1.3),
plot.caption = element_text(color = "gray50", size = 9,
hjust = 0, face = "italic",
lineheight = 1.4),
# remove all grids
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_text(size = 11)
)
print(p4)############################################################
# 11. SUMMARY TABLE — ALL 62 COUNTIES
############################################################
cat("\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n")
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
cat(" Table 1 — FMR Sensitivity: All 62 NY Counties\n") Table 1 — FMR Sensitivity: All 62 NY Counties
cat(" Sorted by absolute prediction shift (most sensitive first)\n") Sorted by absolute prediction shift (most sensitive first)
cat("━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n")━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
county_sensitivity %>%
arrange(desc(abs(mean_shift))) %>%
mutate(
`Actual FMR` = dollar(mean_actual, accuracy = 1),
`Full Model` = dollar(mean_full, accuracy = 1),
`Unlearned Model` = dollar(mean_unlearned, accuracy = 1),
`Shift ($)` = dollar(mean_shift, accuracy = 1),
`Shift (%)` = paste0(round(pct_shift, 1), "%")
) %>%
select(County = county_label,
`Actual FMR`,
`Full Model`,
`Unlearned Model`,
`Shift ($)`,
`Shift (%)`) %>%
print(n = 62)# A tibble: 62 × 6
County `Actual FMR` `Full Model` `Unlearned Model` `Shift ($)` `Shift (%)`
<chr> <chr> <chr> <chr> <chr> <chr>
1 Rockland $2,514 $2,325 $1,792 -$533 -22.9%
2 New York $2,514 $2,435 $1,916 -$519 -21.3%
3 Bronx $2,514 $2,399 $1,880 -$518 -21.6%
4 Queens $2,514 $2,367 $1,893 -$474 -20%
5 Kings $2,514 $2,355 $1,886 -$469 -19.9%
6 Richmond $2,514 $2,332 $1,889 -$443 -19%
7 Putnam $2,514 $2,304 $1,875 -$429 -18.6%
8 Suffolk $2,290 $2,324 $1,938 -$387 -16.6%
9 Nassau $2,290 $2,321 $1,943 -$378 -16.3%
10 Westches… $2,088 $2,266 $1,897 -$369 -16.3%
11 Erie $1,065 $1,142 $1,455 $313 27.4%
12 Onondaga $1,067 $1,175 $1,446 $271 23%
13 Tompkins $1,505 $1,409 $1,147 -$262 -18.6%
14 Monroe $1,177 $1,317 $1,570 $253 19.2%
15 Jefferson $1,255 $1,164 $979 -$186 -15.9%
16 Orleans $1,177 $1,059 $914 -$145 -13.7%
17 Ulster $1,521 $1,440 $1,298 -$143 -9.9%
18 Schenect… $1,298 $1,272 $1,154 -$118 -9.3%
19 Schoharie $1,298 $1,179 $1,087 -$91 -7.7%
20 Wyoming $829 $874 $956 $82 9.4%
21 Ontario $1,177 $1,265 $1,184 -$81 -6.4%
22 Essex $916 $958 $1,028 $71 7.4%
23 Genesee $918 $975 $1,045 $70 7.2%
24 Lewis $881 $904 $972 $68 7.5%
25 Albany $1,298 $1,343 $1,410 $67 5%
26 Warren $1,156 $1,189 $1,123 -$66 -5.6%
27 Wayne $1,177 $1,169 $1,107 -$62 -5.3%
28 Livingst… $1,177 $1,140 $1,080 -$59 -5.2%
29 Steuben $839 $922 $972 $50 5.5%
30 Seneca $923 $909 $959 $50 5.5%
31 Herkimer $937 $949 $999 $50 5.2%
32 Broome $987 $962 $915 -$46 -4.8%
33 Saratoga $1,298 $1,460 $1,418 -$42 -2.9%
34 Cayuga $914 $933 $974 $40 4.3%
35 Oswego $1,067 $1,055 $1,015 -$39 -3.7%
36 Madison $1,067 $1,112 $1,073 -$38 -3.5%
37 Dutchess $1,607 $1,614 $1,650 $37 2.3%
38 Tioga $987 $1,012 $1,043 $31 3.1%
39 Franklin $837 $856 $884 $28 3.3%
40 Clinton $974 $996 $1,024 $28 2.8%
41 Cortland $935 $935 $961 $26 2.8%
42 Oneida $937 $996 $1,020 $24 2.4%
43 Chenango $834 $849 $872 $24 2.8%
44 Allegany $829 $841 $864 $23 2.7%
45 Washingt… $1,156 $1,093 $1,071 -$22 -2%
46 Greene $1,103 $1,104 $1,084 -$20 -1.8%
47 Cattarau… $829 $840 $856 $17 2%
48 Sullivan $1,024 $1,032 $1,047 $16 1.5%
49 Fulton $929 $890 $875 -$15 -1.7%
50 Hamilton $1,017 $992 $977 -$15 -1.5%
51 Delaware $829 $840 $853 $13 1.5%
52 Chemung $1,064 $959 $947 -$12 -1.2%
53 Columbia $1,103 $1,185 $1,196 $11 0.9%
54 Schuyler $877 $904 $915 $11 1.2%
55 Niagara $1,065 $1,047 $1,036 -$10 -1%
56 Montgome… $874 $865 $873 $7 0.9%
57 St. Lawr… $892 $907 $914 $7 0.8%
58 Orange $1,607 $1,663 $1,658 -$5 -0.3%
59 Renssela… $1,298 $1,339 $1,335 -$4 -0.3%
60 Yates $974 $963 $967 $4 0.4%
61 Chautauq… $829 $847 $844 -$3 -0.4%
62 Otsego $981 $966 $963 -$3 -0.3%
cat("\n=== KEY STATS ===\n")
=== KEY STATS ===
cat("Mean shift across all counties :",
dollar(mean(county_sensitivity$mean_shift, na.rm = TRUE),
accuracy = 1), "/month\n")Mean shift across all counties : -$70 /month
cat("Counties with positive shift :",
sum(county_sensitivity$mean_shift > 0), "of 62\n")Counties with positive shift : 28 of 62
cat("Counties with shift > $50/month :",
sum(abs(county_sensitivity$mean_shift) > 50), "of 62\n")Counties with shift > $50/month : 29 of 62
cat("Most sensitive county :",
county_sensitivity$county_label[
which.max(abs(county_sensitivity$mean_shift))], "\n")Most sensitive county : Rockland
cat("Least sensitive county :",
county_sensitivity$county_label[
which.min(abs(county_sensitivity$mean_shift))], "\n")Least sensitive county : Otsego
cat("━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n")━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
############################################################
# 12. RQ2 CONCLUSION
############################################################
cat("
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RQ2 CONCLUSION:
How sensitive are county-level HUD FMR predictions to
removal of high-interest-rate years (2022–2024)?
Machine unlearning was applied via case weight reduction
(1.00 → 0.00) on high-rate year observations across
both Linear Regression and Random Forest models.
Findings:
1. STABLE MODELS: Both models maintain high R² even after
full unlearning — FMR is largely driven by structural
county factors (geography, income, population), not
exclusively by the interest rate environment.
2. MODERATE SENSITIVITY: Mean prediction shift of",
dollar(mean(county_sensitivity$mean_shift,
na.rm = TRUE), accuracy = 1),
"/month —
modest but real influence of rate-hike years on what
the model learned about FMR levels.
3. GEOGRAPHIC VARIATION: Sensitivity varies widely across
the 62 counties. Some markets are more structurally
anchored; others show stronger rate-cycle dependence.
4. METHODOLOGICAL INSIGHT: Machine unlearning reveals
which counties' FMR benchmarks are most rate-sensitive,
informing policy around housing voucher adjustments
during rate cycle transitions.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
")
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RQ2 CONCLUSION:
How sensitive are county-level HUD FMR predictions to
removal of high-interest-rate years (2022–2024)?
Machine unlearning was applied via case weight reduction
(1.00 → 0.00) on high-rate year observations across
both Linear Regression and Random Forest models.
Findings:
1. STABLE MODELS: Both models maintain high R² even after
full unlearning — FMR is largely driven by structural
county factors (geography, income, population), not
exclusively by the interest rate environment.
2. MODERATE SENSITIVITY: Mean prediction shift of -$70 /month —
modest but real influence of rate-hike years on what
the model learned about FMR levels.
3. GEOGRAPHIC VARIATION: Sensitivity varies widely across
the 62 counties. Some markets are more structurally
anchored; others show stronger rate-cycle dependence.
4. METHODOLOGICAL INSIGHT: Machine unlearning reveals
which counties' FMR benchmarks are most rate-sensitive,
informing policy around housing voucher adjustments
during rate cycle transitions.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Conclusion
This analysis finds that housing affordability in New York State is primarily a structural and geographic problem. Local factors — population density, home price levels, and regional rent benchmarks — explain approximately 96% of the variation in housing burden across counties, and these relationships have remained stable even through the interest rate shock of 2022–2024. Macroeconomic variables such as mortgage rates and the Fed Funds Rate, while economically significant at the national level, contribute minimally to predicting county-level burden once local structural conditions are accounted for.
The machine unlearning experiment adds an important nuance: while the high-rate era did influence what models learned about rental markets — producing a mean statewide prediction shift of −$70/month — that influence was heavily concentrated in high-cost downstate counties. Rural and mid-size upstate markets were largely unaffected, reflecting their lower sensitivity to rate-driven demand swings. These findings suggest that meaningful affordability improvements in New York will require addressing local supply constraints and income gaps directly, rather than relying on monetary policy or cyclical rate relief to restore affordability.
References:
OpenAI. (2026). ChatGPT conversation on selective machine unlearning for counterfactual housing market simulation in New York State [Large language model]. ChatGPT. https://chat.openai.com/
Claude. (2026). Claude conversation on New York housing affordability analysis with multi-source data integration [Large language model]. Claude. https://claude.ai/