Introduction

This report is intended to provide documentation of follow-up research for Harvard researcher Justin Katz’s theory that the reason behind the modern housing market phenomena of rising prices even in the face of higher interest rates stems from housing supply rather than financial policy.

Hypothesis

Since the advent of COVID-19, housing market trends have become highly irregular, as the market enters a state with fewer and fewer transactions while prices also rise. Initially, during 2020, prices fell, which created an opportunity for prospective home-buyers to take advantage of financial policies designed to stimulate the economy (Schwartz & Wachter, 2022). However, in the years following the lock-down, housing prices have rebounded, and home-buyers who received good deals on mortgage agreements prior to 2021 have “locked-in”, not wanting to switch homes, meaning that rising prices have not correlated to more transactions.

While researching this phenomenon, Justin Katz observed evidence that rising rates led to a reduced supply of houses on the market, with home-buyers and home-sellers wanting to maintain their current rates. Because of this, Katz recommends that increasing the supply of houses is necessary to stimulate sales in the market again and lower the prices for prospective home-buyers, as opposed to changes to mortgage or interest rates (Katz, 2026).

This research paper intends to look at housing market data provided by the US Census Bureau for 2019–2023, the most up-to-date complete set of data for the general US housing market provided by government sources, to test the validity of Katz’s suggestion. If there is evidence that housing supply is a stronger explanatory variable than mortgage rates or any other explanatory variable related to rates, this would serve as evidence that Katz’s theory holds weight, justifying further research into the ability for housing supply to better explain fluctuations in prices than mortgage rate plans.

Variable Selection

Using the data provided by the US Census Bureau, we can use the Boruta algorithm to find which explanatory variables will be best for our regression model of mortgage approval rates.

#Downloading Data Sets > vars <- load_variables(2023, “acs5”, cache = TRUE) > View(vars)

#Pulling data for FRED to develop Boruta Algoroithm > fredr_set_key(“c9b7ae20a9ea3b847843286a9db92371”) > > # Current 2023 mortgage rate > mortgage_30 <- fredr(“MORTGAGE30US”, + observation_start = as.Date(“2023-01-01”), + observation_end = as.Date(“2023-12-31”)) %>% + summarise(avg_mortgage_rate_2023 = mean(value, na.rm = TRUE)) > > # Locked-in rate baseline (2020-2021 low-rate era) > locked_in_rate <- fredr(“MORTGAGE30US”, + observation_start = as.Date(“2020-01-01”), + observation_end = as.Date(“2021-12-31”)) %>% + summarise(locked_in_rate = mean(value, na.rm = TRUE)) > > # Rate differential > rate_differential <- mortgage_30$avg_mortgage_rate_2023 - + locked_in_rate$locked_in_rate

Total housing units (regional stock)

housing_units <- get_acs( + geography = “state”, variables = “B25001_001”, + year = 2023, survey = “acs5” + ) %>% select(GEOID, state = NAME, housing_units = estimate)

Vacant units (available supply)

vacant_units <- get_acs( + geography = “state”, variables = “B25002_003”, + year = 2023, survey = “acs5” + ) %>% select(GEOID, vacant_units = estimate)

Owner-occupied units (proxy for locked-in homeowners)

owner_occupied <- get_acs( + geography = “state”, variables = “B25003_002”, + year = 2023, survey = “acs5” + ) %>% select(GEOID, owner_occupied = estimate)

Median home value (outcome variable)

home_value <- get_acs( + geography = “state”, variables = “B25077_001”, + year = 2023, survey = “acs5” + ) %>% select(GEOID, median_home_value = estimate)

— Control Variables —

Median household income

income <- get_acs( + geography = “state”, variables = “B19013_001”, + year = 2023, survey = “acs5” + ) %>% select(GEOID, median_income = estimate) # Total population (to compute per-capita supply) population <- get_acs( + geography = “state”, variables = “B01001_001”, + year = 2023, survey = “acs5” + ) %>% select(GEOID, population = estimate) Getting data from the 2019-2023 5-year ACS analysis_df <- home_value %>% #Combining FRED and Census data into one set + left_join(housing_units, by = “GEOID”) %>% + left_join(vacant_units, by = “GEOID”) %>% + left_join(owner_occupied, by = “GEOID”) %>% + left_join(income, by = “GEOID”) %>% + left_join(population, by = “GEOID”) %>% +
+ # Derived supply variables (Katz’s core metrics) + mutate( + vacancy_rate = vacant_units / housing_units, + units_per_capita = housing_units / population, + owner_occupied_rate = owner_occupied / housing_units, +
+ # Attach national mortgage rate variables to each state row + avg_mortgage_rate = mortgage_30$avg_mortgage_rate_2023, + rate_differential = rate_differential, +
+ # Interaction term: does lock-in hit harder in low-supply states? + lockin_x_supply = rate_differential * units_per_capita + ) %>% + drop_na() view(analysis_df)

#Performing the Actual Boruta Analysis > set.seed(123) #For code reproducibility > boruta_result <- Boruta(median_home_value ~ units_per_capita + vacancy_rate + avg_mortgage_rate + owner_occupied_rate + median_income + lockin_x_supply,data = analysis_df, doTrace = 2, maxRuns = 100) > # Plot > attStats(boruta_final) %>% + rownames_to_column(“Feature”) %>% + ggplot(aes(x = reorder(Feature, meanImp), y = meanImp, fill = decision)) + + geom_col() + + coord_flip() + + scale_fill_manual(values = c(“Confirmed” = “green4”, + “Rejected” = “red3”, + “Tentative” = “gold”)) + + labs(title = “What Drives 2023 Home Prices? Supply vs. Rates”, + x = “Feature”, y = “Mean Importance Score”) + + theme_minimal() > analysis_df <- analysis_df %>% #Integrating the factor variable supply tier + mutate(supply_tier = case_when(quantile(units_per_capita, 0.25) ~ “Low Supply”, + TRUE ~ “Mid/High Supply”), + supply_tier = factor(supply_tier,levels = c(“Mid/High Supply”, “Low Supply”)),

Based on the results, the five most relevant variables were median income, units per capita, lockin_x_supply, owner occupied rates, and vacancy rates. Median income is historically and intuitively the best indicator for home values, so even if it doesn’t interact with our hypothesis any regression model without it included would be a poor fit. Units per capita, owner occupied rates, and vacancy rates are all variables which illustrate the relationship between regional supply and median home values. lockin_x_supply is a variable that suggests interaction between the lock-in effect and supply, and its significance provides specific evidence for Katz’s observation that constrained supply has a unique interaction with the lock-in effect.

— Histograms for all quantitative variables —

analysis_df %>% + select(where(is.numeric)) %>% + pivot_longer(everything(), + names_to = “variable”, + values_to = “value”) %>% + ggplot(aes(x = value)) + + geom_histogram(bins = 15, fill = “steelblue”, + color = “white”, alpha = 0.8) + + facet_wrap(~variable, scales = “free”) + + labs(title = “Histograms of All Quantitative Variables”) + + theme_minimal()

— Quantile (Q-Q) Plots —

par(mfrow = c(2, 3)) for (var in c(“median_home_value”, “median_income”, + “units_per_capita”, “vacancy_rate”, + “loan_to_income”, “avg_mortgage_rate”)) { + qqnorm(analysis_df[[var]], main = paste(“Q-Q Plot:”, var)) + qqline(analysis_df[[var]], col = “red”)}

#Descriptive Statistics Table > desc_stats <- analysis_df %>% + select(where(is.numeric)) %>% + pivot_longer(everything(), + names_to = “Variable”, + values_to = “Value”) %>% + group_by(Variable) %>% + summarise( + Mean = mean(Value, na.rm = TRUE), + Median = median(Value, na.rm = TRUE), + SD = sd(Value, na.rm = TRUE), + Min = min(Value, na.rm = TRUE), + Max = max(Value, na.rm = TRUE), + .groups = “drop”) %>% + mutate(across(where(is.numeric), ~round(., 3))) > > kable(desc_stats, caption = “Descriptive Statistics: Housing Variables”) %>% + kable_styling(bootstrap_options = c(“striped”, “hover”))

In addition, the dataset includes a binary supply_tier variable that sorts states between having adequate or low units per capita. This could potentially serve better in models than just units per capita as an explanatory variable if the effect of units per capita is not easily captured in a linear manner.

Dataset Description

The majority of the dataset is pulled from the US Census Bureau, using 5-year estimates ending in 2023, pulling from 51 state-level observations. In addition, a dataset was downloaded from the Federal Reserve Bank of St. Louis for the same time period to provide supplementary information on mortgage rates. The variables are defined as follows:

Median Home Value: median owner-occupied home value in dollars
Median Income: median household income in dollars
Units Per Capita: housing units per person in a given state
Vacancy Rate: ratio of vacant units to total units
Owner Occupied Rate: ratio of owner-occupied units to total units
Rate Differential: 2023 mortgage rate minus 2020–21 baseline rate
Supply Tier: binary factor — Low Supply vs Adequate Supply

#Creating Density Plots > library(fitdistrplus) > plot_density_fit <- function(var, varname) { + fit <- fitdist(var, “norm”) + mean <- fit$estimate["mean"] + sd <- fit$estimate[“sd”] + data.frame(x = var) %>% + ggplot(aes(x = x)) + + geom_density(fill = “blue”, alpha = 0.5) + + stat_function(fun = dnorm, + args = list(mean = mean, sd = sd), + color = “red”, linewidth = 1, + linetype = “dashed”) + + labs(title = paste(“Density Plot:”, varname), + subtitle = paste(“Fitted Normal: mean =”, + round(mean, 2), “, sd =”, round(sd, 2)), + x = varname, y = “Density”) + + theme_minimal()} Browse[1]> plot_density_fit(analysis_df$median_home_value, "Median Home Value") Browse[1]> plot_density_fit(analysis_df$median_income, “Median Income”) Browse[1]> plot_density_fit(analysis_df$units_per_capita, "Units Per Capita") Browse[1]> plot_density_fit(analysis_df$vacancy_rate, “Vacancy Rate”) Browse[1]> plot_density_fit(analysis_df$loan_to_income, “Loan to Income”)

#Observing and Fixing Nonlinearities > # Checking for Nonlinearity > model_check <- lm(median_home_value ~ median_income + units_per_capita + vacancy_rate, data = analysis_df) > crPlots(model_check, main = “Component + Residual Plots (Non-Linearity Check)”) > > > # Addressing Nonlinearities > analysis_df <- analysis_df %>% mutate(log_home_value = log(median_home_value), log_income = log(median_income), log_units_percap = log(units_per_capita))

Upon analysis, home values, income, and units per capita are all right-skewed variables and so it is appropriate to transform them into logarithmic functions going forward. If they remained untransformed, they would introduce bias into OLS estimates, making our results less applicable.

Three models were considered as potential regression models:

Model A: All seven variables
Model B: Drops owner-occupied rate and rate differential
Model C (Final): Median income, units per capita, and owner-occupied rate only

#Model Building > #Model 1: All Variables > model1 <- lm(log_home_value ~ log_units_percap + vacancy_rate + log_income + owner_occupied_rate + avg_mortgage_rate + lockin_x_supply, data = analysis_df) > > #Model 2: Some variables discarded > model2 <- lm(log_home_value ~ avg_mortgage_rate + log_units_percap + vacancy_rate + log_income, data = analysis_df) > > #Model 3: Minimalist model > model3 <- lm(log_home_value ~ log_units_percap + owner_occupied_rate + log_income, data = analysis_df) > > summary(model1) > summary(model2) > summary(model3)

#Detecting outlier with Cook’s diagnostic > cooksd <- cooks.distance(model3) > data.frame(state = analysis_df$state, + cooks = cooksd) %>% + ggplot(aes(x = reorder(state, cooks), y = cooks)) + + geom_col(fill = “blue”) + + geom_hline(yintercept = 4 / nrow(analysis_df), + color = “red”, linetype = “dashed”) + + coord_flip() + + labs(title = “Cook’s Distance by State”, + subtitle = “Red line = 4/n influence threshold”, + x = “State”, y = “Cook’s Distance”) + + theme_minimal() > analysis_df <- analysis_df[-52, ] > #Model 1: All Variables > model1 <- lm(log_home_value ~ log_units_percap + vacancy_rate + log_income + owner_occupied_rate + avg_mortgage_rate + lockin_x_supply, data = analysis_df) > > #Model 2: Some variables discarded > model2 <- lm(log_home_value ~ avg_mortgage_rate + log_units_percap + vacancy_rate + log_income, data = analysis_df) > > #Model 3: Minimalist model > model3 <- lm(log_home_value ~ log_units_percap + owner_occupied_rate + log_income, data = analysis_df) > summary(model3)

Through the application of Cook’s distance, a very large and significant outlier was discovered in the form of the Puerto Rico data. As such, it was decided that this outlier should be removed, as outside research also made clear that the financial systems surrounding the Puerto Rican housing market differ significantly from other states. The unique applicability of Katz’s theory to Puerto Rico as a region could be researched in a follow-up, but for the purposes of this research paper, which intends to find an answer generalizable to most regions of the United States, it was deemed that removing the Puerto Rico data would result in a much more applicable model.

#Diagnostics Tests for Preferred Models > vif(model3) log_units_percap owner_occupied_rate log_income 1.223799 1.167750 1.282496 > resettest(model3, power = 2:3, type = “fitted”)

RESET test

data: model3 RESET = 0.28951, df1 = 2, df2 = 45, p-value = 0.75

anova (model3, model1) Analysis of Variance Table

Model 1: log_home_value ~ log_units_percap + owner_occupied_rate + log_income Model 2: log_home_value ~ log_units_percap + vacancy_rate + log_income + owner_occupied_rate + avg_mortgage_rate + lockin_x_supply Res.Df RSS Df Sum of Sq F Pr(>F) 1 47 1.4362
2 45 1.3109 2 0.1253 2.1506 0.1282 > anova (model3, model2) Analysis of Variance Table

Model 1: log_home_value ~ log_units_percap + owner_occupied_rate + log_income Model 2: log_home_value ~ avg_mortgage_rate + log_units_percap + vacancy_rate + log_income Res.Df RSS Df Sum of Sq F Pr(>F) 1 47 1.4362
2 47 1.8528 0 -0.41659

Even before running more advanced diagnostic tests, the simple model clearly exhibited signs of being the best potential regression model. This was the only model where all explanatory variables had a p-value that deemed them significant, and the smaller set of variables also greatly reduced the chances of overfitting.

#Comparison of Models > AIC(model1, model2, model3) df AIC model1 7 -27.98407 model2 5 -14.33936 model3 5 -27.32850 > BIC(model1, model2, model3) df BIC model1 7 -14.461289 model2 5 -4.680228 model3 5 -17.669367 > library(lmtest) > bptest(model3)

The issue of homoskedasticity was discovered through the Breusch-Pagan test. This means that when interpreting our results, we will not use standard OLS standard errors, instead employing HC3 robust standard errors. However, this does not impact the interpretation of our coefficients and their economic implications.

#Boot test > boot_fn <- function(data, indices) { + d <- data[indices, ] + fit <- lm(log_home_value ~ + log_income + + log_units_percap + + owner_occupied_rate, + data = analysis_df) + return(coef(fit)) + } > set.seed(123) > boot_results <- boot( + data = analysis_df, + statistic = boot_fn, + R = 500 + ) > > print(boot_results)

#Generating Boot histogram > > coef_names <- c(“Intercept”, + “log(Income)”, + “log(Units Per Capita)”, + “Owner Occupied Rate”) > boot_df <- as.data.frame(boot_results$t) > colnames(boot_df) <- coef_names > boot_df %>% + pivot_longer(everything(), + names_to = "Coefficient", + values_to = "Estimate") %>% + left_join( + data.frame( + Coefficient = coef_names, + OLS_Estimate = boot_results$t0 + ), + by = “Coefficient” + ) %>% + ggplot(aes(x = Estimate)) + + geom_histogram( + bins = 30, + fill = “steelblue”, + color = “white”, + alpha = 0.8 + ) + + geom_vline( + aes(xintercept = OLS_Estimate), + color = “red”, + linetype = “dashed”, + linewidth = 1 + ) + + geom_vline( + xintercept = 0, + color = “black”, + linetype = “dotted”, + linewidth = 0.7 + ) + + facet_wrap(~Coefficient, + scales = “free”, + ncol = 2) + + labs( + title = “Bootstrapped Coefficient Distributions — Model 3”, + subtitle = “500 replications | Red dashed = OLS estimate | Black dotted = zero”, + x = “Coefficient Estimate”, + y = “Count”, + caption = “Seed = 123” + ) + + theme_minimal() + + theme(strip.text = element_text( + face = “bold”, size = 9))

#Model Summary > modelsummary(model3,coef_map = coef_map, gof_map = gof_map, stars = c(“” = 0.10, ”” = 0.05, ”” = 0.01), + title = “Table 1: Determinants of State-Level Home Prices (2023)”, + notes = list( + “* p < 0.10, ** p < 0.05, *** p < 0.01”, + “Standard errors in parentheses.”, + “Outcome variable: log(Median Home Value).”, + “Sources: U.S. Census ACS 5-Year Estimates (2023); FRED MORTGAGE30US.” + ), + output = “kableExtra”) %>%

The intercept coefficient serves to better fit the model once other variables have been inputted, without a direct economic interpretation. The coefficient for log(Median Income) indicates elasticity, suggesting a 1% increase in median household income results in 1.711% increase in median home value. The coefficient for log(Units Per Capita) also illustrates elasticity, suggesting that a 1% increase in units per capita would result in a 0.852% decrease in median home value. The coefficient for Owner Occupied Rate illustrates semi-elasticity, suggesting that a 1% increase in owner occupied rates would result in a 242.6% decrease in median home value

The coefficients for median income and units per capita both agree with general economic theory and support the hypothesis, as an increase in housing supply per person exerts a downward pressure on housing prices, as Katz outlined.

Model Fit

The R² and Adjusted R² values of 0.810 and 0.798 respectively both fall within a range that, for economic models, indicates the results are applicable. This means that our model explains enough of the variation in median home prices for the model to be considered of potential value.

While the regression model provides generalizable evidence that Katz’s theory holds — that supply conditions better inform home values than mortgage and interest rates — this evidence comes with caveats about its applicability that would demand further research.

Limitations

As indicated when discussing Puerto Rico, there exists a limitation in that because this research looks at large-scale data for the entirety of the US, findings may not be generalizable to different regions of the country, and specific analysis of these states could reveal a more complex picture that is diluted when looking at national-level data.

Additionally, because the variables used in the final regression model are largely causal, there is a high chance that omitted variable bias exists in the regression model. However, when analyzing the housing market on a large scale, this may not prove as detrimental as it would for smaller models, as many variables related to home or neighborhood quality would affect the housing market but when examined across a whole country would become less relevant in favor of grasping the “bigger picture.”

Future Research

As outlined in the limitations section, follow-up research using state-level data — looking more closely at variables within each individual state or region — could lead to better evidence for Katz’s theory. Additionally, incorporating HMDA loan-level data could provide a more granular view of the lock-in effect at the regional level.

References

Joint Center for Housing Studies. “Did Mortgages with Locked-in Low Rates Lead to Rising House Prices?” Harvard University, 3 Mar. 2026. www.jchs.harvard.edu/blog/did-mortgages-locked-low-rates-lead-rising-house-prices

Schwartz, Amy Ellen, and Susan Wachter. “COVID-19’s Impacts on Housing Markets: Introduction.” Journal of Housing Economics, 13 Dec. 2022, p. 101911. https://doi.org/10.1016/j.jhe.2022.101911

Report on Housing Market Data

Leo Thompson

March 12, 2026