Housing Affordability analysis

OVERVIEW AND BENCHMARK

The Department of Housing and Urban Development use a 30 percent of income benchmark to evaluate affordability. When housing costs exceed 30 percent of income, households are typically considered cost burdened and housing is treated as unaffordable under this standard.

In this project, affordability is defined as the share of annual income required to cover annual mortgage payments.

Affordability ratio equals annual mortgage payment divided by annual income.

Interpretation If affordability ratio is 0.30, mortgage costs equal 30 percent of income. If affordability ratio is 0.60, mortgage costs equal 60 percent of income, which is two times the 30 percent benchmark. If affordability ratio is 1.50, mortgage costs equal 150 percent of income, which is five times the 30 percent benchmark.

Research question

Are homes across metros affordable under the 30 percent benchmark, and how do affordability and home values vary by state and metro?

Hypotheses for the t test

Null hypothesis The mean affordability ratio across metros equals 0.30.

Alternative hypothesis The mean affordability ratio across metros is greater than 0.30.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)
library(stringr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(scales)
library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.5.2

zhvi_file <- "Metro_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv"
mort_file <- "Metro_mortgage_payment_downpayment_0.20_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv"
wage_file <- "LEU0252881500Q.csv"

zhvi_raw <- read.csv(zhvi_file, check.names = FALSE)
mort_raw <- read.csv(mort_file, check.names = FALSE)
wage_raw <- read.csv(wage_file, check.names = FALSE)

get_date_cols <- function(df) {
names(df)[str_detect(names(df), "^[0-9]{4}-[0-9]{2}-[0-9]{2}$")]
}

pick_first_existing <- function(df, candidates) {
hit <- candidates[candidates %in% names(df)][1]
if (is.na(hit)) NA_character_ else hit
}

to_long_zillow <- function(df, value_name) {
date_cols <- get_date_cols(df)

region_col <- pick_first_existing(df, c("RegionName", "Region", "Name"))
state_col  <- pick_first_existing(df, c("StateName", "State", "StateCode"))
rank_col   <- pick_first_existing(df, c("SizeRank", "Rank"))

if (is.na(region_col)) stop("Could not find a metro name column like RegionName.")
if (is.na(state_col))  stop("Could not find a state column like StateName or State.")

keep_cols <- c(region_col, state_col, rank_col)
keep_cols <- keep_cols[!is.na(keep_cols)]

out <- df %>%
select(all_of(keep_cols), all_of(date_cols)) %>%
rename(
RegionName = all_of(region_col),
StateName  = all_of(state_col)
)

out %>%
pivot_longer(
cols = all_of(date_cols),
names_to = "date",
values_to = value_name
) %>%
mutate(date = as.Date(date))
}

latest_value_by_metro <- function(long_df, value_col) {
long_df %>%
group_by(RegionName, StateName) %>%
filter(date == max(date, na.rm = TRUE)) %>%
ungroup() %>%
select(RegionName, StateName, date, all_of(value_col))
}

calc_growth_5y <- function(zhvi_long) {
last_date <- max(zhvi_long$date, na.rm = TRUE)
start_date <- last_date %m-% years(5)

start_vals <- zhvi_long %>%
group_by(RegionName, StateName) %>%
filter(date == min(date[date >= start_date], na.rm = TRUE)) %>%
ungroup() %>%
select(RegionName, StateName, zhvi_start = zhvi)

end_vals <- zhvi_long %>%
group_by(RegionName, StateName) %>%
filter(date == max(date, na.rm = TRUE)) %>%
ungroup() %>%
select(RegionName, StateName, zhvi_end = zhvi)

inner_join(start_vals, end_vals, by = c("RegionName", "StateName")) %>%
mutate(growth_5y = (zhvi_end / zhvi_start) - 1)
}

zhvi_long <- to_long_zillow(zhvi_raw, "zhvi") %>% filter(!is.na(zhvi))
mort_long <- to_long_zillow(mort_raw, "mortgage_monthly") %>% filter(!is.na(mortgage_monthly))

zhvi_latest <- latest_value_by_metro(zhvi_long, "zhvi")
mort_latest <- latest_value_by_metro(mort_long, "mortgage_monthly")

growth_5y <- calc_growth_5y(zhvi_long)

wage_raw_names <- names(wage_raw)
value_col <- wage_raw_names[str_detect(wage_raw_names, "VALUE|Value|value|LEU0252881500Q")]
date_col  <- wage_raw_names[str_detect(wage_raw_names, "DATE|Date|date")]

if (length(value_col) == 0) value_col <- wage_raw_names[2]
if (length(date_col) == 0)  date_col  <- wage_raw_names[1]

wage_series <- wage_raw %>%
transmute(
date = as.Date(.data[[date_col]]),
weekly_earnings = as.numeric(.data[[value_col]])
) %>%
filter(!is.na(date), !is.na(weekly_earnings))

latest_weekly <- wage_series %>%
filter(date == max(date, na.rm = TRUE)) %>%
summarize(weekly_earnings = mean(weekly_earnings, na.rm = TRUE)) %>%
pull(weekly_earnings)

annual_income_benchmark <- latest_weekly * 52

afford_df <- zhvi_latest %>%
inner_join(mort_latest, by = c("RegionName", "StateName", "date")) %>%
left_join(growth_5y, by = c("RegionName", "StateName")) %>%
mutate(
annual_mortgage = mortgage_monthly * 12,
annual_income = annual_income_benchmark,
affordability_ratio = annual_mortgage / annual_income,
affordability_gap = affordability_ratio - 0.30,
affordable_flag = affordability_ratio <= 0.30
) %>%
filter(is.finite(affordability_ratio), is.finite(zhvi), is.finite(annual_mortgage))

summary_tbl <- afford_df %>%
summarize(
metros = n(),
mean_afford_ratio = mean(affordability_ratio, na.rm = TRUE),
median_afford_ratio = median(affordability_ratio, na.rm = TRUE),
share_above_30pct = mean(affordability_ratio > 0.30, na.rm = TRUE)
)

summary_tbl

## # A tibble: 1 × 4
##   metros mean_afford_ratio median_afford_ratio share_above_30pct
##    <int>             <dbl>               <dbl>             <dbl>
## 1    390             0.317               0.279             0.426

top_state_price <- afford_df %>%
group_by(StateName) %>%
slice_max(zhvi, n = 1, with_ties = FALSE) %>%
ungroup() %>%
arrange(desc(zhvi)) %>%
slice_head(n = 15)

ggplot(top_state_price, aes(x = reorder(paste0(RegionName, ", ", StateName), zhvi), y = zhvi)) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = dollar) +
labs(
title = "Most expensive metro in each state by typical home value",
x = "Metro and state",
y = "Typical home value"
)

Chart 1 Most expensive metro in each state

This chart shows the single most expensive metro in each state based on the Zillow Home Value Index. Think of the Zillow Home Value Index as the typical home value in that metro, similar to an average level for that housing market, not a specific listing price.

top_state_unaffordable <- afford_df %>%
group_by(StateName) %>%
slice_max(affordability_ratio, n = 1, with_ties = FALSE) %>%
ungroup() %>%
arrange(desc(affordability_ratio)) %>%
slice_head(n = 15)

ggplot(top_state_unaffordable,
aes(x = reorder(paste0(RegionName, ", ", StateName), affordability_ratio),
y = affordability_ratio,
fill = affordable_flag)) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_y_continuous(labels = percent) +
labs(
title = "Most unaffordable metro in each state by affordability ratio",
x = "Metro and state",
y = "Affordability ratio"
)

Chart 2 Most unaffordable metro in each state

This chart focuses on affordability instead of price. The affordability ratio is the share of income required to cover mortgage payments. If a metro shows 150 percent, that means the mortgage payment is about 1.5 times total annual income, which is five times the 30 percent affordability benchmark.

state_avg_price <- afford_df %>%
group_by(StateName) %>%
summarize(mean_zhvi = mean(zhvi, na.rm = TRUE)) %>%
arrange(mean_zhvi) %>%
slice_head(n = 10)

ggplot(state_avg_price,
aes(x = reorder(StateName, mean_zhvi), y = mean_zhvi)) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = dollar) +
labs(
title = "Ten states with the lowest average typical home value",
x = "State",
y = "Average typical home value"
)

Chart 3 Most affordable states by average home value

This chart averages home values across all metros within each state. It answers a simple question: which states have the lowest typical home values on average

top_state_gap <- afford_df %>%
group_by(StateName) %>%
slice_max(affordability_gap, n = 1, with_ties = FALSE) %>%
ungroup() %>%
arrange(desc(affordability_gap)) %>%
slice_head(n = 15)

ggplot(top_state_gap,
aes(x = reorder(paste0(RegionName, ", ", StateName), affordability_gap),
y = affordability_gap)) +
geom_col() +
coord_flip() +
geom_hline(yintercept = 0) +
scale_y_continuous(labels = percent) +
labs(
title = "Largest affordability gaps by state using the 30 percent benchmark",
x = "Metro and state",
y = "Affordability gap"
)

Chart 4 Affordability gap relative to 30 percent

The affordability gap is the affordability ratio minus 0.30. A gap of 0.20 means mortgage costs are 20 percentage points above the benchmark, so 50 percent of income instead of 30 percent. A negative gap means the metro is below the benchmark.

state_growth_metros <- afford_df %>%
  filter(is.finite(growth_5y)) %>%
  group_by(StateName) %>%
  slice_max(growth_5y, n = 1, with_ties = FALSE) %>%
  ungroup() %>%
  select(RegionName, StateName)

growth_lines <- zhvi_long %>%
  inner_join(state_growth_metros, by = c("RegionName", "StateName"))

# pick the top 4 most expensive metros based on the first available point in 2020
price_2020 <- growth_lines %>%
  filter(date >= as.Date("2020-01-01")) %>%
  group_by(RegionName, StateName) %>%
  slice_min(date, n = 1, with_ties = FALSE) %>%
  ungroup()

top4_expensive_2020 <- price_2020 %>%
  arrange(desc(zhvi)) %>%
  slice_head(n = 4) %>%
  select(RegionName, StateName)

# label positions for the top 4 at the end of the series, but only considering dates after 2020
label_data <- growth_lines %>%
  inner_join(top4_expensive_2020, by = c("RegionName", "StateName")) %>%
  filter(date >= as.Date("2020-01-01")) %>%
  group_by(RegionName, StateName) %>%
  filter(date == max(date)) %>%
  ungroup() %>%
  mutate(label = paste0(RegionName, ", ", StateName))

ggplot(growth_lines,
       aes(x = date, y = zhvi,
           group = paste0(RegionName, StateName))) +

  # all lines in the background
  geom_line(color = "gray70", alpha = 0.6) +

  # highlight only the top 4, only after 2020
  geom_line(
    data = growth_lines %>%
      inner_join(top4_expensive_2020, by = c("RegionName", "StateName")) %>%
      filter(date >= as.Date("2020-01-01")),
    color = "red",
    linewidth = 1.1
  ) +

  # label only the top 4
  ggrepel::geom_text_repel(
    data = label_data,
    aes(label = label),
    nudge_x = 200,
    direction = "y",
    hjust = 0,
    segment.color = "gray50",
    size = 3
  ) +

  scale_y_continuous(labels = dollar) +
  labs(
    title = "Home value growth over time for fastest growing metros",
    subtitle = "Top four most expensive metros highlighted after 2020",
    x = "Year",
    y = "Typical home value"
  ) +
  theme(plot.margin = margin(5.5, 80, 5.5, 5.5))

Chart 5 Home value growth over time for the fastest growing metro in each state

This chart keeps the growth idea but makes it readable. For each state, it selects one metro, the metro with the highest five year growth, and plots its Zillow home value series over time. Steeper lines indicate faster appreciation.

ggplot(afford_df, aes(x = affordability_ratio)) +
geom_histogram(bins = 35) +
scale_x_continuous(labels = percent) +
geom_vline(xintercept = 0.30) +
labs(
title = "Distribution of affordability ratios across metros",
x = "Affordability ratio",
y = "Number of metros"
)

Chart 6 Distribution of affordability ratios across metros

This chart shows how affordability ratios are distributed across all metros. The vertical reference line marks the 30 percent affordability benchmark. Bars to the right of the line indicate metros that exceed the benchmark.

test_result <- t.test(afford_df$affordability_ratio, mu = 0.30, alternative = "greater")
test_result

## 
##  One Sample t-test
## 
## data:  afford_df$affordability_ratio
## t = 2.0647, df = 389, p-value = 0.01981
## alternative hypothesis: true mean is greater than 0.3
## 95 percent confidence interval:
##  0.3034304       Inf
## sample estimates:
## mean of x 
## 0.3170296

Hypothesis test

This test checks whether average affordability is above the 30 percent benchmark. The test produced a t statistic of 2.06 with 389 degrees of freedom and a p value of 0.0198. Because the p value is less than 0.05, the null hypothesis is rejected. This provides statistical evidence that the average share of income required to cover mortgage payments across metros exceeds the 30 percent benchmark. In practical terms, this result suggests that housing across many US metro areas is cost burdened under standard affordability definitions, even when using a national income benchmark rather than metro specific wages.

Housing Affordability analysis

Johan Lopez

2025-12-14