PUMS Housing Affordability Analysis

TidyCensus reference:

https://walker-data.com/census-r/index.html

See, particularly, chapters 9 and 10.

Measuring “rent burden”

The GRPIP variable must be treated carefully. In the raw data, it is a text variable with categories from 0 to 100 representing the percentage of household income spent on rent, and a 101 category representing “topcoded” households for which a meaningful GRPIP value cannot be computed. Treating it numerically - as if “101” means a household is spending 101 percent of its income on rent - wouldn’t make sense. One also could question whether the two “100” households are truly spending 100 percent of their household income on rent, or how anyone could be getting by while spending north of 70, 80 or 90 percent of their income on rent. About the only workable scenario I can think of involves a renter with an income so high that even when 70-plus percent of it goes toward rent, the remainder is enough to cover other household needs, like food, transportation, clothing, medical care, etc.

Here is what the distribution of raw GRPIP values looks like. Note that the values are sorted alphabetically, not numerically, which is why the “101” category shows up between the 100 and 11 categories.

Given this ambiguity about what high GRPIP values actually mean, it’s probably wise to use GRPIP to group renters into four broad categories:

The “Not burdened,” who spend less than 30 percent of their household income on rent.
The “Cost burdened,” who spend 30-49% of their household income on rent.
The “Severely burdened,” who spend 50% or more of their household income on rent.
The “Extreme / top-coded,” who for a range of reasons that may not be similar or valid, show up as spending 100% or more on rent.

Below is the distribution of these four categories, with weights applied so that the category totals estimated totals for all renter households in Davidson County. Step 14 in the code shows how the categories were created.

Exploring rent burden across demographics

Continuous and catgegorical variables. The PUMS data contain a range of demograpic variables. Some, like “Age,” are numeric in nature, or “continuous.” They measure countable quantities, like the number of years that have passed since a person was born. Others, like “Race,” are “categorical.” They sensibly place people into categories based on a some trait that is either fully present or fully absent, not present in amounts ranging from none to however much is possible. In an extremely important sense, nearly nobody is wholly “white” or wholly “black” or wholly “Asian.” Nearly everyone has ancestors from two or more racial groups, and “race” itself is is largely a socially constructed concept. But people do tend to describe themselves as belonging to distinct racial categories. For that reason, “Race” is typically treated as a catgegorical variable, often with at least one category for people who consider themselves to be a blend of two or more races. It is also important to note that “Race” and “ethnicity” are not the same thing. Ethnicity has more to do with cultural traits, like one’s language, preferred food, musical tastes, and (sometimes) religion. Often, people of different races can belong to the same ethnic group, and vice versa.

Typically, continuous variables are summarized with average or a median, usually with some indication of the variable’s variability or distribution shape. Categorical variables, meanwhile, are usually summarized using group percentages. However, it might be wise for us to convert all continuous variables into categorical ones, then use percentages as summaries. The reason: Averages - and, to a lesser degree, medians - imply that midrange values generally reflect all of the varible’s values. With demographics like age and income, this isn’t always the case.

Households and individuals. Another thing to consider: “Rent burden” is a household-level variable, in that all members of a household are affected by rent burden in more or less the same way. But demographics like age and race are individual-level variables, in that members of the same household can be different ages or belong to different racial groups. One way to address the mismatch is to focus on the demographic characteristics of a each household’s “householder.” The Census defines the “householder” as the person whose name is on the household’s deed or lease. If two or more people fit that criteria, the Census goes with whoever the members of the household designate as the “householder” or, if no such designation was made, infers the identity of the householder from answers to other questions in the ACS survey. The approach works fine, as long as we don’t try to generalize from the householder to all members of the household.

Here are graphics of rent burden broken down by the householder’s age and race categories:

Next steps

Explore the geographic distribution of these groups.
Explore regional and/or national comparisons.
Determine the best way to handle error margins.

Script:

# =========================================================
# 1. Load required libraries
# =========================================================

if (!require("tidycensus")) install.packages("tidycensus")
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tigris")) install.packages("tigris")
if (!require("sf")) install.packages("sf")
if (!require("leaflet")) install.packages("leaflet")
if (!require("kableExtra")) install.packages("kableExtra")
if (!require("plotly")) install.packages("plotly")

library(tidycensus)
library(tidyverse)
library(tigris)
library(sf)
library(leaflet)
library(kableExtra)
library(plotly)


# =========================================================
# 2. (Optional) Set Census API key
# =========================================================

# census_api_key("YOUR_API_KEY_HERE", install = TRUE)
# readRenviron("~/.Renviron")


# =========================================================
# 3. Load Tennessee PUMAs (2020)
# =========================================================

options(tigris_use_cache = TRUE)

TN_PUMAs <- pumas(state = "TN", cb = TRUE, year = 2020)


# =========================================================
# 4. Define Nashville PUMAs
# =========================================================

PUMA_list <- c("02401","02402","02403","02404","02405","02406")

NASH_PUMAs <- TN_PUMAs %>%
  filter(PUMACE20 %in% PUMA_list)


# =========================================================
# 5. Prepare geometry for Leaflet
# =========================================================

NASH_PUMAs <- st_transform(NASH_PUMAs, 4326)


# =========================================================
# 6. Create Leaflet map
# =========================================================

PUMA_map <- leaflet(data = NASH_PUMAs) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  setView(lng = -86.78, lat = 36.17, zoom = 10) %>%
  addPolygons(
    fillColor = "steelblue",
    fillOpacity = 0.5,
    color = "black",
    weight = 1,
    highlightOptions = highlightOptions(
      weight = 2,
      color = "white",
      bringToFront = TRUE
    ),
    popup = ~paste0(
      "<strong>PUMA GEOID:</strong> ", GEOID20, "<br>",
      "<strong>PUMA code:</strong> ", PUMACE20, "<br>",
      "<strong>Name:</strong> ", NAMELSAD20
    )
  )

PUMA_map


# =========================================================
# 7. Define affordability variables (UPDATED)
# =========================================================

vars <- c(
  "TEN","GRNTP","RNTP","GRPIP","HINCP",
  "NP","NOC","AGEP","HHLDRAGEP",   # ✅ ADDED
  "RAC1P","HISP","HHLDRRAC1P",
  "SCHL","ESR",
  "ELEP","GASP","WATP","FULP",
  "RMSP","BDSP",
  "PUMA","WGTP"
)


# =========================================================
# 8. Codebooks
# =========================================================

data(pums_variables)

pums_codebook_full <- pums_variables %>%
  filter(survey == "acs5", year == 2023)

pums_codebook_affordability <- pums_codebook_full %>%
  filter(var_code %in% vars)


# =========================================================
# 9. Pull renter data
# =========================================================

TN_Renters <- get_pums(
  variables = vars,
  state = "TN",
  survey = "acs1",
  variables_filter = list(TEN = 3),
  year = 2024
)


# =========================================================
# 10. Filter Nashville renters
# =========================================================

NASH_Renters <- TN_Renters %>%
  filter(PUMA %in% PUMA_list)


# =========================================================
# 11. Raw GRPIP distribution (Plotly)
# =========================================================

grpip_counts <- NASH_Renters %>%
  count(GRPIP)

GRPIP_raw_plot <- plot_ly(
  grpip_counts,
  x = ~GRPIP,
  y = ~n,
  type = "bar",
  marker = list(color = "steelblue"),
  text = ~n,
  hovertemplate = "GRPIP: %{x}<br>Count: %{y}<extra></extra>"
) %>%
  layout(
    title = "Raw Distribution of GRPIP (Sample Counts)",
    xaxis = list(title = "GRPIP (Raw Values)"),
    yaxis = list(title = "Count")
  )

GRPIP_raw_plot


# =========================================================
# 12. Convert to numeric
# =========================================================

NASH_Renters <- NASH_Renters %>%
  mutate(across(
    c(GRNTP, RNTP, GRPIP, HINCP,
      NP, NOC, AGEP, SCHL,
      ELEP, GASP, WATP, FULP,
      RMSP, BDSP, WGTP),
    as.numeric
  ))


# =========================================================
# 13. NA Summary Table
# =========================================================

na_summary <- NASH_Renters %>%
  summarize(across(
    c(GRNTP, GRPIP, HINCP, SCHL),
    ~ mean(is.na(.))
  )) %>%
  pivot_longer(
    everything(),
    names_to = "Variable",
    values_to = "Pct_Missing"
  )

na_summary %>%
  kable(digits = 3, caption = "NA Proportion by Variable") %>%
  kable_styling(full_width = FALSE)


# =========================================================
# 14. Affordability measures + burden groups
# =========================================================

NASH_Renters <- NASH_Renters %>%
  mutate(
    annual_rent = GRNTP * 12,
    residual_income = HINCP - annual_rent,
    college = SCHL >= 21,
    
    burden_group = case_when(
      GRPIP < 30 ~ "1 Not burdened",
      GRPIP < 50 ~ "2 Cost burdened (30–49%)",
      GRPIP < 100 ~ "3 Severely burdened (50–99%)",
      GRPIP >= 100 ~ "4 Extreme / top-coded (100%+)",
      TRUE ~ NA_character_
    )
  )


# =========================================================
# 15. Housing stress profile table
# =========================================================

profile_table <- NASH_Renters %>%
  group_by(burden_group) %>%
  summarize(
    households = sum(WGTP, na.rm = TRUE),
    avg_income = weighted.mean(HINCP, WGTP, na.rm = TRUE),
    avg_rent = weighted.mean(GRNTP, WGTP, na.rm = TRUE),
    pct_college = weighted.mean(college, WGTP, na.rm = TRUE)
  )

profile_table %>%
  kable(digits = 1, caption = "Housing Stress Profile by Burden Group") %>%
  kable_styling(full_width = FALSE)

profile_table


# =========================================================
# 16. Burden distribution (Plotly, weighted + labeled)
# =========================================================

burden_counts <- NASH_Renters %>%
  group_by(burden_group) %>%
  summarize(weighted_n = sum(WGTP, na.rm = TRUE))

burden_group_graph <- plot_ly(
  burden_counts,
  x = ~burden_group,
  y = ~weighted_n,
  type = "bar",
  marker = list(color = "steelblue"),
  text = ~round(weighted_n),
  textposition = "outside",
  hovertemplate = "%{x}<br>Households: %{y}<extra></extra>"
) %>%
  layout(
    title = "Distribution of Housing Burden (Weighted)",
    xaxis = list(title = "Burden Group"),
    yaxis = list(title = "Estimated Households")
  )

burden_group_graph


# =========================================================
# 17. Summary statistics table
# =========================================================

summary_stats <- NASH_Renters %>%
  mutate(
    cost_burdened = !str_detect(burden_group, "Not burdened"),
    severe_burden = str_detect(burden_group, "Severely")
  ) %>%
  summarize(
    pct_cost_burdened =
      weighted.mean(cost_burdened, WGTP, na.rm = TRUE),
    pct_severe_burden =
      weighted.mean(severe_burden, WGTP, na.rm = TRUE),
    avg_rent =
      weighted.mean(GRNTP, WGTP, na.rm = TRUE),
    avg_income =
      weighted.mean(HINCP, WGTP, na.rm = TRUE)
  )

summary_stats_table <- summary_stats %>%
  kable(digits = 3, caption = "Summary Statistics (Weighted)") %>%
  kable_styling(full_width = FALSE)

summary_stats_table


# =========================================================
# 18. Save dataset
# =========================================================

saveRDS(NASH_Renters, "nashville_renters_acs1_2024.rds")

# =========================================================
# 19. Age-by-burden table and interactive visualization
# =========================================================

# ---- Create age groups ----
NASH_Renters <- NASH_Renters %>%
  mutate(
    age_group = case_when(
      HHLDRAGEP < 30 ~ "Under 30",
      HHLDRAGEP < 45 ~ "30–44",
      HHLDRAGEP < 65 ~ "45–64",
      TRUE ~ "65+"
    )
  )


# ---- Create weighted age-by-burden table ----
age_burden_table <- NASH_Renters %>%
  filter(!is.na(age_group), !is.na(burden_group)) %>%
  group_by(age_group, burden_group) %>%
  summarize(
    households = sum(WGTP, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(age_group) %>%
  mutate(
    pct = households / sum(households)
  ) %>%
  ungroup()


# ---- Nicely formatted table ----
age_burden_table %>%
  mutate(
    households = round(households),
    pct = scales::percent(pct, accuracy = 0.1)
  ) %>%
  arrange(age_group, burden_group) %>%
  kable(
    caption = "Housing Burden by Age of Householder",
    col.names = c("Age Group", "Burden Group", "Households", "Percent")
  ) %>%
  kable_styling(full_width = FALSE)


# ---- Create label text (SUPPRESS Extreme group) ----
age_burden_table <- age_burden_table %>%
  mutate(
    label_text = if_else(
      str_detect(burden_group, "^4"),
      "",  # ✅ suppress label for Extreme group
      paste0(
        burden_group,
        "<br>",
        scales::comma(round(households)), " households",
        "<br>",
        scales::percent(pct, accuracy = 0.1)
      )
    )
  )


# ---- Plotly stacked bar chart ----
age_burden_plot <- plot_ly(
  age_burden_table,
  x = ~age_group,
  y = ~households,
  color = ~burden_group,
  type = "bar",
  text = ~label_text,
  hovertemplate = ~paste0(
    burden_group,
    "<br>Households: ", scales::comma(round(households)),
    "<br>Percent: ", scales::percent(pct, accuracy = 0.1),
    "<extra></extra>"
  )
) %>%
  layout(
    barmode = "stack",
    title = "Housing Burden by Age of Householder",
    xaxis = list(title = "Age Group"),
    yaxis = list(title = "Estimated Number of Households")
  )

age_burden_plot

# =========================================================
# 20. Race-by-burden table and interactive visualization
# =========================================================

# ---- Create race categories ----
NASH_Renters <- NASH_Renters %>%
  mutate(
    race_group = case_when(
      HHLDRRAC1P == 1 ~ "White alone",
      HHLDRRAC1P == 2 ~ "Black or African American alone",
      is.na(HHLDRRAC1P) ~ "N/A",
      TRUE ~ "Other race"
    )
  )


# ---- Create weighted race-by-burden table ----
race_burden_table <- NASH_Renters %>%
  filter(!is.na(race_group), !is.na(burden_group)) %>%
  group_by(race_group, burden_group) %>%
  summarize(
    households = sum(WGTP, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(race_group) %>%
  mutate(
    pct = households / sum(households)
  ) %>%
  ungroup()


# ---- Nicely formatted table ----
race_burden_table %>%
  mutate(
    households = round(households),
    pct = scales::percent(pct, accuracy = 0.1)
  ) %>%
  arrange(race_group, burden_group) %>%
  kable(
    caption = "Housing Burden by Race of Householder",
    col.names = c("Race Group", "Burden Group", "Households", "Percent")
  ) %>%
  kable_styling(full_width = FALSE)


# ---- Create label text (suppress Extreme group labels) ----
race_burden_table <- race_burden_table %>%
  mutate(
    label_text = if_else(
      str_detect(burden_group, "^4"),
      "",
      paste0(
        burden_group,
        "<br>",
        scales::comma(round(households)), " households",
        "<br>",
        scales::percent(pct, accuracy = 0.1)
      )
    )
  )


# ---- Plotly stacked bar chart ----
race_burden_plot <- plot_ly(
  race_burden_table,
  x = ~race_group,
  y = ~households,
  color = ~burden_group,
  type = "bar",
  text = ~label_text,
  hovertemplate = ~paste0(
    burden_group,
    "<br>Households: ", scales::comma(round(households)),
    "<br>Percent: ", scales::percent(pct, accuracy = 0.1),
    "<extra></extra>"
  )
) %>%
  layout(
    barmode = "stack",
    title = "Housing Burden by Race of Householder",
    xaxis = list(title = "Race of Householder"),
    yaxis = list(title = "Estimated Number of Households")
  )

race_burden_plot