R Assignment 8 - affordable housing

When Ms. Coats moved into the Baxter Street house, a family needed right around the area’s median income to afford the $82 monthly mortgage payment — the definition of middle class. Today a typical Clairemont home costs $850,000, up 30 percent from 2019. A family would need to make about double San Diego’s median income to afford one.

This is a back-of-the-envelope calculation to determine which cities the “average” household could afford to buy the “average” home, inspired by this article in the New York Times.

We will use:

For additional context we’ll download race/ethnicity and tenure variables to calculate:

library(tidycensus)
library(tidyverse)
library(scales)

# list variables in acs
acs20<- load_variables(2020, "acs5", cache = T)

# list variables in decennial
dec20<- load_variables(2020, "pl", cache = T)

Population data

Download the population data from Decennial census since it is available and more accurate. We’ll acquire the other variables from the 5-year American Community Survey for 2016 - 12020.

# get decennial populationand race data
raw_city_census_data_2020 <- get_decennial(geography = "place", 
                                       variables = c(pop20 = "P1_001N",
                                                     white_alone_not_hisp = "P2_005N"), 
                                       year = 2020,
                                       output = "wide")
GEOID NAME pop20 white_alone_not_hisp
4024200 Erick city, Oklahoma 1000 834
4024300 Erin Springs town, Oklahoma 89 69
4024460 Etowah town, Oklahoma 159 106
4024500 Etta CDP, Oklahoma 80 22
4024650 Eufaula city, Oklahoma 2766 1790

Household income, value, and tenure data

Download the other variables from the 5-year American Community Survey for 2016 - 12020.

# get acs 5-year mhi, mpv and tenure data
raw_city_acs_data_2020 <- get_acs(
  geography = "place",
  variables = c(mhi = "B19013_001",
                mpv = "B25077_001",
                housing_units = "B25003_001",
                own_occupied = "B25003_002",
                rent_occupied = "B25003_003",
                labor_force = "B23025_002",
                unemployed = "B23025_005"),
  year = 2020,
  output = "wide")
GEOID NAME mhiE mhiM mpvE mpvM housing_unitsE housing_unitsM own_occupiedE own_occupiedM rent_occupiedE rent_occupiedM labor_forceE labor_forceM unemployedE unemployedM
4024200 Erick city, Oklahoma 27045 13788 57500 18062 328 107 206 85 122 71 229 125 0 10
4024300 Erin Springs town, Oklahoma 59250 27989 71700 21208 49 13 44 13 5 8 46 24 0 10
4024460 Etowah town, Oklahoma 56563 27909 101300 22274 43 13 37 13 6 6 47 21 1 3
4024500 Etta CDP, Oklahoma 37727 8045 NA NA 40 22 29 22 11 15 29 25 0 10
4024650 Eufaula city, Oklahoma 46068 9359 157400 6134 1245 120 769 158 476 117 853 158 28 30

Margin of error

We’ll calculate a margin-of-error ratio to assess which cities have a very high margin of error and should be removed form the analyses.

# explore margin of error and determine which to remove
city_acs_moe <- raw_city_acs_data_2020 |> 
  mutate(mhi_moe_ratio = mhiM/mhiE,
         mpv_moe_ratio = mpvM/mpvE,
         housing_units_moe_ratio = housing_unitsM/housing_unitsE,
         own_occupied_moe_ratio = own_occupiedM/own_occupiedE,
         rent_occupied_moe_ratio = rent_occupiedM/rent_occupiedE) 

MOE visualization

We’ll look at the histograms for each variable to determine the quality of the data. Notice we don’t have to do this for the population data because it is from the decennial census and is a count, not an estimate like the American Community Survey.

#### histograms to understand the margin of error ####
ggplot(data = city_acs_moe, aes(x=mhi_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="lightblue") +
  labs(title = "MHI Margin of Error, as ratio of MHI Estimate",
       x = "MHI Margin of Error Ratio")
## Warning: Removed 3446 rows containing non-finite values (`stat_bin()`).

ggplot(data = city_acs_moe, aes(x=mpv_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="wheat2") +
  labs(title = "MPV Margin of Error, as ratio of MPV Estimate",
       x = "MPV Margin of Error Ratio")
## Warning: Removed 3554 rows containing non-finite values (`stat_bin()`).

ggplot(data = city_acs_moe, aes(x=own_occupied_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="lightsteelblue") +
  labs(title = "Owner Occupied Margin of Error, as ratio of estimate",
       x = "Owner Occupied Margin of Error Ratio")
## Warning: Removed 739 rows containing non-finite values (`stat_bin()`).

ggplot(data = city_acs_moe, aes(x=rent_occupied_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="seagreen2") +
  labs(title = "Renter Occupied Margin of Error, as ratio of estimate",
       x = "Renter Units Margin of Error Ratio")
## Warning: Removed 2825 rows containing non-finite values (`stat_bin()`).

Margin of error assessment

The margin of error is really high in some of the cities. We will definitely want to remove some areas with small population. We’ll start by removing cities with fewer than 50,000 people. That will help with the margin-of-error, and this back-of-the-envelope analysis will be more meaningful for cities and not small towns.

Combine and process the data

We’ll join the population data and the housing data, remove cities with fewer than 50,000 people and calculate our variables.

#### combine the data, filter small cities, assess impact of removing high moe
city_data <- raw_city_census_data_2020 |> 
  filter(pop20 > 50000) |> 
  left_join(city_acs_moe, by = c("GEOID", "NAME")) |> 
  mutate(affordable = 2.5* mhiE,
         afford_diff = affordable - mpvE,
         pct_owner_occupied = own_occupiedE/housing_unitsE,
         pct_renter = rent_occupiedE/housing_unitsE,
         pct_bipoc = 1 - (white_alone_not_hisp/pop20),
         unemployment_rate = unemployedE/labor_forceE)

Margin of error assessment

Look at the margin of error now. Do we need to filter more?

#### histograms to understand the margin of error after removing low pop ####
ggplot(data = city_data, aes(x=mhi_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="lightblue") +
  labs(title = "MHI Margin of Error, as ratio of MHI Estimate",
       subtitle = "Cities with 50,000 or more",
       x = "MHI Margin of Error Ratio")
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

ggplot(data = city_data, aes(x=mpv_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="wheat2") +
  labs(title = "MPV Margin of Error, as ratio of MPV Estimate",
       subtitle = "Cities with 50,000 or more",
       x = "MPV Margin of Error Ratio")
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

ggplot(data = city_data, aes(x=own_occupied_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="lightsteelblue") +
  labs(title = "Owner Occupied Margin of Error, as ratio of estimate",
       subtitle = "Cities with 50,000 or more",
       x = "Owner Occupied Margin of Error Ratio")
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

ggplot(data = city_data, aes(x=rent_occupied_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="seagreen2") +
  labs(title = "Renter Occupied Margin of Error, as ratio of estimate",
       subtitle = "Cities with 50,000 or more",
       x = "Renter Units Margin of Error Ratio")
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

I think, I want to remove all cities where the margin of error is more than 10% of the estimate

Create final analysis dataset

My analysis dataframe will be called cities. We’ll use two ways to create a column to define whether each city is affordable:

And I’ll create separate city and state columns.

Finally, I’ll create a removed_cities dataframe to look at who has been removed.

#### city data without high margin of error data
cities <- city_data |> 
  separate(NAME, into = c("city", "state"), sep = ",") |> 
  filter(mhi_moe_ratio < .1 &
         mpv_moe_ratio < .1 &
         housing_units_moe_ratio < .1 &
         own_occupied_moe_ratio < .1 &
         rent_occupied_moe_ratio < .1) |> 
  filter(state != " Puerto Rico") |> 
  mutate(affordable_yn = ifelse(afford_diff <= 0, "no", "yes"),
         affordable_cat = case_when(afford_diff <= -10000 ~ "not affordable", 
                                afford_diff > -10000 & afford_diff <= 0 ~ "almost affordable",
                                afford_diff > 0 ~ "affordable",
                                TRUE ~ "None of the above")) 

#### high MOE cities
removed_cities <- city_data |> 
  filter(mhi_moe_ratio >= .1 |
           mpv_moe_ratio >= .1 |
           housing_units_moe_ratio >= .1 |
           own_occupied_moe_ratio >= .1 |
           rent_occupied_moe_ratio >= .1)
GEOID city state pop20 white_alone_not_hisp mhiE mhiM mpvE mpvM housing_unitsE housing_unitsM own_occupiedE own_occupiedM rent_occupiedE rent_occupiedM labor_forceE labor_forceM unemployedE unemployedM mhi_moe_ratio mpv_moe_ratio housing_units_moe_ratio own_occupied_moe_ratio rent_occupied_moe_ratio affordable afford_diff pct_owner_occupied pct_renter pct_bipoc unemployment_rate affordable_yn affordable_cat
0651182 Newport Beach city California 85239 64352 133849 8839 1976400 104275 38596 827 20953 887 17643 879 46439 1181 1349 348 0.0660371 0.0527601 0.0214271 0.0423328 0.0498215 334622.5 -1641778 0.5428801 0.4571199 0.2450404 0.0290489 no not affordable
0617610 Cupertino city California 60381 13085 182857 9918 1866200 59692 20506 391 12468 524 8038 458 28137 730 710 190 0.0542391 0.0319859 0.0190676 0.0420276 0.0569793 457142.5 -1409058 0.6080172 0.3919828 0.7832928 0.0252337 no not affordable
0670000 Santa Monica city California 93076 60654 98300 5511 1452100 108646 45706 993 12761 586 32945 1052 54423 1207 3445 491 0.0560631 0.0748199 0.0217258 0.0459212 0.0319320 245750.0 -1206350 0.2791975 0.7208025 0.3483390 0.0633004 no not affordable
0649670 Mountain View city California 82376 33008 144116 7212 1560600 46859 33029 761 13661 666 19368 715 47903 1036 1616 330 0.0500430 0.0300263 0.0230404 0.0487519 0.0369166 360290.0 -1200310 0.4136062 0.5863938 0.5993008 0.0337348 no not affordable
0660102 Redwood City city California 84292 34067 123294 5083 1424200 58720 30175 816 14328 682 15847 772 48589 1079 1561 413 0.0412267 0.0412302 0.0270423 0.0475991 0.0487158 308235.0 -1115965 0.4748302 0.5251698 0.5958454 0.0321266 no not affordable

Summary stats

How many cities are “affordable”? And what are the characteristics of afforadble and not affordable cities?

I’ll use two new functions:

affordability_stats <- cities |> 
  group_by(affordable_cat) |> 
  summarise(`Number of cities` = n(),
            `Total Population` = comma(sum(pop20)),
            `Average difference between afforable and MPV` = dollar(mean(afford_diff)),
            `Average MHI` = dollar(mean(mhiE)),
            `Average MPV` = dollar(mean(mpvE)),
            `Average Percent BIPOC` = percent(mean(pct_bipoc)),
            `Average Percent Renter-Occupied` = percent(mean(pct_renter)),
            `Average Percent Renter-Occupied` = percent(mean(pct_owner_occupied)),
            `Unemployment Rate` = percent(sum(unemployedE)/sum(labor_forceE)),
            States = toString(unique(state)))
affordable_cat Number of cities Total Population Average difference between afforable and MPV Average MHI Average MPV Average Percent BIPOC Average Percent Renter-Occupied Unemployment Rate States
affordable 74 7,805,454 $12,615.95 $48,677.73 $109,078 45% 57% 8% Texas, Oklahoma, Ohio, Pennsylvania, Missouri, New York, Alabama, Illinois, Michigan, Mississippi, Indiana, Georgia, Iowa, Nebraska, Kansas
almost affordable 43 6,024,764 -$4,245.87 $56,715.60 $146,035 56% 58% 6% Wisconsin, Texas, Pennsylvania, Tennessee, Illinois, Missouri, Mississippi, Indiana, Michigan, Nebraska, Georgia, Iowa, New York, Kansas, Alabama, New Jersey
not affordable 657 115,471,286 -$186,942 $71,942.58 $366,798 52% 56% 6% Oregon, Virginia, Texas, Rhode Island, Utah, Pennsylvania, South Dakota, Oklahoma, Washington, Wisconsin, Wyoming, South Carolina, Tennessee, Maryland, North Dakota, Florida, Nebraska, Louisiana, Alabama, California, Connecticut, District of Columbia, Alaska, Delaware, Montana, Illinois, Mississippi, New Jersey, Georgia, Michigan, Missouri, Indiana, Arkansas, Massachusetts, Nevada, Ohio, Iowa, New Mexico, Colorado, Hawaii, Arizona, Idaho, New Hampshire, New York, Kansas, North Carolina, Kentucky, Maine, Minnesota

Visualizations

# ggplot(data = city_data, aes(x=pct_renter))+
#   geom_histogram(bins = 10, color="darkblue", fill="seagreen2") +
#   labs(title = "Percent Renter Occupied Cities",
#        subtitle = "Cities with 50,000 or more",
#        x = "Percent of Renter")
ggplot(data = cities,
         aes(x = mhiE, y = mpvE, 
             size = pop20, color = affordable_cat)) +
  geom_point(alpha = .85) +
  scale_y_continuous(labels = dollar_format(accuracy = 1)) +
  scale_x_continuous(labels = dollar_format(accuracy = 1)) +
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Estimated Median Household Income", y = "Estimated Median Property Value",
       title = "Cities and Housing Affordability",
       caption = "Sources: US Census, 2020; American Community Survey",
       # add nice label for size element
       size = "Population",
       color = "Affordability") +
  theme_bw()

ggplot(data = cities,
         aes(x = pct_renter, y = mpvE, 
             size = pop20, color = affordable_cat)) +
  geom_point(alpha = .85) +
  scale_y_continuous(labels = dollar_format(accuracy = 1)) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Estimated Percent Renter Occupied", y = "Estimated Median Property Value",
       title = "Cities and Housing Affordability",
       caption = "Sources: US Census, 2020; American Community Survey",
       # add nice label for size element
       size = "Population",
       color = "Affordability") +
  theme_bw()

ggplot(data = cities,
         aes(x = unemployment_rate, y = mhiE, 
             size = pop20, color = affordable_cat)) +
  geom_point(alpha = .85) +
  scale_y_continuous(labels = dollar_format(accuracy = 1)) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Estimated Unemployment Rate", y = "Estimated Median Household Income",
       title = "Cities and Housing Affordability",
       caption = "Sources: US Census, 2020; American Community Survey",
       # add nice label for size element
       size = "Population",
       color = "Affordability") +
  theme_bw()

ggplot(data = cities |> 
         filter(state == " Michigan"),
         aes(x = unemployment_rate, y = mhiE, 
             size = pop20, color = affordable_cat)) +
  geom_point(alpha = .85) +
  scale_y_continuous(labels = dollar_format(accuracy = 1)) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Estimated Unemployment Rate", y = "Estimated Median Household Income",
       title = "Michigan Cities and Housing Affordability",
       caption = "Sources: US Census, 2020; American Community Survey",
       # add nice label for size element
       size = "Population",
       color = "Affordability") +
  theme_bw()