When Ms. Coats moved into the Baxter Street house, a family needed right around the area’s median income to afford the $82 monthly mortgage payment — the definition of middle class. Today a typical Clairemont home costs $850,000, up 30 percent from 2019. A family would need to make about double San Diego’s median income to afford one.
This is a back-of-the-envelope calculation to determine which cities the “average” household could afford to buy the “average” home, inspired by this article in the New York Times.
We will use:
For additional context we’ll download race/ethnicity and tenure variables to calculate:
library(tidycensus)
library(tidyverse)
library(scales)
# list variables in acs
acs20<- load_variables(2020, "acs5", cache = T)
# list variables in decennial
dec20<- load_variables(2020, "pl", cache = T)
Download the population data from Decennial census since it is available and more accurate. We’ll acquire the other variables from the 5-year American Community Survey for 2016 - 12020.
# get decennial populationand race data
raw_city_census_data_2020 <- get_decennial(geography = "place",
variables = c(pop20 = "P1_001N",
white_alone_not_hisp = "P2_005N"),
year = 2020,
output = "wide")
GEOID | NAME | pop20 | white_alone_not_hisp |
---|---|---|---|
4024200 | Erick city, Oklahoma | 1000 | 834 |
4024300 | Erin Springs town, Oklahoma | 89 | 69 |
4024460 | Etowah town, Oklahoma | 159 | 106 |
4024500 | Etta CDP, Oklahoma | 80 | 22 |
4024650 | Eufaula city, Oklahoma | 2766 | 1790 |
Download the other variables from the 5-year American Community Survey for 2016 - 12020.
# get acs 5-year mhi, mpv and tenure data
raw_city_acs_data_2020 <- get_acs(
geography = "place",
variables = c(mhi = "B19013_001",
mpv = "B25077_001",
housing_units = "B25003_001",
own_occupied = "B25003_002",
rent_occupied = "B25003_003",
labor_force = "B23025_002",
unemployed = "B23025_005"),
year = 2020,
output = "wide")
GEOID | NAME | mhiE | mhiM | mpvE | mpvM | housing_unitsE | housing_unitsM | own_occupiedE | own_occupiedM | rent_occupiedE | rent_occupiedM | labor_forceE | labor_forceM | unemployedE | unemployedM |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4024200 | Erick city, Oklahoma | 27045 | 13788 | 57500 | 18062 | 328 | 107 | 206 | 85 | 122 | 71 | 229 | 125 | 0 | 10 |
4024300 | Erin Springs town, Oklahoma | 59250 | 27989 | 71700 | 21208 | 49 | 13 | 44 | 13 | 5 | 8 | 46 | 24 | 0 | 10 |
4024460 | Etowah town, Oklahoma | 56563 | 27909 | 101300 | 22274 | 43 | 13 | 37 | 13 | 6 | 6 | 47 | 21 | 1 | 3 |
4024500 | Etta CDP, Oklahoma | 37727 | 8045 | NA | NA | 40 | 22 | 29 | 22 | 11 | 15 | 29 | 25 | 0 | 10 |
4024650 | Eufaula city, Oklahoma | 46068 | 9359 | 157400 | 6134 | 1245 | 120 | 769 | 158 | 476 | 117 | 853 | 158 | 28 | 30 |
We’ll calculate a margin-of-error ratio to assess which cities have a very high margin of error and should be removed form the analyses.
# explore margin of error and determine which to remove
city_acs_moe <- raw_city_acs_data_2020 |>
mutate(mhi_moe_ratio = mhiM/mhiE,
mpv_moe_ratio = mpvM/mpvE,
housing_units_moe_ratio = housing_unitsM/housing_unitsE,
own_occupied_moe_ratio = own_occupiedM/own_occupiedE,
rent_occupied_moe_ratio = rent_occupiedM/rent_occupiedE)
We’ll look at the histograms for each variable to determine the quality of the data. Notice we don’t have to do this for the population data because it is from the decennial census and is a count, not an estimate like the American Community Survey.
#### histograms to understand the margin of error ####
ggplot(data = city_acs_moe, aes(x=mhi_moe_ratio))+
geom_histogram(bins = 10, color="darkblue", fill="lightblue") +
labs(title = "MHI Margin of Error, as ratio of MHI Estimate",
x = "MHI Margin of Error Ratio")
## Warning: Removed 3446 rows containing non-finite values (`stat_bin()`).
ggplot(data = city_acs_moe, aes(x=mpv_moe_ratio))+
geom_histogram(bins = 10, color="darkblue", fill="wheat2") +
labs(title = "MPV Margin of Error, as ratio of MPV Estimate",
x = "MPV Margin of Error Ratio")
## Warning: Removed 3554 rows containing non-finite values (`stat_bin()`).
ggplot(data = city_acs_moe, aes(x=own_occupied_moe_ratio))+
geom_histogram(bins = 10, color="darkblue", fill="lightsteelblue") +
labs(title = "Owner Occupied Margin of Error, as ratio of estimate",
x = "Owner Occupied Margin of Error Ratio")
## Warning: Removed 739 rows containing non-finite values (`stat_bin()`).
ggplot(data = city_acs_moe, aes(x=rent_occupied_moe_ratio))+
geom_histogram(bins = 10, color="darkblue", fill="seagreen2") +
labs(title = "Renter Occupied Margin of Error, as ratio of estimate",
x = "Renter Units Margin of Error Ratio")
## Warning: Removed 2825 rows containing non-finite values (`stat_bin()`).
The margin of error is really high in some of the cities. We will definitely want to remove some areas with small population. We’ll start by removing cities with fewer than 50,000 people. That will help with the margin-of-error, and this back-of-the-envelope analysis will be more meaningful for cities and not small towns.
We’ll join the population data and the housing data, remove cities with fewer than 50,000 people and calculate our variables.
#### combine the data, filter small cities, assess impact of removing high moe
city_data <- raw_city_census_data_2020 |>
filter(pop20 > 50000) |>
left_join(city_acs_moe, by = c("GEOID", "NAME")) |>
mutate(affordable = 2.5* mhiE,
afford_diff = affordable - mpvE,
pct_owner_occupied = own_occupiedE/housing_unitsE,
pct_renter = rent_occupiedE/housing_unitsE,
pct_bipoc = 1 - (white_alone_not_hisp/pop20),
unemployment_rate = unemployedE/labor_forceE)
Look at the margin of error now. Do we need to filter more?
#### histograms to understand the margin of error after removing low pop ####
ggplot(data = city_data, aes(x=mhi_moe_ratio))+
geom_histogram(bins = 10, color="darkblue", fill="lightblue") +
labs(title = "MHI Margin of Error, as ratio of MHI Estimate",
subtitle = "Cities with 50,000 or more",
x = "MHI Margin of Error Ratio")
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
ggplot(data = city_data, aes(x=mpv_moe_ratio))+
geom_histogram(bins = 10, color="darkblue", fill="wheat2") +
labs(title = "MPV Margin of Error, as ratio of MPV Estimate",
subtitle = "Cities with 50,000 or more",
x = "MPV Margin of Error Ratio")
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
ggplot(data = city_data, aes(x=own_occupied_moe_ratio))+
geom_histogram(bins = 10, color="darkblue", fill="lightsteelblue") +
labs(title = "Owner Occupied Margin of Error, as ratio of estimate",
subtitle = "Cities with 50,000 or more",
x = "Owner Occupied Margin of Error Ratio")
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
ggplot(data = city_data, aes(x=rent_occupied_moe_ratio))+
geom_histogram(bins = 10, color="darkblue", fill="seagreen2") +
labs(title = "Renter Occupied Margin of Error, as ratio of estimate",
subtitle = "Cities with 50,000 or more",
x = "Renter Units Margin of Error Ratio")
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
I think, I want to remove all cities where the margin of error is more than 10% of the estimate
My analysis dataframe will be called cities
. We’ll use
two ways to create a column to define whether each city is
affordable:
ifelse()
- defines the column one way if it meets a
criteria, and another way if it doesn’t
ifelse(criteria, value if it meets the criteria, value if it doesn't meet the criteria)
case_when()
- defines the column in many ways based on
criteria
case_when(criteria ~ value if it meets it, another criteria ~ value if it meets it, yet another criteria ~ value if it meets it, TRUE ~ value if it doesn't meet any of the above)
And I’ll create separate city and state columns.
Finally, I’ll create a removed_cities
dataframe to look
at who has been removed.
#### city data without high margin of error data
cities <- city_data |>
separate(NAME, into = c("city", "state"), sep = ",") |>
filter(mhi_moe_ratio < .1 &
mpv_moe_ratio < .1 &
housing_units_moe_ratio < .1 &
own_occupied_moe_ratio < .1 &
rent_occupied_moe_ratio < .1) |>
filter(state != " Puerto Rico") |>
mutate(affordable_yn = ifelse(afford_diff <= 0, "no", "yes"),
affordable_cat = case_when(afford_diff <= -10000 ~ "not affordable",
afford_diff > -10000 & afford_diff <= 0 ~ "almost affordable",
afford_diff > 0 ~ "affordable",
TRUE ~ "None of the above"))
#### high MOE cities
removed_cities <- city_data |>
filter(mhi_moe_ratio >= .1 |
mpv_moe_ratio >= .1 |
housing_units_moe_ratio >= .1 |
own_occupied_moe_ratio >= .1 |
rent_occupied_moe_ratio >= .1)
GEOID | city | state | pop20 | white_alone_not_hisp | mhiE | mhiM | mpvE | mpvM | housing_unitsE | housing_unitsM | own_occupiedE | own_occupiedM | rent_occupiedE | rent_occupiedM | labor_forceE | labor_forceM | unemployedE | unemployedM | mhi_moe_ratio | mpv_moe_ratio | housing_units_moe_ratio | own_occupied_moe_ratio | rent_occupied_moe_ratio | affordable | afford_diff | pct_owner_occupied | pct_renter | pct_bipoc | unemployment_rate | affordable_yn | affordable_cat |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0651182 | Newport Beach city | California | 85239 | 64352 | 133849 | 8839 | 1976400 | 104275 | 38596 | 827 | 20953 | 887 | 17643 | 879 | 46439 | 1181 | 1349 | 348 | 0.0660371 | 0.0527601 | 0.0214271 | 0.0423328 | 0.0498215 | 334622.5 | -1641778 | 0.5428801 | 0.4571199 | 0.2450404 | 0.0290489 | no | not affordable |
0617610 | Cupertino city | California | 60381 | 13085 | 182857 | 9918 | 1866200 | 59692 | 20506 | 391 | 12468 | 524 | 8038 | 458 | 28137 | 730 | 710 | 190 | 0.0542391 | 0.0319859 | 0.0190676 | 0.0420276 | 0.0569793 | 457142.5 | -1409058 | 0.6080172 | 0.3919828 | 0.7832928 | 0.0252337 | no | not affordable |
0670000 | Santa Monica city | California | 93076 | 60654 | 98300 | 5511 | 1452100 | 108646 | 45706 | 993 | 12761 | 586 | 32945 | 1052 | 54423 | 1207 | 3445 | 491 | 0.0560631 | 0.0748199 | 0.0217258 | 0.0459212 | 0.0319320 | 245750.0 | -1206350 | 0.2791975 | 0.7208025 | 0.3483390 | 0.0633004 | no | not affordable |
0649670 | Mountain View city | California | 82376 | 33008 | 144116 | 7212 | 1560600 | 46859 | 33029 | 761 | 13661 | 666 | 19368 | 715 | 47903 | 1036 | 1616 | 330 | 0.0500430 | 0.0300263 | 0.0230404 | 0.0487519 | 0.0369166 | 360290.0 | -1200310 | 0.4136062 | 0.5863938 | 0.5993008 | 0.0337348 | no | not affordable |
0660102 | Redwood City city | California | 84292 | 34067 | 123294 | 5083 | 1424200 | 58720 | 30175 | 816 | 14328 | 682 | 15847 | 772 | 48589 | 1079 | 1561 | 413 | 0.0412267 | 0.0412302 | 0.0270423 | 0.0475991 | 0.0487158 | 308235.0 | -1115965 | 0.4748302 | 0.5251698 | 0.5958454 | 0.0321266 | no | not affordable |
How many cities are “affordable”? And what are the characteristics of afforadble and not affordable cities?
I’ll use two new functions:
toString()
to list all of the values, separated by
commasunique()
to select only the unique valuesaffordability_stats <- cities |>
group_by(affordable_cat) |>
summarise(`Number of cities` = n(),
`Total Population` = comma(sum(pop20)),
`Average difference between afforable and MPV` = dollar(mean(afford_diff)),
`Average MHI` = dollar(mean(mhiE)),
`Average MPV` = dollar(mean(mpvE)),
`Average Percent BIPOC` = percent(mean(pct_bipoc)),
`Average Percent Renter-Occupied` = percent(mean(pct_renter)),
`Average Percent Renter-Occupied` = percent(mean(pct_owner_occupied)),
`Unemployment Rate` = percent(sum(unemployedE)/sum(labor_forceE)),
States = toString(unique(state)))
affordable_cat | Number of cities | Total Population | Average difference between afforable and MPV | Average MHI | Average MPV | Average Percent BIPOC | Average Percent Renter-Occupied | Unemployment Rate | States |
---|---|---|---|---|---|---|---|---|---|
affordable | 74 | 7,805,454 | $12,615.95 | $48,677.73 | $109,078 | 45% | 57% | 8% | Texas, Oklahoma, Ohio, Pennsylvania, Missouri, New York, Alabama, Illinois, Michigan, Mississippi, Indiana, Georgia, Iowa, Nebraska, Kansas |
almost affordable | 43 | 6,024,764 | -$4,245.87 | $56,715.60 | $146,035 | 56% | 58% | 6% | Wisconsin, Texas, Pennsylvania, Tennessee, Illinois, Missouri, Mississippi, Indiana, Michigan, Nebraska, Georgia, Iowa, New York, Kansas, Alabama, New Jersey |
not affordable | 657 | 115,471,286 | -$186,942 | $71,942.58 | $366,798 | 52% | 56% | 6% | Oregon, Virginia, Texas, Rhode Island, Utah, Pennsylvania, South Dakota, Oklahoma, Washington, Wisconsin, Wyoming, South Carolina, Tennessee, Maryland, North Dakota, Florida, Nebraska, Louisiana, Alabama, California, Connecticut, District of Columbia, Alaska, Delaware, Montana, Illinois, Mississippi, New Jersey, Georgia, Michigan, Missouri, Indiana, Arkansas, Massachusetts, Nevada, Ohio, Iowa, New Mexico, Colorado, Hawaii, Arizona, Idaho, New Hampshire, New York, Kansas, North Carolina, Kentucky, Maine, Minnesota |
# ggplot(data = city_data, aes(x=pct_renter))+
# geom_histogram(bins = 10, color="darkblue", fill="seagreen2") +
# labs(title = "Percent Renter Occupied Cities",
# subtitle = "Cities with 50,000 or more",
# x = "Percent of Renter")
ggplot(data = cities,
aes(x = mhiE, y = mpvE,
size = pop20, color = affordable_cat)) +
geom_point(alpha = .85) +
scale_y_continuous(labels = dollar_format(accuracy = 1)) +
scale_x_continuous(labels = dollar_format(accuracy = 1)) +
# change legend label formatting
scale_size_area(labels = comma, max_size = 10) +
labs(x = "Estimated Median Household Income", y = "Estimated Median Property Value",
title = "Cities and Housing Affordability",
caption = "Sources: US Census, 2020; American Community Survey",
# add nice label for size element
size = "Population",
color = "Affordability") +
theme_bw()
ggplot(data = cities,
aes(x = pct_renter, y = mpvE,
size = pop20, color = affordable_cat)) +
geom_point(alpha = .85) +
scale_y_continuous(labels = dollar_format(accuracy = 1)) +
scale_x_continuous(labels = percent_format(accuracy = 1)) +
# change legend label formatting
scale_size_area(labels = comma, max_size = 10) +
labs(x = "Estimated Percent Renter Occupied", y = "Estimated Median Property Value",
title = "Cities and Housing Affordability",
caption = "Sources: US Census, 2020; American Community Survey",
# add nice label for size element
size = "Population",
color = "Affordability") +
theme_bw()
ggplot(data = cities,
aes(x = unemployment_rate, y = mhiE,
size = pop20, color = affordable_cat)) +
geom_point(alpha = .85) +
scale_y_continuous(labels = dollar_format(accuracy = 1)) +
scale_x_continuous(labels = percent_format(accuracy = 1)) +
# change legend label formatting
scale_size_area(labels = comma, max_size = 10) +
labs(x = "Estimated Unemployment Rate", y = "Estimated Median Household Income",
title = "Cities and Housing Affordability",
caption = "Sources: US Census, 2020; American Community Survey",
# add nice label for size element
size = "Population",
color = "Affordability") +
theme_bw()
ggplot(data = cities |>
filter(state == " Michigan"),
aes(x = unemployment_rate, y = mhiE,
size = pop20, color = affordable_cat)) +
geom_point(alpha = .85) +
scale_y_continuous(labels = dollar_format(accuracy = 1)) +
scale_x_continuous(labels = percent_format(accuracy = 1)) +
# change legend label formatting
scale_size_area(labels = comma, max_size = 10) +
labs(x = "Estimated Unemployment Rate", y = "Estimated Median Household Income",
title = "Michigan Cities and Housing Affordability",
caption = "Sources: US Census, 2020; American Community Survey",
# add nice label for size element
size = "Population",
color = "Affordability") +
theme_bw()