Assignment 8- affordable housing

R Assignment 8 - affordable housing

When Ms. Coats moved into the Baxter Street house, a family needed right around the area’s median income to afford the $82 monthly mortgage payment — the definition of middle class. Today a typical Clairemont home costs $850,000, up 30 percent from 2019. A family would need to make about double San Diego’s median income to afford one.

This is a back-of-the-envelope calculation to determine which cities the “average” household could afford to buy the “average” home, inspired by this article in the New York Times.

We will use:

the general rule of thumb that you can afford a house that is 2.5 times your income source
Median Household Income from ACS 2016-20 to identify the “average” household
Median owner-occupied property value from ACS 2016-20 to identify the “average” home

For additional context we’ll download race/ethnicity and tenure variables to calculate:

Percent BIPOC population
Percent renter-occupied homes
Percent owner-occipued homes

library(tidycensus)
library(tidyverse)
library(scales)

# list variables in acs
acs20<- load_variables(2020, "acs5", cache = T)

# list variables in decennial
dec20<- load_variables(2020, "pl", cache = T)

Population data

Download the population data from Decennial census since it is available and more accurate. We’ll acquire the other variables from the 5-year American Community Survey for 2016 - 12020.

# get decennial populationand race data
raw_city_census_data_2020 <- get_decennial(geography = "place", 
                                       variables = c(pop20 = "P1_001N",
                                                     white_alone_not_hisp = "P2_005N"), 
                                       year = 2020,
                                       output = "wide")

GEOID	NAME	pop20	white_alone_not_hisp
4024200	Erick city, Oklahoma	1000	834
4024300	Erin Springs town, Oklahoma	89	69
4024460	Etowah town, Oklahoma	159	106
4024500	Etta CDP, Oklahoma	80	22
4024650	Eufaula city, Oklahoma	2766	1790

Household income, value, and tenure data

Download the other variables from the 5-year American Community Survey for 2016 - 12020.

# get acs 5-year mhi, mpv and tenure data
raw_city_acs_data_2020 <- get_acs(
  geography = "place",
  variables = c(mhi = "B19013_001",
                mpv = "B25077_001",
                housing_units = "B25003_001",
                own_occupied = "B25003_002",
                rent_occupied = "B25003_003",
                labor_force = "B23025_002",
                unemployed = "B23025_005"),
  year = 2020,
  output = "wide")

GEOID	NAME	mhiE	mhiM	mpvE	mpvM	housing_unitsE	housing_unitsM	own_occupiedE	own_occupiedM	rent_occupiedE	rent_occupiedM	labor_forceE	labor_forceM	unemployedE	unemployedM
4024200	Erick city, Oklahoma	27045	13788	57500	18062	328	107	206	85	122	71	229	125	0	10
4024300	Erin Springs town, Oklahoma	59250	27989	71700	21208	49	13	44	13	5	8	46	24	0	10
4024460	Etowah town, Oklahoma	56563	27909	101300	22274	43	13	37	13	6	6	47	21	1	3
4024500	Etta CDP, Oklahoma	37727	8045	NA	NA	40	22	29	22	11	15	29	25	0	10
4024650	Eufaula city, Oklahoma	46068	9359	157400	6134	1245	120	769	158	476	117	853	158	28	30

Margin of error

We’ll calculate a margin-of-error ratio to assess which cities have a very high margin of error and should be removed form the analyses.

# explore margin of error and determine which to remove
city_acs_moe <- raw_city_acs_data_2020 |> 
  mutate(mhi_moe_ratio = mhiM/mhiE,
         mpv_moe_ratio = mpvM/mpvE,
         housing_units_moe_ratio = housing_unitsM/housing_unitsE,
         own_occupied_moe_ratio = own_occupiedM/own_occupiedE,
         rent_occupied_moe_ratio = rent_occupiedM/rent_occupiedE)

MOE visualization

We’ll look at the histograms for each variable to determine the quality of the data. Notice we don’t have to do this for the population data because it is from the decennial census and is a count, not an estimate like the American Community Survey.

#### histograms to understand the margin of error ####
ggplot(data = city_acs_moe, aes(x=mhi_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="lightblue") +
  labs(title = "MHI Margin of Error, as ratio of MHI Estimate",
       x = "MHI Margin of Error Ratio")

## Warning: Removed 3446 rows containing non-finite values (`stat_bin()`).

ggplot(data = city_acs_moe, aes(x=mpv_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="wheat2") +
  labs(title = "MPV Margin of Error, as ratio of MPV Estimate",
       x = "MPV Margin of Error Ratio")

## Warning: Removed 3554 rows containing non-finite values (`stat_bin()`).

ggplot(data = city_acs_moe, aes(x=own_occupied_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="lightsteelblue") +
  labs(title = "Owner Occupied Margin of Error, as ratio of estimate",
       x = "Owner Occupied Margin of Error Ratio")

## Warning: Removed 739 rows containing non-finite values (`stat_bin()`).

ggplot(data = city_acs_moe, aes(x=rent_occupied_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="seagreen2") +
  labs(title = "Renter Occupied Margin of Error, as ratio of estimate",
       x = "Renter Units Margin of Error Ratio")

## Warning: Removed 2825 rows containing non-finite values (`stat_bin()`).

Margin of error assessment

The margin of error is really high in some of the cities. We will definitely want to remove some areas with small population. We’ll start by removing cities with fewer than 50,000 people. That will help with the margin-of-error, and this back-of-the-envelope analysis will be more meaningful for cities and not small towns.

Combine and process the data

We’ll join the population data and the housing data, remove cities with fewer than 50,000 people and calculate our variables.

Price of an affordable home in each city (2.5 times the median household income)
Difference between the price of the median home, and what the median household could afford
Percent BIPOC population
Percent renter-occupied homes
Percent owner-occipued homes

#### combine the data, filter small cities, assess impact of removing high moe
city_data <- raw_city_census_data_2020 |> 
  filter(pop20 > 50000) |> 
  left_join(city_acs_moe, by = c("GEOID", "NAME")) |> 
  mutate(affordable = 2.5* mhiE,
         afford_diff = affordable - mpvE,
         pct_owner_occupied = own_occupiedE/housing_unitsE,
         pct_renter = rent_occupiedE/housing_unitsE,
         pct_bipoc = 1 - (white_alone_not_hisp/pop20),
         unemployment_rate = unemployedE/labor_forceE)

Margin of error assessment

Look at the margin of error now. Do we need to filter more?

#### histograms to understand the margin of error after removing low pop ####
ggplot(data = city_data, aes(x=mhi_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="lightblue") +
  labs(title = "MHI Margin of Error, as ratio of MHI Estimate",
       subtitle = "Cities with 50,000 or more",
       x = "MHI Margin of Error Ratio")

## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

ggplot(data = city_data, aes(x=mpv_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="wheat2") +
  labs(title = "MPV Margin of Error, as ratio of MPV Estimate",
       subtitle = "Cities with 50,000 or more",
       x = "MPV Margin of Error Ratio")

## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

ggplot(data = city_data, aes(x=own_occupied_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="lightsteelblue") +
  labs(title = "Owner Occupied Margin of Error, as ratio of estimate",
       subtitle = "Cities with 50,000 or more",
       x = "Owner Occupied Margin of Error Ratio")

## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

ggplot(data = city_data, aes(x=rent_occupied_moe_ratio))+
  geom_histogram(bins = 10, color="darkblue", fill="seagreen2") +
  labs(title = "Renter Occupied Margin of Error, as ratio of estimate",
       subtitle = "Cities with 50,000 or more",
       x = "Renter Units Margin of Error Ratio")

## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

I think, I want to remove all cities where the margin of error is more than 10% of the estimate

Create final analysis dataset

My analysis dataframe will be called cities. We’ll use two ways to create a column to define whether each city is affordable:

ifelse() - defines the column one way if it meets a criteria, and another way if it doesn’t
- ifelse(criteria, value if it meets the criteria, value if it doesn't meet the criteria)
case_when() - defines the column in many ways based on criteria
- case_when(criteria ~ value if it meets it, another criteria ~ value if it meets it, yet another criteria ~ value if it meets it, TRUE ~ value if it doesn't meet any of the above)

And I’ll create separate city and state columns.

Finally, I’ll create a removed_cities dataframe to look at who has been removed.

#### city data without high margin of error data
cities <- city_data |> 
  separate(NAME, into = c("city", "state"), sep = ",") |> 
  filter(mhi_moe_ratio < .1 &
         mpv_moe_ratio < .1 &
         housing_units_moe_ratio < .1 &
         own_occupied_moe_ratio < .1 &
         rent_occupied_moe_ratio < .1) |> 
  filter(state != " Puerto Rico") |> 
  mutate(affordable_yn = ifelse(afford_diff <= 0, "no", "yes"),
         affordable_cat = case_when(afford_diff <= -10000 ~ "not affordable", 
                                afford_diff > -10000 & afford_diff <= 0 ~ "almost affordable",
                                afford_diff > 0 ~ "affordable",
                                TRUE ~ "None of the above")) 

#### high MOE cities
removed_cities <- city_data |> 
  filter(mhi_moe_ratio >= .1 |
           mpv_moe_ratio >= .1 |
           housing_units_moe_ratio >= .1 |
           own_occupied_moe_ratio >= .1 |
           rent_occupied_moe_ratio >= .1)

GEOID	city	state	pop20	white_alone_not_hisp	mhiE	mhiM	mpvE	mpvM	housing_unitsE	housing_unitsM	own_occupiedE	own_occupiedM	rent_occupiedE	rent_occupiedM	labor_forceE	labor_forceM	unemployedE	unemployedM	mhi_moe_ratio	mpv_moe_ratio	housing_units_moe_ratio	own_occupied_moe_ratio	rent_occupied_moe_ratio	affordable	afford_diff	pct_owner_occupied	pct_renter	pct_bipoc	unemployment_rate	affordable_yn	affordable_cat
0651182	Newport Beach city	California	85239	64352	133849	8839	1976400	104275	38596	827	20953	887	17643	879	46439	1181	1349	348	0.0660371	0.0527601	0.0214271	0.0423328	0.0498215	334622.5	-1641778	0.5428801	0.4571199	0.2450404	0.0290489	no	not affordable
0617610	Cupertino city	California	60381	13085	182857	9918	1866200	59692	20506	391	12468	524	8038	458	28137	730	710	190	0.0542391	0.0319859	0.0190676	0.0420276	0.0569793	457142.5	-1409058	0.6080172	0.3919828	0.7832928	0.0252337	no	not affordable
0670000	Santa Monica city	California	93076	60654	98300	5511	1452100	108646	45706	993	12761	586	32945	1052	54423	1207	3445	491	0.0560631	0.0748199	0.0217258	0.0459212	0.0319320	245750.0	-1206350	0.2791975	0.7208025	0.3483390	0.0633004	no	not affordable
0649670	Mountain View city	California	82376	33008	144116	7212	1560600	46859	33029	761	13661	666	19368	715	47903	1036	1616	330	0.0500430	0.0300263	0.0230404	0.0487519	0.0369166	360290.0	-1200310	0.4136062	0.5863938	0.5993008	0.0337348	no	not affordable
0660102	Redwood City city	California	84292	34067	123294	5083	1424200	58720	30175	816	14328	682	15847	772	48589	1079	1561	413	0.0412267	0.0412302	0.0270423	0.0475991	0.0487158	308235.0	-1115965	0.4748302	0.5251698	0.5958454	0.0321266	no	not affordable

Summary stats

How many cities are “affordable”? And what are the characteristics of afforadble and not affordable cities?

I’ll use two new functions:

toString() to list all of the values, separated by commas
unique() to select only the unique values

affordability_stats <- cities |> 
  group_by(affordable_cat) |> 
  summarise(`Number of cities` = n(),
            `Total Population` = comma(sum(pop20)),
            `Average difference between afforable and MPV` = dollar(mean(afford_diff)),
            `Average MHI` = dollar(mean(mhiE)),
            `Average MPV` = dollar(mean(mpvE)),
            `Average Percent BIPOC` = percent(mean(pct_bipoc)),
            `Average Percent Renter-Occupied` = percent(mean(pct_renter)),
            `Average Percent Renter-Occupied` = percent(mean(pct_owner_occupied)),
            `Unemployment Rate` = percent(sum(unemployedE)/sum(labor_forceE)),
            States = toString(unique(state)))

affordable_cat	Number of cities	Total Population	Average difference between afforable and MPV	Average MHI	Average MPV	Average Percent BIPOC	Average Percent Renter-Occupied	Unemployment Rate	States
affordable	74	7,805,454	$12,615.95	$48,677.73	$109,078	45%	57%	8%	Texas, Oklahoma, Ohio, Pennsylvania, Missouri, New York, Alabama, Illinois, Michigan, Mississippi, Indiana, Georgia, Iowa, Nebraska, Kansas
almost affordable	43	6,024,764	-$4,245.87	$56,715.60	$146,035	56%	58%	6%	Wisconsin, Texas, Pennsylvania, Tennessee, Illinois, Missouri, Mississippi, Indiana, Michigan, Nebraska, Georgia, Iowa, New York, Kansas, Alabama, New Jersey
not affordable	657	115,471,286	-$186,942	$71,942.58	$366,798	52%	56%	6%	Oregon, Virginia, Texas, Rhode Island, Utah, Pennsylvania, South Dakota, Oklahoma, Washington, Wisconsin, Wyoming, South Carolina, Tennessee, Maryland, North Dakota, Florida, Nebraska, Louisiana, Alabama, California, Connecticut, District of Columbia, Alaska, Delaware, Montana, Illinois, Mississippi, New Jersey, Georgia, Michigan, Missouri, Indiana, Arkansas, Massachusetts, Nevada, Ohio, Iowa, New Mexico, Colorado, Hawaii, Arizona, Idaho, New Hampshire, New York, Kansas, North Carolina, Kentucky, Maine, Minnesota

Visualizations

# ggplot(data = city_data, aes(x=pct_renter))+
#   geom_histogram(bins = 10, color="darkblue", fill="seagreen2") +
#   labs(title = "Percent Renter Occupied Cities",
#        subtitle = "Cities with 50,000 or more",
#        x = "Percent of Renter")

ggplot(data = cities,
         aes(x = mhiE, y = mpvE, 
             size = pop20, color = affordable_cat)) +
  geom_point(alpha = .85) +
  scale_y_continuous(labels = dollar_format(accuracy = 1)) +
  scale_x_continuous(labels = dollar_format(accuracy = 1)) +
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Estimated Median Household Income", y = "Estimated Median Property Value",
       title = "Cities and Housing Affordability",
       caption = "Sources: US Census, 2020; American Community Survey",
       # add nice label for size element
       size = "Population",
       color = "Affordability") +
  theme_bw()

ggplot(data = cities,
         aes(x = pct_renter, y = mpvE, 
             size = pop20, color = affordable_cat)) +
  geom_point(alpha = .85) +
  scale_y_continuous(labels = dollar_format(accuracy = 1)) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Estimated Percent Renter Occupied", y = "Estimated Median Property Value",
       title = "Cities and Housing Affordability",
       caption = "Sources: US Census, 2020; American Community Survey",
       # add nice label for size element
       size = "Population",
       color = "Affordability") +
  theme_bw()

ggplot(data = cities,
         aes(x = unemployment_rate, y = mhiE, 
             size = pop20, color = affordable_cat)) +
  geom_point(alpha = .85) +
  scale_y_continuous(labels = dollar_format(accuracy = 1)) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Estimated Unemployment Rate", y = "Estimated Median Household Income",
       title = "Cities and Housing Affordability",
       caption = "Sources: US Census, 2020; American Community Survey",
       # add nice label for size element
       size = "Population",
       color = "Affordability") +
  theme_bw()

ggplot(data = cities |> 
         filter(state == " Michigan"),
         aes(x = unemployment_rate, y = mhiE, 
             size = pop20, color = affordable_cat)) +
  geom_point(alpha = .85) +
  scale_y_continuous(labels = dollar_format(accuracy = 1)) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  # change legend label formatting
  scale_size_area(labels = comma, max_size = 10) +
  labs(x = "Estimated Unemployment Rate", y = "Estimated Median Household Income",
       title = "Michigan Cities and Housing Affordability",
       caption = "Sources: US Census, 2020; American Community Survey",
       # add nice label for size element
       size = "Population",
       color = "Affordability") +
  theme_bw()