Introduction

In this document I do some basic analyses on median home value of individual census tracts of Howard County, Maryland over time.

For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.

Setup and data preparation

Libraries

I use the following packages for the following purposes:

  • tidyverse: do general data manipulation.
  • readxl: read Excel spreadsheets (part of the Tidyverse, but not loaded by default).
  • tidycensus: retrieve U.S. census data.
  • tigris: retrieve U.S. census geospatial data.
  • sf: manipulate geospatial data.
  • viridis: obtain special color palettes for graphs.
  • knitr: display tabular data.
  • tools: compute MD5 checksums.
library(tidyverse)
library(readxl)
library(tidycensus)
library(tigris)
library(sf)
library(viridis)
library(knitr)
library(tools)

The tidycensus and tigris packages require a Census API key. I previously obtained such a key and stored it for use by the functions included in the tidycensus and tigris packages.

Data sources

I use data from the following sources; see the References section below for more information:

  • Median home value estimates are from the American Community Survey table B25077, and are accessed via the tidycensus get_acs() function.
  • Census tract boundaries are from the U.S. Census Bureau cartographic boundaries data, and are accessed via the tigris tracts() function.
  • Road geometry is also from the U.S. Census bureau, and is accessed via the tigris roads() function.
  • allitems.xlsx. Consumer Price Index data used to adjust for inflation.

I check to make sure that the versions of the files being used in this analysis are identical to the versions of the files I originally downloaded. I do this by comparing the MD5 checksums of the files against MD5 values I previously computed, and stopping execution if they do not match.

stopifnot(md5sum("allitems.xlsx") == "f890a4aff139ca003246d17491d51272")

Reading in and preparing the data

I first use the tidycensus get_acs() function to retrieve median home value estimates from the 2010 and 2017 ACS 5-year estimates for all census tracts in Howard County. This function returns data in tidy format, with the following columns:

  • GEOID. An 11-character geographic ID combining the 2-digit FIPS code for Maryland (“24”), the 3-digit FIPS Code for Howard County (“027”), and the 6-digit FIPS code for the individual census tract within Howard County.
  • NAME. The name of the census tract.
  • variable. References the ACS variable being retrieved (which we here name mhv).
  • estimate. The ACS 5-year estimate for the variable’s value.
  • moe. The margin of error for the estimate.

I take the tables returned by get_acs() and add a new variable year containing the year of the estimate. To reduce the amount of code I create a function get_mhv() to retrieve the ACS estimates for a given year y, and then use the map() function to execute the function for both 2010 and 2017. I then use the bind_rows() function to consolidate the results together into a single table mhv.

get_mhv <- function(y) {
  get_acs(
    year = y, geography = "tract",
    state = "MD", county = "Howard County",
    survey = "acs5", variables = c(mhv = "B25077_001")
  ) %>%
    mutate(year = y)
}

mhv <- map(c(2010, 2017), get_mhv) %>%
  bind_rows()

I also get median home value estimates for Howard County as a whole, for both 2010 and 2017, using the same general procedure as used above. I then add that data to the existing mhv table.

get_hoco_mhv <- function(y) {
  get_acs(
    year = y, geography = "county",
    state = "MD", county = "Howard County",
    survey = "acs5", variables = c(mhv = "B25077_001")
  ) %>%
    mutate(year = y)
}

mhv <- map(c(2010, 2017), get_hoco_mhv) %>%
  bind_rows(mhv)

The tables above contain 5-year estimates that are expressed in dollars as of the year in question. (For example, the 5-year estimates for 2010 are in 2010 dollars.) In order to adjust values for inflation I read in an Excel spreadsheet containing consumer price index data from 1978 to 2018. The index values are expressed as a percentage of the 1997 index (set at 100).

The spreadsheet has two sheets, each of which has values for all twelve months of the year plus an average figure for the entire year. The sheets differ in whether or not the monthly figures are seasonally-adjusted. Since I am doing adjustments for entire years I set col_types to skip over the monthly values, and read only the first sheet (which is not seasonally-adjusted).

cpi_tbl <- read_excel("allitems.xlsx",
  sheet = 1,
  skip = 8,
  col_names = c("year", "index"),
  col_types = c("numeric", rep("skip", 12), "numeric")
)

I want to express all home values as 2017 dollars. I therefore extract from cpi_tbl the consumer price index value for 2017 by filtering for the row for 2017, selecting only the variable index for that row, and then converting that variable to a scalar numeric value. I then divide the 2017 CPI value by all other CPI values to create a new table adj_tbl. The variable adj in this table can be multiplied by median home value values for each year to express them in 2017 dollars.

index_2017 <- cpi_tbl %>%
  filter(year == 2017) %>%
  select(index) %>%
  as.numeric()

adj_tbl <- cpi_tbl %>%
  mutate(adj = index_2017 / index) %>%
  select(year, adj)

The adjustment is done as follows, creating a new table mhv_adj:

  1. I start with the mhv table.
  2. I then join that table with the adj_tbl table containing inflation adjustment figures for each year, using year as the common variable. This has the effect of adding a new column containing the variable adj. This variable has the value 1 for the year 2017, and for other years contains the adjustment factor needed to convert home values into 2017 dollars.
  3. I modify the estimate and moe variables to convert them to 2017 dollars.
mhv_adj <- mhv %>%
  inner_join(adj_tbl, by = "year") %>%
  mutate(estimate = estimate * adj, moe = moe * adj)

In order to produce maps of the median home value data I need geometry of the cartographic boundaries for all 55 census tracts in Howard County. I use the tigris function tracts() to return the cartographic boundaries in the form of a table tracts_sf compatible with the sf package. This table has a variable GEOID that is identical to the variable of the same name returned by get_acs().

options(tigris_use_cache = TRUE)
tracts_sf <- tracts(state = "MD", county = "Howard County", cb = TRUE, class = "sf", progress_bar = FALSE)

To help orient readers as to the locations of the census tracts, I also want any maps generated to also display major roads in Howard County that correspond in whole or in part to census tract boundaries. I use the tigris function roads() to return geometry for all Howard County roads.

Because I don’t need or want to display each and every Howard County road, I filter the results to include only I-70 and I-95, U.S. 29 and U.S. 40 (but not U.S. 1, which is not a census tract boundary), and Maryland routes 32, 100, 108, 175, and 216 (but again omitting roads like MD 97 that are not census tract boundaries). I then keep only the geometry from the table, as roads_geo.

major_roads <- c(
  "I- 70",
  "I- 95",
  "US Hwy 29",
  "US Hwy 40",
  "State Hwy 32",
  "State Hwy 100",
  "State Hwy 108",
  "State Hwy 175",
  "State Hwy 216"
)

roads_geo <- roads(state = "MD", county = "Howard County", class = "sf", progress_bar = FALSE) %>%
  filter(FULLNAME %in% major_roads) %>%
  st_geometry()

Analysis

I do analyses to answer the following questions:

Howard County median home value by census tract

Median home value estimates for census tracts are not available before 2009, at least for Howard County, and the boundaries for Howard County census tracts changed from 2009 to 2010 (presumably as part of the work on the 2010 census). As a result I choose 2010 as the first year for this analysis.

Unfortunately, figures for median home value for census tracts are available only in the American Community Survey 5-year estimates. This means, for example, that the 2010 median home value estimates published for Howard County census tracts actually reflect surveys in the years 2006-2010, the 2011 estimates reflect surveys in the years 2007-2011, and so on.

For this reason the U.S. Census Bureau recommends not comparing 5-year estimates from overlapping sets of years, for example comparing 2011 5-year estimates to 2010 5-year estimates. In order to look at changes over the maximum period of time I therefore choose to compare median home values between the 2010 and 2017 5-year estimates, representing home values over the timeframes 2006-2010 and 2013-2017.

The next issue is that the margins of error for median home values at the census tract level can be very high, up to 10% or even more relative to the estimates. That means that, for example, ranking individual census tracts based on their median home value doesn’t really make sense, given that the relative positions of tracts on the list will in large part reflect random measurement errors.

To work around this issue I take the 55 Howard County census tracts and divide them into 5 groups or “quintiles” of 11 census tracts each, based on their median home value 5-year estimates for 2010. Quintile 1 contains the 11 census tracts with the lowest median home values (lowest 20%) and quintile 5 contains the 11 census tracts with the highest median home values (highest 20%), with the other quintiles containing tracts with median home values intermediate between the lowest and highest.

I create a new table mhv_quintiles as follows:

  1. I start with the mhv_adj table created above, and then retain only the rows for 2010 that are associated with census tracts.
  2. I rank the rows in ascending order from 1 to 55 based on the value of the estimate variable, using the min_rank() function. I then divide the rank value by 11, and take the next highest integer (using the ceiling() function). This produces a value from 1 to 5 for each row. I then convert the value to a character string and save it in a new variable quintile.
  3. I retain only the variables GEOID and quintile.

The resulting table mhv_quintiles has 55 rows, each row containing the 11-character geographic ID of the corresponding census tract and the quintile of median home value that that census tract was in according to the (inflation-adjusted) 2010 5-year estimates.

mhv_quintiles <- mhv_adj %>%
  filter(year == 2010, GEOID != "24027") %>%
  mutate(quintile = as.character(ceiling(min_rank(estimate) / 11))) %>%
  select(GEOID, quintile)

I now want to map each census tract, coloring each tract according to the median home value quintile it was in according to the 2010 5-year estimate. I do this as follows:

  1. I first define variables quintile_colors, quintiles, and quintile_labels, to specify the colors used for the quintiles on the map, the order in which they will be listed in the legend, and the labels used for them in the legend. For the colors I use a special [“colorblind-friendly” palette][cbfp] designed to be more readable by people with various forms of color blindess.
  2. I then take the table tracts_sf containing the geometry for the census tracts and join it with the table mhv_quintiles created above, using the common variable GEOID.
  3. I plot the resulting table, using the value of the quintile variable to determine the fill color for each census tract.
  4. I first use geom_sf() to draw the outlines of the census tracts in white.
  5. I then use geom_sf() again, this time taking data from the roads_geo table, to draw lines for the major roads in Howard County. I also draw them in white but with lines that are over twice as thick.
  6. I use scale_fill_manual() to specify the colors for each census tract and to create a legend, with label text and label order as previously specified.
  7. I add a title, subtitle, and caption for the map.
  8. Finally I blank out the axis tick marks and axis labels (which would normally show latitude and longitude) and make the background of the graph blank for a cleaner look.

Note that I don’t bother trying to label the various Howard County roads because it would be more work and make the map more cluttered—and in any case the intended audience of Howard County residents should be able to identify most if not all of the roads by sight.

# Colors are assigned to the quintiles based on their alphabetical order.
quintile_colors <- c(
  "#0072B2", # 1
  "#56B4E9", # 2
  "#009E73", # 3
  "#E69F00", # 4
  "#F0E442"  # 5
)

# The order of quintiles in this list determines the order in the legend.
quintile_order <- c("5", "4", "3", "2", "1")

# These labels are used as substitutes for the quintile values in the legend.
quintile_labels <- c("5 (highest 20%)", "4", "3", "2", "1 (lowest 20%)")

tracts_sf %>%
  inner_join(mhv_quintiles, by = "GEOID") %>%
  ggplot(aes(fill = quintile, geometry = geometry)) +
  geom_sf(color = "white", size = 0.4) +
  geom_sf(data = roads_geo, color = "white", size = 1.0, fill = NA) +
  scale_fill_manual(name = "Quintile", values = quintile_colors, breaks = quintile_order, labels = quintile_labels) +
  labs(
    title = "2010 Howard County Median Home Value Quintiles",
    subtitle = "Based on 2010 5-Year Estimates for Census Tracts",
    caption = paste0(
      "Data sources:",
      "\n  U.S. Census Bureau, American Community Survey, Table B25077",
      "\n  U.S. Census Bureau, Cartographic boundaries",
      "\nCreated using the tidyverse and tidycensus R packages"
    )
  ) +
  theme(plot.caption = element_text(hjust = 0)) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

The results are as one might expect: the census tracts with the lowest median home values tend to be in eastern Howard County, while the tracts with higher median home values tend to be in western Howard County.

Howard County median home value changes by census tract quintiles

I next want to look at the home value gaps between the quintiles, and how their average median home values have changed over time. More specifically, I investigate how the median home value for the various census tracts changed from 2010 to 2017—or, more correctly, from the 2006-2010 timeframe to the 2013-2017 timeframe.

To look at this question I take the quintiles from above, compute the averages of the median home values for the census tracts in each quintile, and graph the results for 2017 compared to 2010. Thus, for example, for the census tracts in quintile 1 I compute the average median home value from the 2010 5-year estimates for those 11 census tracts. I compute similar averages for the other quintiles. I then repeat the process using the 2017 5-year estimates.

As a side note, averaging the median home values in this way is not strictly speaking correct. The correct method would be to aggregate all the individual home values for all the tracts in a quintile, and then compute a median home value for the quintile overall. However this is not possible because the U.S. Census Bureau does not release data for individual households within a census tract. Averaging the median home values for the tracts is the next-best thing.

There’s another subtlety here as well. Many graphs showing changes between quintiles over time actually reflect different sets of people, households, or geographies between the different timeframes. For example, if in the graph the 2010 figures were to reflect the quintiles as they exist in 2010, and the 2017 figures were to reflect the quintiles as they exist in 2017, then we would not necessarily be comparing like to like. That’s because a given tract may have moved from one quintile to another over time. (For example, a census tract in the highest 20% of tracts in 2010 may have moved to the next lower 20% in 2017.)

To avoid this problem I have the 2017 values in the graph reflect the home values for the same sets of quintiles as for the 2010 values: if a tract were in, say, quintile 3 in 2010 then it is counted as part of the same set of tracts for 2017 (i.e., quintile 3).

Finally, because I’m more interested in real (not nominal) changes between 2010 and 2017, I use inflation-adjusted figures for the graph.

Now I’m ready to create the graph.

  1. I want to include the overall Howard County median home value in the graph. I therefore take the table mhv_quintiles from above, which assigns each census tract a quintile value, and create a new table mhv_qplus, adding a final row containing the geographic ID for Howard County and an arbitrary quintile value of 0.
  2. Since I’m adding Howard County as a whole to the graph, I define new variables to specify the colors for the graph, the labels for the legend, and the ordering of labels.
  3. For the actual plot I start with the table mhv_adj containing inflation-adjusted median home values for all the Howard County census tracts and for Howard County itself.
  4. I join the mhv_adj table with the table mhv_qplus containing quintile values for the census tracts and the extra row for Howard County (“quintile 0”), using the common field GEOID.
  5. I then group the rows by quintile (producing five sets of census tracts that are in the same quintile, along with a separate set containing the rows for Howard County) and then by year (for each quintile, producing two sets of rows for the two years 2010 and 2017 in the data).
  6. I next summarize the groups by calculating the mean (i.e., average) median home value for each group. This produces a 12-row table mhv_avgs, with values for the five quintiles plus Howard County for 2010, and similar values for 2017.
  7. Finally I plot the data using ggplot(), with the x-axis values being the years and the y-axis values the computed average median home values. I use geom_line() to draw lines for each quintile and for Howard County, coloring the lines according to the same color scheme as for the map above.
mhv_qplus <- mhv_quintiles %>%
  bind_rows(c(GEOID = "24027", quintile = "0"))

# Colors are assigned to the quintiles (and Howard County) based on their alphabetical order.
qplus_colors <- c("#000000",  # 0
                  "#0072B2",  # 1
                  "#56B4E9",  # 2
                  "#009E73",  # 3
                  "#E69F00",  # 4
                  "#F0E442")  # 5

# The order of quintiles in this list determines the order in the legend.
qplus_order <- c("5", "4", "3", "2", "1", "0")

# These labels are used as substitutes for the quintile values in the legend.
qplus_labels <- c("5 (highest 20%)", "4", "3", "2", "1 (lowest 20%)", "Howard County")

mhv_avgs <- mhv_adj %>%
  inner_join(mhv_qplus, by = "GEOID") %>%
  group_by(quintile, year) %>%
  summarize(estimate = mean(estimate))

upper_limit <- round(max(mhv_avgs$estimate + 50000), -5)
lower_limit <- 0

mhv_avgs %>%
  ggplot(aes(x = year, y = estimate, color = quintile)) +
  geom_line(size = 1.0) +
  scale_color_manual(name = "Quintiles", values = qplus_colors, labels = qplus_labels, breaks = qplus_order) +
  scale_x_continuous(breaks = seq(2010, 2017, 1)) +
  scale_y_continuous(labels = scales::dollar, limits = c(lower_limit, upper_limit), breaks = seq(lower_limit, upper_limit, 100000)) +
  xlab("Year") +
  ylab("Average Median Home Value") +
  labs(
    title = "Median Home Value Changes within Howard County",
    subtitle = "5-Year Estimates in 2017 Dollars for 2010 Census Tract Home Value Quintiles",
    caption = paste0(
      "Data source:",
      "\n  U.S. Census Bureau, American Community Survey, Table B25077",
      "\nCreated using the tidyverse and tidycensus R packages"
    )
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 10))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(hjust = 0, margin = margin(t = 15)))

Based on the graph above, all quintiles experienced a decline in average median home value between 2010 and 2017, but the decline appears to have been steepest for those tracts in the highest quintile.

To double-check this, I print a table of the percentage changes for each quintile (rounded to the nearest whole number), along with the 2010 and 2017 average median home values for each quintile and the absolute amount of the changes (all rounded to the nearest thousand).

mhv_avgs %>%
  filter(quintile != "0") %>%
  spread(year, estimate) %>%
  mutate(abs_change = `2017` - `2010`, pct_change = 100 * (`2017` - `2010`) / `2010`) %>%
  mutate(abs_change = round(abs_change, -3), pct_change = round(pct_change), `2010` = round(`2010`, -3), `2017` = round(`2017`, -3)) %>%
  arrange(desc(quintile)) %>%
  kable()
quintile 2010 2017 abs_change pct_change
5 787000 669000 -118000 -15
4 628000 531000 -97000 -15
3 482000 420000 -62000 -13
2 416000 350000 -66000 -16
1 342000 275000 -66000 -19

Based on the above table the quintile containing the census tracts with the highest median home values experienced the biggest absolute decline in average median home value, but the quintile containing the census tracts with the lowest median home values experienced the largest decline in percentage terms.

Howard County median home value changes by census tract

I now take another look at the median home value changes between the 2006-2010 and 2013-2017 timeframes, this time mapping the changes for individual census tracts. I do this as follows:

  1. I start with the table mhv_adj containing inflation-adjusted median home value values for all geographies, and then filter for census tracts only.
  2. From the resulting table I retain only the variables GEOID, estimate, and year.
  3. I then use spread() to recast the table so that instead of having different rows for different years, the table has different columns for different years.
  4. I then take the difference between the median home value values for 2017 and 2010 (i.e., the difference between the values in the column for 2017 and the column for 2010), divide the difference by the value for 2010, and multiply by 100 to calculate the percentage change between the two timeframes.
  5. I then take the resulting table mhv_change and join it to the table tracts_sf using the common field GEOID, to add the geometry for the census tracts.
  6. I then plot the resulting table, using the value of the pct_change variable to select the color for each census tract. Since that variable has continuous values I use scale_fill_viridis() to vary the colors along a predefined spectrum of color values, using previously-calculated lower and upper limits.
mhv_change <- mhv_adj %>%
  filter(GEOID != "24027") %>%
  select(GEOID, estimate, year) %>%
  spread(year, estimate) %>%
  mutate(pct_change = 100 * (`2017` - `2010`) / `2010`) %>%
  select(GEOID, pct_change)

upper_limit <- round(max(mhv_change$pct_change) + 5, -1)
lower_limit <- round(min(mhv_change$pct_change) - 5, -1)

tracts_sf %>%
  inner_join(mhv_change, by = "GEOID") %>%
  ggplot(aes(fill = pct_change, geometry = geometry)) +
  geom_sf(color = "white", size = 0.4) +
  geom_sf(data = roads_geo, color = "white", size = 1.0, fill = NA) +
  scale_fill_viridis(option = "plasma", name = "% Change", limits = c(lower_limit, upper_limit), breaks = seq(lower_limit, upper_limit, 10)) +
  labs(title="Changes in Howard County Median Home Value",
       subtitle = "2017 vs. 2010 5-Year Estimates (Inflation Adjusted) for Census Tracts",
       caption = paste0(
         "Data sources:",
         "\n  U.S. Census Bureau, American Community Survey, Table B25077",
         "\n  U.S. Census Bureau, Cartographic boundaries",
         "\nCreated using the tidyverse and tidycensus R packages"
       )
  ) +
  theme(plot.caption = element_text(hjust = 0)) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())