Introduction

In this document I do some basic analyses on median household income of Maryland counties over time.

For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.

Setup and data preparation

Libraries

I use the following packages for the following purposes:

  • tidyverse: do general data manipulation.
  • tidycensus: retrieve U.S. census data.
  • tigris: retrieve U.S. census shapefiles.
  • sf: manipulate geospatial data.
  • viridis: obtain special color palettes for graphs.
  • readxl: read Excel spreadsheets (part of the Tidyverse, but not loaded by default).
  • tools: includes the md5sum function.
library(tidyverse)
library(tidycensus)
library(tigris)
library(sf)
library(viridis)
library(readxl)
library(tools)

The tidycensus and tigris packages require a Census API key. I previously obtained such a key and stored it for use by the functions included in the tidycensus and tigris packages.

Data sources

I use data from the following sources; see the References section below for more information:

  • Median household income estimates are from the American Community Survey table B19013, and are accessed via the tidycensus get_acs() function.
  • County boundaries are from the U.S. Census Bureau cartographic boundaries data, and are accessed via the tigris counties() function.
  • allitems.xlsx. Consumer Price Index data used to adjust for inflation.

I check to make sure that the versions of the files being used in this analysis are identical to the versions of the files I originally downloaded. I do this by comparing the MD5 checksums of the files against MD5 values I previously computed, and stopping execution if they do not match.

stopifnot(md5sum("allitems.xlsx") == "f890a4aff139ca003246d17491d51272")

Reading in and preparing the data

I first use the tidycensus get_acs() function to retrieve median household income estimates from the 2010 and 2017 ACS 5-year estimates. This function returns data in tidy format, with the following columns:

  • GEOID. A 5-character geographic ID combining the 2-digit FIPS code for Maryland (“24”) with the 3-digit FIPS code for the individual county or county-equivalent within Maryland.
  • NAME. The name of the county or county-equivalent.
  • variable. References the ACS variable being retrieved (which we here name mhi).
  • estimate. The ACS 5-year estimate for the variable’s value.
  • moe. The margin of error for the estimate.

I take the table returned by get_acs() and add a new variable year containing the year of the estimate.

mhi_2010 <- get_acs(
  geography = "county",
  state = "MD",
  year = 2010,
  survey = "acs5",
  variables = c(mhi = "B19013_001")
) %>%
  mutate(year = 2010)
mhi_2017 <- get_acs(
  geography = "county",
  state = "MD",
  year = 2017,
  survey = "acs5",
  variables = c(mhi = "B19013_001")
) %>%
  mutate(year = 2017)

The ACS 5-year estimates are expressed in dollars as of the year in question. For example, the 5-year estimates for 2010 are in 2010 dollars. In order to adjust values for inflation I read in an Excel spreadsheet containing consumer price index data from 1978 to 2018. The index values are expressed as a percentage of the 1997 index (set at 100).

The spreadsheet has two sheets, each of which has values for all twelve months of the year plus an average figure for the entire year. The sheets differ in whether or not the monthly figures are seasonally-adjusted. Since I am doing adjustments for entire years I set col_types to skip over the monthly values, and read only the first sheet (which is not seasonally-adjusted).

cpi_tbl <- read_excel("allitems.xlsx",
  sheet = 1,
  skip = 8,
  col_names = c("year", "index"),
  col_types = c("numeric", rep("skip", 12), "numeric")
)

I want to express all income values as 2017 dollars. I therefore extract from cpi_tbl the consumer price index value for 2017 by filtering for the row for 2017, selecting only the variable index for that row, and then converting that variable to a scalar numeric value. I then divide the 2017 CPI value by all other CPI values to create a new table adj_tbl. The variable adj in this table can be multiplied by median household income values for each year to express them in 2017 dollars.

index_2017 <- cpi_tbl %>%
  filter(year == 2017) %>%
  select(index) %>%
  as.numeric()

adj_tbl <- cpi_tbl %>%
  mutate(adj = index_2017 / index) %>%
  select(year, adj)

The adjustment is done as follows, creating a new table mhi_2010_adj:

  1. I start with the mhi_2010 table.
  2. I then join that table with the adj_tbl table containing inflation adjustment figures for each year, using year as the common variable. This has the effect of adding a new column containing the variable adj. This variable has the value 1 for the year 2017, and for other years contains the adjustment factor needed to convert household incomes into 2017 dollars.
  3. I modify the estimate and moe variables to convert them to 2017 dollars.
mhi_2010_adj <- mhi_2010 %>%
  inner_join(adj_tbl, by = "year") %>%
  mutate(estimate = estimate * adj, moe = moe * adj)

In order to produce maps of the median household income data I need geometry of the cartographic boundaries for all 24 Maryland counties and county-equivalents (i.e., Baltimore city). I use the tigris function counties() to return the cartographic boundaries in the form of a table compatible with the sf package. This table has a variable GEOID that is identical to the variable of the same name returned by get_acs().

Since it can take some time to retrieve the boundary data, I cache the returned data.

options(tigris_use_cache = TRUE)
county_sf <- counties(state = "MD", cb = TRUE, class = "sf", progress_bar = FALSE)

Analysis

I do analyses to answer the following questions:

Maryland median household income by county

Median household income estimates for smaller Maryland counties (with populations under 65,000) are available only in the American Community Survey 5-year estimates. This means, for example, that the 2010 median household income estimates published for these counties actually reflect income surveys in the years 2006-2010, the 2011 estimates reflect surveys in the years 2007-2011, and so on.

For this reason the U.S. Census Bureau recommends not comparing 5-year estimates from overlapping sets of years, for example comparing 2011 5-year estimates to 2010 5-year estimates. In order to look at changes over the maximum period of time I therefore choose to compare median household incomes between the 2010 and 2017 5-year estimates, representing incomes over the timeframes 2006-2010 and 2013-2017.

I first do a bar chart showing the 2010 ACS 5-year estimates for median household income for each county, arranging the bars in order of decreasing income from left to right. I do this as follows:

  1. I start with the table mhi_2010_adj from above.
  2. I shorten the names of the counties by removing “, Maryland” from the end of the name, as well as the word “County” if present.
  3. I change NAME into a categorical variable (or “factor” in R-speak) and order the variable values (the county names) so that they are in order of decreasing median household income.
  4. I plot the data with county on the x-axis and median household income on the y-axis.
  5. I use geom_col to plot one bar for each county, with the height of the bar corresponding to the county’s median household income.
  6. I put tick marks on the y-axis corresponding to dollar values for every $20,000 interval from zero to $120,000.
  7. I customize the x-axis and y-axis labels, title, and caption as desired. Among other things, I rotate the county names by 60 degrees to make them more readable.
mhi_2010_adj %>%
  mutate(NAME = sub("( County)?, Maryland$", "", NAME)) %>%
  mutate(NAME = fct_reorder(NAME, -estimate)) %>%
  ggplot() +
  geom_col(aes(x = NAME, y = estimate)) +
  scale_y_continuous(labels = scales::dollar, breaks = seq(0, 120000, 20000)) +
  xlab("County") +
  ylab("Income") +
  labs(
    title = "Maryland Median Household Income by County",
    subtitle = "2010 ACS 5-Year Estimates in 2017 Dollars",
    caption = "Data source:\n  U.S. Census Bureau, American Community Survey, Table B25077\nCreated using the tidyverse and tidycensus R packages"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

To show another view of the same data I also map the counties, coloring each county according to its median household income in the 2010 5-year estimate. I do this as follows:

  1. I start with the table county_sf containing the geometry for the county boundaries.
  2. I then join that table with the table mhi_2010_adj, using the common variable GEOID.
  3. I plot the resulting table, using the value of the estimate variable to determine the fill color for each county.
  4. I use geom_sf() to draw the county physical boundaries in white.
  5. I use scale_fill_viridis() to specify the colors for each county along a predefined spectrum of color values, and to create a legend.
  6. I add a title, subtitle, and caption for the map.
  7. Finally I blank out the axis tick marks and axis labels (which would normally show latitude and longitude) and make the background of the graph blank for a cleaner look.
county_sf %>%
  inner_join(mhi_2010_adj, by = "GEOID") %>%
  ggplot() +
  geom_sf(aes(fill = estimate, geometry = geometry), color = "white", size = 0.4) +
  scale_fill_viridis(
    option = "plasma",
    name = "Income",
    labels = scales::dollar,
    limits = c(40000, 120000),
    breaks = seq(40000, 120000, 20000)
  ) +
  labs(
    title = "Maryland Median Household Income by County",
    subtitle = "2010 ACS 5-Year Estimates in 2017 Dollars",
    caption = "Data sources:\n  U.S. Census Bureau, American Community Survey, Table B25077\n  U.S. Census Bureau, Cartographic boundaries\nCreated using the tidyverse and tidycensus R packages"
  ) +
  theme(plot.caption = element_text(hjust = 0)) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

I next repeat this analysis for the 2017 ACS 5-year estimates, starting with a bar chart. The code is almost identical to that above, except for changing the year where needed.

mhi_2017 %>%
  mutate(NAME = sub("( County)?, Maryland$", "", NAME)) %>%
  mutate(NAME = fct_reorder(NAME, -estimate)) %>%
  ggplot() +
  geom_col(aes(x = NAME, y = estimate)) +
  scale_y_continuous(labels = scales::dollar, breaks = seq(0, 120000, 20000)) +
  xlab("County") +
  ylab("Income") +
  labs(
    title = "Maryland Median Household Income by County",
    subtitle = "2017 ACS 5-Year Estimates in 2017 Dollars",
    caption = "Data source:\n  U.S. Census Bureau, American Community Survey, Table B25077\nCreated using the tidyverse and tidycensus R packages"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

As with the bar chart for the 2017 ACS 5-year estimates, the code for the map for the 2017 ACS 5-year estimates is almost identical to the code for the 2010 map.

county_sf %>%
  inner_join(mhi_2017, by = "GEOID") %>%
  ggplot() +
  geom_sf(aes(fill = estimate, geometry = geometry), color = "white", size = 0.4) +
  scale_fill_viridis(
    option = "plasma",
    name = "Income",
    labels = scales::dollar,
    limits = c(40000, 120000),
    breaks = seq(40000, 120000, 20000)
  ) +
  labs(
    title = "Maryland Median Household Income by County",
    subtitle = "2017 ACS 5-Year Estimates in 2017 Dollars",
    caption = "Data sources:\n  U.S. Census Bureau, American Community Survey, Table B25077\n  U.S. Census Bureau, Cartographic boundaries\nCreated using the tidyverse and tidycensus R packages"
  ) +
  theme(plot.caption = element_text(hjust = 0)) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

At first glance the bar chart and map for the 2010 and 2017 ACS 5-year estimates look fairly comparable. In the next section I look more closely at the changes between the two timeframes.

Maryland median household income changes by county

I next want to look at the median household income for each Maryland county has changed over time. More specifically, I investigate how the median household income for the various counties changed from 2010 to 2017—or, more correctly, from the 2006-2010 timeframe to the 2013-2017 timeframe. Because I’m more interested in real income changes between 2010 and 2017, I use inflation-adjusted figures expressed in 2017 dollars.

I create the bar chart as follows:

  1. I start with the table mhi_2010_adj containing 2010 median household income estimates expressed in 2017 dollars.
  2. I extend the table with rows from the table mhi_2017 containing 2017 median household income estimates.
  3. I retain only the variables of interest: NAME, GEOID, year, and estimate.
  4. I modify the table so that the median household income estimates for each county are reformatted into two columns, one for each year.
  5. I create a new variable rel_change containing the relative increase or decrease in median household income from 2010 to 2017, expressed as a fraction (not a percentage) of the 2010 value.
  6. I retain only the GEOID, NAME, and rel_change variables.
  7. I modify the county names, convert them into factors, and reorder them in order of decreasing change.
  8. I save the resulting table as mhi_change so that I can use it for both a bar chart and a map.
mhi_change <- mhi_2010_adj %>%
  bind_rows(mhi_2017) %>%
  select(NAME, GEOID, year, estimate) %>%
  spread(year, estimate) %>%
  mutate(rel_change = (`2017` - `2010`) / `2010`) %>%
  select(NAME, GEOID, rel_change) %>%
  mutate(NAME = sub("( County)?, Maryland$", "", NAME)) %>%
  mutate(NAME = fct_reorder(NAME, -rel_change))

mhi_change %>%
  ggplot(aes(x = NAME, y = rel_change)) +
  geom_col() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(-0.20, 0.05, 0.05)) +
  xlab("County") +
  ylab("Change") +
  labs(
    title = "Maryland Median Household Income Changes by County",
    subtitle = "2017 vs. 2010 ACS 5-Year Estimates in 2017 Dollars",
    caption = "Data source:\n  U.S. Census Bureau, American Community Survey, Table B25077\nCreated using the tidyverse and tidycensus R packages"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

All Maryland counties had stagnant or declining real median household income over the timeframe in question, with only Baltimore city showing a small increase.

I also create a map showing the same percentage changes. The computation of the changes from 2010 to 2017 is done in a similar manner to the bar chart above, and creation of the map is done in a similar manner to the map above.

I also create a map showing the same percentage changes, in a similar manner to the maps above.

county_sf %>%
  inner_join(mhi_change, by = "GEOID") %>%
  ggplot(aes(fill = rel_change, geometry = geometry)) +
  geom_sf(color = "white", size = 0.4) +
  scale_fill_viridis(
    option = "plasma",
    name = "Change",
    labels = scales::percent_format(accuracy = 1, trim = FALSE),
    limits = c(-0.30, 0.10),
    breaks = seq(-0.30, 0.10, 0.10)
  ) +
  labs(
    title = "Maryland Median Household Income Changes by County",
    subtitle = "2017 vs. 2010 ACS 5-Year Estimates in 2017 Dollars",
    caption = "Data sources:\n  U.S. Census Bureau, American Community Survey, Table B19013\n  U.S. Census Bureau, Cartographic boundaries\nCreated using the tidyverse and tidycensus R packages"
  ) +
  theme(plot.caption = element_text(hjust = 0)) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

The steepest declines in real median household income were in rural areas on the Eastern Shore.

Appendix

Caveats

All values for median household income are estimates based on survey samples, with associated margins of error. For estimates at the county level the associated standard errors can be a few thousand dollars.

As noted above, 5-year estimates identified as being for a certain year actually reflect surveys conducted over the prior five years.

References

To obtain the ACS data I used the tidycensus package, which provides an easy-to-use interface to U.S. Census Bureau data made available via a set of public APIs. As its name suggests, the tidycensus package is designed to be compatible with the tidyverse approach to representing and manipulating data.

To obtain the county boundaries I used the tigris package, which is designed to accompany the tidycensus package and provides access to U.S. Census Bureau cartographic boundaries and TIGER/Line shapefiles.

To adjust for inflation I used the CPI-U-RS data produced by the Bureau of Labor Statistics, specifically the “Updated CPI-U-RS, All items, 1977-2018” Excel spreadsheet, as recommended by the U.S. Census Bureau’s guide to comparing 2017 ACS data to prior years.

Suggestions for others

When the 2018 ACS 5-year estimates are available the graphs and maps above could be re-done using 2018 data.

Environment

I used the following R environment in doing the analysis above:

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] tools     stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] readxl_1.3.1      viridis_0.5.1     viridisLite_0.3.0
##  [4] sf_0.7-5          tigris_0.8.2      tidycensus_0.9.2 
##  [7] forcats_0.4.0     stringr_1.4.0     dplyr_0.8.1      
## [10] purrr_0.3.2       readr_1.3.1       tidyr_0.8.3      
## [13] tibble_2.1.3      ggplot2_3.2.0     tidyverse_1.2.1  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5   xfun_0.8           haven_2.1.1       
##  [4] lattice_0.20-38    colorspace_1.4-1   generics_0.0.2    
##  [7] htmltools_0.3.6    yaml_2.2.0         rlang_0.4.0       
## [10] e1071_1.7-2        pillar_1.4.2       foreign_0.8-71    
## [13] glue_1.3.1         withr_2.1.2        DBI_1.0.0         
## [16] rappdirs_0.3.1     sp_1.3-1           uuid_0.1-2        
## [19] modelr_0.1.4       munsell_0.5.0      gtable_0.3.0      
## [22] cellranger_1.1.0   rvest_0.3.4        evaluate_0.14     
## [25] knitr_1.23         maptools_0.9-5     curl_3.3          
## [28] class_7.3-15       broom_0.5.2        Rcpp_1.0.1        
## [31] KernSmooth_2.23-15 scales_1.0.0       backports_1.1.4   
## [34] classInt_0.3-3     jsonlite_1.6       gridExtra_2.3     
## [37] hms_0.4.2          digest_0.6.20      stringi_1.4.3     
## [40] grid_3.6.0         rgdal_1.4-4        cli_1.1.0         
## [43] magrittr_1.5       lazyeval_0.2.2     crayon_1.3.4      
## [46] pkgconfig_2.0.2    ellipsis_0.2.0.1   xml2_1.2.0        
## [49] lubridate_1.7.4    assertthat_0.2.1   rmarkdown_1.13    
## [52] httr_1.4.0         rstudioapi_0.10    R6_2.4.0          
## [55] units_0.6-3        nlme_3.1-139       compiler_3.6.0

Source code

You can find the source code for this analysis and others at my hocodata public code repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.