Introduction

In this document I look at historical trends for Maryland median household income in relation to D.C. and Virginia as well as to the United States as a whole.

For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.

Setup and data preparation

Libraries

I use the tidyverse package of functions for general data manipulation, the readxl package (part of the Tidyverse, but not loaded by default) to read Excel files, and the knitr package to produce printed tables. I also use the tools package to get the md5sum function.

library(tidyverse)
library(readxl)
library(knitr)
library(tools)

Data sources

I use data from the following U.S. Census Bureau Excel spreadsheets:

  • h08.xls: Median household income by state, 1984-2017.
  • h08a.xls: 2-year average of median household income by state, 1985-2017.
  • h08b.xls: 3-year average of median household income by state, 1986-2017.

I check to make sure that the versions of the files being used in this analysis are identical to the versions of the files I originally downloaded. I do this by comparing the MD5 checksums of the files against MD5 values I previously computed, and stopping execution if they do not match.

stopifnot(md5sum("h08.xls") == "179175185733538c170127ef638a2c78")
stopifnot(md5sum("h08a.xls") == "fe390b065810e9ba42c52d1db797e723")
stopifnot(md5sum("h08b.xls") == "b742ec83105bc3e8fa351b17469be642")

Reading in and preparing the data

I use the read_xls function to read in the Excel spreadsheets, selecting only those cells that contain the data of interest. I use the col_types parameter to further skip over columns not used in this analysis and to provide the data types for those that are used. I also provide a set of new column names: “geography” for the first column used, and year values for the remaining columns.

As read in, the data has one row for each geography (U.S., state, etc.) and one column for each year. To simplify plotting and other data manipulation I convert the data into so-called “tidy” format, where each row contains the income value for a given year and jurisdiction.

I do this using the gather() function. For each row I take all columns except geography, put the value in each column into a new variable mhi, and convert the column name to a new variable year. I then convert the year values from character strings to integers.

ct <- c("text",
        rep(c("numeric", "skip"), 4),
        "skip", "skip",
        rep(c("numeric", "skip"), 30))

mhi_current <- read_xls("h08.xls",
                        range = "A7:BS58",
                        col_names = c("geography", 2017:1984),
                        col_types = ct) %>%
  gather(year, mhi, -geography) %>%
  mutate(year = as.integer(year))

mhi_constant <- read_xls("h08.xls",
                         range = "A62:BS113",
                         col_names = c("geography", 2017:1984),
                         col_types = ct) %>%
  gather(year, mhi, -geography) %>%
  mutate(year = as.integer(year))

mhi_2ya <- read_xls("h08a.xls",
                    range = "A6:BM57",
                    col_names = c("geography", 2017:2015, 2013:1985),
                    col_types = c("text",
                                  rep(c("numeric", "skip"), 32))) %>%
  gather(year, mhi, -geography) %>%
  mutate(year = as.integer(year))

mhi_3ya <- read_xls("h08b.xls",
                    range = "A6:BK57",
                    col_names = c("geography", 2017:2015, 2013:1986),
                    col_types = c("text",
                                  rep(c("numeric", "skip"), 31))) %>%
  gather(year, mhi, -geography) %>%
  mutate(year = as.integer(year))

Analysis

I do analyses to answer the following two questions:

Ranking of states based on median household incomes

Both Maryland and Virginia have had median household incomes above the U.S. median for some time now, with D.C. recently joining them. I now compute the rankings of each state (and D.C.) for each year, and look at how the rankings of the local jurisdiction have changed over time. I do this as follows:

  1. I start with the mhi_3ya table created above.
  2. I filter out the rows for the U.S. as a whole.
  3. I group the rows by year, and then use the min_rank function to compute a new variable ranking representing the relative rank of each state (and D.C.) for that year, with rank 1 being the state with the highest median household income.
mhi_rankings <- mhi_3ya %>%
  filter(geography != "United States") %>%
  group_by(year) %>%
  mutate(ranking = min_rank(desc(mhi)))

I then graph the rankings for Maryland, D.C., and Virginia. Because the list of geographies in the graph is different from that used in the graphs above, I modify the graph’s palette to ensure that the mapping of colors to geographies remains the same as in the other graphs. I also reverse the scale on the y-axis so that geographies with higher rankings will be listed at the top.

# Note: As before, colors are assigned to geographies based on their
# alphabetical order. Since some geographies have been removed from
# the graph I use a modified palette to maintain the same mapping of
# color to geography as in previous graphs.
cbPalette2 <- c("#E69F00",  # D.C.
                "#000000",  # Maryland
                "#56B4E9")  # Virginia

# Note: As before, the order of this list determines the order that
# geographies are listed in the legend, from top to bottom.
geographies2 <- c("Maryland", "D.C.", "Virginia")

mhi_rankings %>%
  filter(geography %in% geographies2) %>%
  ggplot(aes(x = year, y = ranking, color = geography)) +
  geom_line(size = 0.8) +
  scale_x_continuous(breaks=seq(1980, 2020, 5)) +
  scale_color_manual(values = cbPalette2, breaks = geographies2) +
  scale_y_reverse(breaks = seq(0, 40, 5)) +
  xlab("Year") +
  ylab("Rank") +
  labs(title="Rank of Maryland vs. D.C. and Virginia",
       subtitle="Based on 3-Year Estimates of U.S. Median Household Income",
       caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8B") +
  theme_minimal() +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=10))) +
  theme(axis.title.y=element_text(margin=margin(r=10))) +
  theme(plot.caption=element_text(margin=margin(t=15))) +
  theme(legend.title = element_blank())

Maryland has consistently been in the top ten states based on median household income, and has recently been joined in the top five by D.C. Virginia was in the top ten for much of the past 30 years, but has recently fallen out of it.

Finally, to compare Maryland with other affluent states from across the U.S., I create a table top_20 of the highest-ranking states (and D.C.) by median household income in 2017 based on the 3-year average. (In other words, this ranking also reflects the income values for 2015 and 2016.) I do this as follows:

  1. I start with the mhi_rankings table created above, which contains rankings grouped by year.
  2. I remove the grouping by year. (This makes it possible to remove the year variable from the table being created.)
  3. I retain only the rank values, the geography names, and the median household income values.
  4. I sort the resulting list in ascending order by rank.
  5. I retain the first 20 rows, representing those states (and D.C.) ranked from 1 to 20.
top_20 <- mhi_rankings %>%
  ungroup() %>%
  filter(year == 2017) %>%
  select(ranking, geography, mhi) %>%
  arrange(ranking) %>%
  rename(Rank = ranking, Jurisdiction = geography, `Median Household Income` = mhi) %>%
  top_n(20)
## Selecting by Median Household Income

Finally I print top_20 as a formatted table.

States with the Highest 3-Year Average Median Household Incomes in 2017
Rank Jurisdiction Median Household Income
1 Maryland 77525
2 New Hampshire 77000
3 D.C. 76131
4 Alaska 75766
5 Connecticut 75251
6 Massachusetts 72421
7 Washington 72272
8 Colorado 71720
9 Minnesota 71587
10 Hawaii 71337
11 New Jersey 71222
12 Utah 69602
13 California 67891
14 Virginia 67597
15 Illinois 63274
16 Oregon 62654
17 Pennsylvania 62649
18 Vermont 62503
19 Rhode Island 62291
20 Iowa 62270

Appendix

Caveats

All values for median household income are estimates based on survey samples, with associated margins of error. For the 1-year Maryland estimates the associated standard errors are a few thousand dollars. For the multi-year averages the standard errors are smaller.

In 2013 the Census Bureau did two separate surveys using different subsamples. See footnotes 38 and 39 in the Historical Income Table Footnotes for a discussion of the differences. For this analysis I chose the sets of values collected according to footnote 38.

References

National and state data for 1984 through 2017 are from the U.S. Census Bureau “Historical Income Tables: Households”, Table H-8, “Median Household Income by State”. This table includes income values in both constant (2017) dollars and current dollars (i.e., current as of the year in question).

Note that there are two alternative sources for this data: Table H-8A, “Median Household Income by State - 2 Year Average”, and Table H-8B, “Median Household Income by State - 3 Year Average” These tables are in constant 2017 dollars only.

I downloaded all spreadsheets directly from the U.S. Census Bureau web site and did not make any changes to them.

Suggestions for future work

The plots above could be extended to include additional jurisdictions for comparison.

Environment

I used the following R environment in doing the analysis above:

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] tools     stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] knitr_1.22      readxl_1.3.1    forcats_0.4.0   stringr_1.4.0  
##  [5] dplyr_0.8.1     purrr_0.3.2     readr_1.3.1     tidyr_0.8.3    
##  [9] tibble_2.1.1    ggplot2_3.1.1   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1       highr_0.8        cellranger_1.1.0 pillar_1.4.0    
##  [5] compiler_3.6.0   plyr_1.8.4       digest_0.6.18    lubridate_1.7.4 
##  [9] jsonlite_1.6     evaluate_0.13    nlme_3.1-139     gtable_0.3.0    
## [13] lattice_0.20-38  pkgconfig_2.0.2  rlang_0.3.4      cli_1.1.0       
## [17] rstudioapi_0.10  yaml_2.2.0       haven_2.1.0      xfun_0.7        
## [21] withr_2.1.2      xml2_1.2.0       httr_1.4.0       hms_0.4.2       
## [25] generics_0.0.2   grid_3.6.0       tidyselect_0.2.5 glue_1.3.1      
## [29] R6_2.4.0         rematch_1.0.1    rmarkdown_1.12   modelr_0.1.4    
## [33] magrittr_1.5     backports_1.1.4  scales_1.0.0     htmltools_0.3.6 
## [37] rvest_0.3.4      assertthat_0.2.1 colorspace_1.4-1 labeling_0.3    
## [41] stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0    broom_0.5.2     
## [45] crayon_1.3.4

Source code

You can find the source code for this analysis and others at my hocodata public code repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.