Maryland Median Household Income Over Time

Introduction

In this document I look at historical trends for Maryland median household income in relation to D.C. and Virginia as well as to the United States as a whole.

For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.

Setup and data preparation

Libraries

I use the tidyverse package of functions for general data manipulation, the readxl package (part of the Tidyverse, but not loaded by default) to read Excel files, and the knitr package to produce printed tables. I also use the tools package to get the md5sum function.

library(tidyverse)
library(readxl)
library(knitr)
library(tools)

Data sources

I use data from the following U.S. Census Bureau Excel spreadsheets:

h08.xls: Median household income by state, 1984-2017.
h08a.xls: 2-year average of median household income by state, 1985-2017.
h08b.xls: 3-year average of median household income by state, 1986-2017.

I check to make sure that the versions of the files being used in this analysis are identical to the versions of the files I originally downloaded. I do this by comparing the MD5 checksums of the files against MD5 values I previously computed, and stopping execution if they do not match.

stopifnot(md5sum("h08.xls") == "179175185733538c170127ef638a2c78")
stopifnot(md5sum("h08a.xls") == "fe390b065810e9ba42c52d1db797e723")
stopifnot(md5sum("h08b.xls") == "b742ec83105bc3e8fa351b17469be642")

Reading in and preparing the data

I use the read_xls function to read in the Excel spreadsheets, selecting only those cells that contain the data of interest. I use the col_types parameter to further skip over columns not used in this analysis and to provide the data types for those that are used. I also provide a set of new column names: “geography” for the first column used, and year values for the remaining columns.

As read in, the data has one row for each geography (U.S., state, etc.) and one column for each year. To simplify plotting and other data manipulation I convert the data into so-called “tidy” format, where each row contains the income value for a given year and jurisdiction.

I do this using the gather() function. For each row I take all columns except geography, put the value in each column into a new variable mhi, and convert the column name to a new variable year. I then convert the year values from character strings to integers.

ct <- c("text",
        rep(c("numeric", "skip"), 4),
        "skip", "skip",
        rep(c("numeric", "skip"), 30))

mhi_current <- read_xls("h08.xls",
                        range = "A7:BS58",
                        col_names = c("geography", 2017:1984),
                        col_types = ct) %>%
  gather(year, mhi, -geography) %>%
  mutate(year = as.integer(year))

mhi_constant <- read_xls("h08.xls",
                         range = "A62:BS113",
                         col_names = c("geography", 2017:1984),
                         col_types = ct) %>%
  gather(year, mhi, -geography) %>%
  mutate(year = as.integer(year))

mhi_2ya <- read_xls("h08a.xls",
                    range = "A6:BM57",
                    col_names = c("geography", 2017:2015, 2013:1985),
                    col_types = c("text",
                                  rep(c("numeric", "skip"), 32))) %>%
  gather(year, mhi, -geography) %>%
  mutate(year = as.integer(year))

mhi_3ya <- read_xls("h08b.xls",
                    range = "A6:BK57",
                    col_names = c("geography", 2017:2015, 2013:1986),
                    col_types = c("text",
                                  rep(c("numeric", "skip"), 31))) %>%
  gather(year, mhi, -geography) %>%
  mutate(year = as.integer(year))

Analysis

I do analyses to answer the following two questions:

How does Maryland median household income compare to the median household income for the U.S. as a whole, as well as to the median household incomes for the neighboring jurisdictions of Virginia and the District of Columbia?
What are the states with the highest median household income, and where does Maryland rank among them?

Maryland median household income historical trends

I use ggplot and associated functions to plot a series of graphs showing estimated Maryland median household income in relation to income estimates for D.C. and Virginia, as well as in relation to the United States overall.

The first graph is for 1-year estimates expressed in current dollars (i.e., in dollars as of each year in question, without adjusting for inflation). I create the graph as follows:

I start with the mhi_current table.
I filter out all rows except for those for the jurisdictions of interest.
I use ggplot to create the plot, using geom_line to plot a line for each of the groups of data points (corresponding to the individual jurisdictions). I use color to distinguish between the lines for the different jurisdictions, using a special color palette designed to be more distinguishable for people who have various forms of colorblindness.
I specify an x-axis running from 1980 to 2020, with tick marks every five years.
I add labels for the x- and y-axes, along with a plot title, subtitle, and caption.
I use the theme_minimal theme for a clean look, and then tweak it slightly for readability, displaying the x-axis tick mark labels at an angle, moving the x- and y-axis labels slightly away from the tick mark labels, and moving the caption slightly lower.

# Colors are assigned to geographies based on alphabetical order.
cbPalette <- c("#E69F00",  # D.C.
               "#000000",  # Maryland
           "#999999",  # United States
               "#56B4E9")  # Virginia

# The order here determines the order geographies are listed in the legend.
geographies <-  c("Maryland", "D.C.", "Virginia", "United States")

mhi_current %>%
  filter(geography %in% geographies) %>%
  ggplot(aes(x = year, y = mhi, color = geography)) +
  geom_line(size = 0.8) +
  scale_color_manual(values = cbPalette, breaks = geographies) +
  scale_x_continuous(breaks=seq(1980, 2020, 5)) +
  scale_y_continuous(labels = scales::dollar) +
  xlab("Year") +
  ylab("Median Household Income") +
  labs(title="Median Household Income for Maryland vs. Virginia and D.C.",
       subtitle="Estimates in Current Dollars (Not Inflation-Adjusted)",
       caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8") +
  theme_minimal() +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=10))) +
  theme(axis.title.y=element_text(margin=margin(r=10))) +
  theme(plot.caption=element_text(margin=margin(t=15))) +
  theme(legend.title = element_blank())

Maryland has consistently had the highest median household incomes of the three local jurisdictions, but the District of Columbia has caught up in recent years.

I repeat the same plot, but using inflation-adjusted values.

mhi_constant %>%
  filter(geography %in% geographies) %>%
  ggplot(aes(x = year, y = mhi, color = geography)) +
  geom_line(size = 0.8) +
  scale_x_continuous(breaks = seq(1980, 2020, 5)) +
  scale_y_continuous(labels = scales::dollar) +
  scale_color_manual(values = cbPalette, breaks = geographies) +
  xlab("Year") +
  ylab("Median Household Income") +
  labs(title="Median Household Income for Maryland vs. Virginia and D.C.",
       subtitle="Estimates in 2017 Dollars (Inflation-Adjusted)",
       caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8") +
  theme_minimal() +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=10))) +
  theme(axis.title.y=element_text(margin=margin(r=10))) +
  theme(plot.caption=element_text(margin=margin(t=15))) +
  theme(legend.title = element_blank())

Except for the District of Columbia, median household income has not increased very much in real terms over time: less than $10,000 in 2017 dollars over 30 years, or about $300 per year—-significantly less than 1% per year.

I repeat the previous plot, but this time using 2-year averages.

mhi_2ya %>%
  filter(geography %in% geographies) %>%
  ggplot(aes(x = year, y = mhi, color = geography)) +
  geom_line(size = 0.8) +
  scale_x_continuous(breaks = seq(1980, 2020, 5)) +
  scale_y_continuous(breaks = seq(50000, 80000, 10000),
                     labels = scales::dollar) +
  scale_color_manual(values = cbPalette, breaks = geographies) +
  xlab("Year") +
  ylab("Median Household Income") +
  labs(title="Median Household Income for Maryland vs. Virginia and D.C.",
       subtitle="2-Year Averages, Estimates in 2017 Dollars (Inflation-Adjusted)",
       caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8A") +
  theme_minimal() +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=10))) +
  theme(axis.title.y=element_text(margin=margin(r=10))) +
  theme(plot.caption=element_text(margin=margin(t=15))) +
  theme(legend.title = element_blank())

I repeat the previous plot, but this time using 3-year averages.

mhi_3ya %>%
  filter(geography %in% geographies) %>%
  ggplot(aes(x = year, y = mhi, color = geography)) +
  geom_line(size = 0.8) +
  scale_x_continuous(breaks = seq(1980, 2020, 5)) +
  scale_y_continuous(breaks = seq(50000, 80000, 10000),
                     labels = scales::dollar) +
  scale_color_manual(values = cbPalette, breaks = geographies) +
  xlab("Year") +
  ylab("Median Household Income") +
  labs(title="Median Household Income for Maryland vs. Virginia and D.C.",
       subtitle="3-Year Averages, Estimates in 2017 Dollars (Inflation-Adjusted)",
       caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8B") +
  theme_minimal() +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=10))) +
  theme(axis.title.y=element_text(margin=margin(r=10))) +
  theme(plot.caption=element_text(margin=margin(t=15))) +
  theme(legend.title = element_blank())

Finally, I plot median household incomes for the jurisdictions of interest as a percentage of the U.S. median household incomes, using the 3-year averages. I do this as follows:

Again I start with the mhi_3ya table.
I retain only the year, geography, and mhi variables.
I use spread to change the table to make each geography into a separate column, with the column value being the mhi value. This produces a table with one row for each year.
I create an anonymous function that will divide a column’s value (the median household income) by the value for the column corresponding to the United States overall and then multiply by 100 to create a pecentage. I then apply that function to each column other than the first (which holds the year value). (Note that the “.” in ncol(.) references the (temporary) table against which mutate_at is executed.)
I then use gather to reverse the effect of spread and create a new table mhi_relative that has a similar form to mhi_3ya, with each row representing a combination of year, geography, and (relative) median household income.
Finally I use ggplot and related functions to create the actual graph for the geographies of interest.

mhi_relative <- mhi_3ya %>%
  spread(geography, mhi) %>%
  mutate_at(2:ncol(.), function(x) 100 * x / .[["United States"]]) %>%
  gather(geography, mhi, -year)
  
mhi_relative %>%
  filter(geography %in% geographies) %>%
  ggplot(aes(x = year, y = mhi, color = geography)) +
  geom_line(size = 0.8) +
  scale_x_continuous(breaks = seq(1980, 2020, 5)) +
  scale_y_continuous() +
  scale_color_manual(values = cbPalette, breaks = geographies) +
  xlab("Year") +
  ylab("% of U.S. Median Household Income") +
  labs(title="Relative Median Household Income for Maryland, D.C., and Virginia",
       subtitle="3-Year Averages as Percentage of U.S. Median Household Income",
       caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8B") +
  theme_minimal() +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=10))) +
  theme(axis.title.y=element_text(margin=margin(r=10))) +
  theme(plot.caption=element_text(margin=margin(t=15))) +
  theme(legend.title = element_blank())

Maryland median household income has been about 130% or more of U.S. median household income in the past few years, while Virginia median household income has been about 120% or more, with a significant drop in the last decade.

In contrast the District of Columbia has undergone a startling change from the 1990s, when its median household income was less than 90% of U.S. median household income, to the present day, when its median household income rivals that of Maryland.

Ranking of states based on median household incomes

Both Maryland and Virginia have had median household incomes above the U.S. median for some time now, with D.C. recently joining them. I now compute the rankings of each state (and D.C.) for each year, and look at how the rankings of the local jurisdiction have changed over time. I do this as follows:

I start with the mhi_3ya table created above.
I filter out the rows for the U.S. as a whole.
I group the rows by year, and then use the min_rank function to compute a new variable ranking representing the relative rank of each state (and D.C.) for that year, with rank 1 being the state with the highest median household income.

mhi_rankings <- mhi_3ya %>%
  filter(geography != "United States") %>%
  group_by(year) %>%
  mutate(ranking = min_rank(desc(mhi)))

I then graph the rankings for Maryland, D.C., and Virginia. Because the list of geographies in the graph is different from that used in the graphs above, I modify the graph’s palette to ensure that the mapping of colors to geographies remains the same as in the other graphs. I also reverse the scale on the y-axis so that geographies with higher rankings will be listed at the top.

# Note: As before, colors are assigned to geographies based on their
# alphabetical order. Since some geographies have been removed from
# the graph I use a modified palette to maintain the same mapping of
# color to geography as in previous graphs.
cbPalette2 <- c("#E69F00",  # D.C.
                "#000000",  # Maryland
                "#56B4E9")  # Virginia

# Note: As before, the order of this list determines the order that
# geographies are listed in the legend, from top to bottom.
geographies2 <- c("Maryland", "D.C.", "Virginia")

mhi_rankings %>%
  filter(geography %in% geographies2) %>%
  ggplot(aes(x = year, y = ranking, color = geography)) +
  geom_line(size = 0.8) +
  scale_x_continuous(breaks=seq(1980, 2020, 5)) +
  scale_color_manual(values = cbPalette2, breaks = geographies2) +
  scale_y_reverse(breaks = seq(0, 40, 5)) +
  xlab("Year") +
  ylab("Rank") +
  labs(title="Rank of Maryland vs. D.C. and Virginia",
       subtitle="Based on 3-Year Estimates of U.S. Median Household Income",
       caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8B") +
  theme_minimal() +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=10))) +
  theme(axis.title.y=element_text(margin=margin(r=10))) +
  theme(plot.caption=element_text(margin=margin(t=15))) +
  theme(legend.title = element_blank())

Maryland has consistently been in the top ten states based on median household income, and has recently been joined in the top five by D.C. Virginia was in the top ten for much of the past 30 years, but has recently fallen out of it.

Finally, to compare Maryland with other affluent states from across the U.S., I create a table top_20 of the highest-ranking states (and D.C.) by median household income in 2017 based on the 3-year average. (In other words, this ranking also reflects the income values for 2015 and 2016.) I do this as follows:

I start with the mhi_rankings table created above, which contains rankings grouped by year.
I remove the grouping by year. (This makes it possible to remove the year variable from the table being created.)
I retain only the rank values, the geography names, and the median household income values.
I sort the resulting list in ascending order by rank.
I retain the first 20 rows, representing those states (and D.C.) ranked from 1 to 20.

top_20 <- mhi_rankings %>%
  ungroup() %>%
  filter(year == 2017) %>%
  select(ranking, geography, mhi) %>%
  arrange(ranking) %>%
  rename(Rank = ranking, Jurisdiction = geography, `Median Household Income` = mhi) %>%
  top_n(20)

## Selecting by Median Household Income

Finally I print top_20 as a formatted table.

States with the Highest 3-Year Average Median Household Incomes in 2017
Rank	Jurisdiction	Median Household Income
1	Maryland	77525
2	New Hampshire	77000
3	D.C.	76131
4	Alaska	75766
5	Connecticut	75251
6	Massachusetts	72421
7	Washington	72272
8	Colorado	71720
9	Minnesota	71587
10	Hawaii	71337
11	New Jersey	71222
12	Utah	69602
13	California	67891
14	Virginia	67597
15	Illinois	63274
16	Oregon	62654
17	Pennsylvania	62649
18	Vermont	62503
19	Rhode Island	62291
20	Iowa	62270

Appendix

Caveats

All values for median household income are estimates based on survey samples, with associated margins of error. For the 1-year Maryland estimates the associated standard errors are a few thousand dollars. For the multi-year averages the standard errors are smaller.

In 2013 the Census Bureau did two separate surveys using different subsamples. See footnotes 38 and 39 in the Historical Income Table Footnotes for a discussion of the differences. For this analysis I chose the sets of values collected according to footnote 38.

References

National and state data for 1984 through 2017 are from the U.S. Census Bureau “Historical Income Tables: Households”, Table H-8, “Median Household Income by State”. This table includes income values in both constant (2017) dollars and current dollars (i.e., current as of the year in question).

Note that there are two alternative sources for this data: Table H-8A, “Median Household Income by State - 2 Year Average”, and Table H-8B, “Median Household Income by State - 3 Year Average” These tables are in constant 2017 dollars only.

I downloaded all spreadsheets directly from the U.S. Census Bureau web site and did not make any changes to them.

Suggestions for future work

The plots above could be extended to include additional jurisdictions for comparison.

Environment

I used the following R environment in doing the analysis above:

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] tools     stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] knitr_1.22      readxl_1.3.1    forcats_0.4.0   stringr_1.4.0  
##  [5] dplyr_0.8.1     purrr_0.3.2     readr_1.3.1     tidyr_0.8.3    
##  [9] tibble_2.1.1    ggplot2_3.1.1   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1       highr_0.8        cellranger_1.1.0 pillar_1.4.0    
##  [5] compiler_3.6.0   plyr_1.8.4       digest_0.6.18    lubridate_1.7.4 
##  [9] jsonlite_1.6     evaluate_0.13    nlme_3.1-139     gtable_0.3.0    
## [13] lattice_0.20-38  pkgconfig_2.0.2  rlang_0.3.4      cli_1.1.0       
## [17] rstudioapi_0.10  yaml_2.2.0       haven_2.1.0      xfun_0.7        
## [21] withr_2.1.2      xml2_1.2.0       httr_1.4.0       hms_0.4.2       
## [25] generics_0.0.2   grid_3.6.0       tidyselect_0.2.5 glue_1.3.1      
## [29] R6_2.4.0         rematch_1.0.1    rmarkdown_1.12   modelr_0.1.4    
## [33] magrittr_1.5     backports_1.1.4  scales_1.0.0     htmltools_0.3.6 
## [37] rvest_0.3.4      assertthat_0.2.1 colorspace_1.4-1 labeling_0.3    
## [41] stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0    broom_0.5.2     
## [45] crayon_1.3.4

Source code

You can find the source code for this analysis and others at my hocodata public code repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.