Introduction

In this document I look at historical trends for Maryland median household income in relation to nearby jurisdictions.

For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.

Setup and data preparation

Libraries

I use the tidyverse package of functions for general data manipulation, the readxl package (part of the Tidyverse, but not loaded by default) to read Excel files, and the knitr package to produce printed tables. I also use the tools package to get the md5sum function.

library(tidyverse)
library(readxl)
library(knitr)
library(tools)

Data sources

I use data from the following U.S. Census Bureau Excel spreadsheets:

  • h08.xls: Median household income by state, 1984-2017.
  • h08a.xls: 2-Year average of median household income by state, 1985-2017.
  • h08b.xls: 3-Year average of median household income by state, 1986-2017.

I check to make sure that the versions of the files being used in this analysis are identical to the versions of the files I originally downloaded. I do this by comparing the MD5 checksums of the files against MD5 values I previously computed, and stopping execution if they do not match.

stopifnot(md5sum("h08.xls") == "179175185733538c170127ef638a2c78")
stopifnot(md5sum("h08a.xls") == "fe390b065810e9ba42c52d1db797e723")
stopifnot(md5sum("h08b.xls") == "b742ec83105bc3e8fa351b17469be642")

Reading in and preparing the data

I use the read_xls function to read in the Excel spreadsheets, selecting only those cells that contain the data of interest. I use the col_types parameter to further skip over columns not used in this analysis and to provide the data types for those that are used. I also provide a set of new column names.

As read in, the data has one row for each jurisdiction (U.S., state, etc.) and one column for each year. To simplify plotting and other data manipulation I convert the data into so-called “tidy” format, where each row contains the income value for a given year and jurisdiction.

I do this using the gather(Year, Income, -Jurisdiction) function. For each row I take all columns except Jurisdiction, put the value in each column into a new variable Income, and convert the column name to a new variable Year. I then convert the Year values from character strings to integers.

ct <- c("text",
        rep(c("numeric", "skip"), 4),
        "skip", "skip",
        rep(c("numeric", "skip"), 30))

mhi_current <- read_xls("h08.xls",
                        range = "A7:BS58",
                        col_names = c("Jurisdiction", 2017:1984),
                        col_types = ct) %>%
  gather(Year, Income, -Jurisdiction) %>%
  mutate(Year = as.integer(Year))

mhi_constant <- read_xls("h08.xls",
                         range = "A62:BS113",
                         col_names = c("Jurisdiction", 2017:1984),
                         col_types = ct) %>%
  gather(Year, Income, -Jurisdiction) %>%
  mutate(Year = as.integer(Year))

mhi_2ya <- read_xls("h08a.xls",
                    range = "A6:BM57",
                    col_names = c("Jurisdiction", 2017:2015, 2013:1985),
                    col_types = c("text",
                                  rep(c("numeric", "skip"), 32))) %>%
  gather(Year, Income, -Jurisdiction) %>%
  mutate(Year = as.integer(Year))

mhi_3ya <- read_xls("h08b.xls",
                    range = "A6:BK57",
                    col_names = c("Jurisdiction", 2017:2015, 2013:1986),
                    col_types = c("text",
                                  rep(c("numeric", "skip"), 31))) %>%
  gather(Year, Income, -Jurisdiction) %>%
  mutate(Year = as.integer(Year))

Analysis

I do analyses to answer the following two questions:

Jurisdictions with the highest median household incomes

I next display a list of jurisdictions with the highest estimated median household incomes in 2017, as follows:

  1. I start with the mh_current table.
  2. I filter for data from 2017.
  3. I remove the Year variable (since I no longer need it).
  4. I arrange the resulting table in descending order by the variable Income.
  5. I take the first 20 jurisdictions in the list.
top_20 <- mhi_current %>%
  filter(Year == 2017) %>%
  select(-Year) %>%
  arrange(desc(Income)) %>%
  top_n(20)
## Selecting by Income

I then use the kable function to print the resulting table.

Jurisdictions with the Highest Median Household Incomes in 2017
Jurisdiction Income
D.C. 83382
Maryland 81084
Washington 75418
New Hampshire 74801
Colorado 74172
Hawaii 73575
Massachusetts 73227
New Jersey 72997
Connecticut 72780
Alaska 72231
Minnesota 71920
Utah 71319
Virginia 71293
California 69759
Rhode Island 66390
Oregon 64610
Illinois 64609
Vermont 63805
Iowa 63481
Wisconsin 63451

Maryland is the state with the highest median household income, but as noted above for 2017 it is slightly outranked by the District of Columbia. Virginia is some ways doen the list.

Appendix

Caveats

All values for median household income are estimates based on survey samples, with associated margins of error. For the 1-year Maryland estimates the associated standard errors are a few thousand dollars. For the multi-year averages the standard errors are smaller.

In 2013 the Census Bureau did two separate surveys using different subsamples. See footnotes 38 and 39 in the Historical Income Table Footnotes for a discussion of the differences. For this analysis I chose the sets of values collected according to footnote 38.

References

National and state data for 1984 through 2017 are from the U.S. Census Bureau “Historical Income Tables: Households”, Table H-8, “Median Household Income by State”. This table includes income values in both constant (2017) dollars and current dollars (i.e., current as of the year in question).

Note that there are two alternative sources for this data: Table H-8A, “Median Household Income by State - 2 Year Average”, and Table H-8B, “Median Household Income by State - 3 Year Average” These tables are in constant 2017 dollars only.

I downloaded all spreadsheets directly from the U.S. Census Bureau web site and did not make any changes to them.

Suggestions for future work

The plots above could be extended to include additional jurisdictions for comparison.

The analysis of the jurisdictions with the highest median household incomes could be extended to include a trend plot showing how the relative rankings of Maryland, D.C., Virginia, and other jurisdictions have changed over time.

Environment

I used the following R environment in doing the analysis above:

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] tools     stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] knitr_1.22      readxl_1.3.1    forcats_0.4.0   stringr_1.4.0  
##  [5] dplyr_0.8.0.1   purrr_0.3.2     readr_1.3.1     tidyr_0.8.3    
##  [9] tibble_2.1.1    ggplot2_3.1.1   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1       highr_0.8        cellranger_1.1.0 pillar_1.3.1    
##  [5] compiler_3.6.0   plyr_1.8.4       digest_0.6.18    lubridate_1.7.4 
##  [9] jsonlite_1.6     evaluate_0.13    nlme_3.1-139     gtable_0.3.0    
## [13] lattice_0.20-38  pkgconfig_2.0.2  rlang_0.3.4      cli_1.1.0       
## [17] rstudioapi_0.10  yaml_2.2.0       haven_2.1.0      xfun_0.6        
## [21] withr_2.1.2      xml2_1.2.0       httr_1.4.0       hms_0.4.2       
## [25] generics_0.0.2   grid_3.6.0       tidyselect_0.2.5 glue_1.3.1      
## [29] R6_2.4.0         rematch_1.0.1    rmarkdown_1.12   modelr_0.1.4    
## [33] magrittr_1.5     backports_1.1.4  scales_1.0.0     htmltools_0.3.6 
## [37] rvest_0.3.3      assertthat_0.2.1 colorspace_1.4-1 labeling_0.3    
## [41] stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0    broom_0.5.2     
## [45] crayon_1.3.4

Source code

You can find the source code for this analysis and others at my hocodata public code repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.