In this document I look at historical trends for Maryland median household income in relation to D.C. and Virginia as well as to the United States as a whole.
For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.
I use the tidyverse package of functions for general data manipulation, the readxl package (part of the Tidyverse, but not loaded by default) to read Excel files, and the knitr package to produce printed tables. I also use the tools package to get the md5sum
function.
library(tidyverse)
library(readxl)
library(knitr)
library(tools)
I use data from the following U.S. Census Bureau Excel spreadsheets:
h08.xls
: Median household income by state, 1984-2017.h08a.xls
: 2-year average of median household income by state, 1985-2017.h08b.xls
: 3-year average of median household income by state, 1986-2017.I check to make sure that the versions of the files being used in this analysis are identical to the versions of the files I originally downloaded. I do this by comparing the MD5 checksums of the files against MD5 values I previously computed, and stopping execution if they do not match.
stopifnot(md5sum("h08.xls") == "179175185733538c170127ef638a2c78")
stopifnot(md5sum("h08a.xls") == "fe390b065810e9ba42c52d1db797e723")
stopifnot(md5sum("h08b.xls") == "b742ec83105bc3e8fa351b17469be642")
I use the read_xls
function to read in the Excel spreadsheets, selecting only those cells that contain the data of interest. I use the col_types
parameter to further skip over columns not used in this analysis and to provide the data types for those that are used. I also provide a set of new column names: “geography” for the first column used, and year values for the remaining columns.
As read in, the data has one row for each geography (U.S., state, etc.) and one column for each year. To simplify plotting and other data manipulation I convert the data into so-called “tidy” format, where each row contains the income value for a given year and jurisdiction.
I do this using the gather()
function. For each row I take all columns except geography
, put the value in each column into a new variable mhi
, and convert the column name to a new variable year
. I then convert the year
values from character strings to integers.
ct <- c("text",
rep(c("numeric", "skip"), 4),
"skip", "skip",
rep(c("numeric", "skip"), 30))
mhi_current <- read_xls("h08.xls",
range = "A7:BS58",
col_names = c("geography", 2017:1984),
col_types = ct) %>%
gather(year, mhi, -geography) %>%
mutate(year = as.integer(year))
mhi_constant <- read_xls("h08.xls",
range = "A62:BS113",
col_names = c("geography", 2017:1984),
col_types = ct) %>%
gather(year, mhi, -geography) %>%
mutate(year = as.integer(year))
mhi_2ya <- read_xls("h08a.xls",
range = "A6:BM57",
col_names = c("geography", 2017:2015, 2013:1985),
col_types = c("text",
rep(c("numeric", "skip"), 32))) %>%
gather(year, mhi, -geography) %>%
mutate(year = as.integer(year))
mhi_3ya <- read_xls("h08b.xls",
range = "A6:BK57",
col_names = c("geography", 2017:2015, 2013:1986),
col_types = c("text",
rep(c("numeric", "skip"), 31))) %>%
gather(year, mhi, -geography) %>%
mutate(year = as.integer(year))
I do analyses to answer the following two questions:
I use ggplot
and associated functions to plot a series of graphs showing estimated Maryland median household income in relation to income estimates for D.C. and Virginia, as well as in relation to the United States overall.
The first graph is for 1-year estimates expressed in current dollars (i.e., in dollars as of each year in question, without adjusting for inflation). I create the graph as follows:
mhi_current
table.ggplot
to create the plot, using geom_line
to plot a line for each of the groups of data points (corresponding to the individual jurisdictions). I use color to distinguish between the lines for the different jurisdictions, using a special color palette designed to be more distinguishable for people who have various forms of colorblindness.theme_minimal
theme for a clean look, and then tweak it slightly for readability, displaying the x-axis tick mark labels at an angle, moving the x- and y-axis labels slightly away from the tick mark labels, and moving the caption slightly lower.# Colors are assigned to geographies based on alphabetical order.
cbPalette <- c("#E69F00", # D.C.
"#000000", # Maryland
"#999999", # United States
"#56B4E9") # Virginia
# The order here determines the order geographies are listed in the legend.
geographies <- c("Maryland", "D.C.", "Virginia", "United States")
mhi_current %>%
filter(geography %in% geographies) %>%
ggplot(aes(x = year, y = mhi, color = geography)) +
geom_line(size = 0.8) +
scale_color_manual(values = cbPalette, breaks = geographies) +
scale_x_continuous(breaks=seq(1980, 2020, 5)) +
scale_y_continuous(labels = scales::dollar) +
xlab("Year") +
ylab("Median Household Income") +
labs(title="Median Household Income for Maryland vs. Virginia and D.C.",
subtitle="Estimates in Current Dollars (Not Inflation-Adjusted)",
caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15))) +
theme(legend.title = element_blank())
Maryland has consistently had the highest median household incomes of the three local jurisdictions, but the District of Columbia has caught up in recent years.
I repeat the same plot, but using inflation-adjusted values.
mhi_constant %>%
filter(geography %in% geographies) %>%
ggplot(aes(x = year, y = mhi, color = geography)) +
geom_line(size = 0.8) +
scale_x_continuous(breaks = seq(1980, 2020, 5)) +
scale_y_continuous(labels = scales::dollar) +
scale_color_manual(values = cbPalette, breaks = geographies) +
xlab("Year") +
ylab("Median Household Income") +
labs(title="Median Household Income for Maryland vs. Virginia and D.C.",
subtitle="Estimates in 2017 Dollars (Inflation-Adjusted)",
caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15))) +
theme(legend.title = element_blank())
Except for the District of Columbia, median household income has not increased very much in real terms over time: less than $10,000 in 2017 dollars over 30 years, or about $300 per year—-significantly less than 1% per year.
I repeat the previous plot, but this time using 2-year averages.
mhi_2ya %>%
filter(geography %in% geographies) %>%
ggplot(aes(x = year, y = mhi, color = geography)) +
geom_line(size = 0.8) +
scale_x_continuous(breaks = seq(1980, 2020, 5)) +
scale_y_continuous(breaks = seq(50000, 80000, 10000),
labels = scales::dollar) +
scale_color_manual(values = cbPalette, breaks = geographies) +
xlab("Year") +
ylab("Median Household Income") +
labs(title="Median Household Income for Maryland vs. Virginia and D.C.",
subtitle="2-Year Averages, Estimates in 2017 Dollars (Inflation-Adjusted)",
caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8A") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15))) +
theme(legend.title = element_blank())
I repeat the previous plot, but this time using 3-year averages.
mhi_3ya %>%
filter(geography %in% geographies) %>%
ggplot(aes(x = year, y = mhi, color = geography)) +
geom_line(size = 0.8) +
scale_x_continuous(breaks = seq(1980, 2020, 5)) +
scale_y_continuous(breaks = seq(50000, 80000, 10000),
labels = scales::dollar) +
scale_color_manual(values = cbPalette, breaks = geographies) +
xlab("Year") +
ylab("Median Household Income") +
labs(title="Median Household Income for Maryland vs. Virginia and D.C.",
subtitle="3-Year Averages, Estimates in 2017 Dollars (Inflation-Adjusted)",
caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8B") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15))) +
theme(legend.title = element_blank())
Finally, I plot median household incomes for the jurisdictions of interest as a percentage of the U.S. median household incomes, using the 3-year averages. I do this as follows:
mhi_3ya
table.year
, geography
, and mhi
variables.spread
to change the table to make each geography into a separate column, with the column value being the mhi
value. This produces a table with one row for each year.year
value). (Note that the “.” in ncol(.)
references the (temporary) table against which mutate_at
is executed.)gather
to reverse the effect of spread
and create a new table mhi_relative
that has a similar form to mhi_3ya
, with each row representing a combination of year, geography, and (relative) median household income.ggplot
and related functions to create the actual graph for the geographies of interest.mhi_relative <- mhi_3ya %>%
spread(geography, mhi) %>%
mutate_at(2:ncol(.), function(x) 100 * x / .[["United States"]]) %>%
gather(geography, mhi, -year)
mhi_relative %>%
filter(geography %in% geographies) %>%
ggplot(aes(x = year, y = mhi, color = geography)) +
geom_line(size = 0.8) +
scale_x_continuous(breaks = seq(1980, 2020, 5)) +
scale_y_continuous() +
scale_color_manual(values = cbPalette, breaks = geographies) +
xlab("Year") +
ylab("% of U.S. Median Household Income") +
labs(title="Relative Median Household Income for Maryland, D.C., and Virginia",
subtitle="3-Year Averages as Percentage of U.S. Median Household Income",
caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8B") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15))) +
theme(legend.title = element_blank())
Maryland median household income has been about 130% or more of U.S. median household income in the past few years, while Virginia median household income has been about 120% or more, with a significant drop in the last decade.
In contrast the District of Columbia has undergone a startling change from the 1990s, when its median household income was less than 90% of U.S. median household income, to the present day, when its median household income rivals that of Maryland.
Both Maryland and Virginia have had median household incomes above the U.S. median for some time now, with D.C. recently joining them. I now compute the rankings of each state (and D.C.) for each year, and look at how the rankings of the local jurisdiction have changed over time. I do this as follows:
mhi_3ya
table created above.min_rank
function to compute a new variable ranking
representing the relative rank of each state (and D.C.) for that year, with rank 1 being the state with the highest median household income.mhi_rankings <- mhi_3ya %>%
filter(geography != "United States") %>%
group_by(year) %>%
mutate(ranking = min_rank(desc(mhi)))
I then graph the rankings for Maryland, D.C., and Virginia. Because the list of geographies in the graph is different from that used in the graphs above, I modify the graph’s palette to ensure that the mapping of colors to geographies remains the same as in the other graphs. I also reverse the scale on the y-axis so that geographies with higher rankings will be listed at the top.
# Note: As before, colors are assigned to geographies based on their
# alphabetical order. Since some geographies have been removed from
# the graph I use a modified palette to maintain the same mapping of
# color to geography as in previous graphs.
cbPalette2 <- c("#E69F00", # D.C.
"#000000", # Maryland
"#56B4E9") # Virginia
# Note: As before, the order of this list determines the order that
# geographies are listed in the legend, from top to bottom.
geographies2 <- c("Maryland", "D.C.", "Virginia")
mhi_rankings %>%
filter(geography %in% geographies2) %>%
ggplot(aes(x = year, y = ranking, color = geography)) +
geom_line(size = 0.8) +
scale_x_continuous(breaks=seq(1980, 2020, 5)) +
scale_color_manual(values = cbPalette2, breaks = geographies2) +
scale_y_reverse(breaks = seq(0, 40, 5)) +
xlab("Year") +
ylab("Rank") +
labs(title="Rank of Maryland vs. D.C. and Virginia",
subtitle="Based on 3-Year Estimates of U.S. Median Household Income",
caption="Data source: U.S. Census Bureau, Historical Income Tables: Households, Table H-8B") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15))) +
theme(legend.title = element_blank())
Maryland has consistently been in the top ten states based on median household income, and has recently been joined in the top five by D.C. Virginia was in the top ten for much of the past 30 years, but has recently fallen out of it.
Finally, to compare Maryland with other affluent states from across the U.S., I create a table top_20
of the highest-ranking states (and D.C.) by median household income in 2017 based on the 3-year average. (In other words, this ranking also reflects the income values for 2015 and 2016.) I do this as follows:
mhi_rankings
table created above, which contains rankings grouped by year.year
variable from the table being created.)top_20 <- mhi_rankings %>%
ungroup() %>%
filter(year == 2017) %>%
select(ranking, geography, mhi) %>%
arrange(ranking) %>%
rename(Rank = ranking, Jurisdiction = geography, `Median Household Income` = mhi) %>%
top_n(20)
## Selecting by Median Household Income
Finally I print top_20
as a formatted table.
Rank | Jurisdiction | Median Household Income |
---|---|---|
1 | Maryland | 77525 |
2 | New Hampshire | 77000 |
3 | D.C. | 76131 |
4 | Alaska | 75766 |
5 | Connecticut | 75251 |
6 | Massachusetts | 72421 |
7 | Washington | 72272 |
8 | Colorado | 71720 |
9 | Minnesota | 71587 |
10 | Hawaii | 71337 |
11 | New Jersey | 71222 |
12 | Utah | 69602 |
13 | California | 67891 |
14 | Virginia | 67597 |
15 | Illinois | 63274 |
16 | Oregon | 62654 |
17 | Pennsylvania | 62649 |
18 | Vermont | 62503 |
19 | Rhode Island | 62291 |
20 | Iowa | 62270 |
All values for median household income are estimates based on survey samples, with associated margins of error. For the 1-year Maryland estimates the associated standard errors are a few thousand dollars. For the multi-year averages the standard errors are smaller.
In 2013 the Census Bureau did two separate surveys using different subsamples. See footnotes 38 and 39 in the Historical Income Table Footnotes for a discussion of the differences. For this analysis I chose the sets of values collected according to footnote 38.
National and state data for 1984 through 2017 are from the U.S. Census Bureau “Historical Income Tables: Households”, Table H-8, “Median Household Income by State”. This table includes income values in both constant (2017) dollars and current dollars (i.e., current as of the year in question).
Note that there are two alternative sources for this data: Table H-8A, “Median Household Income by State - 2 Year Average”, and Table H-8B, “Median Household Income by State - 3 Year Average” These tables are in constant 2017 dollars only.
I downloaded all spreadsheets directly from the U.S. Census Bureau web site and did not make any changes to them.
The plots above could be extended to include additional jurisdictions for comparison.
I used the following R environment in doing the analysis above:
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] tools stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] knitr_1.22 readxl_1.3.1 forcats_0.4.0 stringr_1.4.0
## [5] dplyr_0.8.1 purrr_0.3.2 readr_1.3.1 tidyr_0.8.3
## [9] tibble_2.1.1 ggplot2_3.1.1 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 highr_0.8 cellranger_1.1.0 pillar_1.4.0
## [5] compiler_3.6.0 plyr_1.8.4 digest_0.6.18 lubridate_1.7.4
## [9] jsonlite_1.6 evaluate_0.13 nlme_3.1-139 gtable_0.3.0
## [13] lattice_0.20-38 pkgconfig_2.0.2 rlang_0.3.4 cli_1.1.0
## [17] rstudioapi_0.10 yaml_2.2.0 haven_2.1.0 xfun_0.7
## [21] withr_2.1.2 xml2_1.2.0 httr_1.4.0 hms_0.4.2
## [25] generics_0.0.2 grid_3.6.0 tidyselect_0.2.5 glue_1.3.1
## [29] R6_2.4.0 rematch_1.0.1 rmarkdown_1.12 modelr_0.1.4
## [33] magrittr_1.5 backports_1.1.4 scales_1.0.0 htmltools_0.3.6
## [37] rvest_0.3.4 assertthat_0.2.1 colorspace_1.4-1 labeling_0.3
## [41] stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0 broom_0.5.2
## [45] crayon_1.3.4
You can find the source code for this analysis and others at my hocodata public code repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.