In this document I do some basic analyses on median household income of Howard County, Maryland over time compared to other local jurisdictions.
For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.
I use the tidyverse package of functions for general data manipulation, the readxl package (part of the Tidyverse, but not loaded by default) to read Excel spreadsheets, the knitr package to create an inline table using the kable
function, and the tools package to get the md5sum
function.
library("tidyverse")
library("readxl")
library("knitr")
library("tools")
I use data from the following sources; see the References section below for more information:
ACS_05_EST_B19013.csv
. Median household income 1-year estimates for the U.S., all states, and all counties (and county-equivalent geographies) for 2005. NOTE: The 1-year estimates do not include counties and county-equivalents with populations less than 65,000 people. Only 5-year estimates are available for these geographies.ACS_06_EST_B19013.csv
. Same data for 2006.ACS_07_1YR_B19013.csv
through ACS_17_1YR_B19013.csv
. Same data for 2007 through 2017.allitems.xlsx
. Consumer Price Index data used to adjust for inflation.I check to make sure that the versions of the files being used in this analysis are identical to the versions of the files I originally downloaded. I do this by comparing the MD5 checksums of the files against MD5 values I previously computed, and stopping execution if they do not match.
stopifnot(md5sum("ACS_05_EST_B19013.csv") == "83d2f4e4d941b2df9bbc600b1c581b9a")
stopifnot(md5sum("ACS_06_EST_B19013.csv") == "9814594aa476d72fe34db653f4b812ab")
stopifnot(md5sum("ACS_07_1YR_B19013.csv") == "d7786fa929289a2ab1aef3faaf65bb11")
stopifnot(md5sum("ACS_08_1YR_B19013.csv") == "ee2b141db5b59c6b66b75ca386244e3e")
stopifnot(md5sum("ACS_09_1YR_B19013.csv") == "675e1b729f43ee93f5cc58222b6fe6db")
stopifnot(md5sum("ACS_10_1YR_B19013.csv") == "415bc328bf0f869170b163a73e3ead1e")
stopifnot(md5sum("ACS_11_1YR_B19013.csv") == "de90254b51a547765dd3e4f3575fd51d")
stopifnot(md5sum("ACS_12_1YR_B19013.csv") == "0b0e3ff4f5a6717a9f6e7d8dbc0a95d4")
stopifnot(md5sum("ACS_13_1YR_B19013.csv") == "be67896e52ca070347a3a598ea896de2")
stopifnot(md5sum("ACS_14_1YR_B19013.csv") == "2ba769f819a0fb695a092d2d4f514580")
stopifnot(md5sum("ACS_15_1YR_B19013.csv") == "8ab9bf16870d6c401559afad5bb4506b")
stopifnot(md5sum("ACS_16_1YR_B19013.csv") == "e51169bd90138554447fe81a97f0390f")
stopifnot(md5sum("ACS_17_1YR_B19013.csv") == "a164336c7addb3eecfba09abb1dc2264")
stopifnot(md5sum("allitems.xlsx") == "f890a4aff139ca003246d17491d51272")
The CSV files all have the same format: two lines of header information followed by one row for each geography, containing the following variables:
For the analysis below I want all the data in one big table mhi_tbl
, with an additional field for the year value. I do this as follows:
mhi_files
, and a corresponding list mhi_years
of the years for those files.mhi_tbl
with five variables corresponding to the five columns of each CSV file–geo_id
, geo_id2
, geography
, mhi
, and moe
–as well as an additional variable year
. To do this I take advantage of the fact that the read_csv
function can take its input from a string literal, in this case a newline-terminated blank line.mhi_tbl
table.year
set to the year value corresponding to the CSV file.mhi_tbl
table.mhi_tbl
rather than the beginning.fix_nonascii_geography
which will take a value of the geo_id
variable and for the problematic geographies will return a modified name. (For all other geographies it will simply return the name modified.)fix_nonascii_geography
function, I use it to change the value of the geography
variable in the mhi_tbl
table. Note that this is a vector operation, so the fix_nonascii_geography
function has to be able to take a vector as an argument. That’s why it uses the ifelse
function, which can work on vectors.mhi_files <- c("ACS_05_EST_B19013.csv",
"ACS_06_EST_B19013.csv",
"ACS_07_1YR_B19013.csv",
"ACS_08_1YR_B19013.csv",
"ACS_09_1YR_B19013.csv",
"ACS_10_1YR_B19013.csv",
"ACS_11_1YR_B19013.csv",
"ACS_12_1YR_B19013.csv",
"ACS_13_1YR_B19013.csv",
"ACS_14_1YR_B19013.csv",
"ACS_15_1YR_B19013.csv",
"ACS_16_1YR_B19013.csv",
"ACS_17_1YR_B19013.csv")
mhi_years <- 2005:2017
mhi_cols <- c("geo_id", "geo_id2", "geography", "mhi", "moe")
mhi_tbl <- read_csv("\n", skip = 2, col_names = c(mhi_cols, "year"), col_types = "cccddd")
for (i in seq_along(mhi_years)) {
mhi_tbl <- bind_rows(mhi_tbl,
mutate(year = mhi_years[[i]],
read_csv(mhi_files[[i]],
skip = 2,
col_names = mhi_cols,
col_types = "cccdd")))
}
# For geographies with non-ASCII or missing names, return a usable name.
fix_nonascii_geography <- function(g, j) {
ifelse(g == "0500000US72021", "Bayamon Municipio, Puerto Rico",
ifelse(g == "0500000US72097", "Mayaguez Municipio, Puerto Rico",
ifelse(g == "0500000US35013", "Dona Ana County, New Mexico",
j
)))
}
mhi_tbl <- mhi_tbl %>%
mutate(geography = fix_nonascii_geography(geo_id, geography))
The CSV files above contain 1-year estimates that are expressed in dollars as of the year in question. (For example, the 1-year estimates for 2010 are in 2010 dollars.) In order to adjust values for inflation I read in an Excel spreadsheet containing consumer price index data from 1978 to 2018. The index values are expressed as a percentage of the 1997 index (set at 100).
The spreadsheet has two sheets, each of which has values for all twelve months of the year plus an average figure for the entire year. The sheets differ in whether or not the monthly figures are seasonally-adjusted. Since I am doing adjustments for entire years I set col_types
to skip over the monthly values, and read only the first sheet (which is not seasonally-adjusted).
cpi_tbl <- read_excel("allitems.xlsx",
sheet = 1,
skip = 8,
col_names = c("year", "index"),
col_types = c("numeric", rep("skip", 12), "numeric"))
I want to express all income values as 2017 dollars. I therefore take the index value for 2017 and divide all other index values by the 2017 index value to create a new table adj_tbl
. The variable adj
in this table can be multiplied by income values for each year to express them in 2017 dollars.
index_2017 <- cpi_tbl %>%
filter(year == 2017) %>%
select(index) %>%
as.integer()
adj_tbl <- cpi_tbl %>%
mutate(adj = index_2017 / index) %>%
select(year, adj)
I do analyses to answer the following questions:
For my first graph I do a simple line plot of how the median household income for Howard County and selected other geographies have changed over time. I don’t attempt to adjust the income values for inflation, so they are in current dollars for the year they were estimated.
I create the graph as follows:
mhi_tbl
table.ggplot
to create the plot, using geom_line
to plot a line for each of the groups of data points (corresponding to the individual geographies). I use color to distinguish between the lines for the different geographies, with a special “colorblind-friendly” palette designed to be more readable by people with various forms of color blindess. (See the comments in the code for more information on how the color assignment is done.)Year
), along with a plot title, subtitle, and caption.theme_minimal
theme for a clean look, and then tweak it slightly for readability, displaying the x-axis tick mark labels at an angle, moving the x- and y-axis labels slightly away from the tick mark labels, moving the caption slightly lower, and removing the title on the graph legend.# Note: The order of geographies in this list determines the order
# in which they are listed in the graph legends. To improve readability
# I list the geographies in order from highest to lowest median
# household income in 2017 (the last year of data graphed).
geographies <- c("Loudoun County, Virginia",
"Fairfax County, Virginia",
"Stafford County, Virginia",
"Howard County, Maryland",
"Montgomery County, Maryland",
"Anne Arundel County, Maryland",
"District of Columbia",
"United States",
"Baltimore city, Maryland")
# Note: Colors are assigned to geographies based on their
# alphabetical order. To improve readability I assign the colors
# black to Howard County and gray to the United States.
cbPalette <- c("#E69F00", # Anne Arundel County, Maryland
"#56B4E9", # Baltimore city, Maryland
"#009E73", # District of Columbia
"#F0E442", # Fairfax County, Virginia
"#000000", # Howard County, Maryland
"#0072B2", # Loudoun County, Virginia
"#D55E00", # Montgomery County, Maryland
"#CC79A7", # Stafford County, Virginia
"#999999") # United States
mhi_tbl %>%
filter(geography %in% geographies) %>%
ggplot(aes(x = year, y = mhi, color = geography)) +
geom_line(size = 0.8) +
scale_color_manual(values = cbPalette, breaks = geographies) +
scale_x_continuous(breaks=seq(2004, 2018, 2)) +
scale_y_continuous(labels = scales::dollar) +
xlab("Year") +
ylab("Median Household Income") +
labs(title="Median Household Income for Howard County vs. Other Jurisdictions",
subtitle="1-Year Estimates in Current Dollars (Not Inflation-Adjusted)",
caption="Data source: U.S. Census Bureau, American Community Survey, Table B19013") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15))) +
theme(legend.title = element_blank())
For my next graph I repeat the same plot, but add error bars to show the margins of error for the median household income estimates. I use position_dodge()
so that bars for two different geographies in the same year don’t display right on top of each other.
mhi_tbl %>%
filter(geography %in% geographies) %>%
ggplot() +
geom_line(aes(x = year, y = mhi, color = geography), size = 0.8) +
geom_errorbar(aes(x = year, ymin = mhi - moe, ymax = mhi + moe, color = geography), width=.2, position = position_dodge(.2)) +
scale_color_manual(values = cbPalette, breaks = geographies) +
scale_x_continuous(breaks = seq(2004, 2018, 2)) +
scale_y_continuous(labels = scales::dollar) +
xlab("Year") +
ylab("Median Household Income") +
labs(title="Median Household Income for Howard County vs. Other Jurisdictions",
subtitle="1-Year Estimates in Current Dollars (Not Inflation-Adjusted) Including Margins of Error (90%)",
caption="Data source: U.S. Census Bureau, American Community Survey, Table B19013") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15))) +
theme(legend.title = element_blank())
I then repeat the first plot, this time using constant 2017 dollars. The adjustment is done as follows, creating a new table mhi_adj_tbl
:
mhi_tbl
table.adj_tbl
table containing inflation adjustment figures for each year, using year
as the common variable. This has the effect of adding a new column containing the variable adj
. This variable has the value 1 for the year 2017, and for other years contains the adjustment factor needed to convert incomes into 2017 dollars.mhi
and moe
variables to convert them to 2017 dollars.I then plot the graph as before, first filtering for the geographies of interest.
mhi_adj_tbl <- mhi_tbl %>%
inner_join(adj_tbl, by = "year") %>%
mutate(mhi = mhi * adj, moe = moe * adj)
mhi_adj_tbl %>%
filter(geography %in% geographies) %>%
ggplot(aes(x = year, y = mhi, color = geography)) +
geom_line(size = 0.8) +
scale_color_manual(values = cbPalette, breaks = geographies) +
scale_x_continuous(breaks=seq(2004, 2018, 2)) +
scale_y_continuous(labels = scales::dollar) +
xlab("Year") +
ylab("Median Household Income") +
labs(title="Median Household Income for Howard County vs. Other Jurisdictions",
subtitle="1-Year Estimates in 2017 Dollars",
caption="Data source: U.S. Census Bureau, American Community Survey, Table B19013") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15))) +
theme(legend.title = element_blank())
I then redo the graph to show error bars:
mhi_adj_tbl %>%
filter(geography %in% geographies) %>%
ggplot() +
geom_line(aes(x = year, y = mhi, color = geography), size = 0.8) +
geom_errorbar(aes(x = year, ymin = mhi - moe, ymax = mhi + moe, color = geography), width=.2, position = position_dodge(.3)) +
scale_color_manual(values = cbPalette, breaks = geographies) +
scale_x_continuous(breaks = seq(2004, 2018, 2)) +
scale_y_continuous(labels = scales::dollar) +
xlab("Year") +
ylab("Median Household Income") +
labs(title="Median Household Income for Howard County vs. Other Jurisdictions",
subtitle="1-Year Estimates in 2017 Dollars Including Margins of Error (90%)",
caption="Data source: U.S. Census Bureau, American Community Survey, Table B19013") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15))) +
theme(legend.title = element_blank())
Next I plot median household incomes for the geographies of interest as a percentage of the overall U.S. median household income. I do this as follows:
mhi_tbl
table. (Because I am comparing incomes relative to one another, it does not matter whether I use inflation-adjusted values or not.)year
, geography
, and mhi
variables.spread
to change the table to make each geography into a separate column, with the column value being the mhi
value. This produces a table with one row for each year.year
value). (Note that the .
in ncol(.)
references the (temporary) table against which mutate_at
is executed.)gather
to reverse the effect of spread
and create a new table mhi_rel_tbl
that has a similar form to mhi_tbl
, with each row representing a combination of year, geography, and (relative) median household income.ggplot
and related functions to create the actual graph for the geographies of interest.mhi_rel_tbl <- mhi_tbl %>%
select(year, geography, mhi) %>%
spread(geography, mhi) %>%
mutate_at(2:ncol(.), function(x) 100 * x / .[["United States"]]) %>%
gather(geography, mhi, -year)
mhi_rel_tbl %>%
filter(geography %in% geographies) %>%
ggplot(aes(x = year, y = mhi, color = geography)) +
geom_line(size = 0.8) +
scale_color_manual(values = cbPalette, breaks = geographies) +
scale_x_continuous(breaks = seq(1980, 2020, 5)) +
scale_y_continuous() +
xlab("Year") +
ylab("% of U.S. Median Household Income") +
labs(title="Relative Median Household Income for Howard County, Maryland",
subtitle="1-Year Estimates as Percentage of U.S. Median Household Income",
caption="Data source: U.S. Census Bureau, American Community Survey, Table B19013") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15))) +
theme(legend.title = element_blank())
To see how the ranking of Howard County versus other large counties (with populations over 65,000) has changed over time, I first compute rankings for each year as follows:
mhi_tbl
table created above.geo_id
variable.the
min_rankfunction to compute a new variable
ranking` representing the relative rank of each county or county-equivalent for that year, with rank 1 being the county with highest median household income.mhi_rank_tbl <- mhi_tbl %>%
filter(str_sub(geo_id, 1, 2) == "05") %>%
group_by(year) %>%
mutate(ranking = min_rank(desc(mhi)))
I then graph the rankings for Howard County and selected other counties in Maryland and Virginia. Because the list of geographies in the graph is different from that used in the graphs above, I modify the graph’s palette to ensure that the mapping of colors to geographies remains the same as in the other graphs.
# Note: As before, the order of this list determines the order that
# geographies are listed in the legend, from top to bottom.
geographies2 <- c("Loudoun County, Virginia",
"Fairfax County, Virginia",
"Stafford County, Virginia",
"Howard County, Maryland",
"Montgomery County, Maryland",
"Anne Arundel County, Maryland")
# Note: As below, colors are assigned to geographies based on their
# alphabetical order. Since some geographies have been removed from
# the graph I use a modified palette to maintain the same mapping of
# color to geography as in previous graphs.
cbPalette2 <- c("#E69F00", # Anne Arundel County, Maryland
"#F0E442", # Fairfax County, Virginia
"#000000", # Howard County, Maryland
"#0072B2", # Loudoun County, Virginia
"#D55E00", # Montgomery County, Maryland
"#CC79A7") # Stafford County, Virginia
mhi_rank_tbl %>%
filter(geography %in% geographies2) %>%
ggplot(aes(x = year, y = ranking, color = geography)) +
geom_line(size = 0.8) +
scale_x_continuous(breaks=seq(2004, 2018, 2)) +
scale_color_manual(values = cbPalette2, breaks = geographies2) +
scale_y_reverse(breaks = seq(0, 40, 5)) +
xlab("Year") +
ylab("Rank") +
labs(title="Ranking Howard County, Maryland vs. Other Jurisdictions",
subtitle="Based on 1-Year Estimates of U.S. Median Household Income",
caption="Data source: U.S. Census Bureau, American Community Survey, Table B19013") +
theme_minimal() +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(axis.title.x=element_text(margin=margin(t=10))) +
theme(axis.title.y=element_text(margin=margin(r=10))) +
theme(plot.caption=element_text(margin=margin(t=15))) +
theme(legend.title = element_blank())
Finally, to compare Howard County with other affluent counties from across the U.S., I create a table top_30_tbl
of the highest-ranking counties (or county-equivalent geographies) by median household income in 2017. (Again, this is restricted to counties or county-equivalents with ppopulations over 65,000 that are represented in the ACS 1-year estimates.)
I do this as follows:
mhi_rank_tbl
table created above, which contains rankings grouped by year.year
variable from the table being created.)top_30_tbl <- mhi_rank_tbl %>%
ungroup() %>%
filter(year == 2017) %>%
select(ranking, geography, mhi) %>%
rename(Rank = ranking, Jurisdiction = geography, `Median Household Income` = mhi) %>%
arrange(Rank) %>%
top_n(30)
## Selecting by Median Household Income
Finally I print top_30_tbl
as a formatted table.
Rank | Jurisdiction | Median Household Income |
---|---|---|
1 | Loudoun County, Virginia | 135842 |
2 | Santa Clara County, California | 119035 |
3 | Fairfax County, Virginia | 118279 |
4 | Arlington County, Virginia | 117237 |
5 | San Mateo County, California | 116653 |
6 | Morris County, New Jersey | 114732 |
7 | Marin County, California | 113908 |
8 | Stafford County, Virginia | 112795 |
9 | Somerset County, New Jersey | 112153 |
10 | Hunterdon County, New Jersey | 112038 |
11 | Douglas County, Colorado | 111482 |
12 | Howard County, Maryland | 111473 |
13 | San Francisco County, California | 110816 |
14 | Nassau County, New York | 108133 |
15 | Delaware County, Ohio | 106933 |
16 | Williamson County, Tennessee | 105622 |
17 | Montgomery County, Maryland | 103235 |
18 | Forsyth County, Georgia | 102084 |
19 | Carver County, Minnesota | 101271 |
20 | Prince William County, Virginia | 100845 |
21 | Norfolk County, Massachusetts | 100829 |
22 | Calvert County, Maryland | 100590 |
23 | Alexandria city, Virginia | 100530 |
24 | Putnam County, New York | 99479 |
25 | Fauquier County, Virginia | 98585 |
26 | Middlesex County, Massachusetts | 98555 |
27 | Scott County, Minnesota | 98538 |
28 | Monmouth County, New Jersey | 98270 |
29 | Charles County, Maryland | 97986 |
30 | Anne Arundel County, Maryland | 97085 |
All values for median household income are estimates based on survey samples, with associated margins of error. For county-level estimates the associated standard errors are a few thousand dollars, as can be seen in the graph above showing error bars.
As noted above, rankings reflect only larger counties and county-equivalents (with populations over 65,000), since smaller counties and county-equivalents are not represented in the ACS 1-year estimates.
To obtain the CSV files used in this analysis I used the American Factfinder site maintained by the U.S. Census Bureau. More specifically, I did the following:
To adjust for inflation I used the CPI-U-RS data produced by the Bureau of Labor Statistics, specifically the “Updated CPI-U-RS, All items, 1977-2018” Excel spreadsheet, as recommended by the U.S. Census Bureau’s guide to comparing 2017 ACS data to prior years.
The plots above could be extended to include additional geographies for comparison. However this would require modifying the color palette used for the line plots, since the colorblind-friendly palette used above has only nine colors, including gray and black.
I used the following R environment in doing the analysis above:
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] tools stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] knitr_1.22 readxl_1.3.1 forcats_0.4.0 stringr_1.4.0
## [5] dplyr_0.8.1 purrr_0.3.2 readr_1.3.1 tidyr_0.8.3
## [9] tibble_2.1.1 ggplot2_3.1.1 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 highr_0.8 cellranger_1.1.0 pillar_1.4.0
## [5] compiler_3.6.0 plyr_1.8.4 digest_0.6.18 lubridate_1.7.4
## [9] jsonlite_1.6 evaluate_0.13 nlme_3.1-139 gtable_0.3.0
## [13] lattice_0.20-38 pkgconfig_2.0.2 rlang_0.3.4 cli_1.1.0
## [17] rstudioapi_0.10 yaml_2.2.0 haven_2.1.0 xfun_0.7
## [21] withr_2.1.2 xml2_1.2.0 httr_1.4.0 hms_0.4.2
## [25] generics_0.0.2 grid_3.6.0 tidyselect_0.2.5 glue_1.3.1
## [29] R6_2.4.0 rmarkdown_1.12 modelr_0.1.4 magrittr_1.5
## [33] backports_1.1.4 scales_1.0.0 htmltools_0.3.6 rvest_0.3.4
## [37] assertthat_0.2.1 colorspace_1.4-1 labeling_0.3 stringi_1.4.3
## [41] lazyeval_0.2.2 munsell_0.5.0 broom_0.5.2 crayon_1.3.4
You can find the source code for this analysis and others at my hocodata public code repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.