Introduction

In this document I do some basic analyses on the population density of Howard County, Maryland, census block groups over time. (A census block group is a census geography below the level of census tracts and above the level of census blocks.)

For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.

Setup and data preparation

Libraries

I use the following packages for the following purposes:

  • tidyverse: do general data manipulation.
  • tidycensus: retrieve U.S. census data.
  • tigris: retrieve U.S. census geospatial data.
  • sf: manipulate geospatial data.
  • viridis: obtain special color palettes for graphs.
  • scales: enable formatting numbers in legend with commas.
  • knitr: display tabular data.
library(tidyverse)
library(tidycensus)
library(tigris)
library(sf)
library(viridis)
library(scales)
library(knitr)

The tidycensus and tigris packages require a Census API key. I previously obtained such a key and stored it for use by the functions included in the tidycensus and tigris packages.

Data sources

I use data from the following sources; see the References section below for more information:

  • Population estimates for census block groups in 2010 are from the 2010 decennial census table P1, and are accessed via the tidycensus get_decennial() function.
  • Population estimates for census block groups for 2013-2017 are from the American Community Survey table B01003, and are accessed via the tidycensus get_acs() function.
  • Census block group boundaries are from the U.S. Census Bureau cartographic boundaries data for 2010 and 2017, and are accessed via the tigris block_groups() function.
  • Road geometry is also from the U.S. Census Bureau, and is accessed via the tigris roads() function.

Reading in and preparing the data

My goal is to look at population density across Howard County, at a relatively fine-grained level. (For example, are there particular neighborhoods in the county that have relatively low or relatively high population density?) In previous analyses I looked at data at the level of census tracts, of which there are 55 in Howard County.

In this analysis I primarily look at data at the level of a census block group (CBG), a geography that is one level below a census tract. There are currently 154 census block groups in Howard County. CBG-level data is not available in the American Community Survey 1-year estimates, but is available in the 5-year estimates, at least for the 2017 survey (2013-2017 time frame), which is the latest currently available.

The smallest geography in U.S. Census data is the census block, one level below the census block group. There are currently 4,845 census blocks defined for Howard County. The American Community Survey does not provide estimates at the census block level, but such data is available for the 2010 decennial census.

My primary interest is in population, so the only data I need is total population at the census block group level. In theory I could retrieve this data for 2010 by calling the tidycensus get_decennial() function and asking for the “P001P001” variable for the census block group geography. However, there is an apparent bug in the Census API that prevents this from working.

Instead I use the get_decennial() function to retrieve population counts for all census blocks in Howard County for the 2010 census. This function returns data in so-called “tidy” format (one row per observation), with the following columns:

  • GEOID. A 15-character geographic ID combining the 2-digit FIPS code for Maryland (“24”), the 3-digit FIPS Code for Howard County (“027”), a 6-digit identifier for an individual census tract within Howard County and a 4-digit identifier for the census block within the tract.
  • NAME. The name of the census block.
  • variable. References the census variable being retrieved, in this case “P001001”, the total population of the block.
  • value. The variable’s value. (There is no margin of error provided since the decennial census is a comprehensive count, not an estimate constructed from a sample.)

I retain only the GEOID value and the population count, renamed to population for consistency throughout this analysis.

pop_block_2010 <- get_decennial(year = 2010, geography = "block", state = "MD", county = "Howard County", variables = "P001001") %>%
  select(GEOID, value) %>%
  rename(population = value)
## Getting data from the 2010 decennial Census

I then summarize this data at the block group level to get the total population for each census block group. The GEOID for a census block group is the first 12 characters of the GEOID for a block within the group: the 2-digit state FIPS code plus the 3-digit county FIPS code plus the 6-digit FIPS code for the census tract plus a single digit to identify the block group within the tract.

Again I retain only the GEOID and population count, as GEOID and population respectively, and add a new variable year for later use.

pop_cbg_2010 <- pop_block_2010 %>%
  mutate(cbg = substr(GEOID, 1, 12)) %>%
  group_by(cbg) %>%
  summarize(population = sum(population)) %>%
  rename(GEOID = cbg) %>%
  mutate(year = 2010)

I next use the tidycensus get_acs() function to retrieve population estimates from the 2017 ACS 5-year estimates for all census block groups in Howard County. This function also returns data in tidy format, with the following columns:

  • GEOID. An 12-character geographic ID for the block group, as discussed above.
  • NAME. The name of the census block group.
  • variable. References the ACS variable being retrieved, in this case “B01003_001”).
  • estimate. The ACS 5-year estimate for the variable’s value.
  • moe. The margin of error for the estimate.

Again I retain only the GEOID and population count, as GEOID and population respectively, and add the variable year.

pop_cbg_2017 <- get_acs(year = 2017, survey = "acs5", geography = "block group", state = "MD", county = "Howard County", variables = "B01003_001") %>%
  select(GEOID, estimate) %>%
  rename(population = estimate) %>%
  mutate(year = 2017)
## Getting data from the 2013-2017 5-year ACS

I also get population values for Howard County as a whole for both 2010 and 2017, using the same general procedure as above.

pop_county_2010 <- get_decennial(year = 2010, geography = "county", state = "MD", county = "Howard County", variables = "P001001") %>%
  select(GEOID, value) %>%
  rename(population = value) %>%
  mutate(year = 2010)
## Getting data from the 2010 decennial Census
pop_county_2017 <- get_acs(year = 2017, survey = "acs5", geography = "county", state = "MD", county = "Howard County", variables = "B01003_001") %>%
  select(GEOID, estimate) %>%
  rename(population = estimate) %>%
  mutate(year = 2017)
## Getting data from the 2013-2017 5-year ACS

In order to produce maps of population density I need geometry of the cartographic boundaries for all census block groups in Howard County as of 2010 and 2017.

I use the tigris function block_groups() to return the cartographic boundaries in the form of a table compatible with the sf package. The table as returned for 2017 has a variable GEOID that is identical to the variable of the same name returned by get_decennial() and get_acs(). However, the table as returned for 2010 has instead a variable GEO_ID that contains the prefix “1500000US” for the geographic ids. I remove this prefix in order to create a GEOID variable of the same format as for the 2017 boundaries.

In order to compute population densities I need the areas of the census block groups for both 2010 and 2017. For 2010 the area is returned from the block_groups() function as a single variable CENSUSAREA, with values expressed as square miles. I retain this as the variable area.

For 2017 this data is returned in the form of two variables, ALAND and AWATER representing land area and water area respectively, both expressed in units of square meters. I convert the ALAND value to square miles and save the result as a variable area to match that for the 2010 boundaries. (The U.S. Census Bureau recommends using the land area when calculating population density.)

options(tigris_use_cache = TRUE)

area_cbg_2010 <- block_groups(year = 2010, state = "MD", county = "Howard County", cb = TRUE, class = "sf", progress_bar = FALSE) %>%
  rename(area = CENSUSAREA) %>%
  mutate(GEOID = substr(GEO_ID, 10, 21)) %>%
  select(GEOID, area)

area_cbg_2017 <- block_groups(year = 2017, state = "MD", county = "Howard County", cb = TRUE, class = "sf", progress_bar = FALSE) %>%
  mutate(area = ALAND / 2589988) %>%
  select(GEOID, area)

I now want to verify that the census block group boundaries have not changed from 2010 to 2017. Rather than comparing the boundaries directly, I just look at the census block group areas for each year.

areas_2010 <- area_cbg_2010 %>%
  st_drop_geometry() %>%
  select(GEOID, area)

areas_2017 <- area_cbg_2017 %>%
  st_drop_geometry() %>%
  select(GEOID, area)

The geographic IDs for the census block groups should be the same in each. If not I stop the analysis, because a discrepancy indicates some sort of problem.

stopifnot(sort(areas_2010$GEOID) == sort(areas_2017$GEOID))

I then compare the areas for each census block group.

area_diffs <- areas_2010 %>%
  left_join(areas_2017, by = "GEOID") %>%
  mutate(diff_pct = 100 * (area.x - area.y) / area.x)

The maximum difference between any of the areas for 2010 vs. 2017 is about 3.4%. I conclude that any census block group boundary changes from 2010 to 2017 are minimal, and elect to use the 2017 boundaries for the analyses below.

I repeat this process to get the (land) area for Howard County overall. (Since the county’s boundaries have not changed from 2010 to 2017, it does not matter which year I get the data for.) Since the counties() function returns all counties for a given state (in this case Maryland), I filter the results to retain only the data for Howard County (FIPS code “24027”).

options(tigris_use_cache = TRUE)
area_county <- counties(year = 2017, state = "MD", cb = TRUE, class = "sf", progress_bar = FALSE) %>%
  filter(GEOID == "24027") %>%
  mutate(area = ALAND / 2589988) %>%
  select(GEOID, area)

This calculation produces a land area of 251 square miles for Howard County, which matches the value found in the Howard County Wikipedia article.

To help orient readers as to the locations of the census block groups, I also want any maps generated to also display major roads in Howard County that correspond in whole or in part to census block group boundaries. I use the tigris function roads() to return geometry for all Howard County roads.

Because I don’t need or want to display each and every Howard County road, I use the RTTYP and FULLNAME variables to filter the results to retain only major roads (interstate and U.S. highways) and significant minor roads (Maryland state routes and roads with “Parkway” in their names). I store the geometry for each in separate variables, so that I can plot them at different widths.

all_roads <- roads(state = "MD", county = "Howard County", class = "sf", progress_bar = FALSE)

major_roads_geo <- all_roads %>%
  filter(RTTYP %in% c("I", "U")) %>%
  st_geometry()

minor_roads_geo <- all_roads %>%
  filter(RTTYP == "S" | str_detect(FULLNAME, "Pkwy")) %>%
  st_geometry()

Analysis

I do analyses to answer the following questions:

Summary of Howard County population statistics

Before I look at population density, here’s a quick summary of Howard County population at the county, census block group, and census block level:

The total population of Howard County according to the 2010 census was 287,085. The estimated total population of Howard County according to the ACS 5-year estimates for 2017 was 312,495. (Recall that ACS 5-year estimates for 2017 actually reflect surveys from 2013 through 2017.) This represents population growth of about 9% over roughly half a decade or so.

As discussed previously, there are currently 154 census block groups in Howard County, and 4,845 census blocks. In 2010 the least populated census block group contained 645 people, while the most populated census block group contained 3,632 people. The average population for all block groups was 1,864 people and the median population for all block groups was 1,824 people.

In 2010 the smallest block group covered an area of 0.1 square miles (about 65 acres), while the largest block group covered an area of 13.42 square miles. The average block group area was 1.63 square miles, and the median was 0.67 square miles.

A typical census block group is thus comparable in both size and population to a traditional village or small town. It is large enough to be a recognizable “place”, but small enough to have its own identity distinct from that of other places in the county.

On the other hand, census blocks are very small: in 2010 the average population for a Howard County census block was only

59, and the median population was 0, indicating that over half of the census blocks in the county contained no people at all.

Howard County population density by census block group

I first want to look at the differing population densities among the various census block groups. I start by computing population densities for 2010, including an overall population density for the county itself.

To do this I take the area_cbg_2017 table (which contains the variable area specifying the land area of the block group in square miles), delete the geometry (which I don’t need in this context), and join to it the table pop_cbg_2010 (which contains the variable population). I then create a new variable pop_density containing the population density. The calculation for the county-level density is similar.

density_cbg_2010 <- area_cbg_2017 %>%
  st_drop_geometry() %>%
  left_join(pop_cbg_2010, by = "GEOID") %>%
  mutate(pop_density = population / area)

density_county_2010 <- area_county %>%
  st_drop_geometry() %>%
  left_join(pop_county_2010, by = "GEOID") %>%
  mutate(pop_density = population / area)

Population density in 2010 varied from a low of 151 people per square mile to a high of 15,181 people per square mile, a difference of two orders of magnitude. Overall population density for the county in 2010 was 1,144 people per square mile.

The following histogram shows the distribution of population density among the various block groups for the 2010 census.

density_cbg_2010 %>%
  ggplot(aes(x = pop_density)) +
  geom_histogram(boundary = 0, binwidth = 500) +
  geom_vline(
    mapping = aes(xintercept = median(pop_density)),
    linetype = "dashed"
  ) +
  xlab("Population Density Per Square Mile") +
  ylab("Number of Census Block Groups") +
  scale_x_continuous(breaks = seq(0, 22000, 2000)) +
  labs(
    title = "2010 Howard County Population Density Distribution",
    subtitle = "Based on 2010 Census Values for Census Block Groups",
    caption = paste0(
      "Data sources:",
      "\n  U.S. Census Bureau, 2010 Decennial Census, Table P1",
      "\n  U.S. Census Bureau, Cartographic boundaries",
      "\nCreated using the tidyverse and tidycensus R packages"
    )
  ) +
  theme_minimal() +
  theme(axis.title.x = element_text(margin = margin(t = 10))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

The distribution is heavily skewed to the right, with only a few census block groups having a density over 6,000 people per square mile, and several block groups having population densities under 500 people per square mile. The median density, shown by the dashed line, was 2,419 people per square mile.

I now repeat the analysis for the 2017 ACS 5-year population estimates.

density_cbg_2017 <- area_cbg_2017 %>%
  st_drop_geometry() %>%
  left_join(pop_cbg_2017, by = "GEOID") %>%
  mutate(pop_density = population / area)

density_county_2017 <- area_county %>%
  st_drop_geometry() %>%
  left_join(pop_county_2017, by = "GEOID") %>%
  mutate(pop_density = population / area)

Population density in the 2013-2017 time frame varied from a low of 140 people per square mile to a high of 22,871 people per square mile, a wider range of densities than was present in 2010. Overall population density for the county in 2013-2017 was 1,245 people per square mile.

The following histogram shows the distribution of population density among the various block groups based on the 2017 ACS 5-year population estimates.

density_cbg_2017 %>%
  ggplot(aes(x = pop_density)) +
  geom_histogram(boundary = 0, binwidth = 500) +
  geom_vline(
    mapping = aes(xintercept = median(pop_density)),
    linetype = "dashed"
  ) +
  xlab("Population Density Per Square Mile") +
  ylab("Number of Census Block Groups") +
  scale_x_continuous(breaks = seq(0, 22000, 2000)) +
  labs(
    title = "2013-2017 Howard County Population Density Distribution",
    subtitle = "Based on 2017 ACS 5-Year Estimates for Census Block Groups",
    caption = paste0(
      "Data sources:",
      "\n  U.S. Census Bureau, 2017 American Community Survey, 5-Year Estimates, Table B01003",
      "\n  U.S. Census Bureau, Cartographic boundaries",
      "\nCreated using the tidyverse and tidycensus R packages"
    )
  ) +
  theme_minimal() +
  theme(axis.title.x = element_text(margin = margin(t = 10))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

The distribution is also heavily skewed to the right, with a median density of 2,560 people per square mile, a little increased from 2010.

The final histogram shows both distributions on a single graph, by combining the tables for 2010 and 2017, converting the year variable into a categorical variable (“factor”), and then plotting the bars of the histogram side by side using year to plot them in different colors.

bind_rows(density_cbg_2010, density_cbg_2017) %>%
  mutate(year = as.factor(year)) %>%
  ggplot(aes(x = pop_density, fill = year)) +
  geom_histogram(boundary = 0, binwidth = 500, position = "dodge") +
  xlab("Population Density Per Square Mile") +
  ylab("Number of Census Block Groups") +
  scale_x_continuous(breaks = seq(0, 22000, 2000)) +
  scale_fill_viridis(option = "viridis", name = "Year", discrete = TRUE) +
  labs(
    title = "2010 and 2017 Howard County Population Density Distribution",
    subtitle = "Based on 2010 Census and 2017 ACS 5-Year Estimates for Census Block Groups",
    caption = paste0(
      "Data sources:",
      "\n  U.S. Census Bureau, 2010 Decennial Census, Table P1",
      "\n  U.S. Census Bureau, American Community Survey, Table B01003",
      "\nCreated using the tidyverse and tidycensus R packages"
    )
  ) +
  theme_minimal() +
  theme(axis.title.x = element_text(margin = margin(t = 10))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

You can see the general rightward shift in the distribution in 2013-2017 compared to 2010, with more census blocks at higher densities and higher maximum densities.

Mapping Howard County population density variation

I next show the same data in the form of two maps, one for 2010 and one for 2013-2017. Because the population densities span such a wide range of values, I use a logarithmic scale when choosing the colors, based on powers of 10. I also use the same limits for the scale in both maps, to ensure that the color values are comparable between the maps.

Note that I don’t bother trying to label the various Howard County roads because it would be more work and make the map more cluttered—and in any case the intended audience of Howard County residents should be able to identify most if not all of the roads by sight.

area_cbg_2017 %>%
  left_join(density_cbg_2010, by = "GEOID") %>%
  ggplot(aes(fill = pop_density, geometry = geometry)) +
  geom_sf(size = 0) +
  geom_sf(data = major_roads_geo, color = "white", size = 1.0, fill = NA) +
  geom_sf(data = minor_roads_geo, color = "white", size = 0.5, fill = NA) +
  scale_fill_viridis(trans = "log10", option = "plasma", name = "Population\nPer Sq Mi", limits = c(100, 25000), breaks = c(100, 1000, 10000), labels = comma) +
  labs(title="2010 Howard County Population Density Variations",
       subtitle = "Population Density Per Square Mile for Census Block Groups",
       caption = paste0(
         "Data sources:",
         "\n  U.S. Census Bureau, 2010 Decennial Census, Table P1",
         "\n  U.S. Census Bureau, Cartographic boundaries",
         "\nCreated using the tidyverse and tidycensus R packages"
       )
  ) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0)) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

You can clearly see the differences in population density between the eastern and western parts of the county. The distinction between Columbia and the areas south of MD 32 and northwest of MD 108 is particularly visible. It’s also possible to discern the areas associated with Ellicott City, Elkridge, Savage, and Laurel.

I now create a comparable map for the 2013-2017 timeframe.

area_cbg_2017 %>%
  left_join(density_cbg_2017, by = "GEOID") %>%
  ggplot(aes(fill = pop_density, geometry = geometry)) +
  geom_sf(size = 0) +
  geom_sf(data = major_roads_geo, color = "white", size = 0.8, fill = NA) +
  geom_sf(data = minor_roads_geo, color = "white", size = 0.4, fill = NA) +
  scale_fill_viridis(trans = "log10", option = "plasma", name = "Population\nPer Sq Mi", limits = c(100, 25000), breaks = c(100, 1000, 10000), labels = comma) +
  labs(title="2013-2017 Howard County Population Density Variations",
       subtitle = "Population Density Per Square Mile for Census Block Groups",
       caption = paste0(
         "Data sources:",
         "\n  U.S. Census Bureau, 2017 American Community Survey, 5-Year Estimates, Table B01003",
         "\n  U.S. Census Bureau, Cartographic boundaries",
         "\nCreated using the tidyverse and tidycensus R packages"
       )
  ) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0)) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

The overall picture is similar to that for 2010—which is only to be expected since only a few years separate the two maps.

Mapping Howard County density quintiles

Population values in the 2010 census are exact counts of the entire population, except for any undercounts that might have occurred. (I assume that these were minimal for Howard County.) However, margins of error for ACS population estimates at the census block group level can be very high, up to 30% or even more relative to the estimates. That means that, for example, ranking individual census block groups based on their population density doesn’t always make sense, given that the relative positions of block groups on the list will in large part reflect random measurement errors.

To work around this issue I take the 154 Howard County census block groups and divide them into 5 groups or “quintiles” of 31 census block groups each, (except for the last quintile, which contains only 30 block groups). The quintiles are based on the block groups’ population densities 5-year estimates for 2010 Quintile 1 contains the 31 census block with the lowest population densities in 2010 (lowest 20%) and quintile 5 contains the 30 block groups with the highest population densities in 2010 (highest 20%). The other quintiles contain block groups with population densities intermediate between the lowest and highest values.

I create a new table density_quintiles as follows:

  1. I start with the density_cbg_2010 table created above.
  2. I rank the rows in ascending order from 1 to 154 based on the value of the population variable, using the min_rank() function. I then divide the rank value by 31, and take the next highest integer (using the ceiling() function). This produces a value from 1 to 5 for each row. I then convert the value to a character string and save it in a new variable quintile.
  3. I retain only the variables GEOID and quintile.

The resulting table density_quintiles has 154 rows, each row containing the 12-character geographic ID of the corresponding census block group and the quintile of population density that that block group was in according to the 2010 census.

density_quintiles <- density_cbg_2010 %>%
  mutate(quintile = as.character(ceiling(min_rank(pop_density) / 31))) %>%
  select(GEOID, quintile)

I now want to map each census block group, coloring each block group according to the density quintile it was in according to the 2010 census. I do this as follows:

  1. I first define variables quintile_colors, quintiles, and quintile_labels, to specify the colors used for the quintiles on the map, the order in which they will be listed in the legend, and the labels used for them in the legend. For the colors I use a special [“colorblind-friendly” palette][cbfp] designed to be more readable by people with various forms of color blindess.
  2. I then take the table area_cbg_2017 containing the geometry for the census block groups and join it with the table density_quintiles created above, using the common variable GEOID.
  3. I plot the resulting table, using the value of the quintile variable to determine the fill color for each census block group.
  4. I first use geom_sf() to draw the census block groups themselves. I set the size parameter to zero to avoid drawing the outlines of the census block groups.
  5. I then use geom_sf() again twice, this time taking data from the major_roads_geo and minor_roads_geo tables respectively, to draw lines for the major and minor roads in Howard County.
  6. I use scale_fill_manual() to specify the colors for each census block group and to create a legend, with label text and label order as previously specified.
  7. I add a title, subtitle, and caption for the map.
  8. Finally I blank out the axis tick marks and axis labels (which would normally show latitude and longitude) and make the background of the graph blank for a cleaner look.
# Colors are assigned to the quintiles based on their alphabetical order.
quintile_colors <- c("#0072B2",  # 1
                     "#56B4E9",  # 2
                     "#009E73",  # 3
                     "#E69F00",  # 4
                     "#F0E442")  # 5

# The order of quintiles in this list determines the order in the legend.
quintile_order <- c("5", "4", "3", "2", "1")

# These labels are used as substitutes for the quintile values in the legend.
quintile_labels <- c("5 (most dense)", "4", "3", "2", "1 (least dense)")

area_cbg_2017 %>%
  inner_join(density_quintiles, by = "GEOID") %>%
  ggplot(aes(fill = quintile, geometry = geometry)) +
  geom_sf(size = 0) +
  geom_sf(data = major_roads_geo, color = "white", size = 0.8, fill = NA) +
  geom_sf(data = minor_roads_geo, color = "white", size = 0.4, fill = NA) +
  scale_fill_manual(name = "Quintile", values = quintile_colors, breaks = quintile_order, labels = quintile_labels) +
  labs(title="2010 Howard County Population Density Variations",
       subtitle = "Population Density Quintiles for Census Block Groups",
       caption = paste0(
         "Data sources:",
         "\n  U.S. Census Bureau, 2010 Decennial Census, Table P1",
         "\n  U.S. Census Bureau, Cartographic boundaries",
         "\nCreated using the tidyverse and tidycensus R packages"
       )
  ) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0)) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

For completeness I redo the original map of Howard County census block groups based on density quintiles, this time using the 2017 ACS 5-year estimates.

I first create a table density_2017_quintiles containing the quintile values for each census block group based on the 2017 estimates, similar to how I originally created the quintile values for 2010.

density_2017_quintiles <- density_cbg_2017 %>%
  mutate(quintile = as.character(ceiling(min_rank(pop_density) / 31))) %>%
  select(GEOID, quintile)

Then I map the Howard County census block groups based on these new 2017 quintile values. Again, I do this in a manner almost identical to how I mapped the census block group quintile values for 2010.

area_cbg_2017 %>%
  inner_join(density_quintiles, by = "GEOID") %>%
  ggplot(aes(fill = quintile, geometry = geometry)) +
  geom_sf(size = 0) +
  geom_sf(data = major_roads_geo, color = "white", size = 0.8, fill = NA) +
  geom_sf(data = minor_roads_geo, color = "white", size = 0.4, fill = NA) +
  scale_fill_manual(name = "Quintile", values = quintile_colors, breaks = quintile_order, labels = quintile_labels) +
  labs(title="2013-2017 Howard County Population Density Variations",
       subtitle = "Population Density Quintiles for Census Block Groups",
       caption = paste0(
         "Data sources:",
         "\n  U.S. Census Bureau, 2017 American Community Survey, 5-Year Estimates, Table B101003",
         "\n  U.S. Census Bureau, Cartographic boundaries",
         "\nCreated using the tidyverse and tidycensus R packages"
       )
  ) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0)) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

The resulting map is very similar to the map for 2010 (as one might expect).

Changes from 2010 to 2013-2017 for Howard County 2010 density quintiles

I next look at how population densities across Howard County have changed from the 2010 census to the 2017 estimates for the 2013-2017 time frame.

I first look at the population density gaps among the quintiles, and how their average population densities have changed over time. More specifically, I investigate how the population densities for the various census block group quintiles changed from 2010 to 2017—or, more correctly, from the 2010 time frame to the 2013-2017 time frame.

To look at this question I take the quintiles from above, compute the averages of the population densities for the census block groups in each quintile, and graph the results for 2017 compared to 2010. Thus, for example, for the census block groups in quintile 1 I compute the average ppoulation densities from the 2010 census for those 31 census block groups. I compute similar averages for the other quintiles. I then repeat the process using the 2017 5-year ACS estimates.

There’s another subtlety here as well. Many graphs showing changes between quintiles over time actually reflect different sets of people, households, or geographies between the different timeframes. For example, if in the graph the 2010 figures were to reflect the quintiles as they exist in 2010, and the 2017 figures were to reflect the quintiles as they exist in 2017, then we would not necessarily be comparing like to like. That’s because a given census block group may have moved from one quintile to another over time. (For example, a census block group in the lowest 20% of block groups by population density in 2010 may have moved to the next higher 20% in 2017.)

To avoid this problem I have the 2017 values in the graph reflect the population densities for the same sets of quintiles as for the 2010 values: if a census block group were in, say, quintile 3 in 2010 then it is counted as part of the same set of census block groups for 2017 (i.e., quintile 3).

Now I’m ready to create the graph.

  1. I want to include the overall Howard County population density in the graph. I therefore take the table density_quintiles from above, which assigns each census block group a quintile value, and create a new table density_qplus, adding a final row containing the geographic ID for Howard County and an arbitrary quintile value of 0.
  2. I define variables to specify the colors for the graph, the labels for the legend, and the ordering of labels.
  3. For the actual plot I create a new table density_adj, starting by combining all the density tables (including those for the block groups and for Howard County overall) into one table.
  4. I join the resulting table with the table density_qplus containing quintile values for the census block groups and the extra row for Howard County (“quintile 0”), using the common field GEOID.
  5. I then group the rows by quintile (producing five sets of census block groups that are in the same quintile, along with a separate set containing the rows for Howard County) and then by year (for each quintile, producing two sets of rows for the two years 2010 and 2017 in the data).
  6. I next summarize the groups by calculating the mean (i.e., average) population density for each group. This produces a 12-row table density_avgs, with values for the five quintiles plus Howard County for 2010, and similar values for 2017.
  7. Finally I plot the data using ggplot(), with the x-axis values being the years and the y-axis values the computed average ppoulation densities for each quintile. I use geom_line() to draw lines for each quintile and for Howard County, coloring the lines using the colors previously defined.
density_qplus <- density_quintiles %>%
  bind_rows(c(GEOID = "24027", quintile = "0"))

# Colors are assigned to the quintiles (and Howard County) based on their alphabetical order.
qplus_colors <- c("#000000",  # 0
                  "#0072B2",  # 1
                  "#56B4E9",  # 2
                  "#009E73",  # 3
                  "#E69F00",  # 4
                  "#F0E442")  # 5

# The order of quintiles in this list determines the order in the legend.
qplus_order <- c("5", "4", "3", "2", "1", "0")

# These labels are used as substitutes for the quintile values in the legend.
qplus_labels <- c("5 (highest 20%)", "4", "3", "2", "1 (lowest 20%)", "Howard County")

density_avgs <- bind_rows(density_cbg_2010, density_county_2010, density_cbg_2017, density_county_2017) %>%
  inner_join(density_qplus, by = "GEOID") %>%
  group_by(quintile, year) %>%
  summarize(pop_density = mean(pop_density))

upper_limit <- 10000
lower_limit <- 0

density_avgs %>%
  ggplot(aes(x = year, y = pop_density, color = quintile)) +
  geom_line(size = 1.0) +
  scale_color_manual(name = "Quintiles", values = qplus_colors, labels = qplus_labels, breaks = qplus_order) +
  scale_x_continuous(breaks = seq(2010, 2017, 1)) +
  scale_y_continuous(limits = c(lower_limit, upper_limit), breaks = seq(lower_limit, upper_limit, 1000)) +
  xlab("Year") +
  ylab("Average Population Density") +
  labs(
    title = "Population Density Changes within Howard County from 2010 to 2013-2017",
    subtitle = "Average Population Densities for 2010 Census Block Group Quintiles",
    caption = paste0(
      "Data source:",
      "\n  U.S. Census Bureau, 2010 Decennial Census, Table P1",
      "\n  U.S. Census Bureau, 2017 American Community Survey 5-Year Estimates, Table B01003",
      "\n  U.S. Census Bureau, Cartographic boundaries",
      "\nCreated using the tidyverse and tidycensus R packages"
    )
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 10))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(hjust = 0, margin = margin(t = 15)))

Based on the graph above, the lowest quintile experienced minimal changes in population density between 2010 and the 2013-2017 time frame, the middle quintiles experienced only small increases in average population density, and only the highest quintile (the census block groups with the highest population densities in 2010) experienced significant increases in population density.

To double-check this, I print a table of the percentage changes for each quintile (rounded to the nearest whole number), along with the 2010 and 2017 average population densities for each quintile.

density_avgs %>%
  filter(quintile != "0") %>%
  spread(year, pop_density) %>%
  mutate(abs_change = `2017` - `2010`, pct_change = 100 * abs_change / `2010`) %>%
  mutate(abs_change = round(abs_change, 0), pct_change = round(pct_change), `2010` = round(`2010`, 0), `2017` = round(`2017`, 0)) %>%
  arrange(desc(quintile)) %>%
  kable()
quintile 2010 2017 abs_change pct_change
5 7503 8313 810 11
4 3627 3860 233 6
3 2502 2638 136 5
2 1668 1924 256 15
1 466 552 86 18

Based on the above table the quintile containing the census block groups with the highest population densities experienced the biggest absolute increase in average density, but the quintile containing the census block groups with the lowest population densities experienced the largest density increase in percentage terms.

Howard County population density changes by census block group

I now take another look at the population density changes between 2010 and the 2013-2017 time frame, this time mapping the changes for individual census block groups.

One decision I have to make is whether to plot density changes as absolute numbers or as percentages. Intuitively plotting percentage changes seems to make more sense: if a census block group with a population density of 200 people per square mile changes to have a population density of 300 people per square mile, that is arguably more significant than a census block group with a population density of 2,000 people per square mile changing to have a population density of 2,100 people per square mile. In absolute terms the population added is the same (100 more people per square mile), but the former case represents a 50% increase in density, and the latter only a 5% increase.

Given that, I proceed to create the map as follows:

  1. I start by combining into a single table the two tables containing population densities by census block groups for 2010 and 2013-2017.
  2. From the resulting table I retain only the variables GEOID, pop_density, and year.
  3. I then use spread() to recast the table so that instead of having different rows for different years, the table has different columns for different years.
  4. I then take the difference between the population densities for 2017 and 2010 (i.e., the difference between the values in the column for 2017 and the column for 2010), divide the difference by the value for 2010, and multiply by 100 to calculate the percentage change between the two timeframes.
  5. I then take the resulting table density_changes and join it to the table area_cbg_2017 using the common field GEOID, to add the geometry for the census block groups.
  6. I then plot the resulting table, using the value of the pct_change variable to select the color for each census block group. Since that variable has continuous values I use scale_fill_viridis() to vary the colors along a predefined spectrum of color values, using previously-calculated lower and upper limits.
density_changes <- bind_rows(density_cbg_2010, density_cbg_2017) %>%
  select(GEOID, pop_density, year) %>%
  spread(year, pop_density) %>%
  mutate(pct_change = 100 * (`2017` - `2010`) / `2010`) %>%
  select(GEOID, pct_change)

upper_limit <- round(max(density_changes$pct_change) + 10, -1)
lower_limit <- round(min(density_changes$pct_change) - 10, -1)

area_cbg_2017 %>%
  left_join(density_changes, by = "GEOID") %>%
  ggplot(aes(fill = pct_change, geometry = geometry)) +
  geom_sf(size = 0) +
  geom_sf(data = major_roads_geo, color = "white", size = 0.8, fill = NA) +
  geom_sf(data = minor_roads_geo, color = "white", size = 0.4, fill = NA) +
  scale_fill_viridis(option = "viridis", name = "% Change", limits = c(lower_limit, upper_limit), breaks = seq(lower_limit, upper_limit, 20)) +
  labs(title="Changes in Howard County Population Density",
       subtitle = "2017 5-Year Estimates vs. 2010 Values for Census Block Groups",
       caption = paste0(
         "Data sources:",
         "\n  U.S. Census Bureau, 2010 Decennial Census, Table P1",
         "\n  U.S. Census Bureau, 2017 American Community Survey, 5-Year Estimates, Table B101003",
         "\n  U.S. Census Bureau, Cartographic boundaries",
         "\nCreated using the tidyverse and tidycensus R packages"
       )
  ) +
  theme(plot.caption = element_text(hjust = 0, margin = margin(t = 15))) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

A few areas stand out as having significant increases in population density. At first glance these areas appear to be in Fulton at the Maple Lawn Farm development and adjacent neighborhoods south of MD 216, along U.S. 1 south and north of MD 175, and near downtown Ellicott City.

Other areas actually experienced decreases in population density. Assuming that the number of housing units did not decrease in those areas, this likely was caused by the number of people per household decreasing, for example due to children leaving families and “empty nesters” remaining. (Additional census data, for example on household size and the ages of household members, should be able to confirm or refute this idea.)

Appendix

Caveats

Population values from the American Community Survey are estimates based on survey samples, with associated margins of error. For estimates at the census block group level the associated standard errors can be 10-30% or more of the estimates themselves.

As noted above, 5-year estimates identified as being for a certain year actually reflect surveys conducted over the prior five years. So the 2-17 population estimates actually correspond to the 2013-2017 time frame.

References

To obtain the 2010 census and 2017 ACS data I used the tidycensus package, which provides an easy-to-use interface to U.S. Census Bureau data made available via a set of public APIs. As its name suggests, the tidycensus package is designed to be compatible with the tidyverse approach to representing and manipulating data.

To obtain the census block group boundaries I used the tigris package, which is designed to accompany the tidycensus package and provides access to U.S. Census Bureau cartographic boundaries and TIGER/Line shapefiles.

Calculation of population density based on land area is discussed in the U.S. Census Bureau document 2010 Census Summary File 1, 2010 Census of Population and Housing, Technical Documentation. See Appendix A, “Geographic Terms and Concepts”, page A-22, “Population and Housing Unit Density”, and Appendix B, “Definitions of Subject Characteristics”, page B-23, “Area Measurement and Density”.

Suggestions for others

If you look closely at the maps above you’ll notice that there’s a gap where MD 100 joins U.S. 29. That appears to be due to an unnamed section of road that is not classified as being part of either highway. I have not yet been able to identify where the geometry for that road section is in the table returned by the roads() function.

Environment

I used the following R environment in doing the analysis above:

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.24        scales_1.0.0      viridis_0.5.1    
##  [4] viridisLite_0.3.0 sf_0.7-7          tigris_0.8.2     
##  [7] tidycensus_0.9.2  forcats_0.4.0     stringr_1.4.0    
## [10] dplyr_0.8.3       purrr_0.3.2       readr_1.3.1      
## [13] tidyr_0.8.3       tibble_2.1.3      ggplot2_3.2.1    
## [16] tidyverse_1.2.1  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2         lubridate_1.7.4    lattice_0.20-38   
##  [4] class_7.3-15       assertthat_0.2.1   zeallot_0.1.0     
##  [7] digest_0.6.20      R6_2.4.0           cellranger_1.1.0  
## [10] backports_1.1.4    evaluate_0.14      e1071_1.7-2       
## [13] highr_0.8          httr_1.4.1         pillar_1.4.2      
## [16] rlang_0.4.0        lazyeval_0.2.2     curl_4.0          
## [19] readxl_1.3.1       uuid_0.1-2         rstudioapi_0.10   
## [22] rmarkdown_1.14     labeling_0.3       rgdal_1.4-4       
## [25] foreign_0.8-71     munsell_0.5.0      broom_0.5.2       
## [28] compiler_3.6.0     modelr_0.1.5       xfun_0.8          
## [31] pkgconfig_2.0.2    htmltools_0.3.6    tidyselect_0.2.5  
## [34] gridExtra_2.3      crayon_1.3.4       withr_2.1.2       
## [37] rappdirs_0.3.1     grid_3.6.0         nlme_3.1-139      
## [40] jsonlite_1.6       gtable_0.3.0       DBI_1.0.0         
## [43] magrittr_1.5       units_0.6-3        KernSmooth_2.23-15
## [46] cli_1.1.0          stringi_1.4.3      sp_1.3-1          
## [49] xml2_1.2.2         vctrs_0.2.0        generics_0.0.2    
## [52] tools_3.6.0        glue_1.3.1         hms_0.5.0         
## [55] yaml_2.2.0         colorspace_1.4-1   maptools_0.9-5    
## [58] classInt_0.4-1     rvest_0.3.4        haven_2.1.1

Source code

You can find the source code for this analysis and others at my hocodata public code repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.