Introduction

In this document I do some basic analyses on the population density of Columbia, Maryland, census block groups over time. (A census block group is a census geography below the level of census tracts and above the level of census blocks.) I also look at density at the level of census blocks for 2010, the most recent year for which I have data at that level.

For those readers unfamiliar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, I’ve included some additional explanation of various steps. For more information check out the the tutorial “Getting started with the Tidyverse”.

Setup and data preparation

Libraries

I use the following packages for the following purposes:

  • tidyverse: do general data manipulation.
  • tidycensus: retrieve U.S. census data.
  • tigris: retrieve U.S. census geospatial data.
  • sf: manipulate geospatial data.
  • viridis: obtain special color palettes for graphs.
  • scales: enable better formatting of graph legend values
  • knitr: display tabular data.
library(tidyverse)
library(tidycensus)
library(tigris)
library(sf)
library(viridis)
library(scales)
library(knitr)

The tidycensus and tigris packages require a Census API key. I previously obtained such a key and used the tidycensus function census_api_key() with the install = TRUE option to store it for use by the functions included in the tidycensus and tigris packages.

Data sources

I use data from the following sources; see the References section below for more information:

  • Population estimates for census blocks in 2010 are from the 2010 decennial census table P1, and are accessed via the tidycensus get_decennial() function.
  • Population estimates for census block groups for 2013-2017 are from the American Community Survey table B01003, and are accessed via the tidycensus get_acs() function.
  • Census block group areas and boundaries are from the U.S. Census Bureau cartographic boundaries data for 2010 and 2017, and are accessed via the tigris block_groups() function.
  • Census block areas and boundaries are from the U.S. Census Bureau cartographic boundaries data for 2010, and are accessed via the tigris blocks() function.
  • Road geometry is also from the U.S. Census Bureau, and is accessed via the tigris roads() function.
  • The association between individual census blocks and Census Designated Places is provided by the block assignment files produced for the 2010 decennial census.

Reading in and preparing population data

My goal is to look at population density in Columbia, Maryland, at a relatively fine-grained level. (For example, are there particular neighborhoods in Columbia that have relatively low or relatively high population density?) In my previous analysis of population density in Howard County I looked at data at the level of census block groups, and will continue that practice for this analysis.

A census block group is a geography that is one level below a census tract and one level above a census block, the smallest geographic subdivision used by the U.S. Census Bureau. While the 2010 census contains population counts for each census block, the American Community Survey data contains population estimates only down to the level of census block groups—and even then these estimates are available only in the 5-year estimates and not in the 1-year estimates.

My primary interest is in population, so the only data I need is total population at the census block group level. In theory I could retrieve this data for 2010 by calling the tidycensus get_decennial() function and asking for the “P001P001” variable for the census block group geography. However, there is an apparent bug in the Census API that prevents this from working.

Instead I use the get_decennial() function to retrieve population counts for all census blocks in Howard County for the 2010 census. This function returns data in so-called “tidy” format (one row per observation), with the following columns:

  • GEOID. A 15-character geographic ID as discussed above.
  • NAME. The name of the census block.
  • variable. References the census variable being retrieved, in this case “P001001”, the total population of the block.
  • value. The variable’s value. (There is no margin of error provided since the decennial census is a comprehensive count, not an estimate constructed from a sample.)

I retain only the GEOID value and the population count, renamed to population for consistency throughout this analysis.

hoco_pop_block_2010 <- get_decennial(year = 2010, geography = "block", state = "MD", county = "Howard County", variables = "P001001") %>%
  select(GEOID, value) %>%
  rename(population = value)
## Getting data from the 2010 decennial Census

I then summarize this data at the block group level to get the total population for each census block group. Again I retain only the GEOID and population count, as GEOID and population respectively, and add a new variable year for later use.

hoco_pop_cbg_2010 <- hoco_pop_block_2010 %>%
  mutate(cbg = substr(GEOID, 1, 12)) %>%
  group_by(cbg) %>%
  summarize(population = sum(population)) %>%
  rename(GEOID = cbg) %>%
  mutate(year = 2010)

I next use the tidycensus get_acs() function to retrieve population estimates from the 2017 ACS 5-year estimates for all census block groups in Howard County. This function also returns data in tidy format, with the following columns:

  • GEOID. An 12-character geographic ID for the block group, as discussed above.
  • NAME. The name of the census block group.
  • variable. References the ACS variable being retrieved, in this case “B01003_001”).
  • estimate. The ACS 5-year estimate for the variable’s value.
  • moe. The margin of error for the estimate.

Again I retain only the GEOID and population count, as GEOID and population respectively, and add the variable year.

hoco_pop_cbg_2017 <- get_acs(year = 2017, survey = "acs5", geography = "block group", state = "MD", county = "Howard County", variables = "B01003_001") %>%
  select(GEOID, estimate) %>%
  rename(population = estimate) %>%
  mutate(year = 2017)
## Getting data from the 2013-2017 5-year ACS

As discussed above, the hoco_pop_cbg_2010 and hoco_pop_cbg_2017 tables contain data for all census block groups in Howard County. In the sections below I subset these tables to retain data only for census block groups associated with Columbia.

Reading in and preparing area and boundary data

In order to produce maps of population density I need the geometry of the cartographic boundaries for all census block groups associated with Columbia as of 2010 and 2017, as well as the associated (land) areas for each block group.

I use the tigris function block_groups() to return the cartographic boundaries for all block groups in Howard County in the form of a table compatible with the sf package. The table as returned for 2017 has a variable GEOID that is identical to the variable of the same name returned by get_decennial() and get_acs(). However, the table as returned for 2010 has instead a variable GEO_ID that contains the prefix “1500000US” for the geographic ids. I remove this prefix in order to create a GEOID variable of the same format as for the 2017 boundaries.

For 2010 the area is returned from the block_groups() function as a single variable CENSUSAREA, with values expressed as square miles. I retain this as the variable area.

For 2017 this data is returned in the form of two variables, ALAND and AWATER representing land area and water area respectively, both expressed in units of square meters. I convert the ALAND value to square miles and save the result as a variable area to match that for the 2010 boundaries. (The U.S. Census Bureau recommends using the land area when calculating population density.)

options(tigris_use_cache = TRUE)

hoco_area_cbg_2010 <- block_groups(year = 2010, state = "MD", county = "Howard County", cb = TRUE, class = "sf", progress_bar = FALSE) %>%
  rename(area = CENSUSAREA) %>%
  mutate(GEOID = substr(GEO_ID, 10, 21)) %>%
  select(GEOID, area)

hoco_area_cbg_2017 <- block_groups(year = 2017, state = "MD", county = "Howard County", cb = TRUE, class = "sf", progress_bar = FALSE) %>%
  mutate(area = ALAND / 2589988) %>%
  select(GEOID, area)

Determining which census block groups make up Columbia

In order to analyze data for Columbia specifically, I need to know which census block groups are associated with Columbia. I start with the so-called “block assignment files” published by the U.S. Census Bureau. These relate census blocks to the higher-level geographic entities with which they’re associated. In this case we want to know which census blocks are part of Columbia, which in census terms is a “census designated place” or CDP—basically a population center that is not an incorporated city.

(In this respect it’s important to note that “Columbia the CDP” is not 100% identical with “Columbia the planned community”. There have always been “outparcels” within Columbia that were not acquired by The Rouse Company and whose owners do not pay Columbia Association dues. There may also be other differences between Columbia’s boundaries as a CDP and as a planned community, based on U.S. Census Bureau decisions on what to include in the CDP.)

To do this I download the archive of block assignment files for Maryland from the 2010 census, unzip the archive, and read in the CSV file that maps census blocks to incorporated entities or CDPs.

Census blocks are uniquely identified by a 15-character ID that combines the 2-digit FIPS code for the state (“24” for Maryland), the 3-digit FIPS code identifying the county or county-equivalent within the state (“027” for Howard County), a 6-digit identifier for an individual census tract within the county or county-equivalent, and a 4-digit identifier for the census block within the tract.

Incorporated entities and CDPs are identified by a 5-digit code that identifies the place within the state. For Columbia that code is “19125”. (The full FIPS code for Columbia, “2419125”, adds the Maryland state code as a prefix.) Areas outside a CDP have missing values, which I replace with the string “none”.

if (!file.exists("BlockAssign_ST24_MD.zip")) {
  download.file("https://www2.census.gov/geo/docs/maps-data/data/baf/BlockAssign_ST24_MD.zip", "BlockAssign_ST24_MD.zip", method = "auto")
}

unzip("BlockAssign_ST24_MD.zip", overwrite = TRUE)

md_block_places <- read_csv("BlockAssign_ST24_MD_INCPLACE_CDP.txt", col_types = "cc") %>%
  rename(GEOID = BLOCKID) %>%
  replace_na(list(PLACEFP = "none"))

The md_block_places table contains entries for every census block in Maryland. I am interested only in Columbia, and only in census block groups, not individual census blocks. Unfortunately, it is possible for a given census block group to contain census blocks associated with two different places. This in fact the case for Howard County: for example, some census blocks in census block group “240276023022” are in Columbia (place code “19125”), some blocks are in Ellicott City (place code “26000”), and some blocks are not part of any CDP at all.

Which census block groups should I consider to be part of Columbia? I could consider only census block groups for which all census blocks are part of the Columbia CDP. However, this would risk omitting census block groups whose areas primarily fall within the Columbia CDP. Alternatively, I could consider any census block group that contains at least one census block associated with Columbia. However, this would risk including census block groups with only a small portion of their area actually part of Columbia.

For purposes of this analysis I consider a census block group to be part of Columbia only if at least 50% of its (land) area is included in the Columbia CDP. Since a given census block is either in Columbia or not, I can make this determination for a given census block group by comparing the total area of the census block group with the sum of the areas of all census blocks in the group that are associated with the Columbia CDP.

To do this I first need the areas for all census blocks in Howard County. I can get these by using the tigris functions blocks(). For consistency with other tables I rename the GEOID10 variable to GEOID and create a new variable area containing the area of each block in square miles.

hoco_area_blocks_2010 <- blocks(year = 2010, state = "MD", county = "Howard County", class = "sf", progress_bar = FALSE) %>%
  rename(GEOID = GEOID10) %>%
  mutate(area = ALAND10 / 2589988)

I now determine which census block groups are more than 50% in Columbia by area, as follows:

  1. I start with the table hoco_area_blocks_2010 containing areas for all Howard County census blocks, and drop any geometry associated with the blocks.
  2. I join to that table the table md_block_places containing the place ID associated with each census block, using the variable GEOID common to both tables. Since I use the left_join function the resulting table contains only rows corresponding to census blocks in Howard County. Each row contains (among other things) the GEOID value for a given block, the place associated with the block (in the PLACEFP variable), and the area of the block.
  3. I convert the values of PLACEFP (“19125”, “26000”, etc.) into their own columns, with the values for the columns being the areas of the census blocks. Since any given census block is associated with only one place, for any given row only one column will include a valid area.
  4. I truncate the blocks’ geography IDs to 12 characters to reflect only census block group identifiers. Since there are several census blocks for a given census block group, that means that the resulting table has multiple rows for each value of (the new) GEOID variable.
  5. I group the rows by census block group ID, and then summarize all the columns by taking their sums, using the na.rm parameter to ignore missing values. The resulting table has one row per census block group and one column for each Howard County place, with the value for that column being the sum of the areas of the blocks associated with that place.
  6. I next use the rowSums() function to create a new variable area containing the total land area of each census block group, summing the areas contained in all columns except the first (which contains the GEOID value).
  7. I compare the land area for Columbia in that census block group (in the column with name “19125”) to the total area of the census block group. If the ratio is 50% (0.5) or more then I retain the row for that census block group.
  8. Finally I retain only the GEOID and area variables.

The resulting table cbgs contains all census block groups considered as part of Columbia for this analysis, along with the total areas for those census block groups (which are not necessarily the areas within Columbia proper).

cbgs <- hoco_area_blocks_2010 %>%
  st_drop_geometry() %>%
  left_join(md_block_places, by = "GEOID") %>%
  select(GEOID, PLACEFP, area) %>%
  spread(PLACEFP, area) %>%
  mutate(GEOID = substr(GEOID, 1, 12)) %>%
  group_by(GEOID) %>%
  summarize_all(sum, na.rm = TRUE) %>%
  mutate(area = rowSums(.[2:length(.)])) %>%
  filter(`19125` / area >= 0.5) %>%
  select(GEOID, area)

There are 54 census block groups considered part of Columbia, with a total area of 31.1 square miles. Compare this to the figure of 31.9 square miles of land area given in the Wikipedia article on Columbia. The difference likely represents census block groups for which only a small portion of the groups’ area is in Columbia, block groups that I’ve omitted from this analysis.

Subsetting the population and boundary data for Columbia

I now have a list of census block groups associated with the Columbia CDP. I therefore join the hoco_pop_cbg_2010 table with the cbgs table to produce a table of population counts for the Columbia-associated census block groups only. (The right_join() function includes only rows for which the GEOID values exist in the second table, namely cbgs.) I do the same for the 2017 population estimates. In both case I remove the area variable (carried over from the cbgs table) because it’s not needed.

pop_cbg_2010 <- hoco_pop_cbg_2010 %>%
  right_join(cbgs, by = "GEOID") %>%
  select(-area)

pop_cbg_2017 <- hoco_pop_cbg_2017 %>%
  right_join(cbgs, by = "GEOID") %>%
  select(-area)

Again, because I am interested only in the Columbia census block groups I follow a similar procedure as above to produce tables of areas and boundaries restricted to the Columbia CDP.

area_cbg_2010 <- hoco_area_cbg_2010 %>%
  right_join(cbgs, by = "GEOID")

area_cbg_2017 <- hoco_area_cbg_2017 %>%
  right_join(cbgs, by = "GEOID")

The resulting tables will have two variables representing area, one from the original hoco_area_cbg_2010 and hoco_area_cbg_2017 tables (area.x) and one from the cbgs table (area.y). They should be the same even though they come from different sources. As a confirmation of the calculations above I check to make sure that the areas don’t differ by more than 1% (0.01), and stop the analysis if they do.

stopifnot(max(abs((area_cbg_2010$area.x - area_cbg_2010$area.y) / area_cbg_2010$area.x)) < 0.01)
stopifnot(max(abs((area_cbg_2017$area.x - area_cbg_2017$area.y) / area_cbg_2017$area.x)) < 0.01)

I then retain only one variable for area in both tables, naming it area.

area_cbg_2010 <- area_cbg_2010 %>%
  rename(area = area.x) %>%
  select(-area.y)

area_cbg_2017 <- area_cbg_2017 %>%
  rename(area = area.x) %>%
  select(-area.y)

Now that I have both population counts (for 2010) or estimates (for 2013-2017) along with areas for all of the block groups (wholly or mostly) associated with Columbia, I want to calculate the total population and total area for all Columbia block groups combined for both 2010 and 2017.

pop_col_2010 <- pop_cbg_2010 %>%
  summarize(population = sum(population)) %>%
  mutate(GEOID = "2419125")
pop_col_2017 <- pop_cbg_2017 %>%
  summarize(population = sum(population)) %>%
  mutate(GEOID = "2419125")

area_col_2010 <- area_cbg_2010 %>%
  summarize(area = sum(area)) %>%
  mutate(GEOID = "2419125")
area_col_2017 <- area_cbg_2017 %>%
  summarize(area = sum(area)) %>%
  mutate(GEOID = "2419125")

Columbia population and area by census block in 2010

Moving down one level from census block groups, for 2010 I have population and area data for individual census blocks within Howard County. In order to map population density at the census block level, I create a table pop_block_2010 containing 2010 population counts for each census block associated with the Columbia CDP.

pop_block_2010 <- hoco_pop_block_2010 %>%
  left_join(md_block_places, by = "GEOID") %>%
  filter(PLACEFP == "19125") %>%
  select(-PLACEFP)

I create another table area_block_2010 containing areas and boundaries for census blocks in the Columbia CDP.

area_block_2010 <- hoco_area_blocks_2010 %>%
  left_join(md_block_places, by = "GEOID") %>%
  filter(PLACEFP == "19125") %>%
  select(-PLACEFP)

Getting road geometry to help with orientation

To help orient readers as to the locations of the census block groups, I also want any maps generated to also display major roads in Columbia that correspond in whole or in part to census block group boundaries. I use the tigris function roads() to return geometry for all Howard County roads.

Because I don’t need or want to display each and every Columbia road, I use the RTTYP and FULLNAME variables to filter the results to retain only major roads (interstate and U.S. highways) and significant minor roads (Maryland state routes, roads with “Parkway” in their names, and other significant roads in Columbia). I store the geometry for each in separate variables, so that I can plot them at different widths.

hoco_roads <- roads(state = "MD", county = "Howard County", class = "sf", progress_bar = FALSE)

hoco_major_roads_geo <- hoco_roads %>%
  filter(RTTYP %in% c("I", "U")) %>%
  st_geometry() %>%
  st_union()

key_roads <- c(
  "Berger Rd",
  "Cedar Ln",
  "Columbia Rd",
  "Columbia Gateway Dr",
  "Cradlerock Way",
  "Dobbin Rd",
  "Eden Brook Dr",
  "Eliots Oak Rd",
  "Freetown Rd",
  "Gerwig Ln",
  "Great Star Dr",
  "Grey Rock Rd",
  "Guilford Rd",
  "Harmel Dr",
  "Harpers Farm Rd",
  "Hickory Ridge Rd",
  "High Tor Hill",
  "Homespun Dr",
  "Kilimanjaro Rd",
  "Martin Rd",
  "McGaw Rd",
  "Murray Hill Rd",
  "Oakland Mills Rd",
  "Old Columbia Rd",
  "Owen Brown Rd",
  "Phelps Luck Dr",
  "Robert Fulton Dr",
  "Rt 29 Sb To South Entrance Rd Rmp",
  "S Entrance Rd",
  "Saddle Dr",
  "Shaker Dr",
  "Stevens Forest Rd",
  "Ten Mills Rd",
  "Trotter Rd",
  "Twin Rivers Rd",
  "Vollmerhausen Rd",
  "Watch Chain Way",
  "White Acres Rd"
)

hoco_minor_roads_geo <- hoco_roads %>%
  filter(RTTYP == "S" | str_detect(FULLNAME, "Pkwy") | FULLNAME %in% key_roads) %>%
  st_geometry() %>%
  st_union()

Analysis

I do analyses to answer the following questions:

Summary of Columbia population statistics

Before I look at population density, here’s a quick summary of Columbia population at the CDP, census block group, and census block level:

The total 2010 population of census block groups wholly in Columbia was 99,085. The estimated total population of Columbia-associated block groups according to the ACS 5-year estimates for 2017 was 102,798. (Recall that ACS 5-year estimates for 2017 actually reflect surveys from 2013 through 2017.) This represents population growth of about 4% over roughly half a decade or so.

As discussed previously, there are currently 54 census block groups wholly or mostly in Columbia. In 2010 the least populated census block group contained 645 people, while the most populated census block group contained 3,632 people. The average population for all block groups was 1,835 people and the median population for all block groups was 1,824 people.

In 2010 the smallest block group covered an area of 0.12 square miles (about 79 acres), while the largest block group covered an area of 3.11 square miles. The average block group area was 0.58 square miles, and the median was 0.44 square miles.

A typical census block group is thus comparable in both size and population to a traditional village or small town. It is large enough to be a recognizable “place”, but small enough to have its own identity distinct from that of other places in the county. However a census block is smaller than a Columbia village: on average a Columbia village contains about 5 or 6 census block groups.

There are 1,605 census blocks in the Columbia CDP, over half of which contained no people at all in the 2010 census; the most populated census block contained 2,696 people. The average population for all blocks was 62 people.

In 2010 there were several census blocks that contained no land at all (and hence had zero land area), while the largest block covered an area of 0.84 square miles (about 535 acres). The average census block area was 0.02 square miles (about 13 acres).

Columbia population density by census block group

I first want to look at the differing population densities among the various census block groups. I start by computing population densities for 2010, including an overall population density for Columbia itself.

To do this I take the area_cbg_2017 table (which contains the variable area specifying the land area of the block groups in square miles), delete the geometry (which I don’t need in this context), and join to it the table pop_cbg_2010 (which contains the variable population). I then create a new variable pop_density containing the population density. I also calculate the population density for Columbia as a whole.

density_cbg_2010 <- area_cbg_2010 %>%
  st_drop_geometry() %>%
  left_join(pop_cbg_2010, by = "GEOID") %>%
  mutate(pop_density = population / area)

density_col_2010 <- pop_col_2010 %>%
  left_join(area_col_2010, by = "GEOID") %>%
  mutate(pop_density = population / area)

Population density in 2010 varied from a low of 761 people per square mile to a high of 13,285 people per square mile, a difference of two orders of magnitude. Overall population density for Columbia in 2010 was 3,187 people per square mile.

The following histogram shows the distribution of population density among the various block groups for the 2010 census. For consistency I use the same limits and break points as in the histogram shown in the analysis for 2017 below.

density_cbg_2010 %>%
  ggplot(aes(x = pop_density)) +
  geom_histogram(boundary = 0, binwidth = 500) +
  geom_vline(
    mapping = aes(xintercept = median(pop_density)),
    linetype = "dashed"
  ) +
  xlab("Population Density Per Square Mile") +
  ylab("Number of Census Block Groups") +
  scale_x_continuous(breaks = seq(0, 22000, 2000), limits = c(0, 22000)) +
  labs(
    title = "2010 Core Columbia Population Density Distribution",
    subtitle = "Based on 2010 Census Values for Census Block Groups",
    caption = paste0(
      "Data sources:",
      "\n  U.S. Census Bureau, 2010 Decennial Census, Table P1",
      "\n  U.S. Census Bureau, Cartographic boundaries",
      "\nCreated using the tidyverse and tidycensus R packages"
    )
  ) +
  theme_minimal() +
  theme(axis.title.x = element_text(margin = margin(t = 10))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

The distribution is heavily skewed to the right, with only a few census block groups having a density over 6,000 people per square mile, and several block groups having population densities under 500 people per square mile. The median density, shown by the dashed line, was 3,663 people per square mile.

I now repeat the analysis for the 2017 ACS 5-year population estimates.

density_cbg_2017 <- area_cbg_2017 %>%
  st_drop_geometry() %>%
  left_join(pop_cbg_2017, by = "GEOID") %>%
  mutate(pop_density = population / area)

density_col_2017 <- pop_col_2017 %>%
  left_join(area_col_2017, by = "GEOID") %>%
  mutate(pop_density = population / area)

Population density in the 2013-2017 time frame varied from a low of 805 people per square mile to a high of 22,376 people per square mile, a wider range of densities than was present in 2010. Overall population density for the core Columbia CDP in 2013-2017 was 3,307 people per square mile.

The following histogram shows the distribution of population density among the various block groups based on the 2017 ACS 5-year population estimates.

density_cbg_2017 %>%
  ggplot(aes(x = pop_density)) +
  geom_histogram(boundary = 0, binwidth = 500) +
  geom_vline(
    mapping = aes(xintercept = median(pop_density)),
    linetype = "dashed"
  ) +
  xlab("Population Density Per Square Mile") +
  ylab("Number of Census Block Groups") +
  scale_x_continuous(breaks = seq(0, 22000, 2000)) +
  labs(
    title = "2013-2017 Core Columbia Population Density Distribution",
    subtitle = "Based on 2017 ACS 5-Year Estimates for Census Block Groups",
    caption = paste0(
      "Data sources:",
      "\n  U.S. Census Bureau, 2017 American Community Survey, 5-Year Estimates, Table B01003",
      "\n  U.S. Census Bureau, Cartographic boundaries",
      "\nCreated using the tidyverse and tidycensus R packages"
    )
  ) +
  theme_minimal() +
  theme(axis.title.x = element_text(margin = margin(t = 10))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

The distribution is also heavily skewed to the right, with a median density of 3,596 people per square mile, a little increased from 2010.

The final histogram shows both distributions on a single graph, by combining the tables for 2010 and 2017, converting the year variable into a categorical variable (“factor”), and then plotting the bars of the histogram side by side using year to plot them in different colors.

bind_rows(density_cbg_2010, density_cbg_2017) %>%
  mutate(year = as.factor(year)) %>%
  ggplot(aes(x = pop_density, fill = year)) +
  geom_histogram(boundary = 0, binwidth = 500, position = "dodge") +
  xlab("Population Density Per Square Mile") +
  ylab("Number of Census Block Groups") +
  scale_x_continuous(breaks = seq(0, 22000, 2000)) +
  scale_fill_viridis(option = "viridis", name = "Year", discrete = TRUE) +
  labs(
    title = "2010 and 2013-2017 Core Columbia Population Density Distribution",
    subtitle = "Based on 2010 Census and 2017 ACS 5-Year Estimates for Census Block Groups",
    caption = paste0(
      "Data sources:",
      "\n  U.S. Census Bureau, 2010 Decennial Census, Table P1",
      "\n  U.S. Census Bureau, 2017 American Community Survey 5-Year Estimates, Table B01003",
      "\nCreated using the tidyverse and tidycensus R packages"
    )
  ) +
  theme_minimal() +
  theme(axis.title.x = element_text(margin = margin(t = 10))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

You can see the general rightward shift in the distribution in 2013-2017 compared to 2010, with more census blocks at higher densities and higher maximum densities.

Mapping Columbia population density variation by census block group

I next show the same data in the form of two maps, one for 2010 and one for 2013-2017. Because the population densities span such a wide range of values, I use a logarithmic scale when choosing the colors, based on powers of 10. I also use the same limits for the scale in both maps, to ensure that the color values are comparable between the maps.

Note that I don’t bother trying to label the various roads because it would be more work and make the map more cluttered—and in any case the intended audience of Howard County residents should be able to identify most if not all of the roads by sight. I compute the boundary of the area covered by all the census block groups, and then intersect that with the roads geometry to show only roads lying within Columbia.

col_boundary <- st_union(area_cbg_2010)

major_roads_geo <- st_intersection(col_boundary, hoco_major_roads_geo)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
minor_roads_geo <- st_intersection(col_boundary, hoco_minor_roads_geo)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
area_cbg_2010 %>%
  left_join(density_cbg_2010, by = "GEOID") %>%
  ggplot(aes(fill = pop_density, geometry = geometry)) +
  geom_sf(size = 0) +
  geom_sf(data = major_roads_geo, color = "white", size = 1.0, fill = NA) +
  geom_sf(data = minor_roads_geo, color = "white", size = 0.5, fill = NA) +
  scale_fill_viridis(trans = "log10", option = "plasma", name = "Population\nPer Sq Mi", limits = c(100, 25000), breaks = c(100, 1000, 10000), labels = comma) +
  labs(title="2010 Core Columbia Population Density Variations",
       subtitle = "Population Density Per Square Mile for Census Block Groups",
       caption = paste0(
         "Data sources:",
         "\n  U.S. Census Bureau, 2010 Decennial Census, Table P1",
         "\n  U.S. Census Bureau, Cartographic boundaries",
         "\nCreated using the tidyverse and tidycensus R packages"
       )
  ) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0)) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

The areas of highest population density appear to be in Owen Brown and Wilde Lake villages.

I now create a comparable map for the 2013-2017 timeframe.

area_cbg_2017 %>%
  left_join(density_cbg_2017, by = "GEOID") %>%
  ggplot(aes(fill = pop_density, geometry = geometry)) +
  geom_sf(size = 0) +
  geom_sf(data = major_roads_geo, color = "white", size = 0.8, fill = NA) +
  geom_sf(data = minor_roads_geo, color = "white", size = 0.4, fill = NA) +
  scale_fill_viridis(trans = "log10", option = "plasma", name = "Population\nPer Sq Mi", limits = c(100, 25000), breaks = c(100, 1000, 10000), labels = comma) +
  labs(title="2013-2017 Core Columbia Population Density Variations",
       subtitle = "Population Density Per Square Mile for Census Block Groups",
       caption = paste0(
         "Data sources:",
         "\n  U.S. Census Bureau, 2017 American Community Survey 5-Year Estimates, Table B01003",
         "\n  U.S. Census Bureau, Cartographic boundaries",
         "\nCreated using the tidyverse and tidycensus R packages"
       )
  ) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0)) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

The overall picture is similar to that for 2010—which is only to be expected since only a few years separate the two maps.

Columbia population density by census block

I also want to look at the differing population densities among the various census blocks in 2010.

I first create a table density_block_2010 containing per-block population densities. This table is analogous to the previous table density_cbg_2010 and is created in a similar way.

density_block_2010 <- area_block_2010 %>%
  st_drop_geometry() %>%
  left_join(pop_block_2010, by = "GEOID") %>%
  mutate(pop_density = ifelse(area > 0, population / area, 0))

Population density at the block level in 2010 varied from a low of 0 people per square mile (since most blocks contain no people at all) to a high of 610,379 people per square mile, a difference of almost six orders of magnitude. The average density for a block was 4,464.

However, the block with highest density is an outlier, as shown by the following table of the top 20 census blocks by density.

density_block_2010 %>%
  select(GEOID, area, population, pop_density) %>%
  arrange(desc(pop_density)) %>%
  head(20) %>%
  kable()
GEOID area population pop_density
240276067051016 0.0001212 74 610379.34
240276054023029 0.0016290 246 151016.13
240276068031043 0.0015931 211 132449.70
240276066034002 0.0001108 11 99267.83
240276054024017 0.0124495 1175 94381.46
240276067062003 0.0000965 9 93239.57
240276066061001 0.0001135 10 88094.83
240276054013007 0.0043958 370 84171.77
240276055021007 0.0003571 27 75599.65
240276055032004 0.0003402 21 61736.38
240276066061020 0.0060695 348 57335.61
240276054023008 0.0012537 71 56633.55
240276066062026 0.0001139 6 52677.72
240276066072009 0.0006181 31 50149.67
240276067051001 0.0001004 5 49807.46
240276055023015 0.0001861 9 48360.77
240276055041046 0.0001656 8 48298.14
240276023023045 0.0029668 139 46851.68
240276067043005 0.0002884 13 45073.42
240276056021044 0.0005795 26 44863.22

Since there is only one census block with a density over about 150,000 people per square miles, and several in the range from roughly 50-150,000 per square mile, I adjust the densities to cap them at a maximum value of 150,000. I also change any blocks with a density of zero to have a density of 1 person per square mile. This prevents errors when plotting the density on a logarithmic scale.

density_block_2010 <- density_block_2010 %>%
  mutate(pop_density = ifelse(pop_density > 150000, 150000, pop_density)) %>%
  mutate(pop_density = ifelse(pop_density < 1, 1, pop_density))

Mapping Columbia population density variation by census blocks

I next show the 2010 population density variations at the census block level. Because the population densities span such a wide range of values, I again use a logarithmic scale when choosing the colors, based on powers of 10. Note that the color values are not comparable between this map and the maps above, because this map spans a considerably wider range of densities.

area_block_2010 %>%
  left_join(density_block_2010, by = "GEOID") %>%
  ggplot(aes(fill = pop_density, geometry = geometry)) +
  geom_sf(size = 0) +
  geom_sf(data = major_roads_geo, color = "white", size = 1.0, fill = NA) +
  geom_sf(data = minor_roads_geo, color = "white", size = 0.5, fill = NA) +
#  scale_fill_viridis(trans = "log10", option = "plasma", name = "Population\nPer Sq Mi", limits = c(0, 150000), breaks = c(1, 100, 1000, 10000, 100000)) +
  scale_fill_viridis(trans = "log10", option = "plasma", name = "Population\nPer Sq Mi", breaks = c(1, 10, 100, 1000, 10000, 100000), labels = comma) +
  labs(title="2010 Core Columbia Population Density Variations",
       subtitle = "Population Density Per Square Mile for Census Blocks",
       caption = paste0(
         "Data sources:",
         "\n  U.S. Census Bureau, 2010 Decennial Census, Table P1",
         "\n  U.S. Census Bureau, Cartographic boundaries",
         "\nCreated using the tidyverse and tidycensus R packages"
       )
  ) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0)) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

This map is most notable for showing areas of Columbia that have very low population densities. These are typically areas like the Mall at Columbia or Columbia Gateway that have office, retail, or industrial buildings but no residential buildings. (Recall that this data is from 2010, before the construction of the apartment buildings next to the mall.)

Note also that some areas appear on this map that were not on the prior maps. These reflect census blocks that are in the Columbia CDP, but that are in census block groups that are mostly not in Columbia. The most notable example of this is the area east of U.S. 29 and north of MD 108, which is split between Columbia and Ellicott City.

Columbia population density changes by census block group

I now take another look at the population density changes between 2010 and the 2013-2017 time frame, this time mapping the changes for individual census block groups. As in the previous analysis for Howard County as a whole, I choose to show percentage changes in density rather than absolute changes. The map itself is created in exactly the same way as before.

density_changes <- bind_rows(density_cbg_2010, density_cbg_2017) %>%
  select(GEOID, pop_density, year) %>%
  spread(year, pop_density) %>%
  mutate(pct_change = 100 * (`2017` - `2010`) / `2010`) %>%
  select(GEOID, pct_change)

upper_limit <- round(max(density_changes$pct_change) + 10, -1)
lower_limit <- round(min(density_changes$pct_change) - 10, -1)

area_cbg_2017 %>%
  left_join(density_changes, by = "GEOID") %>%
  ggplot(aes(fill = pct_change, geometry = geometry)) +
  geom_sf(size = 0) +
  geom_sf(data = major_roads_geo, color = "white", size = 0.8, fill = NA) +
  geom_sf(data = minor_roads_geo, color = "white", size = 0.4, fill = NA) +
  scale_fill_viridis(option = "viridis", name = "% Change", limits = c(lower_limit, upper_limit), breaks = seq(lower_limit, upper_limit, 20)) +
  labs(title="Changes in Core Columbia Population Density",
       subtitle = "2017 5-Year Estimates vs. 2010 Values for Census Block Groups",
       caption = paste0(
         "Data sources:",
         "\n  U.S. Census Bureau, 2010 Decennial Census, Table P1",
         "\n  U.S. Census Bureau, 2017 American Community Survey 5-Year Estimates, Table B101003",
         "\n  U.S. Census Bureau, Cartographic boundaries",
         "\nCreated using the tidyverse and tidycensus R packages"
       )
  ) +
  theme(plot.caption = element_text(hjust = 0, margin = margin(t = 15))) +
  theme(axis.ticks = element_blank(), axis.text = element_blank()) +
  theme(panel.background = element_blank())

Appendix

Caveats

Population values from the American Community Survey are estimates based on survey samples, with associated margins of error. For estimates at the census block group level the associated standard errors can be 10-30% or more of the estimates themselves.

As noted above, 5-year estimates identified as being for a certain year actually reflect surveys conducted over the prior five years. So the 2017 population estimates actually correspond to the 2013-2017 time frame.

References

For more information about the structure of geographic identifiers used for census data, see “Understanding Geographic Identifiers (GEOIDs)”.

To obtain the 2010 census and 2017 ACS data I used the tidycensus package, which provides an easy-to-use interface to U.S. Census Bureau data made available via a set of public APIs. As its name suggests, the tidycensus package is designed to be compatible with the tidyverse approach to representing and manipulating data.

To obtain the census block group and census block boundaries I used the tigris package, which is designed to accompany the tidycensus package and provides access to U.S. Census Bureau cartographic boundaries and TIGER/Line shapefiles.

Calculation of population density based on land area is discussed in the U.S. Census Bureau document 2010 Census Summary File 1, 2010 Census of Population and Housing, Technical Documentation. See Appendix A, “Geographic Terms and Concepts”, page A-22, “Population and Housing Unit Density”, and Appendix B, “Definitions of Subject Characteristics”, page B-23, “Area Measurement and Density”.

Suggestions for others

If you look closely at the maps above you’ll notice that there are gaps in several places between roads where there should be no gaps. I have not yet been able to identify where the geometry for those road sections is in the table returned by the roads() function.

Environment

I used the following R environment in doing the analysis above:

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.23        scales_1.0.0      viridis_0.5.1    
##  [4] viridisLite_0.3.0 sf_0.7-7          tigris_0.8.2     
##  [7] tidycensus_0.9.2  forcats_0.4.0     stringr_1.4.0    
## [10] dplyr_0.8.3       purrr_0.3.2       readr_1.3.1      
## [13] tidyr_0.8.3       tibble_2.1.3      ggplot2_3.2.0    
## [16] tidyverse_1.2.1  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2         lubridate_1.7.4    lattice_0.20-38   
##  [4] class_7.3-15       assertthat_0.2.1   zeallot_0.1.0     
##  [7] digest_0.6.20      R6_2.4.0           cellranger_1.1.0  
## [10] backports_1.1.4    evaluate_0.14      e1071_1.7-2       
## [13] highr_0.8          httr_1.4.1         pillar_1.4.2      
## [16] rlang_0.4.0        lazyeval_0.2.2     curl_4.0          
## [19] readxl_1.3.1       uuid_0.1-2         rstudioapi_0.10   
## [22] rmarkdown_1.14     labeling_0.3       rgdal_1.4-4       
## [25] foreign_0.8-71     munsell_0.5.0      broom_0.5.2       
## [28] compiler_3.6.0     modelr_0.1.4       xfun_0.8          
## [31] pkgconfig_2.0.2    htmltools_0.3.6    tidyselect_0.2.5  
## [34] gridExtra_2.3      crayon_1.3.4       withr_2.1.2       
## [37] rappdirs_0.3.1     grid_3.6.0         nlme_3.1-139      
## [40] jsonlite_1.6       gtable_0.3.0       DBI_1.0.0         
## [43] magrittr_1.5       units_0.6-3        KernSmooth_2.23-15
## [46] cli_1.1.0          stringi_1.4.3      sp_1.3-1          
## [49] xml2_1.2.1         vctrs_0.2.0        generics_0.0.2    
## [52] tools_3.6.0        glue_1.3.1         hms_0.5.0         
## [55] yaml_2.2.0         colorspace_1.4-1   maptools_0.9-5    
## [58] classInt_0.4-1     rvest_0.3.4        haven_2.1.1

Source code

You can find the source code for this analysis and others at my hocodata public code repository. This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.