State Centers of Population

After each Census, the Census Bureau calculates the nation’s “mean center of population.” Imagine a giant map of the United States with weights stacked on top corresponding to where people live. The center of population is that point at which you could balance that map on a pin.

Here’s the national map with population centers from 1790 to 2020, retrieved from the Census Bureau.

The Census Bureau also calculates each state’s center of population, going back to 1880. This document demonstrates how to use R to (1) retrieve that data from the web, (2) process it into GIS format, and (3) plot it on a simple leaflet map.

If you just want to see the map, scroll to the bottom!

Retrieve the data

The Census Bureau publishes each state’s population center in a different table for each year. Using the {{rvest}} package, this function scrapes each table.

library(tidyverse)
library(rvest)
library(sf)
library(leaflet)

scrape_coordinates <- function(year){
  centers <- read_html(paste0("https://www.census.gov/geographies/reference-files/time-series/geo/centers-population/historical-by-year.", year, ".html"))
  centers %>%
    html_elements("td") %>%
    html_text2() %>%
    tibble() %>%
    rename(state = 1) %>%
    mutate(latitude = lead(state, 1),
           longitude = lead(state, 2)) %>%
    filter(state %in% state.name,
           str_detect(latitude, "°|not available")) %>%
    mutate(year = year)
}

I’m no expert on web scraping, so I bet there is a better way to do this. A nice thing about the {{rvest}} package, in my opinion, is that you don’t need to be an expert to use. Even a web scraping novice like me can quickly write a function to grab data from some tables.

Once the function is defined, use dplyr::map_df() to retrieve each census year’s table. I noticed that the Census Bureau describes an error in the data point for Mississippi in 1890, so I manually fix that.

all.centers <- map_df(seq(1880, 2020, 10), scrape_coordinates) %>%
  mutate(latitude = ifelse(state == "Mississippi" & year == 1890,
                           yes = "32° 59′ 52″", no = latitude))
head(all.centers)

## # A tibble: 6 × 4
##   state      latitude      longitude      year
##   <chr>      <chr>         <chr>         <dbl>
## 1 Alabama    32° 51′ 09″   86° 43′ 16″    1880
## 2 Alaska     not available not available  1880
## 3 Arizona    33° 17′ 36″   111° 25′ 32″   1880
## 4 Arkansas   34° 55′ 41″   92° 30′ 25″    1880
## 5 California 37° 55′ 55″   121° 27′ 42″   1880
## 6 Colorado   39° 05′ 23″   105° 32′ 53″   1880

Clean the data

Notice that the coordinates are returned as “sexagesimal degrees” (i.e. days, minutes, and seconds, or “DMS”). We need to convert these to decimal degrees. I hadn’t done this before, so I found this Stackoverflow post very helpful. The {{sf}} package has largely superceded the older {{sp}} package for working with spatial data in R. But in this case, I couldn’t quickly find an {{sf}} equivalent to sp::char2dms.

# Convert coordinates to SF class
all.centers.sf <- all.centers %>%
  filter(latitude != "not available") %>%
  # Add compass direction to coordinates
  mutate(latitude = paste0(str_sub(latitude, 1, -2), "\"N"),
         longitude = paste0(str_sub(longitude, 1, -2), "\"W")) %>%
  # make sure separators are the same
  mutate(latitude = str_replace(latitude, "'", "′"),
         longitude = str_replace(longitude, "'", "′")) %>%
  # convert DMS coordinate representation to decimal degrees
  mutate(lat = as.numeric(sp::char2dms(latitude[!is.na(latitude)], chd = "°", chm = "′", chs = '"')),
         lon = as.numeric(sp::char2dms(longitude[!is.na(longitude)], chd = "°", chm = "′", chs = '"'))) %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326)
head(all.centers.sf)

## Simple feature collection with 6 features and 4 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -121.4617 ymin: 32.8525 xmax: -72.7725 ymax: 41.54694
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 5
##   state       latitude        longitude         year             geometry
##   <chr>       <chr>           <chr>            <dbl>          <POINT [°]>
## 1 Alabama     "32° 51′ 09\"N" "86° 43′ 16\"W"   1880  (-86.72111 32.8525)
## 2 Arizona     "33° 17′ 36\"N" "111° 25′ 32\"W"  1880 (-111.4256 33.29333)
## 3 Arkansas    "34° 55′ 41\"N" "92° 30′ 25\"W"   1880 (-92.50694 34.92806)
## 4 California  "37° 55′ 55\"N" "121° 27′ 42\"W"  1880 (-121.4617 37.93194)
## 5 Colorado    "39° 05′ 23\"N" "105° 32′ 53\"W"  1880 (-105.5481 39.08972)
## 6 Connecticut "41° 32′ 49\"N" "72° 46′ 21\"W"   1880  (-72.7725 41.54694)

Add lines

One way to plot data like this is a marker for each census year. Another technique is to connect each pair of census years with a line, e.g. 1880-1890, 1890-1900, etc. I present the data in both ways in the map below. Here is how to make those lines.

First, define each pair of census years.

year.pairs <- tibble(year1 = seq(1880, 2010, 10),
                     year2 = seq(1890, 2020, 10)) %>%
  mutate(pair = paste(year1, year2, sep = "-"))
head(year.pairs)

## # A tibble: 6 × 3
##   year1 year2 pair     
##   <dbl> <dbl> <chr>    
## 1  1880  1890 1880-1890
## 2  1890  1900 1890-1900
## 3  1900  1910 1900-1910
## 4  1910  1920 1910-1920
## 5  1920  1930 1920-1930
## 6  1930  1940 1930-1940

Next, define a function which creates a line between the first and last year in each pair.

decade_line <- function(pair){
  all.centers.sf %>%
    filter(year %in% c(as.numeric(str_sub(pair, 1, 4)),
                       as.numeric(str_sub(pair, -4)))) %>%
    group_by(state) %>%
    filter(n() == 2) %>%
    group_by(state) %>%
    summarise(geometry = st_union(geometry)) %>%
    st_cast(to = "LINESTRING") %>%
    mutate(pair = pair) %>%
    select(state, pair, geometry) %>%
    mutate(start_year = as.numeric(str_sub(pair, 1, 4)),
           end_year = as.numeric(str_sub(pair, -4)))
}

Apply this function across the data.

all.pairs <- map_df(year.pairs$pair, decade_line) %>%
  mutate(pair = fct_reorder(pair, end_year))
head(all.pairs)

## Simple feature collection with 6 features and 4 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -121.4617 ymin: 32.8525 xmax: -72.7725 ymax: 41.54694
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 5
##   state       pair                                  geometry start_year end_year
##   <chr>       <fct>                         <LINESTRING [°]>      <dbl>    <dbl>
## 1 Alabama     1880-1890 (-86.72111 32.8525, -86.74611 32.91…       1880     1890
## 2 Arizona     1880-1890 (-111.4256 33.29333, -111.4275 33.2…       1880     1890
## 3 Arkansas    1880-1890 (-92.49472 34.95972, -92.50694 34.9…       1880     1890
## 4 California  1880-1890 (-121.4617 37.93194, -121.0389 37.4…       1880     1890
## 5 Colorado    1880-1890 (-105.5481 39.08972, -105.2361 39.1…       1880     1890
## 6 Connecticut 1880-1890  (-72.8 41.52806, -72.7725 41.54694)       1880     1890

Make a leaflet map

Now, plot the data on a leaflet tile map. I create two baselayers which can be toggled between using radio buttons. One layer shows each population center as a text marker. The other layer shows a line capturing the movement between census years. I use the “viridis” palette because it particularly good at covering a wide range of continuous data.

pair.pal <- colorFactor(palette = "viridis", domain = all.pairs$pair, reverse = T)

leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addLabelOnlyMarkers(data = all.centers.sf,
                      label = ~as.character(year),
                      labelOptions = labelOptions(noHide = T, textOnly = T,
                                                  textsize = "20px"),
                      group = "Year markers") %>%
  addPolylines(data = all.pairs,
               color = ~pair.pal(pair),
               label = ~pair, group = "Lines") %>%
  addLegend(position = "bottomright", pal = pair.pal, values = all.pairs$pair,
            title = "Center of Pop.", group = "Lines") %>%
  addLayersControl(baseGroups = c("Year markers", "Lines"),
                   options = layersControlOptions(collapsed = F))