After each Census, the Census Bureau calculates the nation’s “mean center of population.” Imagine a giant map of the United States with weights stacked on top corresponding to where people live. The center of population is that point at which you could balance that map on a pin.
Here’s the national map with population centers from 1790 to 2020, retrieved from the Census Bureau.
The Census Bureau also calculates each state’s center of population, going back to 1880. This document demonstrates how to use R to (1) retrieve that data from the web, (2) process it into GIS format, and (3) plot it on a simple leaflet map.
If you just want to see the map, scroll to the bottom!
The Census Bureau publishes each state’s population center in a different table for each year. Using the {{rvest}} package, this function scrapes each table.
library(tidyverse)
library(rvest)
library(sf)
library(leaflet)
scrape_coordinates <- function(year){
centers <- read_html(paste0("https://www.census.gov/geographies/reference-files/time-series/geo/centers-population/historical-by-year.", year, ".html"))
centers %>%
html_elements("td") %>%
html_text2() %>%
tibble() %>%
rename(state = 1) %>%
mutate(latitude = lead(state, 1),
longitude = lead(state, 2)) %>%
filter(state %in% state.name,
str_detect(latitude, "°|not available")) %>%
mutate(year = year)
}
I’m no expert on web scraping, so I bet there is a better way to do this. A nice thing about the {{rvest}} package, in my opinion, is that you don’t need to be an expert to use. Even a web scraping novice like me can quickly write a function to grab data from some tables.
Once the function is defined, use dplyr::map_df()
to retrieve each census year’s table. I noticed that the Census Bureau describes an error in the data point for Mississippi in 1890, so I manually fix that.
all.centers <- map_df(seq(1880, 2020, 10), scrape_coordinates) %>%
mutate(latitude = ifelse(state == "Mississippi" & year == 1890,
yes = "32° 59′ 52″", no = latitude))
head(all.centers)
## # A tibble: 6 × 4
## state latitude longitude year
## <chr> <chr> <chr> <dbl>
## 1 Alabama 32° 51′ 09″ 86° 43′ 16″ 1880
## 2 Alaska not available not available 1880
## 3 Arizona 33° 17′ 36″ 111° 25′ 32″ 1880
## 4 Arkansas 34° 55′ 41″ 92° 30′ 25″ 1880
## 5 California 37° 55′ 55″ 121° 27′ 42″ 1880
## 6 Colorado 39° 05′ 23″ 105° 32′ 53″ 1880
Notice that the coordinates are returned as “sexagesimal degrees” (i.e. days, minutes, and seconds, or “DMS”). We need to convert these to decimal degrees. I hadn’t done this before, so I found this Stackoverflow post very helpful. The {{sf}} package has largely superceded the older {{sp}} package for working with spatial data in R. But in this case, I couldn’t quickly find an {{sf}} equivalent to sp::char2dms
.
# Convert coordinates to SF class
all.centers.sf <- all.centers %>%
filter(latitude != "not available") %>%
# Add compass direction to coordinates
mutate(latitude = paste0(str_sub(latitude, 1, -2), "\"N"),
longitude = paste0(str_sub(longitude, 1, -2), "\"W")) %>%
# make sure separators are the same
mutate(latitude = str_replace(latitude, "'", "′"),
longitude = str_replace(longitude, "'", "′")) %>%
# convert DMS coordinate representation to decimal degrees
mutate(lat = as.numeric(sp::char2dms(latitude[!is.na(latitude)], chd = "°", chm = "′", chs = '"')),
lon = as.numeric(sp::char2dms(longitude[!is.na(longitude)], chd = "°", chm = "′", chs = '"'))) %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326)
head(all.centers.sf)
## Simple feature collection with 6 features and 4 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -121.4617 ymin: 32.8525 xmax: -72.7725 ymax: 41.54694
## Geodetic CRS: WGS 84
## # A tibble: 6 × 5
## state latitude longitude year geometry
## <chr> <chr> <chr> <dbl> <POINT [°]>
## 1 Alabama "32° 51′ 09\"N" "86° 43′ 16\"W" 1880 (-86.72111 32.8525)
## 2 Arizona "33° 17′ 36\"N" "111° 25′ 32\"W" 1880 (-111.4256 33.29333)
## 3 Arkansas "34° 55′ 41\"N" "92° 30′ 25\"W" 1880 (-92.50694 34.92806)
## 4 California "37° 55′ 55\"N" "121° 27′ 42\"W" 1880 (-121.4617 37.93194)
## 5 Colorado "39° 05′ 23\"N" "105° 32′ 53\"W" 1880 (-105.5481 39.08972)
## 6 Connecticut "41° 32′ 49\"N" "72° 46′ 21\"W" 1880 (-72.7725 41.54694)
One way to plot data like this is a marker for each census year. Another technique is to connect each pair of census years with a line, e.g. 1880-1890, 1890-1900, etc. I present the data in both ways in the map below. Here is how to make those lines.
First, define each pair of census years.
year.pairs <- tibble(year1 = seq(1880, 2010, 10),
year2 = seq(1890, 2020, 10)) %>%
mutate(pair = paste(year1, year2, sep = "-"))
head(year.pairs)
## # A tibble: 6 × 3
## year1 year2 pair
## <dbl> <dbl> <chr>
## 1 1880 1890 1880-1890
## 2 1890 1900 1890-1900
## 3 1900 1910 1900-1910
## 4 1910 1920 1910-1920
## 5 1920 1930 1920-1930
## 6 1930 1940 1930-1940
Next, define a function which creates a line between the first and last year in each pair.
decade_line <- function(pair){
all.centers.sf %>%
filter(year %in% c(as.numeric(str_sub(pair, 1, 4)),
as.numeric(str_sub(pair, -4)))) %>%
group_by(state) %>%
filter(n() == 2) %>%
group_by(state) %>%
summarise(geometry = st_union(geometry)) %>%
st_cast(to = "LINESTRING") %>%
mutate(pair = pair) %>%
select(state, pair, geometry) %>%
mutate(start_year = as.numeric(str_sub(pair, 1, 4)),
end_year = as.numeric(str_sub(pair, -4)))
}
Apply this function across the data.
all.pairs <- map_df(year.pairs$pair, decade_line) %>%
mutate(pair = fct_reorder(pair, end_year))
head(all.pairs)
## Simple feature collection with 6 features and 4 fields
## Geometry type: LINESTRING
## Dimension: XY
## Bounding box: xmin: -121.4617 ymin: 32.8525 xmax: -72.7725 ymax: 41.54694
## Geodetic CRS: WGS 84
## # A tibble: 6 × 5
## state pair geometry start_year end_year
## <chr> <fct> <LINESTRING [°]> <dbl> <dbl>
## 1 Alabama 1880-1890 (-86.72111 32.8525, -86.74611 32.91… 1880 1890
## 2 Arizona 1880-1890 (-111.4256 33.29333, -111.4275 33.2… 1880 1890
## 3 Arkansas 1880-1890 (-92.49472 34.95972, -92.50694 34.9… 1880 1890
## 4 California 1880-1890 (-121.4617 37.93194, -121.0389 37.4… 1880 1890
## 5 Colorado 1880-1890 (-105.5481 39.08972, -105.2361 39.1… 1880 1890
## 6 Connecticut 1880-1890 (-72.8 41.52806, -72.7725 41.54694) 1880 1890
Now, plot the data on a leaflet tile map. I create two baselayers which can be toggled between using radio buttons. One layer shows each population center as a text marker. The other layer shows a line capturing the movement between census years. I use the “viridis” palette because it particularly good at covering a wide range of continuous data.
pair.pal <- colorFactor(palette = "viridis", domain = all.pairs$pair, reverse = T)
leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addLabelOnlyMarkers(data = all.centers.sf,
label = ~as.character(year),
labelOptions = labelOptions(noHide = T, textOnly = T,
textsize = "20px"),
group = "Year markers") %>%
addPolylines(data = all.pairs,
color = ~pair.pal(pair),
label = ~pair, group = "Lines") %>%
addLegend(position = "bottomright", pal = pair.pal, values = all.pairs$pair,
title = "Center of Pop.", group = "Lines") %>%
addLayersControl(baseGroups = c("Year markers", "Lines"),
options = layersControlOptions(collapsed = F))
Leaflet maps are amazing. This entire post can be accessed offline, and the file is only 1.6MB.