Introduction

This guide goes over a handful of useful data sources for geospatial analysis. One big limitation of working with geospatial data is the geographic unit or units that the data are available in, so always inspect what is available when choosing the right data for a project. If the data is available in a geographic unit supported by the US Census Bureau, then, using FIPS codes, it can be aggregated upward to larger geographic units (e.g. all counties in a state) or can be used to apply averages to smaller geographic units (e.g. state average to all counties).

Census Bureau Data and Shapefiles

The Census Bureau provides a wide variety of demographic and housing data for various geographic units. Not all of their data is available for each unit–specifically smaller units, so this is a serious limitation when working with their data.

Below is a graphic that depicts the Census Bureau’s geographic entities hierarchy (here is their page that goes over this hierarchy in greater detail):

The Census Bureau also provides corresponding shapefiles for that data, making it easy to create maps using that data.

The R package tidycensus provides several easy to use API calls for downloading Census Bureau data and their shapefiles. The first step for using these API calls is to set up an API key.

Set Census API key

Before running the code below, which sets the user’s API key for downloading Census Bureau data, first request an API code here.

Replace the “[Insert API Key Here]” line below with your own API key, and then run the following.

Note: this should only have to be run once per system.

# Set Census API key
census_api_key("[Insert API Key Here]",
               install = TRUE)

# Re-reads startup file to load Census API key
readRenviron("~/.Renviron")

FIPS Codes

The tidycensus package includes a built in data set that includes FIPS codes for all states, counties, and other territories in the US. This data set is useful for finding FIPS codes for the exact states and counties needed in an API call. Here is a method to quickly explore this data set, using the lookup_code() function from the tigris package (a companion package to tidycensus).

# Use pacman package to load packages used for exercise
library(pacman)

p_load(tidyverse,   # Variety of packages for data manipulation
       tidycensus,  # Package for downloading census data
       tigris,      # Package for downloading census shapefiles
       kableExtra)  # For creating tables

# Use the tigris package function lookup_code() to search for FIPS codes
# State only:
lookup_code(state = "PA")

## [1] "The code for Pennsylvania is '42'."

# State and County:
lookup_code(state = "PA", county = "Allegheny")

## [1] "The code for Pennsylvania is '42' and the code for Allegheny County is '003'."

FIPS Codes data

Users can also load the fips_codes data set directly to view the data set.

# Load the fips_codes data set
data(fips_codes)

# View the fips_codes data structure
fips_codes[1:5,] %>%
  kbl(caption = "US State and County FIPS Codes (first 5)", 
      align = "r") %>%
  kable_minimal("hover")

US State and County FIPS Codes (first 5)
state	state_code	state_name	county_code	county
AL	01	Alabama	001	Autauga County
AL	01	Alabama	003	Baldwin County
AL	01	Alabama	005	Barbour County
AL	01	Alabama	007	Bibb County
AL	01	Alabama	009	Blount County

The data can then be manipulated to return just the FIPS codes for each state and territory.

# Keep just one FIPS code entry per state/territory
fips <- fips_codes %>%
  select(state, state_code, state_name) %>%
  distinct()

# View state/territory FIPS code data structure
fips[c(1:5),] %>%
  kbl(caption = "US State and Territory FIPS Codes (first 5)",
      align = "r") %>%
  kable_minimal("hover")

US State and Territory FIPS Codes (first 5)
state	state_code	state_name
AL	01	Alabama
AK	02	Alaska
AZ	04	Arizona
AR	05	Arkansas
CA	06	California

# Washington, D.C. and FIPS codes over 56
fips %>%
  filter(state %in% c("DC") | state_code > 56) %>%
  kbl(caption = "Washington, D.C. and US Territory FIPS Codes",
      align = "r") %>%
  kable_minimal("hover")

Washington, D.C. and US Territory FIPS Codes
state	state_code	state_name
DC	11	District of Columbia
AS	60	American Samoa
GU	66	Guam
MP	69	Northern Mariana Islands
PR	72	Puerto Rico
UM	74	U.S. Minor Outlying Islands
VI	78	U.S. Virgin Islands

And the data can be further limited to just provide FIPS codes for the 50 states.

# Keep FIPS code for just the 50 states
state_fips <- fips_codes %>%
  select(state, state_code, state_name) %>%
  filter(state_code <= 56 & !(state %in% c("DC"))) %>% 
  distinct()

# View state FIPS codes data structure
state_fips[1:10,] %>%
  kbl(caption = "50 US State FIPS Codes (first 10)",
      align = "r") %>%
  kable_minimal("hover")

50 US State FIPS Codes (first 10)
state	state_code	state_name
AL	01	Alabama
AK	02	Alaska
AZ	04	Arizona
AR	05	Arkansas
CA	06	California
CO	08	Colorado
CT	09	Connecticut
DE	10	Delaware
FL	12	Florida
GA	13	Georgia

Download Census Bureau Data

The tidycensus package can also be used to download data directly from the Census Bureau using API calls. The Census Bureau provides tables online that list available variables. The tidycensus package can also download those tables to provide an internal list of variables.

List Census Bureau Variables

The code below populates a table with variables from a specific data set and year using the load_variables() function from the tidycensus package. In this example, it downloads data from the ACS 5-year estimates for 2020.

# Store ACS variables to view
variables <- load_variables(year = 2020,
                            dataset = "acs5/profile")

# View ACS variables
head(variables %>%
       filter(concept == "ACS DEMOGRAPHIC AND HOUSING ESTIMATES")) %>%
  kbl(caption = "First 5 Demographic Variables",
      align = "l") %>%
  kable_minimal("hover")

First 5 Demographic Variables
name	label	concept
DP05_0001	Estimate!!SEX AND AGE!!Total population	ACS DEMOGRAPHIC AND HOUSING ESTIMATES
DP05_0001P	Percent!!SEX AND AGE!!Total population	ACS DEMOGRAPHIC AND HOUSING ESTIMATES
DP05_0002	Estimate!!SEX AND AGE!!Total population!!Male	ACS DEMOGRAPHIC AND HOUSING ESTIMATES
DP05_0002P	Percent!!SEX AND AGE!!Total population!!Male	ACS DEMOGRAPHIC AND HOUSING ESTIMATES
DP05_0003	Estimate!!SEX AND AGE!!Total population!!Female	ACS DEMOGRAPHIC AND HOUSING ESTIMATES
DP05_0003P	Percent!!SEX AND AGE!!Total population!!Female	ACS DEMOGRAPHIC AND HOUSING ESTIMATES

Using the API

Using the information from above, the variable “DP05_0001” can be used to pull total population for a geographic region. Here is an example of an API call using the get_acs() function to download statewide populations for the year 2020.

# Download state populations
state_populations <- get_acs(geography = "state",
                             variables = "DP05_0001",
                             year = 2020) %>%
  rename(POPULATION = estimate) %>%
  select(GEOID, NAME, POPULATION)

# View data structure
state_populations[1:5,] %>%
  kbl(caption = "US State Populations (first 5)",
      align = "r",
      format.args = list(big.mark = ",",
                         scientific = FALSE)) %>%
  kable_minimal("hover")

US State Populations (first 5)
GEOID	NAME	POPULATION
01	Alabama	4,893,186
02	Alaska	736,990
04	Arizona	7,174,064
05	Arkansas	3,011,873
06	California	39,346,023

Download Shapefiles

Adding another line of code to the API call above, geometry = TRUE, also downloads the corresponding shapefiles for the data. The example below also includes shift_geo = TRUE, which resizes and relocates Alaksa and Hawaii to fit closer to the contiguous states for mapping purposes.

# Load tmap library
p_load(tmap)    # For creating categorical maps with tmap 

# Download state populations with shapefile
state_populations_shapefile <- get_acs(geography = "state",
                                       variables = "DP05_0001",
                                       year = 2020,
                                       geometry = TRUE,
                                       shift_geo = TRUE) %>%
  rename(POPULATION = estimate) %>%
  select(GEOID, NAME, POPULATION)

# Plot statewide populations
tm_shape(state_populations_shapefile) +
  tm_fill(title = "Statewide Population (2020)",
          col = "POPULATION",
          style = "cont",
          palette = "viridis",
          alpha = 0.7,
          n = 10) +
  tm_shape(state_populations_shapefile) +
  tm_borders(col = "white") +
  tm_layout(frame = FALSE,
            legend.outside = TRUE)

Use FIPS Codes to Limit Selection

Using the FIPS codes from above, this map can be limited easily to just the 48 contiguous states. This is also how a user could use the FIPS code dataset to limit which states are selected with any of the tidycensus API call functions.

# Create list of contiguous states
states <- unlist(state_fips %>%
                   filter(!state %in% c("AK", "HI")) %>%
                   select(state))

# Download population and shapefiles for contiguous states only
contiguous_populations_shapefile <- get_acs(geography = "state",
                                            state =  states,
                                            variables = "DP05_0001",
                                            year = 2020,
                                            geometry = TRUE)  %>%
  rename(POPULATION = estimate) %>%
  select(GEOID, NAME, POPULATION)

# Plot statewide populations for contiguous states
tm_shape(contiguous_populations_shapefile) +
  tm_fill(title = "Statewide Population (2020)",
          col = "POPULATION",
          style = "cont",
          palette = "viridis",
          alpha = 0.7,
          n = 10) +
  tm_shape(contiguous_populations_shapefile) +
  tm_borders(col = "white") +
  tm_layout(frame = FALSE,
            legend.outside = TRUE)

HUD Crosswalk Files

The US Department of Housing and Urban Development’s (HUD) Office of Policy Development and Research puts out ZIP code crosswalk data for each quarter year. These crosswalks allow for easy linking between ZIP codes and census tracts, counties, Core Based Statistical Areas (CBSA), and congressional districts.

Below are examples on how to use the ZIP code to County crosswalk file for the 4th quarter of 2021.

Note: this data is available as an .xlsx spreadsheet, but it has been converted to a .csv file for this example.

# Load the data (use colClasses to keep leading 0's in ZIP codes and county FIPS codes)
hud_crosswalk <- read.csv("./data/ZIP_COUNTY_122021.csv",
                          colClasses = c("zip" = "character",
                                         "county" = "character"))

# Inspect the data
hud_crosswalk[1:5,] %>%
       arrange(zip) %>%
  kbl(caption = "First Five Rows of HUD ZIP Code to County Crosswalk", 
      align = "r") %>%
  kable_minimal("hover")

First Five Rows of HUD ZIP Code to County Crosswalk
zip	county	usps_zip_pref_city	usps_zip_pref_state	res_ratio	bus_ratio	oth_ratio	tot_ratio
00683	72125	SAN GERMAN	PR	0.9530361	0.9966555	0.9807692	0.9563004
00683	72079	SAN GERMAN	PR	0.0009488	0.0000000	0.0000000	0.0008725
00683	72023	SAN GERMAN	PR	0.0007906	0.0011148	0.0000000	0.0007998
00683	72097	SAN GERMAN	PR	0.0001581	0.0000000	0.0000000	0.0001454
00683	72121	SAN GERMAN	PR	0.0450664	0.0022297	0.0192308	0.0418818

One common reason to use these crosswalks is to assign a county to a ZIP code. However, ZIP codes are not subsets of counties, so their routes can cross county lines. Fortunately, HUD provides the proportion of residents living in each linked county for that ZIP code, which can be used to asign a single county to each ZIP code. Here is an example of a ZIP code that crosses into multiple counties.

# Subset HUD data for ZIP code 39573
hud_crosswalk %>%
  filter(zip == "39573") %>%
  arrange(desc(res_ratio)) %>%
  kbl(caption = "All Counties Covered by ZIP Code 39573",
      align = "r") %>%
  kable_minimal("hover")

All Counties Covered by ZIP Code 39573
zip	county	usps_zip_pref_city	usps_zip_pref_state	res_ratio	bus_ratio	oth_ratio	tot_ratio
39573	28131	PERKINSTON	MS	0.6815420	0.6818182	1	0.6824586
39573	28045	PERKINSTON	MS	0.2214540	0.2272727	0	0.2209516
39573	28039	PERKINSTON	MS	0.0343599	0.0181818	0	0.0338983
39573	28109	PERKINSTON	MS	0.0335219	0.0000000	0	0.0326731
39573	28059	PERKINSTON	MS	0.0163419	0.0636364	0	0.0173576
39573	28047	PERKINSTON	MS	0.0127802	0.0090909	0	0.0126608

According to HUD, 68.2% of the residents with a 39573 ZIP code lived in county 28131 at the end of 2021, so it makes sense to assign 39573 to that county if the goal is to have a single ZIP code and county combination. Using the FIPS data from above, county 28131 can be quickly identified:

# Identify county 131 in state 28
fips_codes %>%
  filter(state_code == "28" & county_code == "131") %>%
  select(county, state_name) %>%
  kbl(caption = "County 28131",
      align = "r") %>%
  kable_minimal("hover")

County 28131
county	state_name
Stone County	Mississippi

One County per ZIP Code

Here is a quick code to limit the HUD crosswalk data to include only the county with the highest proportion of residents for each ZIP code.

# Use the variable res_ratio to keep only one county for each ZIP code
zip_county <- hud_crosswalk %>%
  group_by(zip) %>%
  arrange(desc(res_ratio)) %>%
  filter(row_number()==1) %>%
  arrange(county, zip)

# Inspect the modified data
zip_county[1:5,] %>%
  kbl(caption = "First Five Rows of HUD Data with One County per ZIP Code",
      align = "r") %>%
  kable_minimal("hover")

First Five Rows of HUD Data with One County per ZIP Code
zip	county	usps_zip_pref_city	usps_zip_pref_state	res_ratio	bus_ratio	oth_ratio	tot_ratio
36003	01001	AUTAUGAVILLE	AL	1.0000000	1.0000000	1.0000000	1.0000000
36006	01001	BILLINGSLEY	AL	0.7160326	0.5294118	0.5555556	0.7099738
36008	01001	BOOTH	AL	1.0000000	1.0000000	0.0000000	1.0000000
36051	01001	MARBURY	AL	0.6590747	0.8055556	1.0000000	0.6666667
36066	01001	PRATTVILLE	AL	0.8858471	0.7206266	0.7066327	0.8680474

Area Deprivation Index

Another useful data source for geospatial data is the Area Deprivation Index (ADI). This metric is used to rank how a neighborhood–at the census block group level–compares to the rest of its state or on a national level in terms of socioeconomic disadvantage. The rankings are based on factors that have been linked to several healthcare outcomes, healthcare utilization rates, and early mortality.

For a full description of the data, visit the Neighborhood Atlas page. To download the data, first create an account here, and then go to the download page.

To get started with the data, here are the variable descriptions provided in the data’s read me file.

Note: although the data is available in a 9-digit ZIP code format, the providers of the data discourage using this data for ZIP codes and ZCTAs.

# Load textreader library
p_load(textreadr)    # For loading text files

# Load ADI data documentation
adi_read_me <- read_document("./data/2019 ADI_Census Block Group_v3.1_ReadMe.txt")

# Inspect variable descriptions
adi_read_me[3:6] %>%
  kbl(caption = "Descriptions of ADI Data Columns",
      align = "l",
      col.names = c("Variable Description")) %>%
  kable_minimal("hover")

Descriptions of ADI Data Columns
Variable Description
GISJOIN: Key linkage field to the block group shapefile served by NHGIS
FIPS: The block group Census ID
ADI_NATRANK: National percentile of block group ADI score
ADI_STATERNK: State-specific decile of block group ADI score

With that information in mind, here is a quick snapshot of what the data looks like for census block groups.

# Load ADI data
adi <- read.delim("./data/US_2019_ADI_Census Block Group_v3.1.txt",
                  sep = ",",
                  colClasses = c("FIPS" = "character"),
                  row.names = FALSE)

# Inspect ADI data
adi[1:5,-1] %>%
  kbl(caption = "First Five Rows of ADI Data",
      align = "r") %>%
  kable_minimal("hover")

First Five Rows of ADI Data
FIPS	ADI_NATRANK	ADI_STATERNK
010010201001	89	8
010010201002	52	2
010010202001	88	8
010010202002	82	6
010010203001	71	5

Note: both the state and national ranks for ADI include different codes for different missing values. Here’s a quick code to replace those values with NAs and then map the index.

# List of missing value codes
values <- c("GQ", "GQ-PH", "NONE", "PH", "QDI")

# Replace missing value codes with NA
adi <- adi %>%
  mutate(ADI_NATRANK = ifelse(ADI_NATRANK %in% values,
                              NA, ADI_NATRANK),
         ADI_STATERNK = ifelse(ADI_STATERNK %in% values,
                               NA, ADI_STATERNK))

# Download census block groups shapefile for Pennsylvania (from tigris package)
block_group_shapefiles <- block_groups(state = c("PA"))

# Merge ADI values to census block groups shapefile
# Note: there is a census block group for lake Erie that needs to be removed
pa_adi_map <- left_join(block_group_shapefiles,
                        adi,
                        by = c("GEOID" = "FIPS")) %>%
  filter(GEOID != "420499900000")

# Download Pennsylvania counties shapefile for map
pa_counties <- counties(state = "PA")

# Set map categories to just 10
tmap_options(max.categories = 10)

# Plot national ADI rank values for Pennsylvania
tm_shape(pa_adi_map) +
  tm_fill(title = "National ADI Rank",
          col = "ADI_NATRANK",
          style = "cont",
          palette = "viridis",
          alpha = 0.7,
          n = 10,
          colorNA = "grey",
          textNA = "Missing") +
  tm_shape(pa_counties) +
  tm_borders(col = "white") +
  tm_layout(frame = FALSE,
          legend.outside = TRUE)

And here is the same map but with statewide ADI ranks.

# Plot state ADI rank values for Pennsylvania
tm_shape(pa_adi_map) +
  tm_fill(title = "State ADI Rank",
          col = "ADI_STATERNK",
          style = "cont",
          palette = "viridis",
          alpha = 0.7,
          n = 10,
          colorNA = "grey",
          textNA = "Missing") +
  tm_shape(pa_counties) +
  tm_borders(col = "white") +
  tm_layout(frame = FALSE,
            legend.outside = TRUE)

[Tutorial] Geospatial Data Resources

Dylan Nagy