Introduction

This guide goes over a handful of useful data sources for geospatial analysis. One big limitation of working with geospatial data is the geographic unit or units that the data are available in, so always inspect what is available when choosing the right data for a project. If the data is available in a geographic unit supported by the US Census Bureau, then, using FIPS codes, it can be aggregated upward to larger geographic units (e.g. all counties in a state) or can be used to apply averages to smaller geographic units (e.g. state average to all counties).

Census Bureau Data and Shapefiles

The Census Bureau provides a wide variety of demographic and housing data for various geographic units. Not all of their data is available for each unit–specifically smaller units, so this is a serious limitation when working with their data.

Below is a graphic that depicts the Census Bureau’s geographic entities hierarchy (here is their page that goes over this hierarchy in greater detail):


Census Bureau Geographic Hierarchy


The Census Bureau also provides corresponding shapefiles for that data, making it easy to create maps using that data.

The R package tidycensus provides several easy to use API calls for downloading Census Bureau data and their shapefiles. The first step for using these API calls is to set up an API key.

Set Census API key

Before running the code below, which sets the user’s API key for downloading Census Bureau data, first request an API code here.

Replace the “[Insert API Key Here]” line below with your own API key, and then run the following.

Note: this should only have to be run once per system.

# Set Census API key
census_api_key("[Insert API Key Here]",
               install = TRUE)

# Re-reads startup file to load Census API key
readRenviron("~/.Renviron")

FIPS Codes

The tidycensus package includes a built in data set that includes FIPS codes for all states, counties, and other territories in the US. This data set is useful for finding FIPS codes for the exact states and counties needed in an API call. Here is a method to quickly explore this data set, using the lookup_code() function from the tigris package (a companion package to tidycensus).

# Use pacman package to load packages used for exercise
library(pacman)

p_load(tidyverse,   # Variety of packages for data manipulation
       tidycensus,  # Package for downloading census data
       tigris,      # Package for downloading census shapefiles
       kableExtra)  # For creating tables

# Use the tigris package function lookup_code() to search for FIPS codes
# State only:
lookup_code(state = "PA")
## [1] "The code for Pennsylvania is '42'."
# State and County:
lookup_code(state = "PA", county = "Allegheny")
## [1] "The code for Pennsylvania is '42' and the code for Allegheny County is '003'."

FIPS Codes data

Users can also load the fips_codes data set directly to view the data set.

# Load the fips_codes data set
data(fips_codes)

# View the fips_codes data structure
fips_codes[1:5,] %>%
  kbl(caption = "US State and County FIPS Codes (first 5)", 
      align = "r") %>%
  kable_minimal("hover")
US State and County FIPS Codes (first 5)
state state_code state_name county_code county
AL 01 Alabama 001 Autauga County
AL 01 Alabama 003 Baldwin County
AL 01 Alabama 005 Barbour County
AL 01 Alabama 007 Bibb County
AL 01 Alabama 009 Blount County



The data can then be manipulated to return just the FIPS codes for each state and territory.

# Keep just one FIPS code entry per state/territory
fips <- fips_codes %>%
  select(state, state_code, state_name) %>%
  distinct()

# View state/territory FIPS code data structure
fips[c(1:5),] %>%
  kbl(caption = "US State and Territory FIPS Codes (first 5)",
      align = "r") %>%
  kable_minimal("hover")
US State and Territory FIPS Codes (first 5)
state state_code state_name
AL 01 Alabama
AK 02 Alaska
AZ 04 Arizona
AR 05 Arkansas
CA 06 California
# Washington, D.C. and FIPS codes over 56
fips %>%
  filter(state %in% c("DC") | state_code > 56) %>%
  kbl(caption = "Washington, D.C. and US Territory FIPS Codes",
      align = "r") %>%
  kable_minimal("hover")
Washington, D.C. and US Territory FIPS Codes
state state_code state_name
DC 11 District of Columbia
AS 60 American Samoa
GU 66 Guam
MP 69 Northern Mariana Islands
PR 72 Puerto Rico
UM 74 U.S. Minor Outlying Islands
VI 78 U.S. Virgin Islands



And the data can be further limited to just provide FIPS codes for the 50 states.

# Keep FIPS code for just the 50 states
state_fips <- fips_codes %>%
  select(state, state_code, state_name) %>%
  filter(state_code <= 56 & !(state %in% c("DC"))) %>% 
  distinct()

# View state FIPS codes data structure
state_fips[1:10,] %>%
  kbl(caption = "50 US State FIPS Codes (first 10)",
      align = "r") %>%
  kable_minimal("hover")
50 US State FIPS Codes (first 10)
state state_code state_name
AL 01 Alabama
AK 02 Alaska
AZ 04 Arizona
AR 05 Arkansas
CA 06 California
CO 08 Colorado
CT 09 Connecticut
DE 10 Delaware
FL 12 Florida
GA 13 Georgia

Download Census Bureau Data

The tidycensus package can also be used to download data directly from the Census Bureau using API calls. The Census Bureau provides tables online that list available variables. The tidycensus package can also download those tables to provide an internal list of variables.

List Census Bureau Variables

The code below populates a table with variables from a specific data set and year using the load_variables() function from the tidycensus package. In this example, it downloads data from the ACS 5-year estimates for 2020.

# Store ACS variables to view
variables <- load_variables(year = 2020,
                            dataset = "acs5/profile")

# View ACS variables
head(variables %>%
       filter(concept == "ACS DEMOGRAPHIC AND HOUSING ESTIMATES")) %>%
  kbl(caption = "First 5 Demographic Variables",
      align = "l") %>%
  kable_minimal("hover")
First 5 Demographic Variables
name label concept
DP05_0001 Estimate!!SEX AND AGE!!Total population ACS DEMOGRAPHIC AND HOUSING ESTIMATES
DP05_0001P Percent!!SEX AND AGE!!Total population ACS DEMOGRAPHIC AND HOUSING ESTIMATES
DP05_0002 Estimate!!SEX AND AGE!!Total population!!Male ACS DEMOGRAPHIC AND HOUSING ESTIMATES
DP05_0002P Percent!!SEX AND AGE!!Total population!!Male ACS DEMOGRAPHIC AND HOUSING ESTIMATES
DP05_0003 Estimate!!SEX AND AGE!!Total population!!Female ACS DEMOGRAPHIC AND HOUSING ESTIMATES
DP05_0003P Percent!!SEX AND AGE!!Total population!!Female ACS DEMOGRAPHIC AND HOUSING ESTIMATES

Using the API

Using the information from above, the variable “DP05_0001” can be used to pull total population for a geographic region. Here is an example of an API call using the get_acs() function to download statewide populations for the year 2020.

# Download state populations
state_populations <- get_acs(geography = "state",
                             variables = "DP05_0001",
                             year = 2020) %>%
  rename(POPULATION = estimate) %>%
  select(GEOID, NAME, POPULATION)

# View data structure
state_populations[1:5,] %>%
  kbl(caption = "US State Populations (first 5)",
      align = "r",
      format.args = list(big.mark = ",",
                         scientific = FALSE)) %>%
  kable_minimal("hover")
US State Populations (first 5)
GEOID NAME POPULATION
01 Alabama 4,893,186
02 Alaska 736,990
04 Arizona 7,174,064
05 Arkansas 3,011,873
06 California 39,346,023

Download Shapefiles

Adding another line of code to the API call above, geometry = TRUE, also downloads the corresponding shapefiles for the data. The example below also includes shift_geo = TRUE, which resizes and relocates Alaksa and Hawaii to fit closer to the contiguous states for mapping purposes.

# Load tmap library
p_load(tmap)    # For creating categorical maps with tmap 

# Download state populations with shapefile
state_populations_shapefile <- get_acs(geography = "state",
                                       variables = "DP05_0001",
                                       year = 2020,
                                       geometry = TRUE,
                                       shift_geo = TRUE) %>%
  rename(POPULATION = estimate) %>%
  select(GEOID, NAME, POPULATION)

# Plot statewide populations
tm_shape(state_populations_shapefile) +
  tm_fill(title = "Statewide Population (2020)",
          col = "POPULATION",
          style = "cont",
          palette = "viridis",
          alpha = 0.7,
          n = 10) +
  tm_shape(state_populations_shapefile) +
  tm_borders(col = "white") +
  tm_layout(frame = FALSE,
            legend.outside = TRUE)

Use FIPS Codes to Limit Selection

Using the FIPS codes from above, this map can be limited easily to just the 48 contiguous states. This is also how a user could use the FIPS code dataset to limit which states are selected with any of the tidycensus API call functions.

# Create list of contiguous states
states <- unlist(state_fips %>%
                   filter(!state %in% c("AK", "HI")) %>%
                   select(state))

# Download population and shapefiles for contiguous states only
contiguous_populations_shapefile <- get_acs(geography = "state",
                                            state =  states,
                                            variables = "DP05_0001",
                                            year = 2020,
                                            geometry = TRUE)  %>%
  rename(POPULATION = estimate) %>%
  select(GEOID, NAME, POPULATION)

# Plot statewide populations for contiguous states
tm_shape(contiguous_populations_shapefile) +
  tm_fill(title = "Statewide Population (2020)",
          col = "POPULATION",
          style = "cont",
          palette = "viridis",
          alpha = 0.7,
          n = 10) +
  tm_shape(contiguous_populations_shapefile) +
  tm_borders(col = "white") +
  tm_layout(frame = FALSE,
            legend.outside = TRUE)



HUD Crosswalk Files

The US Department of Housing and Urban Development’s (HUD) Office of Policy Development and Research puts out ZIP code crosswalk data for each quarter year. These crosswalks allow for easy linking between ZIP codes and census tracts, counties, Core Based Statistical Areas (CBSA), and congressional districts.

Below are examples on how to use the ZIP code to County crosswalk file for the 4th quarter of 2021.

Note: this data is available as an .xlsx spreadsheet, but it has been converted to a .csv file for this example.

# Load the data (use colClasses to keep leading 0's in ZIP codes and county FIPS codes)
hud_crosswalk <- read.csv("./data/ZIP_COUNTY_122021.csv",
                          colClasses = c("zip" = "character",
                                         "county" = "character"))

# Inspect the data
hud_crosswalk[1:5,] %>%
       arrange(zip) %>%
  kbl(caption = "First Five Rows of HUD ZIP Code to County Crosswalk", 
      align = "r") %>%
  kable_minimal("hover")
First Five Rows of HUD ZIP Code to County Crosswalk
zip county usps_zip_pref_city usps_zip_pref_state res_ratio bus_ratio oth_ratio tot_ratio
00683 72125 SAN GERMAN PR 0.9530361 0.9966555 0.9807692 0.9563004
00683 72079 SAN GERMAN PR 0.0009488 0.0000000 0.0000000 0.0008725
00683 72023 SAN GERMAN PR 0.0007906 0.0011148 0.0000000 0.0007998
00683 72097 SAN GERMAN PR 0.0001581 0.0000000 0.0000000 0.0001454
00683 72121 SAN GERMAN PR 0.0450664 0.0022297 0.0192308 0.0418818



One common reason to use these crosswalks is to assign a county to a ZIP code. However, ZIP codes are not subsets of counties, so their routes can cross county lines. Fortunately, HUD provides the proportion of residents living in each linked county for that ZIP code, which can be used to asign a single county to each ZIP code. Here is an example of a ZIP code that crosses into multiple counties.

# Subset HUD data for ZIP code 39573
hud_crosswalk %>%
  filter(zip == "39573") %>%
  arrange(desc(res_ratio)) %>%
  kbl(caption = "All Counties Covered by ZIP Code 39573",
      align = "r") %>%
  kable_minimal("hover")
All Counties Covered by ZIP Code 39573
zip county usps_zip_pref_city usps_zip_pref_state res_ratio bus_ratio oth_ratio tot_ratio
39573 28131 PERKINSTON MS 0.6815420 0.6818182 1 0.6824586
39573 28045 PERKINSTON MS 0.2214540 0.2272727 0 0.2209516
39573 28039 PERKINSTON MS 0.0343599 0.0181818 0 0.0338983
39573 28109 PERKINSTON MS 0.0335219 0.0000000 0 0.0326731
39573 28059 PERKINSTON MS 0.0163419 0.0636364 0 0.0173576
39573 28047 PERKINSTON MS 0.0127802 0.0090909 0 0.0126608



According to HUD, 68.2% of the residents with a 39573 ZIP code lived in county 28131 at the end of 2021, so it makes sense to assign 39573 to that county if the goal is to have a single ZIP code and county combination. Using the FIPS data from above, county 28131 can be quickly identified:

# Identify county 131 in state 28
fips_codes %>%
  filter(state_code == "28" & county_code == "131") %>%
  select(county, state_name) %>%
  kbl(caption = "County 28131",
      align = "r") %>%
  kable_minimal("hover")
County 28131
county state_name
Stone County Mississippi

One County per ZIP Code

Here is a quick code to limit the HUD crosswalk data to include only the county with the highest proportion of residents for each ZIP code.

# Use the variable res_ratio to keep only one county for each ZIP code
zip_county <- hud_crosswalk %>%
  group_by(zip) %>%
  arrange(desc(res_ratio)) %>%
  filter(row_number()==1) %>%
  arrange(county, zip)

# Inspect the modified data
zip_county[1:5,] %>%
  kbl(caption = "First Five Rows of HUD Data with One County per ZIP Code",
      align = "r") %>%
  kable_minimal("hover")
First Five Rows of HUD Data with One County per ZIP Code
zip county usps_zip_pref_city usps_zip_pref_state res_ratio bus_ratio oth_ratio tot_ratio
36003 01001 AUTAUGAVILLE AL 1.0000000 1.0000000 1.0000000 1.0000000
36006 01001 BILLINGSLEY AL 0.7160326 0.5294118 0.5555556 0.7099738
36008 01001 BOOTH AL 1.0000000 1.0000000 0.0000000 1.0000000
36051 01001 MARBURY AL 0.6590747 0.8055556 1.0000000 0.6666667
36066 01001 PRATTVILLE AL 0.8858471 0.7206266 0.7066327 0.8680474



Area Deprivation Index

Another useful data source for geospatial data is the Area Deprivation Index (ADI). This metric is used to rank how a neighborhood–at the census block group level–compares to the rest of its state or on a national level in terms of socioeconomic disadvantage. The rankings are based on factors that have been linked to several healthcare outcomes, healthcare utilization rates, and early mortality.

For a full description of the data, visit the Neighborhood Atlas page. To download the data, first create an account here, and then go to the download page.

To get started with the data, here are the variable descriptions provided in the data’s read me file.

Note: although the data is available in a 9-digit ZIP code format, the providers of the data discourage using this data for ZIP codes and ZCTAs.

# Load textreader library
p_load(textreadr)    # For loading text files

# Load ADI data documentation
adi_read_me <- read_document("./data/2019 ADI_Census Block Group_v3.1_ReadMe.txt")

# Inspect variable descriptions
adi_read_me[3:6] %>%
  kbl(caption = "Descriptions of ADI Data Columns",
      align = "l",
      col.names = c("Variable Description")) %>%
  kable_minimal("hover")
Descriptions of ADI Data Columns
Variable Description
GISJOIN: Key linkage field to the block group shapefile served by NHGIS
FIPS: The block group Census ID
ADI_NATRANK: National percentile of block group ADI score
ADI_STATERNK: State-specific decile of block group ADI score



With that information in mind, here is a quick snapshot of what the data looks like for census block groups.

# Load ADI data
adi <- read.delim("./data/US_2019_ADI_Census Block Group_v3.1.txt",
                  sep = ",",
                  colClasses = c("FIPS" = "character"),
                  row.names = FALSE)

# Inspect ADI data
adi[1:5,-1] %>%
  kbl(caption = "First Five Rows of ADI Data",
      align = "r") %>%
  kable_minimal("hover")
First Five Rows of ADI Data
FIPS ADI_NATRANK ADI_STATERNK
010010201001 89 8
010010201002 52 2
010010202001 88 8
010010202002 82 6
010010203001 71 5



Note: both the state and national ranks for ADI include different codes for different missing values. Here’s a quick code to replace those values with NAs and then map the index.

# List of missing value codes
values <- c("GQ", "GQ-PH", "NONE", "PH", "QDI")

# Replace missing value codes with NA
adi <- adi %>%
  mutate(ADI_NATRANK = ifelse(ADI_NATRANK %in% values,
                              NA, ADI_NATRANK),
         ADI_STATERNK = ifelse(ADI_STATERNK %in% values,
                               NA, ADI_STATERNK))

# Download census block groups shapefile for Pennsylvania (from tigris package)
block_group_shapefiles <- block_groups(state = c("PA"))

# Merge ADI values to census block groups shapefile
# Note: there is a census block group for lake Erie that needs to be removed
pa_adi_map <- left_join(block_group_shapefiles,
                        adi,
                        by = c("GEOID" = "FIPS")) %>%
  filter(GEOID != "420499900000")

# Download Pennsylvania counties shapefile for map
pa_counties <- counties(state = "PA")

# Set map categories to just 10
tmap_options(max.categories = 10)

# Plot national ADI rank values for Pennsylvania
tm_shape(pa_adi_map) +
  tm_fill(title = "National ADI Rank",
          col = "ADI_NATRANK",
          style = "cont",
          palette = "viridis",
          alpha = 0.7,
          n = 10,
          colorNA = "grey",
          textNA = "Missing") +
  tm_shape(pa_counties) +
  tm_borders(col = "white") +
  tm_layout(frame = FALSE,
          legend.outside = TRUE)



And here is the same map but with statewide ADI ranks.

# Plot state ADI rank values for Pennsylvania
tm_shape(pa_adi_map) +
  tm_fill(title = "State ADI Rank",
          col = "ADI_STATERNK",
          style = "cont",
          palette = "viridis",
          alpha = 0.7,
          n = 10,
          colorNA = "grey",
          textNA = "Missing") +
  tm_shape(pa_counties) +
  tm_borders(col = "white") +
  tm_layout(frame = FALSE,
            legend.outside = TRUE)