This guide goes over a handful of useful data sources for geospatial analysis. One big limitation of working with geospatial data is the geographic unit or units that the data are available in, so always inspect what is available when choosing the right data for a project. If the data is available in a geographic unit supported by the US Census Bureau, then, using FIPS codes, it can be aggregated upward to larger geographic units (e.g. all counties in a state) or can be used to apply averages to smaller geographic units (e.g. state average to all counties).
The Census Bureau provides a wide variety of demographic and housing data for various geographic units. Not all of their data is available for each unit–specifically smaller units, so this is a serious limitation when working with their data.
Below is a graphic that depicts the Census Bureau’s geographic entities hierarchy (here is their page that goes over this hierarchy in greater detail):
The Census Bureau also provides corresponding shapefiles for that data, making it easy to create maps using that data.
The R package tidycensus provides several easy to use API calls for downloading Census Bureau data and their shapefiles. The first step for using these API calls is to set up an API key.
Before running the code below, which sets the user’s API key for downloading Census Bureau data, first request an API code here.
Replace the “[Insert API Key Here]” line below with your own API key, and then run the following.
Note: this should only have to be run once per system.
# Set Census API key
census_api_key("[Insert API Key Here]",
install = TRUE)
# Re-reads startup file to load Census API key
readRenviron("~/.Renviron")
The tidycensus package includes a built in data set that includes FIPS codes for all states, counties, and other territories in the US. This data set is useful for finding FIPS codes for the exact states and counties needed in an API call. Here is a method to quickly explore this data set, using the lookup_code() function from the tigris package (a companion package to tidycensus).
# Use pacman package to load packages used for exercise
library(pacman)
p_load(tidyverse, # Variety of packages for data manipulation
tidycensus, # Package for downloading census data
tigris, # Package for downloading census shapefiles
kableExtra) # For creating tables
# Use the tigris package function lookup_code() to search for FIPS codes
# State only:
lookup_code(state = "PA")
## [1] "The code for Pennsylvania is '42'."
# State and County:
lookup_code(state = "PA", county = "Allegheny")
## [1] "The code for Pennsylvania is '42' and the code for Allegheny County is '003'."
Users can also load the fips_codes data set directly to view the data set.
# Load the fips_codes data set
data(fips_codes)
# View the fips_codes data structure
fips_codes[1:5,] %>%
kbl(caption = "US State and County FIPS Codes (first 5)",
align = "r") %>%
kable_minimal("hover")
| state | state_code | state_name | county_code | county |
|---|---|---|---|---|
| AL | 01 | Alabama | 001 | Autauga County |
| AL | 01 | Alabama | 003 | Baldwin County |
| AL | 01 | Alabama | 005 | Barbour County |
| AL | 01 | Alabama | 007 | Bibb County |
| AL | 01 | Alabama | 009 | Blount County |
The data can then be manipulated to return just the FIPS codes for each state and territory.
# Keep just one FIPS code entry per state/territory
fips <- fips_codes %>%
select(state, state_code, state_name) %>%
distinct()
# View state/territory FIPS code data structure
fips[c(1:5),] %>%
kbl(caption = "US State and Territory FIPS Codes (first 5)",
align = "r") %>%
kable_minimal("hover")
| state | state_code | state_name |
|---|---|---|
| AL | 01 | Alabama |
| AK | 02 | Alaska |
| AZ | 04 | Arizona |
| AR | 05 | Arkansas |
| CA | 06 | California |
# Washington, D.C. and FIPS codes over 56
fips %>%
filter(state %in% c("DC") | state_code > 56) %>%
kbl(caption = "Washington, D.C. and US Territory FIPS Codes",
align = "r") %>%
kable_minimal("hover")
| state | state_code | state_name |
|---|---|---|
| DC | 11 | District of Columbia |
| AS | 60 | American Samoa |
| GU | 66 | Guam |
| MP | 69 | Northern Mariana Islands |
| PR | 72 | Puerto Rico |
| UM | 74 | U.S. Minor Outlying Islands |
| VI | 78 | U.S. Virgin Islands |
And the data can be further limited to just provide FIPS codes for the 50 states.
# Keep FIPS code for just the 50 states
state_fips <- fips_codes %>%
select(state, state_code, state_name) %>%
filter(state_code <= 56 & !(state %in% c("DC"))) %>%
distinct()
# View state FIPS codes data structure
state_fips[1:10,] %>%
kbl(caption = "50 US State FIPS Codes (first 10)",
align = "r") %>%
kable_minimal("hover")
| state | state_code | state_name |
|---|---|---|
| AL | 01 | Alabama |
| AK | 02 | Alaska |
| AZ | 04 | Arizona |
| AR | 05 | Arkansas |
| CA | 06 | California |
| CO | 08 | Colorado |
| CT | 09 | Connecticut |
| DE | 10 | Delaware |
| FL | 12 | Florida |
| GA | 13 | Georgia |
The tidycensus package can also be used to download data directly from the Census Bureau using API calls. The Census Bureau provides tables online that list available variables. The tidycensus package can also download those tables to provide an internal list of variables.
The code below populates a table with variables from a specific data set and year using the load_variables() function from the tidycensus package. In this example, it downloads data from the ACS 5-year estimates for 2020.
# Store ACS variables to view
variables <- load_variables(year = 2020,
dataset = "acs5/profile")
# View ACS variables
head(variables %>%
filter(concept == "ACS DEMOGRAPHIC AND HOUSING ESTIMATES")) %>%
kbl(caption = "First 5 Demographic Variables",
align = "l") %>%
kable_minimal("hover")
| name | label | concept |
|---|---|---|
| DP05_0001 | Estimate!!SEX AND AGE!!Total population | ACS DEMOGRAPHIC AND HOUSING ESTIMATES |
| DP05_0001P | Percent!!SEX AND AGE!!Total population | ACS DEMOGRAPHIC AND HOUSING ESTIMATES |
| DP05_0002 | Estimate!!SEX AND AGE!!Total population!!Male | ACS DEMOGRAPHIC AND HOUSING ESTIMATES |
| DP05_0002P | Percent!!SEX AND AGE!!Total population!!Male | ACS DEMOGRAPHIC AND HOUSING ESTIMATES |
| DP05_0003 | Estimate!!SEX AND AGE!!Total population!!Female | ACS DEMOGRAPHIC AND HOUSING ESTIMATES |
| DP05_0003P | Percent!!SEX AND AGE!!Total population!!Female | ACS DEMOGRAPHIC AND HOUSING ESTIMATES |
Using the information from above, the variable “DP05_0001” can be used to pull total population for a geographic region. Here is an example of an API call using the get_acs() function to download statewide populations for the year 2020.
# Download state populations
state_populations <- get_acs(geography = "state",
variables = "DP05_0001",
year = 2020) %>%
rename(POPULATION = estimate) %>%
select(GEOID, NAME, POPULATION)
# View data structure
state_populations[1:5,] %>%
kbl(caption = "US State Populations (first 5)",
align = "r",
format.args = list(big.mark = ",",
scientific = FALSE)) %>%
kable_minimal("hover")
| GEOID | NAME | POPULATION |
|---|---|---|
| 01 | Alabama | 4,893,186 |
| 02 | Alaska | 736,990 |
| 04 | Arizona | 7,174,064 |
| 05 | Arkansas | 3,011,873 |
| 06 | California | 39,346,023 |
Adding another line of code to the API call above, geometry = TRUE, also downloads the corresponding shapefiles for the data. The example below also includes shift_geo = TRUE, which resizes and relocates Alaksa and Hawaii to fit closer to the contiguous states for mapping purposes.
# Load tmap library
p_load(tmap) # For creating categorical maps with tmap
# Download state populations with shapefile
state_populations_shapefile <- get_acs(geography = "state",
variables = "DP05_0001",
year = 2020,
geometry = TRUE,
shift_geo = TRUE) %>%
rename(POPULATION = estimate) %>%
select(GEOID, NAME, POPULATION)
# Plot statewide populations
tm_shape(state_populations_shapefile) +
tm_fill(title = "Statewide Population (2020)",
col = "POPULATION",
style = "cont",
palette = "viridis",
alpha = 0.7,
n = 10) +
tm_shape(state_populations_shapefile) +
tm_borders(col = "white") +
tm_layout(frame = FALSE,
legend.outside = TRUE)
Using the FIPS codes from above, this map can be limited easily to just the 48 contiguous states. This is also how a user could use the FIPS code dataset to limit which states are selected with any of the tidycensus API call functions.
# Create list of contiguous states
states <- unlist(state_fips %>%
filter(!state %in% c("AK", "HI")) %>%
select(state))
# Download population and shapefiles for contiguous states only
contiguous_populations_shapefile <- get_acs(geography = "state",
state = states,
variables = "DP05_0001",
year = 2020,
geometry = TRUE) %>%
rename(POPULATION = estimate) %>%
select(GEOID, NAME, POPULATION)
# Plot statewide populations for contiguous states
tm_shape(contiguous_populations_shapefile) +
tm_fill(title = "Statewide Population (2020)",
col = "POPULATION",
style = "cont",
palette = "viridis",
alpha = 0.7,
n = 10) +
tm_shape(contiguous_populations_shapefile) +
tm_borders(col = "white") +
tm_layout(frame = FALSE,
legend.outside = TRUE)
The US Department of Housing and Urban Development’s (HUD) Office of Policy Development and Research puts out ZIP code crosswalk data for each quarter year. These crosswalks allow for easy linking between ZIP codes and census tracts, counties, Core Based Statistical Areas (CBSA), and congressional districts.
Below are examples on how to use the ZIP code to County crosswalk file for the 4th quarter of 2021.
Note: this data is available as an .xlsx spreadsheet, but it has been converted to a .csv file for this example.
# Load the data (use colClasses to keep leading 0's in ZIP codes and county FIPS codes)
hud_crosswalk <- read.csv("./data/ZIP_COUNTY_122021.csv",
colClasses = c("zip" = "character",
"county" = "character"))
# Inspect the data
hud_crosswalk[1:5,] %>%
arrange(zip) %>%
kbl(caption = "First Five Rows of HUD ZIP Code to County Crosswalk",
align = "r") %>%
kable_minimal("hover")
| zip | county | usps_zip_pref_city | usps_zip_pref_state | res_ratio | bus_ratio | oth_ratio | tot_ratio |
|---|---|---|---|---|---|---|---|
| 00683 | 72125 | SAN GERMAN | PR | 0.9530361 | 0.9966555 | 0.9807692 | 0.9563004 |
| 00683 | 72079 | SAN GERMAN | PR | 0.0009488 | 0.0000000 | 0.0000000 | 0.0008725 |
| 00683 | 72023 | SAN GERMAN | PR | 0.0007906 | 0.0011148 | 0.0000000 | 0.0007998 |
| 00683 | 72097 | SAN GERMAN | PR | 0.0001581 | 0.0000000 | 0.0000000 | 0.0001454 |
| 00683 | 72121 | SAN GERMAN | PR | 0.0450664 | 0.0022297 | 0.0192308 | 0.0418818 |
One common reason to use these crosswalks is to assign a county to a ZIP code. However, ZIP codes are not subsets of counties, so their routes can cross county lines. Fortunately, HUD provides the proportion of residents living in each linked county for that ZIP code, which can be used to asign a single county to each ZIP code. Here is an example of a ZIP code that crosses into multiple counties.
# Subset HUD data for ZIP code 39573
hud_crosswalk %>%
filter(zip == "39573") %>%
arrange(desc(res_ratio)) %>%
kbl(caption = "All Counties Covered by ZIP Code 39573",
align = "r") %>%
kable_minimal("hover")
| zip | county | usps_zip_pref_city | usps_zip_pref_state | res_ratio | bus_ratio | oth_ratio | tot_ratio |
|---|---|---|---|---|---|---|---|
| 39573 | 28131 | PERKINSTON | MS | 0.6815420 | 0.6818182 | 1 | 0.6824586 |
| 39573 | 28045 | PERKINSTON | MS | 0.2214540 | 0.2272727 | 0 | 0.2209516 |
| 39573 | 28039 | PERKINSTON | MS | 0.0343599 | 0.0181818 | 0 | 0.0338983 |
| 39573 | 28109 | PERKINSTON | MS | 0.0335219 | 0.0000000 | 0 | 0.0326731 |
| 39573 | 28059 | PERKINSTON | MS | 0.0163419 | 0.0636364 | 0 | 0.0173576 |
| 39573 | 28047 | PERKINSTON | MS | 0.0127802 | 0.0090909 | 0 | 0.0126608 |
According to HUD, 68.2% of the residents with a 39573 ZIP code lived in county 28131 at the end of 2021, so it makes sense to assign 39573 to that county if the goal is to have a single ZIP code and county combination. Using the FIPS data from above, county 28131 can be quickly identified:
# Identify county 131 in state 28
fips_codes %>%
filter(state_code == "28" & county_code == "131") %>%
select(county, state_name) %>%
kbl(caption = "County 28131",
align = "r") %>%
kable_minimal("hover")
| county | state_name |
|---|---|
| Stone County | Mississippi |
Here is a quick code to limit the HUD crosswalk data to include only the county with the highest proportion of residents for each ZIP code.
# Use the variable res_ratio to keep only one county for each ZIP code
zip_county <- hud_crosswalk %>%
group_by(zip) %>%
arrange(desc(res_ratio)) %>%
filter(row_number()==1) %>%
arrange(county, zip)
# Inspect the modified data
zip_county[1:5,] %>%
kbl(caption = "First Five Rows of HUD Data with One County per ZIP Code",
align = "r") %>%
kable_minimal("hover")
| zip | county | usps_zip_pref_city | usps_zip_pref_state | res_ratio | bus_ratio | oth_ratio | tot_ratio |
|---|---|---|---|---|---|---|---|
| 36003 | 01001 | AUTAUGAVILLE | AL | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 |
| 36006 | 01001 | BILLINGSLEY | AL | 0.7160326 | 0.5294118 | 0.5555556 | 0.7099738 |
| 36008 | 01001 | BOOTH | AL | 1.0000000 | 1.0000000 | 0.0000000 | 1.0000000 |
| 36051 | 01001 | MARBURY | AL | 0.6590747 | 0.8055556 | 1.0000000 | 0.6666667 |
| 36066 | 01001 | PRATTVILLE | AL | 0.8858471 | 0.7206266 | 0.7066327 | 0.8680474 |
Another useful data source for geospatial data is the Area Deprivation Index (ADI). This metric is used to rank how a neighborhood–at the census block group level–compares to the rest of its state or on a national level in terms of socioeconomic disadvantage. The rankings are based on factors that have been linked to several healthcare outcomes, healthcare utilization rates, and early mortality.
For a full description of the data, visit the Neighborhood Atlas page. To download the data, first create an account here, and then go to the download page.
To get started with the data, here are the variable descriptions provided in the data’s read me file.
Note: although the data is available in a 9-digit ZIP code format, the providers of the data discourage using this data for ZIP codes and ZCTAs.
# Load textreader library
p_load(textreadr) # For loading text files
# Load ADI data documentation
adi_read_me <- read_document("./data/2019 ADI_Census Block Group_v3.1_ReadMe.txt")
# Inspect variable descriptions
adi_read_me[3:6] %>%
kbl(caption = "Descriptions of ADI Data Columns",
align = "l",
col.names = c("Variable Description")) %>%
kable_minimal("hover")
| Variable Description |
|---|
| GISJOIN: Key linkage field to the block group shapefile served by NHGIS |
| FIPS: The block group Census ID |
| ADI_NATRANK: National percentile of block group ADI score |
| ADI_STATERNK: State-specific decile of block group ADI score |
With that information in mind, here is a quick snapshot of what the data looks like for census block groups.
# Load ADI data
adi <- read.delim("./data/US_2019_ADI_Census Block Group_v3.1.txt",
sep = ",",
colClasses = c("FIPS" = "character"),
row.names = FALSE)
# Inspect ADI data
adi[1:5,-1] %>%
kbl(caption = "First Five Rows of ADI Data",
align = "r") %>%
kable_minimal("hover")
| FIPS | ADI_NATRANK | ADI_STATERNK |
|---|---|---|
| 010010201001 | 89 | 8 |
| 010010201002 | 52 | 2 |
| 010010202001 | 88 | 8 |
| 010010202002 | 82 | 6 |
| 010010203001 | 71 | 5 |
Note: both the state and national ranks for ADI include different codes for different missing values. Here’s a quick code to replace those values with NAs and then map the index.
# List of missing value codes
values <- c("GQ", "GQ-PH", "NONE", "PH", "QDI")
# Replace missing value codes with NA
adi <- adi %>%
mutate(ADI_NATRANK = ifelse(ADI_NATRANK %in% values,
NA, ADI_NATRANK),
ADI_STATERNK = ifelse(ADI_STATERNK %in% values,
NA, ADI_STATERNK))
# Download census block groups shapefile for Pennsylvania (from tigris package)
block_group_shapefiles <- block_groups(state = c("PA"))
# Merge ADI values to census block groups shapefile
# Note: there is a census block group for lake Erie that needs to be removed
pa_adi_map <- left_join(block_group_shapefiles,
adi,
by = c("GEOID" = "FIPS")) %>%
filter(GEOID != "420499900000")
# Download Pennsylvania counties shapefile for map
pa_counties <- counties(state = "PA")
# Set map categories to just 10
tmap_options(max.categories = 10)
# Plot national ADI rank values for Pennsylvania
tm_shape(pa_adi_map) +
tm_fill(title = "National ADI Rank",
col = "ADI_NATRANK",
style = "cont",
palette = "viridis",
alpha = 0.7,
n = 10,
colorNA = "grey",
textNA = "Missing") +
tm_shape(pa_counties) +
tm_borders(col = "white") +
tm_layout(frame = FALSE,
legend.outside = TRUE)
And here is the same map but with statewide ADI ranks.
# Plot state ADI rank values for Pennsylvania
tm_shape(pa_adi_map) +
tm_fill(title = "State ADI Rank",
col = "ADI_STATERNK",
style = "cont",
palette = "viridis",
alpha = 0.7,
n = 10,
colorNA = "grey",
textNA = "Missing") +
tm_shape(pa_counties) +
tm_borders(col = "white") +
tm_layout(frame = FALSE,
legend.outside = TRUE)