This file will support users in accessing the U.S. Decennial Census data.
The code and data in this file is based on the 2025 workshop, “Working with Decennial Census Data in R” led by Prof. Kyle Walker (TCU); the slides for this workshop can be found here and the repository for the workshop can be found here.
Variable selection and targets have been prioritized for members of the Quantitative Histories Workshop and other quant-shop-users. Please note that you will need to have your census API key installed on your local machine.
knitr::opts_chunk$set(echo = TRUE)
pkgs <- c("tidycensus", "tidyverse", "mapview", "survey", "srvyr", "arcgislayers")
# install.packages(pkgs) # uncomment to install the packages
library(tidyverse)
library(tidycensus)
library(sf)
library(tidyverse)
library(ggplot2)
library(weights)
library(dplyr)
library(stringr)
options(tigris_use_cache = TRUE)
We will access the 2020 Census data using tidycensus.
We will gather the 2020 state-level population data using the PL 94-171 Redistricting Data Summary File. This data was used to inform redistricting as a part of a U.S. constitutional mandate.
pop20 <- get_decennial(
geography = "state",
variables = "P1_001N",
year = 2020
)
pop20
## # A tibble: 52 × 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 42 Pennsylvania P1_001N 13002700
## 2 06 California P1_001N 39538223
## 3 54 West Virginia P1_001N 1793716
## 4 49 Utah P1_001N 3271616
## 5 36 New York P1_001N 20201249
## 6 11 District of Columbia P1_001N 689545
## 7 02 Alaska P1_001N 733391
## 8 12 Florida P1_001N 21538187
## 9 45 South Carolina P1_001N 5118425
## 10 38 North Dakota P1_001N 779094
## # ℹ 42 more rows
The table parameter obtains all related variables.
table_p2 <- get_decennial(
geography = "state",
table = "P2",
year = 2020
)
head(table_p2)
## # A tibble: 6 × 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 42 Pennsylvania P2_001N 13002700
## 2 06 California P2_001N 39538223
## 3 54 West Virginia P2_001N 1793716
## 4 49 Utah P2_001N 3271616
## 5 36 New York P2_001N 20201249
## 6 11 District of Columbia P2_001N 689545
tail(table_p2)
## # A tibble: 6 × 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 32 Nevada P2_073N 15
## 2 10 Delaware P2_073N 5
## 3 72 Puerto Rico P2_073N 0
## 4 21 Kentucky P2_073N 13
## 5 46 South Dakota P2_073N 1
## 6 47 Tennessee P2_073N 10
We will start with data on Washington, DC.
Take note of the tibble dimensions.
dc_population <- get_decennial(
geography = "county",
variables = "P1_001N",
state = "DC",
sumfile = "dhc",
year = 2020
)
dc_population
## # A tibble: 1 × 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 11001 District of Columbia, District of Columbia P1_001N 689545
We will also collect data on North Carolina.
nc_population <- get_decennial(
geography = "county",
variables = "P1_001N",
state = "NC",
sumfile = "dhc",
year = 2020
)
nc_population
## # A tibble: 100 × 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 37151 Randolph County, North Carolina P1_001N 144171
## 2 37145 Person County, North Carolina P1_001N 39097
## 3 37147 Pitt County, North Carolina P1_001N 170243
## 4 37043 Clay County, North Carolina P1_001N 11089
## 5 37037 Chatham County, North Carolina P1_001N 76285
## 6 37039 Cherokee County, North Carolina P1_001N 28774
## 7 37041 Chowan County, North Carolina P1_001N 13708
## 8 37045 Cleveland County, North Carolina P1_001N 99519
## 9 37047 Columbus County, North Carolina P1_001N 50623
## 10 37049 Craven County, North Carolina P1_001N 100720
## # ℹ 90 more rows
Given the uniqueness of DC, we will focus on NC, specifically Mecklenburg County, NC.
From Walker, a census block is analogous to an urban block and it may vary by geography.
The block is the smallest geography that you can get in the decennial data.
nc_mecklenburg_blocks <- get_decennial(
geography = "block",
variables = "P1_001N",
state = "NC",
county = "Mecklenburg",
sumfile = "dhc",
year = 2020
)
nc_mecklenburg_blocks
## # A tibble: 12,132 × 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 371190005012017 Block 2017, Block Group 2, Census Tract 5.01,… P1_001N 0
## 2 371190005012014 Block 2014, Block Group 2, Census Tract 5.01,… P1_001N 0
## 3 371190005012015 Block 2015, Block Group 2, Census Tract 5.01,… P1_001N 0
## 4 371190005012016 Block 2016, Block Group 2, Census Tract 5.01,… P1_001N 46
## 5 371190005012018 Block 2018, Block Group 2, Census Tract 5.01,… P1_001N 0
## 6 371190005012019 Block 2019, Block Group 2, Census Tract 5.01,… P1_001N 0
## 7 371190005012020 Block 2020, Block Group 2, Census Tract 5.01,… P1_001N 0
## 8 371190005021000 Block 1000, Block Group 1, Census Tract 5.02,… P1_001N 0
## 9 371190005021001 Block 1001, Block Group 1, Census Tract 5.02,… P1_001N 0
## 10 371190005021002 Block 1002, Block Group 1, Census Tract 5.02,… P1_001N 103
## # ℹ 12,122 more rows
Next, we will identify our variables of interest.
vars <- load_variables(2020, "dhc")
View(vars) # view the variables, a new window will pop up
When the data frame loads in a new window, you can use the search function to identify variables.
Please note the different variable concepts and lables:
P: person-level variables
H: housing-level variables
HCT: housing-level variables only available at census-tract level and up.
PCT: person-level variables only available at the census tract level and up.
Take note of the sumfile
that is used to access the data
based on the variable name and concept.
getdecennial()
defaults to sumfile = pl
(Redistricting Data Summary file).
You will need to change the sumfile = dhc
.
Taking a look at counts of owners and renters by Black or African American alone.
nc_housing_black <- get_decennial(
geography = "county",
state = "NC",
variables = c(owner = "H10_004N", # owner occupied
renter = "H10_012N"), # renter occupied
year = 2020,
sumfile = "dhc",
output = "wide"
)
nc_housing_black
## # A tibble: 100 × 4
## GEOID NAME owner renter
## <chr> <chr> <dbl> <dbl>
## 1 37151 Randolph County, North Carolina 1721 1838
## 2 37145 Person County, North Carolina 2279 1791
## 3 37147 Pitt County, North Carolina 8769 15621
## 4 37043 Clay County, North Carolina 19 21
## 5 37037 Chatham County, North Carolina 2196 1162
## 6 37039 Cherokee County, North Carolina 100 58
## 7 37041 Chowan County, North Carolina 879 925
## 8 37045 Cleveland County, North Carolina 3468 4630
## 9 37047 Columbus County, North Carolina 3306 2274
## 10 37049 Craven County, North Carolina 4167 4036
## # ℹ 90 more rows
From the example provided by Walker, I can look at the married and partnered by same sex.
nc_samesex <- get_decennial(
geography = "county",
state = "NC",
variables = c(married = "DP1_0116P",
partnered = "DP1_0118P"),
year = 2020,
sumfile = "dp",
output = "wide"
)
nc_samesex
## # A tibble: 100 × 4
## GEOID NAME married partnered
## <chr> <chr> <dbl> <dbl>
## 1 37001 Alamance County, North Carolina 0.2 0.1
## 2 37003 Alexander County, North Carolina 0.1 0.1
## 3 37005 Alleghany County, North Carolina 0.1 0.1
## 4 37007 Anson County, North Carolina 0.1 0.1
## 5 37009 Ashe County, North Carolina 0.2 0.1
## 6 37011 Avery County, North Carolina 0.1 0.1
## 7 37013 Beaufort County, North Carolina 0.1 0.1
## 8 37015 Bertie County, North Carolina 0.1 0.1
## 9 37017 Bladen County, North Carolina 0.1 0.1
## 10 37019 Brunswick County, North Carolina 0.2 0.1
## # ℹ 90 more rows
We use the core-based statistical areas to take a look at summary variables.
We follow Walker and use an example with race-ethnicity breakdowns.
race_vars <- c(
Hispanic = "P5_010N",
White = "P5_003N",
Black = "P5_004N",
Native = "P5_005N",
Asian = "P5_006N",
HIPI = "P5_007N"
)
# Core-Based Statistical Areas (CBSA)
cbsa_race <- get_decennial(
geography = "cbsa",
variables = race_vars,
summary_var = "P5_001N",
year = 2020,
sumfile = "dhc"
)
The structure of the data determines whether we get a count or a proportion.
For example, in the code above, one options provided counts and the other proportions.
We can then arrange data to view the largest populated areas by race.
arrange(cbsa_race, desc(value)) %>%
head(n=15) # view the top ten areas
## # A tibble: 15 × 5
## GEOID NAME variable value summary_value
## <chr> <chr> <chr> <dbl> <dbl>
## 1 35620 New York-Newark-Jersey City, NY-NJ-PA Me… White 8.71e6 20140470
## 2 31080 Los Angeles-Long Beach-Anaheim, CA Metro… Hispanic 5.89e6 13200998
## 3 35620 New York-Newark-Jersey City, NY-NJ-PA Me… Hispanic 5.07e6 20140470
## 4 16980 Chicago-Naperville-Elgin, IL-IN-WI Metro… White 4.82e6 9618502
## 5 31080 Los Angeles-Long Beach-Anaheim, CA Metro… White 3.76e6 13200998
## 6 37980 Philadelphia-Camden-Wilmington, PA-NJ-DE… White 3.69e6 6245051
## 7 14460 Boston-Cambridge-Newton, MA-NH Metro Area White 3.29e6 4941632
## 8 19100 Dallas-Fort Worth-Arlington, TX Metro Ar… White 3.27e6 7637387
## 9 35620 New York-Newark-Jersey City, NY-NJ-PA Me… Black 3.00e6 20140470
## 10 33100 Miami-Fort Lauderdale-Pompano Beach, FL … Hispanic 2.82e6 6138333
## 11 19820 Detroit-Warren-Dearborn, MI Metro Area White 2.80e6 4392041
## 12 47900 Washington-Arlington-Alexandria, DC-VA-M… White 2.70e6 6385162
## 13 26420 Houston-The Woodlands-Sugar Land, TX Met… Hispanic 2.67e6 7122240
## 14 12060 Atlanta-Sandy Springs-Alpharetta, GA Met… White 2.66e6 6089815
## 15 33460 Minneapolis-St. Paul-Bloomington, MN-WI … White 2.65e6 3690261
Now we need to view the data by proportions (percent) as to not use the raw data.
cbsa_race_percent <- cbsa_race %>%
mutate(percent = 100 * (value / summary_value)) %>%
select(NAME, variable, percent)
cbsa_race_percent
## # A tibble: 5,634 × 3
## NAME variable percent
## <chr> <chr> <dbl>
## 1 Yakima, WA Metro Area Hispanic 50.7
## 2 Yankton, SD Micro Area Hispanic 5.29
## 3 Yauco, PR Metro Area Hispanic 99.4
## 4 York-Hanover, PA Metro Area Hispanic 8.62
## 5 Youngstown-Warren-Boardman, OH-PA Metro Area Hispanic 3.67
## 6 Yuba City, CA Metro Area Hispanic 30.4
## 7 Yuma, AZ Metro Area Hispanic 63.8
## 8 Zanesville, OH Micro Area Hispanic 1.22
## 9 Zapata, TX Micro Area Hispanic 93.6
## 10 Wenatchee, WA Metro Area Hispanic 30.1
## # ℹ 5,624 more rows
Notice that the areas have changed.
We can view the largest proportions of groups by metro area.
largest_group <- cbsa_race_percent %>%
group_by(NAME) %>%
filter(percent == max(percent))
We can analyze the largest proportion of Black communities across the country by metro.
largest_group %>%
arrange(desc(percent)) %>%
filter(variable == "Black")
## # A tibble: 25 × 3
## # Groups: NAME [25]
## NAME variable percent
## <chr> <chr> <dbl>
## 1 Clarksdale, MS Micro Area Black 75.8
## 2 Greenville, MS Micro Area Black 71.1
## 3 Selma, AL Micro Area Black 69.7
## 4 Indianola, MS Micro Area Black 69.6
## 5 Helena-West Helena, AR Micro Area Black 62.2
## 6 Greenwood, MS Micro Area Black 62.2
## 7 Cleveland, MS Micro Area Black 62.1
## 8 Orangeburg, SC Micro Area Black 60.3
## 9 West Point, MS Micro Area Black 57.9
## 10 Forrest City, AR Micro Area Black 54.1
## # ℹ 15 more rows