This file will support users in accessing the U.S. Decennial Census data.

The code and data in this file is based on the 2025 workshop, “Working with Decennial Census Data in R” led by Prof. Kyle Walker (TCU); the slides for this workshop can be found here and the repository for the workshop can be found here.

Variable selection and targets have been prioritized for members of the Quantitative Histories Workshop and other quant-shop-users. Please note that you will need to have your census API key installed on your local machine.

knitr::opts_chunk$set(echo = TRUE)

pkgs <- c("tidycensus", "tidyverse", "mapview", "survey", "srvyr", "arcgislayers")
# install.packages(pkgs) # uncomment to install the packages

library(tidyverse)
library(tidycensus)
library(sf)
library(tidyverse)
library(ggplot2)
library(weights)
library(dplyr)
library(stringr)

options(tigris_use_cache = TRUE)

1 Data

We will access the 2020 Census data using tidycensus.

1.1 State-level population data

We will gather the 2020 state-level population data using the PL 94-171 Redistricting Data Summary File. This data was used to inform redistricting as a part of a U.S. constitutional mandate.

pop20 <- get_decennial(
  geography = "state",
  variables = "P1_001N",
  year = 2020
)

pop20

## # A tibble: 52 × 4
##    GEOID NAME                 variable    value
##    <chr> <chr>                <chr>       <dbl>
##  1 42    Pennsylvania         P1_001N  13002700
##  2 06    California           P1_001N  39538223
##  3 54    West Virginia        P1_001N   1793716
##  4 49    Utah                 P1_001N   3271616
##  5 36    New York             P1_001N  20201249
##  6 11    District of Columbia P1_001N    689545
##  7 02    Alaska               P1_001N    733391
##  8 12    Florida              P1_001N  21538187
##  9 45    South Carolina       P1_001N   5118425
## 10 38    North Dakota         P1_001N    779094
## # ℹ 42 more rows

The table parameter obtains all related variables.

table_p2 <- get_decennial(
  geography = "state", 
  table = "P2", 
  year = 2020
)

head(table_p2)

## # A tibble: 6 × 4
##   GEOID NAME                 variable    value
##   <chr> <chr>                <chr>       <dbl>
## 1 42    Pennsylvania         P2_001N  13002700
## 2 06    California           P2_001N  39538223
## 3 54    West Virginia        P2_001N   1793716
## 4 49    Utah                 P2_001N   3271616
## 5 36    New York             P2_001N  20201249
## 6 11    District of Columbia P2_001N    689545

tail(table_p2)

## # A tibble: 6 × 4
##   GEOID NAME         variable value
##   <chr> <chr>        <chr>    <dbl>
## 1 32    Nevada       P2_073N     15
## 2 10    Delaware     P2_073N      5
## 3 72    Puerto Rico  P2_073N      0
## 4 21    Kentucky     P2_073N     13
## 5 46    South Dakota P2_073N      1
## 6 47    Tennessee    P2_073N     10

1.2 DC data

We will start with data on Washington, DC.

Take note of the tibble dimensions.

dc_population <- get_decennial(
  geography = "county",
  variables = "P1_001N",
  state = "DC",
  sumfile = "dhc",
  year = 2020
)

dc_population

## # A tibble: 1 × 4
##   GEOID NAME                                       variable  value
##   <chr> <chr>                                      <chr>     <dbl>
## 1 11001 District of Columbia, District of Columbia P1_001N  689545

1.3 NC data

We will also collect data on North Carolina.

nc_population <- get_decennial(
  geography = "county",
  variables = "P1_001N",
  state = "NC",
  sumfile = "dhc",
  year = 2020
)

nc_population

## # A tibble: 100 × 4
##    GEOID NAME                             variable  value
##    <chr> <chr>                            <chr>     <dbl>
##  1 37151 Randolph County, North Carolina  P1_001N  144171
##  2 37145 Person County, North Carolina    P1_001N   39097
##  3 37147 Pitt County, North Carolina      P1_001N  170243
##  4 37043 Clay County, North Carolina      P1_001N   11089
##  5 37037 Chatham County, North Carolina   P1_001N   76285
##  6 37039 Cherokee County, North Carolina  P1_001N   28774
##  7 37041 Chowan County, North Carolina    P1_001N   13708
##  8 37045 Cleveland County, North Carolina P1_001N   99519
##  9 37047 Columbus County, North Carolina  P1_001N   50623
## 10 37049 Craven County, North Carolina    P1_001N  100720
## # ℹ 90 more rows

1.4 State and County

Given the uniqueness of DC, we will focus on NC, specifically Mecklenburg County, NC.

From Walker, a census block is analogous to an urban block and it may vary by geography.

The block is the smallest geography that you can get in the decennial data.

nc_mecklenburg_blocks <- get_decennial(
  geography = "block",
  variables = "P1_001N",
  state = "NC",
  county = "Mecklenburg",
  sumfile = "dhc",
  year = 2020
)

nc_mecklenburg_blocks

## # A tibble: 12,132 × 4
##    GEOID           NAME                                           variable value
##    <chr>           <chr>                                          <chr>    <dbl>
##  1 371190005012017 Block 2017, Block Group 2, Census Tract 5.01,… P1_001N      0
##  2 371190005012014 Block 2014, Block Group 2, Census Tract 5.01,… P1_001N      0
##  3 371190005012015 Block 2015, Block Group 2, Census Tract 5.01,… P1_001N      0
##  4 371190005012016 Block 2016, Block Group 2, Census Tract 5.01,… P1_001N     46
##  5 371190005012018 Block 2018, Block Group 2, Census Tract 5.01,… P1_001N      0
##  6 371190005012019 Block 2019, Block Group 2, Census Tract 5.01,… P1_001N      0
##  7 371190005012020 Block 2020, Block Group 2, Census Tract 5.01,… P1_001N      0
##  8 371190005021000 Block 1000, Block Group 1, Census Tract 5.02,… P1_001N      0
##  9 371190005021001 Block 1001, Block Group 1, Census Tract 5.02,… P1_001N      0
## 10 371190005021002 Block 1002, Block Group 1, Census Tract 5.02,… P1_001N    103
## # ℹ 12,122 more rows

2 Variables

Next, we will identify our variables of interest.

vars <- load_variables(2020, "dhc")

View(vars) # view the variables, a new window will pop up

When the data frame loads in a new window, you can use the search function to identify variables.

Please note the different variable concepts and lables:

P: person-level variables
H: housing-level variables
HCT: housing-level variables only available at census-tract level and up.
PCT: person-level variables only available at the census tract level and up.

2.1 Sumfiles

Take note of the sumfile that is used to access the data based on the variable name and concept.

getdecennial() defaults to sumfile = pl (Redistricting Data Summary file).

You will need to change the sumfile = dhc.

2.1.1 Example: Black owner-occupied or renter-occupied housing

Taking a look at counts of owners and renters by Black or African American alone.

nc_housing_black <- get_decennial(
  geography = "county",
  state = "NC",
  variables = c(owner = "H10_004N",  # owner occupied
                renter = "H10_012N"), # renter occupied
  year = 2020,
  sumfile = "dhc",
  output = "wide"
)

nc_housing_black

## # A tibble: 100 × 4
##    GEOID NAME                             owner renter
##    <chr> <chr>                            <dbl>  <dbl>
##  1 37151 Randolph County, North Carolina   1721   1838
##  2 37145 Person County, North Carolina     2279   1791
##  3 37147 Pitt County, North Carolina       8769  15621
##  4 37043 Clay County, North Carolina         19     21
##  5 37037 Chatham County, North Carolina    2196   1162
##  6 37039 Cherokee County, North Carolina    100     58
##  7 37041 Chowan County, North Carolina      879    925
##  8 37045 Cleveland County, North Carolina  3468   4630
##  9 37047 Columbus County, North Carolina   3306   2274
## 10 37049 Craven County, North Carolina     4167   4036
## # ℹ 90 more rows

From the example provided by Walker, I can look at the married and partnered by same sex.

nc_samesex <- get_decennial(
  geography = "county",
  state = "NC",
  variables = c(married = "DP1_0116P",
                partnered = "DP1_0118P"),
  year = 2020,
  sumfile = "dp",
  output = "wide"
)

nc_samesex

## # A tibble: 100 × 4
##    GEOID NAME                             married partnered
##    <chr> <chr>                              <dbl>     <dbl>
##  1 37001 Alamance County, North Carolina      0.2       0.1
##  2 37003 Alexander County, North Carolina     0.1       0.1
##  3 37005 Alleghany County, North Carolina     0.1       0.1
##  4 37007 Anson County, North Carolina         0.1       0.1
##  5 37009 Ashe County, North Carolina          0.2       0.1
##  6 37011 Avery County, North Carolina         0.1       0.1
##  7 37013 Beaufort County, North Carolina      0.1       0.1
##  8 37015 Bertie County, North Carolina        0.1       0.1
##  9 37017 Bladen County, North Carolina        0.1       0.1
## 10 37019 Brunswick County, North Carolina     0.2       0.1
## # ℹ 90 more rows

2.2 Summary variables

We use the core-based statistical areas to take a look at summary variables.

We follow Walker and use an example with race-ethnicity breakdowns.

race_vars <- c(
  Hispanic = "P5_010N",
  White = "P5_003N",
  Black = "P5_004N",
  Native = "P5_005N",
  Asian = "P5_006N",
  HIPI = "P5_007N"
)

# Core-Based Statistical Areas (CBSA)
cbsa_race <- get_decennial(
  geography = "cbsa",
  variables = race_vars,
  summary_var = "P5_001N", 
  year = 2020,
  sumfile = "dhc"
)

The structure of the data determines whether we get a count or a proportion.

For example, in the code above, one options provided counts and the other proportions.

We can then arrange data to view the largest populated areas by race.

arrange(cbsa_race, desc(value)) %>% 
  head(n=15) # view the top ten areas

## # A tibble: 15 × 5
##    GEOID NAME                                      variable  value summary_value
##    <chr> <chr>                                     <chr>     <dbl>         <dbl>
##  1 35620 New York-Newark-Jersey City, NY-NJ-PA Me… White    8.71e6      20140470
##  2 31080 Los Angeles-Long Beach-Anaheim, CA Metro… Hispanic 5.89e6      13200998
##  3 35620 New York-Newark-Jersey City, NY-NJ-PA Me… Hispanic 5.07e6      20140470
##  4 16980 Chicago-Naperville-Elgin, IL-IN-WI Metro… White    4.82e6       9618502
##  5 31080 Los Angeles-Long Beach-Anaheim, CA Metro… White    3.76e6      13200998
##  6 37980 Philadelphia-Camden-Wilmington, PA-NJ-DE… White    3.69e6       6245051
##  7 14460 Boston-Cambridge-Newton, MA-NH Metro Area White    3.29e6       4941632
##  8 19100 Dallas-Fort Worth-Arlington, TX Metro Ar… White    3.27e6       7637387
##  9 35620 New York-Newark-Jersey City, NY-NJ-PA Me… Black    3.00e6      20140470
## 10 33100 Miami-Fort Lauderdale-Pompano Beach, FL … Hispanic 2.82e6       6138333
## 11 19820 Detroit-Warren-Dearborn, MI Metro Area    White    2.80e6       4392041
## 12 47900 Washington-Arlington-Alexandria, DC-VA-M… White    2.70e6       6385162
## 13 26420 Houston-The Woodlands-Sugar Land, TX Met… Hispanic 2.67e6       7122240
## 14 12060 Atlanta-Sandy Springs-Alpharetta, GA Met… White    2.66e6       6089815
## 15 33460 Minneapolis-St. Paul-Bloomington, MN-WI … White    2.65e6       3690261

Now we need to view the data by proportions (percent) as to not use the raw data.

cbsa_race_percent <- cbsa_race %>%
  mutate(percent = 100 * (value / summary_value)) %>% 
  select(NAME, variable, percent) 

cbsa_race_percent

## # A tibble: 5,634 × 3
##    NAME                                         variable percent
##    <chr>                                        <chr>      <dbl>
##  1 Yakima, WA Metro Area                        Hispanic   50.7 
##  2 Yankton, SD Micro Area                       Hispanic    5.29
##  3 Yauco, PR Metro Area                         Hispanic   99.4 
##  4 York-Hanover, PA Metro Area                  Hispanic    8.62
##  5 Youngstown-Warren-Boardman, OH-PA Metro Area Hispanic    3.67
##  6 Yuba City, CA Metro Area                     Hispanic   30.4 
##  7 Yuma, AZ Metro Area                          Hispanic   63.8 
##  8 Zanesville, OH Micro Area                    Hispanic    1.22
##  9 Zapata, TX Micro Area                        Hispanic   93.6 
## 10 Wenatchee, WA Metro Area                     Hispanic   30.1 
## # ℹ 5,624 more rows

Notice that the areas have changed.

2.3 Group-wise analysis

We can view the largest proportions of groups by metro area.

largest_group <- cbsa_race_percent %>%
  group_by(NAME) %>% 
  filter(percent == max(percent))

We can analyze the largest proportion of Black communities across the country by metro.

largest_group %>% 
  arrange(desc(percent)) %>% 
  filter(variable == "Black")

## # A tibble: 25 × 3
## # Groups:   NAME [25]
##    NAME                              variable percent
##    <chr>                             <chr>      <dbl>
##  1 Clarksdale, MS Micro Area         Black       75.8
##  2 Greenville, MS Micro Area         Black       71.1
##  3 Selma, AL Micro Area              Black       69.7
##  4 Indianola, MS Micro Area          Black       69.6
##  5 Helena-West Helena, AR Micro Area Black       62.2
##  6 Greenwood, MS Micro Area          Black       62.2
##  7 Cleveland, MS Micro Area          Black       62.1
##  8 Orangeburg, SC Micro Area         Black       60.3
##  9 West Point, MS Micro Area         Black       57.9
## 10 Forrest City, AR Micro Area       Black       54.1
## # ℹ 15 more rows

Analyzing Decennial Census Data

Nathan Alexander, PhD

Center for Applied Data Science and Analytics

Howard University