Working with Census Data

##R packages, tidycensus & tidyverse, provide easy access to census data. To get started working with tidycensus, users should load the package along with the tidyverse package, and set their Census API key. A key can be obtained from http://api.census.gov/data/key_signup.html.

library(tidycensus)
library(tidyverse)

census_api_key("18706c0e45bf4650b8511ba85a077d9de2dcf850")

There are two major functions implemented in tidycensus: get_decennial(), which grants access to the 2000 and 2010 decennial US Census APIs, and get_acs(), which grants access to the 1-year and 5-year American Community Survey APIs.

In this basic example, let’s look at median age by state in 2010:

age10 <- get_decennial(geography = "state", 
                       variables = "P013001", 
                       year = 2010)

head(age10)

## # A tibble: 6 x 4
##   GEOID NAME       variable value
##   <chr> <chr>      <chr>    <dbl>
## 1 01    Alabama    P013001   37.9
## 2 02    Alaska     P013001   33.8
## 3 04    Arizona    P013001   35.9
## 4 05    Arkansas   P013001   37.4
## 5 06    California P013001   35.2
## 6 22    Louisiana  P013001   35.8

The function returns a tibble with four columns by default: GEOID, which is an identifier for the geographical unit associated with the row; NAME, which is a descriptive name of the geographical unit; variable, which is the Census variable represented in the row; and value, which is the value of the variable for that unit. By default, tidycensus functions return tidy data frames in which rows represent unit-variable combinations; for a wide data frame with Census variable names in the columns, set output = "wide" in the function call.

As the function has returned a tidy object, we can visualize it quickly with ggplot2:

age10 %>%
  ggplot(aes(x = value, y = reorder(NAME, value))) + 
  geom_point()

Geography in tidycensus

To get decennial Census data or American Community Survey data, tidycensus users supply an argument to the required geography parameter. Arguments are formatted as consumed by the Census API, and specified in the table below. Not all geographies are available for all surveys, all years, and all variables. Most Census geographies are supported in tidycensus at the moment; if you require a geography that is missing from the table below, please file an issue at https://github.com/walkerke/tidycensus/issues.

If state or county is in bold face in “Available by”, you are required to supply a state and/or county for the given geography.

Geography	Definition	Available by	Available in
`"us"`	United States		`get_acs()`, `get_decennial()`
`"region"`	Census region		`get_acs()`, `get_decennial()`
`"division"`	Census division		`get_acs()`, `get_decennial()`
`"state"`	State or equivalent	state	`get_acs()`, `get_decennial()`
`"county"`	County or equivalent	state, county	`get_acs()`, `get_decennial()`
`"county subdivision"`	County subdivision	state, county	`get_acs()`, `get_decennial()`
`"tract"`	Census tract	state, county	`get_acs()`, `get_decennial()`
`"block group"`	Census block group	state, county	`get_acs()`, `get_decennial()`
`"block"`	Census block	state, county	`get_decennial()`
`"place"`	Census-designated place	state	`get_acs()`, `get_decennial()`
`"alaska native regional corporation"`	Alaska native regional corporation	state	`get_acs()`, `get_decennial()`
`"american indian area/alaska native area/hawaiian home land"`	Federal and state-recognized American Indian reservations and Hawaiian home lands	state	`get_acs()`, `get_decennial()`
`"american indian area/alaska native area (reservation or statistical entity only)"`	Only reservations and statistical entities	state	`get_acs()`, `get_decennial()`
`"american indian area (off-reservation trust land only)/hawaiian home land"`	Only off-reservation trust lands and Hawaiian home lands	state	`get_acs()`
`"metropolitan statistical area/micropolitan statistical area"`	Core-based statistical area	state	`get_acs()`, `get_decennial()`
`"combined statistical area"`	Combined statistical area	state	`get_acs()`, `get_decennial()`
`"new england city and town area"`	New England city/town area	state	`get_acs()`, `get_decennial()`
`"combined new england city and town area"`	Combined New England area	state	`get_acs()`, `get_decennial()`
`"urban area"`	Census-defined urbanized areas		`get_acs()`, `get_decennial()`
`"congressional district"`	Congressional district for the year-appropriate Congress	state	`get_acs()`, `get_decennial()`
`"school district (elementary)"`	Elementary school district	state	`get_acs()`, `get_decennial()`
`"school district (secondary)"`	Secondary school district	state	`get_acs()`, `get_decennial()`
`"school district (unified)"`	Unified school district	state	`get_acs()`, `get_decennial()`
`"public use microdata area"`	PUMA (geography associated with Census microdata samples)	state	`get_acs()`
`"zip code tabulation area"` OR `"zcta"`	Zip code tabulation area		`get_acs()`, `get_decennial()`
`"state legislative district (upper chamber)"`	State senate districts	state	`get_acs()`, `get_decennial()`
`"state legislative district (lower chamber)"`	State house districts	state	`get_acs()`, `get_decennial()`

Searching for variables

Getting variables from the Census or ACS requires knowing the variable ID - and there are thousands of these IDs across the different Census files. To rapidly search for variables, use the load_variables() function. The function takes two required arguments: the year of the Census or endyear of the ACS sample, and the dataset - one of "sf1", "sf3", or "acs5". For ideal functionality, I recommend assigning the result of this function to a variable, setting cache = TRUE to store the result on your computer for future access, and using the View function in RStudio to interactively browse for variables.

v19 <- load_variables(2019, "acs5", cache = TRUE)

View(v19)

By filtering for “median age” I can quickly view the variable IDs that correspond to my query.

Working with ACS data

American Community Survey data differ from decennial Census data in that ACS data are based on an annual sample of approximately 3 million households, rather than a more complete enumeration of the US population. In turn, ACS data points are estimates characterized by a margin of error. tidycensus will always return the estimate and margin of error together for any requested variables when using get_acs(). In turn, when requesting ACS data with tidycensus, it is not necessary to specify the "E" or "M" suffix for a variable name. Let’s fetch median household income data from the 2015-2019 ACS for counties in Vermont.

vt <- get_acs(geography = "county", 
              variables = c(medincome = "B19013_001"), 
              state = "VT", 
              year = 2019)

vt

## # A tibble: 14 x 5
##    GEOID NAME                       variable  estimate   moe
##    <chr> <chr>                      <chr>        <dbl> <dbl>
##  1 50001 Addison County, Vermont    medincome    68825  3253
##  2 50003 Bennington County, Vermont medincome    56183  3305
##  3 50005 Caledonia County, Vermont  medincome    50563  2497
##  4 50007 Chittenden County, Vermont medincome    73647  2249
##  5 50009 Essex County, Vermont      medincome    44349  3355
##  6 50011 Franklin County, Vermont   medincome    65485  2094
##  7 50013 Grand Isle County, Vermont medincome    71587  6623
##  8 50015 Lamoille County, Vermont   medincome    64003  2777
##  9 50017 Orange County, Vermont     medincome    60925  1966
## 10 50019 Orleans County, Vermont    medincome    49168  2021
## 11 50021 Rutland County, Vermont    medincome    56139  1883
## 12 50023 Washington County, Vermont medincome    62791  2283
## 13 50025 Windham County, Vermont    medincome    51985  2160
## 14 50027 Windsor County, Vermont    medincome    60987  2083

The output is similar to a call to get_decennial(), but instead of a value column, get_acs returns estimate and moe columns for the ACS estimate and margin of error, respectively. moe represents the default 90 percent confidence level around the estimate; this can be changed to 95 or 99 percent with the moe_level parameter in get_acs if desired.

As we have the margin of error, we can visualize the uncertainty around the estimate:

vt %>%
  mutate(NAME = gsub(" County, Vermont", "", NAME)) %>%
  ggplot(aes(x = estimate, y = reorder(NAME, estimate))) +
  geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) +
  geom_point(color = "red", size = 3) +
  labs(title = "Household income by county in Vermont",
       subtitle = "2015-2019 American Community Survey",
       y = "",
       x = "ACS estimate (bars represent margin of error)")

Importing Data into Redshift & Downloading Data as CSV

ACS Datasets can be imported to redshift, or downloaded as csvs depending upon the intended usage.

When importing census data into civis, we use the function write_civis(x, “schema.table_name”, ), x (the dataframe) and the schema/tablename are necessary components. However, write_civis() provides multiple options:

write_civis( x, tablename, database = NULL, if_exists = “fail”, distkey = NULL, sortkey1 = NULL, sortkey2 = NULL, max_errors = NULL, verbose = FALSE, hidden = TRUE, diststyle = NULL, header = TRUE, credential_id = NULL, import_args = NULL, … )

Arguments x: data frame, file path of a csv, or the id of a csv file on S3 to upload to platform.

tablename: string, Name of table and schema “schema.tablename”.

database: string, Name of database where data frame is to be uploaded. If no database is specified, uses options(civis.default_db).

if_exists: string, optional, String indicating action to take if table already exists. Must be either “fail”, “drop”, “truncate” or “append”. Defaults to “fail”.

distkey: string, optional, Column name designating the distkey.

sortkey1: string, optional, Column name designating the first sortkey.

sortkey2: string, optional, Column name designating the second (compound) sortkey.

max_errors: int, optional, Maximum number of rows with errors to remove before failing.

verbose: bool, Set to TRUE to print intermediate progress indicators.

hidden: bool, if TRUE (default), this job will not appear in the Civis UI.

diststyle: string optional. The diststyle to use for the table. One of “even”, “all”, or “key”.

header: bool, if TRUE (default) the first row is a header.

credential_id: integer, the id of the credential to be used when performing the database import. If NULL (default), the default credential of the current user will be used.

import_args: list of additional arguments for imports_post_files.

delimiter: string, optional. Which delimiter to use. One of ‘,’, ‘ or’|’.

The following illustrates the process of importing data directing into civis using the vt dataframe from above:

write_civis(vt, "demographics.census_vermont")

## <civis_api>
## List of 7
##  $ id               : int 303841550
##  $ importId         : int 108185179
##  $ state            : chr "succeeded"
##  $ isCancelRequested: logi FALSE
##  $ startedAt        : chr "2021-02-08T18:19:50.000Z"
##  $ finishedAt       : chr "2021-02-08T18:19:57.000Z"
##  $ error            : NULL
##  - attr(*, "class")= chr [1:2] "civis_api" "list"

Alternatively, the dataframe can be downloaded as a csv file:

write.csv(vt, "vermont_demo.csv")