Data from the US Census Bureau can be invaluable resources for doing research. The enormous scale of Census data (both breadth over time and depth from options!) can be overwhelming, so this tutorial is intended to guide students to using Census Data in R. Although it is possible to manually download data from the Census (if you would prefer to do that, see the note below), we focus on doing so programatically because it has several benefits. By using Census data in R:
Before we begin: if you would prefer not to use R and would rather download data manually, then you may be interested in exploring the offical Census tutorials on working with their data. They focus on manually downloading directly from their data.census.gov portal.
The first hurdle we must pass is knowing what data you are looking for. On their website, the US Census Bureau states that their mission is “…to serve as the nation’s leading provider of quality data about its people and economy.” As a result, they have an astonishing amount of information that is available to you. The screenshot below shows some of the various sources offered by the Census Bureau, many of which you may have heard of:
List of Census Surveys and Programs - a lot to work with!
Here, we will focus on the two most popular Census products - the Decennial Census and the American Community Survey (ACS). Much more detail about the differences between the two can be found here, but the core differences are:
We will learn how to look in much more detail at each release below, but in general the Decennial Census is the best source of general population data (counts by area, racial group, etc.) but may be a little old (though, this will not be a problem for at least a few years with the updated 2020 Census), while the ACS provides more estimated information for a wider variety of demographic topics.
All Census releases have data available at a wide variety of geographies. This graphic, from the Missouri Census Data Center, summarizes the geographic levels at which you can get data. The Decennial Census is available down to the Census Block level (generally comparable to a city block in urban areas, but can be geographically larger in rural areas – normally representing a few dozen or hundred people), while the ACS is available down to the block group level (combinations of adjacent blocks). Much more detail on each of these geographies is available at this link.
Diagram of Census geographic levels.
Both the Decennial Census and the ACS have a wide variety of information (called “variables” you can request). You can see variables manually by searching them on data.census.gov, but the R package tidycensus offers a more systematic way to do this.
The first step is to request a Census API Key. An API is a set of functions that allow you to interact with the Census system that gives you the data you request, and an API Key is like a password that grants you access. To do so, visit this link and enter your information. They will send you a password with a private key. Save it somewhere safe!
Then, we need the tidycensus package. If you have never used it before, you should run install.packages("tidycensus") once to install it. Once that is complete, load the tidycensus (and tidyverse packages, which you should install if you don’t have it).
library(tidyverse)
library(tidycensus)
Then, we use the tidycensus function census_api_key() to “login” to the Census. You should run this one time (ever! Your R will remember the password) by pasting this code into the console:
census_api_key("PASTE YOUR API KEY FROM THE EMAIL HERE", install = TRUE)
To see the variables available to you in a particular survey, you can use the load_variables() function. The function takes arguments for the year you are interested in, the survey (Census or ACS). You can also find these lists on the Census website (on individual pages for each survey like this for the 2019 ACS. Let’s look at all available variables from the ACS 2018 5-year estimates:
# the cache = TRUE argument store the value on your computer,
# which will make it run much faster next time you use it.
acs2018 <- load_variables("2018", "acs5", cache = TRUE)
This stores a datset of available variables as a tibble in R (a type of data frame). You can easily explore it by running the View() function on the object, which will open a new window where you can search through all of the available variables (or type in a phrase next to the magnifiying glass to find something of interest to you). In these examples, the name column lists an ID code for the variable (this is what we will use to request), along with a descriptive label (the Census uses !! symbols to indicate levels, so Estimate!!Total!!Male means the number of males), and concept gives more detail about the question. You can search through these objects to choose whatever variables you want to work with.
Example ACS columns.
Using the RStudio search option in View to search for variables related to ancestry.
The Census (as with all APIs) bases data by “requests.” Basically, you ask the Census nicely for specific data in a format that they offer, and they will return to you a dataset. THe tidycensus package deals with cleaning and presenting this data in R nicely for you. The functions get_decennial() and get_acs() are how you request data from each of the surveys we discussed above. The most important arguments are the year you want, the geography (according to the levels in the diagram above), and the variable (using the ID code we found in the previous section). It is also tempting with any Census data to visualize it on a map, so we will also set the argument geometry = TRUE to ask for spatial data as well. You can see a full list of arguments to (this or any other) function by running ?get_acs in the console. For example, let’s request the number of Dutch people in every New Jersey county in 2017:
# B04004_034 = Estimate!!Total!!Dutch
dutch <- get_acs(geography = "county", state = "NJ",
variables = "B04004_034", year = 2017, geometry = TRUE)
head(dutch)
## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -75.19511 ymin: 39.28857 xmax: -73.89398 ymax: 41.13344
## Geodetic CRS: NAD83
## GEOID NAME variable estimate moe
## 1 34003 Bergen County, New Jersey B04004_034 3173 435
## 2 34007 Camden County, New Jersey B04004_034 405 134
## 3 34013 Essex County, New Jersey B04004_034 632 174
## 4 34019 Hunterdon County, New Jersey B04004_034 544 140
## 5 34001 Atlantic County, New Jersey B04004_034 304 100
## 6 34005 Burlington County, New Jersey B04004_034 755 233
## geometry
## 1 MULTIPOLYGON (((-74.27066 4...
## 2 MULTIPOLYGON (((-75.13598 3...
## 3 MULTIPOLYGON (((-74.37623 4...
## 4 MULTIPOLYGON (((-75.19511 4...
## 5 MULTIPOLYGON (((-74.98527 3...
## 6 MULTIPOLYGON (((-75.05902 3...
The function may print out some progress messages to let you know it is working, and if you receive data back in the object you know that it worked.
As described above, a big benefit of loading Census data directly from R is that you immediately start to analyze and visualize it without any further work or cleaning. Here, let’s use the sf package to make a simple map of our requested variables. This is possible because we asked for the map information with geometry = TRUE.
library(sf)
p1 <- dutch %>%
ggplot() +
geom_sf(aes(fill = estimate)) +
# Code below is just minor visual editing, not necessary for the map to run!
ggthemes::theme_map() +
theme(legend.position = "right") +
scale_fill_gradient2(name = "Dutch Population",
low = "white",
high = "orange")
p1
If you have experience working with demographic data, you may be thinking that maps of raw population counts often just reflect population density. For example, our map above may just reflect that there are more people overall in northern New Jersey, not necessarily that the proportion of the population that is Dutch is particularly high.
That’s why people will often normalize Census data by turning raw counts into some proportion of people in the area. For example, here we could plot the proportion of people in a given county who identified as Dutch. We would do that by dividing our same variable above by the total population in that area. This pattern is so common that tidycensus has a built-in option for it with the summary_var argument, which will add a new column onto our dataset with the column we would like to normalize by.
If we open acs2018 back up again, we can see that the total number of people who reported their ancestry is given by the B04004_001 column, so this is what we will include as a summary variable:
dutch_prop <- get_acs(geography = "county", state = "NJ",
variables = "B04004_034", summary_var = "B04004_001",
year = 2017, geometry = TRUE)
head(dutch_prop)
## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -75.19511 ymin: 39.28857 xmax: -73.89398 ymax: 41.13344
## Geodetic CRS: NAD83
## GEOID NAME variable estimate moe summary_est
## 1 34003 Bergen County, New Jersey B04004_034 3173 435 610508
## 2 34007 Camden County, New Jersey B04004_034 405 134 286937
## 3 34013 Essex County, New Jersey B04004_034 632 174 586676
## 4 34019 Hunterdon County, New Jersey B04004_034 544 140 57850
## 5 34001 Atlantic County, New Jersey B04004_034 304 100 155640
## 6 34005 Burlington County, New Jersey B04004_034 755 233 216941
## summary_moe geometry
## 1 5030 MULTIPOLYGON (((-74.27066 4...
## 2 3110 MULTIPOLYGON (((-75.13598 3...
## 3 4764 MULTIPOLYGON (((-74.37623 4...
## 4 1554 MULTIPOLYGON (((-75.19511 4...
## 5 2705 MULTIPOLYGON (((-74.98527 3...
## 6 2627 MULTIPOLYGON (((-75.05902 3...
Now, we see that the estimate column gives us the number of people who reported Dutch people, while the summary_est column reports the number of people who reported ancestry at all. The moe columns are the margin of error, which give an indication of how accurate the ACS thinks these estimates are.
Let’s use these columns above to calculate the proportion of each county in NJ that is Dutch (according to the 2018 ACS 5-year estimates). We will use tidyverse functions to do this. If you’d like an introduction or review of using tidyverse to work with data in R, I’d recommend these introductory materials or this free book by the creator of tidyverse for a more advanced look.
# Calculate the proportion of people who identify as Dutch
# Divide Dutch # by total
p2 <- dutch_prop %>%
# I rename the columns for clarity, this is not necessary
rename(dutch = estimate,
total = summary_est) %>%
mutate(prop = dutch / total) %>%
ggplot() +
geom_sf(aes(fill = prop)) +
# Code below is just minor visual editing, not necessary for the map to run!
ggthemes::theme_map() +
theme(legend.position = "right") +
scale_fill_gradient2(name = "Dutch Population",
low = "white",
high = "orange",
labels = scales::percent)
p2