In today’s class we’ll learn more about the US Census and how to use it, including how to pull census data directly into R Studio with the tidycensus package.
To get started working with tidycensus, you need a Census API key. A key can be obtained from http://api.census.gov/data/key_signup.html.
Create new project from new folder:
- methods1/class5
Install the tidycensus package
install.packages("tidycensus")
Homework review
The decennial US Census
1-year and 5-year American Community Survey
tidycensus
creating tidy data
in-class lab
Readings discussion
Assignment 6
The U.S. census counts every resident of the country every ten years on the year ending in zero. The US government has conducted a survey every ten years since 1790. The Census Bureau counts each resident where they live on April 1. The Constitution mandates the enumeration to determine how to apportion the House of Representatives among the states. The census is not a count of citizens, it is a count of residents. The Census is used for:
The American Community Survey (ACS) is a demographics survey conducted by the US Census Bureau every year. It sends surveys to a randomly selected 3.5 million addresses every year to collect more information than the decennial census. It asks questions, and creates datasets about demographics, age, income, educational attainment, migration, employment, housing characteristics and much more. The ACS began in 2005 - the additional questions were collected in the “long form census” which ended in 2010.
ACS data are estimates. ACS data has a margin of error, and it is larger for areas with smaller populations.
Differences between the Decennial Census and the American Community Survey
https://www.census.gov/content/dam/Census/data/developers/geoareaconcepts.pdf
It always depends on your analysis questions!
Decennial vs ACS
The decennial Census is an actual count, not an estimate. So use it when possible. The American Community Survey is useful for questions not included in the decennial survey and becomes more useful later in the decade when the decennial census data becomes out of date.
If you use ACS, never forget that it is an estimation.
The Census Bureau’s longstanding practice is to count persons incarcerated in state and federal correctional facilities as residents of the district where they are confined. By far the majority of states use the population and residence data reported in the census, as is.
A handful of states have changed their procedures for allocating inmate data for redistricting purposes. In these states, when possible, they reallocate data on inmatates in the redistricting data file from where they are incarcerated to their residence prior to incarceration. (NCSL)
Tidycensus helps you import data from the decennial US Census and the 1-year and 5-year American Community Survey.
library(tidyverse)
library(tidycensus)
# add your census api key, install = TRUe installs it for future sessions as well
# census_api_key("put your census api key here", install = TRUE)
# look at the help section for the load_variables() function
# run the line of code below in your console and look at the help section
?load_variables
#### First let’s look at what’s avaiable in the decennial census by looking at Summary File 1 from 2010
# create table of all variables in the 2010 SF1
sf1_2010 <- load_variables(2010, "sf1", cache = TRUE)
Summary File 1 (SF 1) contains the data compiled from the questions asked of all people and about every housing unit. Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters. Housing items include occupancy status, vacancy status, and tenure (whether a housing unit is owner-occupied or renter-occupied).
There are 177 population tables (identified with a ‘‘P’’) and 58 housing tables (identified with an ‘‘H’’) shown down to the block level; 82 population tables (identified with a ‘‘PCT’’) and 4 housing tables (identified with an “HCT”) shown down to the census tract level; and 10 population tables (identified with a “PCO”) shown down to the county level, for a total of 331 tables. The SF 1 Urban/Rural Update added 2 PCT tables, increasing the total number to 333 tables. There are 14 population tables and 4 housing tables shown down to the block level and 5 population tables shown down to the census tract level that are repeated by the major race and Hispanic or Latino groups.
# create table of all variables in the 2020 redistricting file
pl_2020 <- load_variables(2020, "pl")
housing_units <- get_decennial(geography = "state",
variables = c(housing_units = "H1_001N"),
year = 2020)
## Getting data from the 2020 decennial Census
## Using the PL 94-171 Redistricting Data summary file
## Note: 2020 decennial Census data use differential privacy, a technique that
## introduces errors into data to preserve respondent confidentiality.
## ℹ Small counts should be interpreted with caution.
## ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
## This message is displayed once per session.
#### Let’s use this dataset to answer a question:
What percentage of housing units receive an American Community Survey each year?
# ACS questionaires go to 3.5 million addresses each year
acs_percent <- 3500000/sum(housing_units$value)
acs_percent
housing_vars = c("H1_001N", "H1_002N", "H1_003N")
raw_housing_2020 = get_decennial(geography = "state",
variables = housing_vars,
year = 2020) %>%
arrange(NAME)
Looking at this data, you can see that each of the 3 variables are there own row. This is not a useful format for exploring and analyzing data.
We want the data to be tidy, meaning each row is an observation with all of the variables associated with the observation in a column. The census does not deliver tidy data.
There are three interrelated rules which make a dataset tidy:
To convert this into tidy data we need to pivot wider:
housing_2020 <- raw_housing_2020 %>%
pivot_wider(names_from = variable, values_from = value)
Better, but we need to make the column names more descriptive so that we know that each coumn means. We can look at the pl_2020 data frame that we created from the load_variables() function to create a descriptive name for each column. And we’ll create a ‘percent occupied’ and ‘percent vacant’ variable while we’re at it.
housing_2020 <- raw_housing_2020 %>%
pivot_wider(names_from = variable, values_from = value) %>%
rename(tot_housing_units = H1_001N,
occupied = H1_002N,
vacant = H1_003N)
Now let’s create some variables from the census data: ‘percent occupied’ and ‘percent vacant’.
housing_2020 <- raw_housing_2020 %>%
pivot_wider(names_from = variable, values_from = value) %>%
rename(state = NAME,
tot_housing_units = H1_001N,
occupied = H1_002N,
vacant = H1_003N) %>%
mutate(pct_occupied = round(occupied/tot_housing_units, 3),
pct_vacant = round(vacant/tot_housing_units, 3))
p<-ggplot(data=housing_2020, aes(x=reorder(state,pct_vacant), y=pct_vacant)) +
geom_col() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
labs(x = "State", y = "Percent Vacant Housing Units")
p
Great, we know how to import census data but I want more data to hopefully figure out what is driving the occupancy rates. Let’s import some data from the American Community Survey - median household income.
Use the load_variables() function to look at the variables in the available variables in the 5-year American Community Survey dataset. There are a lot of them!
acs201519 <- load_variables(2019, "acs5")
## fill in the blank to import the dataset
# raw_income = get_acs(geography = "state",
# variables = "",
# year = 2019)
First, explore the four most common ACS tables to see what data is available:
Most common ACS tables:
For the coding assignment you will continue with the in-class assignment for homework in a new script. You can copy your code from the in-class assignment to get started, but beginning in a new script will help you to make this a neat and well-commented script.
Create a state data frame with:
Load variables from the 2015-19 American Community Survey using the get_acs() function
#### import mhi #### raw_income = get_acs(geography = "state", variables = "B19013_001", year = 2019)
Look at the write up on basic usage of the tidyverse and examples to learn how to download data for one state or county at a time.
Import at least 3 variables either from the 2020 Decennial Census or the 2015-19 American Community Survey for either:
Process the data so that it is “tidy” and calculate at least one new variable from the data. You can look back at the Creating tidy data section of this week’s lesson and this week’s reading assignment for more details on tidy data.
Upload your completed script in canvas - don’t forget to review your script to remove any unnecessary lines of code and to add comments to explain your work.