Week 5 Lesson

In today’s class we’ll learn more about the US Census and how to use it, including how to pull census data directly into R Studio with the tidycensus package.

To get started working with tidycensus, you need a Census API key. A key can be obtained from http://api.census.gov/data/key_signup.html.

No data to download today, we’ll pull import it directly from the census!

Create new project from new folder:

methods1/class5

Install the tidycensus package

install.packages("tidycensus")

Outline

Homework review

The decennial US Census

1-year and 5-year American Community Survey

tidycensus

creating tidy data

in-class lab

Readings discussion

Assignment 6

The decennial US Census

The U.S. census counts every resident of the country every ten years on the year ending in zero. The US government has conducted a survey every ten years since 1790. The Census Bureau counts each resident where they live on April 1. The Constitution mandates the enumeration to determine how to apportion the House of Representatives among the states. The census is not a count of citizens, it is a count of residents. The Census is used for:

Apportionment: The process of dividing the 435 memberships, or seats, in the U.S. House of Representatives among the 50 states, based on the state population counts that result from each decennial census. These results determine the number of seats each state will have in the U.S. House of Representatives for the next 10 years.
Redistricting: The results are used to adjust or redraw electoral districts based on where populations have increased or decreased.
To allocate federal money: The results of the census help determine how federal funds are distributed across the country, including funding for schools, hospitals, roads, and public infrastructure. The results also inform how federal funding is allocated to more than federal programs, including Medicaid, Head Start, block grant programs for community mental health services, and the Supplemental Nutrition Assistance Program (SNAP).

The 1-year and 5-year American Community Survey

The American Community Survey (ACS) is a demographics survey conducted by the US Census Bureau every year. It sends surveys to a randomly selected 3.5 million addresses every year to collect more information than the decennial census. It asks questions, and creates datasets about demographics, age, income, educational attainment, migration, employment, housing characteristics and much more. The ACS began in 2005 - the additional questions were collected in the “long form census” which ended in 2010.

ACS data are estimates. ACS data has a margin of error, and it is larger for areas with smaller populations.

1-year estimates are available for areas with a population of at least 65,000 people.
5-year estimates are available for areas down to the block group scale, on the order of 600 to 3000 people.

Differences between the Decennial Census and the American Community Survey

Census data is delivered at different geographic levels

https://www.census.gov/newsroom/blogs/random-samplings/2014/07/understanding-geographic-relationships-counties-places-tracts-and-more.html

https://www.census.gov/content/dam/Census/data/developers/geoareaconcepts.pdf

Which census data should you use?

It always depends on your analysis questions!

Decennial vs ACS

The decennial Census is an actual count, not an estimate. So use it when possible. The American Community Survey is useful for questions not included in the decennial survey and becomes more useful later in the decade when the decennial census data becomes out of date.

If you use ACS, never forget that it is an estimation.

Some technical issues with the census

The questions and methods of collection change constantly - historical overiew
Prison gerrymandering

The Census Bureau’s longstanding practice is to count persons incarcerated in state and federal correctional facilities as residents of the district where they are confined. By far the majority of states use the population and residence data reported in the census, as is.

A handful of states have changed their procedures for allocating inmate data for redistricting purposes. In these states, when possible, they reallocate data on inmatates in the redistricting data file from where they are incarcerated to their residence prior to incarceration. (NCSL)

Prisoners of the census overview
Prison gerrymandering factsheet for Georgia
National Conference of State Legislators (NCSL) 50-state overview of prison gerrymandering policy.

Download Census data

Census
- Datasets:
- Summary File 1: data compiled from the questions asked of all people and about every housing unit.
- Summary File 3: contains the sample data, which is the information compiled from the questions asked of a sample of all people and housing units.
ACS
- Most common ACS tables:
- Selected Social Characteristics in the United States
- Selected Economic Characteristics
- Selected Housing Characteristics
- Demographic and Housing Estimates
NHGIS - The National Historical Geographic Information System (NHGIS) provides easy access to summary tables and time series of population, housing, agriculture, and economic data, along with GIS-compatible boundary files, for years from 1790 through the present and for all levels of U.S. census geography, including states, counties, tracts, and blocks.
- NHGIS is one of several IPUMS data integration projects housed with the Minnesota Population Center at the Institute for Social Research & Data Innovation at the University of Minnesota.

Import census data directly into R

tidycensus

Tidycensus helps you import data from the decennial US Census and the 1-year and 5-year American Community Survey.

Overview
Basic usage
You can also look at the ChangeLog to see what is available for the 2020 census.

library(tidyverse)
library(tidycensus)
# add your census api key, install = TRUe installs it for future sessions as well
# census_api_key("put your census api key here", install = TRUE) 

# look at the help section for the load_variables() function 
# run the line of code below in your console and look at the help section
?load_variables

#### First let’s look at what’s avaiable in the decennial census by looking at Summary File 1 from 2010

# create table of all variables in the 2010 SF1
sf1_2010 <- load_variables(2010, "sf1", cache = TRUE)

Summary File 1 (SF 1) contains the data compiled from the questions asked of all people and about every housing unit. Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters. Housing items include occupancy status, vacancy status, and tenure (whether a housing unit is owner-occupied or renter-occupied).

There are 177 population tables (identified with a ‘‘P’’) and 58 housing tables (identified with an ‘‘H’’) shown down to the block level; 82 population tables (identified with a ‘‘PCT’’) and 4 housing tables (identified with an “HCT”) shown down to the census tract level; and 10 population tables (identified with a “PCO”) shown down to the county level, for a total of 331 tables. The SF 1 Urban/Rural Update added 2 PCT tables, increasing the total number to 333 tables. There are 14 population tables and 4 housing tables shown down to the block level and 5 population tables shown down to the census tract level that are repeated by the major race and Hispanic or Latino groups.

Now let’s look at what is available so far from the 2020 census

# create table of all variables in the 2020 redistricting file
pl_2020 <- load_variables(2020, "pl")

Let’s pull in some data to see what this looks like - we’ll start with Housing Units

housing_units <- get_decennial(geography = "state",
                             variables = c(housing_units = "H1_001N"), 
                             year = 2020)

## Getting data from the 2020 decennial Census

## Using the PL 94-171 Redistricting Data summary file

## Note: 2020 decennial Census data use differential privacy, a technique that
## introduces errors into data to preserve respondent confidentiality.
## ℹ Small counts should be interpreted with caution.
## ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
## This message is displayed once per session.

#### Let’s use this dataset to answer a question:

What percentage of housing units receive an American Community Survey each year?

# ACS questionaires go to 3.5 million addresses each year
acs_percent <- 3500000/sum(housing_units$value)

acs_percent

Pull in several columns at once and format

housing_vars = c("H1_001N", "H1_002N", "H1_003N")

raw_housing_2020 = get_decennial(geography = "state", 
                   variables = housing_vars, 
                   year = 2020) %>% 
  arrange(NAME)

Looking at this data, you can see that each of the 3 variables are there own row. This is not a useful format for exploring and analyzing data.

We want the data to be tidy, meaning each row is an observation with all of the variables associated with the observation in a column. The census does not deliver tidy data.

There are three interrelated rules which make a dataset tidy:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Creating tidy data

To convert this into tidy data we need to pivot wider:

housing_2020 <-  raw_housing_2020 %>%
  pivot_wider(names_from = variable, values_from = value)

Better, but we need to make the column names more descriptive so that we know that each coumn means. We can look at the pl_2020 data frame that we created from the load_variables() function to create a descriptive name for each column. And we’ll create a ‘percent occupied’ and ‘percent vacant’ variable while we’re at it.

housing_2020 <-  raw_housing_2020 %>%
  pivot_wider(names_from = variable, values_from = value) %>% 
  rename(tot_housing_units = H1_001N,
         occupied = H1_002N,
         vacant = H1_003N)

Now let’s create some variables from the census data: ‘percent occupied’ and ‘percent vacant’.

housing_2020 <-  raw_housing_2020 %>%
  pivot_wider(names_from = variable, values_from = value) %>% 
  rename(state = NAME,
         tot_housing_units = H1_001N,
         occupied = H1_002N,
         vacant = H1_003N) %>% 
  mutate(pct_occupied = round(occupied/tot_housing_units, 3),
         pct_vacant = round(vacant/tot_housing_units, 3))

p<-ggplot(data=housing_2020, aes(x=reorder(state,pct_vacant), y=pct_vacant)) +
  geom_col() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  labs(x = "State", y = "Percent Vacant Housing Units")
  
p

Great, we know how to import census data but I want more data to hopefully figure out what is driving the occupancy rates. Let’s import some data from the American Community Survey - median household income.

In-class Analysis

Use the load_variables() function to look at the variables in the available variables in the 5-year American Community Survey dataset. There are a lot of them!
acs201519 <- load_variables(2019, "acs5")

Open the dataframe and search the variables for ‘median household income’
- There will be a lot of them, find the one that says “Estimate!!Median household income in the past 12 months (in 2019 inflation-adjusted dollars)”
- Copy the name of that variable
Use the get_acs() function to create a new dataframe of the median household income for each state

## fill in the blank to import the dataset
# raw_income = get_acs(geography = "state", 
#                    variables = "", 
#                    year = 2019)

rename the variables
join to the housing_2020 data frame
create a quick scatterplot of the relationship between income and percent vacant housing units

5a. Readings

Read Chapter 12 of R for Data Science on tidy data to learn more making your data “tidy”.

5b. R Assignment

First, explore the four most common ACS tables to see what data is available:

Most common ACS tables:

For the coding assignment you will continue with the in-class assignment for homework in a new script. You can copy your code from the in-class assignment to get started, but beginning in a new script will help you to make this a neat and well-commented script.

Create a state data frame with:

percent vacant housing units (from 2020 Decennial census)
percent occupied housing units (from 2020 Decennial census)

Load variables from the 2015-19 American Community Survey using the get_acs() function

Import median household income (MHI)
- use the code chunk below to import mhi: #### import mhi #### raw_income = get_acs(geography = "state", variables = "B19013_001", year = 2019)
Add median property value to your state table. You will have to search for “Estimate!!Median value –!!Total:” in the acs table.
Add total population from the 2020 census. (lavel = “!!Total:”, concept = “race”)
Create a scatterplot of the relationship between median household income and percent vacant housing units
- do you see a relationship between these two variables?
- save it to a figures folder in your class5 folder
Create a scatter plot of the relationship between median property value and median household income, with the dots sized by population
- do you see a relationship between these two variables?
- save it to a figures folder in your class5 folder
Create a bar chart of the median household income each state, ordered by median property value
- bonus, see if you can make the width of the columns based on the population
- save it to a figures folder in your class5 folder
Write a short, commented out section at the bottom of your script describing the relationship between median household income, property value, and percent vacant. This doesn’t have to be earth-shattering revelations :)
Review your script, clean it up and make sure it is commented thoroughly.
Upload you script and one of your plots to canvas.

5c. R Assignment 2 - OPTIONAL, does not count towards final grade

Look at the write up on basic usage of the tidyverse and examples to learn how to download data for one state or county at a time.

Import at least 3 variables either from the 2020 Decennial Census or the 2015-19 American Community Survey for either:

all counties of one state OR
all census tracts of one county

Process the data so that it is “tidy” and calculate at least one new variable from the data. You can look back at the Creating tidy data section of this week’s lesson and this week’s reading assignment for more details on tidy data.

Upload your completed script in canvas - don’t forget to review your script to remove any unnecessary lines of code and to add comments to explain your work.

Week 5 Lesson

No data to download today, we’ll pull import it directly from the census!

Outline

Homework review

The decennial US Census

1-year and 5-year American Community Survey

tidycensus

creating tidy data

in-class lab

Readings discussion

Assignment 6

The decennial US Census

The 1-year and 5-year American Community Survey

Census data is delivered at different geographic levels

Which census data should you use?

Some technical issues with the census

Prison gerrymandering

Download Census data

Import census data directly into R

tidycensus

Now let’s look at what is available so far from the 2020 census

Let’s pull in some data to see what this looks like - we’ll start with Housing Units

Pull in several columns at once and format

Creating tidy data

In-class Analysis

5a. Readings

5b. R Assignment

5c. R Assignment 2 - OPTIONAL, does not count towards final grade