Methods 1, Week 6

tidycensus

Today we talk about the US Census and the tidycensus package.

tidycensus requires a key: get it here.
Create new project from new folder: methods1/class5
Install the tidycensus package: install.packages("tidycensus")

Outline

Homework questions and overview
US Census
tidycensus
In-class exercise
Homework

The decennial US Census

The U.S. census counts every resident of the country every ten years (year ending in zero).
The Constitution mandates the count.
Conducted a survey every ten years since 1790.
The Census Bureau counts each resident where they live on April 1.
The census is not a count of citizens, it is a count of residents.

Uses of the census: Apportionment

The process of dividing the 435 seats in the U.S. House of Representatives among the 50 states, based on the state population counts.

Uses of the census: Redistricting

Provides population data to adjust or redraw electoral districts.

Uses of the census: Allocate federal money

Helps determine how federal funds are distributed across the country, including:

funding for schools, hospitals, roads, and public infrastructure.
Medicaid, Head Start, block grant programs for community mental health services, and the Supplemental Nutrition Assistance Program (SNAP).

Uses of the census:

Research
Maps

The 1-year and 5-year American Community Survey

The American Community Survey (ACS) is a demographics survey conducted by the US Census Bureau every year since 2005.

The Census sends surveys to a randomly selected 3.5 million addresses every year

ACS data are estimates. ACS data has a margin of error, and it is larger for areas with smaller populations.

1-year estimates are available for areas with population >= 65,000 people.
5-year estimates are available for areas down to the block group

Comparison

Geographic levels

Smallest areas

Which census data should you use?

It always depends on your analysis questions!

Decennial vs ACS

The Decennial Census is an actual count, not an estimate. Use it when possible.
The ACS is useful for questions not included in the decennial survey
- becomes more useful later in the decade.

If you use ACS, never forget that it is an estimation.

Question changes

The questions and methods of collection change constantly
- historical overiew

Prison gerrymandering

The Census Bureau counts incarcerated people as residents of the district where they are confined.

A few states now reallocate incarcerated people from where they are incarcerated to their last residence.
Most do not.

Resources:

Prisoners of the census overview
Prison gerrymandering factsheet for Georgia
National Conference of State Legislators (NCSL) 50-state overview of prison gerrymandering policy.

tidycensus

tidycensus is a package that imports census data directly from the U.S. Census as tidyverse-ready dataframes. It is very nice.

Overview
Basic usage
You can also look at the ChangeLog to see what is available for the 2020 census.

Major functions

There are two major functions implemented in tidycensus:

get_decennial(): grants access to the 2000, 2010, and 2020 decennial US Census APIs
- only the redistricting file - “pl” - is available from the 2020 decennial census
get_acs(): grants access to the 1-year and 5-year American Community Survey APIs
- ACS 1-year for 2021 was released Sept 15, 2022 (more release notes)

`get_decennial()`

find available geographies here

get_decennial(

geography,
variables = NULL,
table = NULL,
cache_table = FALSE,
year = 2010,
sumfile = “sf1”,
state = NULL,
county = NULL )

Install census api key

Install tidycensus and the census API key

Get Census API key to use tidycensus here.

In your console:

install.packages("tidycensus")
census_api_key("put your census api key here", install = TRUE)

Help section

library(tidyverse)
library(tidycensus)


# look at the help section for the load_variables() function 
# run the line of code below in your console and look at the help section
?load_variables

HELP

List 2020 Census variables

The 2020 Census data release is very delayed from covid

Only data for redistricting is available so far
- The 2020 Census Redistricting Data (P.L. 94-171) Summary Files
- “pl”

# create table of all variables in the 2020 redistricting file
pl_2020 <- load_variables(2020, "pl", cache = T)

Redistricting Race/Ethnicity data

P1. Race
P2. Hispanic or Latino, and Not Hispanic or Latino by Race
P3. Race for the Population 18 Years and Over
P4. Hispanic or Latino, and Not Hispanic or Latino by Race for the Population 18 Years and Over
P5. Group Quarters Population by Major Group Quarters Type
H1. Occupancy Status

Import Housing Units data

housing_units <- get_decennial(geography = "state",
                             variables = c(housing_units = "H1_001N"), 
                             year = 2020)

GEOID	NAME	variable	value
42	Pennsylvania	housing_units	5742828
06	California	housing_units	14392140

A question:

What percentage of housing units receive an American Community Survey each year?

The answer

# ACS questionaires go to 3.5 million addresses each year
acs_percent <- 3500000/sum(housing_units$value)

acs_percent

[1] 0.02463108

Import multiple columns

housing_vars = c("H1_001N", "H1_002N", "H1_003N")

raw_housing_2020 = get_decennial(geography = "state", 
                   variables = housing_vars, 
                   year = 2020) %>% 
  arrange(NAME)

GEOID	NAME	variable	value
01	Alabama	H1_001N	2288330
01	Alabama	H1_002N	2011947
01	Alabama	H1_003N	276383
02	Alaska	H1_001N	326200
02	Alaska	H1_002N	269148
02	Alaska	H1_003N	57052

Tidy data

Each state has 3 rows, one for each of the 3 housing variables. This is called long format.

We want the data to be in wide format to make it tidy:

meaning each row is an observation with all of the variables associated with the observation in a column.

GEOID	NAME	H1_001N	H1_002N	H1_003N
01	Alabama	2288330	2011947	276383
02	Alaska	326200	269148	57052
04	Arizona	3082000	2705878	376122

Pivot_wider

We’ll use pivot_wider() to convert from long to wide format:

names_from = the column you want the new column-names to come from
values_from = the column you want the data to come from

Then rename the columns to be more descriptive

housing_2020 <-  raw_housing_2020 %>%
  pivot_wider(names_from = variable, values_from = value) %>% 
  rename(tot_housing_units = H1_001N,
         occupied = H1_002N,
         vacant = H1_003N)

GEOID	NAME	tot_housing_units	occupied	vacant
01	Alabama	2288330	2011947	276383
02	Alaska	326200	269148	57052

Tidy data

There are three interrelated rules which make a dataset tidy:

Each observation has its own row.
Each variable has its own column.
Each value has its own cell.

Percent Vacant

Now calculate percent occupied and percent vacant

housing_2020 <-  raw_housing_2020 %>%
  pivot_wider(names_from = variable, values_from = value) %>% 
  rename(state = NAME,
         tot_housing_units = H1_001N,
         occupied = H1_002N,
         vacant = H1_003N) %>% 
  mutate(pct_occupied = round(occupied/tot_housing_units, 3),
         pct_vacant = round(vacant/tot_housing_units, 3))

GEOID	state	tot_housing_units	occupied	vacant	pct_occupied	pct_vacant
01	Alabama	2288330	2011947	276383	0.879	0.121
02	Alaska	326200	269148	57052	0.825	0.175
04	Arizona	3082000	2705878	376122	0.878	0.122
05	Arkansas	1365265	1199395	165870	0.879	0.121
06	California	14392140	13475623	916517	0.936	0.064

Plot each state

Bar plot

Use geom_col to create a bar plot

within the aes() we’ll order states by pct_vacant
rotate the state labels in the x-axis so we can read them

vacant_housing_plot <- ggplot(data=housing_2020, 
                              aes(x=reorder(state,pct_vacant), 
                                  y=pct_vacant)) +
  geom_col() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  labs(x = "State", y = "Percent Vacant Housing Units")

In-class Analysis

Use the get_decennial function to create a state dataframe with the following variables:

GEOID
State
Total Population (P1_001N)
Percent Hispanic or Latino (P2_002N/P1_001N)
Percent Black alone, not Hispanic or Latino (see if you can find the data to calculate this variable)

There are a lot of race/ethnicity variables. It is not easy to determine which one to use!

Create a bar chart of the percent Black population, with the states ordered by population.

Assignment 6a: Reading

Read the census questionnaire for 2020 (and previous years if you want to)
Read Black Is Over (Or, Special Black) by Tressie McMillan Cottom.

Assignment 6b: Tidycensus

Create a dataframe of estimated Median Household Income and selected race/ethnicity variables for every county in one state. Use this data to understand the relationship between race/ethnicity and income in this state. Explore the dataframe by:

looking at the data
calculating summary statistics
creating plots

Write a paragraph explaining at least 3 things you have learned about your state by exploring the data. Include plots and/or statistics to support your conclusions. (You can upload the plots separately or create a pdf with text and images)

Assignment 6b: specific instructions

Use the get_decennial function to create a dataframe of all counties in one state (pick any state) with the following variables:

GEOID
County
Total Population
Percent Hispanic or Latino
Percent White alone, not Hispanic or Latino
Percent Black alone, not Hispanic or Latino
Percent Asian alone, not Hispanic or Latino
(get more variables if you want!)

Use the get_acs function to create a dataframe of the estimated Median household income all counties in the same state. Use the code below. We’ll learn more about ACS next week.

raw_mhi_2020 = get_acs(geography = "state", 
                                 variables = c(mhi = "B19013_001"), 
                                 year = 2020)

Join these two dataframes together. Explore as described in the assignment overview.

Methods 1, Week 6

tidycensus

Outline

Homework questions and overview

US Census

tidycensus

In-class exercise

Homework

The decennial US Census

Uses of the census: Apportionment

Uses of the census: Redistricting

Uses of the census: Allocate federal money

Uses of the census:

The 1-year and 5-year American Community Survey

Comparison

Geographic levels

Smallest areas

Which census data should you use?

Question changes

Prison gerrymandering

tidycensus

Major functions

get_decennial()

Install census api key

Help section

HELP

List 2020 Census variables

Redistricting Race/Ethnicity data

Import Housing Units data

A question:

The answer

Import multiple columns

Tidy data

Pivot_wider

Tidy data

Percent Vacant

Plot each state

Bar plot

In-class Analysis

Assignment 6a: Reading

Assignment 6b: Tidycensus

Assignment 6b: specific instructions

`get_decennial()`