Methods 1, Week 7

Outline

Today we talk about the US Census and the tidycensus package.

Homework questions and overview
US Census
tidycensus
In-class exercise
Homework

Homework questions and overview

tidycensus

Now that you’ve had some practice, we’re going to create a new project to have a fresh start without your beginner scripts

Download the new part2 folder and save it to your methods1 folder
Create new project in your methods1/part2 folder
Install the tidycensus package: install.packages("tidycensus")

The decennial US Census

The U.S. census counts every resident of the country every ten years (year ending in zero).
The Constitution mandates the count.
Conducted a survey every ten years since 1790.
The Census Bureau counts each resident where they live on April 1.
The census is not a count of citizens, it is a count of residents.

Uses of the census: Apportionment

The process of dividing the 435 seats in the U.S. House of Representatives among the 50 states, based on the state population counts.

Uses of the census: Redistricting

Provides population data to adjust or redraw electoral districts.

Uses of the census: Allocate federal money

Helps determine how federal funds are distributed across the country, including:

funding for schools, hospitals, roads, and public infrastructure.
Medicaid, Head Start, block grant programs for community mental health services, and the Supplemental Nutrition Assistance Program (SNAP).

Uses of the census:

Research
Maps

The 1-year and 5-year American Community Survey

The American Community Survey (ACS) is a demographics survey conducted by the US Census Bureau every year since 2005.

The Census sends surveys to a randomly selected 3.5 million addresses every year

ACS data are estimates. ACS data has a margin of error, and it is larger for areas with smaller populations.

1-year estimates are available for areas with population >= 65,000 people.
5-year estimates are available for areas down to the block group

Comparison

Geographic levels

Smallest areas

Which census data should you use?

It always depends on your analysis questions!

Decennial vs ACS

The Decennial Census is an actual count, not an estimate. Use it when possible.
The ACS is useful for questions not included in the decennial survey
- becomes more useful later in the decade.

If you use ACS, never forget that it is an estimation.

Question changes

The questions and methods of collection change constantly
- historical overiew

Prison gerrymandering

The Census Bureau counts incarcerated people as residents of the district where they are confined.

A few states now reallocate incarcerated people from where they are incarcerated to their last residence.
Most do not.

Resources:

Prisoners of the census overview
Prison gerrymandering factsheet for Georgia
National Conference of State Legislators (NCSL) 50-state overview of prison gerrymandering policy.

tidycensus

tidycensus is a package that imports census data directly from the U.S. Census as tidyverse-ready dataframes. It is very nice.

Major functions

There are two major functions implemented in tidycensus:

get_decennial(): grants access to the 2000, 2010, and 2020 decennial US Census APIs
get_acs(): grants access to the 1-year and 5-year American Community Survey APIs
- ACS 1-year for 2022 was released Sept 15, 2023

`get_decennial()`

get_decennial(

geography,
variables = NULL,
table = NULL,
cache_table = FALSE,
year = 2020,
sumfile = “pl”,
state = NULL,
county = NULL,
geometry = FALSE,
output = “tidy
…

)

find available geographies here

Help section

library(tidyverse)
library(tidycensus)
library(scales)

# look at the help section for the load_variables() function 
# run the line of code below in your console and look at the help section
?load_variables

HELP

List 2020 Census variables

The 2020 Census data release is still delayed from covid. Data available so far:

The 2020 Census Redistricting Data population data - pl
Demographic and Housing Characteristics - dhc
Demographic Profile - dp

To view all of the variables in any of these tables use the load_variables() function

# create table of all variables in the 2020 redistricting file
pl_2020 <- load_variables(2020, "pl", cache = T)

cache = TRUE means it’s faster to load the next time

Data in the Redistricting Dataset

P1. Race
P2. Hispanic or Latino, and Not Hispanic or Latino by Race
P3. Race for the Population 18 Years and Over
P4. Hispanic or Latino, and Not Hispanic or Latino by Race for the Population 18 Years and Over
P5. Group Quarters Population by Major Group Quarters Type
H1. Occupancy Status

Import Housing Units data

housing_units <- get_decennial(geography = "state",
                             variables = c(housing_units = "H1_001N"), 
                             year = 2020)

GEOID	NAME	variable	value
42	Pennsylvania	housing_units	5742828
06	California	housing_units	14392140

A question:

What percentage of housing units receive an American Community Survey each year?

The answer

# ACS questionaires go to 3.5 million addresses each year
acs_percent <- 3500000/sum(housing_units$value)

acs_percent

[1] 0.02463108

Import multiple columns

housing_vars = c("H1_001N", "H1_002N", "H1_003N")

raw_housing_2020_tidy = get_decennial(geography = "state", 
                   variables = housing_vars, 
                   year = 2020)

GEOID	NAME	variable	value
42	Pennsylvania	H1_001N	5742828
06	California	H1_001N	14392140
54	West Virginia	H1_001N	855635
49	Utah	H1_001N	1151414
36	New York	H1_001N	8488066
11	District of Columbia	H1_001N	350364

Tidy data

Each state has 3 rows, one for each of the 3 housing variables. This is called long format.

We want the data to be in wide format to make it easier to work with.

Wide format

housing_vars = c("H1_001N", "H1_002N", "H1_003N")

raw_housing_2020 = get_decennial(geography = "state", 
                   variables = housing_vars, 
                   year = 2020,
                   output = "wide")

GEOID	NAME	H1_001N	H1_002N	H1_003N
42	Pennsylvania	5742828	5210598	532230
06	California	14392140	13475623	916517
54	West Virginia	855635	743442	112193
49	Utah	1151414	1057252	94162
36	New York	8488066	7715172	772894
11	District of Columbia	350364	312448	37916

Percent Vacant

Rename the variables and calculate percent occupied and percent vacant

housing_2020 <-  raw_housing_2020 |>
  rename(state = NAME,
         tot_housing_units = H1_001N,
         occupied = H1_002N,
         vacant = H1_003N) |> 
  mutate(pct_occupied = round(occupied/tot_housing_units, 3),
         pct_vacant = round(vacant/tot_housing_units, 3))

GEOID	state	tot_housing_units	occupied	vacant	pct_occupied	pct_vacant
42	Pennsylvania	5742828	5210598	532230	0.907	0.093
06	California	14392140	13475623	916517	0.936	0.064
54	West Virginia	855635	743442	112193	0.869	0.131
49	Utah	1151414	1057252	94162	0.918	0.082
36	New York	8488066	7715172	772894	0.909	0.091

Bar plot - code

Use geom_col to create a bar plot of percent vacant

ggplot(data=housing_2020, aes(x=state, y=pct_vacant)) +
  geom_col() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  labs(x = "State", y = "Percent Vacant Housing Units")

Bar plot

Plot each state and reorder columns

Use the reorder() function to alphabetize the states
Format the y-axis as %

ggplot(data=housing_2020, aes(x=reorder(state,pct_vacant), y=pct_vacant)) +
  geom_col() +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  labs(x = "State", y = "Percent Vacant",
       title = "Proportion of Housing Units that are vacant")

Plot each state and reorder columns - code

Redistricting Race/Ethnicity data

Let’s look at the list of variables again.

P1. Race
P2. Hispanic or Latino, and Not Hispanic or Latino by Race
P3. Race for the Population 18 Years and Over
P4. Hispanic or Latino, and Not Hispanic or Latino by Race for the Population 18 Years and Over
P5. Group Quarters Population by Major Group Quarters Type
H1. Occupancy Status

Find the variable for:

Total Population
Hispanic or Latino
Asian Alone, NOT Hispanic or Latino

In-class Analysis

Use the get_decennial function to create a state dataframe with the following variables:

GEOID
State
Total Population (P1_001N)
Percent Hispanic or Latino (P2_002N/P1_001N)
Percent Black alone, not Hispanic or Latino (see if you can find the data to calculate this variable)

There are a lot of race/ethnicity variables. It is not easy to determine which one to use!

Create a bar chart of the percent Black population, with the states ordered by population.

Assignment 6a: Reading

Read the census questionnaire for 2020 (and previous years if you want to)
Read Black Is Over (Or, Special Black) by Tressie McMillan Cottom.

Assignment 6b: Tidycensus

Create a dataframe of estimated Median Household Income and selected race/ethnicity variables for every county in one state. Use this data to understand the relationship between race/ethnicity and income in this state. Explore the dataframe by:

looking at the data
calculating summary statistics
creating plots

Write a paragraph explaining at least 3 things you have learned about your state by exploring the data. Include plots and/or statistics to support your conclusions. (You can upload the plots separately or create a pdf with text and images)

See more instructions on the next slide

Assignment 6b: specific instructions

Use the get_decennial function to create a dataframe of all counties in one state (pick any state) with the following variables:

GEOID
County
Total Population
Percent Hispanic or Latino
Percent White alone, not Hispanic or Latino
Percent Black alone, not Hispanic or Latino
Percent Asian alone, not Hispanic or Latino
(get more variables if you want!)

Use the get_acs function to create a dataframe of the estimated Median household income all counties in the same state. Use the code below. We’ll learn more about ACS next week.

raw_mhi_2020 = get_acs(geography = "state", 
                                 variables = c(mhi = "B19013_001"), 
                                 year = 2020,
                       survey = "acs5")

Join these two dataframes together. Explore as described in the assignment overview.

Methods 1, Week 7

Outline

Homework questions and overview

US Census

tidycensus

In-class exercise

Homework

Homework questions and overview

tidycensus

The decennial US Census

Uses of the census: Apportionment

Uses of the census: Redistricting

Uses of the census: Allocate federal money

Uses of the census:

The 1-year and 5-year American Community Survey

Comparison

Geographic levels

Smallest areas

Which census data should you use?

Question changes

Prison gerrymandering

tidycensus

Major functions

get_decennial()

Help section

HELP

List 2020 Census variables

Data in the Redistricting Dataset

Import Housing Units data

A question:

The answer

Import multiple columns

Tidy data

Wide format

Percent Vacant

Bar plot - code

Bar plot

Plot each state and reorder columns

Plot each state and reorder columns - code

Redistricting Race/Ethnicity data

In-class Analysis

Assignment 6a: Reading

Assignment 6b: Tidycensus

Assignment 6b: specific instructions

`get_decennial()`