Methods 1, Week 7

Outline

Today we talk about the US Census and the tidycensus package.

  • Homework questions and overview

  • US Census

  • tidycensus

  • In-class exercise

  • Homework

Homework questions and overview

tidycensus


Now that you’ve had some practice, we’re going to create a new project to have a fresh start without your beginner scripts

  • Download the new part2 folder and save it to your methods1 folder
  • Create new project in your methods1/part2 folder
  • Install the tidycensus package: install.packages("tidycensus")

The decennial US Census

  • The U.S. census counts every resident of the country every ten years (year ending in zero).
  • The Constitution mandates the count.
  • Conducted a survey every ten years since 1790.
  • The Census Bureau counts each resident where they live on April 1.
  • The census is not a count of citizens, it is a count of residents.

Uses of the census: Apportionment


  • The process of dividing the 435 seats in the U.S. House of Representatives among the 50 states, based on the state population counts.

Uses of the census: Redistricting


  • Provides population data to adjust or redraw electoral districts.

Uses of the census: Allocate federal money


Helps determine how federal funds are distributed across the country, including:

  • funding for schools, hospitals, roads, and public infrastructure.
  • Medicaid, Head Start, block grant programs for community mental health services, and the Supplemental Nutrition Assistance Program (SNAP).

Uses of the census:


  • Research
  • Maps

The 1-year and 5-year American Community Survey

The American Community Survey (ACS) is a demographics survey conducted by the US Census Bureau every year since 2005.

  • The Census sends surveys to a randomly selected 3.5 million addresses every year

ACS data are estimates. ACS data has a margin of error, and it is larger for areas with smaller populations.

  • 1-year estimates are available for areas with population >= 65,000 people.
  • 5-year estimates are available for areas down to the block group

Comparison

Geographic levels

Smallest areas

Which census data should you use?

It always depends on your analysis questions!

Decennial vs ACS

  • The Decennial Census is an actual count, not an estimate. Use it when possible.
  • The ACS is useful for questions not included in the decennial survey
    • becomes more useful later in the decade.

If you use ACS, never forget that it is an estimation.

Question changes

Prison gerrymandering

The Census Bureau counts incarcerated people as residents of the district where they are confined.

  • A few states now reallocate incarcerated people from where they are incarcerated to their last residence.
  • Most do not.

Resources:

  • Prisoners of the census overview
  • Prison gerrymandering factsheet for Georgia
  • National Conference of State Legislators (NCSL) 50-state overview of prison gerrymandering policy.

tidycensus

tidycensus is a package that imports census data directly from the U.S. Census as tidyverse-ready dataframes. It is very nice.

Major functions

There are two major functions implemented in tidycensus:

  • get_decennial(): grants access to the 2000, 2010, and 2020 decennial US Census APIs
  • get_acs(): grants access to the 1-year and 5-year American Community Survey APIs
    • ACS 1-year for 2022 was released Sept 15, 2023

get_decennial()

get_decennial(

  • geography,
  • variables = NULL,
  • table = NULL,
  • cache_table = FALSE,
  • year = 2020,
  • sumfile = “pl”,
  • state = NULL,
  • county = NULL,
  • geometry = FALSE,
  • output = “tidy

)

find available geographies here

Help section


library(tidyverse)
library(tidycensus)
library(scales)

# look at the help section for the load_variables() function 
# run the line of code below in your console and look at the help section
?load_variables

HELP

List 2020 Census variables


The 2020 Census data release is still delayed from covid. Data available so far:

  • The 2020 Census Redistricting Data population data - pl
  • Demographic and Housing Characteristics - dhc
  • Demographic Profile - dp

To view all of the variables in any of these tables use the load_variables() function

# create table of all variables in the 2020 redistricting file
pl_2020 <- load_variables(2020, "pl", cache = T)
  • cache = TRUE means it’s faster to load the next time

Data in the Redistricting Dataset

  • P1. Race
  • P2. Hispanic or Latino, and Not Hispanic or Latino by Race
  • P3. Race for the Population 18 Years and Over
  • P4. Hispanic or Latino, and Not Hispanic or Latino by Race for the Population 18 Years and Over
  • P5. Group Quarters Population by Major Group Quarters Type
  • H1. Occupancy Status

Import Housing Units data

housing_units <- get_decennial(geography = "state",
                             variables = c(housing_units = "H1_001N"), 
                             year = 2020)


GEOID NAME variable value
42 Pennsylvania housing_units 5742828
06 California housing_units 14392140

A question:


What percentage of housing units receive an American Community Survey each year?

The answer


# ACS questionaires go to 3.5 million addresses each year
acs_percent <- 3500000/sum(housing_units$value)

acs_percent
[1] 0.02463108

Import multiple columns

housing_vars = c("H1_001N", "H1_002N", "H1_003N")

raw_housing_2020_tidy = get_decennial(geography = "state", 
                   variables = housing_vars, 
                   year = 2020)


GEOID NAME variable value
42 Pennsylvania H1_001N 5742828
06 California H1_001N 14392140
54 West Virginia H1_001N 855635
49 Utah H1_001N 1151414
36 New York H1_001N 8488066
11 District of Columbia H1_001N 350364

Tidy data

Each state has 3 rows, one for each of the 3 housing variables. This is called long format.

We want the data to be in wide format to make it easier to work with.

Wide format

housing_vars = c("H1_001N", "H1_002N", "H1_003N")

raw_housing_2020 = get_decennial(geography = "state", 
                   variables = housing_vars, 
                   year = 2020,
                   output = "wide")


GEOID NAME H1_001N H1_002N H1_003N
42 Pennsylvania 5742828 5210598 532230
06 California 14392140 13475623 916517
54 West Virginia 855635 743442 112193
49 Utah 1151414 1057252 94162
36 New York 8488066 7715172 772894
11 District of Columbia 350364 312448 37916

Percent Vacant

Rename the variables and calculate percent occupied and percent vacant

housing_2020 <-  raw_housing_2020 |>
  rename(state = NAME,
         tot_housing_units = H1_001N,
         occupied = H1_002N,
         vacant = H1_003N) |> 
  mutate(pct_occupied = round(occupied/tot_housing_units, 3),
         pct_vacant = round(vacant/tot_housing_units, 3))


GEOID state tot_housing_units occupied vacant pct_occupied pct_vacant
42 Pennsylvania 5742828 5210598 532230 0.907 0.093
06 California 14392140 13475623 916517 0.936 0.064
54 West Virginia 855635 743442 112193 0.869 0.131
49 Utah 1151414 1057252 94162 0.918 0.082
36 New York 8488066 7715172 772894 0.909 0.091

Bar plot - code

Use geom_col to create a bar plot of percent vacant

ggplot(data=housing_2020, aes(x=state, y=pct_vacant)) +
  geom_col() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  labs(x = "State", y = "Percent Vacant Housing Units") 

Bar plot

Plot each state and reorder columns

  • Use the reorder() function to alphabetize the states
  • Format the y-axis as %
ggplot(data=housing_2020, aes(x=reorder(state,pct_vacant), y=pct_vacant)) +
  geom_col() +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  labs(x = "State", y = "Percent Vacant",
       title = "Proportion of Housing Units that are vacant") 

Plot each state and reorder columns - code

Redistricting Race/Ethnicity data

Let’s look at the list of variables again.

  • P1. Race
  • P2. Hispanic or Latino, and Not Hispanic or Latino by Race
  • P3. Race for the Population 18 Years and Over
  • P4. Hispanic or Latino, and Not Hispanic or Latino by Race for the Population 18 Years and Over
  • P5. Group Quarters Population by Major Group Quarters Type
  • H1. Occupancy Status

Find the variable for:

  • Total Population
  • Hispanic or Latino
  • Asian Alone, NOT Hispanic or Latino

In-class Analysis

Use the get_decennial function to create a state dataframe with the following variables:

  • GEOID
  • State
  • Total Population (P1_001N)
  • Percent Hispanic or Latino (P2_002N/P1_001N)
  • Percent Black alone, not Hispanic or Latino (see if you can find the data to calculate this variable)

There are a lot of race/ethnicity variables. It is not easy to determine which one to use!

Create a bar chart of the percent Black population, with the states ordered by population.

Assignment 6a: Reading


Assignment 6b: Tidycensus

Create a dataframe of estimated Median Household Income and selected race/ethnicity variables for every county in one state. Use this data to understand the relationship between race/ethnicity and income in this state. Explore the dataframe by:

  • looking at the data
  • calculating summary statistics
  • creating plots

Write a paragraph explaining at least 3 things you have learned about your state by exploring the data. Include plots and/or statistics to support your conclusions. (You can upload the plots separately or create a pdf with text and images)

  • See more instructions on the next slide

Assignment 6b: specific instructions

Use the get_decennial function to create a dataframe of all counties in one state (pick any state) with the following variables:

  • GEOID
  • County
  • Total Population
  • Percent Hispanic or Latino
  • Percent White alone, not Hispanic or Latino
  • Percent Black alone, not Hispanic or Latino
  • Percent Asian alone, not Hispanic or Latino
  • (get more variables if you want!)

Use the get_acs function to create a dataframe of the estimated Median household income all counties in the same state. Use the code below. We’ll learn more about ACS next week.

raw_mhi_2020 = get_acs(geography = "state", 
                                 variables = c(mhi = "B19013_001"), 
                                 year = 2020,
                       survey = "acs5")

Join these two dataframes together. Explore as described in the assignment overview.