Methods 1, Week 6

tidycensus


Today we talk about the US Census and the tidycensus package.

  • tidycensus requires a key: get it here.
  • Create new project from new folder: methods1/class5
  • Install the tidycensus package: install.packages("tidycensus")

Outline

  • Homework questions and overview

  • US Census

  • tidycensus

  • In-class exercise

  • Homework

The decennial US Census

  • The U.S. census counts every resident of the country every ten years (year ending in zero).
  • The Constitution mandates the count.
  • Conducted a survey every ten years since 1790.
  • The Census Bureau counts each resident where they live on April 1.
  • The census is not a count of citizens, it is a count of residents.

Uses of the census: Apportionment


  • The process of dividing the 435 seats in the U.S. House of Representatives among the 50 states, based on the state population counts.

Uses of the census: Redistricting


  • Provides population data to adjust or redraw electoral districts.

Uses of the census: Allocate federal money


Helps determine how federal funds are distributed across the country, including:

  • funding for schools, hospitals, roads, and public infrastructure.
  • Medicaid, Head Start, block grant programs for community mental health services, and the Supplemental Nutrition Assistance Program (SNAP).

Uses of the census:


  • Research
  • Maps

The 1-year and 5-year American Community Survey

The American Community Survey (ACS) is a demographics survey conducted by the US Census Bureau every year since 2005.

  • The Census sends surveys to a randomly selected 3.5 million addresses every year

ACS data are estimates. ACS data has a margin of error, and it is larger for areas with smaller populations.

  • 1-year estimates are available for areas with population >= 65,000 people.
  • 5-year estimates are available for areas down to the block group

Comparison

Geographic levels

Smallest areas

Which census data should you use?

It always depends on your analysis questions!

Decennial vs ACS

  • The Decennial Census is an actual count, not an estimate. Use it when possible.
  • The ACS is useful for questions not included in the decennial survey
    • becomes more useful later in the decade.

If you use ACS, never forget that it is an estimation.

Question changes

Prison gerrymandering

The Census Bureau counts incarcerated people as residents of the district where they are confined.

  • A few states now reallocate incarcerated people from where they are incarcerated to their last residence.
  • Most do not.

Resources:

  • Prisoners of the census overview
  • Prison gerrymandering factsheet for Georgia
  • National Conference of State Legislators (NCSL) 50-state overview of prison gerrymandering policy.

tidycensus

tidycensus is a package that imports census data directly from the U.S. Census as tidyverse-ready dataframes. It is very nice.

Major functions

There are two major functions implemented in tidycensus:

  • get_decennial(): grants access to the 2000, 2010, and 2020 decennial US Census APIs
    • only the redistricting file - “pl” - is available from the 2020 decennial census
  • get_acs(): grants access to the 1-year and 5-year American Community Survey APIs

get_decennial()

find available geographies here

get_decennial(

  • geography,
  • variables = NULL,
  • table = NULL,
  • cache_table = FALSE,
  • year = 2010,
  • sumfile = “sf1”,
  • state = NULL,
  • county = NULL )

Install census api key

Install tidycensus and the census API key


Get Census API key to use tidycensus here.

In your console:

  • install.packages("tidycensus")

  • census_api_key("put your census api key here", install = TRUE)

Help section


library(tidyverse)
library(tidycensus)


# look at the help section for the load_variables() function 
# run the line of code below in your console and look at the help section
?load_variables

HELP

List 2020 Census variables


The 2020 Census data release is very delayed from covid

  • Only data for redistricting is available so far
    • The 2020 Census Redistricting Data (P.L. 94-171) Summary Files
    • “pl”
# create table of all variables in the 2020 redistricting file
pl_2020 <- load_variables(2020, "pl", cache = T)

Redistricting Race/Ethnicity data

  • P1. Race
  • P2. Hispanic or Latino, and Not Hispanic or Latino by Race
  • P3. Race for the Population 18 Years and Over
  • P4. Hispanic or Latino, and Not Hispanic or Latino by Race for the Population 18 Years and Over
  • P5. Group Quarters Population by Major Group Quarters Type
  • H1. Occupancy Status

Import Housing Units data

housing_units <- get_decennial(geography = "state",
                             variables = c(housing_units = "H1_001N"), 
                             year = 2020)


GEOID NAME variable value
42 Pennsylvania housing_units 5742828
06 California housing_units 14392140

A question:


What percentage of housing units receive an American Community Survey each year?

The answer


# ACS questionaires go to 3.5 million addresses each year
acs_percent <- 3500000/sum(housing_units$value)

acs_percent
[1] 0.02463108

Import multiple columns

housing_vars = c("H1_001N", "H1_002N", "H1_003N")

raw_housing_2020 = get_decennial(geography = "state", 
                   variables = housing_vars, 
                   year = 2020) %>% 
  arrange(NAME)


GEOID NAME variable value
01 Alabama H1_001N 2288330
01 Alabama H1_002N 2011947
01 Alabama H1_003N 276383
02 Alaska H1_001N 326200
02 Alaska H1_002N 269148
02 Alaska H1_003N 57052

Tidy data

Each state has 3 rows, one for each of the 3 housing variables. This is called long format.

We want the data to be in wide format to make it tidy:

  • meaning each row is an observation with all of the variables associated with the observation in a column.
GEOID NAME H1_001N H1_002N H1_003N
01 Alabama 2288330 2011947 276383
02 Alaska 326200 269148 57052
04 Arizona 3082000 2705878 376122

Pivot_wider

We’ll use pivot_wider() to convert from long to wide format:

  • names_from = the column you want the new column-names to come from
  • values_from = the column you want the data to come from

Then rename the columns to be more descriptive

housing_2020 <-  raw_housing_2020 %>%
  pivot_wider(names_from = variable, values_from = value) %>% 
  rename(tot_housing_units = H1_001N,
         occupied = H1_002N,
         vacant = H1_003N) 
GEOID NAME tot_housing_units occupied vacant
01 Alabama 2288330 2011947 276383
02 Alaska 326200 269148 57052

Tidy data

There are three interrelated rules which make a dataset tidy:

  • Each observation has its own row.
  • Each variable has its own column.
  • Each value has its own cell.

Percent Vacant

Now calculate percent occupied and percent vacant

housing_2020 <-  raw_housing_2020 %>%
  pivot_wider(names_from = variable, values_from = value) %>% 
  rename(state = NAME,
         tot_housing_units = H1_001N,
         occupied = H1_002N,
         vacant = H1_003N) %>% 
  mutate(pct_occupied = round(occupied/tot_housing_units, 3),
         pct_vacant = round(vacant/tot_housing_units, 3))


GEOID state tot_housing_units occupied vacant pct_occupied pct_vacant
01 Alabama 2288330 2011947 276383 0.879 0.121
02 Alaska 326200 269148 57052 0.825 0.175
04 Arizona 3082000 2705878 376122 0.878 0.122
05 Arkansas 1365265 1199395 165870 0.879 0.121
06 California 14392140 13475623 916517 0.936 0.064

Plot each state

Bar plot

Use geom_col to create a bar plot

  • within the aes() we’ll order states by pct_vacant
  • rotate the state labels in the x-axis so we can read them
vacant_housing_plot <- ggplot(data=housing_2020, 
                              aes(x=reorder(state,pct_vacant), 
                                  y=pct_vacant)) +
  geom_col() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  labs(x = "State", y = "Percent Vacant Housing Units") 

In-class Analysis

Use the get_decennial function to create a state dataframe with the following variables:

  • GEOID
  • State
  • Total Population (P1_001N)
  • Percent Hispanic or Latino (P2_002N/P1_001N)
  • Percent Black alone, not Hispanic or Latino (see if you can find the data to calculate this variable)

There are a lot of race/ethnicity variables. It is not easy to determine which one to use!

Create a bar chart of the percent Black population, with the states ordered by population.

Assignment 6a: Reading


Assignment 6b: Tidycensus

Create a dataframe of estimated Median Household Income and selected race/ethnicity variables for every county in one state. Use this data to understand the relationship between race/ethnicity and income in this state. Explore the dataframe by:

  • looking at the data
  • calculating summary statistics
  • creating plots

Write a paragraph explaining at least 3 things you have learned about your state by exploring the data. Include plots and/or statistics to support your conclusions. (You can upload the plots separately or create a pdf with text and images)

Assignment 6b: specific instructions

Use the get_decennial function to create a dataframe of all counties in one state (pick any state) with the following variables:

  • GEOID
  • County
  • Total Population
  • Percent Hispanic or Latino
  • Percent White alone, not Hispanic or Latino
  • Percent Black alone, not Hispanic or Latino
  • Percent Asian alone, not Hispanic or Latino
  • (get more variables if you want!)

Use the get_acs function to create a dataframe of the estimated Median household income all counties in the same state. Use the code below. We’ll learn more about ACS next week.

raw_mhi_2020 = get_acs(geography = "state", 
                                 variables = c(mhi = "B19013_001"), 
                                 year = 2020)

Join these two dataframes together. Explore as described in the assignment overview.