week9

Outline

  • Assignment 3
  • Lab 3 overview & Part 2 deliverables
  • Intro to tidycensus
  • Intro to spatial data and sf
  • Create a map with decennial Census data
  • Create a map with ACS data
  • Lab 3 deliverables detail

Assignment 3


Research Proposal:

One PDF (3-5 pages) describing the research question, methods and expected results of the research project.

  • Due next week: Nov 1

Lab 3, Part 2 Deliverables


due November 8 (pushing this by one week so you have time for in-class help next week)

  • Deliverable 1 R script to process housing occupancy data for every state from the 2020 decennial Census and create a leaflet map in your R Studio viewer.
  • Deliverable 2 R script to:
    • Process median rent data for every census tract in Brooklyn from the 2016-20 5-year American Community Survey
    • Calculate which tract is affordable for 2 people with minimum wage jobs
    • Create two leaflet maps to display median rent and rent affordability for 2 people with minimum wage jobs
  • Resubmit any part of Lab, part 1 if you have corrections

tidycensus

Today we talk about how to use the tidycensus package to access Census data in R and prepare it for mapping and spatial analysis.

  • tidycensus:
    • delivers Census data to R users in a tidyverse-friendly way
      • Decennial Census data
      • American Community Survey data, 5-year and 1-year
      • geometry
    • it requires a Census API key: get it here.

lesson prep

  • install packages:
    • tidycensus: to import Census data
    • sf: to handle spatial data
    • scales: to format text
    • leaflet: to create interactive map viewer in R Studio
      • install.packages("package_name")
      • or click Install in the Packages window for a user-interface
  • create a script to install your Census API key
    • create new R script, save it in your scripts folder as tidycensus_API_key.R
census_api_key("put your census api key here", 
               install = TRUE,
               overwrite = TRUE)`
  • run the script
    • On your computer you only need to install the API key once
    • Pratt computers may require you run this every time you log on

tidycensus documentation

Create a new script


Create a new script to process housing occupancy data from the 2020 decennial Census (this will be Deliverable 1:

  • New R Script
  • Save it as housing_occupancy_2020.R in the main_data/scripts/data_processing folder
  • Add the packages you will use in the script at the top of the script:
library(tidyverse)
library(tidycensus)
library(sf)
library(leaflet)
library(scales)

list of decennial census variables


  • Create a data frame of all available variables in the 2020 decennial Census
  • Only data for redistricting is available so far (boo, covid delays)
    • The 2020 Census Redistricting Data (P.L. 94-171) Summary Files
    • called “pl” in tidycensus
# create table of all variables in the 2020 redistricting file
decennial_vars_2020 <- load_variables(2020, "pl", cache = TRUE)
  • cache = TRUE: saves information to your computer so that this table will load faster next time. (recommended)

load_variables documentation

  • in your console, type ??load_variables to see the help section

decennial census variables

name label concept
H1_001N !!Total: OCCUPANCY STATUS
H1_002N !!Total:!!Occupied OCCUPANCY STATUS
H1_003N !!Total:!!Vacant OCCUPANCY STATUS
P1_001N !!Total: RACE
P1_002N !!Total:!!Population of one race: RACE
P1_003N !!Total:!!Population of one race:!!White alone RACE

….(displaying the first 6 of 301 rows)


  • The returned data frame always has three columns:
    • name:refers to the Census variable ID
      • format = tableID_variableID
    • label: a descriptive data label for the variable
    • concept: refers to the topic of the data and often corresponds to a table of Census data.

Redistricting Race/Ethnicity data


There are 6 tables in the PL-2020 Census Redistricting Data:

  • P1. Race
  • P2. Hispanic or Latino, and Not Hispanic or Latino by Race
  • P3. Race for the Population 18 Years and Over
  • P4. Hispanic or Latino, and Not Hispanic or Latino by Race for the Population 18 Years and Over
  • P5. Group Quarters Population by Major Group Quarters Type
  • H1. Occupancy Status

Import Housing Units data

Use the get_decennial() function to create a data frame of housing units for every county in New York:

  • Look at the decennial_vars_2020 data frame to find the variable IDs for:
    • housing units, occupied housing units, vacant housing units
raw_housing_units <- get_decennial(geography = "state",
                             variables = c(housing_units = "H1_001N",
                                           occupied_units = "H1_002N", 
                                           vacant_units = "H1_003N"), 
                             year = 2020,
                             output = "wide",
                             geometry = TRUE)


GEOID NAME housing_units occupied_units vacant_units geometry
35 New Mexico 940859 829514 111345 MULTIPOLYGON (((-109.0502 3…
72 Puerto Rico 1598159 1340534 257625 MULTIPOLYGON (((-65.23805 1…
06 California 14392140 13475623 916517 MULTIPOLYGON (((-118.6044 3…
01 Alabama 2288330 2011947 276383 MULTIPOLYGON (((-88.05338 3…
13 Georgia 4410956 4020808 390148 MULTIPOLYGON (((-81.27939 3…

….(displaying the first 5 of 52 rows)

get_decennial explainer

Let’s break down the get_decennial() function

  • look it up in the help for more details: ??get_decennial
raw_housing_units <- get_decennial(geography = "state",
                             variables = c(housing_units = "H1_001N",
                                           occupied_units = "H1_002N", 
                                           vacant_units = "H1_003N"), 
                             year = 2020,
                             output = "wide",
                             geometry = TRUE)
  • geography: defines the geographic unit (find available geographies)
  • variables: list of variable IDs
    • we are renaming as we import with housing_units = "H1_001N"
  • year: defines the year (2000, 2010, 2020 are available for decennial)
  • output: wide returns a data frame where each variable is a column
    • the default option is tidy returns a data frame with unique unit-variable combinations
  • geometry: imports as a spatial data frame with a geometry column
    • *Note: the order of the parameters above doesn’t matter

Answer a question


The ACS questionnaire is sent to 3.5 million addresses each year.

  • What proportion of US households get the questionnaire?
  • How would we answer this question with our data frame?
  • acs_percent <- 3500000/sum(raw_housing_units$housing_units)

Calculate new variables

Create a new data frame and calculate two new variables in your dataframe:

  • use the filter() function to remove Alaska, Hawaii, and Puerto Rico
  • use the mutate() function to add the two new variables
    • percent vacant = occupied_units/housing_units
    • percent occupied = vacant_units/housing_units
    • percent_vacant_label = percent_vacant formatted with percent() function from scales package
housing_units <- raw_housing_units %>% 
 mutate(pct_occupied = occupied_units/housing_units,
        pct_vacant = vacant_units/housing_units,
        percent_vacant_label = percent(pct_vacant, accuracy = 1L)) %>% 
  filter(NAME != "Alaska",
         NAME != "Hawaii",
         NAME != "Puerto Rico")
  • so your data frame looks like this:
GEOID NAME housing_units occupied_units vacant_units geometry pct_occupied pct_vacant percent_vacant_label
35 New Mexico 940859 829514 111345 MULTIPOLYGON (((-109.0502 3… 0.8816560 0.1183440 12%
06 California 14392140 13475623 916517 MULTIPOLYGON (((-118.6044 3… 0.9363182 0.0636818 6%
01 Alabama 2288330 2011947 276383 MULTIPOLYGON (((-88.05338 3… 0.8792207 0.1207793 12%

Make a map with leaflet

use st_tranform() to project to WGS84 (WGS84 is required for leaflet)

  • use epsg.io to find the EPSG code for any projection
# transform spatial data to WGS84 for leaflet
housing_units_4326 <- st_transform(housing_units, 4326)