week9

Outline

Assignment 3
Lab 3 overview & Part 2 deliverables
Intro to tidycensus
Intro to spatial data and sf
Create a map with decennial Census data
Create a map with ACS data
Lab 3 deliverables detail

Assignment 3

Research Proposal:

One PDF (3-5 pages) describing the research question, methods and expected results of the research project.

Due next week: Nov 1

Lab 3, Part 2 Deliverables

due November 8 (pushing this by one week so you have time for in-class help next week)

Deliverable 1 R script to process housing occupancy data for every state from the 2020 decennial Census and create a leaflet map in your R Studio viewer.
Deliverable 2 R script to:
- Process median rent data for every census tract in Brooklyn from the 2016-20 5-year American Community Survey
- Calculate which tract is affordable for 2 people with minimum wage jobs
- Create two leaflet maps to display median rent and rent affordability for 2 people with minimum wage jobs
Resubmit any part of Lab, part 1 if you have corrections

tidycensus

Today we talk about how to use the tidycensus package to access Census data in R and prepare it for mapping and spatial analysis.

tidycensus:
- delivers Census data to R users in a tidyverse-friendly way
  - Decennial Census data
  - American Community Survey data, 5-year and 1-year
  - geometry
- it requires a Census API key: get it here.

lesson prep

install packages:
- tidycensus: to import Census data
- sf: to handle spatial data
- scales: to format text
- leaflet: to create interactive map viewer in R Studio
  - install.packages("package_name")
  - or click Install in the Packages window for a user-interface
create a script to install your Census API key
- create new R script, save it in your scripts folder as tidycensus_API_key.R

census_api_key("put your census api key here", 
               install = TRUE,
               overwrite = TRUE)`

run the script
- On your computer you only need to install the API key once
- Pratt computers may require you run this every time you log on

tidycensus documentation

Overview
Basic usage
Available geography
You can also look at the ChangeLog to see what is available for the 2020 census.

Create a new script

Create a new script to process housing occupancy data from the 2020 decennial Census (this will be Deliverable 1:

New R Script
Save it as housing_occupancy_2020.R in the main_data/scripts/data_processing folder
Add the packages you will use in the script at the top of the script:

library(tidyverse)
library(tidycensus)
library(sf)
library(leaflet)
library(scales)

list of decennial census variables

Create a data frame of all available variables in the 2020 decennial Census
Only data for redistricting is available so far (boo, covid delays)
- The 2020 Census Redistricting Data (P.L. 94-171) Summary Files
- called “pl” in tidycensus

# create table of all variables in the 2020 redistricting file
decennial_vars_2020 <- load_variables(2020, "pl", cache = TRUE)

cache = TRUE: saves information to your computer so that this table will load faster next time. (recommended)

load_variables documentation

in your console, type ??load_variables to see the help section

decennial census variables

name	label	concept
H1_001N	!!Total:	OCCUPANCY STATUS
H1_002N	!!Total:!!Occupied	OCCUPANCY STATUS
H1_003N	!!Total:!!Vacant	OCCUPANCY STATUS
P1_001N	!!Total:	RACE
P1_002N	!!Total:!!Population of one race:	RACE
P1_003N	!!Total:!!Population of one race:!!White alone	RACE

….(displaying the first 6 of 301 rows)

The returned data frame always has three columns:

name:refers to the Census variable ID

format = tableID_variableID

label: a descriptive data label for the variable

concept: refers to the topic of the data and often corresponds to a table of Census data.

Redistricting Race/Ethnicity data

There are 6 tables in the PL-2020 Census Redistricting Data:

P1. Race
P2. Hispanic or Latino, and Not Hispanic or Latino by Race
P3. Race for the Population 18 Years and Over
P4. Hispanic or Latino, and Not Hispanic or Latino by Race for the Population 18 Years and Over
P5. Group Quarters Population by Major Group Quarters Type
H1. Occupancy Status

Import Housing Units data

Use the get_decennial() function to create a data frame of housing units for every county in New York:

Look at the decennial_vars_2020 data frame to find the variable IDs for:
- housing units, occupied housing units, vacant housing units

raw_housing_units <- get_decennial(geography = "state",
                             variables = c(housing_units = "H1_001N",
                                           occupied_units = "H1_002N", 
                                           vacant_units = "H1_003N"), 
                             year = 2020,
                             output = "wide",
                             geometry = TRUE)

GEOID	NAME	housing_units	occupied_units	vacant_units	geometry
35	New Mexico	940859	829514	111345	MULTIPOLYGON (((-109.0502 3…
72	Puerto Rico	1598159	1340534	257625	MULTIPOLYGON (((-65.23805 1…
06	California	14392140	13475623	916517	MULTIPOLYGON (((-118.6044 3…
01	Alabama	2288330	2011947	276383	MULTIPOLYGON (((-88.05338 3…
13	Georgia	4410956	4020808	390148	MULTIPOLYGON (((-81.27939 3…

….(displaying the first 5 of 52 rows)

get_decennial explainer

Let’s break down the get_decennial() function

look it up in the help for more details: ??get_decennial

raw_housing_units <- get_decennial(geography = "state",
                             variables = c(housing_units = "H1_001N",
                                           occupied_units = "H1_002N", 
                                           vacant_units = "H1_003N"), 
                             year = 2020,
                             output = "wide",
                             geometry = TRUE)

geography: defines the geographic unit (find available geographies)

variables: list of variable IDs

we are renaming as we import with housing_units = "H1_001N"

year: defines the year (2000, 2010, 2020 are available for decennial)

output: wide returns a data frame where each variable is a column

the default option is tidy returns a data frame with unique unit-variable combinations

geometry: imports as a spatial data frame with a geometry column

*Note: the order of the parameters above doesn’t matter

Answer a question

The ACS questionnaire is sent to 3.5 million addresses each year.

What proportion of US households get the questionnaire?
How would we answer this question with our data frame?

acs_percent <- 3500000/sum(raw_housing_units$housing_units)

Calculate new variables

Create a new data frame and calculate two new variables in your dataframe:

use the filter() function to remove Alaska, Hawaii, and Puerto Rico
use the mutate() function to add the two new variables
- percent vacant = occupied_units/housing_units
- percent occupied = vacant_units/housing_units
- percent_vacant_label = percent_vacant formatted with percent() function from scales package

housing_units <- raw_housing_units %>% 
 mutate(pct_occupied = occupied_units/housing_units,
        pct_vacant = vacant_units/housing_units,
        percent_vacant_label = percent(pct_vacant, accuracy = 1L)) %>% 
  filter(NAME != "Alaska",
         NAME != "Hawaii",
         NAME != "Puerto Rico")

so your data frame looks like this:

GEOID	NAME	housing_units	occupied_units	vacant_units	geometry	pct_occupied	pct_vacant	percent_vacant_label
35	New Mexico	940859	829514	111345	MULTIPOLYGON (((-109.0502 3…	0.8816560	0.1183440	12%
06	California	14392140	13475623	916517	MULTIPOLYGON (((-118.6044 3…	0.9363182	0.0636818	6%
01	Alabama	2288330	2011947	276383	MULTIPOLYGON (((-88.05338 3…	0.8792207	0.1207793	12%

Make a map with leaflet

use st_tranform() to project to WGS84 (WGS84 is required for leaflet)

use epsg.io to find the EPSG code for any projection

# transform spatial data to WGS84 for leaflet
housing_units_4326 <- st_transform(housing_units, 4326)