census_api_key("put your census api key here",
install = TRUE,
overwrite = TRUE)`week9
Outline
- Assignment 3
- Lab 3 overview & Part 2 deliverables
- Intro to tidycensus
- Intro to spatial data and sf
- Create a map with decennial Census data
- Create a map with ACS data
- Lab 3 deliverables detail
Assignment 3
Research Proposal:
One PDF (3-5 pages) describing the research question, methods and expected results of the research project.
- Due next week: Nov 1
Lab 3, Part 2 Deliverables
due November 8 (pushing this by one week so you have time for in-class help next week)
- Deliverable 1 R script to process housing occupancy data for every state from the 2020 decennial Census and create a leaflet map in your R Studio viewer.
- Deliverable 2 R script to:
- Process median rent data for every census tract in Brooklyn from the 2016-20 5-year American Community Survey
- Calculate which tract is affordable for 2 people with minimum wage jobs
- Create two leaflet maps to display median rent and rent affordability for 2 people with minimum wage jobs
- Resubmit any part of Lab, part 1 if you have corrections
tidycensus
Today we talk about how to use the tidycensus package to access Census data in R and prepare it for mapping and spatial analysis.
tidycensus:- delivers Census data to R users in a tidyverse-friendly way
- Decennial Census data
- American Community Survey data, 5-year and 1-year
- geometry
- it requires a Census API key: get it here.
- delivers Census data to R users in a tidyverse-friendly way
lesson prep
- install packages:
- tidycensus: to import Census data
- sf: to handle spatial data
- scales: to format text
- leaflet: to create interactive map viewer in R Studio
install.packages("package_name")- or click
Installin the Packages window for a user-interface
- create a script to install your Census API key
- create new R script, save it in your scripts folder as
tidycensus_API_key.R
- create new R script, save it in your scripts folder as
- run the script
- On your computer you only need to install the API key once
- Pratt computers may require you run this every time you log on
tidycensus documentation
- Overview
- Basic usage
- Available geography
- You can also look at the ChangeLog to see what is available for the 2020 census.
Create a new script
Create a new script to process housing occupancy data from the 2020 decennial Census (this will be Deliverable 1:
- New R Script
- Save it as
housing_occupancy_2020.Rin the main_data/scripts/data_processing folder - Add the packages you will use in the script at the top of the script:
library(tidyverse)
library(tidycensus)
library(sf)
library(leaflet)
library(scales)list of decennial census variables
- Create a data frame of all available variables in the 2020 decennial Census
- Only data for redistricting is available so far (boo, covid delays)
- The 2020 Census Redistricting Data (P.L. 94-171) Summary Files
- called “pl” in tidycensus
# create table of all variables in the 2020 redistricting file
decennial_vars_2020 <- load_variables(2020, "pl", cache = TRUE)cache = TRUE: saves information to your computer so that this table will load faster next time. (recommended)
load_variables documentation
- in your console, type
??load_variablesto see the help section
decennial census variables
| name | label | concept |
|---|---|---|
| H1_001N | !!Total: | OCCUPANCY STATUS |
| H1_002N | !!Total:!!Occupied | OCCUPANCY STATUS |
| H1_003N | !!Total:!!Vacant | OCCUPANCY STATUS |
| P1_001N | !!Total: | RACE |
| P1_002N | !!Total:!!Population of one race: | RACE |
| P1_003N | !!Total:!!Population of one race:!!White alone | RACE |
….(displaying the first 6 of 301 rows)
- The returned data frame always has three columns:
- name:refers to the Census variable ID
- format =
tableID_variableID- label: a descriptive data label for the variable
- concept: refers to the topic of the data and often corresponds to a table of Census data.
Redistricting Race/Ethnicity data
There are 6 tables in the PL-2020 Census Redistricting Data:
- P1. Race
- P2. Hispanic or Latino, and Not Hispanic or Latino by Race
- P3. Race for the Population 18 Years and Over
- P4. Hispanic or Latino, and Not Hispanic or Latino by Race for the Population 18 Years and Over
- P5. Group Quarters Population by Major Group Quarters Type
- H1. Occupancy Status
Import Housing Units data
Use the get_decennial() function to create a data frame of housing units for every county in New York:
- Look at the
decennial_vars_2020data frame to find the variable IDs for:- housing units, occupied housing units, vacant housing units
raw_housing_units <- get_decennial(geography = "state",
variables = c(housing_units = "H1_001N",
occupied_units = "H1_002N",
vacant_units = "H1_003N"),
year = 2020,
output = "wide",
geometry = TRUE)| GEOID | NAME | housing_units | occupied_units | vacant_units | geometry |
|---|---|---|---|---|---|
| 35 | New Mexico | 940859 | 829514 | 111345 | MULTIPOLYGON (((-109.0502 3… |
| 72 | Puerto Rico | 1598159 | 1340534 | 257625 | MULTIPOLYGON (((-65.23805 1… |
| 06 | California | 14392140 | 13475623 | 916517 | MULTIPOLYGON (((-118.6044 3… |
| 01 | Alabama | 2288330 | 2011947 | 276383 | MULTIPOLYGON (((-88.05338 3… |
| 13 | Georgia | 4410956 | 4020808 | 390148 | MULTIPOLYGON (((-81.27939 3… |
….(displaying the first 5 of 52 rows)
get_decennial explainer
Let’s break down the get_decennial() function
- look it up in the help for more details:
??get_decennial
raw_housing_units <- get_decennial(geography = "state",
variables = c(housing_units = "H1_001N",
occupied_units = "H1_002N",
vacant_units = "H1_003N"),
year = 2020,
output = "wide",
geometry = TRUE)
- geography: defines the geographic unit (find available geographies)
- variables: list of variable IDs
- we are renaming as we import with
housing_units = "H1_001N"- year: defines the year (2000, 2010, 2020 are available for decennial)
- output: wide returns a data frame where each variable is a column
- the default option is
tidyreturns a data frame with unique unit-variable combinations- geometry: imports as a spatial data frame with a geometry column
- *Note: the order of the parameters above doesn’t matter
Answer a question
The ACS questionnaire is sent to 3.5 million addresses each year.
- What proportion of US households get the questionnaire?
- How would we answer this question with our data frame?
acs_percent <- 3500000/sum(raw_housing_units$housing_units)
Calculate new variables
Create a new data frame and calculate two new variables in your dataframe:
- use the
filter()function to remove Alaska, Hawaii, and Puerto Rico - use the
mutate()function to add the two new variables- percent vacant = occupied_units/housing_units
- percent occupied = vacant_units/housing_units
- percent_vacant_label =
percent_vacantformatted withpercent()function from scales package
housing_units <- raw_housing_units %>%
mutate(pct_occupied = occupied_units/housing_units,
pct_vacant = vacant_units/housing_units,
percent_vacant_label = percent(pct_vacant, accuracy = 1L)) %>%
filter(NAME != "Alaska",
NAME != "Hawaii",
NAME != "Puerto Rico")- so your data frame looks like this:
| GEOID | NAME | housing_units | occupied_units | vacant_units | geometry | pct_occupied | pct_vacant | percent_vacant_label |
|---|---|---|---|---|---|---|---|---|
| 35 | New Mexico | 940859 | 829514 | 111345 | MULTIPOLYGON (((-109.0502 3… | 0.8816560 | 0.1183440 | 12% |
| 06 | California | 14392140 | 13475623 | 916517 | MULTIPOLYGON (((-118.6044 3… | 0.9363182 | 0.0636818 | 6% |
| 01 | Alabama | 2288330 | 2011947 | 276383 | MULTIPOLYGON (((-88.05338 3… | 0.8792207 | 0.1207793 | 12% |
Make a map with leaflet
use st_tranform() to project to WGS84 (WGS84 is required for leaflet)
- use epsg.io to find the EPSG code for any projection
# transform spatial data to WGS84 for leaflet
housing_units_4326 <- st_transform(housing_units, 4326)