Advanced GIS, Week 9

Outline

  • Assignment 3
  • Lab 3 overview & Part 2 deliverables
  • Intro to tidycensus
  • Intro to spatial data and sf
  • Create a map with decennial Census data
  • Create a map with ACS data
  • Lab 3 deliverables detail

Assignment 3


Research Proposal:

One PDF (3-5 pages) describing the research question, methods and expected results of the research project.

  • Due next week: Nov 1

Lab 3, Part 2 Deliverables


due November 8 (pushing this by one week so you have time for in-class help next week)

  • Deliverable 1 R script to process housing occupancy data for every state from the 2020 decennial Census and create a leaflet map in your R Studio viewer.
  • Deliverable 2 R script to:
    • Process median rent data for every census tract in Brooklyn from the 2016-20 5-year American Community Survey
    • Calculate which tract is affordable for 2 people with minimum wage jobs
    • Create two leaflet maps to display median rent and rent affordability for 2 people with minimum wage jobs
  • Resubmit any part of Lab, part 1 if you have corrections

tidycensus

Today we talk about how to use the tidycensus package to access Census data in R and prepare it for mapping and spatial analysis.

  • tidycensus:
    • delivers Census data to R users in a tidyverse-friendly way
      • Decennial Census data
      • American Community Survey data, 5-year and 1-year
      • geometry
    • it requires a Census API key: get it here.

lesson prep

  • install packages:
    • tidycensus: to import Census data
    • sf: to handle spatial data
    • scales: to format text
    • leaflet: to create interactive map viewer in R Studio
      • install.packages("package_name")
      • or click Install in the Packages window for a user-interface
  • create a script to install your Census API key
    • create new R script, save it in your scripts folder as tidycensus_API_key.R
census_api_key("put your census api key here", 
               install = TRUE,
               overwrite = TRUE)`
  • run the script
    • On your computer you only need to install the API key once
    • Pratt computers may require you run this every time you log on

tidycensus documentation

Create a new script


Create a new script to process housing occupancy data from the 2020 decennial Census (this will be Deliverable 1:

  • New R Script
  • Save it as housing_occupancy_2020.R in the main_data/scripts/data_processing folder
  • Add the packages you will use in the script at the top of the script:
library(tidyverse)
library(tidycensus)
library(sf)
library(leaflet)
library(scales)

list of decennial census variables


  • Create a data frame of all available variables in the 2020 decennial Census
  • Only data for redistricting is available so far (boo, covid delays)
    • The 2020 Census Redistricting Data (P.L. 94-171) Summary Files
    • called “pl” in tidycensus
# create table of all variables in the 2020 redistricting file
decennial_vars_2020 <- load_variables(2020, "pl", cache = TRUE)
  • cache = TRUE: saves information to your computer so that this table will load faster next time. (recommended)

load_variables documentation

  • in your console, type ??load_variables to see the help section

decennial census variables

name label concept
H1_001N !!Total: OCCUPANCY STATUS
H1_002N !!Total:!!Occupied OCCUPANCY STATUS
H1_003N !!Total:!!Vacant OCCUPANCY STATUS
P1_001N !!Total: RACE
P1_002N !!Total:!!Population of one race: RACE
P1_003N !!Total:!!Population of one race:!!White alone RACE

….(displaying the first 6 of 301 rows)


  • The returned data frame always has three columns:
    • name:refers to the Census variable ID
      • format = tableID_variableID
    • label: a descriptive data label for the variable
    • concept: refers to the topic of the data and often corresponds to a table of Census data.

Redistricting Race/Ethnicity data


There are 6 tables in the PL-2020 Census Redistricting Data:

  • P1. Race
  • P2. Hispanic or Latino, and Not Hispanic or Latino by Race
  • P3. Race for the Population 18 Years and Over
  • P4. Hispanic or Latino, and Not Hispanic or Latino by Race for the Population 18 Years and Over
  • P5. Group Quarters Population by Major Group Quarters Type
  • H1. Occupancy Status

Import Housing Units data

Use the get_decennial() function to create a data frame of housing units for every county in New York:

  • Look at the decennial_vars_2020 data frame to find the variable IDs for:
    • housing units, occupied housing units, vacant housing units
raw_housing_units <- get_decennial(geography = "state",
                             variables = c(housing_units = "H1_001N",
                                           occupied_units = "H1_002N", 
                                           vacant_units = "H1_003N"), 
                             year = 2020,
                             output = "wide",
                             geometry = TRUE)


GEOID NAME housing_units occupied_units vacant_units geometry
35 New Mexico 940859 829514 111345 MULTIPOLYGON (((-109.0502 3…
72 Puerto Rico 1598159 1340534 257625 MULTIPOLYGON (((-65.23805 1…
06 California 14392140 13475623 916517 MULTIPOLYGON (((-118.6044 3…
01 Alabama 2288330 2011947 276383 MULTIPOLYGON (((-88.05338 3…
13 Georgia 4410956 4020808 390148 MULTIPOLYGON (((-81.27939 3…

….(displaying the first 5 of 52 rows)

get_decennial explainer

Let’s break down the get_decennial() function

  • look it up in the help for more details: ??get_decennial
raw_housing_units <- get_decennial(geography = "state",
                             variables = c(housing_units = "H1_001N",
                                           occupied_units = "H1_002N", 
                                           vacant_units = "H1_003N"), 
                             year = 2020,
                             output = "wide",
                             geometry = TRUE)
  • geography: defines the geographic unit (find available geographies)
  • variables: list of variable IDs
    • we are renaming as we import with housing_units = "H1_001N"
  • year: defines the year (2000, 2010, 2020 are available for decennial)
  • output: wide returns a data frame where each variable is a column
    • the default option is tidy returns a data frame with unique unit-variable combinations
  • geometry: imports as a spatial data frame with a geometry column
    • *Note: the order of the parameters above doesn’t matter

Answer a question


The ACS questionnaire is sent to 3.5 million addresses each year.

  • What proportion of US households get the questionnaire?
  • How would we answer this question with our data frame?
  • acs_percent <- 3500000/sum(raw_housing_units$housing_units)

Calculate new variables

Create a new data frame and calculate two new variables in your dataframe:

  • use the filter() function to remove Alaska, Hawaii, and Puerto Rico
  • use the mutate() function to add the two new variables
    • percent vacant = occupied_units/housing_units
    • percent occupied = vacant_units/housing_units
    • percent_vacant_label = percent_vacant formatted with percent() function from scales package
housing_units <- raw_housing_units %>% 
 mutate(pct_occupied = occupied_units/housing_units,
        pct_vacant = vacant_units/housing_units,
        percent_vacant_label = percent(pct_vacant, accuracy = 1L)) %>% 
  filter(NAME != "Alaska",
         NAME != "Hawaii",
         NAME != "Puerto Rico")
  • so your data frame looks like this:
GEOID NAME housing_units occupied_units vacant_units geometry pct_occupied pct_vacant percent_vacant_label
35 New Mexico 940859 829514 111345 MULTIPOLYGON (((-109.0502 3… 0.8816560 0.1183440 12%
06 California 14392140 13475623 916517 MULTIPOLYGON (((-118.6044 3… 0.9363182 0.0636818 6%
01 Alabama 2288330 2011947 276383 MULTIPOLYGON (((-88.05338 3… 0.8792207 0.1207793 12%

Make a map with leaflet

use st_tranform() to project to WGS84 (WGS84 is required for leaflet)

  • use epsg.io to find the EPSG code for any projection
# transform spatial data to WGS84 for leaflet
housing_units_4326 <- st_transform(housing_units, 4326)
leaflet() %>%
  addProviderTiles(provider = "CartoDB.Positron") %>% 
  addPolygons(data = housing_units_4326,
              color = "#ababab",
              fillColor = "purple")

Leaflet explained

leaflet() %>%
  addProviderTiles(provider = "CartoDB.Positron") %>% 
  addPolygons(data = housing_units_4326,
              color = "#ababab",
              fillColor = "purple")
  • leaflet(): creates the map object
  • addProviderTiles(): adds the basemap layer
  • addPolygons(): adds the polygon layer with the following parameters:
    • data: the data frame to use
    • color: the color of the lines
    • fillColor: the color of the fill

Make a choropleth map

We’re going to make a choropleth map of the proportion of housing units that are vacant in each state.

Steps:

  • Create a histogram to see the range of the statistic
  • Define the color palette we’ll use
  • Add the color palette to the leaflet map
  • Add a popup to the map

Choropleth part 1: histogram

min(housing_units_4326$pct_vacant)
[1] 0.06368177
max(housing_units_4326$pct_vacant)
[1] 0.2119347
hist(housing_units_4326$pct_vacant)

Choropleth part 2: palette

  • We’ll use the max, min and histogram to define the breaks in the colors in our map
  • Then use the colorBin() function to define the palette
vacant_green_palette <- colorBin(palette = "Greens",
                          domain = housing_units_4326$pct_vacant,
                          bins = c(0, .05, .1, .15, .2, .25))
  • colorBin(): defines the palette used in a leaflet map:
    • palette(): defines a predefined gradient (see optional below for more information)
    • domain(): defines the variable to apply the palette to
    • bins(): defines the breaks in the palette - when the color changes
  • see colorBrewer color palettes
  • optional: see the Colors chapter of Leaflet for R to learn about:
    • the accepted names of color palettes
    • the 3 other color functions you can use

Choropleth part 3: map

leaflet() %>%
  addProviderTiles(provider = "CartoDB.Positron") %>% 
  addPolygons(data = housing_units_4326,
              color = "#ababab",
              fillColor = ~vacant_green_palette(pct_vacant),
              fillOpacity = .85,
              popup = paste0(housing_units_4326$NAME,
                             '<br/>', 'Percent Vacant: ', 
                             housing_units_4326$pct_vacant))

leaflet, learn more


This is as far as we’ll go with leaflet. To learn more, check out this simple tutorial and the other chapters in that book.

new functions in this lesson

  • census_api_key: (tidycensus) install api key for tidycensus (do this once)
  • load_variables(): (tidycensus) import dataframe of all variables in the Census
  • c(): combine values into a list
  • get_decennial(): (tidycensus) import data from decennial Census
  • get_acs(): (tidycensus) import data from ACS
  • st_transform(): (sf) project spatial dataframe to new projection
  • leaflet(): (leaflet) create interactive map, displayed in viewer
  • hist(): create a histogram of a variable
  • addProviderTiles(): (leaflet) add baselayer
  • addPolygons(): (leaflet) add polygon layer
  • colorBin(): (leaflet) define color palette for leaflet map based on a variable
  • paste0(): concatenate strings

Deliverable 1


Clean up the script you created above, housing_occupancy_2020.R.

Upload to CANVAS, Lab 3 Part 2.

Deliverable 2

  • Create a new script to create 2 maps using data from the American Community Survey:
    • median rent for every census tract in Brooklyn (or another county if you prefer)
    • affordability for two people working 40 hours/week making minimum wage
  • Save the script as Brooklyn_rent_affordability.R in the main_data/scripts/data_exploration folder
    • change the name if you’ll do the analysis in a different county

You will use the same steps as in the Percent Vacant example, using data from the 2016-20, 5-yr American Community Survey.

See the instructions in the following pages. When complete, submit your script on CANVAS.

D2: Load packages

library(tidyverse)
library(tidycensus)
library(sf)
library(leaflet)
library(scales)

D2: Download contract rent

  • Use the load_variables() function to see a list of all variables available in the 5-year ACS
    • acs201620 <- load_variables(2020, "acs5", cache = T)
  • View the acs201620 dataframe and search for “MEDIAN CONTRACT RENT” in the concept column
  • Note the variable ID for MEDIAN CONTRACT RENT
  • Use the get_acs() function to download MEDIAN CONTRACT RENT for every tract in Brooklyn
    • this will look similar the get_decennial function in the previous example with the following changes:
      • geography = "tract"
      • change the variable to contract rent
      • add two new parameters: state = "New York", "county = "Brooklyn,

D2: get_acs example

Your get_acs() should look like this, with the correct variable inserted

raw_contract_rent <- get_acs(geography = "tract",
  variables = c(contract_rent = "insert variable id here"),
  state = "New York",
  county = "Kings County",
  year = 2020,
  geometry = TRUE,
  output = "wide")
  • Notice that ACS data has 2 data columns
    • contract_rentE: this is the estimate
    • contract_rentM: this is the margin of error
  • For now, you can ignore the margin of error, but it is best practice to exclude areas with high margins of error from your analysis

D2: Calculate variables


Create 4 new columns using the mutate() function:

  • afforable_rent_2_people_min_wage = 2 * 40 * 15 * 4 * .3

    • this is a back-of-the-envelope calculation of how much 2 people with minimum wage jobs can afford to pay for rent
    • 2 people, working 40 hours at $15/hour for 4 weeks, allocating 30% of their income to rent
  • rent_affordability = afforable_rent_2_people_min_wage - contract_rentE

    • negative values indicate that the rent is NOT affordable
  • median_rent_label = dollar(contract_rentE)

  • rent_affordability_label = dollar(rent_affordability)

  • use st_transform() to project this spatial dataframe to WGS84

D2: Define palette

Define a palette for 2 maps:

  1. Choropleth of median rent for every census tract
  2. Choropleth of affordability for 2 people with minimum wage jobs
  • Look at the minimum, maximum, and histogram for each variable to determine the breaks to use in your palette for each map
    • Note: there are NA values in this data that we must explicitly ignore in the max and min calculations with na.rm()
      • ex. using the percent vacant data min(housing_units$pct_vacant, na.rm=TRUE)
  • Define the palette for your map with colorBin()
  • For the affordablity map, I recommend a diverging color palette like “RdBu”

D2: Create leaflet maps

  • Create two leaflet maps:
  1. Choropleth of median rent for every census tract
    • color by contract rent
    • create popup that displays the rent for each census tract
  2. Choropleth of affordability for 2 people with minimum wage jobs
    • color by rent affordability for 2 people with minimum wage jobs
    • create popup that displays the median rent for each census tract, and the difference between median rent and what 2 people with minimum wage jobs can afford

D2: Submit your script


When you are happy with your maps, clean up your script and upload it to CANVAS, Lab 3 Part 2.