Disclaimer: Most contents here are borrowed from Dr. Kyle Walker’s tidycensus GitHub webpages. This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.
This document is also published on RPubs.
install.packages("tidyverse", dependencies = TRUE)
install.packages("sf", dependencies = TRUE)
install.packages("tigris", dependencies = TRUE)
install.packages("tidycensus", dependencies = TRUE)
install.packages("mapview", dependencies = TRUE)
install.packages("viridis", dependencies = TRUE)
library(tidyverse)
library(sf)
library(tigris)
library(tidycensus)
library(mapview)
library(viridis)
The below example is borrowed from tidycensus GitHub

1. Intro

What we do:

  1. Learn how to download the dicennial US Census and American Community Survey (ACS) 5-year estimatess

  2. Understand the simple feature class in R, sf.

  3. Create basic choropleth (i.e., thematic) maps using the US Census boudary shapefiles.

Why we do:

  1. The US Census provides rich information on neighborhoods, cities, and regions, which is critical to urban planning and any urban-related studies.

  2. Various data sets from the US Census contain both geospatial and non-geospatial information, which we need to approach with care. (For differences between spatial, geographic, and geospatial, read this StackExchange Q & A thread. More thorough discussion may come in following weeks.)

2. Basic Usage of tidycensus(original contents)

To obtain your Census API key: http://api.census.gov/data/key_signup.html Then, insert your key to the current environment

census_api_key("YOUR API KEY GOES HERE")

two main functions are:

get_decennial(): access to 1990, 2000, and 2010 decennial US Census

get_acs(): 5-year Amercian Communicty Survey

m90 <- get_decennial(geography = "state", variables = "H043A001", year = 1990)

head(m90)
## # A tibble: 6 x 4
##   GEOID NAME       variable value
##   <chr> <chr>      <chr>    <dbl>
## 1 01    Alabama    H043A001   325
## 2 02    Alaska     H043A001   559
## 3 04    Arizona    H043A001   438
## 4 05    Arkansas   H043A001   328
## 5 06    California H043A001   620
## 6 08    Colorado   H043A001   418

GEOID: state FIPS code

NAME: state name

variable: variable of interest: median gross rent

value: actual values

By defaluse,tidycensus returns tidy data frames, in which individual GEOID & variable combinations form unique keys. An alternative form is wide data frames with individual vairables in their own columns by output = "wide".

m90 %>%
  ggplot(aes(x = value, y = reorder(NAME, value))) + 
  geom_point()

Geography in tidycensus

Check the list on Dr. Walker’s GitHub.

Searching for variable

To load a lookup table of variable names, use load_variables.

Two arguments here: year (e.g., 2017) and dataset (e.g., "sf1", "sf3", or "acs5")

var_acs17 <- load_variables(2017, "acs5", cache = TRUE) 

#View(var_acs17)
head(var_acs17)
## # A tibble: 6 x 3
##   name      label                         concept                          
##   <chr>     <chr>                         <chr>                            
## 1 B00001_0~ Estimate!!Total               UNWEIGHTED SAMPLE COUNT OF THE P~
## 2 B00002_0~ Estimate!!Total               UNWEIGHTED SAMPLE HOUSING UNITS  
## 3 B01001_0~ Estimate!!Total               SEX BY AGE                       
## 4 B01001_0~ Estimate!!Total!!Male         SEX BY AGE                       
## 5 B01001_0~ Estimate!!Total!!Male!!Under~ SEX BY AGE                       
## 6 B01001_0~ Estimate!!Total!!Male!!5 to ~ SEX BY AGE
Working with ACS data

Compare two outputs below:

For all census tracts in Fulton County, GA

ga1: one varibale in a tidy data frame

ga2: two variables in a wide data frame

Note that ACS contains estimates from 1% sample of the US population. For each variable, ACS returns both an estimate and a margin of errors (By default moe_level is 90%). Check how these two values are stored differently by the output option.

ga1 <- get_acs(geography = "tract", year = 2017, 
              variables = c(medincome = "B19013_001"), 
              state = "GA", 
              county = "Fulton")

head(ga1)
## # A tibble: 6 x 5
##   GEOID       NAME                                  variable estimate   moe
##   <chr>       <chr>                                 <chr>       <dbl> <dbl>
## 1 13121000100 Census Tract 1, Fulton County, Georg~ medinco~   149450 23192
## 2 13121000200 Census Tract 2, Fulton County, Georg~ medinco~   116094 18910
## 3 13121000400 Census Tract 4, Fulton County, Georg~ medinco~    91857 21546
## 4 13121000500 Census Tract 5, Fulton County, Georg~ medinco~    85750 23402
## 5 13121000600 Census Tract 6, Fulton County, Georg~ medinco~    50969  8609
## 6 13121000700 Census Tract 7, Fulton County, Georg~ medinco~    71214 10372
ga2 <- get_acs(geography = "tract", year = 2017, 
              variables = c(medincome = "B19013_001", 
                            medhomevalue = "B25077_001"), 
              state = "GA", 
              county = "Fulton", output = "wide")

head(ga2)
## # A tibble: 6 x 6
##   GEOID   NAME            medincomeE medincomeM medhomevalueE medhomevalueM
##   <chr>   <chr>                <dbl>      <dbl>         <dbl>         <dbl>
## 1 131210~ Census Tract 1~      42159       9769        136500         20463
## 2 131210~ Census Tract 3~      84350      23612        228700         20385
## 3 131210~ Census Tract 3~      41372       8278        175900         25191
## 4 131210~ Census Tract 6~         NA         NA            NA            NA
## 5 131210~ Census Tract 8~      95524      17032        285100         32455
## 6 131210~ Census Tract 1~      30819       5281         92800         39929

Let’s create two plots, each preseting census tracts with either the highest or lowest median household incomes.

ga1 %>%
  mutate(NAME = gsub(", Fulton County, Georgia", "", NAME),             # clean the name field 
         rank = min_rank(desc(estimate))) %>%                           # create a rank variable in descending order 
  filter(rank<=10) %>%                                                  # keep top 10 wealthiest tracts  
  ggplot(aes(x = estimate, y = reorder(NAME, estimate))) +
  geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) +
  geom_point(color = "red", size = 3) +
  labs(title = "Household income by tract in Fulton County, Georgia",
       subtitle = "2013-2017 American Community Survey",
       y = "",
       x = "ACS estimate (bars represent margin of error)")

ga1 %>%
  mutate(NAME = gsub(", Fulton County, Georgia", "", NAME), 
         rank = min_rank(estimate)) %>%                                 # create a rank variable in descending order 
  filter(rank<=10) %>%                                                  # keep bottom 10 least-wealthy tracts  
  ggplot(aes(x = estimate, y = reorder(NAME, estimate))) +
  geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) +
  geom_point(color = "red", size = 3) +
  labs(title = "Household income by tract in Fulton County, Georgia",
       subtitle = "2013-2017 American Community Survey",
       y = "",
       x = "ACS estimate (bars represent margin of error)")

3. Spatial data in tidycensus(original contents)

Simple feature geometry, or sf: A relatively new R class that handles geospatial information along with non-geospatial information, compatible with tidyverse functions. In tidycensus, geometry = TRUE returns simple feature geometry for geographic units with variables.

options(tigris_use_cache = TRUE) # to cache downloaded data for future use 
# to setp up your own cache driver, run the following line and restart R
# tigris_cache_dir("PATH TO YOUR CUSTOM CACHE DIRECTORY") 

fulton <- get_acs(state = "GA", county = "Fulton", geography = "tract", 
                  variables = "B19013_001", year = 2017, geometry = TRUE)  # median household income 

head(fulton)
## Simple feature collection with 6 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.43234 ymin: 33.77138 xmax: -84.34816 ymax: 33.81313
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
##         GEOID                                   NAME   variable estimate
## 1 13121000100 Census Tract 1, Fulton County, Georgia B19013_001   149450
## 2 13121000200 Census Tract 2, Fulton County, Georgia B19013_001   116094
## 3 13121000400 Census Tract 4, Fulton County, Georgia B19013_001    91857
## 4 13121000500 Census Tract 5, Fulton County, Georgia B19013_001    85750
## 5 13121000600 Census Tract 6, Fulton County, Georgia B19013_001    50969
## 6 13121000700 Census Tract 7, Fulton County, Georgia B19013_001    71214
##     moe                       geometry
## 1 23192 MULTIPOLYGON (((-84.36737 3...
## 2 18910 MULTIPOLYGON (((-84.37421 3...
## 3 21546 MULTIPOLYGON (((-84.38757 3...
## 4 23402 MULTIPOLYGON (((-84.40086 3...
## 5  8609 MULTIPOLYGON (((-84.41522 3...
## 6 10372 MULTIPOLYGON (((-84.43234 3...

The main difference is geometry, a list column. It contains geographic information (e.g., geometry type and points) for individual observations, or census tracts here. It comes from the US Census Cartographic Boundary Shapefiles, which is a low-resolution version of the TIGER Shapefiles and helps faster processing (By default, ‘cb = TRUE’).

# for now, this is "un"projected 
fulton %>%
  ggplot(aes(fill = estimate, color = estimate)) + 
  geom_sf() +                                        # geom_sf helps plot sf objects on a 2D plane.    
  #coord_sf(crs = 26911) +                           # coordinate reference system, more details next week
  scale_fill_viridis(option = "magma", direction = -1) + 
  scale_color_viridis(option = "magma", direction = -1)

racevars <- c(White = "P005003", 
              Black = "P005004", 
              Asian = "P005006", 
              Hispanic = "P004003")

gwinnett <- get_decennial(geography = "tract", variables = racevars,           # By default, 2010 dicennial census  
                  state = "GA", county = "Gwinnett County", geometry = TRUE,   
                  summary_var = "P001001")                                     # this helps calculate percent 

head(gwinnett)
## Simple feature collection with 6 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.11389 ymin: 34.0433 xmax: -83.94986 ymax: 34.16787
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 6 x 6
##   GEOID  NAME     variable value summary_value                     geometry
##   <chr>  <chr>    <chr>    <dbl>         <dbl>          <MULTIPOLYGON [°]>
## 1 13135~ Census ~ White     6021          9882 (((-84.04025 34.11132, -84.~
## 2 13135~ Census ~ White     1725          3651 (((-84.02758 34.10506, -84.~
## 3 13135~ Census ~ White     5477          9513 (((-83.97996 34.08514, -83.~
## 4 13135~ Census ~ White     4543          6798 (((-84.0585 34.12659, -84.0~
## 5 13135~ Census ~ White     6809          9660 (((-84.04206 34.11135, -84.~
## 6 13135~ Census ~ White     3873          7056 (((-84.02768 34.10508, -84.~
gwinnett %>%
  mutate(pct = 100 * (value / summary_value)) %>%         # compute the percent of individual racial groups 
  ggplot(aes(fill = pct)) +
  facet_wrap(~variable) +                                 # create faceted plots by race 
  geom_sf(color = "white", size = 0.1) +
  #coord_sf(crs = 26915) + 
  scale_fill_viridis(direction = -1) #+

  #scale_color_viridis(direction = -1)                    # make the boundary color the same as that of fill  

Next, we will compare the difference between the Cartographic Boundary SHapefile and TIGER/Line shapefile with mapview, which allows interactive viewing of spatial data in R (FYI, visit its GitHub).

ny <- get_acs(geography = "tract", 
              variables = "B19013_001", 
              state = "NY", 
              county = "New York", 
              geometry = TRUE)

mapview(ny, zcol = "estimate", legend = TRUE)

To download a TIGER/Line shapefile, add cb = FALSE.

ny2 <- get_acs(geography = "tract", 
              variables = "B19013_001", 
              state = "NY", 
              county = "New York", 
              geometry = TRUE, 
              cb = FALSE)

st_erase <- function(x, y) {
  st_difference(x, st_union(st_combine(y)))            # clipping x intersection (y)^(complement) 
}

ny_water <- area_water("NY", "New York", class = "sf") # Download an area water shapefile into R

ny_water
## Simple feature collection with 13 features and 8 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -74.03457 ymin: 40.67955 xmax: -73.9086 ymax: 40.88221
## epsg (SRID):    4269
## proj4string:    +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
##    ANSICODE       HYDROID           FULLNAME MTFCC ALAND   AWATER
## 1      <NA>  110460984157           The Pond H2030     0    17866
## 2      <NA> 1102217391354 Upper New York Bay H2051     0  4732539
## 3      <NA>  110460984158           The Pool H2030     0     8634
## 4      <NA>  110460984167        Harlem Mere H2030     0    40588
## 5      <NA> 1105045104832           East Riv H3010     0  7968623
## 6      <NA> 1103717081254         Hudson Riv H3010     0 13240545
## 7      <NA>  110460984156           The Lake H2030     0    87276
## 8      <NA>  110460984195 Conservatory Water H2030     0     7813
## 9      <NA>  110460984159       Belvedere Lk H2030     0     4330
## 10     <NA>  110460984215      The Reservoir H2040     0   400599
##       INTPTLAT    INTPTLON                       geometry
## 1  +40.7657278 -73.9741203 POLYGON ((-73.97588 40.7665...
## 2  +40.6940748 -74.0153463 POLYGON ((-74.03294 40.6877...
## 3  +40.7946696 -73.9607674 POLYGON ((-73.96182 40.7947...
## 4  +40.7965616 -73.9519770 POLYGON ((-73.95429 40.797,...
## 5  +40.7516105 -73.9547365 POLYGON ((-73.99818 40.7084...
## 6  +40.7965246 -73.9799445 POLYGON ((-74.02495 40.7072...
## 7  +40.7763985 -73.9717336 POLYGON ((-73.97359 40.7771...
## 8  +40.7742073 -73.9671237 POLYGON ((-73.96765 40.7739...
## 9  +40.7796257 -73.9681243 POLYGON ((-73.9689 40.77979...
## 10 +40.7858797 -73.9620676 POLYGON ((-73.96677 40.7850...
ggplot(ny_water, aes(fill = AWATER)) +
  geom_sf()

ny_erase <- st_erase(ny2, ny_water)

mapview(ny_erase, zcol = "estimate", legend = TRUE)