tidycensus, tigris, & lehdrinstall.packages("tidyverse", dependencies = TRUE)
install.packages("sf", dependencies = TRUE)
install.packages("tigris", dependencies = TRUE)
install.packages("tidycensus", dependencies = TRUE)
install.packages("mapview", dependencies = TRUE)
install.packages("viridis", dependencies = TRUE)
library(tidyverse)
library(sf)
library(tigris)
library(tidycensus)
library(mapview)
library(viridis)
tidycensus GitHubWhat we do:
Learn how to download the dicennial US Census and American Community Survey (ACS) 5-year estimatess
Understand the simple feature class in R, sf.
Create basic choropleth (i.e., thematic) maps using the US Census boudary shapefiles.
Why we do:
The US Census provides rich information on neighborhoods, cities, and regions, which is critical to urban planning and any urban-related studies.
Various data sets from the US Census contain both geospatial and non-geospatial information, which we need to approach with care. (For differences between spatial, geographic, and geospatial, read this StackExchange Q & A thread. More thorough discussion may come in following weeks.)
tidycensus(original contents)To obtain your Census API key: http://api.census.gov/data/key_signup.html Then, insert your key to the current environment
census_api_key("YOUR API KEY GOES HERE")
two main functions are:
get_decennial(): access to 1990, 2000, and 2010 decennial US Census
get_acs(): 5-year Amercian Communicty Survey
m90 <- get_decennial(geography = "state", variables = "H043A001", year = 1990)
head(m90)
## # A tibble: 6 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 01 Alabama H043A001 325
## 2 02 Alaska H043A001 559
## 3 04 Arizona H043A001 438
## 4 05 Arkansas H043A001 328
## 5 06 California H043A001 620
## 6 08 Colorado H043A001 418
GEOID: state FIPS code
NAME: state name
variable: variable of interest: median gross rent
value: actual values
By defaluse,tidycensus returns tidy data frames, in which individual GEOID & variable combinations form unique keys. An alternative form is wide data frames with individual vairables in their own columns by output = "wide".
m90 %>%
ggplot(aes(x = value, y = reorder(NAME, value))) +
geom_point()
tidycensusCheck the list on Dr. Walker’s GitHub.
variableTo load a lookup table of variable names, use load_variables.
Two arguments here: year (e.g., 2017) and dataset (e.g., "sf1", "sf3", or "acs5")
var_acs17 <- load_variables(2017, "acs5", cache = TRUE)
#View(var_acs17)
head(var_acs17)
## # A tibble: 6 x 3
## name label concept
## <chr> <chr> <chr>
## 1 B00001_0~ Estimate!!Total UNWEIGHTED SAMPLE COUNT OF THE P~
## 2 B00002_0~ Estimate!!Total UNWEIGHTED SAMPLE HOUSING UNITS
## 3 B01001_0~ Estimate!!Total SEX BY AGE
## 4 B01001_0~ Estimate!!Total!!Male SEX BY AGE
## 5 B01001_0~ Estimate!!Total!!Male!!Under~ SEX BY AGE
## 6 B01001_0~ Estimate!!Total!!Male!!5 to ~ SEX BY AGE
Compare two outputs below:
For all census tracts in Fulton County, GA
ga1: one varibale in a tidy data frame
ga2: two variables in a wide data frame
Note that ACS contains estimates from 1% sample of the US population. For each variable, ACS returns both an estimate and a margin of errors (By default moe_level is 90%). Check how these two values are stored differently by the output option.
ga1 <- get_acs(geography = "tract", year = 2017,
variables = c(medincome = "B19013_001"),
state = "GA",
county = "Fulton")
head(ga1)
## # A tibble: 6 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 13121000100 Census Tract 1, Fulton County, Georg~ medinco~ 149450 23192
## 2 13121000200 Census Tract 2, Fulton County, Georg~ medinco~ 116094 18910
## 3 13121000400 Census Tract 4, Fulton County, Georg~ medinco~ 91857 21546
## 4 13121000500 Census Tract 5, Fulton County, Georg~ medinco~ 85750 23402
## 5 13121000600 Census Tract 6, Fulton County, Georg~ medinco~ 50969 8609
## 6 13121000700 Census Tract 7, Fulton County, Georg~ medinco~ 71214 10372
ga2 <- get_acs(geography = "tract", year = 2017,
variables = c(medincome = "B19013_001",
medhomevalue = "B25077_001"),
state = "GA",
county = "Fulton", output = "wide")
head(ga2)
## # A tibble: 6 x 6
## GEOID NAME medincomeE medincomeM medhomevalueE medhomevalueM
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 131210~ Census Tract 1~ 42159 9769 136500 20463
## 2 131210~ Census Tract 3~ 84350 23612 228700 20385
## 3 131210~ Census Tract 3~ 41372 8278 175900 25191
## 4 131210~ Census Tract 6~ NA NA NA NA
## 5 131210~ Census Tract 8~ 95524 17032 285100 32455
## 6 131210~ Census Tract 1~ 30819 5281 92800 39929
Let’s create two plots, each preseting census tracts with either the highest or lowest median household incomes.
ga1 %>%
mutate(NAME = gsub(", Fulton County, Georgia", "", NAME), # clean the name field
rank = min_rank(desc(estimate))) %>% # create a rank variable in descending order
filter(rank<=10) %>% # keep top 10 wealthiest tracts
ggplot(aes(x = estimate, y = reorder(NAME, estimate))) +
geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) +
geom_point(color = "red", size = 3) +
labs(title = "Household income by tract in Fulton County, Georgia",
subtitle = "2013-2017 American Community Survey",
y = "",
x = "ACS estimate (bars represent margin of error)")
ga1 %>%
mutate(NAME = gsub(", Fulton County, Georgia", "", NAME),
rank = min_rank(estimate)) %>% # create a rank variable in descending order
filter(rank<=10) %>% # keep bottom 10 least-wealthy tracts
ggplot(aes(x = estimate, y = reorder(NAME, estimate))) +
geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) +
geom_point(color = "red", size = 3) +
labs(title = "Household income by tract in Fulton County, Georgia",
subtitle = "2013-2017 American Community Survey",
y = "",
x = "ACS estimate (bars represent margin of error)")
tidycensus(original contents)Simple feature geometry, or sf: A relatively new R class that handles geospatial information along with non-geospatial information, compatible with tidyverse functions. In tidycensus, geometry = TRUE returns simple feature geometry for geographic units with variables.
options(tigris_use_cache = TRUE) # to cache downloaded data for future use
# to setp up your own cache driver, run the following line and restart R
# tigris_cache_dir("PATH TO YOUR CUSTOM CACHE DIRECTORY")
fulton <- get_acs(state = "GA", county = "Fulton", geography = "tract",
variables = "B19013_001", year = 2017, geometry = TRUE) # median household income
head(fulton)
## Simple feature collection with 6 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -84.43234 ymin: 33.77138 xmax: -84.34816 ymax: 33.81313
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## GEOID NAME variable estimate
## 1 13121000100 Census Tract 1, Fulton County, Georgia B19013_001 149450
## 2 13121000200 Census Tract 2, Fulton County, Georgia B19013_001 116094
## 3 13121000400 Census Tract 4, Fulton County, Georgia B19013_001 91857
## 4 13121000500 Census Tract 5, Fulton County, Georgia B19013_001 85750
## 5 13121000600 Census Tract 6, Fulton County, Georgia B19013_001 50969
## 6 13121000700 Census Tract 7, Fulton County, Georgia B19013_001 71214
## moe geometry
## 1 23192 MULTIPOLYGON (((-84.36737 3...
## 2 18910 MULTIPOLYGON (((-84.37421 3...
## 3 21546 MULTIPOLYGON (((-84.38757 3...
## 4 23402 MULTIPOLYGON (((-84.40086 3...
## 5 8609 MULTIPOLYGON (((-84.41522 3...
## 6 10372 MULTIPOLYGON (((-84.43234 3...
The main difference is geometry, a list column. It contains geographic information (e.g., geometry type and points) for individual observations, or census tracts here. It comes from the US Census Cartographic Boundary Shapefiles, which is a low-resolution version of the TIGER Shapefiles and helps faster processing (By default, ‘cb = TRUE’).
# for now, this is "un"projected
fulton %>%
ggplot(aes(fill = estimate, color = estimate)) +
geom_sf() + # geom_sf helps plot sf objects on a 2D plane.
#coord_sf(crs = 26911) + # coordinate reference system, more details next week
scale_fill_viridis(option = "magma", direction = -1) +
scale_color_viridis(option = "magma", direction = -1)
racevars <- c(White = "P005003",
Black = "P005004",
Asian = "P005006",
Hispanic = "P004003")
gwinnett <- get_decennial(geography = "tract", variables = racevars, # By default, 2010 dicennial census
state = "GA", county = "Gwinnett County", geometry = TRUE,
summary_var = "P001001") # this helps calculate percent
head(gwinnett)
## Simple feature collection with 6 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -84.11389 ymin: 34.0433 xmax: -83.94986 ymax: 34.16787
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 6 x 6
## GEOID NAME variable value summary_value geometry
## <chr> <chr> <chr> <dbl> <dbl> <MULTIPOLYGON [°]>
## 1 13135~ Census ~ White 6021 9882 (((-84.04025 34.11132, -84.~
## 2 13135~ Census ~ White 1725 3651 (((-84.02758 34.10506, -84.~
## 3 13135~ Census ~ White 5477 9513 (((-83.97996 34.08514, -83.~
## 4 13135~ Census ~ White 4543 6798 (((-84.0585 34.12659, -84.0~
## 5 13135~ Census ~ White 6809 9660 (((-84.04206 34.11135, -84.~
## 6 13135~ Census ~ White 3873 7056 (((-84.02768 34.10508, -84.~
gwinnett %>%
mutate(pct = 100 * (value / summary_value)) %>% # compute the percent of individual racial groups
ggplot(aes(fill = pct)) +
facet_wrap(~variable) + # create faceted plots by race
geom_sf(color = "white", size = 0.1) +
#coord_sf(crs = 26915) +
scale_fill_viridis(direction = -1) #+
#scale_color_viridis(direction = -1) # make the boundary color the same as that of fill
Next, we will compare the difference between the Cartographic Boundary SHapefile and TIGER/Line shapefile with mapview, which allows interactive viewing of spatial data in R (FYI, visit its GitHub).
ny <- get_acs(geography = "tract",
variables = "B19013_001",
state = "NY",
county = "New York",
geometry = TRUE)
mapview(ny, zcol = "estimate", legend = TRUE)
To download a TIGER/Line shapefile, add cb = FALSE.
st_difference: Geocomputation with R
area_water: tigris documentation)
ny2 <- get_acs(geography = "tract",
variables = "B19013_001",
state = "NY",
county = "New York",
geometry = TRUE,
cb = FALSE)
st_erase <- function(x, y) {
st_difference(x, st_union(st_combine(y))) # clipping x intersection (y)^(complement)
}
ny_water <- area_water("NY", "New York", class = "sf") # Download an area water shapefile into R
ny_water
## Simple feature collection with 13 features and 8 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -74.03457 ymin: 40.67955 xmax: -73.9086 ymax: 40.88221
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## First 10 features:
## ANSICODE HYDROID FULLNAME MTFCC ALAND AWATER
## 1 <NA> 110460984157 The Pond H2030 0 17866
## 2 <NA> 1102217391354 Upper New York Bay H2051 0 4732539
## 3 <NA> 110460984158 The Pool H2030 0 8634
## 4 <NA> 110460984167 Harlem Mere H2030 0 40588
## 5 <NA> 1105045104832 East Riv H3010 0 7968623
## 6 <NA> 1103717081254 Hudson Riv H3010 0 13240545
## 7 <NA> 110460984156 The Lake H2030 0 87276
## 8 <NA> 110460984195 Conservatory Water H2030 0 7813
## 9 <NA> 110460984159 Belvedere Lk H2030 0 4330
## 10 <NA> 110460984215 The Reservoir H2040 0 400599
## INTPTLAT INTPTLON geometry
## 1 +40.7657278 -73.9741203 POLYGON ((-73.97588 40.7665...
## 2 +40.6940748 -74.0153463 POLYGON ((-74.03294 40.6877...
## 3 +40.7946696 -73.9607674 POLYGON ((-73.96182 40.7947...
## 4 +40.7965616 -73.9519770 POLYGON ((-73.95429 40.797,...
## 5 +40.7516105 -73.9547365 POLYGON ((-73.99818 40.7084...
## 6 +40.7965246 -73.9799445 POLYGON ((-74.02495 40.7072...
## 7 +40.7763985 -73.9717336 POLYGON ((-73.97359 40.7771...
## 8 +40.7742073 -73.9671237 POLYGON ((-73.96765 40.7739...
## 9 +40.7796257 -73.9681243 POLYGON ((-73.9689 40.77979...
## 10 +40.7858797 -73.9620676 POLYGON ((-73.96677 40.7850...
ggplot(ny_water, aes(fill = AWATER)) +
geom_sf()
ny_erase <- st_erase(ny2, ny_water)
mapview(ny_erase, zcol = "estimate", legend = TRUE)