Some quick sample maps to motivate the lesson!
(colors are different because they’re not capped)
I used a helper function i had written (in a package called censusrx) to pull the tract information very quickly. But this is built in turn on packages to quickly query data – including spatial data, such as area geometries, from the US Census Bureau.
This tutorial quickly goes over essential libraries for doing spatial work in R and gives an example workflow working with census data.
sf
is the key R library for spatial R. You’ll likely
always want to load whenever doing mapping + spatial analysis
For pulling in census geometries directly from R,
tigris
is extremely useful. tidycensus
is less
straightforward to use, but can pull census estimates for many social
attributes, as well as geometries. Resource: Reference book on working
with Census Bureau data in R
You might see references to sp
or other libraries on
stackoverflow and elsewhere–sp
provided some similar
functionalities but is rapidly getting displaced by sf
. I
recommend trying to avoid older libraries that are falling out of favor
and using sf
or compatible libraries whenever
possible.
For visualization:
ggplot2
in the tidyverse is great for static
visualization!
mapview
is great for interactive ones.
More polished or modifiable interactive maps can be made with
leaflet
.
The sf
library will underlie most spatial work in R.
Some quick notes:
All functions are prefixed with st_
sf objects are basically data.frames that include a geometry column. The geometry column will represent the spatial features associated with each row.
The geometry column will be “sticky.” To drop it, you’ll have to turn the sf into a tibble or other non-spatial data.frame. This is often useful because grouping an sf object will cause geometries to union. This can very useful, but it can also be an expensive operation that slows down non-spatial analysis.
Get population for the five boroughs of NYC!
This requires looking up the state/county FIPS codes we want (FIPS codes are identifiers used by the Census Bureau to identify areas like counties, census tracts, etc.)
# county FIPS codes for the 5 boroughs of NYC
nyboros <- c("085", "005", "047", "061", "081")
# query population data with tidycensus
boro.pops <-
tidycensus::get_acs(
geography = 'county'
,state = 'ny'
,county = nyboros
,year = 2021
,variables = c(pop = "B01001_001"), # just population
)
# i like column names in lowercase
boro.pops <- boro.pops %>%
rename_with(tolower)
# get geometries for the 5 boroughs
boro.geos <-
tigris::counties(
state = 36
,year = 2021
) %>%
rename_with(tolower) %>%
select(geoid, aland) %>%
filter(geoid %in% paste0('36', nyboros))
# merge and make it an sf object
boro.popsf <- boro.pops %>%
left_join(boro.geos) %>%
st_sf()
we can make it prettier by adding water, using better color palette, neatening the legend, and removing the gridded background.
# use tigris to pull water areas.
wtr <- tigris::area_water(state = 36, county = nyboros) %>%
rename_with(tolower)
wtr <- wtr %>% filter(aland >= quantile(wtr$aland, .98))
# define a bkgd color
bkgd.color <- 'grey90'
ggplot(boro.popsf) +
geom_sf(
aes(fill = estimate)
,color = NA
) +
# add a layer for water areas that matches background color.
geom_sf(data = wtr
,fill = bkgd.color
,color = bkgd.color) +
# change color palette and neaten labels.
scale_fill_viridis_c(
'Population'
,labels = scales::comma
,option = 'B'
) +
theme_void() +
theme(panel.background =
element_rect(fill = bkgd.color)
) +
labs(
caption = 'Source: 2021, American Community Survey, 5-year estimates.'
)
If you’re doing spatial data manipulation, you might run into some issues caused by the curvature of the earth. As you probably know, maps are projected different ways to represent geography on a flat surface. The projection method is called a coordinate reference system (CRS).
The curvature of the earth is not dealt with in all spatial R
libraries. The lwgeom
package can extend sf to allow
“geodetic” functions, which do account for the curved surface when
calculating things like distance or area. See here.
Sf
is also being extended to allow geodetic operations. See
here.
In general, you’ll have to transform your spatial datasets to the same CRS to work with them together. Some functions will give a message/warning/error based on whether it expects long/lat or planar geometries.
A CRS can store a lot of information, but common ones can just be represented by a 4+ digit number, called an EPSG code. You can use the espg 4326 for a common long-lat crs, used by google maps and elsewhere.
Changing CRS can also help your maps by shifting to appropriate projections for the area you’re mapping.
Example of how using geodetic operations vs planar operations (and a non-areal-preserving CRS) will yield different results:
library(lwgeom)
boro.geos %>%
st_transform(32618) %>%
mutate(calculated.area =
st_area(geometry)
) %>%
st_transform(4326) %>%
mutate(calculated.geodetic.area =
st_geod_area(geometry)
) %>%
tibble() %>%
select(-geometry) %>%
mutate(difference =
calculated.geodetic.area - calculated.area)
## # A tibble: 5 × 5
## geoid aland calculated.area calculated.geodetic.area difference
## <chr> <dbl> [m^2] [m^2] [m^2]
## 1 36085 148982680 266246093. 266424173. 178079.
## 2 36081 281594049 469774022. 470038471. 264450.
## 3 36047 179684484 250691422. 250843279. 151857.
## 4 36061 58683562 87640496. 87694321. 53825.
## 5 36005 109233466 148502118. 148586736. 84618.
The first calculated area has a UTM CRS projection (represented by the EPSG code 32618). This map projection doesn’t preserve area, and our area calculation function doesn’t know how to compensate.
When we change the CRS to a longitude/latitude projection (using the
EPSG code 4326) and use a geodetic function to calculate area, we get a
different answer (see the difference
column).
The second answer will be more accurate because it accounts for the curvature of the earth and doesn’t base calculations on an areal-distorted map projection.