Spatial Ineq DBs from R tutorial

General Background

We have a database on a princeton server to store, share, and access data. This setup has some advantages over just sharing data through email or keeping on our local machines:

Consistency - we can all access and work with the same data without manually managing versions scattered throughout various google drives and local machines.
Reproducability - This could also allow us to run each other’s scripts and easily check or replicate each other’s work.
Subsetting - We can selectively access portions of a dataset that might be too large to load onto our local machine. This can be done programmatically from R.
Scalability - We can store more large datasets that would fill up our hard drives if we just tried to work locally.

Separate from the Princeton setup, I also have a working database hosted on an Amazon server. This will be an option if we want to depart from Princeton’s institutional resources. This setup would have the additional advantages of allowing us to link the databases to more interactive visualizations and potentially allowing us to share versions of our datasets more widely.

Setup with Princeton credentials

To connect to the Princeton database, you will first have to connect with the princeton research computing folks. You will need a username and password and you’ll need to keep open a VPN to the Princeton server. Email them to get that set up:

https://researchcomputing.princeton.edu/about/contact/ask-research-computing

Connecting through R

After that, you’re ready to pull & write data to the database. I wrote short R library called ‘dblinkr’ that contains some helper functions for this purpose. This tutorial will use some of these convenience functions.

First, we can use a helper function from this library to connect to the princeton server. Save your connection in your R environment.

library(dblinkr)
# set usr and pw to your username and password!
con <- princeton.db.connect(usr, pw)

## Loading required package: DBI

## Loading required package: RPostgreSQL

## connecting with Postgres driver

Checking the DB contents

We’re in! The first thing we might want to do is check what’s there.

We can keep our tables organized into different “schema,” which are basically organizational subdivisions for our database.

# check what's there:
DBI::dbListObjects(con)

##                               table is_prefix
## 1     <Id> table = corelogic_sample     FALSE
## 2    <Id> table = geography_columns     FALSE
## 3     <Id> table = geometry_columns     FALSE
## 4      <Id> table = spatial_ref_sys     FALSE
## 5       <Id> table = raster_columns     FALSE
## 6     <Id> table = raster_overviews     FALSE
## 7          <Id> schema = pg_catalog      TRUE
## 8  <Id> schema = information_schema      TRUE
## 9                <Id> schema = divs      TRUE
## 10 <Id> schema = regions_simplified      TRUE
## 11             <Id> schema = xwalks      TRUE
## 12            <Id> schema = regions      TRUE
## 13             <Id> schema = public      TRUE
## 14               <Id> schema = attr      TRUE

# The things I've loaded in so far are all in schema.

Some of the above are artifacts from setting up postgis; the others are schema containing tables that I’ve already loaded in.

My tables so far are divided between “divs,” which holds spatial divisions (like highways, rivers, historical redlining, etc.); “regions,” which holds study areas (commuting zones, counties, census tracts, etc.); and “attr,” which stores non-spatial, attribute information.

We can see what’s in each schema with a convenience fcn from my dblinkr package:

dblinkr::tbls.in.schema(con, "divs")

##                                      table is_prefix
## 1    <Id> schema = divs, table = redlining     FALSE
## 2    <Id> schema = divs, table = rails_bts     FALSE
## 3 <Id> schema = divs, table = water_tigris     FALSE
## 4         <Id> schema = divs, table = hwys     FALSE
## 5       <Id> schema = divs, table = places     FALSE
## 6  <Id> schema = divs, table = hwyPlan1947     FALSE

# this fcn just wraps DBI::dbListObjects w/ some parameters specified

Here we can see the tables in the “divs” schema. It’ll likely make sense to add more schema as we work to keep all the components of the project organized.

Working with hosted data

You can start working with what’s there by using R, SQL, or a combination. I know very minimal SQL and it’s not necessary to to use these things. But learning enough to write a line here or there can be convenient in some cases.

First let’s just pull in a whole table. This will use ‘dbplyr,’ a library that translates tidyverse R code into SQL. Our code will change slightly depending on whether or not we’re reading in spatial information .

The ‘dbplyr’ vignette can be found here: https://dbplyr.tidyverse.org/articles/dbplyr.html

# First, let's read using dbplyr:
suppressMessages(library(dplyr))
suppressMessages(library(dbplyr))
cts <- tbl(con, in_schema("attr", "cts"))
colnames(cts)[1:10]

##  [1] "state"      "county"     "tract"      "geoid"      "cz"        
##  [6] "czname"     "cbsa_id"    "county_id"  "row_id"     "population"

# Now we can see the data, but it's still a ~reference~ to the database table,
# not a df or tibble in our R environment:
class(cts)

## [1] "tbl_PqConnection" "tbl_dbi"          "tbl_sql"          "tbl_lazy"        
## [5] "tbl"

# It's generally a good idea to pull things into local memory as soon as they're
# down to a manageable size. Database references will be harder to
# work with. This is done as:
local.cts <- collect(cts)
class(local.cts)

## [1] "tbl_df"     "tbl"        "data.frame"

# You can also use dplyr to generate sql queries:
czpop <- cts %>% group_by(cz, czname) %>% summarise("commuting zone population" = sum(population, na.rm = T))
czpop %>% head()

## # Source:   lazy query [?? x 3]
## # Database: postgres [km31@carto.princeton.edu:5432/shp]
## # Groups:   cz
##      cz czname         `commuting zone population`
##   <int> <chr>                              <int64>
## 1    NA <NA>                                782195
## 2  7500 Deltona                             664653
## 3 11101 Montgomery                          395483
## 4 33400 Longview                            314342
## 5 12902 Mount Sterling                       44396
## 6 35300 Farmington                          194161

# equivalent sql to above R code:
czpop %>% dplyr::show_query()

## <SQL>
## SELECT "cz", "czname", SUM("population") AS "commuting zone population"
## FROM attr.cts
## GROUP BY "cz", "czname"

One likely use case might be pulling from an extremely large table that won’t fit into local memory all at once. We can pull in subsamples for exploratory analysis or iterate through identifiers to process over the full dataset:

# Pull in just census-tracts for Philadelphia
phl.cts <- cts %>% filter(czname == "Philadelphia") %>% collect()
phl.cts %>% select(geoid, cz, czname, population) %>% head()

## # A tibble: 6 x 4
##   geoid             cz czname       population
##   <chr>          <int> <chr>             <int>
## 1 G3400010000100 19700 Philadelphia       2417
## 2 G3400010000200 19700 Philadelphia       3202
## 3 G3400010000300 19700 Philadelphia       4373
## 4 G3400010000400 19700 Philadelphia       2606
## 5 G3400010000500 19700 Philadelphia       3503
## 6 G3400010001100 19700 Philadelphia       2027

This will likely more useful with, for example, the 380GB safegraph data!

Working with spatial data

I’ve mostly been using this setup for spatial data so far. The ‘dblinkr’ library I referenced above has some helper functions for spatially subsetting from a database. dbplyr won’t be able to do spatial subsets and won’t handle geometries well, so we switch to the sf library or dblinkr helper functions for this.

library(sf)

## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1

tbls.in.schema(con, "regions")

##                                     table is_prefix
## 1  <Id> schema = regions, table = cts2010     FALSE
## 2   <Id> schema = regions, table = states     FALSE
## 3     <Id> schema = regions, table = cbsa     FALSE
## 4 <Id> schema = regions, table = counties     FALSE
## 5 <Id> schema = regions, table = czs_1990     FALSE

# use sf library to load a subset. SQL is used for the 'query' argument.
counties <- st_read(con, query = "select * from regions.counties where statefp='42';")

phl <- counties %>% filter(grepl("Philadelphia", name))
# use this region to subset from other, more memory-intensive spatial datasets.
# Below helper function gets all rows of a spatial table within the area bounded
# by the supplied region.
phwys <- dblinkr::query.division(con, region = phl, "divs.hwys")
prl <- dblinkr::query.division(con, region = phl, "divs.redlining")
# make a cool map?
library(ggplot2)
ggplot() +
  geom_sf(data = prl, aes(fill= holc_grade), color = "white") +
  geom_sf(data = st_crop(phwys, prl), color = "black", size = 1) +
  theme_void() +
  labs(title = "hwys and redlining in philly")

## although coordinates are longitude/latitude, st_intersection assumes that they are planar

## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries

(You can also supply SQL to subset a shapefile even if you’re reading just from a file, rather than from a database.)

Final Notes

Working outside of R

As a final quick note, you can download pgadmin to have a GUI to check all of this. This can be really useful for checking that things are going as expected and to have more of a visual on the db. It was surprisingly painless for me to get this set up and find the princeton db from it.

Little quirks and tips

As I mentioned, I’m not an expert in this stuff either, although I have found it useful and workable. Here are a couple quirks and tips from working with SQL from R: - Try to avoid capitalizing schema or table names when you upload them to the database. It causes double quotes to be required, which are not otherwise required. It feels easier to keep everything lowercase. - Writing spatial data to the database using sf::st_write is straightforward, except the CRS doesn’t seem to transfer. My workaround has been to always set the CRS to 4326 before uploading so that they’re all standard. I’m not sure there is a better approach, but if you’re uploading spatial data, be aware of this.

Additional resources

Intro to DBI – underlying package for using databases from R
dbplyr vignette
More on Spatial SQL from R
PostGIS & PostgresSQL – The actual setups for our database