This introduces the database set up to store and access some of our larger datasets.
We have a database on a princeton server to store, share, and access data. This setup has some advantages over just sharing data through email or keeping on our local machines:
Separate from the Princeton setup, I also have a working database hosted on an Amazon server. This will be an option if we want to depart from Princeton’s institutional resources. This setup would have the additional advantages of allowing us to link the databases to more interactive visualizations and potentially allowing us to share versions of our datasets more widely.
To connect to the Princeton database, you will first have to connect with the princeton research computing folks. You will need a username and password and you’ll need to keep open a VPN to the Princeton server. Email them to get that set up:
https://researchcomputing.princeton.edu/about/contact/ask-research-computing
After that, you’re ready to pull & write data to the database. I wrote short R library called ‘dblinkr’ that contains some helper functions for this purpose. This tutorial will use some of these convenience functions.
First, we can use a helper function from this library to connect to the princeton server. Save your connection in your R environment.
library(dblinkr)
# set usr and pw to your username and password!
con <- princeton.db.connect(usr, pw)## Loading required package: DBI
## Loading required package: RPostgreSQL
## connecting with Postgres driver
We’re in! The first thing we might want to do is check what’s there.
We can keep our tables organized into different “schema,” which are basically organizational subdivisions for our database.
## table is_prefix
## 1 <Id> table = corelogic_sample FALSE
## 2 <Id> table = geography_columns FALSE
## 3 <Id> table = geometry_columns FALSE
## 4 <Id> table = spatial_ref_sys FALSE
## 5 <Id> table = raster_columns FALSE
## 6 <Id> table = raster_overviews FALSE
## 7 <Id> schema = pg_catalog TRUE
## 8 <Id> schema = information_schema TRUE
## 9 <Id> schema = divs TRUE
## 10 <Id> schema = regions_simplified TRUE
## 11 <Id> schema = xwalks TRUE
## 12 <Id> schema = regions TRUE
## 13 <Id> schema = public TRUE
## 14 <Id> schema = attr TRUE
Some of the above are artifacts from setting up postgis; the others are schema containing tables that I’ve already loaded in.
My tables so far are divided between “divs,” which holds spatial divisions (like highways, rivers, historical redlining, etc.); “regions,” which holds study areas (commuting zones, counties, census tracts, etc.); and “attr,” which stores non-spatial, attribute information.
We can see what’s in each schema with a convenience fcn from my dblinkr package:
## table is_prefix
## 1 <Id> schema = divs, table = redlining FALSE
## 2 <Id> schema = divs, table = rails_bts FALSE
## 3 <Id> schema = divs, table = water_tigris FALSE
## 4 <Id> schema = divs, table = hwys FALSE
## 5 <Id> schema = divs, table = places FALSE
## 6 <Id> schema = divs, table = hwyPlan1947 FALSE
Here we can see the tables in the “divs” schema. It’ll likely make sense to add more schema as we work to keep all the components of the project organized.
You can start working with what’s there by using R, SQL, or a combination. I know very minimal SQL and it’s not necessary to to use these things. But learning enough to write a line here or there can be convenient in some cases.
First let’s just pull in a whole table. This will use ‘dbplyr,’ a library that translates tidyverse R code into SQL. Our code will change slightly depending on whether or not we’re reading in spatial information .
The ‘dbplyr’ vignette can be found here: https://dbplyr.tidyverse.org/articles/dbplyr.html
# First, let's read using dbplyr:
suppressMessages(library(dplyr))
suppressMessages(library(dbplyr))
cts <- tbl(con, in_schema("attr", "cts"))
colnames(cts)[1:10]## [1] "state" "county" "tract" "geoid" "cz"
## [6] "czname" "cbsa_id" "county_id" "row_id" "population"
# Now we can see the data, but it's still a ~reference~ to the database table,
# not a df or tibble in our R environment:
class(cts)## [1] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy"
## [5] "tbl"
# It's generally a good idea to pull things into local memory as soon as they're
# down to a manageable size. Database references will be harder to
# work with. This is done as:
local.cts <- collect(cts)
class(local.cts)## [1] "tbl_df" "tbl" "data.frame"
# You can also use dplyr to generate sql queries:
czpop <- cts %>% group_by(cz, czname) %>% summarise("commuting zone population" = sum(population, na.rm = T))
czpop %>% head()## # Source: lazy query [?? x 3]
## # Database: postgres [km31@carto.princeton.edu:5432/shp]
## # Groups: cz
## cz czname `commuting zone population`
## <int> <chr> <int64>
## 1 NA <NA> 782195
## 2 7500 Deltona 664653
## 3 11101 Montgomery 395483
## 4 33400 Longview 314342
## 5 12902 Mount Sterling 44396
## 6 35300 Farmington 194161
## <SQL>
## SELECT "cz", "czname", SUM("population") AS "commuting zone population"
## FROM attr.cts
## GROUP BY "cz", "czname"
One likely use case might be pulling from an extremely large table that won’t fit into local memory all at once. We can pull in subsamples for exploratory analysis or iterate through identifiers to process over the full dataset:
# Pull in just census-tracts for Philadelphia
phl.cts <- cts %>% filter(czname == "Philadelphia") %>% collect()
phl.cts %>% select(geoid, cz, czname, population) %>% head()## # A tibble: 6 x 4
## geoid cz czname population
## <chr> <int> <chr> <int>
## 1 G3400010000100 19700 Philadelphia 2417
## 2 G3400010000200 19700 Philadelphia 3202
## 3 G3400010000300 19700 Philadelphia 4373
## 4 G3400010000400 19700 Philadelphia 2606
## 5 G3400010000500 19700 Philadelphia 3503
## 6 G3400010001100 19700 Philadelphia 2027
This will likely more useful with, for example, the 380GB safegraph data!
I’ve mostly been using this setup for spatial data so far. The ‘dblinkr’ library I referenced above has some helper functions for spatially subsetting from a database. dbplyr won’t be able to do spatial subsets and won’t handle geometries well, so we switch to the sf library or dblinkr helper functions for this.
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
## table is_prefix
## 1 <Id> schema = regions, table = cts2010 FALSE
## 2 <Id> schema = regions, table = states FALSE
## 3 <Id> schema = regions, table = cbsa FALSE
## 4 <Id> schema = regions, table = counties FALSE
## 5 <Id> schema = regions, table = czs_1990 FALSE
# use sf library to load a subset. SQL is used for the 'query' argument.
counties <- st_read(con, query = "select * from regions.counties where statefp='42';")
phl <- counties %>% filter(grepl("Philadelphia", name))
# use this region to subset from other, more memory-intensive spatial datasets.
# Below helper function gets all rows of a spatial table within the area bounded
# by the supplied region.
phwys <- dblinkr::query.division(con, region = phl, "divs.hwys")
prl <- dblinkr::query.division(con, region = phl, "divs.redlining")
# make a cool map?
library(ggplot2)
ggplot() +
geom_sf(data = prl, aes(fill= holc_grade), color = "white") +
geom_sf(data = st_crop(phwys, prl), color = "black", size = 1) +
theme_void() +
labs(title = "hwys and redlining in philly")## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
(You can also supply SQL to subset a shapefile even if you’re reading just from a file, rather than from a database.)
As a final quick note, you can download pgadmin to have a GUI to check all of this. This can be really useful for checking that things are going as expected and to have more of a visual on the db. It was surprisingly painless for me to get this set up and find the princeton db from it.
As I mentioned, I’m not an expert in this stuff either, although I have found it useful and workable. Here are a couple quirks and tips from working with SQL from R: - Try to avoid capitalizing schema or table names when you upload them to the database. It causes double quotes to be required, which are not otherwise required. It feels easier to keep everything lowercase. - Writing spatial data to the database using sf::st_write is straightforward, except the CRS doesn’t seem to transfer. My workaround has been to always set the CRS to 4326 before uploading so that they’re all standard. I’m not sure there is a better approach, but if you’re uploading spatial data, be aware of this.