Roll your own sf with Geopackage (SQLite)

First some background.

sf::read_sf()

In sf, we read Geopackage (GPKG) using GDAL via the read_sf() function.

library(sf)
gfile <- system.file("gpkg/nc.gpkg", package = "sf", mustWork = TRUE)

read_sf(gfile)
## Simple feature collection with 100 features and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
## # A tibble: 100 x 15
##     AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74 NWBIR74
##    <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>   <dbl>
##  1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091     1      10
##  2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0      10
##  3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188     5     208
##  4 0.07       2.97  1831    1831 Curr… 37053  37053       27   508     1     123
##  5 0.153      2.21  1832    1832 Nort… 37131  37131       66  1421     9    1066
##  6 0.097      1.67  1833    1833 Hert… 37091  37091       46  1452     7     954
##  7 0.062      1.55  1834    1834 Camd… 37029  37029       15   286     0     115
##  8 0.091      1.28  1835    1835 Gates 37073  37073       37   420     0     254
##  9 0.118      1.42  1836    1836 Warr… 37185  37185       93   968     4     748
## 10 0.124      1.43  1837    1837 Stok… 37169  37169       85  1612     1     160
## # … with 90 more rows, and 4 more variables: BIR79 <dbl>, SID79 <dbl>,
## #   NWBIR79 <dbl>, geom <MULTIPOLYGON [°]>

What happened there?

  • sf found a layer for us sf::st_layers(gfile), and chose the first one because we did not specify
  • it used GDAL to get all the attributes (fields), and the geometry and its crs from the entire layer
  • it converted the attributes to vectors that R understands (character, logical, integer, numeric)
  • it converted geometry to sf’s sfc class of list/s of list/s of matrices, and added a bbox and crs
  • then it returned a data frame with sf’s sf class

If we are more explicit, we can tell sf which layer we want.

sf::st_layers(gfile)
## Driver: GPKG 
## Available layers:
##   layer_name geometry_type features fields
## 1    nc.gpkg Multi Polygon      100     14
read_sf(gfile, "nc.gpkg")
## Simple feature collection with 100 features and 14 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
## # A tibble: 100 x 15
##     AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74 NWBIR74
##    <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>   <dbl>
##  1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091     1      10
##  2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0      10
##  3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188     5     208
##  4 0.07       2.97  1831    1831 Curr… 37053  37053       27   508     1     123
##  5 0.153      2.21  1832    1832 Nort… 37131  37131       66  1421     9    1066
##  6 0.097      1.67  1833    1833 Hert… 37091  37091       46  1452     7     954
##  7 0.062      1.55  1834    1834 Camd… 37029  37029       15   286     0     115
##  8 0.091      1.28  1835    1835 Gates 37073  37073       37   420     0     254
##  9 0.118      1.42  1836    1836 Warr… 37185  37185       93   968     4     748
## 10 0.124      1.43  1837    1837 Stok… 37169  37169       85  1612     1     160
## # … with 90 more rows, and 4 more variables: BIR79 <dbl>, SID79 <dbl>,
## #   NWBIR79 <dbl>, geom <MULTIPOLYGON [°]>

(There are no other layers, but there might be in other GDAL vector sources).

We can cut into the data source using SQL! You might know this

haha a GeoPackage is really just a SQLite database … so of course we can use SQL.

But no, we’re not there yet. We can issue SQL via GDAL to the GeoPackage. This happens in the C++ after the GDAL data source is opened, there’s a function called ExcecuteSQL() and it can be done on any GDAL vector data source, of any supported driver (a.k.a. format, i.e. MapInfo TAB, BNA, MSSQLServer, FlatGeoBuf, XLSX, etc).

(Ys <- read_sf(gfile, 
        query = "SELECT NAME, CNTY_ AS County_ID, geom FROM \"nc.gpkg\" WHERE (NAME LIKE 'Y%')")
)
## Simple feature collection with 2 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -82.50694 ymin: 35.69756 xmax: -80.44081 ymax: 36.27843
## geographic CRS: NAD27
## # A tibble: 2 x 3
##   NAME   County_ID                                                          geom
##   <chr>      <dbl>                                            <MULTIPOLYGON [°]>
## 1 Yadkin      1893 (((-80.49554 36.04327, -80.68958 36.04548, -80.87741 36.0524…
## 2 Yancey      1936 (((-82.27921 35.69756, -82.28158 35.7202, -82.32185 35.73985…

The only thing that’s a bit annoying (ok, here’s a list):

  • layer, we can’t just let sf pick the first one because it’s an integral part of the query string
  • the format of the layer is weird, here it’s the file name plus extension so we need to quote it
  • SELECT * is easy, but anything else will ignore the geometry for some drivers (so sometimes we have to be explicit)

The layer might be file without extension, or completely unrelated.

Technically the dialect of SQL in use is SQLite or OGRSQL, this is controlled by GDAL at a lower level (and will not be discussed further as I haven’t explored it). Just consider that there are several possible dialects, this special GDAL OGRSQL one, SQLite, or the native SQL used by a data source (say MSSQLServer, PostgreSQL, or any of the ODBC enabled databases via the ODBC driver).

Another feature we can invoke is the extent filter, this is a filtering (not crop) bounding box that is only applied if we run SQL (or if sf runs a dummy query for us, as here).

xlim <- c(-78, -76.2)
ylim <- c(34, 36)
## we have to have a WKT text version of these four numbers so a geometry
## must bre created as well as the bbox object (or we might have sf geom objects already)
(wktbbox <- sf::st_as_text(sf::st_as_sfc(sf::st_bbox(c(xmin = xlim[1L], ymin = ylim[1], 
                                                      xmax = xlim[2L], ymax = ylim[2L])))))
## [1] "POLYGON ((-78 34, -76.2 34, -76.2 36, -78 36, -78 34))"
filtered <- read_sf(gfile, wkt_filter = wktbbox)

That only read features that fall at least partly inside that bounding box.

plot(st_geometry(read_sf(gfile)), reset  = FALSE)
plot(filtered, col = rainbow(nrow(filtered), alpha = 0.5), add = TRUE)
## Warning in plot.sf(filtered, col = rainbow(nrow(filtered), alpha = 0.5), :
## ignoring all but the first attribute
abline(v = xlim, h = ylim)

How that impacts a data set can be quite subtle, note the filter is in addition to any SQL that is issued. Here we drop any county with a name starting "P*" and three more are dropped. (Pitt, Pamlico, and Pender).

plot(st_geometry(read_sf(gfile)), reset  = FALSE)

filtered2 <- read_sf(gfile, wkt_filter = wktbbox, 
                     query =   "SELECT * FROM \"nc.gpkg\" WHERE (NAME NOT LIKE 'P%')")
plot(filtered2, col = rainbow(nrow(filtered), alpha = 0.5), add = TRUE)
## Warning in plot.sf(filtered2, col = rainbow(nrow(filtered), alpha = 0.5), :
## ignoring all but the first attribute
## Warning in plot.sf(filtered2, col = rainbow(nrow(filtered), alpha = 0.5), : col
## is not of length 1 or nrow(x): colors will be recycled; use pal to specify a
## color palette
abline(v = xlim, h = ylim)

Roll your own sf-Geopackage with SQLite

ok finally to the point of this story …

GDAL is awesome, but we can avoid it entirely and hit this one particular format with SQL via SQLite, and invoke all the lazy tidyverse wonders.

To work in the db-tidyverse we need a DB connection.

library(DBI)
library(dplyr)
## haha guess what, geopackage is SQLite
db <- dbConnect(RSQLite::SQLite(), gfile)
dbListTables(db)
##  [1] "gpkg_contents"             "gpkg_extensions"          
##  [3] "gpkg_geometry_columns"     "gpkg_metadata"            
##  [5] "gpkg_metadata_reference"   "gpkg_spatial_ref_sys"     
##  [7] "gpkg_tile_matrix"          "gpkg_tile_matrix_set"     
##  [9] "nc.gpkg"                   "rtree_nc.gpkg_geom"       
## [11] "rtree_nc.gpkg_geom_node"   "rtree_nc.gpkg_geom_parent"
## [13] "rtree_nc.gpkg_geom_rowid"  "sqlite_sequence"

That looks scary, but what did we need above?

  • layer name
  • attributes in R vectors (the fields stored with each feature)
  • geometry, then converted into sf’s in-memory format sfc
  • a CRS for the geometry
  • a data frame with sf class

We’ll try to get each of these and match our query above that got the "Y*" counties. We won’t actually use this query text (because lazy tidyverse!).

“SELECT NAME, CNTY_ AS County_ID, geom FROM "nc.gpkg" WHERE (NAME LIKE ‘Y%’)”

Note we can write SQL if we want to via the from argument to tbl.src_dbi().

tbl(db, sql(“SELECT NAME, CNTY_ AS County_ID, geom FROM "nc.gpkg" WHERE (NAME LIKE ‘Y%’)”))

Using tidyverse idioms

What’s in that first table?

tbl(db, "gpkg_contents")
## # Source:   table<gpkg_contents> [?? x 10]
## # Database: sqlite 3.30.1 [/usr/local/lib/R/site-library/sf/gpkg/nc.gpkg]
##   table_name data_type identifier description last_change min_x min_y max_x
##   <chr>      <chr>     <chr>      <chr>       <chr>       <dbl> <dbl> <dbl>
## 1 nc.gpkg    features  nc.gpkg    ""          2016-09-28… -84.3  33.9 -75.5
## # … with 2 more variables: max_y <dbl>, srs_id <int>

That looks promising, so we grab the layer name.

## do whatever it is you need to get that string
(layername <- tbl(db, "gpkg_contents") %>% collect() %>% slice(1) %>% pull(table_name))
## [1] "nc.gpkg"

First, check what’s in that table because remember with SELECT we needed to explicitly include the name of the geometry.

Here it looks like we’ll get the geometry and the attributes together.

tbl(db, layername)
## # Source:   table<nc.gpkg> [?? x 16]
## # Database: sqlite 3.30.1 [/usr/local/lib/R/site-library/sf/gpkg/nc.gpkg]
##      fid        geom  AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID
##    <int>      <blob> <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int>
##  1     1 <raw 494 B> 0.114      1.44  1825    1825 Ashe  37009  37009        5
##  2     2 <raw 478 B> 0.061      1.23  1827    1827 Alle… 37005  37005        3
##  3     3 <raw 510 B> 0.143      1.63  1828    1828 Surry 37171  37171       86
##  4     4 <raw 696 B> 0.07       2.97  1831    1831 Curr… 37053  37053       27
##  5     5 <raw 606 B> 0.153      2.21  1832    1832 Nort… 37131  37131       66
##  6     6 <raw 414 B> 0.097      1.67  1833    1833 Hert… 37091  37091       46
##  7     7 <raw 446 B> 0.062      1.55  1834    1834 Camd… 37029  37029       15
##  8     8 <raw 334 B> 0.091      1.28  1835    1835 Gates 37073  37073       37
##  9     9 <raw 286 B> 0.118      1.42  1836    1836 Warr… 37185  37185       93
## 10    10 <raw 158 B> 0.124      1.43  1837    1837 Stok… 37169  37169       85
## # … with more rows, and 6 more variables: BIR74 <dbl>, SID74 <dbl>,
## #   NWBIR74 <dbl>, BIR79 <dbl>, SID79 <dbl>, NWBIR79 <dbl>

Now run the query, but using tidyverse helpers with dbplyr SQL translation.

(qu_table <-  tbl(db, layername) %>% 
  dplyr::filter(NAME %LIKE% 'Y%') %>% 
  dplyr::select(NAME, County_ID = CNTY_, geom))
## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.30.1 [/usr/local/lib/R/site-library/sf/gpkg/nc.gpkg]
##   NAME   County_ID        geom
##   <chr>      <dbl>      <blob>
## 1 Yadkin      1893 <raw 446 B>
## 2 Yancey      1936 <raw 590 B>

That looks promising so commit and collect.

nc_q <- qu_table %>% collect()

In memory now, let’s see if we can get the CRS from the DB.

db %>% dbListTables()
##  [1] "gpkg_contents"             "gpkg_extensions"          
##  [3] "gpkg_geometry_columns"     "gpkg_metadata"            
##  [5] "gpkg_metadata_reference"   "gpkg_spatial_ref_sys"     
##  [7] "gpkg_tile_matrix"          "gpkg_tile_matrix_set"     
##  [9] "nc.gpkg"                   "rtree_nc.gpkg_geom"       
## [11] "rtree_nc.gpkg_geom_node"   "rtree_nc.gpkg_geom_parent"
## [13] "rtree_nc.gpkg_geom_rowid"  "sqlite_sequence"
db %>% tbl("gpkg_metadata")  #NO
## # Source:   table<gpkg_metadata> [?? x 5]
## # Database: sqlite 3.30.1 [/usr/local/lib/R/site-library/sf/gpkg/nc.gpkg]
## # … with 5 variables: id <int>, md_scope <chr>, md_standard_uri <chr>,
## #   mime_type <chr>, metadata <chr>
db %>% tbl("gpkg_metadata_reference") # NO
## # Source:   table<gpkg_metadata_reference> [?? x 7]
## # Database: sqlite 3.30.1 [/usr/local/lib/R/site-library/sf/gpkg/nc.gpkg]
## # … with 7 variables: reference_scope <chr>, table_name <chr>,
## #   column_name <chr>, row_id_value <int>, timestamp <dbl>, md_file_id <int>,
## #   md_parent_id <int>
db %>% tbl("gpkg_spatial_ref_sys") ## oh yeah, looks good
## # Source:   table<gpkg_spatial_ref_sys> [?? x 6]
## # Database: sqlite 3.30.1 [/usr/local/lib/R/site-library/sf/gpkg/nc.gpkg]
##   srs_name   srs_id organization organization_coo… definition     description   
##   <chr>       <int> <chr>                    <int> <chr>          <chr>         
## 1 Undefined…     -1 NONE                        -1 "undefined"    undefined car…
## 2 Undefined…      0 NONE                         0 "undefined"    undefined geo…
## 3 NAD27        4267 EPSG                      4267 "GEOGCS[\"NAD… <NA>          
## 4 WGS 84 ge…   4326 EPSG                      4326 "GEOGCS[\"WGS… longitude/lat…

We could get the definition out, but we can also cheat and just see that it’s EPSG:4267.

Then convert the geometry to sf’s sfc column vector, and convert to sf data frame.

nc_q <- 
  nc_q %>% mutate(geom = sf::st_as_sfc(geom, crs = "EPSG:4267")) %>% 
  sf::st_as_sf()

nc_q
## Simple feature collection with 2 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -82.50694 ymin: 35.69756 xmax: -80.44081 ymax: 36.27843
## geographic CRS: NAD27
## # A tibble: 2 x 3
##   NAME   County_ID                                                          geom
##   <chr>      <dbl>                                            <MULTIPOLYGON [°]>
## 1 Yadkin      1893 (((-80.49554 36.04327, -80.68958 36.04548, -80.87741 36.0524…
## 2 Yancey      1936 (((-82.27921 35.69756, -82.28158 35.7202, -82.32185 35.73985…

Summary

To wrap up,

  1. we can use SQL with sf!

To get only counties starting with "Y*", issue a LIKE filter query.

gfile <- system.file("gpkg/nc.gpkg", package = "sf", mustWork = TRUE)
sf::read_sf(gfile, 
        query = "SELECT NAME, CNTY_ AS County_ID, geom FROM \"nc.gpkg\" WHERE (NAME LIKE 'Y%')")
## Simple feature collection with 2 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -82.50694 ymin: 35.69756 xmax: -80.44081 ymax: 36.27843
## geographic CRS: NAD27
## # A tibble: 2 x 3
##   NAME   County_ID                                                          geom
##   <chr>      <dbl>                                            <MULTIPOLYGON [°]>
## 1 Yadkin      1893 (((-80.49554 36.04327, -80.68958 36.04548, -80.87741 36.0524…
## 2 Yancey      1936 (((-82.27921 35.69756, -82.28158 35.7202, -82.32185 35.73985…
  1. We can not use SQL with the tidyverse, but get the same result!

Using RSQLite and dbplyr with tidyverse verbs, and some judicious harvesting of key pieces of data .

crs <- "EPSG:4267"
layername <- "nc.gpkg"
library(DBI); library(dplyr)
dbConnect(RSQLite::SQLite(), gfile) %>% 
  tbl(layername) %>% 
  dplyr::filter(NAME %LIKE% 'Y%') %>% 
  dplyr::select(NAME, County_ID = CNTY_, geom) %>% 
  dplyr::collect() %>% 
  mutate(geom = sf::st_as_sfc(geom, crs = crs)) %>% 
  sf::st_as_sf()
## Simple feature collection with 2 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -82.50694 ymin: 35.69756 xmax: -80.44081 ymax: 36.27843
## geographic CRS: NAD27
## # A tibble: 2 x 3
##   NAME   County_ID                                                          geom
##   <chr>      <dbl>                                            <MULTIPOLYGON [°]>
## 1 Yadkin      1893 (((-80.49554 36.04327, -80.68958 36.04548, -80.87741 36.0524…
## 2 Yancey      1936 (((-82.27921 35.69756, -82.28158 35.7202, -82.32185 35.73985…

Take care.