First some background.
In sf, we read Geopackage (GPKG) using GDAL via the read_sf() function.
library(sf)
gfile <- system.file("gpkg/nc.gpkg", package = "sf", mustWork = TRUE)
read_sf(gfile)
## Simple feature collection with 100 features and 14 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
## # A tibble: 100 x 15
## AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 10
## 2 0.061 1.23 1827 1827 Alle… 37005 37005 3 487 0 10
## 3 0.143 1.63 1828 1828 Surry 37171 37171 86 3188 5 208
## 4 0.07 2.97 1831 1831 Curr… 37053 37053 27 508 1 123
## 5 0.153 2.21 1832 1832 Nort… 37131 37131 66 1421 9 1066
## 6 0.097 1.67 1833 1833 Hert… 37091 37091 46 1452 7 954
## 7 0.062 1.55 1834 1834 Camd… 37029 37029 15 286 0 115
## 8 0.091 1.28 1835 1835 Gates 37073 37073 37 420 0 254
## 9 0.118 1.42 1836 1836 Warr… 37185 37185 93 968 4 748
## 10 0.124 1.43 1837 1837 Stok… 37169 37169 85 1612 1 160
## # … with 90 more rows, and 4 more variables: BIR79 <dbl>, SID79 <dbl>,
## # NWBIR79 <dbl>, geom <MULTIPOLYGON [°]>
What happened there?
sf::st_layers(gfile), and chose the first one because we did not specifysfc class of list/s of list/s of matrices, and added a bbox and crssf classIf we are more explicit, we can tell sf which layer we want.
sf::st_layers(gfile)
## Driver: GPKG
## Available layers:
## layer_name geometry_type features fields
## 1 nc.gpkg Multi Polygon 100 14
read_sf(gfile, "nc.gpkg")
## Simple feature collection with 100 features and 14 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## geographic CRS: NAD27
## # A tibble: 100 x 15
## AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 10
## 2 0.061 1.23 1827 1827 Alle… 37005 37005 3 487 0 10
## 3 0.143 1.63 1828 1828 Surry 37171 37171 86 3188 5 208
## 4 0.07 2.97 1831 1831 Curr… 37053 37053 27 508 1 123
## 5 0.153 2.21 1832 1832 Nort… 37131 37131 66 1421 9 1066
## 6 0.097 1.67 1833 1833 Hert… 37091 37091 46 1452 7 954
## 7 0.062 1.55 1834 1834 Camd… 37029 37029 15 286 0 115
## 8 0.091 1.28 1835 1835 Gates 37073 37073 37 420 0 254
## 9 0.118 1.42 1836 1836 Warr… 37185 37185 93 968 4 748
## 10 0.124 1.43 1837 1837 Stok… 37169 37169 85 1612 1 160
## # … with 90 more rows, and 4 more variables: BIR79 <dbl>, SID79 <dbl>,
## # NWBIR79 <dbl>, geom <MULTIPOLYGON [°]>
(There are no other layers, but there might be in other GDAL vector sources).
We can cut into the data source using SQL! You might know this
haha a GeoPackage is really just a SQLite database … so of course we can use SQL.
But no, we’re not there yet. We can issue SQL via GDAL to the GeoPackage. This happens in the C++ after the GDAL data source is opened, there’s a function called ExcecuteSQL() and it can be done on any GDAL vector data source, of any supported driver (a.k.a. format, i.e. MapInfo TAB, BNA, MSSQLServer, FlatGeoBuf, XLSX, etc).
(Ys <- read_sf(gfile,
query = "SELECT NAME, CNTY_ AS County_ID, geom FROM \"nc.gpkg\" WHERE (NAME LIKE 'Y%')")
)
## Simple feature collection with 2 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -82.50694 ymin: 35.69756 xmax: -80.44081 ymax: 36.27843
## geographic CRS: NAD27
## # A tibble: 2 x 3
## NAME County_ID geom
## <chr> <dbl> <MULTIPOLYGON [°]>
## 1 Yadkin 1893 (((-80.49554 36.04327, -80.68958 36.04548, -80.87741 36.0524…
## 2 Yancey 1936 (((-82.27921 35.69756, -82.28158 35.7202, -82.32185 35.73985…
The only thing that’s a bit annoying (ok, here’s a list):
The layer might be file without extension, or completely unrelated.
Technically the dialect of SQL in use is SQLite or OGRSQL, this is controlled by GDAL at a lower level (and will not be discussed further as I haven’t explored it). Just consider that there are several possible dialects, this special GDAL OGRSQL one, SQLite, or the native SQL used by a data source (say MSSQLServer, PostgreSQL, or any of the ODBC enabled databases via the ODBC driver).
Another feature we can invoke is the extent filter, this is a filtering (not crop) bounding box that is only applied if we run SQL (or if sf runs a dummy query for us, as here).
xlim <- c(-78, -76.2)
ylim <- c(34, 36)
## we have to have a WKT text version of these four numbers so a geometry
## must bre created as well as the bbox object (or we might have sf geom objects already)
(wktbbox <- sf::st_as_text(sf::st_as_sfc(sf::st_bbox(c(xmin = xlim[1L], ymin = ylim[1],
xmax = xlim[2L], ymax = ylim[2L])))))
## [1] "POLYGON ((-78 34, -76.2 34, -76.2 36, -78 36, -78 34))"
filtered <- read_sf(gfile, wkt_filter = wktbbox)
That only read features that fall at least partly inside that bounding box.
plot(st_geometry(read_sf(gfile)), reset = FALSE)
plot(filtered, col = rainbow(nrow(filtered), alpha = 0.5), add = TRUE)
## Warning in plot.sf(filtered, col = rainbow(nrow(filtered), alpha = 0.5), :
## ignoring all but the first attribute
abline(v = xlim, h = ylim)
How that impacts a data set can be quite subtle, note the filter is in addition to any SQL that is issued. Here we drop any county with a name starting "P*" and three more are dropped. (Pitt, Pamlico, and Pender).
plot(st_geometry(read_sf(gfile)), reset = FALSE)
filtered2 <- read_sf(gfile, wkt_filter = wktbbox,
query = "SELECT * FROM \"nc.gpkg\" WHERE (NAME NOT LIKE 'P%')")
plot(filtered2, col = rainbow(nrow(filtered), alpha = 0.5), add = TRUE)
## Warning in plot.sf(filtered2, col = rainbow(nrow(filtered), alpha = 0.5), :
## ignoring all but the first attribute
## Warning in plot.sf(filtered2, col = rainbow(nrow(filtered), alpha = 0.5), : col
## is not of length 1 or nrow(x): colors will be recycled; use pal to specify a
## color palette
abline(v = xlim, h = ylim)
ok finally to the point of this story …
GDAL is awesome, but we can avoid it entirely and hit this one particular format with SQL via SQLite, and invoke all the lazy tidyverse wonders.
To work in the db-tidyverse we need a DB connection.
library(DBI)
library(dplyr)
## haha guess what, geopackage is SQLite
db <- dbConnect(RSQLite::SQLite(), gfile)
dbListTables(db)
## [1] "gpkg_contents" "gpkg_extensions"
## [3] "gpkg_geometry_columns" "gpkg_metadata"
## [5] "gpkg_metadata_reference" "gpkg_spatial_ref_sys"
## [7] "gpkg_tile_matrix" "gpkg_tile_matrix_set"
## [9] "nc.gpkg" "rtree_nc.gpkg_geom"
## [11] "rtree_nc.gpkg_geom_node" "rtree_nc.gpkg_geom_parent"
## [13] "rtree_nc.gpkg_geom_rowid" "sqlite_sequence"
That looks scary, but what did we need above?
sfcWe’ll try to get each of these and match our query above that got the "Y*" counties. We won’t actually use this query text (because lazy tidyverse!).
“SELECT NAME, CNTY_ AS County_ID, geom FROM "nc.gpkg" WHERE (NAME LIKE ‘Y%’)”
Note we can write SQL if we want to via the from argument to tbl.src_dbi().
tbl(db, sql(“SELECT NAME, CNTY_ AS County_ID, geom FROM "nc.gpkg" WHERE (NAME LIKE ‘Y%’)”))
What’s in that first table?
tbl(db, "gpkg_contents")
## # Source: table<gpkg_contents> [?? x 10]
## # Database: sqlite 3.30.1 [/usr/local/lib/R/site-library/sf/gpkg/nc.gpkg]
## table_name data_type identifier description last_change min_x min_y max_x
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 nc.gpkg features nc.gpkg "" 2016-09-28… -84.3 33.9 -75.5
## # … with 2 more variables: max_y <dbl>, srs_id <int>
That looks promising, so we grab the layer name.
## do whatever it is you need to get that string
(layername <- tbl(db, "gpkg_contents") %>% collect() %>% slice(1) %>% pull(table_name))
## [1] "nc.gpkg"
First, check what’s in that table because remember with SELECT we needed to explicitly include the name of the geometry.
Here it looks like we’ll get the geometry and the attributes together.
tbl(db, layername)
## # Source: table<nc.gpkg> [?? x 16]
## # Database: sqlite 3.30.1 [/usr/local/lib/R/site-library/sf/gpkg/nc.gpkg]
## fid geom AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID
## <int> <blob> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int>
## 1 1 <raw 494 B> 0.114 1.44 1825 1825 Ashe 37009 37009 5
## 2 2 <raw 478 B> 0.061 1.23 1827 1827 Alle… 37005 37005 3
## 3 3 <raw 510 B> 0.143 1.63 1828 1828 Surry 37171 37171 86
## 4 4 <raw 696 B> 0.07 2.97 1831 1831 Curr… 37053 37053 27
## 5 5 <raw 606 B> 0.153 2.21 1832 1832 Nort… 37131 37131 66
## 6 6 <raw 414 B> 0.097 1.67 1833 1833 Hert… 37091 37091 46
## 7 7 <raw 446 B> 0.062 1.55 1834 1834 Camd… 37029 37029 15
## 8 8 <raw 334 B> 0.091 1.28 1835 1835 Gates 37073 37073 37
## 9 9 <raw 286 B> 0.118 1.42 1836 1836 Warr… 37185 37185 93
## 10 10 <raw 158 B> 0.124 1.43 1837 1837 Stok… 37169 37169 85
## # … with more rows, and 6 more variables: BIR74 <dbl>, SID74 <dbl>,
## # NWBIR74 <dbl>, BIR79 <dbl>, SID79 <dbl>, NWBIR79 <dbl>
Now run the query, but using tidyverse helpers with dbplyr SQL translation.
(qu_table <- tbl(db, layername) %>%
dplyr::filter(NAME %LIKE% 'Y%') %>%
dplyr::select(NAME, County_ID = CNTY_, geom))
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.30.1 [/usr/local/lib/R/site-library/sf/gpkg/nc.gpkg]
## NAME County_ID geom
## <chr> <dbl> <blob>
## 1 Yadkin 1893 <raw 446 B>
## 2 Yancey 1936 <raw 590 B>
That looks promising so commit and collect.
nc_q <- qu_table %>% collect()
In memory now, let’s see if we can get the CRS from the DB.
db %>% dbListTables()
## [1] "gpkg_contents" "gpkg_extensions"
## [3] "gpkg_geometry_columns" "gpkg_metadata"
## [5] "gpkg_metadata_reference" "gpkg_spatial_ref_sys"
## [7] "gpkg_tile_matrix" "gpkg_tile_matrix_set"
## [9] "nc.gpkg" "rtree_nc.gpkg_geom"
## [11] "rtree_nc.gpkg_geom_node" "rtree_nc.gpkg_geom_parent"
## [13] "rtree_nc.gpkg_geom_rowid" "sqlite_sequence"
db %>% tbl("gpkg_metadata") #NO
## # Source: table<gpkg_metadata> [?? x 5]
## # Database: sqlite 3.30.1 [/usr/local/lib/R/site-library/sf/gpkg/nc.gpkg]
## # … with 5 variables: id <int>, md_scope <chr>, md_standard_uri <chr>,
## # mime_type <chr>, metadata <chr>
db %>% tbl("gpkg_metadata_reference") # NO
## # Source: table<gpkg_metadata_reference> [?? x 7]
## # Database: sqlite 3.30.1 [/usr/local/lib/R/site-library/sf/gpkg/nc.gpkg]
## # … with 7 variables: reference_scope <chr>, table_name <chr>,
## # column_name <chr>, row_id_value <int>, timestamp <dbl>, md_file_id <int>,
## # md_parent_id <int>
db %>% tbl("gpkg_spatial_ref_sys") ## oh yeah, looks good
## # Source: table<gpkg_spatial_ref_sys> [?? x 6]
## # Database: sqlite 3.30.1 [/usr/local/lib/R/site-library/sf/gpkg/nc.gpkg]
## srs_name srs_id organization organization_coo… definition description
## <chr> <int> <chr> <int> <chr> <chr>
## 1 Undefined… -1 NONE -1 "undefined" undefined car…
## 2 Undefined… 0 NONE 0 "undefined" undefined geo…
## 3 NAD27 4267 EPSG 4267 "GEOGCS[\"NAD… <NA>
## 4 WGS 84 ge… 4326 EPSG 4326 "GEOGCS[\"WGS… longitude/lat…
We could get the definition out, but we can also cheat and just see that it’s EPSG:4267.
Then convert the geometry to sf’s sfc column vector, and convert to sf data frame.
nc_q <-
nc_q %>% mutate(geom = sf::st_as_sfc(geom, crs = "EPSG:4267")) %>%
sf::st_as_sf()
nc_q
## Simple feature collection with 2 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -82.50694 ymin: 35.69756 xmax: -80.44081 ymax: 36.27843
## geographic CRS: NAD27
## # A tibble: 2 x 3
## NAME County_ID geom
## <chr> <dbl> <MULTIPOLYGON [°]>
## 1 Yadkin 1893 (((-80.49554 36.04327, -80.68958 36.04548, -80.87741 36.0524…
## 2 Yancey 1936 (((-82.27921 35.69756, -82.28158 35.7202, -82.32185 35.73985…
To wrap up,
To get only counties starting with "Y*", issue a LIKE filter query.
gfile <- system.file("gpkg/nc.gpkg", package = "sf", mustWork = TRUE)
sf::read_sf(gfile,
query = "SELECT NAME, CNTY_ AS County_ID, geom FROM \"nc.gpkg\" WHERE (NAME LIKE 'Y%')")
## Simple feature collection with 2 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -82.50694 ymin: 35.69756 xmax: -80.44081 ymax: 36.27843
## geographic CRS: NAD27
## # A tibble: 2 x 3
## NAME County_ID geom
## <chr> <dbl> <MULTIPOLYGON [°]>
## 1 Yadkin 1893 (((-80.49554 36.04327, -80.68958 36.04548, -80.87741 36.0524…
## 2 Yancey 1936 (((-82.27921 35.69756, -82.28158 35.7202, -82.32185 35.73985…
Using RSQLite and dbplyr with tidyverse verbs, and some judicious harvesting of key pieces of data .
crs <- "EPSG:4267"
layername <- "nc.gpkg"
library(DBI); library(dplyr)
dbConnect(RSQLite::SQLite(), gfile) %>%
tbl(layername) %>%
dplyr::filter(NAME %LIKE% 'Y%') %>%
dplyr::select(NAME, County_ID = CNTY_, geom) %>%
dplyr::collect() %>%
mutate(geom = sf::st_as_sfc(geom, crs = crs)) %>%
sf::st_as_sf()
## Simple feature collection with 2 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -82.50694 ymin: 35.69756 xmax: -80.44081 ymax: 36.27843
## geographic CRS: NAD27
## # A tibble: 2 x 3
## NAME County_ID geom
## <chr> <dbl> <MULTIPOLYGON [°]>
## 1 Yadkin 1893 (((-80.49554 36.04327, -80.68958 36.04548, -80.87741 36.0524…
## 2 Yancey 1936 (((-82.27921 35.69756, -82.28158 35.7202, -82.32185 35.73985…
Take care.