Stupid Massive Shapefiles

Key advice if your vector spatial data is large

Item 2 comes first because it’s so important.

2. Don’t do this

library(sf)
x <- read_sf("massive.shp")

plot(x)  ## DON'T DO THIS!!!!

If you used sp you might be conditioned like me to do that because it was extremely convenient.

The problem is your shapefile may be massive because it has a lot of geometries, or it may be because it has a lot of attributes, and the sf plot will plot your geometries once for every column (up to 16 by default), and a slow plot multiplied by the number of columns is even slower.

Do this, but please note for very many shapes, especially complicated ones this will still take some time:

plot(x[1])  ## just colour by the first column

plot(st_geometry(x)) ## ignore the columns

To be really safe, just plot the first few.

plot(x[1:6, 1])  

plot(st_geometry(x)[1:6])

Explore the data so you know what columns there are and what they contain, and consider dropping anything you don’t need (we’ll look at upfront queries below).

x[1:6, ]

1. Use `read_sf()`.

The sf package is fast, and there are two functions to read vector spatial data read_sf() and st_read(). They are the same apart from converting character data to factors, read_sf() uses stringsAsFactors = FALSE by default.

Don’t use raster::shapefile() or rgdal::readOGR(), the former is a wrapper for the latter for just one format, and both are slower than sf.

x <- read_sf("massive.shp")

3. Why is your vector spatial data file massive?

Have a look at the size of the file/s. If it’s a shapefile there will be at least a .shp and a .dbf file. If either or both of these are large (> 25Mb) then there’s a reasonable time required to read the data and convert it in-memory into an sf data frame. If either or both are very large (> 250Mb) or stupid massive (> 1Gb) then the problems just get worse.

The .dbf is a dBase file, you can read it and ignore the geometries with foreign::read.dbf(). This might be a good way to explore what you have so you know how to filter the data to what you want. But, we can’t read the DBF with read.dbf() and only select a certain set of rows.

4. Select only certain rows

This is a great technique to explore a data set quickly and plan your attack if the thing is very large.

With sf, craft a query - this is SQL, so we also need a layer name, if it’s a shapefile it’s the file name without the extension, but use sf::st_layers() to find the name/s in general (many formats are better than .shp files, and allow different names as well as multiple layers).

There’s a special virtual column name called FID and it is numbered from 0 or 1 (also in some cases it’s not a row number, so we can’t be sure).

This will read 6 (or 7) rows from the file as well as all the attribute fields as columns.

x0 <- read_sf("massive.shp", query = "SELECT * FROM massive WHERE FID < 6")

Use names() and dim() in the usual ways to discover what data you have.

5. Select only certain columns

If you know the names of columns you can select them only - this will help with the “big because of many fields” problem. Select one or more column names.

x0 <- read_sf("massive.shp", query = "SELECT NAME, Type, status FROM massive WHERE FID < 6")

When you get a reasonable result, experiment with a greater number of rows (increase FID).

6. Discover OGRSQL

There are a bunch of tricks with OGRSQL.

Find out how many rows, here I use an actual shapefile not a theoretical large one.

library(sf)

## Linking to GEOS 3.7.1, GDAL 2.4.2, PROJ 5.2.0

f <- system.file("shape/nc.shp", package = "sf", mustWork = TRUE)
read_sf(f, query = "SELECT Count(FID) AS number_of_rows FROM nc")

## Warning: no simple feature geometries present: returning a data.frame or
## tbl_df

## # A tibble: 1 x 1
##   number_of_rows
##            <int>
## 1            100

Find out the field names.

library(sf)
f <- system.file("shape/nc.shp", package = "sf", mustWork = TRUE)
read_sf(f, query = "SELECT * FROM nc WHERE FID = 1")

## Simple feature collection with 1 feature and 14 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -81.34754 ymin: 36.36536 xmax: -80.90344 ymax: 36.57286
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs
## # A tibble: 1 x 15
##    AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74
##   <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>
## 1 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0
## # … with 5 more variables: NWBIR74 <dbl>, BIR79 <dbl>, SID79 <dbl>,
## #   NWBIR79 <dbl>, `_ogr_geometry_` <POLYGON [°]>

Count distinct values on a given field, unsurprisingly FIPS is distinct for every 100 rows, but your data likely has different patterns. Note that these queries can take some time to run, as GDAL must scan over every feature and query its field values - but it likely won’t run out of memory because it doesn’t need to read the geometries at all.

library(sf)
f <- system.file("shape/nc.shp", package = "sf", mustWork = TRUE)
read_sf(f, query = "SELECT Count(DISTINCT FIPS) FROM nc")

## Warning: no simple feature geometries present: returning a data.frame or
## tbl_df

## # A tibble: 1 x 1
##   COUNT_FIPS
##        <int>
## 1        100

7. Use GDAL more directly for queries

The above tips should help you hone into the data you need in your stupid massive shapefile/s.

For more control, try the vapour package which has functions to read only attributes, only geometries, to apply SQL as in the sf package, and to skip_n and limit_n on the features read in. The lists of attribute values and geometries are raw, but this gives the power to construct or use just the data required.