Measuring and simplifying spatial datasets in R

Any datasets that have been read into R's workspace are represented in R's own data storage file type .RData. This is an efficient way of storing spatial information, but even so, datasets stored in R can get quite large. This may cause problems such as depleting available memory (RAM) or hard disk space on the computer. It is therefore wise to understand roughly how large spatial objects are, providing insight into how long certain functions will take to run.

The purpose of this tutorial is to explain how to gain information about the size of R objects, and demonstrate how spatial objects can be simplified, potentially liberating computers from overload and making plotting and analysis functions much faster. This tutorial is an accompaniment to
Cheshire and Lovelace (2014), a book chapter on spatial data and visualisation. A pre-print of it, containing the data used in the example, can be downloaded online (github.com/geocomPP/sdvwR).

Loading test data

Let us compare two datasets, one representing the boroughs of London, and the other taken from the GPS log of a bicycle-powered house move. First we load the data (assuming R's working directory is set to “sdvwR-master”, which has been downloaded and unzipped).

library(rgdal)  # load the gdal package

## Loading required package: sp
## rgdal: version: 0.8-13, (SVN revision 494)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.10.0, released 2013/04/24
## Path to GDAL shared files: /usr/share/gdal/1.10
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
## Path to PROJ.4 shared files: (autodetected)

shf2lds <- readOGR(dsn = "data/gps-trace.gpx", layer = "tracks")  # load track data

## OGR data source with driver: GPX 
## Source: "data/gps-trace.gpx", layer: "tracks"
## with 1 features and 12 fields
## Feature type: wkbMultiLineString with 2 dimensions

lnd <- readOGR(dsn = "data/", "london_sport")  # load the London data

## OGR data source with driver: ESRI Shapefile 
## Source: "data/", layer: "london_sport"
## with 33 features and 4 fields
## Feature type: wkbPolygon with 2 dimensions

Let's plot these to see what they look like.

plot(shf2lds)

plot of chunk unnamed-chunk-2

plot(lnd)

plot of chunk unnamed-chunk-2

Measuring object size

In the absence of prior knowledge, which of the two objects would one expect to be larger? One could hypothesize that the London dataset would be larger based on its greater spatial extent, but how much larger? The answer in R is found in the function object.size:

object.size(shf2lds)

## 107464 bytes

object.size(lnd)

## 125544 bytes

In fact, the objects have similar sizes: the GPS dataset is surprisingly large. To see why, we can find out how many vertices (points connected by lines) are contained in each dataset. To do this we use fortify from the ggplot2 package (use the same method used for rgdal, described above, to install it).

shf2lds.f <- fortify(shf2lds)
nrow(shf2lds.f)

## [1] 6085


lnd.f <- fortify(lnd)

## Regions defined for each Polygons

nrow(lnd.f)

## [1] 1102

In the above block of code we performed two functions for each object: 1) flatten the dataset so that each vertice is allocated a unique row 2) use nrow to count the result.

It is clear that the GPS data has almost 6 times the number of vertices compared to the London data, explaining its large size. Yet when plotted, the GPS data does not seem more detailed, implying that some of the vertices in the object are not needed for effective visualisation since the nodes of the line are imperceptible.

Simplifying geometries

Simplification can help to make a graphic more readable and less cluttered. Within the 'rgeos' package it is possible to use the gSimplify function to simplify spatial R objects:

library(rgeos)
shf2lds.simple <- gSimplify(shf2lds, tol = 0.001)
(object.size(shf2lds.simple)/object.size(shf2lds))[1]

## [1] 0.04608

plot(shf2lds.simple)
plot(shf2lds, col = "red", add = T)

In the above block of code, gSimplify is given the object shf2lds and the tol argument of 0.001 (much larger tolerance values may be needed, for data that is projected). Next, we divide the size of the simplified object by the original (note the use of the / symbol). The output of 0.04... tells us that the new object is only around 4% of its original size. We can see how this has happened by again counting the number of vertices. This time we use the coordinates and nrow functions together:

nrow(coordinates(shf2lds.simple)[[1]][[1]])

## [1] 44

The syntax of the double square brackets will seem strange, providing a taster of how R 'sees' spatial data. Do not worry about this for now. Of interest here is that the number of vertices has shrunk, from over 6,000 to only 44, without losing much information about the shape of the line. To test this, try plotting the original and simplified tracks on your computer: when visualized using the plot function, object shf2lds.simple retains the overall shape of the line and is virtually indistinguishable from the original object.

This example is rather contrived because even the larger object shf2lds is only a tenth of a megabyte, negligible compared with the gigabytes of memory available to modern computers. However, it underlines a wider point: for visualizing small scale maps, spatial data geometries can often be simplified to reduce processing time and use of memory.