Any datasets that have been read into R's workspace
are represented in R's own data storage file type .RData
.
This is an efficient way of storing spatial information,
but even so, datasets stored in R can get quite large.
This may cause problems
such as depleting available memory (RAM) or hard
disk space on the computer. It is therefore wise to understand
roughly how large spatial objects are, providing insight
into how long certain functions will take to run.
The purpose of this tutorial is to explain how to
gain information about the size of R objects, and
demonstrate how spatial objects can be simplified,
potentially liberating computers from overload
and making plotting and analysis functions much faster.
This tutorial is an accompaniment to
Cheshire and Lovelace (2014), a book chapter on spatial
data and visualisation. A pre-print of it,
containing the data used in the example, can be
downloaded online (github.com/geocomPP/sdvwR).
Let us compare two datasets, one representing the boroughs of London, and the other taken from the GPS log of a bicycle-powered house move. First we load the data (assuming R's working directory is set to “sdvwR-master”, which has been downloaded and unzipped).
library(rgdal) # load the gdal package
## Loading required package: sp
## rgdal: version: 0.8-13, (SVN revision 494)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 1.10.0, released 2013/04/24
## Path to GDAL shared files: /usr/share/gdal/1.10
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480]
## Path to PROJ.4 shared files: (autodetected)
shf2lds <- readOGR(dsn = "data/gps-trace.gpx", layer = "tracks") # load track data
## OGR data source with driver: GPX
## Source: "data/gps-trace.gpx", layer: "tracks"
## with 1 features and 12 fields
## Feature type: wkbMultiLineString with 2 dimensions
lnd <- readOGR(dsn = "data/", "london_sport") # load the London data
## OGR data source with driver: ESRI Shapefile
## Source: "data/", layer: "london_sport"
## with 33 features and 4 fields
## Feature type: wkbPolygon with 2 dimensions
Let's plot these to see what they look like.
plot(shf2lds)
plot(lnd)
In the absence of prior knowledge, which of the two objects
would one expect to be larger? One could
hypothesize that the London dataset
would be larger based on its greater spatial extent, but how much larger?
The answer in R is found in the function object.size
:
object.size(shf2lds)
## 107464 bytes
object.size(lnd)
## 125544 bytes
In fact, the objects have similar sizes: the GPS dataset is surprisingly large.
To see why, we can find out how
many vertices (points connected by lines) are contained in each
dataset. To do this we use fortify
from the ggplot2 package
(use the same method used for rgdal, described above, to install it).
shf2lds.f <- fortify(shf2lds)
nrow(shf2lds.f)
## [1] 6085
lnd.f <- fortify(lnd)
## Regions defined for each Polygons
nrow(lnd.f)
## [1] 1102
In the above block of code we performed two
functions for each object: 1) flatten the dataset so that
each vertice is allocated a unique row 2) use
nrow
to count the result.
It is clear that the GPS data has almost 6 times the number of vertices compared to the London data, explaining its large size. Yet when plotted, the GPS data does not seem more detailed, implying that some of the vertices in the object are not needed for effective visualisation since the nodes of the line are imperceptible.
Simplification can help to make a graphic more readable and less cluttered. Within the 'rgeos' package it is possible to use the gSimplify
function to simplify spatial R objects:
library(rgeos)
shf2lds.simple <- gSimplify(shf2lds, tol = 0.001)
(object.size(shf2lds.simple)/object.size(shf2lds))[1]
## [1] 0.04608
plot(shf2lds.simple)
plot(shf2lds, col = "red", add = T)
In the above block of code, gSimplify
is given the object shf2lds
and the tol
argument of 0.001 (much
larger tolerance values may be needed, for data that is projected).
Next, we divide the size of the simplified object by the original
(note the use of the /
symbol). The output of 0.04...
tells us that the new object is only around
4% of its original size. We can see how this has happened
by again counting the number of vertices. This time we use the
coordinates
and nrow
functions together:
nrow(coordinates(shf2lds.simple)[[1]][[1]])
## [1] 44
The syntax of the double square brackets will seem strange,
providing a taster of how R 'sees' spatial data.
Do not worry about this for now.
Of interest here is that the number of vertices has shrunk, from over 6,000 to
only 44, without losing much information about the shape of the line.
To test this, try plotting the original and simplified tracks
on your computer: when visualized using the plot
function,
object shf2lds.simple
retains the overall shape of the
line and is virtually indistinguishable from the original object.
This example is rather contrived because even the larger object
shf2lds
is only a tenth of a megabyte, negligible compared with the gigabytes of
memory available to modern computers. However, it underlines a wider point:
for visualizing small scale maps, spatial data geometries
can often be simplified to reduce processing time and
use of memory.