Clipping and aggregating with gIntersects

library(sp)
library(rgeos)
## rgeos version: 0.3-8, (SVN revision 460)
##  GEOS runtime version: 3.4.2-CAPI-1.8.2 r3921 
##  Polygon checking: TRUE

This short piece demonstrates how clipping spatial data can be done using the rgeos package in R. A more common method is to use the sp::over function, as detailed in another, larger tutorial see here.

The test datasets are contained in the data folder of a repository called Creating-maps-in-R. These, and the rgeos library needed for the function can be loaded with the following commands:

load("data/lnd.RData")
load("data/stations.RData")

Here, we use gIntersects, although we could equally use gContains, gWithin and other g... functions. The power of these commands can be seen by accessing the rgeos help pages, e.g. ?gOverlaps. gIntersects will output information for each point, telling us which polygon it interacts with (i.e. the polygon it is in).

Let’s take a look at how the function works on our test data:

int <- gIntersects(stations, lnd, byid = T) # find which stations intersect 
class(int) # it's outputed a matrix
dim(int) # with 33 rows (one for each zone) and 2532 cols (the points)
summary(int[,c(200,500)]) # not the output of this
plot(lnd)
points(stations[200,], col = "red") # note point id 200 is outside the zones
points(stations[500,], col = "green") # note point 500 is inside
which(int[,500] == T) # this tells us that point 500 intersects with zone 32
points(coordinates(lnd[32,]), col = "black") # test the previous statement

In the above code, only the first line actually ‘does’ anything in our workspace, by creating the object int. The proceeding lines are dedicated to exploring this object and what it means. Note that it is a matrix with columns corresponding to the points and rows corresponding to boroughs. The borough in which a particular point can be extracted from int as we shall see below. For the purposes of clipping, we are only interested in whether the point intersects with any of the boroughs. This is where the function apply, which is unique to R, comes into play:

clipped <- apply(int == F, MARGIN = 2, all)
plot(stations[which(clipped),]) # shows all stations we DO NOT want
stations.cl <- stations[which(!clipped),] # use ! to select the invers
points(stations.cl, col = "green") # check that it's worked

stations <- stations.cl; rm(stations.cl) # tidy up: we're only interested in clipped ones

The first line instructs R to look at each column (MARGIN = 2, we would use MARGIN = 1 for row-by-row analysis) and report back whether all of the values are false. This creates the inverse selection that we want, hence the use of ! to invert it. We test that the function works on a new object (often a good idea, to avoid overwriting useful data) with plots and, once content that the clip has worked, save the sample of points to our main stations object and remove the now duplicated stations.cl object.

Optional task: aggregation with gIntersects

As with clipping, we can also do spatial aggregation with the rgeos package. In some ways, this method makes explicit the steps taken in aggregate ‘under the hood’. The code is quite involved and intimidating, so feel free to skip this stage. Working through and thinking about it this alternative method may, however, pay dividends if you intend to perform more sophisticated spatial analysis in R.

int <- gIntersects(stations, lnd, byid = TRUE) # re-run the intersection query
head(apply(int, MARGIN = 2, FUN = which))
## 91 92 93 94 95 96 
##  6 10 10 10 10 10
b.indexes <- which(int, arr.ind = TRUE) # indexes that intersect
summary(b.indexes)
##       row             col       
##  Min.   : 1.00   Min.   :  1.0  
##  1st Qu.: 8.00   1st Qu.:183.5  
##  Median :15.00   Median :366.0  
##  Mean   :15.01   Mean   :366.0  
##  3rd Qu.:22.00   3rd Qu.:548.5  
##  Max.   :33.00   Max.   :731.0
b.names <- lnd$name[b.indexes[, 1]]
b.count <- aggregate(b.indexes ~ b.names, FUN = length)
head(b.count)
##                b.names row col
## 1 Barking and Dagenham  12  12
## 2               Barnet  31  31
## 3               Bexley  28  28
## 4                Brent  30  30
## 5              Bromley  48  48
## 6               Camden  14  14

The above code first extracts the index of the row (borough) for which the corresponding column is true and then converts this into names. The final object created, b.count contains the number of station points in each zone. According to this, Barking and Dagenham should contain 12 station points. It is important to check the output makes sense at every stage with R, so let’s check to see if this is indeed the case with a quick plot:

plot(lnd[grepl("Barking", lnd$name),])
points(stations)

Transport points in Barking and Dagenham

Now the fun part: count the points in the polygon and report back how many there are!

We have now seen how to load, join and clip data. The second half of this tutorial is concerned with visualisation of the results. For this, we will use ggplot2 and begin by looking at how it handles non-spatial data.