This article shows how to use R to simplify an ESRI shape File making use of the Ramer-Douglas-Peuker algorithm. I have added my own enhancements to prevent slivers and overlaps that occur with existing packages.
The code here takes the Australian post code shape files which are very detailed and simplifies so that each post code boundary is made up of a series of straight lines that approximate the original boundary. There are lots of reasons why one may want to do this. In my case I wanted to show information using post codes and the original data file (after “fortifying”) was >6 million records long. It took too long to plot anything.
Ramer-Douglas-Peucker (RDP) is a really clever algorithm for this. Google it. There are lots of implementations already available in the packages you already use to read shape files, but they all seem to have a problem in that a shared boundary gets RDPed differently as part of each boundary. This means that the simplified shapes don’t fit together anymore. There are a series of slivers and overlaps.
Ok, so here goes with my example.
Download the Australian post codes shape file (the one that corresponds to the 2011 Census). It is called 2011_POA_shape.zip and is available from the Australian Bureau of Statisitics. I can’t seem to find an address that I can download it from within R using download.file(), but clicking on the download arrow at the intersection of Row “Postal Areas” and column “ESRI shape file” starts the download.
You may need to register and set up an account to access the ABS “Data Packs” but it is free.
I assume you have downloaded and unzipped the file to ‘C:/RPubs/Simplify_Boundaries’
Read the shape file into R using the maptools library (fixing some defective polygons automatically) and plot what it looks like:
require(maptools)
setwd('C:/RPubs/Simplify_Boundaries')
POA <- readShapePoly('POA_2011_AUST', repair=TRUE, delete_null_obj=TRUE)
Then plot the starting position
plot(POA, col=POA$POA_NAME)
That looks like Australia. Apart from the West Coast of Tasmania which has vanished due to being wilderness with no postal delivery. Some observations
Just to show the range of sizes:
cat('Largest=', max(POA$AREA_SQKM), ' smallest=', min(POA$AREA_SQKM), '\n')
## Largest= 1203979 smallest= 0.1806006
So lets use the packaged function to create a simplified set of boundaries and display the result:
Thin <- thinnedSpatialPoly(POA, tol=0.1, topologyPreserve=TRUE)
#pick some post codes close to home
select <- c('2069','2070','2071','2072','2073','2074','2075','2076')
par(mfrow=c(1,2))
plot(POA[POA$POA_NAME %in% select,], col=POA$POA_NAME); title(main='Original')
plot(Thin[Thin$POA_NAME %in% select,], col=Thin$POA_NAME); title(main='Thinned')
Pymble in light blue no longer has a border with Killara in green.
Load the data.table package
require(data.table)
First define some constants. There are thousands of islands that don’t really contribute much to a large scale presentation, so we can blow away the small ones based on size. But if the whole post code is an island we want to keep it. So for example Scotland Island is part of Sydney’s suburbs and we don’t want it to disappear completely as a post code . Generally the rings within each boundary are in decreasing order of size, so we can achieve this by always keeping the first ring. Then a group of islands post code will still be represented. We occasionally may have a hole which contains a group of small post code inclusions. The small inclusions may disappear while the hole remains, but this is the only rare case that may matter.
#Define some constants to control the simplificiation
#units are generally degrees of Lat/Lon. 1dg approx=100Km
min.size <- 0.04 #diagonal of bounding box size below which rings are deleted
min.apply.first <- FALSE #do not apply the minimum to the first ring in each polygon
max.deviation <-0.002 #simplified boundary stays within 200m of original boundary
max.it <- 20 #limit on Ramer-Douglas-Peucker in case it infinite loops
Create a function to calculate the size of the diagonal of the bounding box. This is quicker than calculating the areas of polygons. We can save more time by not square rooting it before comparing. We then use this function to determine which of the smaller islands to drown (or we could wait for global warming ;-))
#Delete any Polygons that are less than the minimum size. The min.size is compared with
#the bounding box diagonal, so first need a function to calculate the Bounding Box diagonal squared
#when passed a matrix of (lon,lat)
BBdiag2 <- function(coords){
bb <- apply(coords,2,range)
return(sum((bb[2,]-bb[1,])^2))
}
#loop reversed so as not to affect the next values position in the list if the previous one is deleted
drop.list <- NULL
drop.ring.list <- NULL
min.size2 <- min.size^2
for(i in rev(seq(POA))){
for (j in rev(seq(POA@polygons[[i]]@Polygons))){
if (BBdiag2(POA@polygons[[i]]@Polygons[[j]]@coords) < min.size2){
if (j > 1 | (j==1 & min.apply.first)) {
#cat('Deleting ',i,' ', j,' ', POA@polygons[[i]]@Polygons[[j]]@area, '\n')
drop.ring.list <- c(drop.ring.list, i+j/1000)
POA@polygons[[i]]@Polygons[j] <- NULL
}
}
#list of polygons have had all their rings removed
if (length(POA@polygons[[i]]@Polygons) == 0) drop.list <- c(drop.list,i)
}
}
#Remove the Polygons that have been completely dropped. There should not be any if we always kept
#the first ring using min.apply.first=FALSE
POA<-POA[!seq(POA) %in% drop.list,]
cat(length(drop.list), 'areas are removed completely, ', length(drop.ring.list), ' smaller rings dropped\n')
## 0 areas are removed completely, 6086 smaller rings dropped
To handle the communication of boundary changes (by RDP) we put all the data in a single data.table and then when we select a boundary marker as an RDP “node”, make that a node for all the other rings that share the same border. You may want to comment back in the #cat line below so you know it is still working. Otherwise the code looks like nothing is happenning for a long while.
All.df <- NULL
for (i in seq(POA)){
for (j in seq(POA@polygons[[i]]@Polygons)){
#cat('Adding Ring ',i,'.',j,'\n')
All.df <<- rbindlist(list(All.df
,data.table(
Poly=i
,Ring=j
,Ord=seq(nrow(POA@polygons[[i]]@Polygons[[j]]@coords)) #PlotOrder
,Lon=POA@polygons[[i]]@Polygons[[j]]@coords[,1] #Longitude
,Lat=POA@polygons[[i]]@Polygons[[j]]@coords[,2] #Latitude
)), use.names=FALSE)
}
}
Add in a field to contain the number of Rings that share each point Exclude the Ord=1 (“Head”) of each ring to prevent double counting with its own “Tail”.
All.df[Ord!=1 ,N:= .N, by=.(Lon,Lat)]
## Poly Ring Ord Lon Lat N
## 1: 1 1 1 130.8339 -12.45741 NA
## 2: 1 1 2 130.8338 -12.45723 2
## 3: 1 1 3 130.8334 -12.45684 2
## 4: 1 1 4 130.8335 -12.45675 2
## 5: 1 1 5 130.8339 -12.45627 2
## ---
## 4412433: 2513 1 594 145.5302 -41.79029 2
## 4412434: 2513 1 595 145.5303 -41.79025 2
## 4412435: 2513 1 596 145.5304 -41.79022 2
## 4412436: 2513 1 597 145.5305 -41.79020 2
## 4412437: 2513 1 598 145.5305 -41.79018 2
table(All.df$N)
##
## 1 2 3 4
## 1314422 3082510 12318 164
The table(All.df$N) call shows that around a third of the boundary markers part of in a single ring.
We can think of these as the “Coast”. Around half the boundary markers are shared between two rings. These can be thought of as “Fence”. There are quite a lot of “Pegs” where 3 areas meet and a few pegs where 4 meet. In general we don’t want to move any Pegs. There is no real cost to keeping these few pegs as RDP nodes. We also need to keep the points where Fences meet the Coast and keep all these as RDP nodes as well. The head/tail can be kept as nodes although a few could be removed if we rotated each ring so its head was at a point that was already a node.
The code below contains a function to find an “Extreme” Point as well as the RDP algorithm for a given set of co-ordinates (which may or may not be a ring). The Extremity function without a from_index can be used to find a rotation point for islands if you don’t want to keep the original Head/Tail as nodes.
#Function to find the most extreme point in a shape (optionally from some given index point)
#but if no point is given, then from the centre of the bounding box.
Extremity <- function(x, y, from_index=NULL){
x0 <- ifelse(is.null(from_index), mean(range((x))), x[from_index])
y0 <- ifelse(is.null(from_index), mean(range((y))), y[from_index])
return(which.max((x-x0)^2 + (y-y0)^2))
}
#Function to implement Ramer-Douglas-Peucker algorithm to find additional nodes
#x,y = Long,Lat
#n = boolean array of starting nodes. The end points are always assumed to be nodes
#e2 = square of the permitted tolerance
#max.it = maximum number of iterations before assuming an infinite loop
#m = minimum additional nodes to add. For a ring this is 2 as Head=Tail and 2 more nodes make a triangle
RDP <- function(x, y, n=rep(FALSE, length(x)), e2=(0.05*max(c(diff(range(x)), diff(range(y)))))^2
,max.it=10, m=0, display=FALSE){
l <-length(x)
nn <- n #new nodes
nn[1]<-TRUE #end points are always nodes even if not sepcified in input
nn[l]<-TRUE
if(sum(nn)==2 & x[1]==x[l] & y[1]==y[l]){
#dealing with a ring where the start and end is the same so the perpendicular distance part of
#RDP fails. Instead we find the most extreme point from the start and make it another node
nn[Extremity(x, y, 1)]<-TRUE
}
#iterate until no more nodes added
it <- 1
while (it<=max.it){
#perpendicular distance calculation for all points at once
cum.n <- cumsum(nn)
y0 <- y[nn][cum.n] #applicable y0 value from node before each y
y1 <- y[nn][1+cum.n] #applicable y1 value from node above each y
x0 <- x[nn][cum.n] #applicable x0 value from node before each x
x1 <- x[nn][1+cum.n] #applicable x1 value from node above each x
D2 <- ((y1-y0)*x - (x1-x0)*y + x1*y0 - y1*x0)^2/((y1-y0)^2+(x1-x0)^2)
#which values are the maximum within each section
n1 <- as.integer(which(nn) -1 + tapply(D2, cum.n, function(x) which.max(x)[1]))
#drop NAs and zeros
n1<-n1[which(D2[n1]>0)]
#check against minimum nodes required and tolerance
n1 <- n1[order(-D2[n1])] #reorder by distance decreasing
n1 <- n1[(seq(n1) <= m+2-sum(nn)) | (D2[n1] > e2)] #drop below tolerance if beyond minimum reqd nodes
#n1 contains index value of new nodes to be added
nn[n1] <- TRUE
it <- it + 1 #iteration
if (length(n1)==0) break #No more nodes added
}
#produce some cool plots in action if display=TRUE
if (display){
plot(x=x, y=y, type='l')
points(x=x[nn], y=y[nn], col='red', pch=20)
lines(x=x[nn], y=y[nn], col='blue')
}
return(which(nn &! n))
}
The graph below shows RDP acting on the first ring or the first polygon successively
xx <-POA@polygons[[1]]@Polygons[[1]]@coords[,1]
yy <-POA@polygons[[1]]@Polygons[[1]]@coords[,2]
par(mfrow=c(1,3))
nn <- RDP(xx,yy, m=2, max.it=1, display=TRUE)
nn <- RDP(xx,yy, m=2, max.it=2, display=TRUE)
nn <- RDP(xx,yy, m=2, max.it=3, display=TRUE)
Apply RDP to all the shapes and share the node updates with the other rings.
#now simplify each polygon shape
max.dev2 <- max.deviation^2 #distance squared maximum error of new boundary from old
#Set the Key on the sharing table
setkey(All.df, Lon, Lat)
#loop through all the rings
for (i in seq(POA)){
for (j in seq(POA@polygons[[i]]@Polygons)){
#Where are we up to - uncomment to see progress
#cat('Calculating Ring ',i,'.',j,'\n')
#Get the co-ordinates from the All.df
coords <- as.matrix(All.df[Poly==i & Ring==j][order(Ord), .(Lon, Lat)])
#Sharing counts from All.df
N <- as.integer(All.df[Poly==i & Ring==j][order(Ord), N])
#number of marker points
l <- length(N)
#make the head the same as the tail (which was ignored in the sharing count)
N[1] <- N[l]
#which of these values are nodes already
#A "peg" with 3 or more rings
nodes <- N > 2
#This gets the first Fence (N=2) along the coast ie increase in sharing from the previous marker
nodes <- nodes | (N > N[c(l-1,seq(l-1))])
#This is the last fence post before the coast ie increase in sharing from the next marker
nodes <- nodes | (N > N[c(1+seq(l-1),2)])
#Force the head and tail to be nodes if they were not already
nodes[1] <- TRUE
nodes[l] <- TRUE
#useful for debugging
#plot(coords, type='l');points(coords[nodes,], col='red');lines(coords[nodes,], col='blue')
#get the RDP nodes with min of 2 additional nodes so shape is at least a triangle
nodes.new <- RDP(coords[,1], coords[,2], n=nodes , e2=max.dev2, m=2, max.it=max.it)
nodes.new <- union(1, nodes.new) #Ensure head/tail is marked as a node too
#mark the new nodes in the sharing table so other shapes can use them too
All.df[.(Lon=coords[nodes.new,1],Lat=coords[nodes.new,2]),N:=10] #10 = added node
}
}
#Now loop through it again, this time replacing the rings with node only rings
for (i in seq(POA)){
for (j in seq(POA@polygons[[i]]@Polygons)){
#cat('Calculating Ring ',i,'.',j,'\n')
#Put the new co-ordinates back in the Shape
POA@polygons[[i]]@Polygons[[j]]@coords <- as.matrix(All.df[Poly==i & Ring==j & N>2]
[order(Ord), .(Lon, Lat)])
}
}
Save the Simplified Data File and Load the Original data again for some comparisons.
writePolyShape(POA, fn='POA_2011_SIMPLE')
Orig <- readShapePoly('POA_2011_AUST', repair=TRUE, delete_null_obj=TRUE)
## Warning in .Map2PolyDF(Map, IDs = IDvar, proj4string = proj4string,
## force_ring = force_ring, : Null objects with the following indices
## deleted: 2514, 2515, 2516
par(mfrow=c(1,2))
system.time(plot(Orig, col=POA$POA_NAME)); title(main='Original')
## user system elapsed
## 6.38 0.02 6.60
system.time(plot(POA, col=POA$POA_NAME)); title(main='Simplified')
## user system elapsed
## 3.30 0.00 3.36
Compare the selected post codes:
select <- c('2069','2070','2071','2072','2073','2074','2075','2076')
par(mfrow=c(1,3))
plot(Orig[Orig$POA_NAME %in% select,], col=Orig$POA_NAME); title(main='Original')
plot(Thin[Thin$POA_NAME %in% select,], col=Thin$POA_NAME); title(main='Thinned')
plot(POA[POA$POA_NAME %in% select,], col=POA$POA_NAME); title(main='Simplified')
The slivers and overlaps have been removed, while the overall shapes are very close. The number of boundary markers in the file has been reduced considerably. The 1’s and 2’s are no longer required:
table(All.df$N)
##
## 1 2 3 4 10
## 1274677 2903272 12087 152 222249
Complete !