Simplify an ESRI Shape File

Executive Summary

This article shows how to use R to simplify an ESRI shape File making use of the Ramer-Douglas-Peuker algorithm. I have added my own enhancements to prevent slivers and overlaps that occur with existing packages.

Introduction

The code here takes the Australian post code shape files which are very detailed and simplifies so that each post code boundary is made up of a series of straight lines that approximate the original boundary. There are lots of reasons why one may want to do this. In my case I wanted to show information using post codes and the original data file (after “fortifying”) was >6 million records long. It took too long to plot anything.

Ramer-Douglas-Peucker (RDP) is a really clever algorithm for this. Google it. There are lots of implementations already available in the packages you already use to read shape files, but they all seem to have a problem in that a shared boundary gets RDPed differently as part of each boundary. This means that the simplified shapes don’t fit together anymore. There are a series of slivers and overlaps.

Ok, so here goes with my example.

Download ESRI shape file

Download the Australian post codes shape file (the one that corresponds to the 2011 Census). It is called 2011_POA_shape.zip and is available from the Australian Bureau of Statisitics. I can’t seem to find an address that I can download it from within R using download.file(), but clicking on the download arrow at the intersection of Row “Postal Areas” and column “ESRI shape file” starts the download.
You may need to register and set up an account to access the ABS “Data Packs” but it is free.

I assume you have downloaded and unzipped the file to ‘C:/RPubs/Simplify_Boundaries’

Read the shape file into R using the maptools library (fixing some defective polygons automatically) and plot what it looks like:

require(maptools)
setwd('C:/RPubs/Simplify_Boundaries')
POA <- readShapePoly('POA_2011_AUST', repair=TRUE, delete_null_obj=TRUE)

Then plot the starting position

plot(POA, col=POA$POA_NAME)

Why write my own code

That looks like Australia. Apart from the West Coast of Tasmania which has vanished due to being wilderness with no postal delivery. Some observations

Unordered list can use asterisks
Lots of detail - takes a long time to plot
Some very large post codes, but lots of tiny ones too
Some donuts where a regional area has a hole in which resides a small town
Lots of islands, some of which are their own post code (okay - you can’t see this but the dots off to the north west are Christmas Island and the Cocos Islands)

Just to show the range of sizes:

cat('Largest=', max(POA$AREA_SQKM), ' smallest=', min(POA$AREA_SQKM), '\n')

## Largest= 1203979  smallest= 0.1806006

So lets use the packaged function to create a simplified set of boundaries and display the result:

Thin <- thinnedSpatialPoly(POA, tol=0.1, topologyPreserve=TRUE)

#pick some post codes close to home
select <- c('2069','2070','2071','2072','2073','2074','2075','2076')
par(mfrow=c(1,2))
plot(POA[POA$POA_NAME %in% select,], col=POA$POA_NAME); title(main='Original')
plot(Thin[Thin$POA_NAME %in% select,], col=Thin$POA_NAME); title(main='Thinned')

Pymble in light blue no longer has a border with Killara in green.

The DIY approach

Load the data.table package

require(data.table)

First define some constants. There are thousands of islands that don’t really contribute much to a large scale presentation, so we can blow away the small ones based on size. But if the whole post code is an island we want to keep it. So for example Scotland Island is part of Sydney’s suburbs and we don’t want it to disappear completely as a post code . Generally the rings within each boundary are in decreasing order of size, so we can achieve this by always keeping the first ring. Then a group of islands post code will still be represented. We occasionally may have a hole which contains a group of small post code inclusions. The small inclusions may disappear while the hole remains, but this is the only rare case that may matter.

#Define some constants to control the simplificiation
#units are generally degrees of Lat/Lon. 1dg approx=100Km
min.size <- 0.04            #diagonal of bounding box size below which rings are deleted
min.apply.first <- FALSE    #do not apply the minimum to the first ring in each polygon
max.deviation <-0.002       #simplified boundary stays within 200m of original boundary 
max.it <- 20                #limit on Ramer-Douglas-Peucker in case it infinite loops

Create a function to calculate the size of the diagonal of the bounding box. This is quicker than calculating the areas of polygons. We can save more time by not square rooting it before comparing. We then use this function to determine which of the smaller islands to drown (or we could wait for global warming ;-))

#Delete any Polygons that are less than the minimum size.  The min.size is compared with 
#the bounding box diagonal, so first need a function to calculate the Bounding Box diagonal squared
#when passed a matrix of (lon,lat)
BBdiag2 <- function(coords){
    bb <- apply(coords,2,range)
    return(sum((bb[2,]-bb[1,])^2))
}

#loop reversed so as not to affect the next values position in the list if the previous one is deleted
drop.list <- NULL
drop.ring.list <- NULL
min.size2 <- min.size^2
for(i in rev(seq(POA))){
    for (j in rev(seq(POA@polygons[[i]]@Polygons))){
        if (BBdiag2(POA@polygons[[i]]@Polygons[[j]]@coords) < min.size2){
            if (j > 1 | (j==1 & min.apply.first)) {
                #cat('Deleting ',i,' ', j,' ', POA@polygons[[i]]@Polygons[[j]]@area, '\n')
                drop.ring.list <- c(drop.ring.list, i+j/1000) 
                POA@polygons[[i]]@Polygons[j] <- NULL 
            }
        }
        #list of polygons have had all their rings removed
        if (length(POA@polygons[[i]]@Polygons) == 0)  drop.list <- c(drop.list,i)
    }
}

#Remove the Polygons that have been completely dropped.  There should not be any if we always kept
#the first ring using min.apply.first=FALSE
POA<-POA[!seq(POA) %in% drop.list,]
cat(length(drop.list), 'areas are removed completely, ', length(drop.ring.list), ' smaller rings dropped\n')

## 0 areas are removed completely,  6086  smaller rings dropped

Create a node table

To handle the communication of boundary changes (by RDP) we put all the data in a single data.table and then when we select a boundary marker as an RDP “node”, make that a node for all the other rings that share the same border. You may want to comment back in the #cat line below so you know it is still working. Otherwise the code looks like nothing is happenning for a long while.

All.df <- NULL
for (i in seq(POA)){
    for (j in seq(POA@polygons[[i]]@Polygons)){
        #cat('Adding Ring ',i,'.',j,'\n')
        All.df <<- rbindlist(list(All.df
                                  ,data.table(
                                      Poly=i
                                      ,Ring=j
                                      ,Ord=seq(nrow(POA@polygons[[i]]@Polygons[[j]]@coords)) #PlotOrder
                                      ,Lon=POA@polygons[[i]]@Polygons[[j]]@coords[,1]        #Longitude
                                      ,Lat=POA@polygons[[i]]@Polygons[[j]]@coords[,2]        #Latitude
                                  )), use.names=FALSE)
    }
}

Add in a field to contain the number of Rings that share each point Exclude the Ord=1 (“Head”) of each ring to prevent double counting with its own “Tail”.

All.df[Ord!=1 ,N:= .N, by=.(Lon,Lat)]

##          Poly Ring Ord      Lon       Lat  N
##       1:    1    1   1 130.8339 -12.45741 NA
##       2:    1    1   2 130.8338 -12.45723  2
##       3:    1    1   3 130.8334 -12.45684  2
##       4:    1    1   4 130.8335 -12.45675  2
##       5:    1    1   5 130.8339 -12.45627  2
##      ---                                    
## 4412433: 2513    1 594 145.5302 -41.79029  2
## 4412434: 2513    1 595 145.5303 -41.79025  2
## 4412435: 2513    1 596 145.5304 -41.79022  2
## 4412436: 2513    1 597 145.5305 -41.79020  2
## 4412437: 2513    1 598 145.5305 -41.79018  2

table(All.df$N)

## 
##       1       2       3       4 
## 1314422 3082510   12318     164

The table(All.df$N) call shows that around a third of the boundary markers part of in a single ring.
We can think of these as the “Coast”. Around half the boundary markers are shared between two rings. These can be thought of as “Fence”. There are quite a lot of “Pegs” where 3 areas meet and a few pegs where 4 meet. In general we don’t want to move any Pegs. There is no real cost to keeping these few pegs as RDP nodes. We also need to keep the points where Fences meet the Coast and keep all these as RDP nodes as well. The head/tail can be kept as nodes although a few could be removed if we rotated each ring so its head was at a point that was already a node.

Ramer-Douglas-Peucker

The code below contains a function to find an “Extreme” Point as well as the RDP algorithm for a given set of co-ordinates (which may or may not be a ring). The Extremity function without a from_index can be used to find a rotation point for islands if you don’t want to keep the original Head/Tail as nodes.

#Function to find the most extreme point in a shape (optionally from some given index point)
#but if no point is given, then from the centre of the bounding box.
Extremity <- function(x, y, from_index=NULL){
    x0 <- ifelse(is.null(from_index), mean(range((x))), x[from_index])
    y0 <- ifelse(is.null(from_index), mean(range((y))), y[from_index])
    return(which.max((x-x0)^2 + (y-y0)^2))
}

#Function to implement Ramer-Douglas-Peucker algorithm to find additional nodes
#x,y = Long,Lat
#n = boolean array of starting nodes.  The end points are always assumed to be nodes
#e2 = square of the permitted tolerance 
#max.it = maximum number of iterations before assuming an infinite loop
#m = minimum additional nodes to add.  For a ring this is 2 as Head=Tail and 2 more nodes make a triangle
RDP <- function(x, y, n=rep(FALSE, length(x)), e2=(0.05*max(c(diff(range(x)), diff(range(y)))))^2
                ,max.it=10, m=0, display=FALSE){
    
    l <-length(x)
    nn <- n           #new nodes
    nn[1]<-TRUE       #end points are always nodes even if not sepcified in input
    nn[l]<-TRUE
    
    if(sum(nn)==2 & x[1]==x[l] & y[1]==y[l]){
        #dealing with a ring where the start and end is the same so the perpendicular distance part of 
        #RDP fails. Instead we find the most extreme point from the start and make it another node
        nn[Extremity(x, y, 1)]<-TRUE
    }
    
    #iterate until no more nodes added
    it <- 1
    while (it<=max.it){
        #perpendicular distance calculation for all points at once
        cum.n <- cumsum(nn)
        y0 <- y[nn][cum.n]   #applicable y0 value from node before each y
        y1 <- y[nn][1+cum.n] #applicable y1 value from node above each y
        x0 <- x[nn][cum.n]   #applicable x0 value from node before each x
        x1 <- x[nn][1+cum.n] #applicable x1 value from node above each x
        D2 <- ((y1-y0)*x - (x1-x0)*y + x1*y0 - y1*x0)^2/((y1-y0)^2+(x1-x0)^2)
        
        #which values are the maximum within each section
        n1 <- as.integer(which(nn) -1 + tapply(D2, cum.n, function(x) which.max(x)[1]))
        
        #drop NAs and zeros
        n1<-n1[which(D2[n1]>0)]
        
        #check against minimum nodes required and tolerance
        n1 <- n1[order(-D2[n1])]                            #reorder by distance decreasing
        n1 <- n1[(seq(n1) <= m+2-sum(nn)) | (D2[n1] > e2)]  #drop below tolerance if beyond minimum reqd nodes
        
        #n1 contains index value of new nodes to be added
        nn[n1] <- TRUE
        
        it <- it + 1 #iteration
        
        
        if (length(n1)==0) break  #No more nodes added
    }

    #produce some cool plots in action if display=TRUE
    if (display){
        plot(x=x, y=y, type='l')
        points(x=x[nn], y=y[nn], col='red', pch=20)
        lines(x=x[nn], y=y[nn], col='blue')
    }    
    
    return(which(nn &! n))
}

The graph below shows RDP acting on the first ring or the first polygon successively

xx <-POA@polygons[[1]]@Polygons[[1]]@coords[,1]
yy <-POA@polygons[[1]]@Polygons[[1]]@coords[,2]
par(mfrow=c(1,3))
nn <- RDP(xx,yy, m=2, max.it=1, display=TRUE)
nn <- RDP(xx,yy, m=2, max.it=2, display=TRUE)
nn <- RDP(xx,yy, m=2, max.it=3, display=TRUE)

Update the Shape File

Apply RDP to all the shapes and share the node updates with the other rings.

#now simplify each polygon shape
max.dev2 <- max.deviation^2  #distance squared maximum error of new boundary from old

#Set the Key on the sharing table
setkey(All.df, Lon, Lat)

#loop through all the rings
for (i in seq(POA)){
    for (j in seq(POA@polygons[[i]]@Polygons)){
        #Where are we up to - uncomment to see progress
        #cat('Calculating Ring ',i,'.',j,'\n')
        
        #Get the co-ordinates from the All.df
        coords <- as.matrix(All.df[Poly==i & Ring==j][order(Ord), .(Lon, Lat)])
        
        #Sharing counts from All.df
        N <- as.integer(All.df[Poly==i & Ring==j][order(Ord), N])
        
        #number of marker points
        l <- length(N)
        
        #make the head the same as the tail (which was ignored in the sharing count)
        N[1] <- N[l]
        
        #which of these values are nodes already
        #A "peg" with 3 or more rings
        nodes <- N > 2                         
        #This gets the first Fence (N=2) along the coast ie increase in sharing from the previous marker
        nodes <- nodes | (N > N[c(l-1,seq(l-1))]) 
        #This is the last fence post before the coast ie increase in sharing from the next marker
        nodes <- nodes | (N > N[c(1+seq(l-1),2)])
        #Force the head and tail to be nodes if they were not already
        nodes[1] <- TRUE
        nodes[l] <- TRUE
        
        #useful for debugging
        #plot(coords, type='l');points(coords[nodes,], col='red');lines(coords[nodes,], col='blue')
        
        #get the RDP nodes with min of 2 additional nodes so shape is at least a triangle
        nodes.new <- RDP(coords[,1], coords[,2], n=nodes , e2=max.dev2, m=2, max.it=max.it) 
        nodes.new <- union(1, nodes.new) #Ensure head/tail is marked as a node too
        
        #mark the new nodes in the sharing table so other shapes can use them too
        All.df[.(Lon=coords[nodes.new,1],Lat=coords[nodes.new,2]),N:=10] #10 = added node
    }
}

#Now loop through it again, this time replacing the rings with node only rings
for (i in seq(POA)){
    for (j in seq(POA@polygons[[i]]@Polygons)){
        #cat('Calculating Ring ',i,'.',j,'\n')
        #Put the new co-ordinates back in the Shape
        POA@polygons[[i]]@Polygons[[j]]@coords <- as.matrix(All.df[Poly==i & Ring==j & N>2]
                                                            [order(Ord), .(Lon, Lat)])
    }
}

Compare Output

Save the Simplified Data File and Load the Original data again for some comparisons.

writePolyShape(POA, fn='POA_2011_SIMPLE')
Orig <- readShapePoly('POA_2011_AUST', repair=TRUE, delete_null_obj=TRUE)

## Warning in .Map2PolyDF(Map, IDs = IDvar, proj4string = proj4string,
## force_ring = force_ring, : Null objects with the following indices
## deleted: 2514, 2515, 2516

par(mfrow=c(1,2))
system.time(plot(Orig, col=POA$POA_NAME)); title(main='Original')

##    user  system elapsed 
##    6.38    0.02    6.60

system.time(plot(POA, col=POA$POA_NAME)); title(main='Simplified')

##    user  system elapsed 
##    3.30    0.00    3.36

Compare the selected post codes:

select <- c('2069','2070','2071','2072','2073','2074','2075','2076')
par(mfrow=c(1,3))
plot(Orig[Orig$POA_NAME %in% select,], col=Orig$POA_NAME); title(main='Original')
plot(Thin[Thin$POA_NAME %in% select,], col=Thin$POA_NAME); title(main='Thinned')
plot(POA[POA$POA_NAME %in% select,], col=POA$POA_NAME); title(main='Simplified')

The slivers and overlaps have been removed, while the overall shapes are very close. The number of boundary markers in the file has been reduced considerably. The 1’s and 2’s are no longer required:

table(All.df$N)

## 
##       1       2       3       4      10 
## 1274677 2903272   12087     152  222249

Complete !