Spatially constrained clustering

There appears to be no direct implementation of spatially constrained clustering in R at present (see https://stat.ethz.ch/pipermail/r-sig-geo/2008-January/002998.html). In the absence of a translation in R of the very useful python library clusterPy, this script implements one simple form.

Spatially constrained clustering is needed when clusters are drawn from not-necessarily-spatial data, yet required to be spatially contiguous. Classic examples are demographic or real estate data.

Random data are used to allow insight into the reliability of the method. I presume that each cluster will contain only a few non-contiguous points, and then re-allocate these points following an initial clustering. Non-contiguous points are first identified, and then reallocated to the closest (minimal-distance) point within a contiguous cluster.

The approach only reallocates individually isolated points. Cases with 2 or more non-contiguous points will be left as is.

require(tripack)  # For Delaunay triangulation and neighbour lists
require(spatstat)
npts <- 100
set.seed(12)
xy <- data.frame(array(runif(npts * 2), dim = c(npts, 2)))
names(xy) <- c("x", "y")

Contiguity

Contiguity is assessed from nearest neighbours, defined as sharing borders between Delaunay triangulations.

tm <- tri.mesh(xy)
nbs <- neighbours(tm)

Not quite spatial data

To simulate a distance matrix representing data having only some (presumably) major portion of spatial variance, random noise is added to the distance matrix.

dmat <- dist(xy)
dmat <- dmat + 0.25 * rnorm(length(dmat))
hc <- hclust(dmat)
nclusts <- 8
clusts <- cutree(hc, nclusts)

Or using k-means clustering:

km <- kmeans(as.dist(dmat), centers = nc + nc.add, iter.max = 20, nstart = 5)
clusts <- as.vector(km$cluster)

Spatial constraint

Nearest neighbours

Find all points that are entirely surrounded by points from other clusters (that is, points that have no neighbours in the same cluster).

non.nbs <- lapply(1:npts, function(x) {
    cl.i <- clusts[nbs[[x]]]
    if (!clusts[x] %in% cl.i) {
        nlist = x
    } else {
        nlist = NULL
    }
    return(nlist)
})
non.nbs <- unlist(non.nbs)

Then find the neighbouring point to all non.nbs that is at minimal distance (as determined by dmat).

dmat <- as.matrix(dmat)
d.min <- lapply(non.nbs, function(x) {
    dnbs <- dmat[nbs[[x]], x]
    di <- which.min(dnbs)
    cbind(dnbs[di], clusts[nbs[[x]][di]])
})

Then reallocate those points. d.min is then a list of minimal distances and cluster IDS over all neighbours of each point in non.nbs. The first step extracts the cluster IDs:

new.clIDs <- unlist(lapply(d.min, function(x) x[, 2]))
clusts.adj <- clusts
clusts.adj[non.nbs] <- new.clIDs

It is then possible that reallocated points may still be isolated, and the entire routine should thus be looped within a while length (non.nbs) > 0 clause.

Plot

plotxy <- function(xy, clusts) {
    plot(xy, col = clusts, pch = 19, xlim = range(xy$x), ylim = range(xy$y))
    for (i in 1:nclusts) {
        indx <- which(clusts == i)
        xy.pp <- ppp(xy$x[indx], xy$y[indx])
        ch <- convexhull(xy.pp)
        plot(ch, col = NA, border = i, add = TRUE)
        text(mean(xy$x[indx]), mean(xy$y[indx]), pos = 4, labels = i, col = i, 
            cex = 2)
    }
}
par(mfrow = c(1, 2))
plotxy(xy, clusts)
title(main = "Before")
plotxy(xy, clusts.adj)
title(main = "After")

plot of chunk unnamed-chunk-8