There appears to be no direct implementation of spatially constrained clustering in R at present (see https://stat.ethz.ch/pipermail/r-sig-geo/2008-January/002998.html). In the absence of a translation in R of the very useful python library clusterPy, this script implements one simple form.
Spatially constrained clustering is needed when clusters are drawn from not-necessarily-spatial data, yet required to be spatially contiguous. Classic examples are demographic or real estate data.
Random data are used to allow insight into the reliability of the method. I presume that each cluster will contain only a few non-contiguous points, and then re-allocate these points following an initial clustering. Non-contiguous points are first identified, and then reallocated to the closest (minimal-distance) point within a contiguous cluster.
The approach only reallocates individually isolated points. Cases with 2 or more non-contiguous points will be left as is.
require(tripack) # For Delaunay triangulation and neighbour lists
require(spatstat)
npts <- 100
set.seed(12)
xy <- data.frame(array(runif(npts * 2), dim = c(npts, 2)))
names(xy) <- c("x", "y")
Contiguity is assessed from nearest neighbours, defined as sharing borders between Delaunay triangulations.
tm <- tri.mesh(xy)
nbs <- neighbours(tm)
To simulate a distance matrix representing data having only some (presumably) major portion of spatial variance, random noise is added to the distance matrix.
dmat <- dist(xy)
dmat <- dmat + 0.25 * rnorm(length(dmat))
hc <- hclust(dmat)
nclusts <- 8
clusts <- cutree(hc, nclusts)
Or using k-means clustering:
km <- kmeans(as.dist(dmat), centers = nc + nc.add, iter.max = 20, nstart = 5)
clusts <- as.vector(km$cluster)
Find all points that are entirely surrounded by points from other clusters (that is, points that have no neighbours in the same cluster).
non.nbs <- lapply(1:npts, function(x) {
cl.i <- clusts[nbs[[x]]]
if (!clusts[x] %in% cl.i) {
nlist = x
} else {
nlist = NULL
}
return(nlist)
})
non.nbs <- unlist(non.nbs)
Then find the neighbouring point to all non.nbs that is at minimal distance (as determined by dmat).
dmat <- as.matrix(dmat)
d.min <- lapply(non.nbs, function(x) {
dnbs <- dmat[nbs[[x]], x]
di <- which.min(dnbs)
cbind(dnbs[di], clusts[nbs[[x]][di]])
})
Then reallocate those points. d.min is then a list of minimal distances and cluster IDS over all neighbours of each point in non.nbs. The first step extracts the cluster IDs:
new.clIDs <- unlist(lapply(d.min, function(x) x[, 2]))
clusts.adj <- clusts
clusts.adj[non.nbs] <- new.clIDs
It is then possible that reallocated points may still be isolated, and the entire routine should thus be looped within a while length (non.nbs) > 0
clause.
plotxy <- function(xy, clusts) {
plot(xy, col = clusts, pch = 19, xlim = range(xy$x), ylim = range(xy$y))
for (i in 1:nclusts) {
indx <- which(clusts == i)
xy.pp <- ppp(xy$x[indx], xy$y[indx])
ch <- convexhull(xy.pp)
plot(ch, col = NA, border = i, add = TRUE)
text(mean(xy$x[indx]), mean(xy$y[indx]), pos = 4, labels = i, col = i,
cex = 2)
}
}
par(mfrow = c(1, 2))
plotxy(xy, clusts)
title(main = "Before")
plotxy(xy, clusts.adj)
title(main = "After")