Introduction

Tom Helmuth and I are still trying to work out the best approach to do our clustering. After working with some (x, y) points, we wanted to explore binary vectors since that will be more like our the data we’re actually going to be working with.

Again, we’ll load some libraries:

library('ggplot2')
library('cluster')
library('apcluster')

Creating some test data

The make_vector function generates a sequence of random bits with the specified number of ones. mutate_vector takes a binary vector and a percentage of bits to flip and mutates that vector.

sequence_length = 100

make_vector <- function(percent_ones) 
  replicate(sequence_length, ifelse(runif(1)[1]>percent_ones, 0, 1))

flip <- function(b) (1-b)
mutate_vector <- function (v, percent_flips)
  sapply(v, function(b) ifelse(runif(1)[1]<percent_flips, flip(b), b))

Make a bunch of vectors that will be the center of clusters obtained by mutating these center vectors:

The vectors p, q, and r should all be quite close to each other, and so clusters formed around them should also be fairly close, although some of the underlying structure may be extractable if one clusters agressively. I’ll refer to these as the “tight” group later on.

all_zeros = rep(0, sequence_length)
ten_percent_zeros = make_vector(0.1)
all_ones = rep(1, sequence_length)
u = make_vector(0.5)
v = make_vector(0.5)
w = make_vector(0.5)
p = make_vector(0.5)
q = mutate_vector(p, 0.01)
r = mutate_vector(p, 0.01)

Now generate a bunch of clusters:

mostly_zeros <- t(replicate(20, mutate_vector(all_zeros, 0.02)))
nearly_10_percent_zeros <- t(replicate(20, mutate_vector(ten_percent_zeros, 0.02)))
mostly_ones <- t(replicate(20, mutate_vector(all_ones, 0.1)))
near_u <- t(replicate(20, mutate_vector(u, 0.1)))
near_v <- t(replicate(20, mutate_vector(v, 0.1)))
near_w <- t(replicate(20, mutate_vector(w, 0.1)))
near_p <- t(replicate(20, mutate_vector(p, 0.01)))
near_q <- t(replicate(20, mutate_vector(q, 0.01)))
near_r <- t(replicate(20, mutate_vector(r, 0.01)))
all <- rbind(mostly_zeros, nearly_10_percent_zeros, mostly_ones, 
             near_u, near_v, near_w, near_p, near_q, near_r)

Running agnes and apcluster on all our data

agnes_all <- agnes(all)
plot(agnes_all, which.plots=2)

If we cut at height 5 we get usually get six clusters (but I have seen seven once):

sum(agnes_all$height>5) + 1
## [1] 6

These clusters are (roughly):

Applying apcluster generates seven (sometimes eight) clusters, as apcluster (with the default settings) separates out mostly_zeros and nearly_10_percent_zeros, which agnes pulled together.

ap_all <- apcluster(negDistMat(r=2), all, details=TRUE)
length(ap_all)
## [1] 7
heatmap(ap_all)

What if we cluster the “tight” group separately?

In the clusterings above near_p, near_q, and near_r were clustered together, which is (I think) what we want. Now what if we cluster them on their own, as we would if they were, for example, the clusters coming from tournament selection?

tight <- rbind(near_p, near_q, near_r)

If we apply agnes to just this group, there’s definitely some grouping in the dendrogram, but maybe not as clear as we want for making this point? (I’ll need to talk to Tom about this.)

agnes_tight <- agnes(tight)
plot(agnes_tight, which.plots=2)

That said, if we use the same height (5), we just get one cluster, which is good:

sum(agnes_tight$height>5) + 1
## [1] 1

Applying apcluster to this data gives us lots o’ clusters, like what we get with agnes if we set the height limit to some lower value like 1 or 2.

ap_tight <- apcluster(negDistMat(r=2), tight, details=TRUE)
heatmap(ap_tight)

length(ap_tight)
## [1] 6
sum(agnes_tight$height>2) + 1
## [1] 3

If we extract the input preference value p from running apcluster on all the data and use it as the input preference value when clustering the tight subset, then here again we usually only get one cluster (but I have seen two):

ap_all@p
## [1] -49
ap_tight_modified_p <- apcluster(negDistMat(r=2), tight, p=ap_all@p, details=TRUE)
length(ap_tight_modified_p)
## [1] 1

Note that setting the q parameter to 0 sometimes gives us multiple clusters; I’ve seen up to 3, although sometimes it just gives us 1. That does suggest that just setting q to 0 isn’t going to work as a strategy.

ap_tight_q0 <- apcluster(negDistMat(r=2), tight, q=0, details=TRUE)
length(ap_tight_q0)
## [1] 3

Conclusion

So we can clearly manipulate both agnes and apcluster to get the “right” answer for the tight group, i.e, to place them all in a single cluster.

The question, then, is how understandable and explicable are the decisions that need to be made? While I was really keen to use apcluster at first, trying to find “correct’ settings for either the p or q parameters, and then explain those, is gonna be a pain. It seems that using distance/height cutoffs in agnes will be a lot easier to explain, and easier to map across a variety of problems, etc.

So I’m kinda thinking that agnes and height cutoffs are the way to go.


Using Manhattan distance instead of Euclidean

Manhattan distance with agnes

Let’s try using Manhattan distance instead of Euclidean distance when using agnes.

agnes_all_manh <- agnes(all, metric= "manhattan")
plot(agnes_all_manh, which.plots=2)

As expected, the heights are now much higher between clusters. For example, where Euclidean distances between all_ones and all_zeros will be

dist(rbind(all_zeros, all_ones))
##          all_zeros
## all_ones        10

with Manhattan distances it will be

dist(rbind(all_zeros, all_ones), method = "manhattan")
##          all_zeros
## all_ones       100

If we use a larger height, we still see a similar number of clusters (here’s with height = 30:

sum(agnes_all_manh$height>30) + 1
## [1] 6

We threw around “10% of the test cases different” as being a potential cutoff for the clustering. For this data at least, that seems like it would be too fine-grained:

sum(agnes_all_manh$height>10) + 1
## [1] 75

I got 79 there, which is close to half the size of the dataset, which has the following number of points:

dim(all)[1]
## [1] 180

Of course, with real data 10% might be reasonable, or something like the 30% I used above might be better.

apcluster with Manhattan distances

Let’s try apcluster with Manhattan distances.

ap_all_manh <- apcluster(negDistMat(method="manhattan"), all, details=TRUE)
heatmap(ap_all_manh)

length(ap_all_manh)
## [1] 7

If we use Manhattan distances and q=0 on the “tight” data, we still get three (sometimes two) clusters. I suspect that the fact that Manhattan distances tend to be larger than Euclidean means it will be that much harder to get apcluster to see the “tight” group as a single cluster.

ap_tight_manh <- apcluster(negDistMat(method="manhattan"), tight, q=0, details=TRUE)
length(ap_tight_manh)
## [1] 3