To produce k means with nearly equal cluster size. Let’s start by acknowledging that this isn’t the point of k-means, but since we all understand that anyway, here goes:
library(magrittr); library(dplyr); library(ggplot2)
We get started by computing k-means the traditional way to create an incumbent solution:
data(mtcars)
k = 3
kdat = mtcars %>% select(c(mpg, wt))
kdat %>% kmeans(k) ->kclust
So far, so good. Now we’ll compute the full rank distance between each point and each center; this forms the beginning of the
kdist = function(x1, y1, x2, y2){
sqrt((x1-x2)^2 + (y1-y2)^2)
}
centers = kclust$centers
kdat %<>% mutate(D1 = kdist(mpg, wt, centers[1,1], centers[1,2]))
kdat %<>% mutate(D2 = kdist(mpg, wt, centers[2,1], centers[2,2]))
kdat %<>% mutate(D3 = kdist(mpg, wt, centers[3,1], centers[3,2]))
From here, we assign clusters. I think we should go with the ‘little kids soccer’ algorithm; go down the list, each cluster picks until their cluster is full. This works ‘OK’ until you hit the last one.
kdat$Assigned = 0
kdat$index = 1:nrow(kdat)
working = kdat
nassign = trunc(nrow(kdat)/3)
for(i in 1:nassign){
ind1 = working$index[which(working$D1 == min(working$D1))[1]]
kdat$assigned[kdat$index == ind1] = 1
working %<>% filter(!index == ind1)
ind2 = working$index[which(working$D2 == min(working$D2))[1]]
kdat$assigned[kdat$index == ind2] = 2
working %<>% filter(!index == ind2)
ind3 = working$index[which(working$D3 == min(working$D3))[1]]
kdat$assigned[kdat$index == ind3] = 3
working %<>% filter(!index == ind3)
##The sorting hat says... GRYFFINDOR!!!
}
Next step is to recompute the centroids. You can do this a number of ways; a fast one is to reuse kmeans with k = 1, like so:
kdat %>% filter(assigned == 1) %>%
select(mpg, wt) %>%
kmeans(1) %$% centers ->NewCenters
NewCenters %<>% rbind(kdat %>%
filter(assigned == 2) %>%
select(mpg, wt) %>%
kmeans(1) %$% centers)
NewCenters %<>% rbind(kdat %>%
filter(assigned == 3) %>%
select(mpg, wt) %>%
kmeans(1) %$% centers)
NewCenters %<>% as.data.frame()
The next step (which I’m not doing today but you should be able to quickly figure out) is to use the new centers as the incumbent solution at the Assignment Step (above) and recompute the distances and run through the sorting algorithm above, looping until a satisfactory solution is reached. My sense is that convergence will be very rapid in practice.
You will notice that one point on the edge between the red and green cluster is clearly mis-classified; applying the next step as mentioned should resolve this.
It is possible - likely even - for degenerecy to emerge in the sense that there may be cycling between candidate solutions. Again, it’s a heuristic and worth what you pay for.
You can see how coering the size made the cluster centroids migrate - significantly in the case of Cluster #3. Grey dots are the original centroid, Black are the updated (equal size) centroid.
kdat$assigned %<>% as.factor()
kdat %>% ggplot(aes(x = mpg, y = wt, color = assigned)) +
theme_minimal() + geom_point() +
geom_point(data = NewCenters, aes(x = mpg, y = wt),
color = "black", size = 4) +
geom_point(data = as.data.frame(centers),
aes(x = mpg, y = wt), color = "grey", size = 4)