The outline of the algorithm is:

This approach, like most clustering methods requires a defined distance metric, a fixed number of clusters, and an initial guess as to the cluster centriods. There’s no set approach to determining the initial configuration of centroids, but many algorithms simply randomly select data points from your dataset as the initial centroids.

The K-means algorithm produces:

Illustrating the K-means algorithm

# Set the seed
set.seed(1234)

# Set the X & Y vectors
x <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
y <- rnorm(12, mean = rep(c(1, 2, 1), each = 4), sd = 0.2)

# Create the Plot
plot(x, y, col = "blue", pch = 19, cex = 2)

# Set the text in the plot
text(x + 0.05, y + 0.05, labels = as.character(1:12))

k-means-clustering-picking-centroid.png

k-means-clustering-picking-centroid.png

k-means-clustering-assigning-points.png

k-means-clustering-assigning-points.png

k-means-clustering-recalculate-centroids.png

k-means-clustering-recalculate-centroids.png

k-means-clustering-full-cycle.png

k-means-clustering-full-cycle.png

Stopping the algorithm

Using the kmeans() function

dataFrame <- data.frame(x, y)
kmeansObj <- kmeans(dataFrame, centers = 3)
names(kmeansObj)
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
# Check assigned cluster for data point
kmeansObj$cluster
##  [1] 3 3 3 3 1 1 1 1 2 2 2 2
k-means-clustering-final-solution.png

k-means-clustering-final-solution.png

Building heatmaps from K-means solutions

set.seed(1234)
dataMatrix <- as.matrix(dataFrame)[sample(1:12), ]
kmeansObj <- kmeans(dataMatrix, centers = 3)
par(mfrow = c(1, 2))
image(t(dataMatrix)[, nrow(dataMatrix):1], yaxt = "n", main = "Original Data")
image(t(dataMatrix)[, order(kmeansObj$cluster)], yaxt = "n", main = "Clustered Data")