This is an R Markdown document created for myself with examples of cluster analysis. Based on Coursera's Exploratory Data Analysis course.
From wikipedia: clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. We will look at a bottom up and a top down approaches, respectively:
One possible strategy is that of agglomerative hierarchical (as we seek to build a hierarchy of clusters) clustering: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy:
What is required:
Produces:
Suppose one has the following data:
set.seed(1234)
x=rnorm(12, mean=rep(1:3, each=4), sd=0.2)
y=rnorm(12, mean=rep(c(1, 2, 1), each=4), sd=0.2)
which we can plot to see what's going on, also labeling the points
plot(x, y, col="blue", pch=19, cex=2)
text(x+0.05, y+0.05, labels=as.character(1:12))
The distance between pairs of points is computed by the dist function:
dataFrame=data.frame(x=x, y=y)
distances=dist(dataFrame)
print(distances)
## 1 2 3 4 5 6 7 8 9
## 2 0.34121
## 3 0.57494 0.24103
## 4 0.26382 0.52579 0.71862
## 5 1.69425 1.35818 1.11953 1.80667
## 6 1.65813 1.31960 1.08339 1.78081 0.08150
## 7 1.49823 1.16621 0.92569 1.60132 0.21110 0.21667
## 8 1.99149 1.69093 1.45649 2.02849 0.61704 0.69792 0.65063
## 9 2.13630 1.83168 1.67836 2.35676 1.18350 1.11500 1.28583 1.76461
## 10 2.06420 1.76999 1.63110 2.29239 1.23848 1.16550 1.32063 1.83518 0.14090
## 11 2.14702 1.85183 1.71074 2.37462 1.28154 1.21077 1.37370 1.86999 0.11624
## 12 2.05664 1.74663 1.58659 2.27232 1.07701 1.00777 1.17740 1.66224 0.10849
## 10 11
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11 0.08318
## 12 0.19129 0.20803
Let's do some clustering!
clusters=hclust(distances)
plot(clusters)
Yay. Now let's get the clusters data into 5 groups:
plot(clusters)
rect.hclust(clusters, k=5, border="red")
groups=cutree(clusters, k=5)
print(groups)
## [1] 1 2 2 1 3 3 3 4 5 5 5 5
K-means clustering is
K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm, which is known to converge. However, theoretically the number of iterations could be large. It typically converges quickly though.
Suppose one has the following data:
set.seed(1234)
x=rnorm(12, mean=rep(1:3, each=4), sd=0.2)
y=rnorm(12, mean=rep(c(1, 2, 1), each=4), sd=0.2)
which we can plot to see what's going on, also labeling the points
plot(x, y, col="blue", pch=19, cex=2)
text(x+0.05, y+0.05, labels=as.character(1:12))
We can cluster the data using the kmeans function:
dataFrame <- data.frame(x,y)
kmeansObj <- kmeans(dataFrame,centers=3)
Let's plot the resolt
plot(x,y,col=kmeansObj$cluster,pch=19,cex=2)
points(kmeansObj$centers,col=1:3,pch=3,cex=3,lwd=3)
text(x+0.05, y+0.05, labels=as.character(1:12))
and look at the object we obtained:
names(kmeansObj)
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
kmeansObj$cluster
## [1] 3 3 3 3 1 1 1 1 2 2 2 2
So, as we can see, the first four elements are in cluster 3, elements 5 to 8 are in cluster 1 and the remaining elements are in cluster 2.