Cluster analysis

This is an R Markdown document created for myself with examples of cluster analysis. Based on Coursera's Exploratory Data Analysis course.

Definition

From wikipedia: clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. We will look at a bottom up and a top down approaches, respectively:

agglomerative clustering
k-means

One Strategy: bottom up approach

One possible strategy is that of agglomerative hierarchical (as we seek to build a hierarchy of clusters) clustering: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy:

find closest two things
put them together
find next closest
…

What is required:

a definition of distance between two observations (example: euclidian distance)
a merging approach

Produces:

a tree showing how close things are to each other

Example:

Suppose one has the following data:

set.seed(1234)
x=rnorm(12, mean=rep(1:3, each=4), sd=0.2)
y=rnorm(12, mean=rep(c(1, 2, 1), each=4), sd=0.2)

which we can plot to see what's going on, also labeling the points

plot(x, y, col="blue", pch=19, cex=2)
text(x+0.05, y+0.05, labels=as.character(1:12))

plot of chunk unnamed-chunk-2

The distance between pairs of points is computed by the dist function:

dataFrame=data.frame(x=x, y=y)
distances=dist(dataFrame)
print(distances)

##          1       2       3       4       5       6       7       8       9
## 2  0.34121                                                                
## 3  0.57494 0.24103                                                        
## 4  0.26382 0.52579 0.71862                                                
## 5  1.69425 1.35818 1.11953 1.80667                                        
## 6  1.65813 1.31960 1.08339 1.78081 0.08150                                
## 7  1.49823 1.16621 0.92569 1.60132 0.21110 0.21667                        
## 8  1.99149 1.69093 1.45649 2.02849 0.61704 0.69792 0.65063                
## 9  2.13630 1.83168 1.67836 2.35676 1.18350 1.11500 1.28583 1.76461        
## 10 2.06420 1.76999 1.63110 2.29239 1.23848 1.16550 1.32063 1.83518 0.14090
## 11 2.14702 1.85183 1.71074 2.37462 1.28154 1.21077 1.37370 1.86999 0.11624
## 12 2.05664 1.74663 1.58659 2.27232 1.07701 1.00777 1.17740 1.66224 0.10849
##         10      11
## 2                 
## 3                 
## 4                 
## 5                 
## 6                 
## 7                 
## 8                 
## 9                 
## 10                
## 11 0.08318        
## 12 0.19129 0.20803

Let's do some clustering!

clusters=hclust(distances)
plot(clusters)

plot of chunk unnamed-chunk-4

Yay. Now let's get the clusters data into 5 groups:

plot(clusters)
rect.hclust(clusters, k=5, border="red")

plot of chunk unnamed-chunk-5

groups=cutree(clusters, k=5)
print(groups)

##  [1] 1 2 2 1 3 3 3 4 5 5 5 5

Another Strategy: top down approach

K-means clustering is

A partioning approach
- Fix a number of clusters
- Get “centroids” of each cluster
- Assign things to closest centroid
- Reclaculate centroids
Requires
- A defined distance metric
- A number of clusters
- An initial guess as to cluster centroids
Produces
- Final estimate of cluster centroids
- An assignment of each point to clusters

K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm, which is known to converge. However, theoretically the number of iterations could be large. It typically converges quickly though.

Example:

Suppose one has the following data:

set.seed(1234)
x=rnorm(12, mean=rep(1:3, each=4), sd=0.2)
y=rnorm(12, mean=rep(c(1, 2, 1), each=4), sd=0.2)

which we can plot to see what's going on, also labeling the points

plot(x, y, col="blue", pch=19, cex=2)
text(x+0.05, y+0.05, labels=as.character(1:12))

plot of chunk unnamed-chunk-7

We can cluster the data using the kmeans function:

dataFrame <- data.frame(x,y)
kmeansObj <- kmeans(dataFrame,centers=3)

Let's plot the resolt

plot(x,y,col=kmeansObj$cluster,pch=19,cex=2)
points(kmeansObj$centers,col=1:3,pch=3,cex=3,lwd=3)
text(x+0.05, y+0.05, labels=as.character(1:12))

plot of chunk unnamed-chunk-9

and look at the object we obtained:

names(kmeansObj)

## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

kmeansObj$cluster

##  [1] 3 3 3 3 1 1 1 1 2 2 2 2

So, as we can see, the first four elements are in cluster 3, elements 5 to 8 are in cluster 1 and the remaining elements are in cluster 2.