Clustering Techniques

part of the Data Mining Series by Karen Mazidi

Clustering

Clustering is an unsupervised classification technique that groups observations into homogenous clusters.

This script explores data mining with two clusturing algorithms: k-Means and Hierarchical clustering.

K-means clustering

K-means seeks to partition the observations into k clusters. The idea behind K-means is that we want to create clusters in which in-cluster variation is small, that is, the observations are similar. How do we measure similarity? Often it is with Euclidean distance.

Algorithm

Here is the algorithm for K-means clustering:

Randomly assign each of the observations to one of the K clusters.
For each cluster, compute a centroid, which is a vector of feature means.
Assign each observation to the cluster whose centroid is closest.
Repeat steps 2 and 3 until convergence.

Limitations

K-means is not guaranteed to find the global optimal solution. A solution to this problem is to start with different values of K and choosing the best result.

K means is in the stats library

library(stats)

K-means on the iris data

For this clustering, the petal length feature was chosen.

km_model = kmeans(iris$Petal.Length, 3, nstart=10)  # K=3

See results in vector form

Value cluster gives a vector of all k integers, indicating the cluster of each observation

km_model$cluster

##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 2 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
## [106] 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2
## [141] 2 2 2 2 2 2 2 2 2 2

See the results in graph form

We have a pretty good cluster for petal length < 2, but the other two clusters overlap a little.

plot(iris$Petal.Length, col=(km_model$cluster+1))

plot of chunk unnamed-chunk-4

Clustering on more than one feature

Now let's cluster on all the features (minus the class of course)

We see from the confusion matrix that the cluster for setosa is good but a bit mixed for versicolor and virginica.

km_model2 = kmeans(x = subset(iris, select=-Species), 3, nstart=10)
table(km_model2$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         48        14
##   3      0          2        36

Hierarchical clustering

Hierarchical clustering creates a tree-like representation of the observations called a dendogram. This structure allows us to visualize the clusterings for possible clusterings from 1 to n.

Algorithm

The clustering can be bottom-up or top-down. The following is the algorithm for bottom-up clustering, also called agglomerative clustering.

Start with each point in its own cluster.
Identify the closest two clusters.
Merge these two clusters.
Repeat steps 2 and 3 until there is only one big cluster.

Limitations and Recommendations

The scaling of variables is important. Another critical issue, just as in k-means, is the selection of the number of clusters. Feature selection to drive the clustering is another critical issue.

hc = hclust(dist(iris$Petal.Length))
plot(hc)

plot of chunk unnamed-chunk-6

A good tutorial on customizing the dendogram visualization is at:

https://rstudio-pubs-static.s3.amazonaws.com/1876_df0bf890dd54461f98719b461d987c3d.html