part of the Data Mining Series by Karen Mazidi
See more at my RPubs site.
Clustering is an unsupervised classification technique that groups observations into homogenous clusters.
This script explores data mining with two clusturing algorithms: k-Means and Hierarchical clustering.
K-means seeks to partition the observations into k clusters. The idea behind K-means is that we want to create clusters in which in-cluster variation is small, that is, the observations are similar. How do we measure similarity? Often it is with Euclidean distance.
Here is the algorithm for K-means clustering:
K-means is not guaranteed to find the global optimal solution. A solution to this problem is to start with different values of K and choosing the best result.
K means is in the stats library
library(stats)
For this clustering, the petal length feature was chosen.
km_model = kmeans(iris$Petal.Length, 3, nstart=10) # K=3
Value cluster gives a vector of all k integers, indicating the cluster of each observation
km_model$cluster
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 2 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
## [106] 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2
## [141] 2 2 2 2 2 2 2 2 2 2
We have a pretty good cluster for petal length < 2, but the other two clusters overlap a little.
plot(iris$Petal.Length, col=(km_model$cluster+1))
Now let's cluster on all the features (minus the class of course)
We see from the confusion matrix that the cluster for setosa is good but a bit mixed for versicolor and virginica.
km_model2 = kmeans(x = subset(iris, select=-Species), 3, nstart=10)
table(km_model2$cluster, iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 14
## 3 0 2 36
Hierarchical clustering creates a tree-like representation of the observations called a dendogram. This structure allows us to visualize the clusterings for possible clusterings from 1 to n.
The clustering can be bottom-up or top-down. The following is the algorithm for bottom-up clustering, also called agglomerative clustering.
The scaling of variables is important. Another critical issue, just as in k-means, is the selection of the number of clusters. Feature selection to drive the clustering is another critical issue.
hc = hclust(dist(iris$Petal.Length))
plot(hc)
A good tutorial on customizing the dendogram visualization is at:
https://rstudio-pubs-static.s3.amazonaws.com/1876_df0bf890dd54461f98719b461d987c3d.html