K-means Clustering

A Shiny App Illustration

Renaud Dufour

K-means is a distance-based method for cluster analysis in data mining
It enables partitioning a set of data points into groups which are as similar as possible
Each group, called cluster, is represented by its center

Given K, the number of clusters, k-means clustering works as follows:

Select K points as initial centroids
Repeat
- Form K clusters by assigning each point to its closest centroid
- Re-compute the centroids of each cluster
Until convergence criterion is satisfied
Different kinds of measures can be used (L1 norm, L2 norm, cosine similarity, ...)

Illustrates K-mean clustering based on 2 datasets:
- the R built in iris dataset
- a dataset dat1 involving embedded clusters
Enables to change the following parameters:
- dataset to be used
- variables on which the clustering is to be performed (note: 2D clustering only)
- number of clusters
- type of kernel : linear or radial (RBF)
When using a non-linear kernel, the datapoints are first projected into the kernel space before clustering is performed.

The Application can be accessed directly here

Left panel : iris dataset, variables sepal.length and sepal.width, 3 clusters and linear kernel
Right panel : dat1 dataset, variables x and y, 2 clusters and radial(RBF) kernel.

plot of chunk unnamed-chunk-1

The source code of the shiny App is available on my GitHub Repo
More informations on the K-means algorithm on wikipedia. I also recommend the Cluster Analysis In Data Mining class on Coursera, which actually inspired me this app.
Potential improvements include :
- using interactive graphics (rchart, googleVis)
- computing clustering validation measures such as purity or normalized mutual information. Note that such external measures require knowing the true classes of the data points, which is the case for the 2 implemented datasets but not in general. Instead one could also consider internal measures such as Beta CV.
- Implementing other kernels and allow user to tune kernel parameters (actually parameter of RBF kernel is internally determined using an heuristic approach)
- Implementing alternative clustering techniques like k-medians or k-medoids
Feel free to contact me for any question or suggestion !