Data Products: K-means Clustering of IRIS data

Kenneth D. Graves
Tue Feb 17 15:54:10 2015

A R Presentation on a shiny application prepared for Coursera's Data Products course. The application allows the selection of number of K clusters on the IRIS data-set to demonstrate the parameter's effect on K-means grouping in R.

K-Means Clustering

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. Given an initial set of centers, the k-means algorithm alternates two steps:

  • for each center identify the subset of training points that are closer to them than any other centroid.
  • the means each gouping are computed, and this mean vector becomes the new centroid for that cluster.

The steps are iterated until convergance.

alt text

The IRIS Dataset

To demonstrate the inital k parameter effect on R's k-means alogorthm, the application uses Edgar Anderson's IRIS dataset. The data gives the measurements in centimeter of sepal and petal width and length. The species are Iris setosa, veriscolor, and virginica.

Here are the actual means of each of the three species:

Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.01 3.43 1.46 0.25
2 versicolor 5.94 2.77 4.26 1.33
3 virginica 6.59 2.97 5.55 2.03

alt text

Three Choices for Number of Cluster Centers

Different choices of desired number of cluster centers will group the underlying observations differently:

plot of chunk unnamed-chunk-2

K-means Demonstration Application

Our shiny application allows the observer to see the effect of different number of clusters on groupings of the iris species. The following tables show the confusion matrices for three such settings: 2, 3, 5.

1 2
setosa 0 50
versicolor 47 3
virginica 50 0
1 2 3
setosa 0 0 50
versicolor 48 2 0
virginica 14 36 0
1 2 3 4 5
setosa 17 33 0 0 0
versicolor 0 0 0 23 27
virginica 0 0 32 17 1

As you can see, the number of k is an important parameter in determine accuracy of grouping. There are advanced techniques to optimally determine these cluster numbers, but they are beyond the scope of this assignment.

In the meantime, please feel free to use this application in all your k-means demonstrations.