Kenneth D. Graves
Tue Feb 17 15:54:10 2015
A R Presentation on a shiny application prepared for Coursera's Data Products course. The application allows the selection of number of K clusters on the IRIS data-set to demonstrate the parameter's effect on K-means grouping in R.
K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. Given an initial set of centers, the k-means algorithm alternates two steps:
The steps are iterated until convergance.
To demonstrate the inital k parameter effect on R's k-means alogorthm, the application uses Edgar Anderson's IRIS dataset. The data gives the measurements in centimeter of sepal and petal width and length. The species are Iris setosa, veriscolor, and virginica.
Here are the actual means of each of the three species:
| Species | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
|---|---|---|---|---|---|
| 1 | setosa | 5.01 | 3.43 | 1.46 | 0.25 |
| 2 | versicolor | 5.94 | 2.77 | 4.26 | 1.33 |
| 3 | virginica | 6.59 | 2.97 | 5.55 | 2.03 |
Different choices of desired number of cluster centers will group the underlying observations differently:
Our shiny application allows the observer to see the effect of different number of clusters on groupings of the iris species. The following tables show the confusion matrices for three such settings: 2, 3, 5.
| 1 | 2 | |
|---|---|---|
| setosa | 0 | 50 |
| versicolor | 47 | 3 |
| virginica | 50 | 0 |
| 1 | 2 | 3 | |
|---|---|---|---|
| setosa | 0 | 0 | 50 |
| versicolor | 48 | 2 | 0 |
| virginica | 14 | 36 | 0 |
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| setosa | 17 | 33 | 0 | 0 | 0 |
| versicolor | 0 | 0 | 0 | 23 | 27 |
| virginica | 0 | 0 | 32 | 17 | 1 |
As you can see, the number of k is an important parameter in determine accuracy of grouping. There are advanced techniques to optimally determine these cluster numbers, but they are beyond the scope of this assignment.
In the meantime, please feel free to use this application in all your k-means demonstrations.