set.seed(1234)  #setting a seedfor the random number generator
data(iris)
ir3 <- kmeans(iris[ , -5], center=3, iter.max=200)  #not using species info
ir3
## K-means clustering with 3 clusters of sizes 50, 62, 38
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.006000    3.428000     1.462000    0.246000
## 2     5.901613    2.748387     4.393548    1.433871
## 3     6.850000    3.073684     5.742105    2.071053
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 15.15100 39.82097 23.87947
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Clustering Methods - K-Means Clustering - Torgo, P. 123

table(ir3$cluster, iris$Species)
##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         48        14
##   3      0          2        36
cm <- table(ir3$cluster, iris$Species)
1-sum(diag(cm))/sum(cm)
## [1] 0.1066667

An example of an internal validation metric is the silhouette coefficient. The coefficient takes values between -1 and 1.

library(cluster)
s <- silhouette(ir3$cluster, dist(iris[ , -5]))
plot(s)

Note that the function requires the distance matrix of the dataset. The result of this function was given to the plot() function producing the silhouette plot.

As we can observe from the figure, the overall average silhouette coefficient of all 150 observations(or cases) is 0.55, which is a reasonable value (the nearer 1 the better).

In our figure, note that the best cluster is number 1, with an average silhouette of 0.8, while the other two clusters have lower scores.

Torgo, P.125. The silhouette coefficient can be used to compare different clustering solutions or even to select the “ideal” number of clusters for a given method.

The following code illustrates this idea with the k-means method applied to the Iris dataset. We check for the best number of groups in the interval [2, 6],

set.seed(1234)
d <- dist(iris[ , -5])
avgS <- c()
for(k in 2:6) {
  cl <- kmeans(iris[ , -5], centers=k, iter.max = 200)
  s <- silhouette(cl$cluster, d)
  avgS <- c(avgS, mean(s[ ,3]))
}
data.frame(nClus=2:6, Silh=avgS)
##   nClus      Silh
## 1     2 0.6810462
## 2     3 0.5528190
## 3     4 0.4965169
## 4     5 0.4609502
## 5     6 0.3664804

Please continue reading this chapter in Torgo, P.125 (bottom half).

Study another widely used partitioning algorithm: The k-medoids methods.

This method is somewhat similar to the k-means methods; however, there are some blatant differences. What are those differences?