AI4OPT - DE&M II - Clustering - Part 3

Clustering Methods - K-Means Clustering - Torgo, P. 123

table(ir3$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         48        14
##   3      0          2        36

cm <- table(ir3$cluster, iris$Species)
1-sum(diag(cm))/sum(cm)

## [1] 0.1066667

An example of an internal validation metric is the silhouette coefficient. The coefficient takes values between -1 and 1.

library(cluster)
s <- silhouette(ir3$cluster, dist(iris[ , -5]))
plot(s)

Note that the function requires the distance matrix of the dataset. The result of this function was given to the plot() function producing the silhouette plot.

As we can observe from the figure, the overall average silhouette coefficient of all 150 observations(or cases) is 0.55, which is a reasonable value (the nearer 1 the better).

In our figure, note that the best cluster is number 1, with an average silhouette of 0.8, while the other two clusters have lower scores.

Torgo, P.125. The silhouette coefficient can be used to compare different clustering solutions or even to select the “ideal” number of clusters for a given method.

The following code illustrates this idea with the k-means method applied to the Iris dataset. We check for the best number of groups in the interval [2, 6],

set.seed(1234)
d <- dist(iris[ , -5])
avgS <- c()
for(k in 2:6) {
  cl <- kmeans(iris[ , -5], centers=k, iter.max = 200)
  s <- silhouette(cl$cluster, d)
  avgS <- c(avgS, mean(s[ ,3]))
}
data.frame(nClus=2:6, Silh=avgS)

##   nClus      Silh
## 1     2 0.6810462
## 2     3 0.5528190
## 3     4 0.4965169
## 4     5 0.4609502
## 5     6 0.3664804

AI4OPT - DE&M II - Clustering - Part 3 - Practice

Charles Pierre

2022-09-29

Clustering Methods - K-Means Clustering - Torgo, P. 123

An example of an internal validation metric is the silhouette coefficient. The coefficient takes values between -1 and 1.

Note that the function requires the distance matrix of the dataset. The result of this function was given to the plot() function producing the silhouette plot.

As we can observe from the figure, the overall average silhouette coefficient of all 150 observations(or cases) is 0.55, which is a reasonable value (the nearer 1 the better).

In our figure, note that the best cluster is number 1, with an average silhouette of 0.8, while the other two clusters have lower scores.

Torgo, P.125. The silhouette coefficient can be used to compare different clustering solutions or even to select the “ideal” number of clusters for a given method.

The following code illustrates this idea with the k-means method applied to the Iris dataset. We check for the best number of groups in the interval [2, 6],

Please continue reading this chapter in Torgo, P.125 (bottom half).

Study another widely used partitioning algorithm: The k-medoids methods.

This method is somewhat similar to the k-means methods; however, there are some blatant differences. What are those differences?