set.seed(1234) #setting a seedfor the random number generator
data(iris)
ir3 <- kmeans(iris[ , -5], center=3, iter.max=200) #not using species info
ir3
## K-means clustering with 3 clusters of sizes 50, 62, 38
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.006000 3.428000 1.462000 0.246000
## 2 5.901613 2.748387 4.393548 1.433871
## 3 6.850000 3.073684 5.742105 2.071053
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
##
## Within cluster sum of squares by cluster:
## [1] 15.15100 39.82097 23.87947
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Clustering Methods - K-Means Clustering - Torgo, P. 123
table(ir3$cluster, iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 14
## 3 0 2 36
cm <- table(ir3$cluster, iris$Species)
1-sum(diag(cm))/sum(cm)
## [1] 0.1066667
An example of an internal validation metric is the silhouette
coefficient. The coefficient takes values between -1 and 1.
library(cluster)
s <- silhouette(ir3$cluster, dist(iris[ , -5]))
plot(s)

Note that the function requires the distance matrix of the dataset.
The result of this function was given to the plot() function producing
the silhouette plot.
The following code illustrates this idea with the k-means method
applied to the Iris dataset. We check for the best number of groups in
the interval [2, 6],
set.seed(1234)
d <- dist(iris[ , -5])
avgS <- c()
for(k in 2:6) {
cl <- kmeans(iris[ , -5], centers=k, iter.max = 200)
s <- silhouette(cl$cluster, d)
avgS <- c(avgS, mean(s[ ,3]))
}
data.frame(nClus=2:6, Silh=avgS)
## nClus Silh
## 1 2 0.6810462
## 2 3 0.5528190
## 3 4 0.4965169
## 4 5 0.4609502
## 5 6 0.3664804
Please continue reading this chapter in Torgo, P.125 (bottom
half).
Study another widely used partitioning algorithm: The k-medoids
methods.
This method is somewhat similar to the k-means methods; however,
there are some blatant differences. What are those differences?