Clustering - The Iris Dataset

In R, the k-means algorithm is implemented in the function, kmeans(). We apply this function to the Iris dataset. There are 3 species of these plants.

If we did not know the species information for the 150 plants, we could try to use a clustering algorithm to form 3 groups with these plants, expecting that the algorithm would allocate each cluster plants of the sae species.

set.seed(1234)  # Setting a seed for the random number generator
data(iris)
ir3 <- kmeans(iris[ , -5], centers=3, iter.max = 200)   # not using Species info
ir3

## K-means clustering with 3 clusters of sizes 50, 62, 38
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.006000    3.428000     1.462000    0.246000
## 2     5.901613    2.748387     4.393548    1.433871
## 3     6.850000    3.073684     5.742105    2.071053
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 15.15100 39.82097 23.87947
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Below, we use external information that consists of the class labels of each of the 150 plants:

table(ir3$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         48        14
##   3      0          2        36

cm <- table(ir3$cluster, iris$Species)
1 - sum(diag(cm))/sum(cm)

## [1] 0.1066667

The clustering that was obtained above assigns 10.7% of the plants to the wrong cluster, according to the class label.

The silhouette coefficient takes values between -1 and 1.

The silhouette coefficient is implemented in R through the function, silhouette() available in the package, cluster.

library(cluster)
s <- silhouette(ir3$cluster, dist(iris[ , -5]))
plot(s)

The function above requires the distance matrix of the dataset used in the clustering dataset. The result of this function can be given to the plot() function.

Below, the kmeans method is applied to the Iris dataset. Also, the code checks for the best number of groups in the interval [2,6].

The silhouette coefficient can be used to compare different clustering solutions or even to select the ideal number of clusters for a given method. Below, we use a simple way of searching for the ideal number of clusters using a simple for() loop.

set.seed(1234)
d <- dist(iris[ , -5])
avgS <- c()
for(k in 2:6) {
  cl <- kmeans(iris[ , -5], centers = k, iter.max = 200)
  s <- silhouette(cl$cluster, d)
  avgS <- c(avgS, mean(s[ , 3]))
} 
data.frame(nClus=2:6, Silh=avgS)

##   nClus      Silh
## 1     2 0.6810462
## 2     3 0.5528190
## 3     4 0.4965169
## 4     5 0.4609502
## 5     6 0.3664804

Another partitioning algorithm is the k-medoids method. It revolves around the medoids as cluster centers instead of means. Instead of squared distances, the implementations of this algorithm typically use sums of dissimilarities as search criterion. This method is implemented in the function, pam(), of the cluster package:

library(cluster)
set.seed(1234)
pc <- pam(iris[ , -5], k=3)
(cm <-table(pc$clustering, iris$Species))

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         48        14
##   3      0          2        36

100*(1-sum(diag(cm))/sum(cm))

## [1] 10.66667

pc$silinfo$avg.width

## [1] 0.552819

In the package, fpc, we have the function, pamk(), that also searches for the ideal number of clusters, but it does it with more flexibility in terms of options. This function allows use to supply a range of a possible number of clusters so that we can use different criteria to search for the best solution, using either the function, pam(), or the function, clara().

Note tha “asw” means average silhouette width.

library(fpc)

## Warning: package 'fpc' was built under R version 4.2.1

data(iris)
sol <- pamk(iris[ , -5], krange=2:10, criterion="asw", usepam=TRUE)
sol

## $pamobject
## Medoids:
##       ID Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   8          5.0         3.4          1.5         0.2
## [2,] 127          6.2         2.8          4.8         1.8
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2
## Objective function:
##     build      swap 
## 0.9901187 0.8622026 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"      
## 
## $nc
## [1] 2
## 
## $crit
##  [1] 0.0000000 0.6857882 0.5528190 0.4896972 0.4867481 0.4703951 0.3390116
##  [8] 0.3318516 0.2918520 0.2918482

The output of the function above is a list with three components: pamobject, with the object resulting from running pam() (or clara()) with the optimal number of clusters: nc, with this number of clusters; and crit, with the scores of the internal validation metric used for the different values of the number of clusters. In our example above, we have tried from 2 to 10 and selected the best using “asw.”

Below, the first statement obtains the distance matrix of the cases. The second statement obtains the clustering. The resulting hierarchy, known as the dendrogram, can be shown by applying the function, plot(), to the object resulting from the call to hclust().

We have used a negative value for the hang parameter so that the labels are all at the bottom, and then we use the labels parameter to show the actual species of the plants. Note that “cex” means decreasing the font size.

d <- dist(scale(iris[ , -5]))
h <- hclust(d)
plot(h, hang=-0.1, labels=iris[["Species"]], cex=0.5)
clus3 <- cutree(h, 3)
(cm <- table(clus3, iris$Species))

##      
## clus3 setosa versicolor virginica
##     1     49          0         0
##     2      1         21         2
##     3      0         29        48

100*(1-sum(diag(cm))/sum(cm))

## [1] 21.33333

plot(h, hang=-0.1, labels=iris[["Species"]], cex=0.5)
rect.hclust(h, k=3)

A dendrogram corresponds to a set of possible clustering solutions. Cuting this hierarchy at different heights corresponds to selecting a particular number of clusters (i.e.: a solution), depending on how many vertical lines of the dendrogram are crossed by the cut.

The function, cutree(), above, was used to perform the cutting by specifying the desired number of groups. The result of this function is a vector with the number of the cluster to which each observation was assigned, according to the dendrogram cut.

Below, we assume the lack of external information and resort to the silhouette coefficient to compare a few of these variants:

set.seed(1234)
d <- dist(scale(iris[ , -5]))
methds <- c('complete', 'single', 'average')
avgS <- matrix(NA, ncol=3, nrow=5,
               dimnames=list(2:6, methds))
for(k in 2:6)
  for(m in seq_along(methds)) {
    h <- hclust(d, meth=methds[m])
    c <- cutree(h, k)
    s <- silhouette(c, d)
    avgS[k-1, m] <- mean(s[ , 3])
    
  }
avgS

##    complete    single   average
## 2 0.4408121 0.5817500 0.5817500
## 3 0.4496185 0.5046456 0.4802669
## 4 0.4106071 0.4067465 0.4067465
## 5 0.3520630 0.3424089 0.3746013
## 6 0.3106991 0.2018867 0.3248248

Above, it seems the best scores are obtained with the single and average linkage methods.

Below, we look at the Divisive hierarchal methods. These methods are less common than the agglormerative methods. We will use the DIANA algorithm which comes from the cluster package.

The DIANA algorithm selects the cluster to be split in two during the iterative process of building the dendrogram by looking at the cluster diameter.

This diameter is estimated as the largest dissimilarity between any two observations of the cluster.

Once this group is identified, the observation in that group with the largest average dissimilarity to the other members of the group is selected.

Then all observations are allocated to either the cluster of this selected observation or to the old group (represented by its center), depending on which one is nearest.

This algorithm is implemented in the chunk of code below:

di <- diana(iris[ , -5], metric='euclidean', stand=TRUE)
di3 <- cutree(di, 3)
(cm <- table(di3, iris$Species))

##    
## di3 setosa versicolor virginica
##   1     50          0         0
##   2      0         11        33
##   3      0         39        17

100*(1-sum(diag(cm))/sum(cm))

## [1] 48

The Diana() function internally calculates the Euclidean distance between observations (the variables have already been standardized).

Above, we have been assuming that the order of the clusters corresponds to the order of the values in the target variable.

If we look at the numbers in the confusion matrix, we oberserve that-

cluster 2 should be regarded as the cluster of the virginica species, and

cluster 3 of the versicolor spiecies.

This means that we should swap rows 2 and 3 before calculating the error:

cm <- cm[c(1,3,2),]
100*(1-sum(diag(cm))/sum(cm))

## [1] 18.66667

This leads to a considerably lower error.

Below, we will work with an algorithm that deals with a density-based clustering method. (Look at Torgo, Page 130.) These methods search for regions in the input space where cases are packed tightly. The DBSCAN algorithm is used for this purpose.

The key idea of this method is to estimate the density of a single observation as the number of observations that are within a certain radius of the observation, The radius being a key parameter of the method known as reachability distance.

The dbscan() function from the package, fpc, contains an implementation of the density-based clustering algorithm that is discussed in Torgo, Page 130.

The dbscan() function contains two parameters: eps and MinPts. These parameters control the reachability distance and the minimum number of observations within this distance for a case to be considered a core point, respectively.

The code below illustrates the application of this method to the Iris dataset.

library(fpc)
d <- scale(iris[ ,-5])
db <- dbscan(d, eps=0.9, MinPts=5)
db

## dbscan Pts=150 MinPts=5 eps=0.9
##        0  1  2
## border 4  1  4
## seed   0 48 93
## total  4 49 97

table(db$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   0      1          0         3
##   1     49          0         0
##   2      0         50        47

Cluster 0 contains the set of observations that the algorithm has decided to tag as too different from the remaining cases. With the parameter settings used, the algorithm has clustered the remaining cases in two groups.

Looking at the confusion matrix (above), it seems cluster 1 contains the setosas, with the exception of one that was put in the group of outliers.

Cluster 2 contains the plants of the other two species. When analyzing the silhouette coefficients (above) for different clustering settings, creating two groups often appeared as the best option.

Clustering - The Iris Dataset - Aug 27 2022

Charles Pierre

2022-08-27

In R, the distance functions applicable to numeric vectors are implemented in the function, dist( ). See below:

In R, the k-means algorithm is implemented in the function, kmeans(). We apply this function to the Iris dataset. There are 3 species of these plants.

If we did not know the species information for the 150 plants, we could try to use a clustering algorithm to form 3 groups with these plants, expecting that the algorithm would allocate each cluster plants of the sae species.

Below, we use external information that consists of the class labels of each of the 150 plants:

The clustering that was obtained above assigns 10.7% of the plants to the wrong cluster, according to the class label.

The silhouette coefficient takes values between -1 and 1.

The silhouette coefficient is implemented in R through the function, silhouette() available in the package, cluster.

The function above requires the distance matrix of the dataset used in the clustering dataset. The result of this function can be given to the plot() function.

Below, the kmeans method is applied to the Iris dataset. Also, the code checks for the best number of groups in the interval [2,6].

The silhouette coefficient can be used to compare different clustering solutions or even to select the ideal number of clusters for a given method. Below, we use a simple way of searching for the ideal number of clusters using a simple for() loop.

Note tha “asw” means average silhouette width.

Below, the first statement obtains the distance matrix of the cases. The second statement obtains the clustering. The resulting hierarchy, known as the dendrogram, can be shown by applying the function, plot(), to the object resulting from the call to hclust().

We have used a negative value for the hang parameter so that the labels are all at the bottom, and then we use the labels parameter to show the actual species of the plants. Note that “cex” means decreasing the font size.

A dendrogram corresponds to a set of possible clustering solutions. Cuting this hierarchy at different heights corresponds to selecting a particular number of clusters (i.e.: a solution), depending on how many vertical lines of the dendrogram are crossed by the cut.

The function, cutree(), above, was used to perform the cutting by specifying the desired number of groups. The result of this function is a vector with the number of the cluster to which each observation was assigned, according to the dendrogram cut.

Below, we assume the lack of external information and resort to the silhouette coefficient to compare a few of these variants:

Above, it seems the best scores are obtained with the single and average linkage methods.

Below, we look at the Divisive hierarchal methods. These methods are less common than the agglormerative methods. We will use the DIANA algorithm which comes from the cluster package.

The DIANA algorithm selects the cluster to be split in two during the iterative process of building the dendrogram by looking at the cluster diameter.

This diameter is estimated as the largest dissimilarity between any two observations of the cluster.

Once this group is identified, the observation in that group with the largest average dissimilarity to the other members of the group is selected.

Then all observations are allocated to either the cluster of this selected observation or to the old group (represented by its center), depending on which one is nearest.

This algorithm is implemented in the chunk of code below:

The Diana() function internally calculates the Euclidean distance between observations (the variables have already been standardized).

Above, we have been assuming that the order of the clusters corresponds to the order of the values in the target variable.

If we look at the numbers in the confusion matrix, we oberserve that-

cluster 2 should be regarded as the cluster of the virginica species, and

cluster 3 of the versicolor spiecies.

This means that we should swap rows 2 and 3 before calculating the error:

This leads to a considerably lower error.

Below, we will work with an algorithm that deals with a density-based clustering method. (Look at Torgo, Page 130.) These methods search for regions in the input space where cases are packed tightly. The DBSCAN algorithm is used for this purpose.

The key idea of this method is to estimate the density of a single observation as the number of observations that are within a certain radius of the observation, The radius being a key parameter of the method known as reachability distance.

The dbscan() function from the package, fpc, contains an implementation of the density-based clustering algorithm that is discussed in Torgo, Page 130.

The dbscan() function contains two parameters: eps and MinPts. These parameters control the reachability distance and the minimum number of observations within this distance for a case to be considered a core point, respectively.

The code below illustrates the application of this method to the Iris dataset.

Cluster 0 contains the set of observations that the algorithm has decided to tag as too different from the remaining cases. With the parameter settings used, the algorithm has clustered the remaining cases in two groups.

Looking at the confusion matrix (above), it seems cluster 1 contains the setosas, with the exception of one that was put in the group of outliers.

Cluster 2 contains the plants of the other two species. When analyzing the silhouette coefficients (above) for different clustering settings, creating two groups often appeared as the best option.

However, it is well known that in this dataset the versicolor and virginica species are very hard to differentiate.