In R, the distance functions applicable to numeric vectors are
implemented in the function, dist( ). See below:
set.seed(1234)
randDat <- matrix(rnorm(50), nrow=5)
dist(randDat) # Euclidean distance (default)
## 1 2 3 4
## 2 4.261667
## 3 4.038030 2.060117
## 4 3.456732 3.726399 4.037978
## 5 5.307253 4.415046 4.111230 4.814393
dist(randDat, method="manhattan")
## 1 2 3 4
## 2 11.382197
## 3 10.016795 4.536827
## 4 9.887932 8.845512 8.829131
## 5 14.683770 10.617871 9.091241 11.362705
dist(randDat, method="minkowski", p=4)
## 1 2 3 4
## 2 2.899494
## 3 2.875467 1.653824
## 4 2.208297 2.814135 3.453336
## 5 3.488531 3.192217 3.398721 3.643788
In R, the k-means algorithm is implemented in the function,
kmeans(). We apply this function to the Iris dataset. There are 3
species of these plants.
The silhouette coefficient takes values between -1 and 1.
The silhouette coefficient is implemented in R through the function,
silhouette() available in the package, cluster.
library(cluster)
s <- silhouette(ir3$cluster, dist(iris[ , -5]))
plot(s)

The function above requires the distance matrix of the dataset used
in the clustering dataset. The result of this function can be given to
the plot() function.
Below, the kmeans method is applied to the Iris dataset. Also, the
code checks for the best number of groups in the interval [2,6].
The silhouette coefficient can be used to compare different
clustering solutions or even to select the ideal number of clusters for
a given method. Below, we use a simple way of searching for the ideal
number of clusters using a simple for() loop.
set.seed(1234)
d <- dist(iris[ , -5])
avgS <- c()
for(k in 2:6) {
cl <- kmeans(iris[ , -5], centers = k, iter.max = 200)
s <- silhouette(cl$cluster, d)
avgS <- c(avgS, mean(s[ , 3]))
}
data.frame(nClus=2:6, Silh=avgS)
## nClus Silh
## 1 2 0.6810462
## 2 3 0.5528190
## 3 4 0.4965169
## 4 5 0.4609502
## 5 6 0.3664804
Another partitioning algorithm is the k-medoids method. It revolves
around the medoids as cluster centers instead of means. Instead of
squared distances, the implementations of this algorithm typically use
sums of dissimilarities as search criterion. This method is implemented
in the function, pam(), of the cluster package:
library(cluster)
set.seed(1234)
pc <- pam(iris[ , -5], k=3)
(cm <-table(pc$clustering, iris$Species))
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 14
## 3 0 2 36
100*(1-sum(diag(cm))/sum(cm))
## [1] 10.66667
pc$silinfo$avg.width
## [1] 0.552819
In the package, fpc, we have the function, pamk(), that also
searches for the ideal number of clusters, but it does it with more
flexibility in terms of options. This function allows use to supply a
range of a possible number of clusters so that we can use different
criteria to search for the best solution, using either the function,
pam(), or the function, clara().
Note tha “asw” means average silhouette width.
library(fpc)
## Warning: package 'fpc' was built under R version 4.2.1
data(iris)
sol <- pamk(iris[ , -5], krange=2:10, criterion="asw", usepam=TRUE)
sol
## $pamobject
## Medoids:
## ID Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] 8 5.0 3.4 1.5 0.2
## [2,] 127 6.2 2.8 4.8 1.8
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2
## Objective function:
## build swap
## 0.9901187 0.8622026
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
##
## $nc
## [1] 2
##
## $crit
## [1] 0.0000000 0.6857882 0.5528190 0.4896972 0.4867481 0.4703951 0.3390116
## [8] 0.3318516 0.2918520 0.2918482
The output of the function above is a list with three components:
pamobject, with the object resulting from running pam() (or clara())
with the optimal number of clusters: nc, with this number of clusters;
and crit, with the scores of the internal validation metric used for the
different values of the number of clusters. In our example above, we
have tried from 2 to 10 and selected the best using “asw.”
Below, the first statement obtains the distance matrix of the cases.
The second statement obtains the clustering. The resulting hierarchy,
known as the dendrogram, can be shown by applying the function, plot(),
to the object resulting from the call to hclust().
We have used a negative value for the hang parameter so that the
labels are all at the bottom, and then we use the labels parameter to
show the actual species of the plants. Note that “cex” means decreasing
the font size.
d <- dist(scale(iris[ , -5]))
h <- hclust(d)
plot(h, hang=-0.1, labels=iris[["Species"]], cex=0.5)
clus3 <- cutree(h, 3)
(cm <- table(clus3, iris$Species))
##
## clus3 setosa versicolor virginica
## 1 49 0 0
## 2 1 21 2
## 3 0 29 48
100*(1-sum(diag(cm))/sum(cm))
## [1] 21.33333
plot(h, hang=-0.1, labels=iris[["Species"]], cex=0.5)
rect.hclust(h, k=3)

A dendrogram corresponds to a set of possible clustering solutions.
Cuting this hierarchy at different heights corresponds to selecting a
particular number of clusters (i.e.: a solution), depending on how many
vertical lines of the dendrogram are crossed by the cut.
The function, cutree(), above, was used to perform the cutting by
specifying the desired number of groups. The result of this function is
a vector with the number of the cluster to which each observation was
assigned, according to the dendrogram cut.
Below, we look at the Divisive hierarchal methods. These methods are
less common than the agglormerative methods. We will use the DIANA
algorithm which comes from the cluster package.
The DIANA algorithm selects the cluster to be split in two during
the iterative process of building the dendrogram by looking at the
cluster diameter.
This diameter is estimated as the largest dissimilarity between any
two observations of the cluster.
Once this group is identified, the observation in that group with
the largest average dissimilarity to the other members of the group is
selected.
Then all observations are allocated to either the cluster of this
selected observation or to the old group (represented by its center),
depending on which one is nearest.
This algorithm is implemented in the chunk of code below:
di <- diana(iris[ , -5], metric='euclidean', stand=TRUE)
di3 <- cutree(di, 3)
(cm <- table(di3, iris$Species))
##
## di3 setosa versicolor virginica
## 1 50 0 0
## 2 0 11 33
## 3 0 39 17
100*(1-sum(diag(cm))/sum(cm))
## [1] 48
The Diana() function internally calculates the Euclidean distance
between observations (the variables have already been
standardized).
Above, we have been assuming that the order of the clusters
corresponds to the order of the values in the target variable.
If we look at the numbers in the confusion matrix, we oberserve
that-
cluster 2 should be regarded as the cluster of the virginica
species, and
cluster 3 of the versicolor spiecies.
This means that we should swap rows 2 and 3 before calculating the
error:
cm <- cm[c(1,3,2),]
100*(1-sum(diag(cm))/sum(cm))
## [1] 18.66667
This leads to a considerably lower error.
Below, we will work with an algorithm that deals with a
density-based clustering method. (Look at Torgo, Page 130.) These
methods search for regions in the input space where cases are packed
tightly. The DBSCAN algorithm is used for this purpose.
The key idea of this method is to estimate the density of a single
observation as the number of observations that are within a certain
radius of the observation, The radius being a key parameter of the
method known as reachability distance.
The dbscan() function from the package, fpc, contains an
implementation of the density-based clustering algorithm that is
discussed in Torgo, Page 130.
The dbscan() function contains two parameters: eps and MinPts. These
parameters control the reachability distance and the minimum number of
observations within this distance for a case to be considered a core
point, respectively.
The code below illustrates the application of this method to the
Iris dataset.
library(fpc)
d <- scale(iris[ ,-5])
db <- dbscan(d, eps=0.9, MinPts=5)
db
## dbscan Pts=150 MinPts=5 eps=0.9
## 0 1 2
## border 4 1 4
## seed 0 48 93
## total 4 49 97
table(db$cluster, iris$Species)
##
## setosa versicolor virginica
## 0 1 0 3
## 1 49 0 0
## 2 0 50 47
Cluster 0 contains the set of observations that the algorithm has
decided to tag as too different from the remaining cases. With the
parameter settings used, the algorithm has clustered the remaining cases
in two groups.
Looking at the confusion matrix (above), it seems cluster 1 contains
the setosas, with the exception of one that was put in the group of
outliers.
Cluster 2 contains the plants of the other two species. When
analyzing the silhouette coefficients (above) for different clustering
settings, creating two groups often appeared as the best option.
However, it is well known that in this dataset the versicolor and
virginica species are very hard to differentiate.