The Basics of Cluster Analysis

Introduction

Cluster analysis is a method of exploratory multivariate analysis, like PCA. But notice that, although with a PCA result we can detect patterns in the data, like clusters of observations (individuals), it is not made for that.

There are several methods of clustering, of different types such as

Hierarchical methods, such as UPGMA (average linkage), nearest neighbour (single linkage), Ward etc.
Non-hierarchical methods, such as Tocher and k-means.

All of them are based on distances among observations (\(n\)) in a multidimensional (\(p\)) space. Being so, the presence (or absense) of correlated variables does not limit the effectiveness of a cluster analysis, as it happens with PCA. To be more specific, even if the variables are uncorrelated, clusters can still be formed.

Distance measures

Many different distance metrics are available for measuring the distance between two individuals, say \(i\) and \(k\), based on numeric variables. The most usual are Euclidean:

\[d_{ik} = \sqrt{({\bf Y}_i - {\bf Y}_k)^T({\bf Y}_i - {\bf Y}_k)}\]

and Mahalanobis.

\[D_{ik} = \sqrt{({\bf Y}_i - {\bf Y}_k)^T{\bf S}^{-1}({\bf Y}_i - {\bf Y}_k)}\] where \({\bf S}^{-1}\) is an estimate of the covariance matrix of the \(p\) variables.

For binary variables, measures such as Jaccard similarity index can be used. When the data set contains mixed variables, i.e., categorical, binary and numeric, the coefficient of Gower might be handy. Those can be found in the package cluster and others.

An example

The following data set contains observations on 20 genotypes of garlic for production variables, namely diameter of bulb, length of bulb, bulb weight, average number of bulbils per bulb, average number of leaves per plant, and lead area.

path <- 'http://arsilva.weebly.com/uploads/2/1/0/0/21008856/alho.txt'
garlic <- read.table(path, header = TRUE)
colnames(garlic) <- c('Genotype', 'DIAM', 'LENGTH', 'BW', 'NbB', 'NL', 'LA')
garlic$Genotype <- 1:nrow(garlic)
str(garlic)

'data.frame':   20 obs. of  7 variables:
 $ Genotype: int  1 2 3 4 5 6 7 8 9 10 ...
 $ DIAM    : num  38.8 45 45.6 48 44.4 ...
 $ LENGTH  : num  30 33.7 35.1 34.3 34 ...
 $ BW      : num  22.9 31.7 36.2 32.5 26 ...
 $ NbB     : num  29.8 22.2 51.4 41.5 15.5 ...
 $ NL      : num  6.5 7.5 6.75 6.75 7.25 7.25 6.5 6.5 7.5 7.5 ...
 $ LA      : num  95.3 108.5 93.4 109.2 121.4 ...

Because the variables have different scales, those with larger scales may dominate the distances’ values while not necessarily being discriminating. To avoid that, the variables should be standardized:

garlic_s <- scale(garlic[, -1])
apply(garlic_s, 2, mean)    # means

    DIAM   LENGTH       BW      NbB       NL       LA 
-2.4e-16 -6.2e-16  1.7e-16 -1.5e-17 -6.9e-18 -2.7e-17

apply(garlic_s, 2, sd)      # Std. deviations

  DIAM LENGTH     BW    NbB     NL     LA 
     1      1      1      1      1      1

The matrix of Euclidean distance can be obtained with

d <- dist(garlic_s)

while the matrix of Mahalanobis distance can be obtained using the package biotools

# install.packages('biotools')
library(biotools)
D <- sqrt( D2.dist(garlic_s, cov(garlic_s)) )

Note: using Mahalanobis, we take into account the correlations among variables.

Both objects d and D contain 190 pairwise distances among Genotypes.

Hierarchical clustering

These method produce are agglomerative, producing a tree (dendrogram) of objects being merged during the clustering process.

h1 <- hclust(D, method = 'average')
h2 <- hclust(D, method = 'ward.D2')
par(mfrow = c(1, 2), cex = 0.7)
plot(h1, hang = -1, main = 'UPGMA')
plot(h2, hang = -1, main = 'Ward')

Clusters can be identified by placing a cutoff line on the y-axis. For instance, a line on height equal to 4 on Ward’s dendrogram will produce six clusters of genotypes, with Genotypes 14 and 13 forming each a single cluster. Check those clusters with

rbind( G = garlic$Genotype, Cluster = cutree(h2, h = 4) )

        [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
G          1    2    3    4    5    6    7    8    9    10    11    12    13
Cluster    1    2    3    4    5    4    1    1    5     5     5     4     6
        [,14] [,15] [,16] [,17] [,18] [,19] [,20]
G          14    15    16    17    18    19    20
Cluster     7     5     2     1     1     3     1

Different cutoff points will produce different clustering results. But yes, there are methods for determining the cutoff point objectively, such as Mojena (1977) or statistics based on \(R^2\) and pseudo-F, the former two based on maximizing the variability between clusters.

Tocher

Tocher’s optimization method allows one to establish mutually exclusive clusters, with no need to define the number of clusters. Furthermore, Tocher’s method can be used to determine the number of clusters in dendrograms. Clusters are established according to an objective function that adopts an optimization criterion, which minimizes the average intra-cluster distance and maximizes the average inter-cluster distances.

toc <- tocher(D)
toc


          Tocher's Clustering 

Call: tocher.dist(d = D)

Cluster algorithm: original 
Number of objects: 20 
Number of clusters: 5 
Most contrasting clusters: cluster 4 and cluster 5, with 
   average intercluster distance: 5.4

$`cluster 1`
 [1] 9  10 15 5  11 8  1  18 20 17 7  16 6  2  19 12

$`cluster 2`
[1] 3

$`cluster 3`
[1] 4

$`cluster 4`
[1] 13

$`cluster 5`
[1] 14

Cophenetic correlation

Cophenetic distances are those based on the clustering result, meaning that different clustering algorithms may represent the original distance matrix differently.

D_h1 <- cophenetic(h1)
D_h2 <- cophenetic(h2)
D_toc <- cophenetic(toc)

The correlation between the cophenetic and the original distances is a way to evaluate the quality of the clustering.

cor(D, D_h1)

[1] 0.79

cor(D, D_h2)

[1] 0.58

cor(D, D_toc)

[1] 0.7

which suggest that the method UPGMA should be preferred.

Mantel’s test

Is that cophenetic correlation significantly greater than zero? Thit question can be answered using Mantel (1967) permutation test.

mantelTest(D, D_h1, nperm = 999)


            Mantel's permutation test

Correlation: 0.79
p-value: 0.001, based on 999 matrix permutations
Alternative hypothesis: true correlation is greater than 0

k-Means

Differently from the previous methods that operate from a distance matrix, k-means tries to find partitions in the original data \({\bf Y}\) into k groups such that the sum of squares (SS) from observations to the assigned cluster centres is minimized. But at least the number k of partitions is required. Also, the vectors of the centres can be passed as input. An \(R^2\)-like measure can be obtained from the SS and used to evaluate the clustering quality.

k <- kmeans(garlic_s, centers = 5)
k

K-means clustering with 5 clusters of sizes 8, 5, 2, 1, 4

Cluster means:
   DIAM LENGTH     BW   NbB    NL     LA
1  0.17   0.25 -0.096 -0.59  0.77  0.318
2 -1.26  -1.28 -1.131 -0.50 -0.85 -0.883
3 -0.19  -0.29  0.214  0.65 -1.35 -0.594
4  1.25   1.56  2.091  2.27  1.90  3.255
5  1.02   0.86  0.977  0.91 -0.27 -0.048

Clustering vector:
 [1] 2 1 5 5 1 5 2 2 1 1 1 1 4 5 1 1 3 2 3 2

Within cluster sum of squares by cluster:
[1] 11.3  3.8  1.5  0.0  9.1
 (between_SS / total_SS =  77.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"