Cluster analysis is a method of exploratory multivariate analysis, like PCA. But notice that, although with a PCA result we can detect patterns in the data, like clusters of observations (individuals), it is not made for that.
There are several methods of clustering, of different types such as
All of them are based on distances among observations (\(n\)) in a multidimensional (\(p\)) space. Being so, the presence (or absense) of correlated variables does not limit the effectiveness of a cluster analysis, as it happens with PCA. To be more specific, even if the variables are uncorrelated, clusters can still be formed.
Many different distance metrics are available for measuring the distance between two individuals, say \(i\) and \(k\), based on numeric variables. The most usual are Euclidean:
\[d_{ik} = \sqrt{({\bf Y}_i - {\bf Y}_k)^T({\bf Y}_i - {\bf Y}_k)}\]
and Mahalanobis.
\[D_{ik} = \sqrt{({\bf Y}_i - {\bf Y}_k)^T{\bf S}^{-1}({\bf Y}_i - {\bf Y}_k)}\] where \({\bf S}^{-1}\) is an estimate of the covariance matrix of the \(p\) variables.
For binary variables, measures such as Jaccard similarity index can be used. When the data set contains mixed variables, i.e., categorical, binary and numeric, the coefficient of Gower might be handy. Those can be found in the package cluster and others.
The following data set contains observations on 20 genotypes of garlic for production variables, namely diameter of bulb, length of bulb, bulb weight, average number of bulbils per bulb, average number of leaves per plant, and lead area.
path <- 'http://arsilva.weebly.com/uploads/2/1/0/0/21008856/alho.txt'
garlic <- read.table(path, header = TRUE)
colnames(garlic) <- c('Genotype', 'DIAM', 'LENGTH', 'BW', 'NbB', 'NL', 'LA')
garlic$Genotype <- 1:nrow(garlic)
str(garlic)'data.frame': 20 obs. of 7 variables:
$ Genotype: int 1 2 3 4 5 6 7 8 9 10 ...
$ DIAM : num 38.8 45 45.6 48 44.4 ...
$ LENGTH : num 30 33.7 35.1 34.3 34 ...
$ BW : num 22.9 31.7 36.2 32.5 26 ...
$ NbB : num 29.8 22.2 51.4 41.5 15.5 ...
$ NL : num 6.5 7.5 6.75 6.75 7.25 7.25 6.5 6.5 7.5 7.5 ...
$ LA : num 95.3 108.5 93.4 109.2 121.4 ...
Because the variables have different scales, those with larger scales may dominate the distances’ values while not necessarily being discriminating. To avoid that, the variables should be standardized:
garlic_s <- scale(garlic[, -1])
apply(garlic_s, 2, mean) # means DIAM LENGTH BW NbB NL LA
-2.4e-16 -6.2e-16 1.7e-16 -1.5e-17 -6.9e-18 -2.7e-17
apply(garlic_s, 2, sd) # Std. deviations DIAM LENGTH BW NbB NL LA
1 1 1 1 1 1
The matrix of Euclidean distance can be obtained with
d <- dist(garlic_s)while the matrix of Mahalanobis distance can be obtained using the package biotools
# install.packages('biotools')
library(biotools)
D <- sqrt( D2.dist(garlic_s, cov(garlic_s)) )Note: using Mahalanobis, we take into account the correlations among variables.
Both objects d and D contain 190 pairwise distances among Genotypes.
These method produce are agglomerative, producing a tree (dendrogram) of objects being merged during the clustering process.
h1 <- hclust(D, method = 'average')
h2 <- hclust(D, method = 'ward.D2')
par(mfrow = c(1, 2), cex = 0.7)
plot(h1, hang = -1, main = 'UPGMA')
plot(h2, hang = -1, main = 'Ward')Clusters can be identified by placing a cutoff line on the y-axis. For instance, a line on height equal to 4 on Ward’s dendrogram will produce six clusters of genotypes, with Genotypes 14 and 13 forming each a single cluster. Check those clusters with
rbind( G = garlic$Genotype, Cluster = cutree(h2, h = 4) ) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
G 1 2 3 4 5 6 7 8 9 10 11 12 13
Cluster 1 2 3 4 5 4 1 1 5 5 5 4 6
[,14] [,15] [,16] [,17] [,18] [,19] [,20]
G 14 15 16 17 18 19 20
Cluster 7 5 2 1 1 3 1
Different cutoff points will produce different clustering results. But yes, there are methods for determining the cutoff point objectively, such as Mojena (1977) or statistics based on \(R^2\) and pseudo-F, the former two based on maximizing the variability between clusters.
Tocher’s optimization method allows one to establish mutually exclusive clusters, with no need to define the number of clusters. Furthermore, Tocher’s method can be used to determine the number of clusters in dendrograms. Clusters are established according to an objective function that adopts an optimization criterion, which minimizes the average intra-cluster distance and maximizes the average inter-cluster distances.
toc <- tocher(D)
toc
Tocher's Clustering
Call: tocher.dist(d = D)
Cluster algorithm: original
Number of objects: 20
Number of clusters: 5
Most contrasting clusters: cluster 4 and cluster 5, with
average intercluster distance: 5.4
$`cluster 1`
[1] 9 10 15 5 11 8 1 18 20 17 7 16 6 2 19 12
$`cluster 2`
[1] 3
$`cluster 3`
[1] 4
$`cluster 4`
[1] 13
$`cluster 5`
[1] 14
Cophenetic distances are those based on the clustering result, meaning that different clustering algorithms may represent the original distance matrix differently.
D_h1 <- cophenetic(h1)
D_h2 <- cophenetic(h2)
D_toc <- cophenetic(toc)The correlation between the cophenetic and the original distances is a way to evaluate the quality of the clustering.
cor(D, D_h1)[1] 0.79
cor(D, D_h2)[1] 0.58
cor(D, D_toc)[1] 0.7
which suggest that the method UPGMA should be preferred.
Is that cophenetic correlation significantly greater than zero? Thit question can be answered using Mantel (1967) permutation test.
mantelTest(D, D_h1, nperm = 999)
Mantel's permutation test
Correlation: 0.79
p-value: 0.001, based on 999 matrix permutations
Alternative hypothesis: true correlation is greater than 0
Differently from the previous methods that operate from a distance matrix, k-means tries to find partitions in the original data \({\bf Y}\) into k groups such that the sum of squares (SS) from observations to the assigned cluster centres is minimized. But at least the number k of partitions is required. Also, the vectors of the centres can be passed as input. An \(R^2\)-like measure can be obtained from the SS and used to evaluate the clustering quality.
k <- kmeans(garlic_s, centers = 5)
kK-means clustering with 5 clusters of sizes 8, 5, 2, 1, 4
Cluster means:
DIAM LENGTH BW NbB NL LA
1 0.17 0.25 -0.096 -0.59 0.77 0.318
2 -1.26 -1.28 -1.131 -0.50 -0.85 -0.883
3 -0.19 -0.29 0.214 0.65 -1.35 -0.594
4 1.25 1.56 2.091 2.27 1.90 3.255
5 1.02 0.86 0.977 0.91 -0.27 -0.048
Clustering vector:
[1] 2 1 5 5 1 5 2 2 1 1 1 1 4 5 1 1 3 2 3 2
Within cluster sum of squares by cluster:
[1] 11.3 3.8 1.5 0.0 9.1
(between_SS / total_SS = 77.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"