Hierarchical Clustering

Author

David McCabe

Published

February 9, 2026

TSB Optimise Proposal - Old method developed understand design variable impact along the pareto frontier captured by GA’s genetic optimisation engine. Rows sorted/group in terms of design variables and row sorted/grouped in terms of similar objective variables.

head(d, n=9)

    dgn.var.1  dgn.var.2 obj.var.1  obj.var.2
        <num>      <num>     <num>      <num>
1:  12.655175  0.8016304 6.3402695  11.853545
2:  -5.616493 -0.5840207 2.8233880  -5.032473
3:  13.114151  0.5083654 6.5620002  12.605785
4: -12.606379 -0.3669127 6.3058588 -12.239466
5:  19.196444  0.4122441 9.6004350  18.784200
6:   1.212050 -0.9268597 0.7629112   2.138910
7:  -6.130704 -1.1441838 3.1182804  -4.986520
8: -10.640974 -0.1725064 5.3211861 -10.468468
9:  -9.517473 -0.3145619 4.7613348  -9.202911

heatmap(as.matrix(d), Rowv = TRUE, Colv = NA) # sort 4d des.parameter space

heatmap(as.matrix(d))

Clustering both the 4-dimensional feature space points & the 50-dimensional dual design space lets us examine features and design parameters represented in the pareto front it shuffles the rows and also columns without scrambling (designs in the feature space & and features in the dual design space are the same - the rows and columns are just ordered according to the clustering) - more useful for gene expresion datasets with high numbers of features

Basic Workflow

0. Here we prepare a trivial example

Let’s put the vertices of a simple 3-4-5 triangle into a data.table…

       id          x          y
   <char>      <num>      <num>
1:      A -0.1083825  0.5756526
2:      B  3.2772819 -0.4908377
3:      C -0.0689406  2.8234177
4:      D  0.5678667 -0.5379234

1. First we generate a distance matrix

dm<-dist(d[,.(x,y)], method = "euclidean")
print(dm)

         1        2        3
2 3.549665                  
3 2.248111 4.709723         
4 1.302829 2.709824 3.421131

2. Second we perform hierarchical clustering to produce a dendogram

dendgram<-hclust(dm, method = "complete")
dendgram


Call:
hclust(d = dm, method = "complete")

Cluster method   : complete 
Distance         : euclidean 
Number of objects: 4

3. Visualise and “slice the tree”

plot(dendgram, labels = d[,id], main = "Have a nice day!")

clusters <- cutree(dendgram, k = 2) # Cut the dendrogram at 2 clusters

d$cluster <- clusters

print(d)

       id          x          y cluster
   <char>      <num>      <num>   <int>
1:      A -0.1083825  0.5756526       1
2:      B  3.2772819 -0.4908377       2
3:      C -0.0689406  2.8234177       1
4:      D  0.5678667 -0.5379234       1

Distance Metrics

The method of calculating distance is customisable in the dist(x, method = “euclidean”, diag = FALSE, upper = FALSE, p = 2) command.

Method	Purpose
“euclidean”	Useful for high dimensional data
D.I.Y.	https://en.wikipedia.org/wiki/Chebyshev_distance
“maximum”
“manhattan”	https://en.wikipedia.org/wiki/Taxicab_geometry
“canberra”	https://en.wikipedia.org/wiki/Canberra_distance
“binary”
“minkowski”	https://en.wikipedia.org/wiki/Minkowski_distance

Agglomeration method

The method of calculating cluster position is customisable in the hclust(d, method = “complete”, members = NULL) command.

Method	Purpose
“ward.D”
“ward.D2”
“single”
“complete”	complete linkage - average between two furthest points
“average” (= UPGMA)	average linkage - averaged value of \(x_i^d\) for all points in cluster \(i=1:N\) in each dimension \(d\)
“mcquitty” (= WPGMA)	https://en.wikipedia.org/wiki/WPGMA
“median” (= WPGMC)
“centroid” (= UPGMC)