We are going to identify clusters in the ‘mtcars’ dataset and experiment with different types of bond method. We aim to develop an understanding of how the different bond types affect the solution(s).

We begin by viewing the first 5 rows of the ‘mtcars’ dataset:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Using the ‘dist’ function, method Euclidean, we calculate the matrix of distances between all of the datapoints:

distance_mat <- dist(mtcars, method = 'euclidean')

If we were to examine the entire matrix, we would see Mazda RX4 and Mazda RX4 Wag are very close together. The Mazda RX4 and Duster 360 are very far apart (most values seem 0-200, from those we can see at the start of the matrix).

Out of curiosity, let’s see the spread of distances:

round(summary(distance_mat),1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.6    75.8   156.7   169.3   248.7   425.3

The ‘hclust’ function takes arguments: distance matrix (a dissimilarity structure as produced by dist), bond method: the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).

We begin with the average bond method:

Hierar_cl_a <- hclust(distance_mat, method = "average")
Hierar_cl_a
## 
## Call:
## hclust(d = distance_mat, method = "average")
## 
## Cluster method   : average 
## Distance         : euclidean 
## Number of objects: 32

We can then plot the cluster dendogram. Adding in a horizontal line at h = 110, we can see three distinct clusters and an outlier. We add 3 rectangles to outline 3 clusters:

plot(Hierar_cl_a)
abline(h = 110, col = "green")
rect.hclust(Hierar_cl_a, k = 3, border = "red")

The function ‘cutree’ cuts a tree, (resulting from hclust), into several groups either by specifying the desired number(s) of groups or the cut height(s). Its arguments: tree, k (no. of groups), h (height where the tree should be cut).

In this case we could have used h=160 rather than k=3.

fit_a <- cutree(Hierar_cl_a, k = 3)
fit_a
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##                   1                   1                   1                   2 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##                   2                   2                   2                   1 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##                   1                   1                   1                   2 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##                   2                   2                   2                   2 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##                   2                   1                   1                   1 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##                   1                   2                   2                   2 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##                   2                   1                   1                   1 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##                   2                   1                   3                   1

We can see the cluster number for each row of data. NB Maserati Bora is on its own in cluster number 3. It’s easier to see the frequencies in each cluster using ‘table’:

table(fit_a)
## fit_a
##  1  2  3 
## 16 15  1

Now we repeat the clustering with the single bond method (minimum distance between each new datapoint and the nearest datapoint already in a cluster):

Hierar_cl_s <- hclust(distance_mat, method = "single")
Hierar_cl_s
## 
## Call:
## hclust(d = distance_mat, method = "single")
## 
## Cluster method   : single 
## Distance         : euclidean 
## Number of objects: 32
plot(Hierar_cl_s)
abline(h = 70, col = "green")

fit_s <- cutree(Hierar_cl_s, k = 3 )
fit_s
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##                   1                   1                   1                   1 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##                   1                   1                   2                   1 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##                   1                   1                   1                   1 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##                   1                   1                   1                   1 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##                   1                   1                   1                   1 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##                   1                   1                   1                   2 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##                   1                   1                   1                   1 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##                   2                   1                   3                   1
table(fit_s)
## fit_s
##  1  2  3 
## 28  3  1
rect.hclust(Hierar_cl_s, k = 3, border = "red")

NB Maserati Bora is on its own in cluster number 3 again. This time, there is a small cluster of 3 vehicles and a large cluster containing the remaining datapoints.

Now we repeat the clustering with the bond method complete (maximum distance between each new datapoint and the furthest datapoint already in a cluster):

Hierar_cl_c <- hclust(distance_mat, method = "complete")
Hierar_cl_c
## 
## Call:
## hclust(d = distance_mat, method = "complete")
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 32
plot(Hierar_cl_c)
abline(h = 230, col = "green")

fit_c <- cutree(Hierar_cl_c, k = 3 )
fit_c
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##                   1                   1                   1                   2 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##                   3                   2                   3                   1 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##                   1                   1                   1                   2 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##                   2                   2                   3                   3 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##                   3                   1                   1                   1 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##                   1                   2                   2                   3 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##                   3                   1                   1                   1 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##                   3                   1                   3                   1
table(fit_c)
## fit_c
##  1  2  3 
## 16  7  9
rect.hclust(Hierar_cl_c, k = 3, border = "red")

The results of ‘complete’ seem more similar to ‘average’ and the clusters have similar frequencies.

Finally we repeat the clustering with the Ward bond method (minimising the total variance from the centroid of the new group):

Hierar_cl_w <- hclust(distance_mat, method = "ward.D2")
Hierar_cl_w
## 
## Call:
## hclust(d = distance_mat, method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 32
plot(Hierar_cl_w)
abline(h = 270, col = "green")

fit_w <- cutree(Hierar_cl_w, k = 3 )
fit_w
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##                   1                   1                   1                   2 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##                   3                   2                   3                   1 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##                   1                   1                   1                   2 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##                   2                   2                   3                   3 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##                   3                   1                   1                   1 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##                   1                   2                   2                   3 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##                   3                   1                   1                   1 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##                   3                   1                   3                   1
table(fit_w)
## fit_w
##  1  2  3 
## 16  7  9
rect.hclust(Hierar_cl_w, k = 3, border = "red")

In this case the distances between the various clusters seem greater, making each cluster more distinct from the others.

Comparison tables of two methods

table(fit_a,fit_s)
##      fit_s
## fit_a  1  2  3
##     1 16  0  0
##     2 12  3  0
##     3  0  0  1
table(fit_a,fit_c)
##      fit_c
## fit_a  1  2  3
##     1 16  0  0
##     2  0  7  8
##     3  0  0  1
table(fit_w,fit_c)
##      fit_c
## fit_w  1  2  3
##     1 16  0  0
##     2  0  7  0
##     3  0  0  9

All methods are able to group the 16 datapoints in cluster 1.

Complete and Ward give identical solutions (as far as frequencies are concerned).

Single forms one huge cluster, a small cluster and an outlier, whereas average splits the large cluster into two more equal clusters.

Complete forms a large 16 point cluster, then two similar sized clusters, whereas average forms two similar clusters and one outlier.

Given the variety of solutions, we can conclude that the bond method is very important.

The ‘appropriate’ bond method will depend on the presence and position of outliers, the type of solution required, etc.