We are going to identify clusters in the ‘mtcars’ dataset and experiment with different types of bond method. We aim to develop an understanding of how the different bond types affect the solution(s).
We begin by viewing the first 5 rows of the ‘mtcars’ dataset:
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Using the ‘dist’ function, method Euclidean, we calculate the matrix of distances between all of the datapoints:
distance_mat <- dist(mtcars, method = 'euclidean')
If we were to examine the entire matrix, we would see Mazda RX4 and Mazda RX4 Wag are very close together. The Mazda RX4 and Duster 360 are very far apart (most values seem 0-200, from those we can see at the start of the matrix).
Out of curiosity, let’s see the spread of distances:
round(summary(distance_mat),1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6 75.8 156.7 169.3 248.7 425.3
The ‘hclust’ function takes arguments: distance matrix (a dissimilarity structure as produced by dist), bond method: the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).
We begin with the average bond method:
Hierar_cl_a <- hclust(distance_mat, method = "average")
Hierar_cl_a
##
## Call:
## hclust(d = distance_mat, method = "average")
##
## Cluster method : average
## Distance : euclidean
## Number of objects: 32
We can then plot the cluster dendogram. Adding in a horizontal line at h = 110, we can see three distinct clusters and an outlier. We add 3 rectangles to outline 3 clusters:
plot(Hierar_cl_a)
abline(h = 110, col = "green")
rect.hclust(Hierar_cl_a, k = 3, border = "red")
The function ‘cutree’ cuts a tree, (resulting from hclust), into several groups either by specifying the desired number(s) of groups or the cut height(s). Its arguments: tree, k (no. of groups), h (height where the tree should be cut).
In this case we could have used h=160 rather than k=3.
fit_a <- cutree(Hierar_cl_a, k = 3)
fit_a
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 1 1 1 2
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 2 2 2 1
## Merc 230 Merc 280 Merc 280C Merc 450SE
## 1 1 1 2
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 2 2 2 2
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 2 1 1 1
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## 1 2 2 2
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 2 1 1 1
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 2 1 3 1
We can see the cluster number for each row of data. NB Maserati Bora is on its own in cluster number 3. It’s easier to see the frequencies in each cluster using ‘table’:
table(fit_a)
## fit_a
## 1 2 3
## 16 15 1
Now we repeat the clustering with the single bond method (minimum distance between each new datapoint and the nearest datapoint already in a cluster):
Hierar_cl_s <- hclust(distance_mat, method = "single")
Hierar_cl_s
##
## Call:
## hclust(d = distance_mat, method = "single")
##
## Cluster method : single
## Distance : euclidean
## Number of objects: 32
plot(Hierar_cl_s)
abline(h = 70, col = "green")
fit_s <- cutree(Hierar_cl_s, k = 3 )
fit_s
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 1 1 1 1
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 1 1 2 1
## Merc 230 Merc 280 Merc 280C Merc 450SE
## 1 1 1 1
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 1 1 1 1
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 1 1 1 1
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## 1 1 1 2
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 1 1 1 1
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 2 1 3 1
table(fit_s)
## fit_s
## 1 2 3
## 28 3 1
rect.hclust(Hierar_cl_s, k = 3, border = "red")
NB Maserati Bora is on its own in cluster number 3 again. This time, there is a small cluster of 3 vehicles and a large cluster containing the remaining datapoints.
Now we repeat the clustering with the bond method complete (maximum distance between each new datapoint and the furthest datapoint already in a cluster):
Hierar_cl_c <- hclust(distance_mat, method = "complete")
Hierar_cl_c
##
## Call:
## hclust(d = distance_mat, method = "complete")
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 32
plot(Hierar_cl_c)
abline(h = 230, col = "green")
fit_c <- cutree(Hierar_cl_c, k = 3 )
fit_c
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 1 1 1 2
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 3 2 3 1
## Merc 230 Merc 280 Merc 280C Merc 450SE
## 1 1 1 2
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 2 2 3 3
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 3 1 1 1
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## 1 2 2 3
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 3 1 1 1
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 3 1 3 1
table(fit_c)
## fit_c
## 1 2 3
## 16 7 9
rect.hclust(Hierar_cl_c, k = 3, border = "red")
The results of ‘complete’ seem more similar to ‘average’ and the clusters have similar frequencies.
Finally we repeat the clustering with the Ward bond method (minimising the total variance from the centroid of the new group):
Hierar_cl_w <- hclust(distance_mat, method = "ward.D2")
Hierar_cl_w
##
## Call:
## hclust(d = distance_mat, method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 32
plot(Hierar_cl_w)
abline(h = 270, col = "green")
fit_w <- cutree(Hierar_cl_w, k = 3 )
fit_w
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 1 1 1 2
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 3 2 3 1
## Merc 230 Merc 280 Merc 280C Merc 450SE
## 1 1 1 2
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 2 2 3 3
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 3 1 1 1
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## 1 2 2 3
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 3 1 1 1
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 3 1 3 1
table(fit_w)
## fit_w
## 1 2 3
## 16 7 9
rect.hclust(Hierar_cl_w, k = 3, border = "red")
In this case the distances between the various clusters seem greater, making each cluster more distinct from the others.
Comparison tables of two methods
table(fit_a,fit_s)
## fit_s
## fit_a 1 2 3
## 1 16 0 0
## 2 12 3 0
## 3 0 0 1
table(fit_a,fit_c)
## fit_c
## fit_a 1 2 3
## 1 16 0 0
## 2 0 7 8
## 3 0 0 1
table(fit_w,fit_c)
## fit_c
## fit_w 1 2 3
## 1 16 0 0
## 2 0 7 0
## 3 0 0 9
All methods are able to group the 16 datapoints in cluster 1.
Complete and Ward give identical solutions (as far as frequencies are concerned).
Single forms one huge cluster, a small cluster and an outlier, whereas average splits the large cluster into two more equal clusters.
Complete forms a large 16 point cluster, then two similar sized clusters, whereas average forms two similar clusters and one outlier.
Given the variety of solutions, we can conclude that the bond method is very important.
The ‘appropriate’ bond method will depend on the presence and position of outliers, the type of solution required, etc.