About the dataset

The given dataset is the famous Iris dataset which comprises measurements of 150 iris flowers from three different species: Setosa, Versicolor, and Virginica. Each sample consists of four features: sepal length, sepal width, petal length, and petal width. Additionally, each sample is labeled with its corresponding species(i.e. Setosa,Versicolor and Virginica). This dataset was made famous by R. Fisher in his 1936 paper, “The use of multiple measurements in taxonomic problems”.

Objective

The prime objective here is to do a cluster analysis by using Hierrechical and Non-hierrechical clustering methods on the given dataset.

Since it is already known that the data set involves 3 types of iris flowers, the number of clusters to be taken is pre-decided to be 3 for all hierrechical methods. On the otherhand, when non-hierrechical methods were performed the optimum value of k, the number of clusters was found.

Hierrechical Clustering

Agglomerative Hierarchical Clustering

It initially starts as many clusters as objects. The most similar objects are first grouped, and these initial groups are merged according to their similarities. Eventually, as the similarity decreses, all subgroups are fused into a single cluster.

Similarity Measures

Euclidean Distance
Maximum Distance
Manhattan Distance
Canberra Distance
Binary Distance
Minkowski Distance

Linkage Methods

Single Linkage The inputs to a single linkage algorithm can be distances or similariteis between pairs of objects. Groups are formed from the individual entities by merging nearest neighbors, where the term nearest neighbor connotes the smallest distance or largest similarity.

Since single linkage joins clusters by the shortest link between them, the technique cannot discern poorly separated clusters. On the other hand, single linkage is one of the few clustering methods that can delineate nonellipsoidal clusters. The tendency of single linkage to pick out long stringlike clusters is known as chaining. Chaining can be misleading if items at opposite ends of the chain are, in fact, quite dissimilar.

Complete Linkage It is quite similar to single linkage, with one important exception that at each stage the concept of nearest neighbor is replaced by farthest neighbor.
Average Linkage It treats the distance between two clusters as the average distance between all pairs of items where one membe of a pair belongs to each cluster.
Ward’s linkage It is based on minimizing the “loss of information” from joining two groups. This method is usually implemented with loss of information taken to be an increase in an error sum of squares criterion.

Ward’s method is based on the notion that the clusters of multivariate observations are expected to be roughly elliptically shaped. It is a hierarchical precursor to nonhierarchical clustering methods that optimize some criterion for dividing data into a given number of elliptical groups

Mcquitty
Median
Centroid

Boxplots

Hierrachical Agglomerative method

Dendograms for Different Combination of Distance and Linkage Methods

Cluster Sizes of the 3 clusters formed for different Agglomerative Hierrechical Clustering Methods

##                                         [,1] [,2] [,3]
## euclidean distance and single linkage     50   98    2
## euclidean distance and complete linkage   50   72   28
## euclidean distance and average linkage    50   64   36
## euclidean distance and ward.D2 linkage    50   64   36
## euclidean distance and mcquitty linkage   50   65   35
## euclidean distance and median linkage     50   63   37
## euclidean distance and centroid linkage   50   98    2
## maximum distance and single linkage       50   98    2
## maximum distance and complete linkage     50   65   35
## maximum distance and average linkage      50   90   10
## maximum distance and ward.D2 linkage      50   64   36
## maximum distance and mcquitty linkage     50   65   35
## maximum distance and median linkage       50   69   31
## maximum distance and centroid linkage     50   96    4
## manhattan distance and single linkage     50   99    1
## manhattan distance and complete linkage   50   66   34
## manhattan distance and average linkage    50   63   37
## manhattan distance and ward.D2 linkage    50   65   35
## manhattan distance and mcquitty linkage   50   51   49
## manhattan distance and median linkage     50   76   24
## manhattan distance and centroid linkage   50   98    2
## canberra distance and single linkage      44    6  100
## canberra distance and complete linkage    50   52   48
## canberra distance and average linkage     44    6  100
## canberra distance and ward.D2 linkage     50   52   48
## canberra distance and mcquitty linkage    44    6  100
## canberra distance and median linkage      44    6  100
## canberra distance and centroid linkage    44    6  100
## minkowski distance and single linkage     50   98    2
## minkowski distance and complete linkage   50   72   28
## minkowski distance and average linkage    50   64   36
## minkowski distance and ward.D2 linkage    50   64   36
## minkowski distance and mcquitty linkage   50   65   35
## minkowski distance and median linkage     50   63   37
## minkowski distance and centroid linkage   50   98    2

Conclusion

As it appears in the above table, most of the methods were not able to even form clusters of decent size of 50 each. This clearly gives us an idea that the the methods are not suitable whenever groups are entabled closely. However it needs to be mentioned here that for most of methods the size of the cluster one is 50 and it only included the group “Iris-setosa”. This may be due to the distribution of sepal widths among the different classes.

Now checking the clustering methods for which three clusters of decent sizes close to 50 each are formed.

Methods with decent cluster sizes

##                                         [,1] [,2] [,3]
## manhattan distance and mcquitty linkage   50   51   49
## canberra distance and complete linkage    50   52   48
## canberra distance and ward.D2 linkage     50   52   48

Tables for the 3 best methods

## $`manhattan distance and mcquitty linkage`
##     Iris-setosa Iris-versicolor Iris-virginica
## I            50               0              0
## II            0              47              4
## III           0               3             46
## 
## $`canberra distance and complete linkage`
##     Iris-setosa Iris-versicolor Iris-virginica
## I            50               0              0
## II            0              48              4
## III           0               2             46
## 
## $`canberra distance and ward.D2 linkage`
##     Iris-setosa Iris-versicolor Iris-virginica
## I            50               0              0
## II            0              48              4
## III           0               2             46

It appears that the three clustering methods i.e. manhattan distance and mcquitty linkage, canberra distance with complete linkage AND canberra distance with ward.D2 linkage have almost similar amount of mis-specifications.

Divisive Hierrachical Clustering

It works in the opposite direction. At initial stage froups of objects is divided into two subgroups such that the objects in one subgroup are “far from” the objects in the other. These subgroups are then further divided into dissimilar subgroups, the process continues untill there are many subgroups as objects- that is until each object forms a group.

Cluster Sizes of the 3 clusters formed for different Divisive Hierrechical Clustering Methods

##           [,1] [,2] [,3]
## euclidean   53   60   37
## maximum     51   39   60
## manhattan   54   60   36
## canberra    50   49   51
## minkowski   53   60   37

It appears that using the canberra distance here works well as the clusters sizes are in tune with the group sizes. However, the mis-specification is higher compared to the best 3 agglomerative methods.

##     Iris-setosa Iris-versicolor Iris-virginica
## I            50               0              0
## II            0              45              4
## III           0               5             46

k-means clustering

This method involves selcting k nuclei around which k initial clusters are formed. Each of remaining n-k individuals are assigned to the center, whose nucleus is closest to itself. Without further rellocation needed, the k clusters constitute final partioning of n individuals.

The main problem in this method is to decide an optimum value of k. This is however agreed upon by drawing a scree plot for different values of k vs the within cluster variability. The value of k at which an elbow is seen is taken as the optimum number of clusters.

max_k = 10
wss = numeric(max_k)
for(i in 1:max_k){
  km.out = kmeans(df[,-5],centers=i)
  wss[i] <- km.out$tot.withinss
}
plot(1:max_k,wss,xlab="No. of clusters",ylab="Within Sum of Squares",main =  "Scree Plot",type="l")

Here, the value of k is taken to be 2 and 3.

When k=2

## K-means clustering with 2 clusters of sizes 97, 53
## 
## Cluster means:
##   sepal.length sepal.width petal.length petal.width
## 1     6.301031    2.886598     4.958763   1.6958763
## 2     5.005660    3.360377     1.562264   0.2886792
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 123.79588  28.57283
##  (between_SS / total_SS =  77.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

When k=3

## K-means clustering with 3 clusters of sizes 38, 62, 50
## 
## Cluster means:
##   sepal.length sepal.width petal.length petal.width
## 1     6.850000    3.073684     5.742105    2.071053
## 2     5.901613    2.748387     4.393548    1.433871
## 3     5.006000    3.418000     1.464000    0.244000
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
## [112] 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1
## [149] 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 23.87947 39.82097 15.24040
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Plots

Conclusion

For k=2, two clusters of sizes 97(red) and 53(blue) are formed. Theblue cluster contains all the elements from setosa and ony 3 elements from versicolor. Whereas the red cluster contains all from verginica and versicolor(remaining).

For k=3, three clusters of sizes 50(red), 38(green) and 62(blue). The mis-specifications in both the cases when k=2 and k=3 is quite high.

Hence overall, It appears that the three clustering methods i.e. manhattan distance and mcquitty linkage, canberra distance with complete linkage AND canberra distance with ward.D2 linkage have almost similar amount of mis-specifications.

Cluster Analysis on IRIS dataset [AKC]

2024-04-22

About the dataset

Objective

Hierrechical Clustering

Agglomerative Hierarchical Clustering

Similarity Measures

Linkage Methods

Boxplots

Hierrachical Agglomerative method

Dendograms for Different Combination of Distance and Linkage Methods

Cluster Sizes of the 3 clusters formed for different Agglomerative Hierrechical Clustering Methods

Conclusion

Methods with decent cluster sizes

Tables for the 3 best methods

Divisive Hierrachical Clustering

Cluster Sizes of the 3 clusters formed for different Divisive Hierrechical Clustering Methods

k-means clustering

When k=2

When k=3

Plots

Conclusion