The given dataset is the famous Iris dataset which comprises measurements of 150 iris flowers from three different species: Setosa, Versicolor, and Virginica. Each sample consists of four features: sepal length, sepal width, petal length, and petal width. Additionally, each sample is labeled with its corresponding species(i.e. Setosa,Versicolor and Virginica). This dataset was made famous by R. Fisher in his 1936 paper, “The use of multiple measurements in taxonomic problems”.
The prime objective here is to do a cluster analysis by using Hierrechical and Non-hierrechical clustering methods on the given dataset.
Since it is already known that the data set involves 3 types of iris flowers, the number of clusters to be taken is pre-decided to be 3 for all hierrechical methods. On the otherhand, when non-hierrechical methods were performed the optimum value of k, the number of clusters was found.
It initially starts as many clusters as objects. The most similar objects are first grouped, and these initial groups are merged according to their similarities. Eventually, as the similarity decreses, all subgroups are fused into a single cluster.
Since single linkage joins clusters by the shortest link between them, the technique cannot discern poorly separated clusters. On the other hand, single linkage is one of the few clustering methods that can delineate nonellipsoidal clusters. The tendency of single linkage to pick out long stringlike clusters is known as chaining. Chaining can be misleading if items at opposite ends of the chain are, in fact, quite dissimilar.
Complete Linkage It is quite similar to single linkage, with one important exception that at each stage the concept of nearest neighbor is replaced by farthest neighbor.
Average Linkage It treats the distance between two clusters as the average distance between all pairs of items where one membe of a pair belongs to each cluster.
Ward’s linkage It is based on minimizing the “loss of information” from joining two groups. This method is usually implemented with loss of information taken to be an increase in an error sum of squares criterion.
Ward’s method is based on the notion that the clusters of multivariate observations are expected to be roughly elliptically shaped. It is a hierarchical precursor to nonhierarchical clustering methods that optimize some criterion for dividing data into a given number of elliptical groups
Mcquitty
Median
Centroid
## [,1] [,2] [,3]
## euclidean distance and single linkage 50 98 2
## euclidean distance and complete linkage 50 72 28
## euclidean distance and average linkage 50 64 36
## euclidean distance and ward.D2 linkage 50 64 36
## euclidean distance and mcquitty linkage 50 65 35
## euclidean distance and median linkage 50 63 37
## euclidean distance and centroid linkage 50 98 2
## maximum distance and single linkage 50 98 2
## maximum distance and complete linkage 50 65 35
## maximum distance and average linkage 50 90 10
## maximum distance and ward.D2 linkage 50 64 36
## maximum distance and mcquitty linkage 50 65 35
## maximum distance and median linkage 50 69 31
## maximum distance and centroid linkage 50 96 4
## manhattan distance and single linkage 50 99 1
## manhattan distance and complete linkage 50 66 34
## manhattan distance and average linkage 50 63 37
## manhattan distance and ward.D2 linkage 50 65 35
## manhattan distance and mcquitty linkage 50 51 49
## manhattan distance and median linkage 50 76 24
## manhattan distance and centroid linkage 50 98 2
## canberra distance and single linkage 44 6 100
## canberra distance and complete linkage 50 52 48
## canberra distance and average linkage 44 6 100
## canberra distance and ward.D2 linkage 50 52 48
## canberra distance and mcquitty linkage 44 6 100
## canberra distance and median linkage 44 6 100
## canberra distance and centroid linkage 44 6 100
## minkowski distance and single linkage 50 98 2
## minkowski distance and complete linkage 50 72 28
## minkowski distance and average linkage 50 64 36
## minkowski distance and ward.D2 linkage 50 64 36
## minkowski distance and mcquitty linkage 50 65 35
## minkowski distance and median linkage 50 63 37
## minkowski distance and centroid linkage 50 98 2
As it appears in the above table, most of the methods were not able to even form clusters of decent size of 50 each. This clearly gives us an idea that the the methods are not suitable whenever groups are entabled closely. However it needs to be mentioned here that for most of methods the size of the cluster one is 50 and it only included the group “Iris-setosa”. This may be due to the distribution of sepal widths among the different classes.
Now checking the clustering methods for which three clusters of decent sizes close to 50 each are formed.
## [,1] [,2] [,3]
## manhattan distance and mcquitty linkage 50 51 49
## canberra distance and complete linkage 50 52 48
## canberra distance and ward.D2 linkage 50 52 48
## $`manhattan distance and mcquitty linkage`
## Iris-setosa Iris-versicolor Iris-virginica
## I 50 0 0
## II 0 47 4
## III 0 3 46
##
## $`canberra distance and complete linkage`
## Iris-setosa Iris-versicolor Iris-virginica
## I 50 0 0
## II 0 48 4
## III 0 2 46
##
## $`canberra distance and ward.D2 linkage`
## Iris-setosa Iris-versicolor Iris-virginica
## I 50 0 0
## II 0 48 4
## III 0 2 46
It appears that the three clustering methods i.e. manhattan distance and mcquitty linkage, canberra distance with complete linkage AND canberra distance with ward.D2 linkage have almost similar amount of mis-specifications.
It works in the opposite direction. At initial stage froups of objects is divided into two subgroups such that the objects in one subgroup are “far from” the objects in the other. These subgroups are then further divided into dissimilar subgroups, the process continues untill there are many subgroups as objects- that is until each object forms a group.
## [,1] [,2] [,3]
## euclidean 53 60 37
## maximum 51 39 60
## manhattan 54 60 36
## canberra 50 49 51
## minkowski 53 60 37
It appears that using the canberra distance here works well as the clusters sizes are in tune with the group sizes. However, the mis-specification is higher compared to the best 3 agglomerative methods.
## Iris-setosa Iris-versicolor Iris-virginica
## I 50 0 0
## II 0 45 4
## III 0 5 46
This method involves selcting k nuclei around which k initial clusters are formed. Each of remaining n-k individuals are assigned to the center, whose nucleus is closest to itself. Without further rellocation needed, the k clusters constitute final partioning of n individuals.
The main problem in this method is to decide an optimum value of k. This is however agreed upon by drawing a scree plot for different values of k vs the within cluster variability. The value of k at which an elbow is seen is taken as the optimum number of clusters.
max_k = 10
wss = numeric(max_k)
for(i in 1:max_k){
km.out = kmeans(df[,-5],centers=i)
wss[i] <- km.out$tot.withinss
}
plot(1:max_k,wss,xlab="No. of clusters",ylab="Within Sum of Squares",main = "Scree Plot",type="l")
Here, the value of k is taken to be 2 and 3.
## K-means clustering with 2 clusters of sizes 97, 53
##
## Cluster means:
## sepal.length sepal.width petal.length petal.width
## 1 6.301031 2.886598 4.958763 1.6958763
## 2 5.005660 3.360377 1.562264 0.2886792
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1
##
## Within cluster sum of squares by cluster:
## [1] 123.79588 28.57283
## (between_SS / total_SS = 77.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
## K-means clustering with 3 clusters of sizes 38, 62, 50
##
## Cluster means:
## sepal.length sepal.width petal.length petal.width
## 1 6.850000 3.073684 5.742105 2.071053
## 2 5.901613 2.748387 4.393548 1.433871
## 3 5.006000 3.418000 1.464000 0.244000
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
## [112] 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1
## [149] 1 2
##
## Within cluster sum of squares by cluster:
## [1] 23.87947 39.82097 15.24040
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
For k=2, two clusters of sizes 97(red) and 53(blue) are formed. Theblue cluster contains all the elements from setosa and ony 3 elements from versicolor. Whereas the red cluster contains all from verginica and versicolor(remaining).
For k=3, three clusters of sizes 50(red), 38(green) and 62(blue). The mis-specifications in both the cases when k=2 and k=3 is quite high.
Hence overall, It appears that the three clustering methods i.e. manhattan distance and mcquitty linkage, canberra distance with complete linkage AND canberra distance with ward.D2 linkage have almost similar amount of mis-specifications.