Complete all Exercises, and submit answers to VtopBeta

Introduction

Working

  • Put each data point in its own cluster
  • Identify the closest two clusters and combine them into one cluster
  • Repeat till the above step till all the data points are in a single cluster

Determining clusters

  • Complete linkage clustering: Find the maximum possible distance between two points belonging to two different clusters.
  • Single linkage clustering: Find the minimum possible distance between points belonging to two different clusters.
  • Mean linkage clustering: Find all possible pairwise distances for points belonging to two different clusters and then calculate the average.

Datasets

Iris dataset for clustering
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa

Clustering

We know there are 3 species of flowers.

Complete linkage clustering method

hclust requires data to be provided in the form of a distance

library(ggplot2)
clusters <- hclust(dist(iris[, 3:4])) #dist is used to produce this matrix

plot(clusters, 
     xlab = "Distance Matrix of Petal.Length and Petal.Width", 
     ylab = "Height", 
     main = "Dendogram for Iris Dataset using complete linkage")

Clusters - 3 or 4

Based on the above result, the best choice for clusters are either 3 or 4. To do this, we can cut off the tree at the desired number of clusters using cuttree.

clusterCutThree <- cutree(clusters, 3) #Creating three clusters
clusterCutFour <- cutree(clusters, 4) #Creating four clusters

#Comparing clusters with original species
table(clusterCutThree,iris$Species)
##                
## clusterCutThree setosa versicolor virginica
##               1     50          0         0
##               2      0         21        50
##               3      0         29         0
table(clusterCutFour,iris$Species)
##               
## clusterCutFour setosa versicolor virginica
##              1     50          0         0
##              2      0         21        31
##              3      0         29         0
##              4      0          0        19
#Creating a scatterplot
plot <- ggplot(iris, 
               aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

plot + labs(title = "Scatter plot of Iris Dataset",
            x = "Petal Length",
            y = "Petal Width")

As shown in the above graph, the ideal number of clusters would be three, since there are three species.

It looks like the algorithm successfully classified all the flowers of species setosa into cluster 1, and virginica into cluster 2, but had trouble with versicolor. This is because based on the scatterplot, virginica and versicolor have similar measurments.

Mean linkage method

clusters <- hclust(dist(iris[, 3:4]), 
                   method = 'average')
plot(clusters, 
     xlab = "Distance Matrix of Petal.Length and Petal.Width", 
     ylab = "Height", 
     main = "Dendogram for Iris Dataset using Mean linkage method")

Clusters - 3 or 5

We can see that the two best choices for number of clusters are either 3 or 5. Let us use cutree to bring it down to 3 and 5 clusters.

clusterCutThree <- cutree(clusters, 3)
clusterCutFive <- cutree(clusters, 5)

table(clusterCutThree, iris$Species)
##                
## clusterCutThree setosa versicolor virginica
##               1     50          0         0
##               2      0         45         1
##               3      0          5        49
table(clusterCutFive, iris$Species)
##               
## clusterCutFive setosa versicolor virginica
##              1     50          0         0
##              2      0         39         1
##              3      0          5        41
##              4      0          6         0
##              5      0          0         8

We can see that this time, the algorithm did a much better job of clustering the data, only going wrong with 6 of the data points.

plot <- ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) + 
        geom_point(col = clusterCutThree) + 
        scale_color_manual(values = c('black', 'red', 'green'))

plot + labs(title = "Scatter plot of Iris Dataset",
            x = "Petal Length",
            y = "Petal Width")

Single linkage method

clusters <- hclust(dist(iris[, 3:4]), 
                   method = 'single')
plot(clusters, 
     xlab = "Distance Matrix of Petal.Length and Petal.Width", 
     ylab = "Height", 
     main = "Dendogram for Iris Dataset using Single linkage method")

Clusters - 2

We can see that the two best choices for number of clusters is 2. Let us use cutree to bring it down to 2 clusters.

clusterCutTwo <- cutree(clusters, 2)

table(clusterCutTwo, iris$Species)
##              
## clusterCutTwo setosa versicolor virginica
##             1     50          0         0
##             2      0         50        50
plot <- ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) + 
        geom_point(col = clusterCutTwo) + 
        scale_color_manual(values = c('black', 'red',))

plot + labs(title = "Scatter plot of Iris Dataset",
            x = "Petal Length",
            y = "Petal Width")

Inference

The best method would be mean linkage clustering because it divided the species into three even clusters.