Exercise 7

Iris dataset for clustering
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa

Clustering

We know there are 3 species of flowers.

Complete linkage clustering method

hclust requires data to be provided in the form of a distance

library(ggplot2)
clusters <- hclust(dist(iris[, 3:4])) #dist is used to produce this matrix

plot(clusters, 
     xlab = "Distance Matrix of Petal.Length and Petal.Width", 
     ylab = "Height", 
     main = "Dendogram for Iris Dataset using complete linkage")

Clusters - 3 or 4

Based on the above result, the best choice for clusters are either 3 or 4. To do this, we can cut off the tree at the desired number of clusters using cuttree.

clusterCutThree <- cutree(clusters, 3) #Creating three clusters
clusterCutFour <- cutree(clusters, 4) #Creating four clusters

#Comparing clusters with original species
table(clusterCutThree,iris$Species)

##                
## clusterCutThree setosa versicolor virginica
##               1     50          0         0
##               2      0         21        50
##               3      0         29         0

table(clusterCutFour,iris$Species)

##               
## clusterCutFour setosa versicolor virginica
##              1     50          0         0
##              2      0         21        31
##              3      0         29         0
##              4      0          0        19

#Creating a scatterplot
plot <- ggplot(iris, 
               aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

plot + labs(title = "Scatter plot of Iris Dataset",
            x = "Petal Length",
            y = "Petal Width")

As shown in the above graph, the ideal number of clusters would be three, since there are three species.

It looks like the algorithm successfully classified all the flowers of species setosa into cluster 1, and virginica into cluster 2, but had trouble with versicolor. This is because based on the scatterplot, virginica and versicolor have similar measurments.

Mean linkage method

clusters <- hclust(dist(iris[, 3:4]), 
                   method = 'average')
plot(clusters, 
     xlab = "Distance Matrix of Petal.Length and Petal.Width", 
     ylab = "Height", 
     main = "Dendogram for Iris Dataset using Mean linkage method")

Clusters - 3 or 5

We can see that the two best choices for number of clusters are either 3 or 5. Let us use cutree to bring it down to 3 and 5 clusters.

clusterCutThree <- cutree(clusters, 3)
clusterCutFive <- cutree(clusters, 5)

table(clusterCutThree, iris$Species)

##                
## clusterCutThree setosa versicolor virginica
##               1     50          0         0
##               2      0         45         1
##               3      0          5        49

table(clusterCutFive, iris$Species)

##               
## clusterCutFive setosa versicolor virginica
##              1     50          0         0
##              2      0         39         1
##              3      0          5        41
##              4      0          6         0
##              5      0          0         8

We can see that this time, the algorithm did a much better job of clustering the data, only going wrong with 6 of the data points.

plot <- ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) + 
        geom_point(col = clusterCutThree) + 
        scale_color_manual(values = c('black', 'red', 'green'))

plot + labs(title = "Scatter plot of Iris Dataset",
            x = "Petal Length",
            y = "Petal Width")

Single linkage method

clusters <- hclust(dist(iris[, 3:4]), 
                   method = 'single')
plot(clusters, 
     xlab = "Distance Matrix of Petal.Length and Petal.Width", 
     ylab = "Height", 
     main = "Dendogram for Iris Dataset using Single linkage method")

Clusters - 2

We can see that the two best choices for number of clusters is 2. Let us use cutree to bring it down to 2 clusters.

clusterCutTwo <- cutree(clusters, 2)

table(clusterCutTwo, iris$Species)

##              
## clusterCutTwo setosa versicolor virginica
##             1     50          0         0
##             2      0         50        50

plot <- ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) + 
        geom_point(col = clusterCutTwo) + 
        scale_color_manual(values = c('black', 'red',))

plot + labs(title = "Scatter plot of Iris Dataset",
            x = "Petal Length",
            y = "Petal Width")

Exercise 7

Jacob John

Introduction

Working

Determining clusters

Datasets