Complete all Exercises, and submit answers to VtopBeta
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa |
We know there are 3 species of flowers.
hclust requires data to be provided in the form of a distance
library(ggplot2)
clusters <- hclust(dist(iris[, 3:4])) #dist is used to produce this matrix
plot(clusters,
xlab = "Distance Matrix of Petal.Length and Petal.Width",
ylab = "Height",
main = "Dendogram for Iris Dataset using complete linkage")Based on the above result, the best choice for clusters are either 3 or 4. To do this, we can cut off the tree at the desired number of clusters using cuttree.
clusterCutThree <- cutree(clusters, 3) #Creating three clusters
clusterCutFour <- cutree(clusters, 4) #Creating four clusters
#Comparing clusters with original species
table(clusterCutThree,iris$Species)##
## clusterCutThree setosa versicolor virginica
## 1 50 0 0
## 2 0 21 50
## 3 0 29 0
table(clusterCutFour,iris$Species)##
## clusterCutFour setosa versicolor virginica
## 1 50 0 0
## 2 0 21 31
## 3 0 29 0
## 4 0 0 19
#Creating a scatterplot
plot <- ggplot(iris,
aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
plot + labs(title = "Scatter plot of Iris Dataset",
x = "Petal Length",
y = "Petal Width")As shown in the above graph, the ideal number of clusters would be three, since there are three species.
It looks like the algorithm successfully classified all the flowers of species setosa into cluster 1, and virginica into cluster 2, but had trouble with versicolor. This is because based on the scatterplot, virginica and versicolor have similar measurments.
clusters <- hclust(dist(iris[, 3:4]),
method = 'average')
plot(clusters,
xlab = "Distance Matrix of Petal.Length and Petal.Width",
ylab = "Height",
main = "Dendogram for Iris Dataset using Mean linkage method")We can see that the two best choices for number of clusters are either 3 or 5. Let us use cutree to bring it down to 3 and 5 clusters.
clusterCutThree <- cutree(clusters, 3)
clusterCutFive <- cutree(clusters, 5)
table(clusterCutThree, iris$Species)##
## clusterCutThree setosa versicolor virginica
## 1 50 0 0
## 2 0 45 1
## 3 0 5 49
table(clusterCutFive, iris$Species)##
## clusterCutFive setosa versicolor virginica
## 1 50 0 0
## 2 0 39 1
## 3 0 5 41
## 4 0 6 0
## 5 0 0 8
We can see that this time, the algorithm did a much better job of clustering the data, only going wrong with 6 of the data points.
plot <- ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) +
geom_point(col = clusterCutThree) +
scale_color_manual(values = c('black', 'red', 'green'))
plot + labs(title = "Scatter plot of Iris Dataset",
x = "Petal Length",
y = "Petal Width")clusters <- hclust(dist(iris[, 3:4]),
method = 'single')
plot(clusters,
xlab = "Distance Matrix of Petal.Length and Petal.Width",
ylab = "Height",
main = "Dendogram for Iris Dataset using Single linkage method")We can see that the two best choices for number of clusters is 2. Let us use cutree to bring it down to 2 clusters.
clusterCutTwo <- cutree(clusters, 2)
table(clusterCutTwo, iris$Species)##
## clusterCutTwo setosa versicolor virginica
## 1 50 0 0
## 2 0 50 50
plot <- ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) +
geom_point(col = clusterCutTwo) +
scale_color_manual(values = c('black', 'red',))
plot + labs(title = "Scatter plot of Iris Dataset",
x = "Petal Length",
y = "Petal Width")The best method would be mean linkage clustering because it divided the species into three even clusters.