Background

Edgar Anderson was an American botanist. In 1929 he moved to Britain where he studied and worked at the John Innes Horticultural Institute. It was while studying here that he created a data set containing the sepal lengths and widths and petal lengths and widths of three different Iris flower species. His coworker and statistician R.A. Fisher then used that data set as an example for statistical methods on classification. This project seeks to demonstrate prediction of the classification of an Iris flower’s species with the use of clustering algorithms.

First

We would normally load in the data, but because this data set is widely used, the CRAN community has already provided the data set in R’s base data sets package. We can immediately start to use it.

iris_df <- iris
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We can see here that this data set contains 150 observations. We also see that four of our variables are quantitative and one variable (species) is categorical. Some manipulation will have to be done to the data set before we begin. We’re going to remove the species variable for now and save it in a different list for later.

species <- as.list(iris_df$Species)
species <- unlist(species)
iris_df <- iris_df[1:4]

Exploritory Visualizations

We’re now ready to look at the data.

From these plots we can easily see two distinct clusters, but we know from viewing the structure of the data set above that there are three species of flowers.

Hierarchical Clustering

The first type of clustering we’ll try out is a Single Hierarchical Clustering. This type of clustering is Agglomerative, which means it takes a bottom up approach. Each observation starts in its own cluster, pairs are then merged over and over again moving up the cluster. First, we’ll need to calculate the distance between pairs; this is achieved by using the dist() function. We’ll use “euclidean” for the method. This will give us the ordinary straight line distance between two points. Next, we’ll use the hclust() function with the method set to “single”. Since the function produce an entire tree of every cluster we’ll have to cut it to create our 3 clusters to represent the iris species. We’ll use cutree() for this.

set.seed(3949)
iris_dist <- dist(iris_df, method = "euclidean")
model_1 <- hclust(iris_dist, method = "single")
model_1_cut <- cutree(model_1, 3)
table(species, model_1_cut)
##             model_1_cut
## species       1  2  3
##   setosa     50  0  0
##   versicolor  0 50  0
##   virginica   0 48  2

By interpreting the confusion matrix and visualizations here, we see that this clustering method was good at determining one group of flowers, but had a hard time distinguishing between the other two. If we decide to define cluster 1 as setosa, cluster 2 as versicolor, and cluster 3 as virginica, we end up with an accuracy of 68%. Not a very good model at all. We’ll now try to use the “complete” method for hierarchical clustering. This method differs in how it finds the shortest distance between two pairs by using the distance of the two elements that are farthest away from each other.

model_2 <- hclust(iris_dist, method = "complete")
model_2_cut <- cutree(model_2, 3)
table(species, model_2_cut)
##             model_2_cut
## species       1  2  3
##   setosa     50  0  0
##   versicolor  0 23 27
##   virginica   0 49  1

This model shows a dramatic improvement. Defining cluster 1 as setosa, cluster 2 as virginica and cluster 3 as versicolor, we have an accuracy of 84%. It may yet be possible to get better results, though, by using a different clustering algorithm.

K-Means Clustering

K-means clustering works by initializing ‘k’ number of ‘means’ as random values in a data set. Clusters are then created by associating every observation to its nearest mean. A centroid is calculated for each cluster. That centroid becomes the new mean. The process then runs over and over again until convergence has been reached. R’s stats package makes this an easy process. We’ll simply use the kmeans() function, set with 3 centers.

set.seed(3949)
model_3 <- kmeans(iris_df, centers = 3)
table(species, model_3$cluster)
##             
## species       1  2  3
##   setosa      0 50  0
##   versicolor 48  0  2
##   virginica  14  0 36

It is clear from the visualizations that using k-means gives us the best result. If we assign setosa to cluster 2, versicolor to cluster 1, and virginica to cluster 3 we get an accuracy of 89.3%.

Conclusion

Using the k-means method for clustering works well in separating out the different species from the Iris data set. From here one could scale the data and then run the k-means method again, but considering that there are not any major outliers in the data set and the values for the measurements are small, it is doubtful that would make a better model. Instead, you could explore further methods of clustering.