Simple Cluster Analysis Example Using the iris data from R Software

The Iris Dataset used in this example contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). These measures were used to create a linear discriminant model to classify the species. The dataset is often used in data mining, classification and clustering examples and to test algorithms.(http://www.lac.inpe.br/~rafael.santos/Docs/CAP394/WholeStory-Iris.html#:~:text=The%20Iris%20Dataset%20contains%20four,model%20to%20classify%20the%20species.)

Information about the original paper and usages of the dataset can be found in the UCI Machine Learning Repository – Iris Data Set. (http://archive.ics.uci.edu/dataset/53/iris)

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
clusters <- hclust(dist(iris[, 3:4]))
plot(clusters)

install.packages("ggplot2")
## The following package(s) will be installed:
## - ggplot2 [3.4.3]
## These packages will be installed into "~/Documents/MA Economics/MA Economics/renv/library/R-4.3/aarch64-apple-darwin20".
## 
## # Installing packages --------------------------------------------------------
## - Installing ggplot2 ...                        OK [linked from cache]
## Successfully installed 1 package in 2.5 milliseconds.
library(ggplot2)

ggplot(iris, aes(Petal.Length, Petal.Width, col=Species)) + geom_point() + ggtitle("MA Economics Class iris scatter plot")

set.seed(55)
cluster.iris <- kmeans(iris[, 3:4], 3, nstart = 20)
cluster.iris
## K-means clustering with 3 clusters of sizes 48, 50, 52
## 
## Cluster means:
##   Petal.Length Petal.Width
## 1     5.595833    2.037500
## 2     1.462000    0.246000
## 3     4.269231    1.342308
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 1 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1
## [149] 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 16.29167  2.02200 13.05769
##  (between_SS / total_SS =  94.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
table(cluster.iris$cluster, iris$Species)
##    
##     setosa versicolor virginica
##   1      0          2        46
##   2     50          0         0
##   3      0         48         4
cluster.iris$cluster <- as.factor(cluster.iris$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color=cluster.iris$cluster)) + geom_point() + ggtitle("MA Economics iris cluster plot")