K-Means Clustering

Overview

K-means clustering is a type of unsupervised machine learning algorithm that analyzes unlabeled data to find relationships within the data. K-means clustering sorts data into different classes, or clusters, based on their similarity. K represents the number of clusters.

In this presentation, we will look at how k-means clustering can be applied to the Iris dataset.

K-Means Algorithm

Randomly initialize centroids (central points)
Each data point is assigned to its closest centroid
Clusters are formed around each centroid
Centroids get updated through finding the average position of all the points in each cluster
Repeat until the points in each cluster do not get reassigned
Euclidean distance is used to calculate the similarity between points. \[d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}\]

K-Means Formula

\[\underset{C}{\operatorname{arg\,min}} \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2\] where:

\(C_k\) is the set of points assigned to cluster \(k\),

\(\mu_k\) is the centroid of cluster \(k\),

\(\|x_i - \mu_k\|^2\) is the Euclidean distance squared.

Random Centroids

Let’s take a look at K-Means clustering in the Iris dataset for the variables Sepal Length and Petal Length. Here we see that all the data points are plotted and the random centroids are initialized as t marks. Note that the centroids here are randomly chosen among the data values.

Final Centroids

Next, let’s take a look at where the final centroids end up when the algorithm finishes and no points get reassigned.

Clusters

Finally, let’s take a look at the classes assigned to the data points based on their closest centroid.

2D Plot Code


``` r
iris_dataset <- iris[, 1:4]
kmeans_result <- kmeans(iris_dataset, centers = 3)

iris$Cluster <- as.factor(kmeans_result$cluster)

# find iris species majority in each cluster
species_map <- iris %>%
  group_by(Cluster, Species) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(Cluster) %>%
  slice_max(order_by = n, n = 1) %>%
  ungroup()

# map clusters to species names
cluster_labels <- setNames(species_map$Species, species_map$Cluster)
iris$ClusterLabel <- cluster_labels[as.character(iris$Cluster)]
```

2D Plot Code (Continued)


``` r
final_centroids <- as.data.frame(kmeans_result$centers)[, c("Sepal.Length", "Petal.Length")]
final_centroids$Cluster <- as.factor(1:3)
final_centroids$ClusterLabel <- cluster_labels[as.character(final_centroids$Cluster)]

iris_plot3 = ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = ClusterLabel)) + 
  geom_point(size = 2) + 
  geom_point(data = final_centroids, aes(x = Sepal.Length, y = Petal.Length),
             color = "black", size = 4, shape = 3) + 
  labs(title = "Iris K-Means Clustering (Final Centroids)",
       x = "Sepal Length", y = "Petal Length")

print(iris_plot3)
```

3D Clustering

We can also plot this data in three dimensions by adding the variable Petal Width as the z-axis. Use left click and drag to view the 3D plot.