library(tidyverse)
library(gganimate)

Understanding K-Means Clustering

If you are unfamiliar with K-means clustering, please read this section. K-means clustering is a popular unsupervised machine learning algorithm that groups data points together in a fixed number (k) of clusters in the dataset. In K-means clustering, you decide on the number k, which is the number of centroids, or cluster centers, the algorithm will find in the dataset.

The K-means algorithm starts with randomly selecting k centroids to use as starting points for each of the clusters. Then, for every data point, the algorithm determines which centroid it is nearest to (in terms of Euclidean distance) and assigns the data point to that centroid. After each data point has been assigned to its nearest centroid, the centroids change to be the average of the points that were just assigned to that centroid. The algorithm then finds which centroid each data point is closest to, and this process keeps on repeating. The algorithm stops once the centroid assignments no longer change, which is when the algorithm has “converged”.

Creating the K-means Clustering Algorithm from Scratch

k_means <- function(data, k, pca = FALSE) {

    #option for Principal Component Analysis
    if(pca == TRUE){
        data <- princomp(data)
        data <- data$scores %>%
            as.data.frame() %>%
            select(Comp.1, Comp.2)
    }
    
    #randomly select the indices of k rows to use as starting
    #centers of the k clusters
    rand <- sample(1:nrow(data), k)

    #data frame with k observations that were randomly selected
    clusters <- data[rand,]

    #empty vectors that will contain the cluster assignments for each observation
    cluster_vec <- c()
    last_vec <- c(0)
    
    #iteration counter
    iter <- 1
    
    #algorithm will stop once stop is equal to 1
    stop <- 0

    while (stop == 0) {

        #loop through each observation
        for (i in 1:nrow(data)) {
    
            #find the euclidean distance of the ith observation to each of the clusters
            dist <- data[i,] %>%
                rbind(clusters) %>%
                dist()
            
            #find which cluster the ith observation has the smallest distance with
            i_cluster <- dist[1:k] %>%
                which.min()

            #add the cluster assignment for the ith observation to a vector
            #containing the cluster assignments of all observations
            cluster_vec[i] <- i_cluster

        }

        #check to see if the cluster assignments have changed at all since
        #the last iteration
        if (all(cluster_vec == last_vec)) {
            stop <-  1
        }

        #save the cluster assignments from this iteration to another object
        #so we can check to see if cluster assignments changed
        last_vec <- cluster_vec

        #group the observations into their assigned clusters and find the means
        #of all the columns to use as the new cluster centers
        clusters <- data %>%
            cbind(cluster_vec) %>%
            group_by(cluster_vec) %>%
            summarize_all(mean)

        #remove the first column that contains the cluster number
        clusters <- clusters[, -1]
        
        iter <- iter + 1
        
        if (stop == 1) {
            sizes <- data %>% 
                cbind(cluster_vec) %>% 
                count(cluster_vec) %>% 
                pull(n)
            
        clusters <- data %>%
            cbind(cluster_vec) %>%
            group_by(cluster_vec) %>%
            summarize_all(mean)
        
        }

    }

    result <- list("Sizes" = sizes, 
                   "Cluster Means" = clusters,
                   "Clustering Vector" = cluster_vec,
                   "Iterations" = iter)
    return(result)
}

Using the k_means Function

To demonstrate how the function created above works, we will use the well-known iris dataset as an example. The function only works with data frames that only contain numeric variables, so the Species column is removed.

iris2 <- iris %>% 
    select(-Species)

head(iris2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

The K-means algorithm randomly selects k observations to use as the starting points, which means that the final cluster assignments can be different depending on which cluster centroids are randomly selected first. We use the set.seed() function to make sure the results stay the same for the examples.

set.seed(23)
k_means(data = iris2, k = 3)
## $Sizes
## [1] 50 62 38
## 
## $`Cluster Means`
## # A tibble: 3 × 5
##   cluster_vec Sepal.Length Sepal.Width Petal.Length Petal.Width
##         <int>        <dbl>       <dbl>        <dbl>       <dbl>
## 1           1         5.01        3.43         1.46       0.246
## 2           2         5.90        2.75         4.39       1.43 
## 3           3         6.85        3.07         5.74       2.07 
## 
## $`Clustering Vector`
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## 
## $Iterations
## [1] 7

In the output of the function, we can see that the K-means algorithm converged with clusters of sizes 50, 62, and 38 observations. The cluster means for the three clusters are displayed in the form of a tibble. The clustering vector is a vector of which cluster each observation belongs to, and it is in order of the original data frame observations. Finally, the number of iterations displays the number of times the algorithm changed the cluster centroids before convergence. In this example, the algorithm took 7 iterations.

Visualizing the Results

In the previous example, we used all 4 numeric variables in the iris dataset, so it is difficult to show how well the cluster assignments were graphically. In this example, we only use Sepal.Length and Petal.Length, so we can show the results on a scatterplot.

iris3 <- iris %>% 
    select(Sepal.Length, Petal.Length)

head(iris3)
##   Sepal.Length Petal.Length
## 1          5.1          1.4
## 2          4.9          1.4
## 3          4.7          1.3
## 4          4.6          1.5
## 5          5.0          1.4
## 6          5.4          1.7
set.seed(78)
result <- k_means(iris3, 3)
result
## $Sizes
## [1] 41 51 58
## 
## $`Cluster Means`
## # A tibble: 3 × 3
##   cluster_vec Sepal.Length Petal.Length
##         <int>        <dbl>        <dbl>
## 1           1         6.84         5.68
## 2           2         5.01         1.49
## 3           3         5.87         4.39
## 
## $`Clustering Vector`
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 1 3 1 1 1 1 3 1 1 1 1
## [112] 1 1 3 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 1 1 1 3 1
## [149] 1 3
## 
## $Iterations
## [1] 5
#save the clustering assignments to an object
assignments <- result$`Clustering Vector`

iris3 %>% 
    cbind(assignments) %>% 
    ggplot(aes(x = Sepal.Length, y = Petal.Length, color = as.factor(assignments))) + 
    geom_point() + 
    theme_bw() + 
    labs(color = "Cluster")

Using only Sepal.Length and Petal.Length to create the cluster assignments using the K-means algorithm, we can see that the cluster assignments make sense. For example, observations in cluster 1 are close together in terms of Euclidean distance as well as on the scatterplot, and those observations have larger values for Petal.Length and Sepal.Length.

Visualizing the K-Means Algorithm in Action

In order to create an animated graph that shows the cluster assignments on each iteration, we must create another modified K-means function that saves the centroids and all the assignments on every iteration.

k_means_iterations <- function(data, k, pca = FALSE) {

    #option for Principal Component Analysis
    if(pca == TRUE){
        data <- princomp(data)
        data <- data$scores %>%
            as.data.frame() %>%
            select(Comp.1, Comp.2)
    }
    
    #randomly select the indices of k rows to use as starting
    #centers of the k clusters
    rand <- sample(1:nrow(data), k)

    #data frame with k observations that were randomly selected
    clusters <- data[rand,]

    #empty vectors that will contain the cluster assignments for each observation
    cluster_vec <- c()
    last_vec <- c(0)
    
    #iteration counter
    iter <- 1
    
    #algorithm will stop once stop is equal to 1
    stop <- 0
    
    #create empty data frame that will contain all the assignments for every iteration
    all_df <- data.frame()
    
    #create empty data frames that will contain the centroids for every iteration
    all_center_df <- data.frame()

    while (stop == 0) {

        #loop through each observation
        for (i in 1:nrow(data)) {
    
            #find the euclidean distance of the ith observation to each of the clusters
            dist <- data[i,] %>%
                rbind(clusters) %>%
                dist()
            
            #find which cluster the ith observation has the smallest distance with
            i_cluster <- dist[1:k] %>%
                which.min()

            #add the cluster assignment for the ith observation to a vector
            #containing the cluster assignments of all observations
            cluster_vec[i] <- i_cluster

        }

        #check to see if the cluster assignments have changed at all since
        #the last iteration
        if (all(cluster_vec == last_vec)) {
            stop <-  1
        }

        #save the cluster assignments from this iteration to another object
        #so we can check to see if cluster assignments change in the next iteration
        last_vec <- cluster_vec
        
        #save this iteration's clustering vector and original data to a new data frame
        df <- data %>%
            add_column(cluster = cluster_vec) %>%
            add_column(iteration = iter)

        #save this iteration's centroids to a new data frame with the iteration number too
        center_df <- clusters %>%
            add_column(cluster = c(1:k)) %>%
            add_column(iteration = iter)

        #add this iteration's cluster assignments to the data frame containing all the
        #assignments of all previous iterations
        all_df <- rbind(all_df, df)

        #add this iteration's centroids to the data frame containing all the 
        #centroids of all the previous iterations
        all_center_df <- rbind(all_center_df, center_df)

        #group the observations into their assigned clusters and find the means
        #of all the columns to use as the new cluster centers
        clusters <- data %>%
            cbind(cluster_vec) %>%
            group_by(cluster_vec) %>%
            summarize_all(mean)

        #remove the first column that contains the cluster number
        clusters <- clusters[, -1]
        
        #clear the data frames containing information on only the current iteration
        df <- data.frame()
        center_df <- data.frame()
        
        #add to the iteration counter
        iter <- iter + 1
        
        #find sizes and clusters once the algorithm is finished
        if (stop == 1) {
            sizes <- data %>% 
                cbind(cluster_vec) %>% 
                count(cluster_vec) %>% 
                pull(n)
            
        clusters <- data %>%
            cbind(cluster_vec) %>%
            group_by(cluster_vec) %>%
            summarize_all(mean)
        
        }

    }

    result <- list("Sizes" = sizes, 
                   "Cluster Means" = clusters,
                   "Clustering Vector" = cluster_vec,
                   "Iterations" = iter,
                   "All Assignments" = all_df,
                   "All Centroids" = all_center_df)
    return(result)
}
set.seed(54)
#call the function with k equal to 3
result <- k_means_iterations(data = iris3, k = 3)

#save the tibble with all of the assignments to a new data frame
graph_df <- result$`All Assignments`
#save the tibble with all the centroids to a new data frame
centers_df <- result$`All Centroids`
#use the results to make an animated graph of the assignments on each iteration
graph_df %>% 
  ggplot(aes(x = Sepal.Length,
             y = Petal.Length, 
             color = as.factor(cluster))) + 
  geom_point() + 
  geom_point(data = centers_df, size = 5, shape = 10) + 
  transition_states(iteration, transition_length = 0, state_length = 10) + 
  labs(title = "Iteration: {closest_state}",
       color = "Cluster") + 
  theme_bw()

In the graph above, we can see how the K-means clustering algorithm works. Using the iris dataset with only the variables Sepal.Length and Petal.Length, we run a K-means clustering algorithm. The K-means function clusters the observations into 3 clusters using only the variables Sepal.Length and Petal.Length. In the first iteration, 3 random observations are chosen as the starting points for each of the 3 cluster centroids. Then, the distance to each of those 3 cluster centroids is calculated for each observation, and each observation is assigned to the closest cluster centroid. After the observations have been assigned to a cluster, the centroids of each of the three clusters are recalculated using the mean of all the observations in that cluster. The distance of each observation to the new cluster centroids are recalculated again and the process continues until convergence, or until the cluster centroids do not change anymore. This occurs after 8 iteration in this example above.