library(tidyverse)
library(gganimate)
If you are unfamiliar with K-means clustering, please read this section. K-means clustering is a popular unsupervised machine learning algorithm that groups data points together in a fixed number (k) of clusters in the dataset. In K-means clustering, you decide on the number k, which is the number of centroids, or cluster centers, the algorithm will find in the dataset.
The K-means algorithm starts with randomly selecting k centroids to use as starting points for each of the clusters. Then, for every data point, the algorithm determines which centroid it is nearest to (in terms of Euclidean distance) and assigns the data point to that centroid. After each data point has been assigned to its nearest centroid, the centroids change to be the average of the points that were just assigned to that centroid. The algorithm then finds which centroid each data point is closest to, and this process keeps on repeating. The algorithm stops once the centroid assignments no longer change, which is when the algorithm has “converged”.
k_means <- function(data, k, pca = FALSE) {
#option for Principal Component Analysis
if(pca == TRUE){
data <- princomp(data)
data <- data$scores %>%
as.data.frame() %>%
select(Comp.1, Comp.2)
}
#randomly select the indices of k rows to use as starting
#centers of the k clusters
rand <- sample(1:nrow(data), k)
#data frame with k observations that were randomly selected
clusters <- data[rand,]
#empty vectors that will contain the cluster assignments for each observation
cluster_vec <- c()
last_vec <- c(0)
#iteration counter
iter <- 1
#algorithm will stop once stop is equal to 1
stop <- 0
while (stop == 0) {
#loop through each observation
for (i in 1:nrow(data)) {
#find the euclidean distance of the ith observation to each of the clusters
dist <- data[i,] %>%
rbind(clusters) %>%
dist()
#find which cluster the ith observation has the smallest distance with
i_cluster <- dist[1:k] %>%
which.min()
#add the cluster assignment for the ith observation to a vector
#containing the cluster assignments of all observations
cluster_vec[i] <- i_cluster
}
#check to see if the cluster assignments have changed at all since
#the last iteration
if (all(cluster_vec == last_vec)) {
stop <- 1
}
#save the cluster assignments from this iteration to another object
#so we can check to see if cluster assignments changed
last_vec <- cluster_vec
#group the observations into their assigned clusters and find the means
#of all the columns to use as the new cluster centers
clusters <- data %>%
cbind(cluster_vec) %>%
group_by(cluster_vec) %>%
summarize_all(mean)
#remove the first column that contains the cluster number
clusters <- clusters[, -1]
iter <- iter + 1
if (stop == 1) {
sizes <- data %>%
cbind(cluster_vec) %>%
count(cluster_vec) %>%
pull(n)
clusters <- data %>%
cbind(cluster_vec) %>%
group_by(cluster_vec) %>%
summarize_all(mean)
}
}
result <- list("Sizes" = sizes,
"Cluster Means" = clusters,
"Clustering Vector" = cluster_vec,
"Iterations" = iter)
return(result)
}
To demonstrate how the function created above works, we will use the well-known iris dataset as an example. The function only works with data frames that only contain numeric variables, so the Species column is removed.
iris2 <- iris %>%
select(-Species)
head(iris2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
The K-means algorithm randomly selects k observations to use as the starting points, which means that the final cluster assignments can be different depending on which cluster centroids are randomly selected first. We use the set.seed() function to make sure the results stay the same for the examples.
set.seed(23)
k_means(data = iris2, k = 3)
## $Sizes
## [1] 50 62 38
##
## $`Cluster Means`
## # A tibble: 3 × 5
## cluster_vec Sepal.Length Sepal.Width Petal.Length Petal.Width
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 5.01 3.43 1.46 0.246
## 2 2 5.90 2.75 4.39 1.43
## 3 3 6.85 3.07 5.74 2.07
##
## $`Clustering Vector`
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
##
## $Iterations
## [1] 7
In the output of the function, we can see that the K-means algorithm converged with clusters of sizes 50, 62, and 38 observations. The cluster means for the three clusters are displayed in the form of a tibble. The clustering vector is a vector of which cluster each observation belongs to, and it is in order of the original data frame observations. Finally, the number of iterations displays the number of times the algorithm changed the cluster centroids before convergence. In this example, the algorithm took 7 iterations.
In the previous example, we used all 4 numeric variables in the iris dataset, so it is difficult to show how well the cluster assignments were graphically. In this example, we only use Sepal.Length and Petal.Length, so we can show the results on a scatterplot.
iris3 <- iris %>%
select(Sepal.Length, Petal.Length)
head(iris3)
## Sepal.Length Petal.Length
## 1 5.1 1.4
## 2 4.9 1.4
## 3 4.7 1.3
## 4 4.6 1.5
## 5 5.0 1.4
## 6 5.4 1.7
set.seed(78)
result <- k_means(iris3, 3)
result
## $Sizes
## [1] 41 51 58
##
## $`Cluster Means`
## # A tibble: 3 × 3
## cluster_vec Sepal.Length Petal.Length
## <int> <dbl> <dbl>
## 1 1 6.84 5.68
## 2 2 5.01 1.49
## 3 3 5.87 4.39
##
## $`Clustering Vector`
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 1 3 1 1 1 1 3 1 1 1 1
## [112] 1 1 3 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 3 1 1 1 3 1
## [149] 1 3
##
## $Iterations
## [1] 5
#save the clustering assignments to an object
assignments <- result$`Clustering Vector`
iris3 %>%
cbind(assignments) %>%
ggplot(aes(x = Sepal.Length, y = Petal.Length, color = as.factor(assignments))) +
geom_point() +
theme_bw() +
labs(color = "Cluster")
Using only Sepal.Length and Petal.Length to create the cluster assignments using the K-means algorithm, we can see that the cluster assignments make sense. For example, observations in cluster 1 are close together in terms of Euclidean distance as well as on the scatterplot, and those observations have larger values for Petal.Length and Sepal.Length.
In order to create an animated graph that shows the cluster assignments on each iteration, we must create another modified K-means function that saves the centroids and all the assignments on every iteration.
k_means_iterations <- function(data, k, pca = FALSE) {
#option for Principal Component Analysis
if(pca == TRUE){
data <- princomp(data)
data <- data$scores %>%
as.data.frame() %>%
select(Comp.1, Comp.2)
}
#randomly select the indices of k rows to use as starting
#centers of the k clusters
rand <- sample(1:nrow(data), k)
#data frame with k observations that were randomly selected
clusters <- data[rand,]
#empty vectors that will contain the cluster assignments for each observation
cluster_vec <- c()
last_vec <- c(0)
#iteration counter
iter <- 1
#algorithm will stop once stop is equal to 1
stop <- 0
#create empty data frame that will contain all the assignments for every iteration
all_df <- data.frame()
#create empty data frames that will contain the centroids for every iteration
all_center_df <- data.frame()
while (stop == 0) {
#loop through each observation
for (i in 1:nrow(data)) {
#find the euclidean distance of the ith observation to each of the clusters
dist <- data[i,] %>%
rbind(clusters) %>%
dist()
#find which cluster the ith observation has the smallest distance with
i_cluster <- dist[1:k] %>%
which.min()
#add the cluster assignment for the ith observation to a vector
#containing the cluster assignments of all observations
cluster_vec[i] <- i_cluster
}
#check to see if the cluster assignments have changed at all since
#the last iteration
if (all(cluster_vec == last_vec)) {
stop <- 1
}
#save the cluster assignments from this iteration to another object
#so we can check to see if cluster assignments change in the next iteration
last_vec <- cluster_vec
#save this iteration's clustering vector and original data to a new data frame
df <- data %>%
add_column(cluster = cluster_vec) %>%
add_column(iteration = iter)
#save this iteration's centroids to a new data frame with the iteration number too
center_df <- clusters %>%
add_column(cluster = c(1:k)) %>%
add_column(iteration = iter)
#add this iteration's cluster assignments to the data frame containing all the
#assignments of all previous iterations
all_df <- rbind(all_df, df)
#add this iteration's centroids to the data frame containing all the
#centroids of all the previous iterations
all_center_df <- rbind(all_center_df, center_df)
#group the observations into their assigned clusters and find the means
#of all the columns to use as the new cluster centers
clusters <- data %>%
cbind(cluster_vec) %>%
group_by(cluster_vec) %>%
summarize_all(mean)
#remove the first column that contains the cluster number
clusters <- clusters[, -1]
#clear the data frames containing information on only the current iteration
df <- data.frame()
center_df <- data.frame()
#add to the iteration counter
iter <- iter + 1
#find sizes and clusters once the algorithm is finished
if (stop == 1) {
sizes <- data %>%
cbind(cluster_vec) %>%
count(cluster_vec) %>%
pull(n)
clusters <- data %>%
cbind(cluster_vec) %>%
group_by(cluster_vec) %>%
summarize_all(mean)
}
}
result <- list("Sizes" = sizes,
"Cluster Means" = clusters,
"Clustering Vector" = cluster_vec,
"Iterations" = iter,
"All Assignments" = all_df,
"All Centroids" = all_center_df)
return(result)
}
set.seed(54)
#call the function with k equal to 3
result <- k_means_iterations(data = iris3, k = 3)
#save the tibble with all of the assignments to a new data frame
graph_df <- result$`All Assignments`
#save the tibble with all the centroids to a new data frame
centers_df <- result$`All Centroids`
#use the results to make an animated graph of the assignments on each iteration
graph_df %>%
ggplot(aes(x = Sepal.Length,
y = Petal.Length,
color = as.factor(cluster))) +
geom_point() +
geom_point(data = centers_df, size = 5, shape = 10) +
transition_states(iteration, transition_length = 0, state_length = 10) +
labs(title = "Iteration: {closest_state}",
color = "Cluster") +
theme_bw()
In the
graph above, we can see how the K-means clustering algorithm works.
Using the iris dataset with only the variables
Sepal.Length and Petal.Length, we run a K-means
clustering algorithm. The K-means function clusters the observations
into 3 clusters using only the variables Sepal.Length and
Petal.Length. In the first iteration, 3 random observations are
chosen as the starting points for each of the 3 cluster centroids. Then,
the distance to each of those 3 cluster centroids is calculated for each
observation, and each observation is assigned to the closest cluster
centroid. After the observations have been assigned to a cluster, the
centroids of each of the three clusters are recalculated using the mean
of all the observations in that cluster. The distance of each
observation to the new cluster centroids are recalculated again and the
process continues until convergence, or until the cluster centroids do
not change anymore. This occurs after 8 iteration in this example
above.