Computational Biology & Data Science

Clustering with K-means

K-means clustering is a method of clustering data into a user-specified number of distinct groups or clusters. The algorithm works by iteratively assigning each data point to the nearest cluster, based on the features of the data point, and then updating the cluster centroids (i.e., the mean of all the points in the cluster) to be the center of the new cluster. This process is repeated until the clusters stabilize and the points are no longer reassigned to different clusters.

One key advantage of k-means clustering is that it is relatively fast and efficient, especially for large datasets. It is also easy to implement, as it only requires a few simple steps. However, one limitation of k-means clustering is that it requires the user to specify the number of clusters in advance, which can be difficult if the data are not clearly separated into distinct groups. In addition, k-means clustering can be sensitive to the initial placement of the cluster centroids, and may not always converge to the same solution.

Despite these limitations, k-means clustering is a widely used and effective method for clustering data, and is often used as a baseline method for comparison with other clustering algorithms. It is particularly well-suited for datasets with continuous features and relatively well-defined clusters, and is often used in a variety of applications, including data mining, image processing, and machine learning.

The steps of k-means clustering are as follows:

Specify the number of clusters: The first step in k-means clustering is to specify the number of clusters that you want to identify in the data. This is typically done by the user, and can be based on prior knowledge of the data or by using some form of model selection criteria (e.g., the elbow method).
Initialize the cluster centroids: The next step is to randomly initialize the cluster centroids. This is typically done by selecting k data points from the dataset at random, and using these points as the initial cluster centroids.
Assign each point to the nearest cluster: For each data point, the algorithm calculates the distance to each cluster centroid and assigns the point to the cluster with the nearest centroid.
Update the cluster centroids: The algorithm then updates the cluster centroids to be the mean of all the points in the cluster.
Repeat steps 3 and 4 until convergence: The algorithm repeats steps 3 and 4 until the clusters stabilize and the points are no longer being reassigned to different clusters.
Output the final clusters: Once the algorithm has converged, the final clusters are output, along with the cluster assignments for each data point.

Overall, the goal of k-means clustering is to iteratively improve the cluster assignments and cluster centroids until the clusters stabilize and the points are no longer being reassigned to different clusters. The algorithm is typically considered to have converged when the assignments of the points to clusters do not change between successive iterations.

The Iris dataset is a small dataset that consists of 150 observations of iris flowers, with four features (sepal length, sepal width, petal length, and petal width) for each flower. The dataset includes three species of iris (Iris setosa, Iris virginica, and Iris versicolor), and the goal is to use clustering algorithms to identify the different species based on their features.

Here is an example toy dataset that you can use for a simple clustering exercise in R:

# Load the Iris dataset
data(iris)

# Access the data frame
df = iris

head(df)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

This code will load the Iris dataset, extract the four features that we want to use for clustering (sepal length, sepal width, petal length, and petal width), and then perform k-means clustering with 3 clusters. The kmeans function will return a list with the cluster assignments for each observation, which we can then print out using the print function.

# Install and load the stats package

#install.packages("stats")
library(stats)
# Extract the features we want to use for clustering
features = df[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]

# Perform k-means clustering with 3 clusters
kmeansoutput = kmeans(features, centers = 3)

# Print the cluster assignments
print(kmeansoutput$cluster)

##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
## [112] 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
## [149] 2 1

This will cluster the points in the df dataset into two clusters, and print out the cluster assignments for each point. You can then use the plot function to visualize the clusters, like this:

You can then use the plot function to visualize the clusters, like this:

plot(df$Sepal.Length, df$Sepal.Width, col=df$Species)

plot(df$Sepal.Length, df$Sepal.Width, col=kmeansoutput$cluster)

plot(df$Petal.Length, df$Petal.Width, col=df$Species)

plot(df$Petal.Length, df$Petal.Width, col=kmeansoutput$cluster)

# Plot the clusters
plot(features, col = kmeansoutput$cluster)

This will create a scatterplot of the features, with the points colored according to their cluster assignments. You can use this plot to visualize the clusters and see how well the clustering algorithm has separated the points into different clusters.

You can also use the table function to count the number of observations in each cluster:

# Count the number of observations in each cluster
table(kmeansoutput$cluster)

## 
##  1  2  3 
## 62 38 50

table(df$Species, kmeansoutput$cluster)

##             
##               1  2  3
##   setosa      0  0 50
##   versicolor 48  2  0
##   virginica  14 36  0

How to find optimum k?

There are several methods (Elbow, Silhouette, Cross-Validation) that can be used to determine the optimum number of clusters (k) for k-means clustering in R. Here is one of the most commonly used methods:

Elbow method: The elbow method is a heuristic method that involves fitting the k-means model for a range of different values of k, and then selecting the value of k that results in the “elbow” of the plot of within-cluster sum of squares (WCSS) against the number of clusters. The WCSS is a measure of how far the points in a cluster are from the centroid of the cluster, and the idea behind the elbow method is to select the value of k that results in a significant reduction in WCSS while still retaining a reasonable number of clusters.

To perform within-cluster sum of squares (WCSS) for determining the optimum number of clusters (k) for k-means clustering in R, you can use the kmeans function from the stats package and extract the tot.withinss component from the returned object. Here is an example of how you can do this:

# Fit the k-means model for a range of values of k
wcss = sapply(1:10, function(k) {
  kmeans(features, centers = k, iter.max = 10)$tot.withinss
})

# Plot WCSS against k
plot(1:10, wcss, type = "b", xlab = "Number of clusters (k)", ylab = "Within-cluster sum of squares (WCSS)")

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.9     purrr_0.3.4    
## [5] readr_2.1.2     tidyr_1.2.0     tibble_3.1.8    ggplot2_3.4.0  
## [9] tidyverse_1.3.2
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.2    xfun_0.32           bslib_0.4.0        
##  [4] haven_2.5.0         gargle_1.2.0        colorspace_2.0-3   
##  [7] vctrs_0.5.1         generics_0.1.3      htmltools_0.5.3    
## [10] yaml_2.3.5          utf8_1.2.2          rlang_1.0.6        
## [13] jquerylib_0.1.4     pillar_1.8.1        withr_2.5.0        
## [16] glue_1.6.2          DBI_1.1.3           dbplyr_2.2.1       
## [19] readxl_1.4.1        modelr_0.1.9        lifecycle_1.0.3    
## [22] munsell_0.5.0       gtable_0.3.0        cellranger_1.1.0   
## [25] rvest_1.0.2         evaluate_0.16       knitr_1.39         
## [28] tzdb_0.3.0          fastmap_1.1.0       fansi_1.0.3        
## [31] highr_0.9           broom_1.0.0         backports_1.4.1    
## [34] scales_1.2.0        googlesheets4_1.0.1 cachem_1.0.6       
## [37] jsonlite_1.8.0      fs_1.5.2            hms_1.1.2          
## [40] digest_0.6.29       stringi_1.7.8       grid_4.1.0         
## [43] cli_3.4.1           tools_4.1.0         magrittr_2.0.3     
## [46] sass_0.4.2          crayon_1.5.1        pkgconfig_2.0.3    
## [49] ellipsis_0.3.2      xml2_1.3.3          reprex_2.0.2       
## [52] googledrive_2.0.0   lubridate_1.8.0     assertthat_0.2.1   
## [55] rmarkdown_2.15      httr_1.4.4          rstudioapi_0.13    
## [58] R6_2.5.1            compiler_4.1.0

Computational Biology & Data Science - Lecture 8

Yasin Kaymaz

2023-01-01

Clustering in R

Clustering with K-means

How to find optimum k?