Clustering in R

There are many functions available in R for performing clustering, including both base R functions and functions from packages in the R ecosystem. Some of the most commonly used functions for clustering in R include:

kmeans: This is a base R function that performs k-means clustering. It is a popular and widely used function, and is relatively fast and efficient, especially for large datasets.

hclust: This is a base R function that performs hierarchical clustering, which is a type of clustering that builds a hierarchy of clusters by iteratively merging the most similar clusters. Hierarchical clustering is well-suited for exploring the structure of a dataset, and can be useful for identifying relationships between points that are not immediately apparent.

dbscan: This is a function from the fpc package that performs density-based clustering, which is a type of clustering that groups together points that are surrounded by a high density of other points. Density-based clustering is useful for identifying clusters of points that are well-separated from other clusters, and can be effective for datasets with noisy or irregularly shaped clusters.

kmedoids: This is a function from the cluster package that performs k-medoids clustering, which is a variant of k-means clustering that uses medoids (i.e., representative points) rather than means to represent the clusters. K-medoids clustering can be more robust to noise and outliers than k-means clustering, and is often used in situations where the data are categorical or binary.

These are just a few examples of the many functions available for clustering in R. There are many other functions available, and the best function to use will depend on the specific needs of your application.

Clustering with K-means

K-means clustering is a method of clustering data into a user-specified number of distinct groups or clusters. The algorithm works by iteratively assigning each data point to the nearest cluster, based on the features of the data point, and then updating the cluster centroids (i.e., the mean of all the points in the cluster) to be the center of the new cluster. This process is repeated until the clusters stabilize and the points are no longer reassigned to different clusters.

One key advantage of k-means clustering is that it is relatively fast and efficient, especially for large datasets. It is also easy to implement, as it only requires a few simple steps. However, one limitation of k-means clustering is that it requires the user to specify the number of clusters in advance, which can be difficult if the data are not clearly separated into distinct groups. In addition, k-means clustering can be sensitive to the initial placement of the cluster centroids, and may not always converge to the same solution.

Despite these limitations, k-means clustering is a widely used and effective method for clustering data, and is often used as a baseline method for comparison with other clustering algorithms. It is particularly well-suited for datasets with continuous features and relatively well-defined clusters, and is often used in a variety of applications, including data mining, image processing, and machine learning.

The steps of k-means clustering are as follows:

Overall, the goal of k-means clustering is to iteratively improve the cluster assignments and cluster centroids until the clusters stabilize and the points are no longer being reassigned to different clusters. The algorithm is typically considered to have converged when the assignments of the points to clusters do not change between successive iterations.

The Iris dataset is a small dataset that consists of 150 observations of iris flowers, with four features (sepal length, sepal width, petal length, and petal width) for each flower. The dataset includes three species of iris (Iris setosa, Iris virginica, and Iris versicolor), and the goal is to use clustering algorithms to identify the different species based on their features.

Here is an example toy dataset that you can use for a simple clustering exercise in R:

# Load the Iris dataset
data(iris)

# Access the data frame
df = iris

head(df)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

This code will load the Iris dataset, extract the four features that we want to use for clustering (sepal length, sepal width, petal length, and petal width), and then perform k-means clustering with 3 clusters. The kmeans function will return a list with the cluster assignments for each observation, which we can then print out using the print function.

# Install and load the stats package

#install.packages("stats")
library(stats)
# Extract the features we want to use for clustering
features = df[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]

# Perform k-means clustering with 3 clusters
kmeansoutput = kmeans(features, centers = 3)

# Print the cluster assignments
print(kmeansoutput$cluster)
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
## [112] 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
## [149] 2 1

This will cluster the points in the df dataset into two clusters, and print out the cluster assignments for each point. You can then use the plot function to visualize the clusters, like this:

You can then use the plot function to visualize the clusters, like this:

plot(df$Sepal.Length, df$Sepal.Width, col=df$Species)

plot(df$Sepal.Length, df$Sepal.Width, col=kmeansoutput$cluster)

plot(df$Petal.Length, df$Petal.Width, col=df$Species)

plot(df$Petal.Length, df$Petal.Width, col=kmeansoutput$cluster)

# Plot the clusters
plot(features, col = kmeansoutput$cluster)

This will create a scatterplot of the features, with the points colored according to their cluster assignments. You can use this plot to visualize the clusters and see how well the clustering algorithm has separated the points into different clusters.

You can also use the table function to count the number of observations in each cluster:

# Count the number of observations in each cluster
table(kmeansoutput$cluster)
## 
##  1  2  3 
## 62 38 50
table(df$Species, kmeansoutput$cluster)
##             
##               1  2  3
##   setosa      0  0 50
##   versicolor 48  2  0
##   virginica  14 36  0

How to find optimum k?

There are several methods (Elbow, Silhouette, Cross-Validation) that can be used to determine the optimum number of clusters (k) for k-means clustering in R. Here is one of the most commonly used methods:

Elbow method: The elbow method is a heuristic method that involves fitting the k-means model for a range of different values of k, and then selecting the value of k that results in the “elbow” of the plot of within-cluster sum of squares (WCSS) against the number of clusters. The WCSS is a measure of how far the points in a cluster are from the centroid of the cluster, and the idea behind the elbow method is to select the value of k that results in a significant reduction in WCSS while still retaining a reasonable number of clusters.

To perform within-cluster sum of squares (WCSS) for determining the optimum number of clusters (k) for k-means clustering in R, you can use the kmeans function from the stats package and extract the tot.withinss component from the returned object. Here is an example of how you can do this:

# Fit the k-means model for a range of values of k
wcss = sapply(1:10, function(k) {
  kmeans(features, centers = k, iter.max = 10)$tot.withinss
})

# Plot WCSS against k
plot(1:10, wcss, type = "b", xlab = "Number of clusters (k)", ylab = "Within-cluster sum of squares (WCSS)")

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.9     purrr_0.3.4    
## [5] readr_2.1.2     tidyr_1.2.0     tibble_3.1.8    ggplot2_3.4.0  
## [9] tidyverse_1.3.2
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.2    xfun_0.32           bslib_0.4.0        
##  [4] haven_2.5.0         gargle_1.2.0        colorspace_2.0-3   
##  [7] vctrs_0.5.1         generics_0.1.3      htmltools_0.5.3    
## [10] yaml_2.3.5          utf8_1.2.2          rlang_1.0.6        
## [13] jquerylib_0.1.4     pillar_1.8.1        withr_2.5.0        
## [16] glue_1.6.2          DBI_1.1.3           dbplyr_2.2.1       
## [19] readxl_1.4.1        modelr_0.1.9        lifecycle_1.0.3    
## [22] munsell_0.5.0       gtable_0.3.0        cellranger_1.1.0   
## [25] rvest_1.0.2         evaluate_0.16       knitr_1.39         
## [28] tzdb_0.3.0          fastmap_1.1.0       fansi_1.0.3        
## [31] highr_0.9           broom_1.0.0         backports_1.4.1    
## [34] scales_1.2.0        googlesheets4_1.0.1 cachem_1.0.6       
## [37] jsonlite_1.8.0      fs_1.5.2            hms_1.1.2          
## [40] digest_0.6.29       stringi_1.7.8       grid_4.1.0         
## [43] cli_3.4.1           tools_4.1.0         magrittr_2.0.3     
## [46] sass_0.4.2          crayon_1.5.1        pkgconfig_2.0.3    
## [49] ellipsis_0.3.2      xml2_1.3.3          reprex_2.0.2       
## [52] googledrive_2.0.0   lubridate_1.8.0     assertthat_0.2.1   
## [55] rmarkdown_2.15      httr_1.4.4          rstudioapi_0.13    
## [58] R6_2.5.1            compiler_4.1.0