1 Introduction
Clustering is a potent data analysis technique with widespread applications across various domains. R, a popular programming language for statistical computing, offers numerous packages for implementing clustering methods. This paper presents an extensive review of clustering packages available in R, focusing on their classes, outcomes, and Switching Posibilities. We also dive into the theoretical aspects of these methods and provide recommendations for selecting the most suitable clustering algorithm based on data characteristics and desired results. Clustering is an unsupervised machine learning technique that partitions data points into groups based on similarity or distance measures. R offers an array of packages for implementing clustering methods. This paper aims to provide an all-encompassing review of these packages and shed light on their theoretical features.
Clustering Methods in R Here, we shall discuss some widely-used clustering packages in R, which can be categorized into four groups: hierarchical clustering, partitioning methods, model-based clustering, and density-based clustering.
2.1 Hierarchical Clustering
Hierarchical clustering builds a tree-like structure called a dendrogram to represent the nested clustering structure of the data. R offers the following packages for hierarchical clustering:
stats: The foundational R package includes the hclust() function for agglomerative hierarchical clustering. It supports various linkage methods such as single, complete, average, and Ward’s method.
fastcluster: This package delivers a faster implementation of hierarchical clustering algorithms. It is capable of handling large datasets more efficiently than the stats package.
2.2 Partitioning Methods
Partitioning methods divide data points into a predefined number of non-overlapping clusters. The most popular partitioning method is k-means clustering. R offers the following packages for partitioning methods:
stats: The foundational R package provides the kmeans() function for implementing the k-means clustering algorithm.
cluster: This package offers additional partitioning methods such as PAM (Partitioning Around Medoids) and CLARA (Clustering Large Applications).
2.3 Model-based Clustering
Model-based clustering methods employ probabilistic models to define the clustering structure. Gaussian mixture models (GMM) are the most common model-based clustering methods. R offers the following packages for model-based clustering:
mclust: This package delivers model-based clustering using GMM with the Expectation-Maximization (EM) algorithm. It provides various covariance structures and model selection criteria such as the Bayesian Information Criterion (BIC).
flexmix: The flexmix package offers a general framework for finite mixture models, allowing users to define their own models for clustering.
2.4 Density-based Clustering
Density-based clustering methods identify clusters based on dense regions in the data space. The most widely-known density-based clustering algorithm is DBSCAN (Density-Based Spatial Clustering of Applications with Noise). R offers the following packages for density-based clustering:
dbscan: This package provides an efficient implementation of the DBSCAN algorithm and its variants such as OPTICS (Ordering Points to Identify the Clustering Structure).
fpc: The fpc package (Flexible Procedures for Clustering) offers various density-based clustering methods, including DBSCAN, OPTICS, and others.
Switching Posibilities Several R clustering packages are compatible and can be used interchangeably. For instance, clustering results from the stats package can be visualized using the dendextend package for hierarchical clustering or the factoextra package for k-means clustering. Additionally, some packages such as fpc and cluster provide wrapper functions for algorithms available in other packages, enabling users to seamlessly transition between different clustering methods.
References:
Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. John Wiley & Sons.
Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458), 611-631.
Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).
Hahsler, M., & Piekenbrock, M. (2021). dbscan: Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms. R package version 1.1-8. https://CRAN.R-project.org/package=dbscan
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., & Hornik, K. (2021). cluster: Cluster Analysis Basics and Extensions. R package version 2.1.2. https://CRAN.R-project.org/package=cluster
Grün, B., & Leisch, F. (2007). Fitting finite mixtures of generalized linear regressions in R. Computational Statistics & Data Analysis, 51(11), 5247-5252.