Introduction

Clustering techniques are essential in data science for the purpose of identifying underlying patterns or structures within a dataset. In this paper, we analyze the theoretical features of three well-known clustering methods: K-means, Partitioning Around Medoids (PAM), and Clustering Large Applications (CLARA). Furthermore, we provide practical R implementations with examples and plots to illustrate the similarities and differences between these methods.

The K-means, PAM, and CLARA clustering algorithms are widely used in a variety of fields, including marketing, computer vision, and natural language processing. Each method has its unique characteristics, strengths, and weaknesses, which are essential to consider when selecting the appropriate algorithm for a given dataset.

K-means Clustering

K-means is a partition-based clustering algorithm that aims to minimize the sum of squared distances between data points and their corresponding cluster centroids. It starts by initializing centroids randomly, followed by an iterative process of assigning points to the nearest centroid and updating the centroids based on the mean of the assigned points.

library(cluster)
set.seed(123)

# Create data
data <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
              matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))

# K-means clustering
kmeans_result <- kmeans(data, centers = 2)
plot(data, col = kmeans_result$cluster, main = "K-means Clustering")
points(kmeans_result$centers, col = 1:2, pch = 8, cex = 2)

Partitioning Around Medoids (PAM)

PAM, also known as K-medoids, is an alternative to K-means that aims to minimize the sum of absolute distances between data points and their corresponding cluster medoids. Medoids are actual data points that are representative of their cluster. PAM is less sensitive to outliers compared to K-means, as it uses the medoid instead of the mean to calculate distances.

# PAM clustering
pam_result <- pam(data, k = 2)
plot(data, col = pam_result$clustering, main = "PAM Clustering")
points(data[pam_result$medoids,], col = 1:2, pch = 8, cex = 2)

Clustering LARge Applications (CLARA)

CLARA is an extension of PAM, specifically designed to handle large datasets efficiently. CLARA works by selecting a random sample of the data, applying PAM to the sample, and repeating this process multiple times. The best clustering result is selected based on the minimum dissimilarity measure. CLARA is faster than PAM for large datasets, but the quality of the clustering may be lower.

# CLARA clustering
clara_result <- clara(data, k = 2)
plot(data, col = clara_result$clustering, main = "CLARA Clustering")
points(data[clara_result$medoids,], col = 1:2, pch = 8, cex = 2)

Comparison of K-means, PAM, and CLARA

Theoretical Features 1 Objective function: K-means minimizes the sum of squared distances, while PAM and CLARA minimize thesum of absolute distances.

  1. Centroid vs. Medoid: K-means uses the mean of the assigned points as the cluster center, while PAM and CLARA use the medoid, which is an actual data point.

  2. Outlier sensitivity: K-means is more sensitive to outliers due to the squared distance metric, while PAM and CLARA are less sensitive because they use absolute distances.

  3. Scalability: K-means is generally faster and more scalable than PAM for large datasets. CLARA is designed to address the scalability issue of PAM by using sampling techniques

# Comparison plots
par(mfrow = c(1, 3))

plot(data, col = kmeans_result$cluster, main = "K-means Clustering")
points(kmeans_result$centers, col = 1:2, pch = 8, cex = 2)

plot(data, col = pam_result$clustering, main = "PAM Clustering")
points(data[pam_result$medoids,], col = 1:2, pch = 8, cex = 2)

plot(data, col = clara_result$clustering, main = "CLARA Clustering")
points(data[clara_result$medoids,], col = 1:2, pch = 8, cex = 2)

Conclusion In summary, K-means, PAM, and CLARA are popular clustering algorithms with distinct theoretical features. K-means is faster and more scalable but is sensitive to outliers due to its reliance on the mean and squared distance metric. PAM addresses the outlier sensitivity issue by using medoids and absolute distances, but it is not as scalable as K-means. CLARA extends PAM to handle large datasets through sampling techniques, trading off some clustering quality for increased scalability. When selecting a clustering algorithm, it is essential to consider the specific characteristics of the dataset and the desired clustering properties. K-means might be suitable for large datasets without significant outliers, while PAM or CLARA might be preferred for datasets with outliers or when using non-Euclidean distance metrics. Further analysis can be conducted using other clustering techniques and comparing their performance to gain a deeper understanding of the most appropriate method for a given dataset.