There are many functions available in R for performing clustering, including both base R functions and functions from packages in the R ecosystem. Some of the most commonly used functions for clustering in R include:
kmeans
: This is a base R function that performs k-means
clustering. It is a popular and widely used function, and is relatively
fast and efficient, especially for large datasets.
hclust
: This is a base R function that performs
hierarchical clustering, which is a type of clustering that builds a
hierarchy of clusters by iteratively merging the most similar clusters.
Hierarchical clustering is well-suited for exploring the structure of a
dataset, and can be useful for identifying relationships between points
that are not immediately apparent.
dbscan
: This is a function from the fpc package that
performs density-based clustering, which is a type of clustering that
groups together points that are surrounded by a high density of other
points. Density-based clustering is useful for identifying clusters of
points that are well-separated from other clusters, and can be effective
for datasets with noisy or irregularly shaped clusters.
kmedoids
: This is a function from the cluster package
that performs k-medoids clustering, which is a variant of k-means
clustering that uses medoids (i.e., representative points) rather than
means to represent the clusters. K-medoids clustering can be more robust
to noise and outliers than k-means clustering, and is often used in
situations where the data are categorical or binary.
These are just a few examples of the many functions available for clustering in R. There are many other functions available, and the best function to use will depend on the specific needs of your application.
K-means clustering is a method of clustering data into a user-specified number of distinct groups or clusters. The algorithm works by iteratively assigning each data point to the nearest cluster, based on the features of the data point, and then updating the cluster centroids (i.e., the mean of all the points in the cluster) to be the center of the new cluster. This process is repeated until the clusters stabilize and the points are no longer reassigned to different clusters.
One key advantage of k-means clustering is that it is relatively fast and efficient, especially for large datasets. It is also easy to implement, as it only requires a few simple steps. However, one limitation of k-means clustering is that it requires the user to specify the number of clusters in advance, which can be difficult if the data are not clearly separated into distinct groups. In addition, k-means clustering can be sensitive to the initial placement of the cluster centroids, and may not always converge to the same solution.
Despite these limitations, k-means clustering is a widely used and effective method for clustering data, and is often used as a baseline method for comparison with other clustering algorithms. It is particularly well-suited for datasets with continuous features and relatively well-defined clusters, and is often used in a variety of applications, including data mining, image processing, and machine learning.
The steps of k-means clustering are as follows:
Specify the number of clusters: The first step in k-means clustering is to specify the number of clusters that you want to identify in the data. This is typically done by the user, and can be based on prior knowledge of the data or by using some form of model selection criteria (e.g., the elbow method).
Initialize the cluster centroids: The next step is to randomly initialize the cluster centroids. This is typically done by selecting k data points from the dataset at random, and using these points as the initial cluster centroids.
Assign each point to the nearest cluster: For each data point, the algorithm calculates the distance to each cluster centroid and assigns the point to the cluster with the nearest centroid.
Update the cluster centroids: The algorithm then updates the cluster centroids to be the mean of all the points in the cluster.
Repeat steps 3 and 4 until convergence: The algorithm repeats steps 3 and 4 until the clusters stabilize and the points are no longer being reassigned to different clusters.
Output the final clusters: Once the algorithm has converged, the final clusters are output, along with the cluster assignments for each data point.
Overall, the goal of k-means clustering is to iteratively improve the cluster assignments and cluster centroids until the clusters stabilize and the points are no longer being reassigned to different clusters. The algorithm is typically considered to have converged when the assignments of the points to clusters do not change between successive iterations.
The Iris dataset is a small dataset that consists of 150 observations of iris flowers, with four features (sepal length, sepal width, petal length, and petal width) for each flower. The dataset includes three species of iris (Iris setosa, Iris virginica, and Iris versicolor), and the goal is to use clustering algorithms to identify the different species based on their features.
Here is an example toy dataset that you can use for a simple clustering exercise in R:
# Load the Iris dataset
data(iris)
# Access the data frame
df = iris
head(df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
This code will load the Iris dataset
, extract the four
features that we want to use for clustering (sepal length, sepal width,
petal length, and petal width), and then perform k-means
clustering with 3 clusters. The kmeans
function will return
a list with the cluster assignments for each observation, which we can
then print out using the print function.
# Install and load the stats package
#install.packages("stats")
library(stats)
# Extract the features we want to use for clustering
features = df[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
# Perform k-means clustering with 3 clusters
kmeansoutput = kmeans(features, centers = 3)
# Print the cluster assignments
print(kmeansoutput$cluster)
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
## [112] 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
## [149] 2 1
This will cluster the points in the df
dataset into two
clusters, and print out the cluster assignments for each point. You can
then use the plot
function to visualize the clusters, like
this:
You can then use the plot function to visualize the clusters, like this:
plot(df$Sepal.Length, df$Sepal.Width, col=df$Species)
plot(df$Sepal.Length, df$Sepal.Width, col=kmeansoutput$cluster)
plot(df$Petal.Length, df$Petal.Width, col=df$Species)
plot(df$Petal.Length, df$Petal.Width, col=kmeansoutput$cluster)
# Plot the clusters
plot(features, col = kmeansoutput$cluster)
This will create a scatterplot
of the features, with the
points colored according to their cluster assignments. You can use this
plot to visualize the clusters and see how well the clustering algorithm
has separated the points into different clusters.
You can also use the table function to count the number of observations in each cluster:
# Count the number of observations in each cluster
table(kmeansoutput$cluster)
##
## 1 2 3
## 62 38 50
table(df$Species, kmeansoutput$cluster)
##
## 1 2 3
## setosa 0 0 50
## versicolor 48 2 0
## virginica 14 36 0
There are several methods (Elbow, Silhouette, Cross-Validation) that can be used to determine the optimum number of clusters (k) for k-means clustering in R. Here is one of the most commonly used methods:
Elbow method
: The elbow method is a heuristic method
that involves fitting the k-means model for a range of different values
of k, and then selecting the value of k that results in the “elbow” of
the plot of within-cluster sum of squares (WCSS) against the number of
clusters. The WCSS is a measure of how far the points in a cluster are
from the centroid of the cluster, and the idea behind the elbow method
is to select the value of k that results in a significant reduction in
WCSS while still retaining a reasonable number of clusters.
To perform within-cluster sum of squares (WCSS) for determining the
optimum number of clusters (k) for k-means clustering in R, you can use
the kmeans
function from the stats package and extract the
tot.withinss
component from the returned object. Here is an
example of how you can do this:
# Fit the k-means model for a range of values of k
wcss = sapply(1:10, function(k) {
kmeans(features, centers = k, iter.max = 10)$tot.withinss
})
# Plot WCSS against k
plot(1:10, wcss, type = "b", xlab = "Number of clusters (k)", ylab = "Within-cluster sum of squares (WCSS)")
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4
## [5] readr_2.1.2 tidyr_1.2.0 tibble_3.1.8 ggplot2_3.4.0
## [9] tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.2 xfun_0.32 bslib_0.4.0
## [4] haven_2.5.0 gargle_1.2.0 colorspace_2.0-3
## [7] vctrs_0.5.1 generics_0.1.3 htmltools_0.5.3
## [10] yaml_2.3.5 utf8_1.2.2 rlang_1.0.6
## [13] jquerylib_0.1.4 pillar_1.8.1 withr_2.5.0
## [16] glue_1.6.2 DBI_1.1.3 dbplyr_2.2.1
## [19] readxl_1.4.1 modelr_0.1.9 lifecycle_1.0.3
## [22] munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0
## [25] rvest_1.0.2 evaluate_0.16 knitr_1.39
## [28] tzdb_0.3.0 fastmap_1.1.0 fansi_1.0.3
## [31] highr_0.9 broom_1.0.0 backports_1.4.1
## [34] scales_1.2.0 googlesheets4_1.0.1 cachem_1.0.6
## [37] jsonlite_1.8.0 fs_1.5.2 hms_1.1.2
## [40] digest_0.6.29 stringi_1.7.8 grid_4.1.0
## [43] cli_3.4.1 tools_4.1.0 magrittr_2.0.3
## [46] sass_0.4.2 crayon_1.5.1 pkgconfig_2.0.3
## [49] ellipsis_0.3.2 xml2_1.3.3 reprex_2.0.2
## [52] googledrive_2.0.0 lubridate_1.8.0 assertthat_0.2.1
## [55] rmarkdown_2.15 httr_1.4.4 rstudioapi_0.13
## [58] R6_2.5.1 compiler_4.1.0