K-means clustering is a widely used unsupervised learning method that partitions a dataset into a pre-specified number of clusters. The goal is to group observations in such a way that those within the same cluster are as similar as possible, while observations in different clusters are as dissimilar as possible.
The method works by assigning each observation to one of k clusters, where each cluster is represented by its centroid (the mean of all observations in that cluster). The algorithm iteratively updates cluster assignments and centroids until the within-cluster variation is minimised.
In the following section, we will learn how to perform k-means clustering in R, how to choose the number of clusters, and how to interpret and visualise the resulting cluster structure.
In R, k-means clustering is performed using the kmeans()
function. This function partitions the data into a pre-specified number
of clusters by iteratively assigning observations to the nearest cluster
centroid.
Syntax:
kmeans(x, centers, nstart = 25)Where:
- x: Data matrix or data frame containing the observations (rows) and variables (columns)
- centers: Specifies the number of clusters (k) to form.
- nstart: Indicates how many random initial configurations are tried (higher values improve stability; 25 is commonly recommended).
The resulting object (km) contains several useful
components, including cluster assignments (km$cluster),
cluster centroids (km$centers), and within-cluster
variation measures (km$tot.withinss), which help assess the
quality of the clustering solution.
First we need to read in our data into R. Throughtout this example we
will use the wine data. These data are the results of a
chemical analysis of wines grown in the same region in Italy but derived
from three different cultivars. The analysis determined the quantities
of 13 constituents found in each of the three types of wines.
The attributes are:
The wine data is in a .txt format, so to read in the
data we can use the read.table() function in R.
wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")
colnames(wine) <- c("Cultivar","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid phenols","Proanthocyanins","Color intensity","Hue","OD280/OD315 of diluted wines","Proline")
dim(wine)
#> [1] 178 14
head(wine, 5)
#> Cultivar Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols
#> 1 1 14.23 1.71 2.43 15.6 127 2.80
#> 2 1 13.20 1.78 2.14 11.2 100 2.65
#> 3 1 13.16 2.36 2.67 18.6 101 2.80
#> 4 1 14.37 1.95 2.50 16.8 113 3.85
#> 5 1 13.24 2.59 2.87 21.0 118 2.80
#> Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue
#> 1 3.06 0.28 2.29 5.64 1.04
#> 2 2.76 0.26 1.28 4.38 1.05
#> 3 3.24 0.30 2.81 5.68 1.03
#> 4 3.49 0.24 2.18 7.80 0.86
#> 5 2.69 0.39 1.82 4.32 1.04
#> OD280/OD315 of diluted wines Proline
#> 1 3.92 1065
#> 2 3.40 1050
#> 3 3.17 1185
#> 4 3.45 1480
#> 5 2.93 735The wine dataset contains 178 observations of 14
variables, including the 13 measured quantities of chemicals and the
variable Cultivar, which indicates the type of grape from which the wine
was produced.
The measured attributes have very different ranges, so the data should be standardized before performing hierarchical clustering to ensure that all variables contribute equally to the analysis.
To determine the appropriate number of clusters in k-means
clustering, several methods can be used, including the elbow method, the
average silhouette method, and the gap statistic. In R, the first two
methods can be implemented easily using the fviz_nbclust()
function from the factoextra package, while the gap
statistic is available through the cluster package.
The elbow method evaluates the total within-cluster variation for different values of k. The “elbow” point, where the rate of decrease starts to slow down, suggests a suitable number of clusters.
library(factoextra)
#> Warning: package 'factoextra' was built under R version 4.5.3
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 4.5.3
#> Welcome to factoextra!
#> Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/
fviz_nbclust(x = wine_stand[,2:14],FUNcluster = kmeans, method = 'wss' )The silhouette method assesses how well observations are clustered by comparing within-cluster similarity to between-cluster dissimilarity. Higher average silhouette values indicate a better clustering structure.
fviz_nbclust(x = wine_stand[,2:14],FUNcluster = kmeans, method = 'silhouette' ) #Average silhouette methodThe gap statistic compares the observed within-cluster dispersion to that expected under a reference (random) distribution of the data. The optimal number of clusters is typically chosen as the smallest k such that the gap statistic is within one standard deviation of its maximum.
library(cluster)
#> Warning: package 'cluster' was built under R version 4.5.3
set.seed(123)
gap_stat <- clusGap(x = wine_stand[,2:14], FUN = kmeans, K.max = 15, nstart = 25, B = 50 )
fviz_gap_stat(gap_stat)Together, these approaches provide complementary information to help determine an appropriate number of clusters before fitting the final k-means model.
In addition to visual methods such as the elbow plot, silhouette
plot, and gap statistic, the NbClust package provides a
more comprehensive approach for determining the optimal number of
clusters. It evaluates a large number of clustering validity indices
(over 20 different criteria) and uses them to propose a suitable number
of clusters for the data.
Each index assesses clustering quality in a different way, based on measures such as within-cluster compactness and between-cluster separation. The final recommendation is typically based on a majority rule, i.e., the number of clusters that is most frequently suggested across all indices.
The output includes the suggested number of clusters from each index, as well as a summary indicating the most commonly selected solution. This makes NbClust a useful complementary tool when deciding on the number of clusters, especially when different graphical methods give conflicting results.
library(NbClust)
#> Warning: package 'NbClust' was built under R version 4.5.2
set.seed(1234)
nc <- NbClust(wine_stand[,2:14], min.nc=2, max.nc=15, method="kmeans")
#> *** : The Hubert index is a graphical method of determining the number of clusters.
#> In the plot of Hubert index, we seek a significant knee that corresponds to a
#> significant increase of the value of the measure i.e the significant peak in Hubert
#> index second differences plot.
#>
#> *** : The D index is a graphical method of determining the number of clusters.
#> In the plot of D index, we seek a significant knee (the significant peak in Dindex
#> second differences plot) that corresponds to a significant increase of the value of
#> the measure.
#>
#> *******************************************************************
#> * Among all indices:
#> * 2 proposed 2 as the best number of clusters
#> * 19 proposed 3 as the best number of clusters
#> * 1 proposed 14 as the best number of clusters
#> * 1 proposed 15 as the best number of clusters
#>
#> ***** Conclusion *****
#>
#> * According to the majority rule, the best number of clusters is 3
#>
#>
#> *******************************************************************
table(nc$Best.n[1,])
#>
#> 0 1 2 3 14 15
#> 2 1 2 19 1 1Since K-means cluster analysis starts with k randomly chosen
centroids, a different solution can be obtained each time the function
is invoked. Use the set.seed() function to guarantee that
the results are reproducible.
Additionally, this clustering approach can be sensitive to the
initial selection of centroids. The kmeans() function has
an nstart option that attempts multiple initial configurations and
reports on the best one. For example, adding nstart=25 will
generate 25 initial configurations. This approach is often
recommended.
set.seed(123)
wine_kmeans <- kmeans(wine_stand[,2:14], 3, iter.max = 10, nstart = 25)
# Print the results
print(wine_kmeans)
#> K-means clustering with 3 clusters of sizes 51, 62, 65
#>
#> Cluster means:
#> Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols
#> 1 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548
#> 2 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724
#> 3 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891
#> Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue
#> 1 -1.21182921 0.72402116 -0.77751312 0.9388902 -1.1615122
#> 2 0.97506900 -0.56050853 0.57865427 0.1705823 0.4726504
#> 3 0.02075402 -0.03343924 0.05810161 -0.8993770 0.4605046
#> OD280/OD315 of diluted wines Proline
#> 1 -1.2887761 -0.4059428
#> 2 0.7770551 1.1220202
#> 3 0.2700025 -0.7517257
#>
#> Clustering vector:
#> [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2
#> [75] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#> [112] 3 3 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>
#> Within cluster sum of squares by cluster:
#> [1] 326.3537 385.6983 558.6971
#> (between_SS / total_SS = 44.8 %)
#>
#> Available components:
#>
#> [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
#> [6] "betweenss" "size" "iter" "ifault"
# Number of observations per cluster
wine_kmeans$size
#> [1] 51 62 65
# Means of the variables within each cluster
wine_kmeans$centers
#> Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols
#> 1 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548
#> 2 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724
#> 3 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891
#> Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue
#> 1 -1.21182921 0.72402116 -0.77751312 0.9388902 -1.1615122
#> 2 0.97506900 -0.56050853 0.57865427 0.1705823 0.4726504
#> 3 0.02075402 -0.03343924 0.05810161 -0.8993770 0.4605046
#> OD280/OD315 of diluted wines Proline
#> 1 -1.2887761 -0.4059428
#> 2 0.7770551 1.1220202
#> 3 0.2700025 -0.7517257We can visualize the results from the clustering using the
fviz_cluster function from the factoextra
package
fviz_cluster(wine_kmeans, data = wine_stand[,2:14],
palette = c("#2E9FDF", "#00AFBB", "#E7B800"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw()
)When the true group labels are available, clustering performance can be assessed using the Rand Index. This measure compares the similarity between the clustering solution and the known class labels by examining all pairs of observations and checking whether they are assigned consistently in both partitions.
The Rand Index takes values between 0 and 1, where a value of 1 indicates perfect agreement between the clustering results and the true classes, and values closer to 0 indicate poor agreement. In the wine example, it can be used to evaluate how well the clustering algorithm recovers the three known cultivars.
The Rand Index can be computed in R using the randIndex() function from the flexclust package.