K-means clustering

dr. Annelies Agten

2026-04-18

K-means clustering is a widely used unsupervised learning method that partitions a dataset into a pre-specified number of clusters. The goal is to group observations in such a way that those within the same cluster are as similar as possible, while observations in different clusters are as dissimilar as possible.

The method works by assigning each observation to one of k clusters, where each cluster is represented by its centroid (the mean of all observations in that cluster). The algorithm iteratively updates cluster assignments and centroids until the within-cluster variation is minimised.

In the following section, we will learn how to perform k-means clustering in R, how to choose the number of clusters, and how to interpret and visualise the resulting cluster structure.

K-means clustering in R

In R, k-means clustering is performed using the kmeans() function. This function partitions the data into a pre-specified number of clusters by iteratively assigning observations to the nearest cluster centroid.

Syntax:
kmeans(x, centers, nstart = 25)

Where:

The resulting object (km) contains several useful components, including cluster assignments (km$cluster), cluster centroids (km$centers), and within-cluster variation measures (km$tot.withinss), which help assess the quality of the clustering solution.

Read data

First we need to read in our data into R. Throughtout this example we will use the wine data. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

The attributes are:

The wine data is in a .txt format, so to read in the data we can use the read.table() function in R.

wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")

colnames(wine) <- c("Cultivar","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid phenols","Proanthocyanins","Color intensity","Hue","OD280/OD315 of diluted wines","Proline")

dim(wine)
#> [1] 178  14

head(wine, 5)
#>   Cultivar Alcohol Malic acid  Ash Alcalinity of ash Magnesium Total phenols
#> 1        1   14.23       1.71 2.43              15.6       127          2.80
#> 2        1   13.20       1.78 2.14              11.2       100          2.65
#> 3        1   13.16       2.36 2.67              18.6       101          2.80
#> 4        1   14.37       1.95 2.50              16.8       113          3.85
#> 5        1   13.24       2.59 2.87              21.0       118          2.80
#>   Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity  Hue
#> 1       3.06                 0.28            2.29            5.64 1.04
#> 2       2.76                 0.26            1.28            4.38 1.05
#> 3       3.24                 0.30            2.81            5.68 1.03
#> 4       3.49                 0.24            2.18            7.80 0.86
#> 5       2.69                 0.39            1.82            4.32 1.04
#>   OD280/OD315 of diluted wines Proline
#> 1                         3.92    1065
#> 2                         3.40    1050
#> 3                         3.17    1185
#> 4                         3.45    1480
#> 5                         2.93     735

The wine dataset contains 178 observations of 14 variables, including the 13 measured quantities of chemicals and the variable Cultivar, which indicates the type of grape from which the wine was produced.

The measured attributes have very different ranges, so the data should be standardized before performing hierarchical clustering to ensure that all variables contribute equally to the analysis.

wine_stand <- as.data.frame(scale(wine)) # standardize data by subtracting the mean and deviding by the sd

Determining the number of clusters

To determine the appropriate number of clusters in k-means clustering, several methods can be used, including the elbow method, the average silhouette method, and the gap statistic. In R, the first two methods can be implemented easily using the fviz_nbclust() function from the factoextra package, while the gap statistic is available through the cluster package.

The elbow method evaluates the total within-cluster variation for different values of k. The “elbow” point, where the rate of decrease starts to slow down, suggests a suitable number of clusters.

library(factoextra)
#> Warning: package 'factoextra' was built under R version 4.5.3
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 4.5.3
#> Welcome to factoextra!
#> Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/

fviz_nbclust(x = wine_stand[,2:14],FUNcluster = kmeans, method = 'wss' )

The silhouette method assesses how well observations are clustered by comparing within-cluster similarity to between-cluster dissimilarity. Higher average silhouette values indicate a better clustering structure.

fviz_nbclust(x = wine_stand[,2:14],FUNcluster = kmeans, method = 'silhouette' ) #Average silhouette method

The gap statistic compares the observed within-cluster dispersion to that expected under a reference (random) distribution of the data. The optimal number of clusters is typically chosen as the smallest k such that the gap statistic is within one standard deviation of its maximum.

library(cluster)
#> Warning: package 'cluster' was built under R version 4.5.3

set.seed(123)
gap_stat <- clusGap(x = wine_stand[,2:14], FUN = kmeans, K.max = 15, nstart = 25, B = 50 )

fviz_gap_stat(gap_stat)

Together, these approaches provide complementary information to help determine an appropriate number of clusters before fitting the final k-means model.

In addition to visual methods such as the elbow plot, silhouette plot, and gap statistic, the NbClust package provides a more comprehensive approach for determining the optimal number of clusters. It evaluates a large number of clustering validity indices (over 20 different criteria) and uses them to propose a suitable number of clusters for the data.

Each index assesses clustering quality in a different way, based on measures such as within-cluster compactness and between-cluster separation. The final recommendation is typically based on a majority rule, i.e., the number of clusters that is most frequently suggested across all indices.

The output includes the suggested number of clusters from each index, as well as a summary indicating the most commonly selected solution. This makes NbClust a useful complementary tool when deciding on the number of clusters, especially when different graphical methods give conflicting results.

library(NbClust)
#> Warning: package 'NbClust' was built under R version 4.5.2


set.seed(1234)
nc <- NbClust(wine_stand[,2:14], min.nc=2, max.nc=15, method="kmeans")
#> *** : The Hubert index is a graphical method of determining the number of clusters.
#>                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
#>                 significant increase of the value of the measure i.e the significant peak in Hubert
#>                 index second differences plot. 
#> 
#> *** : The D index is a graphical method of determining the number of clusters. 
#>                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
#>                 second differences plot) that corresponds to a significant increase of the value of
#>                 the measure. 
#>  
#> ******************************************************************* 
#> * Among all indices:                                                
#> * 2 proposed 2 as the best number of clusters 
#> * 19 proposed 3 as the best number of clusters 
#> * 1 proposed 14 as the best number of clusters 
#> * 1 proposed 15 as the best number of clusters 
#> 
#>                    ***** Conclusion *****                            
#>  
#> * According to the majority rule, the best number of clusters is  3 
#>  
#>  
#> *******************************************************************

table(nc$Best.n[1,])
#> 
#>  0  1  2  3 14 15 
#>  2  1  2 19  1  1

Perform k-means clustering

Since K-means cluster analysis starts with k randomly chosen centroids, a different solution can be obtained each time the function is invoked. Use the set.seed() function to guarantee that the results are reproducible.

Additionally, this clustering approach can be sensitive to the initial selection of centroids. The kmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart=25 will generate 25 initial configurations. This approach is often recommended.

set.seed(123)
wine_kmeans <- kmeans(wine_stand[,2:14], 3, iter.max = 10, nstart = 25)

# Print the results
print(wine_kmeans)
#> K-means clustering with 3 clusters of sizes 51, 62, 65
#> 
#> Cluster means:
#>      Alcohol Malic acid        Ash Alcalinity of ash   Magnesium Total phenols
#> 1  0.1644436  0.8690954  0.1863726         0.5228924 -0.07526047   -0.97657548
#> 2  0.8328826 -0.3029551  0.3636801        -0.6084749  0.57596208    0.88274724
#> 3 -0.9234669 -0.3929331 -0.4931257         0.1701220 -0.49032869   -0.07576891
#>    Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity        Hue
#> 1 -1.21182921           0.72402116     -0.77751312       0.9388902 -1.1615122
#> 2  0.97506900          -0.56050853      0.57865427       0.1705823  0.4726504
#> 3  0.02075402          -0.03343924      0.05810161      -0.8993770  0.4605046
#>   OD280/OD315 of diluted wines    Proline
#> 1                   -1.2887761 -0.4059428
#> 2                    0.7770551  1.1220202
#> 3                    0.2700025 -0.7517257
#> 
#> Clustering vector:
#>   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#>  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2
#>  [75] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#> [112] 3 3 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 
#> Within cluster sum of squares by cluster:
#> [1] 326.3537 385.6983 558.6971
#>  (between_SS / total_SS =  44.8 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
#> [6] "betweenss"    "size"         "iter"         "ifault"

# Number of observations per cluster
wine_kmeans$size 
#> [1] 51 62 65

# Means of the variables within each cluster
wine_kmeans$centers      
#>      Alcohol Malic acid        Ash Alcalinity of ash   Magnesium Total phenols
#> 1  0.1644436  0.8690954  0.1863726         0.5228924 -0.07526047   -0.97657548
#> 2  0.8328826 -0.3029551  0.3636801        -0.6084749  0.57596208    0.88274724
#> 3 -0.9234669 -0.3929331 -0.4931257         0.1701220 -0.49032869   -0.07576891
#>    Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity        Hue
#> 1 -1.21182921           0.72402116     -0.77751312       0.9388902 -1.1615122
#> 2  0.97506900          -0.56050853      0.57865427       0.1705823  0.4726504
#> 3  0.02075402          -0.03343924      0.05810161      -0.8993770  0.4605046
#>   OD280/OD315 of diluted wines    Proline
#> 1                   -1.2887761 -0.4059428
#> 2                    0.7770551  1.1220202
#> 3                    0.2700025 -0.7517257

We can visualize the results from the clustering using the fviz_cluster function from the factoextra package

fviz_cluster(wine_kmeans, data = wine_stand[,2:14],
             palette = c("#2E9FDF", "#00AFBB", "#E7B800"), 
             geom = "point",
             ellipse.type = "convex", 
             ggtheme = theme_bw()
)

Performance of the clustering

When the true group labels are available, clustering performance can be assessed using the Rand Index. This measure compares the similarity between the clustering solution and the known class labels by examining all pairs of observations and checking whether they are assigned consistently in both partitions.

The Rand Index takes values between 0 and 1, where a value of 1 indicates perfect agreement between the clustering results and the true classes, and values closer to 0 indicate poor agreement. In the wine example, it can be used to evaluate how well the clustering algorithm recovers the three known cultivars.

The Rand Index can be computed in R using the randIndex() function from the flexclust package.

library(flexclust)
#> Warning: package 'flexclust' was built under R version 4.5.3

ct.km <- table(wine$Cultivar, wine_kmeans$cluster)
ct.km 
#>    
#>      1  2  3
#>   1  0 59  0
#>   2  3  3 65
#>   3 48  0  0

randIndex(ct.km)      
#>      ARI 
#> 0.897495