K-means Clustering

Exploratory analysis

When conducting any sort of cluster analysis, it starts with visualizing the data.

If there are two variables, we can make a scatter plot
If there are three or more variables, we make a biplot(s) using the relevant PCs

Since the iris data set has 4 numeric columns we’ll use to cluster the data, we’ll us PCA to see if there are any obvious clusters. We’ll use fviz_screeplot() and fviz_pca() function in the factoextra package.

iris2 |> 
  # Principal component analysis using R (scale. = T)
  prcomp(scale. = T) |> 
  # Plotting the scree plot
  fviz_screeplot(choice = "eigenvalue",
                 geom = "line")

From the scree plot, we should to keep 2 PCs

Always a good idea to start by plotting the data to see if there are any obvious groups.

iris2 |> 
  prcomp(scale. = T) |> 
  # fviz_pca will create a biplot
  fviz_pca(geom = "point")

There appears to be at least 2 different groups in the data!

Rescaling the data

Before we can perform before cluster analysis, we need to rescale the data, similarly to what we did with PCA.

With clustering, there are two commonly used options:

Standardizing: \[z = \frac{y - \mu}{\sigma}\]
Normalizing: \[v = \frac{y - \min(y)}{\max(y) - \min(y)}\]

Standardizing will make all of the variables have a mean of 0 and standard deviation of 1

Normalizing will place all of the values between 0 (min) and 1 (max)

If calculating means and standard deviations seems appropriate for the data, we standardize. If the data are not sufficiently summarized using means and standard deviations, we normalize.

For our example, we’ll standardize the data. But for a thorough analysis, often both are used separately and the results are compared.

iris_rescaled <- 
  iris2 |> 
  # Standardizing all the numeric columns at once using across()
  mutate(
    across(
      .cols = where(is.numeric),
      .fns = ~ (. - mean(.)) / sd(.)
    )
  )

# Checking to see if the sd are 1
apply(iris_rescaled,
      MARGIN = 2,
      FUN = sd)

## sepal_length  sepal_width petal_length  petal_width 
##            1            1            1            1

K Means Clustering

To conduct kmeans clustering, we use the kmeans() function. The arguments are:

x = the rescaled data used for clustering
center = is the number of clusters to form (k)
max.iter = is the max number of iterations it will repeat before stopping
nstart = how many times to run the algorithm before keeping the “best” result

iris_km2 <- 
  kmeans(
    x = iris_rescaled, 
    center = 2, 
    nstart = 20, 
    iter.max = 10
  )

What objects are stored in iris_km2?

names(iris_km2)

## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

cluster = Which cluster each row belongs to

iris2 |> 
  mutate(cluster = iris_km2$cluster) |> 
  slice_sample(n = 10)

##    sepal_length sepal_width petal_length petal_width cluster
## 1           6.7         2.5          5.8         1.8       1
## 2           5.1         2.5          3.0         1.1       1
## 3           6.8         2.8          4.8         1.4       1
## 4           5.5         4.2          1.4         0.2       2
## 5           4.9         3.6          1.4         0.1       2
## 6           5.3         3.7          1.5         0.2       2
## 7           7.2         3.0          5.8         1.6       1
## 8           5.4         3.9          1.7         0.4       2
## 9           5.7         2.5          5.0         2.0       1
## 10          4.8         3.1          1.6         0.2       2

center = the average of each cluster (centroid) for the rescaled data

iris_km2$center

##   sepal_length sepal_width petal_length petal_width
## 1         0.51       -0.43         0.65        0.63
## 2        -1.01        0.85        -1.30       -1.25

If we want to see how the clusters differ using the original scale of the variables, we can calculate the centers ourselves:

iris2 |> 
  # Adding the cluster each iris is assigned to the data
  mutate(cluster = iris_km2$cluster) |>
  # Averaging all the numeric variables by cluster
  summarize(
    .by = cluster,
    across(
      .cols = where(is.numeric),
      .fns = mean
    )
  )

##   cluster sepal_length sepal_width petal_length petal_width
## 1       2          5.0         3.4          1.5        0.25
## 2       1          6.3         2.9          4.9        1.68

It appears that the main difference between cluster 1 and cluster 2 is the petal size. The sepals aren’t that different (relative to the petal size anyway).

Plotting the results

We can quickly plot the results using fviz_cluster() function:

fviz_cluster(
  iris_km2, 
  geom = "point",        # Use points to represent each plant (use text if the row names are meaningful)
  data = iris2,          # The data to be plotted
  show.clust.cent = T,   # If you want to show the centroid of each cluster
  ellipse = T            # If you want an ellipse drawn around each cluster
) +
  
  labs(title = "Iris Data Clustered with K-means") + 
  
  theme_bw() + 
  
  theme(legend.position = "none")

Let’s save another k-means result with k = 3

iris_km3 <- 
  kmeans(
    iris_rescaled, 
    center = 3, 
    nstart = 20, 
    iter.max = 10
  )


# And let's plot the result to compare:
fviz_cluster(
  iris_km3, 
  geom = "point",        # Use points to represent each eruption
  data = iris2,          # The data to be plotted
  show.clust.cent = T,   # If you want to show the centroid of each cluster
  ellipse = T            # If you want an ellipse drawn around each cluster
) + 
  labs(title = "Iris Data Clustered with k = 3") + 
  
  theme_bw() + 
  
  theme(legend.position = "none")

Which result should we use?

How do we determine the appropriate number of clusters?

Determining the Number of Clusters

Which k: Within Sums of Squares

We can use fviz_nbclust() to plot the results of a lot of our tools for determining what k should be.

x = dataset,
FUNcluster = clustering method
method = measure to determine optimal k (“wss”, “silhouette”, “gap_stat”)
k.max = largest k you’d consider

Let’s start by creating an elbow plot

fviz_nbclust(
  x = iris_rescaled, 
  FUNcluster = kmeans, 
  method ="wss", 
  k.max=10,
  nstart = 10
) + 
  
  labs(title ="Choosing K for Iris Dataset Using WSS") + 
  theme_bw()

Looks to be 2 or three clusters, since that’s where the elbow of the plot occurs.

The downside of an elbow plot is that the within sums of squares always decreases as k increases, so there isn’t an object “best” choice from the plot.

We have to alternative plots we can create:

Which k: Silhouette Score

Still uses the fviz_nbclust() function, just change method = "silhouette" instead of “wss”

fviz_nbclust(
  x = iris_rescaled, 
  FUNcluster = kmeans, 
  method = "silhouette", 
  k.max=10
) + 
  
  labs(title ="Choosing K for Iris Dataset Using Silhouette score") + 
  
  theme_bw()

From the silhouette plot, there are two clusters.

We can plot the silhouette values for each flower in the data. We just need to calculate the individual silhouette score for a certain choice of k

Finding the individual silhouette scores for k=2

Need to give the silhouette() function the cluster ID from the clustering method, and the distance matrix for the data (use dist(scaled data)). Then we can pipe the results into fviz_silhouette() to create a silhouette plot

# Getting the silhouette score 
silhouette(
  iris_km2$cluster, 
  dist = dist(iris_rescaled)
) |> 
  # Plotting the scores for each iris
  fviz_silhouette() + 
  theme(legend.position = "none")

##   cluster size ave.sil.width
## 1       1  100          0.53
## 2       2   50          0.68

Let’s create another silhouette plot for k = 3

# Calculate the silhouette score
silhouette(iris_km3$cluster, 
           dist = dist(iris_rescaled)) |> 

  # Pass the result to fviz_silhouette with no labels (no rownames)
  fviz_silhouette(label = F) + 
  
  theme(legend.position = "none")

##   cluster size ave.sil.width
## 1       1   53          0.39
## 2       2   50          0.64
## 3       3   47          0.35

Using k = 4

set.seed(1234)
iris_km4 <- 
  kmeans(
    iris_rescaled, 
    centers=4, 
    nstart=20
  )

# GGplot for SS
silhouette(
  iris_km4$cluster, 
  dist = dist(iris_rescaled)
) |>
  
fviz_silhouette(label = F)

##   cluster size ave.sil.width
## 1       1   53          0.38
## 2       2   25          0.37
## 3       3   25          0.47
## 4       4   47          0.35

From all 4 plots, k = 2 had the highest average silhouette score, indicating k = 2 seems to be the best choice for the number of clusters when using k-means clustering

The last plot we’ll make is a gap stat plot

Which k: Gap Statistic

The gap statistics generates a fake data set with the same dimensions and spread as the original data, but where there are no true clusters in the data.

Using fviz_nbclust() to create the plot for the gap statistic

set.seed(1234)
fviz_nbclust(
  x = iris_rescaled, 
  FUNcluster = kmeans, 
  nstart = 25,  
  method = "gap_stat", 
  nboot = 100
)

If we want to calculate the gap statistic plot, we have an alternative to the fviz_nbclust() function. We can use the clusGap() function in the cluster package and pipe it into the fviz_gap_stat() function in factoextra

The clusGap() function needs:

x = the rescaled data set
FUNcluster = how we are going to cluster the data (kmeans)
K.max = the most clusters we are willing to consider
B = the number of bootstraps to calculate per choice of k

# Setting the seed so everyone gets the same result
RNGversion("4.1.0")
set.seed(1234)


# Calculating the gap statistic results and passing them to the fviz_gap_stat
clusGap(
  x = iris_rescaled, 
  FUNcluster = kmeans, 
  K.max = 10, 
  B = 100
) |> 
  
  # Plotting the results using fviz_gap_stat
  fviz_gap_stat()

While there are 3 we’ve visualized the results, there are many, many more ways of attempting to determine the number of clusters in a data set. The NbClust() function will go through different ways and report what they recommend:

#### Using NbClust() to find k from 30 different methods ####
opt_k <- 
  NbClust(
    data = iris_rescaled, 
          max.nc = 10, 
          method = "kmeans", 
          index = "alllong"   # "alllong" will run all 30 different methods
)

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 13 proposed 2 as the best number of clusters 
## * 10 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 3 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

# Displaying the results in a table
k_choices <- 
  opt_k$Best.nc |> 
  data.frame() |> 
  # Getting the first row (how many clusters it recomments)
  slice(1) |> 
  # Transforming the columns into two rows
  pivot_longer(
    cols = everything(),
    names_to = "Method",
    values_to = "k"
  ) |> 
  # Arranging in order of clusters recommended
  arrange(k)


# Making a graphical table for the knitted document
gt::gt(k_choices)

Method	k
Hubert	0
Dindex	0
CH	2
DB	2
Silhouette	2
Duda	2
PseudoT2	2
Beale	2
Ratkowsky	2
PtBiserial	2
Gap	2
McClain	2
Tau	2
Dunn	2
SDindex	2
KL	3
Hartigan	3
CCC	3
Scott	3
Marriot	3
TraceW	3
Rubin	3
Cindex	3
Ball	3
Frey	3
TrCovW	4
Friedman	6
Gamma	10
Gplus	10
SDbw	10

# Bar plot to illustrate how many methods chose values of k
ggplot(
  data = k_choices,
  mapping = aes(x = k)
) + 
  
  geom_bar(
    fill = "steelblue",
    color = "black"
  ) + 
  
  scale_x_continuous(
    breaks = 0:10,
    minor_breaks = NULL
  ) + 
  
  scale_y_continuous(expand = c(0, 0, 0.05, 0))