When conducting any sort of cluster analysis, it starts with visualizing the data.
If there are two variables, we can make a scatter plot
If there are three or more variables, we make a biplot(s) using the relevant PCs
Since the iris data set has 4 numeric columns we’ll use to cluster
the data, we’ll us PCA to see if there are any obvious clusters. We’ll
use fviz_screeplot()
and fviz_pca()
function
in the factoextra
package.
iris2 |>
# Principal component analysis using R (scale. = T)
prcomp(scale. = T) |>
# Plotting the scree plot
fviz_screeplot(choice = "eigenvalue",
geom = "line")
From the scree plot, we should to keep 2 PCs
Always a good idea to start by plotting the data to see if there are any obvious groups.
iris2 |>
prcomp(scale. = T) |>
# fviz_pca will create a biplot
fviz_pca(geom = "point")
There appears to be at least 2 different groups in the data!
Before we can perform before cluster analysis, we need to rescale the data, similarly to what we did with PCA.
With clustering, there are two commonly used options:
Standardizing: \[z = \frac{y - \mu}{\sigma}\]
Normalizing: \[v = \frac{y - \min(y)}{\max(y) - \min(y)}\]
Standardizing will make all of the variables have a mean of 0 and standard deviation of 1
Normalizing will place all of the values between 0 (min) and 1 (max)
If calculating means and standard deviations seems appropriate for the data, we standardize. If the data are not sufficiently summarized using means and standard deviations, we normalize.
For our example, we’ll standardize the data. But for a thorough analysis, often both are used separately and the results are compared.
iris_rescaled <-
iris2 |>
# Standardizing all the numeric columns at once using across()
mutate(
across(
.cols = where(is.numeric),
.fns = ~ (. - mean(.)) / sd(.)
)
)
# Checking to see if the sd are 1
apply(iris_rescaled,
MARGIN = 2,
FUN = sd)
## sepal_length sepal_width petal_length petal_width
## 1 1 1 1
To conduct kmeans clustering, we use the kmeans()
function. The arguments are:
x =
the rescaled data used for clusteringcenter =
is the number of clusters to form (k)max.iter =
is the max number of iterations it will
repeat before stoppingnstart =
how many times to run the algorithm before
keeping the “best” resultiris_km2 <-
kmeans(
x = iris_rescaled,
center = 2,
nstart = 20,
iter.max = 10
)
What objects are stored in iris_km2?
names(iris_km2)
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
cluster = Which cluster each row belongs to
iris2 |>
mutate(cluster = iris_km2$cluster) |>
slice_sample(n = 10)
## sepal_length sepal_width petal_length petal_width cluster
## 1 6.7 2.5 5.8 1.8 1
## 2 5.1 2.5 3.0 1.1 1
## 3 6.8 2.8 4.8 1.4 1
## 4 5.5 4.2 1.4 0.2 2
## 5 4.9 3.6 1.4 0.1 2
## 6 5.3 3.7 1.5 0.2 2
## 7 7.2 3.0 5.8 1.6 1
## 8 5.4 3.9 1.7 0.4 2
## 9 5.7 2.5 5.0 2.0 1
## 10 4.8 3.1 1.6 0.2 2
center
= the average of each cluster (centroid) for the
rescaled data
iris_km2$center
## sepal_length sepal_width petal_length petal_width
## 1 0.51 -0.43 0.65 0.63
## 2 -1.01 0.85 -1.30 -1.25
If we want to see how the clusters differ using the original scale of the variables, we can calculate the centers ourselves:
iris2 |>
# Adding the cluster each iris is assigned to the data
mutate(cluster = iris_km2$cluster) |>
# Averaging all the numeric variables by cluster
summarize(
.by = cluster,
across(
.cols = where(is.numeric),
.fns = mean
)
)
## cluster sepal_length sepal_width petal_length petal_width
## 1 2 5.0 3.4 1.5 0.25
## 2 1 6.3 2.9 4.9 1.68
It appears that the main difference between cluster 1 and cluster 2 is the petal size. The sepals aren’t that different (relative to the petal size anyway).
We can quickly plot the results using fviz_cluster()
function:
fviz_cluster(
iris_km2,
geom = "point", # Use points to represent each plant (use text if the row names are meaningful)
data = iris2, # The data to be plotted
show.clust.cent = T, # If you want to show the centroid of each cluster
ellipse = T # If you want an ellipse drawn around each cluster
) +
labs(title = "Iris Data Clustered with K-means") +
theme_bw() +
theme(legend.position = "none")
iris_km3 <-
kmeans(
iris_rescaled,
center = 3,
nstart = 20,
iter.max = 10
)
# And let's plot the result to compare:
fviz_cluster(
iris_km3,
geom = "point", # Use points to represent each eruption
data = iris2, # The data to be plotted
show.clust.cent = T, # If you want to show the centroid of each cluster
ellipse = T # If you want an ellipse drawn around each cluster
) +
labs(title = "Iris Data Clustered with k = 3") +
theme_bw() +
theme(legend.position = "none")
Which result should we use?
How do we determine the appropriate number of clusters?
We can use fviz_nbclust()
to plot the results of a lot
of our tools for determining what k should be.
x =
dataset,FUNcluster =
clustering methodmethod =
measure to determine optimal k (“wss”,
“silhouette”, “gap_stat”)k.max
= largest k you’d considerLet’s start by creating an elbow plot
fviz_nbclust(
x = iris_rescaled,
FUNcluster = kmeans,
method ="wss",
k.max=10,
nstart = 10
) +
labs(title ="Choosing K for Iris Dataset Using WSS") +
theme_bw()
Looks to be 2 or three clusters, since that’s where the elbow of the plot occurs.
The downside of an elbow plot is that the within sums of squares always decreases as k increases, so there isn’t an object “best” choice from the plot.
We have to alternative plots we can create:
Still uses the fviz_nbclust()
function, just change
method = "silhouette"
instead of “wss”
fviz_nbclust(
x = iris_rescaled,
FUNcluster = kmeans,
method = "silhouette",
k.max=10
) +
labs(title ="Choosing K for Iris Dataset Using Silhouette score") +
theme_bw()
From the silhouette plot, there are two clusters.
We can plot the silhouette values for each flower in the data. We just need to calculate the individual silhouette score for a certain choice of k
Need to give the silhouette()
function the cluster ID
from the clustering method, and the distance matrix for the data (use
dist(scaled data)
). Then we can pipe the results into
fviz_silhouette()
to create a silhouette plot
# Getting the silhouette score
silhouette(
iris_km2$cluster,
dist = dist(iris_rescaled)
) |>
# Plotting the scores for each iris
fviz_silhouette() +
theme(legend.position = "none")
## cluster size ave.sil.width
## 1 1 100 0.53
## 2 2 50 0.68
Let’s create another silhouette plot for k = 3
# Calculate the silhouette score
silhouette(iris_km3$cluster,
dist = dist(iris_rescaled)) |>
# Pass the result to fviz_silhouette with no labels (no rownames)
fviz_silhouette(label = F) +
theme(legend.position = "none")
## cluster size ave.sil.width
## 1 1 53 0.39
## 2 2 50 0.64
## 3 3 47 0.35
Using k = 4
set.seed(1234)
iris_km4 <-
kmeans(
iris_rescaled,
centers=4,
nstart=20
)
# GGplot for SS
silhouette(
iris_km4$cluster,
dist = dist(iris_rescaled)
) |>
fviz_silhouette(label = F)
## cluster size ave.sil.width
## 1 1 53 0.38
## 2 2 25 0.37
## 3 3 25 0.47
## 4 4 47 0.35
From all 4 plots, k = 2 had the highest average silhouette score, indicating k = 2 seems to be the best choice for the number of clusters when using k-means clustering
The last plot we’ll make is a gap stat plot
The gap statistics generates a fake data set with the same dimensions and spread as the original data, but where there are no true clusters in the data.
Using fviz_nbclust()
to create the plot for the gap
statistic
set.seed(1234)
fviz_nbclust(
x = iris_rescaled,
FUNcluster = kmeans,
nstart = 25,
method = "gap_stat",
nboot = 100
)
If we want to calculate the gap statistic plot, we have an
alternative to the fviz_nbclust()
function. We can use the
clusGap()
function in the cluster
package and
pipe it into the fviz_gap_stat()
function in
factoextra
The clusGap()
function needs:
x =
the rescaled data setFUNcluster =
how we are going to cluster the data
(kmeans)K.max =
the most clusters we are willing to
considerB =
the number of bootstraps to calculate per choice of
k# Setting the seed so everyone gets the same result
RNGversion("4.1.0")
set.seed(1234)
# Calculating the gap statistic results and passing them to the fviz_gap_stat
clusGap(
x = iris_rescaled,
FUNcluster = kmeans,
K.max = 10,
B = 100
) |>
# Plotting the results using fviz_gap_stat
fviz_gap_stat()
While there are 3 we’ve visualized the results, there are many, many
more ways of attempting to determine the number of clusters in a data
set. The NbClust()
function will go through different ways
and report what they recommend:
#### Using NbClust() to find k from 30 different methods ####
opt_k <-
NbClust(
data = iris_rescaled,
max.nc = 10,
method = "kmeans",
index = "alllong" # "alllong" will run all 30 different methods
)
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 13 proposed 2 as the best number of clusters
## * 10 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
# Displaying the results in a table
k_choices <-
opt_k$Best.nc |>
data.frame() |>
# Getting the first row (how many clusters it recomments)
slice(1) |>
# Transforming the columns into two rows
pivot_longer(
cols = everything(),
names_to = "Method",
values_to = "k"
) |>
# Arranging in order of clusters recommended
arrange(k)
# Making a graphical table for the knitted document
gt::gt(k_choices)
Method | k |
---|---|
Hubert | 0 |
Dindex | 0 |
CH | 2 |
DB | 2 |
Silhouette | 2 |
Duda | 2 |
PseudoT2 | 2 |
Beale | 2 |
Ratkowsky | 2 |
PtBiserial | 2 |
Gap | 2 |
McClain | 2 |
Tau | 2 |
Dunn | 2 |
SDindex | 2 |
KL | 3 |
Hartigan | 3 |
CCC | 3 |
Scott | 3 |
Marriot | 3 |
TraceW | 3 |
Rubin | 3 |
Cindex | 3 |
Ball | 3 |
Frey | 3 |
TrCovW | 4 |
Friedman | 6 |
Gamma | 10 |
Gplus | 10 |
SDbw | 10 |
# Bar plot to illustrate how many methods chose values of k
ggplot(
data = k_choices,
mapping = aes(x = k)
) +
geom_bar(
fill = "steelblue",
color = "black"
) +
scale_x_continuous(
breaks = 0:10,
minor_breaks = NULL
) +
scale_y_continuous(expand = c(0, 0, 0.05, 0))