Assessing Clustering Tendency

Before Clustering, we need to check if the dataset contains meaningful clusters or not.

library(factoextra)
## Warning: package 'factoextra' was built under R version 3.6.2
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(clustertend)

Lets take a dataset iris

head(iris,3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

We start by excluding the column “Species” at position 5, generate some random data and standardise the data set

# Iris data set
df <- iris[, -5]
# Random data generated from the iris data set
random_df <- apply(df, 2,
function(x){runif(length(x), min(x), (max(x)))})
random_df <- as.data.frame(random_df)
# Standardize the data sets
df <- iris.scaled <- scale(df)
random_df <- scale(random_df)

Visual Inspection of the data

library("factoextra")
# Plot faithful data set
fviz_pca_ind(prcomp(df), title = "PCA - Iris data",
habillage = iris$Species, palette = "jco",
geom = "point", ggtheme = theme_classic(),
legend = "bottom")

Now we plot the random df

# Plot the random df
fviz_pca_ind(prcomp(random_df), title = "PCA - Random data",
geom = "point", ggtheme = theme_classic())

It can be seen that the iris data set contains 3 real clusters. However, the randomly generated data set doesn’t contain any meaningful clustera.

Why it is important to assess Clusterising tendency

Let us apply k-means on iris dataset

library(factoextra)
set.seed(123)
# K-means on iris dataset
km.res1 <- kmeans(df, 3)
fviz_cluster(list(data = df, cluster = km.res1$cluster),
ellipse.type = "norm", geom = "point", stand = FALSE,
palette = "jco", ggtheme = theme_classic())

It produces real clusters. Let us apply k-means on random data set

# K-means on the random dataset
km.res2 <- kmeans(random_df, 3)
fviz_cluster(list(data = random_df, cluster = km.res2$cluster),
ellipse.type = "norm", geom = "point", stand = FALSE,
palette = "jco", ggtheme = theme_classic())

We can see that a classficaiton has been imposed. i.e. why it is important to assess

Clusterability is tested by Hopkins statistic, a value close to 0.50 means not clusterable, close to 0 is clusterable. Let’s do it for IRIS database. n is the number of points to be selected from data.

library(clustertend)
# Compute Hopkins statistic for iris dataset
set.seed(123)
hopkins(df, n = nrow(df)-1)
## $H
## [1] 0.1815219

So it is clusterable. Lets do it for random data set

# Compute Hopkins statistic for a random dataset
set.seed(123)
hopkins(random_df, n = nrow(random_df)-1)
## $H
## [1] 0.5045176

The value is close to 50. so It is not clusterable. We can also calculate using visual method.

fviz_dist(dist(df),show_labels = FALSE)+
  labs(title="Iris data")

There is a high similarity showing that the cluster structure in the data is set.

fviz_dist(dist(random_df),show_labels = FALSE)+
  labs(title="Iris data")

There is a high dissimilarity. So the data is not clusterable.