Before Clustering, we need to check if the dataset contains meaningful clusters or not.
library(factoextra)
## Warning: package 'factoextra' was built under R version 3.6.2
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(clustertend)
Lets take a dataset iris
head(iris,3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
We start by excluding the column “Species” at position 5, generate some random data and standardise the data set
# Iris data set
df <- iris[, -5]
# Random data generated from the iris data set
random_df <- apply(df, 2,
function(x){runif(length(x), min(x), (max(x)))})
random_df <- as.data.frame(random_df)
# Standardize the data sets
df <- iris.scaled <- scale(df)
random_df <- scale(random_df)
Visual Inspection of the data
library("factoextra")
# Plot faithful data set
fviz_pca_ind(prcomp(df), title = "PCA - Iris data",
habillage = iris$Species, palette = "jco",
geom = "point", ggtheme = theme_classic(),
legend = "bottom")
Now we plot the random df
# Plot the random df
fviz_pca_ind(prcomp(random_df), title = "PCA - Random data",
geom = "point", ggtheme = theme_classic())
It can be seen that the iris data set contains 3 real clusters. However, the randomly generated data set doesn’t contain any meaningful clustera.
Let us apply k-means on iris dataset
library(factoextra)
set.seed(123)
# K-means on iris dataset
km.res1 <- kmeans(df, 3)
fviz_cluster(list(data = df, cluster = km.res1$cluster),
ellipse.type = "norm", geom = "point", stand = FALSE,
palette = "jco", ggtheme = theme_classic())
It produces real clusters. Let us apply k-means on random data set
# K-means on the random dataset
km.res2 <- kmeans(random_df, 3)
fviz_cluster(list(data = random_df, cluster = km.res2$cluster),
ellipse.type = "norm", geom = "point", stand = FALSE,
palette = "jco", ggtheme = theme_classic())
We can see that a classficaiton has been imposed. i.e. why it is important to assess
Clusterability is tested by Hopkins statistic, a value close to 0.50 means not clusterable, close to 0 is clusterable. Let’s do it for IRIS database. n is the number of points to be selected from data.
library(clustertend)
# Compute Hopkins statistic for iris dataset
set.seed(123)
hopkins(df, n = nrow(df)-1)
## $H
## [1] 0.1815219
So it is clusterable. Lets do it for random data set
# Compute Hopkins statistic for a random dataset
set.seed(123)
hopkins(random_df, n = nrow(random_df)-1)
## $H
## [1] 0.5045176
The value is close to 50. so It is not clusterable. We can also calculate using visual method.
fviz_dist(dist(df),show_labels = FALSE)+
labs(title="Iris data")
There is a high similarity showing that the cluster structure in the data is set.
fviz_dist(dist(random_df),show_labels = FALSE)+
labs(title="Iris data")
There is a high dissimilarity. So the data is not clusterable.