Unsupervised Techniques- Clustering

Packages Used
Dataset
Types of Clustering
K Means Clustering
Heirarchical Clustering
Fuzzy C-Means Clustering
DB Scan Clustering
Normal Mixture Model Clustering

Packages Used

library(fpc)
library(factoextra)
library(NbClust)
library(cluster)
library(mclust)
library(e1071)
library(dbscan)

Dataset

We will be using the Iris Dataset for the comparison of clustering algorithms.
The dataset consists of 150 observations with 4 variables: Sepal Length, Sepal Width, Petal Length and Petal Width.
The entire dataset has three different species of Setosa, Versicolor and Virginica with 50 samples each. The dataset was used to develop the linear discriminant model to distinguish the species from each other.This dataset is commonly used as a test for many commonly used statistical classification techniques.

Types of Clustering

Clustering is an unsupervised machine learning technique to identify groups in the dataset which contain observations with similar profiles according to the specified criteria. Similarity between the observations are defined using distance measures such as Euclidian, Manhattan Distances and some correlation based distance measures.

There are different clustering methodologies, which can be subdivided as the five general strategies with examples:

Partitioning Methods- K Means Clustering

Hierarchical Methods- Hierarchical Clustering

Fuzzy Clustering- Fuzzy C-Means Clustering

Density- Based Clustering- DB Scan Clustering

Model- Based Clustering- Normal Mixture Model Clustering

Before we move on to Clustering methodologies, we need to find the optimal number of clusters as certain methods require prior knowledge of number of clusters before performing the clustering methodology.

Elbow Method

We plot the Within Cluster Sum of Squares and the number of clusters to find the location of a bend or a knee in the plot which is considered as an indicator of the appropriate number of clusters.

set.seed(123)
# Compute and plot wss for k = 2 to k = 15
k.max <- 15 # Maximal number of clusters
data <- iris[,-5]
iris.scaled <- scale(data)
wss <- sapply(1:k.max, 
              function(k){kmeans(data, k, nstart=10 )$tot.withinss})
plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")
abline(v = 3, lty =2)

fviz_nbclust(data, kmeans, method = c("wss"))+geom_vline(xintercept = 3, linetype = 2)

We find that three clusters are optimal using this method.

Average Silhouette Method

The average silhouette method (Kaufman and Rousseeuw [1990]) measures the quality of clustering by determining how well a data point lies within its cluster. A high average silhouette indicates good clustering. We plot the Average silhouette to the number of clusters to find the maximum point.

fviz_nbclust(data, kmeans, method = c("silhouette"))

We find that two clusters are optimal using this method.

Gap Statistic Method

Gap Statistic Method (R. Tibshirani, G. Walther, and T. Hastie, 2001) is an approach which compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data i.e when there is no underlying clustering. It is a statistical testing method unlike the above methods which are based on comparing to distance measures.

set.seed(123)
gap_stat <- clusGap(iris.scaled, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)
fviz_gap_stat(gap_stat)

Using 30 different indices

The NbClust package in R uses 30 different indices (mostly distance measures) to indicate the number of clusters required.

nb <- NbClust(iris.scaled, distance = "euclidean", min.nc = 2,
              max.nc = 10, method = "complete", index ="all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 2 proposed 2 as the best number of clusters 
## * 18 proposed 3 as the best number of clusters 
## * 3 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

fviz_nbclust(nb) + theme_minimal()

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 2 proposed  2 as the best number of clusters
## * 18 proposed  3 as the best number of clusters
## * 3 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

We find that three clusters are found to be optimal in over 18 different indices.

K Means Clustering

We try to form two clusters as suggested by our previous methods and find the following cluster formation. We find that the misclassification is just 3 observations and Setosa is part of Cluster 1 and Versicolor and Virginica are part of Cluster 2.

irisCluster <- kmeans(iris[,-5], 2, nstart = 20)
table(irisCluster$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1     50          3         0
##   2      0         47        50

plotcluster(iris[,-5], irisCluster$cluster)

aggregate(iris[,-5],by=list(irisCluster$cluster),FUN=mean)

##   Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1       1     5.005660    3.369811     1.560377    0.290566
## 2       2     6.301031    2.886598     4.958763    1.695876

plot(iris[c("Sepal.Length", "Sepal.Width")], col=irisCluster$cluster)
points(irisCluster$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=23, cex=3)

fviz_cluster(irisCluster, data = iris[,-5], geom = "point",
             stand = FALSE, ellipse.type = "norm")

We try to form three clusters as suggested by our previous methods and find the following cluster formation. We find that the misclassification of 16 observations and Setosa is part of Cluster 1, Versicolor is part of Cluster 3 and Virginica are part of Cluster 2.

irisCluster <- kmeans(iris[,-5], 3, nstart = 20)
table(irisCluster$cluster, iris$Species)

##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0          2        36
##   3      0         48        14

plotcluster(iris[,-5], irisCluster$cluster)

aggregate(iris[,-5],by=list(irisCluster$cluster),FUN=mean)

##   Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1       1     5.006000    3.428000     1.462000    0.246000
## 2       2     6.850000    3.073684     5.742105    2.071053
## 3       3     5.901613    2.748387     4.393548    1.433871

plot(iris[c("Sepal.Length", "Sepal.Width")], col=irisCluster$cluster)
points(irisCluster$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=23, cex=3)

fviz_cluster(irisCluster, data = iris[,-5], geom = "point",
             stand = FALSE, ellipse.type = "norm")

Heirarchical Clustering

Hierarchical clustering can be divided into agglomerative and divisive clustering. The former starts with a single element and builds the nodes based on similar clusters to form a final big cluster or root. The latter starts with a big cluster and breaks it until all data points are in different clusters.

Method 1: Complete Linkage

This method computes all the pairwise dissimilarities between the elements in cluster 1 and elements in cluster 2 and considers the max value of the dissimilarities as the distance between the clusters to produce compact clusters.

iris.dist=dist(iris.scaled)
iris.hclust2= hclust(iris.dist, method="complete")
plot(iris.hclust2, labels= FALSE, hang=-1)
rect.hclust(iris.hclust2,k=3, border=2:4)

iris.3clust2 = cutree(iris.hclust2,k=3)
plotcluster(iris[,-5], iris.3clust2)

Method 2: Single Linkage

This method computes all the pairwise dissimilarities between the elements in cluster 1 and elements in cluster 2 and finds the smallest of these dissimilarities as a linkage criterion to produce loose clusters.

iris.dist=dist(iris.scaled)
iris.hclust3= hclust(iris.dist, method="single")
plot(iris.hclust3, labels= FALSE, hang=-1)
rect.hclust(iris.hclust3,k=3, border=2:4)

iris.3clust3 = cutree(iris.hclust3,k=3)
plotcluster(iris[,-5], iris.3clust3)

Method 3: Ward’s Method

This minimizes the total within-cluster variance. At each of the steps, the pairs of clusters with minimum between-cluster distances are merged.

iris.dist=dist(iris.scaled)
iris.hclust=hclust(iris.dist, method="ward.D")
plot(iris.hclust, labels= FALSE, hang=-1)
rect.hclust(iris.hclust,k=3, border=2:4)

iris.3clust = cutree(iris.hclust,k=3)
plotcluster(iris[,-5], iris.3clust)

Fuzzy C-Means Clustering

In K-means, the data is divided into distinct clusters, where each element is affected exactly to one cluster. This type of clustering is also known as hard clustering or non-fuzzy clustering.

Unlike K-means, Fuzzy clustering is considered as a soft clustering, in which each element has a probability of belonging to each cluster. In other words, each element has a set of membership coefficients corresponding to the degree of being in a given cluster.

Points close to the center of a cluster, may be in the cluster to a higher degree than points in the edge of a cluster. The degree, to which an element belongs to a given cluster, is a numerical value in [0, 1].

Fuzzy c-means (FCM) algorithm is one of the most widely used fuzzy clustering algorithms. It was developed by Dunn in 1973 and improved by Bezdek in 1981. It’s frequently used in pattern recognition.

cm3 <- cmeans(iris.scaled, 3)
cm3

## Fuzzy c-means clustering with 3 clusters
## 
## Cluster centers:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1  -0.03832267 -0.81607171    0.3218358    0.231315
## 2  -1.00142059  0.84368087   -1.2803648   -1.234510
## 3   1.06560838  0.03725651    0.9668919    1.026295
## 
## Memberships:
##                  1           2            3
##   [1,] 0.005296245 0.991568914 0.0031348412
##   [2,] 0.119216534 0.828946583 0.0518368828
##   [3,] 0.047304040 0.929381973 0.0233139870
##   [4,] 0.089184354 0.869522462 0.0412931840
##   [5,] 0.016047918 0.973940846 0.0100112362
##   [6,] 0.101995241 0.818800386 0.0792043728
##   [7,] 0.026343563 0.959287675 0.0143687617
##   [8,] 0.001114623 0.998272825 0.0006125523
##   [9,] 0.180598259 0.740301619 0.0791001226
##  [10,] 0.074357958 0.891206605 0.0344354371
##  [11,] 0.054066735 0.908658021 0.0372752433
##  [12,] 0.008871563 0.986304162 0.0048242745
##  [13,] 0.117605171 0.829928696 0.0524661324
##  [14,] 0.142132037 0.789990726 0.0678772370
##  [15,] 0.138468088 0.744798372 0.1167335392
##  [16,] 0.186814969 0.633259761 0.1799252699
##  [17,] 0.097056267 0.827967607 0.0749761258
##  [18,] 0.005206986 0.991723055 0.0030699591
##  [19,] 0.109874310 0.805501466 0.0846242236
##  [20,] 0.058303654 0.900663623 0.0410327224
##  [21,] 0.030778403 0.951867891 0.0173537063
##  [22,] 0.040565212 0.932374363 0.0270604247
##  [23,] 0.037839535 0.938748444 0.0234120210
##  [24,] 0.031977550 0.952212384 0.0158100656
##  [25,] 0.015007921 0.976934499 0.0080575801
##  [26,] 0.127054902 0.819155481 0.0537896169
##  [27,] 0.005448265 0.991613767 0.0029379682
##  [28,] 0.009557707 0.984754872 0.0056874204
##  [29,] 0.007513372 0.988281502 0.0042051264
##  [30,] 0.049405995 0.926766968 0.0238270373
##  [31,] 0.078935145 0.885339067 0.0357257883
##  [32,] 0.034398875 0.946254551 0.0193465746
##  [33,] 0.120333913 0.782564738 0.0971013491
##  [34,] 0.148544849 0.722198482 0.1292566689
##  [35,] 0.074166454 0.892113610 0.0337199355
##  [36,] 0.036587651 0.945306451 0.0181058980
##  [37,] 0.041511977 0.932680435 0.0258075878
##  [38,] 0.019892619 0.967746219 0.0123611621
##  [39,] 0.139287658 0.796810862 0.0639014800
##  [40,] 0.002407517 0.996261911 0.0013305722
##  [41,] 0.004870811 0.992266946 0.0028622432
##  [42,] 0.388858129 0.456256367 0.1548855040
##  [43,] 0.077425545 0.883740094 0.0388343617
##  [44,] 0.028882836 0.954416662 0.0167005023
##  [45,] 0.069504173 0.881357136 0.0491386909
##  [46,] 0.125429676 0.820720800 0.0538495246
##  [47,] 0.057808383 0.901621660 0.0405699571
##  [48,] 0.056164201 0.916220848 0.0276149513
##  [49,] 0.045677417 0.923248661 0.0310739217
##  [50,] 0.011287212 0.982861832 0.0058509558
##  [51,] 0.206759641 0.061083679 0.7321566798
##  [52,] 0.313693114 0.067321157 0.6189857287
##  [53,] 0.163219193 0.036323238 0.8004575687
##  [54,] 0.802198025 0.076262271 0.1215397049
##  [55,] 0.559368878 0.040244620 0.4003865024
##  [56,] 0.960520490 0.011348625 0.0281308850
##  [57,] 0.270503576 0.071991370 0.6575050540
##  [58,] 0.658112063 0.198301606 0.1435863314
##  [59,] 0.486974123 0.055466129 0.4575597484
##  [60,] 0.835065699 0.073062530 0.0918717711
##  [61,] 0.620264404 0.192648847 0.1870867491
##  [62,] 0.748205909 0.053981852 0.1978122392
##  [63,] 0.731781558 0.098578623 0.1696398191
##  [64,] 0.773631145 0.033265212 0.1931036427
##  [65,] 0.832431018 0.071754776 0.0958142062
##  [66,] 0.303178317 0.062509327 0.6343123554
##  [67,] 0.767485907 0.062538661 0.1699754315
##  [68,] 0.909567089 0.034773141 0.0556597700
##  [69,] 0.708766466 0.072304004 0.2189295301
##  [70,] 0.877107245 0.050865859 0.0720268960
##  [71,] 0.396724267 0.074959081 0.5283166520
##  [72,] 0.895746091 0.025306596 0.0789473127
##  [73,] 0.751772366 0.039285159 0.2089424752
##  [74,] 0.866534392 0.027246015 0.1062195935
##  [75,] 0.645185927 0.053924292 0.3008897815
##  [76,] 0.396014139 0.057044904 0.5469409568
##  [77,] 0.400038342 0.049761262 0.5502003962
##  [78,] 0.106084302 0.016375369 0.8775403284
##  [79,] 0.821839260 0.028831542 0.1493291976
##  [80,] 0.849319307 0.069803659 0.0808770334
##  [81,] 0.820377984 0.078362009 0.1012600064
##  [82,] 0.796971375 0.094096238 0.1089323866
##  [83,] 0.956905929 0.015154238 0.0279398331
##  [84,] 0.824442940 0.023809588 0.1517474713
##  [85,] 0.745942267 0.083544298 0.1705134345
##  [86,] 0.355627752 0.130290494 0.5140817541
##  [87,] 0.214051717 0.041377560 0.7445707228
##  [88,] 0.739324221 0.065485637 0.1951901420
##  [89,] 0.785500099 0.080112205 0.1343876957
##  [90,] 0.893141561 0.040902057 0.0659563820
##  [91,] 0.923233494 0.028390752 0.0483757540
##  [92,] 0.662515226 0.050913889 0.2865708845
##  [93,] 0.950854481 0.016813759 0.0323317598
##  [94,] 0.666383588 0.184117683 0.1494987290
##  [95,] 0.969615335 0.010406849 0.0199778160
##  [96,] 0.785423563 0.077729731 0.1368467060
##  [97,] 0.894642574 0.033857519 0.0714999068
##  [98,] 0.769975825 0.042667484 0.1873566912
##  [99,] 0.694068101 0.177059454 0.1288724443
## [100,] 0.959022439 0.013481217 0.0274963442
## [101,] 0.161912588 0.048695197 0.7893922148
## [102,] 0.724700454 0.038680105 0.2366194411
## [103,] 0.063783867 0.015119223 0.9210969099
## [104,] 0.252676063 0.026219489 0.7211044479
## [105,] 0.067201177 0.012252107 0.9205467157
## [106,] 0.161890866 0.050042814 0.7880663197
## [107,] 0.726162314 0.106795102 0.1670425836
## [108,] 0.149811514 0.036414116 0.8137743695
## [109,] 0.412132793 0.052334672 0.5355325347
## [110,] 0.173589454 0.078777765 0.7476327809
## [111,] 0.064907611 0.014430127 0.9206622623
## [112,] 0.400563951 0.034602733 0.5648333156
## [113,] 0.018437200 0.003601290 0.9779615091
## [114,] 0.720723927 0.052685789 0.2265902839
## [115,] 0.461372693 0.062966571 0.4756607364
## [116,] 0.102666550 0.024748001 0.8725854490
## [117,] 0.076370929 0.010894624 0.9127344463
## [118,] 0.216501235 0.118994010 0.6645047548
## [119,] 0.256932535 0.074215841 0.6688516239
## [120,] 0.718319538 0.069933942 0.2117465203
## [121,] 0.058072252 0.016224552 0.9257031961
## [122,] 0.685675041 0.053268772 0.2610561873
## [123,] 0.212766111 0.062110467 0.7251234227
## [124,] 0.582040785 0.034226123 0.3837330915
## [125,] 0.063204492 0.018257859 0.9185376495
## [126,] 0.092609036 0.027671180 0.8797197845
## [127,] 0.584249762 0.032611053 0.3831391855
## [128,] 0.412992248 0.039639890 0.5473678622
## [129,] 0.220068635 0.027066826 0.7528645391
## [130,] 0.132326280 0.031710152 0.8359635683
## [131,] 0.178849369 0.042783354 0.7783672774
## [132,] 0.223042439 0.127525070 0.6494324908
## [133,] 0.214482222 0.028767896 0.7567498817
## [134,] 0.618787707 0.033862529 0.3473497634
## [135,] 0.739663973 0.039845748 0.2204902783
## [136,] 0.164813179 0.052790625 0.7823961961
## [137,] 0.167705814 0.055958916 0.7763352700
## [138,] 0.091978093 0.015288283 0.8927336243
## [139,] 0.500677268 0.045366024 0.4539567072
## [140,] 0.018703351 0.004356971 0.9769396774
## [141,] 0.066343073 0.015900482 0.9177564455
## [142,] 0.058193854 0.014396293 0.9274098533
## [143,] 0.724700454 0.038680105 0.2366194411
## [144,] 0.060472171 0.016504146 0.9230236825
## [145,] 0.110391732 0.034631431 0.8549768372
## [146,] 0.062777746 0.012561094 0.9246611595
## [147,] 0.607977412 0.046345156 0.3456774323
## [148,] 0.056694907 0.008533573 0.9347715194
## [149,] 0.185592300 0.060848982 0.7535587181
## [150,] 0.507634203 0.048167014 0.4441987824
## 
## Closest hard clustering:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 1 1 1 3 1 1 1 1 1 1 1 1 3 1 1 1 1
##  [71] 3 1 1 1 1 3 3 3 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3
## [106] 3 1 3 3 3 3 3 3 1 3 3 3 3 3 1 3 1 3 1 3 3 1 3 3 3 3 3 3 1 1 3 3 3 1 3
## [141] 3 3 1 3 3 3 1 3 3 1
## 
## Available components:
## [1] "centers"     "size"        "cluster"     "membership"  "iter"       
## [6] "withinerror" "call"

fviz_cluster(list(data = iris.scaled, cluster=cm3$cluster), frame.type = "norm",
             frame.level = 0.68)

cm2 <- cmeans(iris.scaled, 2)
cm2

## Fuzzy c-means clustering with 2 clusters
## 
## Cluster centers:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1   -0.9711458   0.7388322   -1.2136548   -1.171447
## 2    0.5766253  -0.3628038    0.6899806    0.670941
## 
## Memberships:
##                  1           2
##   [1,] 0.990476315 0.009523685
##   [2,] 0.930783798 0.069216202
##   [3,] 0.970021507 0.029978493
##   [4,] 0.945155289 0.054844711
##   [5,] 0.978231918 0.021768082
##   [6,] 0.887646677 0.112353323
##   [7,] 0.977834590 0.022165410
##   [8,] 0.997585016 0.002414984
##   [9,] 0.879380037 0.120619963
##  [10,] 0.957001208 0.042998792
##  [11,] 0.942698213 0.057301787
##  [12,] 0.991503312 0.008496688
##  [13,] 0.928107643 0.071892357
##  [14,] 0.899288354 0.100711646
##  [15,] 0.838312918 0.161687082
##  [16,] 0.758212503 0.241787497
##  [17,] 0.892426960 0.107573040
##  [18,] 0.991710986 0.008289014
##  [19,] 0.881673830 0.118326170
##  [20,] 0.935931132 0.064068868
##  [21,] 0.978910458 0.021089542
##  [22,] 0.956298734 0.043701266
##  [23,] 0.959081410 0.040918590
##  [24,] 0.988412802 0.011587198
##  [25,] 0.988878245 0.011121755
##  [26,] 0.928909451 0.071090549
##  [27,] 0.998130198 0.001869802
##  [28,] 0.988171886 0.011828114
##  [29,] 0.993394666 0.006605334
##  [30,] 0.970883926 0.029116074
##  [31,] 0.956029144 0.043970856
##  [32,] 0.977975833 0.022024167
##  [33,] 0.861796934 0.138203066
##  [34,] 0.821690933 0.178309067
##  [35,] 0.959489271 0.040510729
##  [36,] 0.979059886 0.020940114
##  [37,] 0.962212902 0.037787098
##  [38,] 0.974243884 0.025756116
##  [39,] 0.906991026 0.093008974
##  [40,] 0.997227654 0.002772346
##  [41,] 0.990999455 0.009000545
##  [42,] 0.687386303 0.312613697
##  [43,] 0.945336798 0.054663202
##  [44,] 0.978256902 0.021743098
##  [45,] 0.927118463 0.072881537
##  [46,] 0.927533571 0.072466429
##  [47,] 0.936190728 0.063809272
##  [48,] 0.963565433 0.036434567
##  [49,] 0.950307259 0.049692741
##  [50,] 0.993926861 0.006073139
##  [51,] 0.109559291 0.890440709
##  [52,] 0.073559843 0.926440157
##  [53,] 0.064154745 0.935845255
##  [54,] 0.258184778 0.741815222
##  [55,] 0.021176114 0.978823886
##  [56,] 0.126065482 0.873934518
##  [57,] 0.097179787 0.902820213
##  [58,] 0.469287757 0.530712243
##  [59,] 0.046004448 0.953995552
##  [60,] 0.293736083 0.706263917
##  [61,] 0.411171583 0.588828417
##  [62,] 0.082882840 0.917117160
##  [63,] 0.256169246 0.743830754
##  [64,] 0.031829923 0.968170077
##  [65,] 0.263362289 0.736637711
##  [66,] 0.071796723 0.928203277
##  [67,] 0.131584926 0.868415074
##  [68,] 0.220464047 0.779535953
##  [69,] 0.159965790 0.840034210
##  [70,] 0.263706855 0.736293145
##  [71,] 0.083298717 0.916701283
##  [72,] 0.093976160 0.906023840
##  [73,] 0.069420655 0.930579345
##  [74,] 0.070726409 0.929273591
##  [75,] 0.053319403 0.946680597
##  [76,] 0.047310623 0.952689377
##  [77,] 0.046268746 0.953731254
##  [78,] 0.021849358 0.978150642
##  [79,] 0.037515241 0.962484759
##  [80,] 0.309615828 0.690384172
##  [81,] 0.297707772 0.702292228
##  [82,] 0.327172781 0.672827219
##  [83,] 0.183924926 0.816075074
##  [84,] 0.035202416 0.964797584
##  [85,] 0.191675243 0.808324757
##  [86,] 0.184819452 0.815180548
##  [87,] 0.049975795 0.950024205
##  [88,] 0.153255017 0.846744983
##  [89,] 0.215865178 0.784134822
##  [90,] 0.234224036 0.765775964
##  [91,] 0.214238643 0.785761357
##  [92,] 0.043934917 0.956065083
##  [93,] 0.182974062 0.817025938
##  [94,] 0.440754102 0.559245898
##  [95,] 0.177270827 0.822729173
##  [96,] 0.203857946 0.796142054
##  [97,] 0.155272224 0.844727776
##  [98,] 0.059953145 0.940046855
##  [99,] 0.459400482 0.540599518
## [100,] 0.159040378 0.840959622
## [101,] 0.118319048 0.881680952
## [102,] 0.054949660 0.945050340
## [103,] 0.074229876 0.925770124
## [104,] 0.011053266 0.988946734
## [105,] 0.044548489 0.955451511
## [106,] 0.129554441 0.870445559
## [107,] 0.276066710 0.723933290
## [108,] 0.091753273 0.908246727
## [109,] 0.067975899 0.932024101
## [110,] 0.192284519 0.807715481
## [111,] 0.052557565 0.947442435
## [112,] 0.022111694 0.977888306
## [113,] 0.044965849 0.955034151
## [114,] 0.103943565 0.896056435
## [115,] 0.082890211 0.917109789
## [116,] 0.073546883 0.926453117
## [117,] 0.015848206 0.984151794
## [118,] 0.246403782 0.753596218
## [119,] 0.146433139 0.853566861
## [120,] 0.159134629 0.840865371
## [121,] 0.090432458 0.909567542
## [122,] 0.082508363 0.917491637
## [123,] 0.134147596 0.865852404
## [124,] 0.018336369 0.981663631
## [125,] 0.090334149 0.909665851
## [126,] 0.103148348 0.896851652
## [127,] 0.008856755 0.991143245
## [128,] 0.014243482 0.985756518
## [129,] 0.028732396 0.971267604
## [130,] 0.081194051 0.918805949
## [131,] 0.096670848 0.903329152
## [132,] 0.257081228 0.742918772
## [133,] 0.036608277 0.963391723
## [134,] 0.012662522 0.987337478
## [135,] 0.064988364 0.935011636
## [136,] 0.135799162 0.864200838
## [137,] 0.131419724 0.868580276
## [138,] 0.026674373 0.973325627
## [139,] 0.024110427 0.975889573
## [140,] 0.061391120 0.938608880
## [141,] 0.074011609 0.925988391
## [142,] 0.074508040 0.925491960
## [143,] 0.054949660 0.945050340
## [144,] 0.088077078 0.911922922
## [145,] 0.114854781 0.885145219
## [146,] 0.052758125 0.947241875
## [147,] 0.058766913 0.941233087
## [148,] 0.019742905 0.980257095
## [149,] 0.129652379 0.870347621
## [150,] 0.033357625 0.966642375
## 
## Closest hard clustering:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2
## 
## Available components:
## [1] "centers"     "size"        "cluster"     "membership"  "iter"       
## [6] "withinerror" "call"

fviz_cluster(list(data = iris.scaled, cluster=cm2$cluster), frame.type = "norm",
             frame.level = 0.68)

DB Scan Clustering

Partitioning methods (K-means, PAM clustering) and hierarchical clustering are suitable for finding spherical-shaped clusters or convex clusters. In other words, they work well for compact and well separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data.

DBSCAN, a density-based clustering algorithm, was introduced in Ester et al. 1996, which can be used to identify clusters of any shape in data set containing noise and outliers

The advantages of DBSCAN are:

Unlike K-means, DBSCAN does not require the user to specify the number of clusters to be generated DBSCAN can find any shape of clusters. The cluster doesn’t have to be circular. DBSCAN can identify outliers

Two important parameters are required for DBSCAN: epsilon (“eps”) and minimum points (“MinPts”). The parameter eps defines the radius of neighborhood around a point x. It’s called called the ϵϵ-neighborhood of x. The parameter MinPts is the minimum number of neighbors within “eps” radius.

The method proposed to calculate the optimal eps and MinPtd consists of computing the k-nearest neighbor distances in a matrix of points.

The idea is to calculate, the average of the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts.

Next, these k-distances are plotted in an ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter.

A knee corresponds to a threshold where a sharp change occurs along the k-distance curve.

iris <- as.matrix(iris[, 1:4])
dbscan::kNNdistplot(iris, k =  4)
abline(h = 0.4, lty = 2)

set.seed(123)
# fpc package
res.fpc <- fpc::dbscan(iris, eps = 0.4, MinPts = 4)
fviz_cluster(res.fpc, iris, geom = "point")

Normal Mixture Model Clustering

The traditional clustering methods such as hierarchical clustering and partitioning algorithms (k-means and others) are heuristic and are not based on formal models.

An alternative is to use model-based clustering, in which, the data are considered as coming from a distribution that is mixture of two or more components (i.e. clusters) (Chris Fraley and Adrian E. Raftery, 2002 and 2012).

Each component k (i.e. group or cluster) is modeled by the normal or Gaussian distribution which is characterized by the parameters:

μkμk: mean vector, ∑k∑k: covariance matrix, An associated probability in the mixture. Each point has a probability of belonging to each cluster.

mclust_result = Mclust(iris.scaled)
summary(mclust_result)

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 2 components:
## 
##  log.likelihood   n df       BIC       ICL
##       -322.6936 150 29 -790.6956 -790.6969
## 
## Clustering table:
##   1   2 
##  50 100

mclust_result$G

## [1] 2