Clustering using EM Algorithm

Model-based clustering.

Both hierarchical clustering and k-means clustering use a heuristic approach to construct clusters, and do not rely on a formal model.

Model-based clustering assumes a data model and applies an EM algorithm to find the most likely model components and the number of clusters.

Significance of statistical distribution of variables in the dataset is the measure.

Gaussian mixture models

Probabilistic model-based clustering techniques have been widely used and have shown promising results in many applications, ranging from image segmentation, handwriting recognition, document clustering, topic modeling to information retrieval.

Model-based clustering approaches attempt to optimize the fit between the observed data and some mathematical model using a probabilistic approach.

In model-based clustering, it is assumed that the data are generated by a mixture of probability distributions in which each component represents a different cluster.

The generative models are typically solved with the use of an EM approach which is the most widely used method for estimating the parameters of a finite mixture probability density.

The model-based clustering framework provides a principal way to deal with several problems in this approach, such as the number of component densities (or clusters), initial values of the parameters (the EM algorithm needs initial parameter values to get started), and distributions of the component densities (e.g., Gaussian).

EM Algorithm Outline Steps

EM starts off with a random or heuristic initialization and then iteratively uses two steps to resolve the circularity in computation:

E-Step.

Determine the expected probability of assignment of data points to clusters with the use of current model parameters.

M-Step.

Determine the optimum model parameters of each mixture by using the assignment probabilities as weights.

Each cluster is modeled as a multivariate Gaussian distribution, and the model is specified by giving the following:

The number of clusters.
The fraction of all data points that are in each cluster.
Each cluster’s mean and its d-by-d covariance matrix.

library(mclust)

## Package 'mclust' version 5.4.10
## Type 'citation("mclust")' for citing this R package in publications.

library(EMCluster)

## Loading required package: MASS

## Loading required package: Matrix

model-based clustering on the iris dataset using Mclust:

Model-based clustering based on parameterized finite Gaussian mixture models. Models are estimated by EM algorithm initialized by hierarchical model-based agglomerative clustering. The optimal model is then selected according to BIC.

Analysis of Iris Data : Model based Clustering —

mod2 <- Mclust(iris[,1:4], G = 3)
summary(mod2, parameters = TRUE)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 3 components: 
## 
##  log-likelihood   n df       BIC       ICL
##        -186.074 150 38 -562.5522 -566.4673
## 
## Clustering table:
##  1  2  3 
## 50 45 55 
## 
## Mixing probabilities:
##         1         2         3 
## 0.3333333 0.3005423 0.3661243 
## 
## Means:
##               [,1]     [,2]     [,3]
## Sepal.Length 5.006 5.915044 6.546807
## Sepal.Width  3.428 2.777451 2.949613
## Petal.Length 1.462 4.204002 5.482252
## Petal.Width  0.246 1.298935 1.985523
## 
## Variances:
## [,,1]
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length   0.13320850  0.10938369  0.019191764 0.011585649
## Sepal.Width    0.10938369  0.15495369  0.012096999 0.010010130
## Petal.Length   0.01919176  0.01209700  0.028275400 0.005818274
## Petal.Width    0.01158565  0.01001013  0.005818274 0.010695632
## [,,2]
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length   0.22572159  0.07613348   0.14689934  0.04335826
## Sepal.Width    0.07613348  0.08024338   0.07372331  0.03435893
## Petal.Length   0.14689934  0.07372331   0.16613979  0.04953078
## Petal.Width    0.04335826  0.03435893   0.04953078  0.03338619
## [,,3]
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length   0.42943106  0.10784274   0.33452389  0.06538369
## Sepal.Width    0.10784274  0.11596343   0.08905176  0.06134034
## Petal.Length   0.33452389  0.08905176   0.36422115  0.08706895
## Petal.Width    0.06538369  0.06134034   0.08706895  0.08663823

Using prior:

The default assumes no prior, but this argument allows specification of a conjugate prior on the means and variances through the function priorControl.

#mod3 <- Mclust(iris[,1:4], prior = priorControl())
#summary(mod3)

#mod4 <- Mclust(iris[,1:4], prior = priorControl(functionName="defaultPrior", shrinkage=0.1))
#summary(mod4)

mb = Mclust(iris[,-5])

#or specify number of clusters
mb3 = Mclust(iris[,-5], 3)

optimal selected model: (A character string denoting the model at which the optimal BIC occurs.)

# optimal selected model
mb$modelName

## [1] "VEV"

# optimal number of cluster
mb$G

## [1] 2

# get probabilities, means, variances
summary(mb3, parameters = TRUE)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 3 components: 
## 
##  log-likelihood   n df       BIC       ICL
##        -186.074 150 38 -562.5522 -566.4673
## 
## Clustering table:
##  1  2  3 
## 50 45 55 
## 
## Mixing probabilities:
##         1         2         3 
## 0.3333333 0.3005423 0.3661243 
## 
## Means:
##               [,1]     [,2]     [,3]
## Sepal.Length 5.006 5.915044 6.546807
## Sepal.Width  3.428 2.777451 2.949613
## Petal.Length 1.462 4.204002 5.482252
## Petal.Width  0.246 1.298935 1.985523
## 
## Variances:
## [,,1]
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length   0.13320850  0.10938369  0.019191764 0.011585649
## Sepal.Width    0.10938369  0.15495369  0.012096999 0.010010130
## Petal.Length   0.01919176  0.01209700  0.028275400 0.005818274
## Petal.Width    0.01158565  0.01001013  0.005818274 0.010695632
## [,,2]
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length   0.22572159  0.07613348   0.14689934  0.04335826
## Sepal.Width    0.07613348  0.08024338   0.07372331  0.03435893
## Petal.Length   0.14689934  0.07372331   0.16613979  0.04953078
## Petal.Width    0.04335826  0.03435893   0.04953078  0.03338619
## [,,3]
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length   0.42943106  0.10784274   0.33452389  0.06538369
## Sepal.Width    0.10784274  0.11596343   0.08905176  0.06134034
## Petal.Length   0.33452389  0.08905176   0.36422115  0.08706895
## Petal.Width    0.06538369  0.06134034   0.08706895  0.08663823

Compare amount of the data within each cluster:

table(iris$Species, mb$classification)

##             
##               1  2
##   setosa     50  0
##   versicolor  0 50
##   virginica   0 50

# vs
table(iris$Species, mb3$classification)

##             
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 45  5
##   virginica   0  0 50

plot(mb, what=c("classification"))

plot(mb3, what=c("classification"))

estimated density & Contour plot

plot(mb, "density")

plot(mb3, "density")

Application on 3 Datasets

There are four small datasets to test and demonstrate EMCluster. Usage da1 da2 da3 Format da1, da2, da3 are in list.

Details of Synthetic Datasets

da1 has 500 observations in two dimensions x and y, and they are in 10 clusters given in da1$class.

da2 has 2,500 observations in two dimensions, too. The true parameters are given in da1$pi, da1$Mu, and da1$LTSigma.

There are 40 clusters given in da2$class for this dataset.

da3 is similar to da2, but with lower overlaps between clusters.

plot(da1$da)

x1 <- da1$da

mbx.1 = Mclust(x1)
mbx1 = Mclust(x1, 10)

mbx1$modelName

## [1] "VEE"

# optimal number of cluster
mbx.1$G

## [1] 9

# get probabilities, means, variances
summary(mbx1, parameters = TRUE)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEE (ellipsoidal, equal shape and orientation) model with 10 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -5697.546 500 41 -11649.89 -11670.29
## 
## Clustering table:
##  1  2  3  4  5  6  7  8  9 10 
## 49 39 90 38 42 35 57 51 41 58 
## 
## Mixing probabilities:
##          1          2          3          4          5          6          7 
## 0.09930060 0.07643034 0.18056886 0.07628735 0.08390980 0.06998730 0.11365066 
##          8          9         10 
## 0.10186509 0.08690034 0.11109965 
## 
## Means:
##        [,1]      [,2]     [,3]      [,4]      [,5]      [,6]      [,7]
## x  63.75779  17.53929 108.4392  187.9146 -27.68478 -45.57327 180.82467
## y -10.71563 190.19090 127.2173 -118.2376 -96.22916  62.66587  19.21047
##         [,8]     [,9]    [,10]
## x   52.82972 454.4464 362.8577
## y -148.44373 446.2849 361.8947
## 
## Variances:
## [,,1]
##          x        y
## x 429.6931 119.5638
## y 119.5638 553.9733
## [,,2]
##           x         y
## x 290.46111  80.82198
## y  80.82198 374.47125
## [,,3]
##          x        y
## x 876.5011  243.890
## y 243.8900 1130.012
## [,,4]
##           x         y
## x 345.58316  96.15991
## y  96.15991 445.53625
## [,,5]
##           x         y
## x 277.75695  77.28699
## y  77.28699 358.09265
## [,,6]
##          x        y
## x 385.7320 107.3315
## y 107.3315 497.2973
## [,,7]
##          x        y
## x 394.3108 109.7186
## y 109.7186 508.3574
## [,,8]
##           x         y
## x 326.90298  90.96207
## y  90.96207 421.45321
## [,,9]
##           x         y
## x 1200.4819  334.0389
## y  334.0389 1547.6975
## [,,10]
##          x         y
## x 983.2847  273.6029
## y 273.6029 1267.6804

plot(mbx1, what=c("classification"))

plot(mbx1, "density")

plot(da2$da)

x2 <- da2$da

mbx.2 = Mclust(x2)
mbx2 = Mclust(x2, 40)

mbx2$modelName

## [1] "EEV"

# optimal number of cluster
mbx.2$G

## [1] 8

# get probabilities, means, variances
summary(mbx2, parameters = TRUE)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEV (ellipsoidal, equal volume and shape) model with 40 components: 
## 
##  log-likelihood    n  df       BIC       ICL
##       -10319.05 2500 161 -21897.78 -22552.76
## 
## Clustering table:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##  88  88 144  93 115 229 104  54 124 100  80  67 108  16  94   7  19  91  91  37 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
## 104  18  43 163  63  48  49  48  13  15  86  12   4   9  35  13   2   2  10  14 
## 
## Mixing probabilities:
##            1            2            3            4            5            6 
## 0.0294982820 0.0344021912 0.0575770502 0.0353921923 0.0452651852 0.0780010145 
##            7            8            9           10           11           12 
## 0.0414476728 0.0217166349 0.0475926263 0.0368884450 0.0310344836 0.0286318930 
##           13           14           15           16           17           18 
## 0.0427323360 0.0075742372 0.0372742043 0.0026426597 0.0074565349 0.0362917078 
##           19           20           21           22           23           24 
## 0.0360519479 0.0147586233 0.0424670371 0.0086358560 0.0234881034 0.0636497890 
##           25           26           27           28           29           30 
## 0.0249928415 0.0195844494 0.0200209925 0.0191658641 0.0070669467 0.0055274797 
##           31           32           33           34           35           36 
## 0.0351765006 0.0187315403 0.0017201890 0.0038719748 0.0140904770 0.0063078682 
##           37           38           39           40 
## 0.0008000696 0.0015907432 0.0043853145 0.0064960413 
## 
## Means:
##       [,1]     [,2]     [,3]      [,4]     [,5]     [,6]      [,7]      [,8]
## x 9.377612 5.439021 10.12150  7.230348 10.06642 12.16460 12.154130 14.696365
## y 5.398966 6.476544 10.49702 11.214918 12.00152 13.91635  6.333163  5.493335
##        [,9]    [,10]    [,11]    [,12]     [,13]    [,14]    [,15]    [,16]
## x 12.949193 13.36952 9.114686 14.83528  7.199516 11.99718 5.007453 8.045737
## y  9.019692 12.83218 7.907965 13.28458 13.169988 12.37493 8.494978 6.929531
##      [,17]     [,18]     [,19]    [,20]    [,21]     [,22]    [,23]     [,24]
## x 7.078594  6.157478  9.269853 5.312778 7.107803  8.747881 9.931345 10.842052
## y 7.664378 12.452539 14.965230 4.767730 9.539244 11.557376 5.337512  8.116546
##       [,25]     [,26]    [,27]    [,28]     [,29]    [,30]     [,31]    [,32]
## x  5.219525  6.998458 14.22631 11.94853 12.825932 8.077504  8.285024 11.77862
## y 13.745859 14.133844 14.57954 10.83940  8.021631 4.704905 10.206519 13.97735
##      [,33]     [,34]     [,35]    [,36]      [,37]    [,38]    [,39]    [,40]
## x 3.473921  4.032722 13.041216 9.879166  0.8771717 13.96137 6.007711 5.384331
## y 9.423857 11.164934  6.111048 8.766057 10.4078374  8.96435 9.211041 5.350121
## 
## Variances:
## [,,1]
##            x          y
## x 0.09089949 0.07713374
## y 0.07713374 0.34056782
## [,,2]
##            x          y
## x  0.2914174 -0.1257185
## y -0.1257185  0.1400499
## [,,3]
##             x           y
## x  0.35236901 -0.05351601
## y -0.05351601  0.07909831
## [,,4]
##            x          y
## x 0.10051223 0.09086916
## y 0.09086916 0.33095509
## [,,5]
##           x         y
## x 0.2457101 0.1436475
## y 0.1436475 0.1857572
## [,,6]
##             x           y
## x  0.33477302 -0.08580683
## y -0.08580683  0.09669429
## [,,7]
##            x          y
## x  0.3003963 -0.1198559
## y -0.1198559  0.1310711
## [,,8]
##            x          y
## x 0.34863467 0.06221336
## y 0.06221336 0.08283264
## [,,9]
##           x         y
## x 0.2723507 0.1353798
## y 0.1353798 0.1591166
## [,,10]
##           x         y
## x 0.1139555 0.1057090
## y 0.1057090 0.3175118
## [,,11]
##            x          y
## x  0.3091420 -0.1131727
## y -0.1131727  0.1223253
## [,,12]
##            x          y
## x 0.35798392 0.03602839
## y 0.03602839 0.07348339
## [,,13]
##            x          y
## x 0.08159311 0.05949366
## y 0.05949366 0.34987420
## [,,14]
##           x         y
## x 0.2479988 0.1431508
## y 0.1431508 0.1834685
## [,,15]
##            x          y
## x  0.3266094 -0.0961236
## y -0.0961236  0.1048579
## [,,16]
##            x          y
## x 0.35850795 0.03389223
## y 0.03389223 0.07295936
## [,,17]
##            x          y
## x 0.35048043 0.05810757
## y 0.05810757 0.08098689
## [,,18]
##            x          y
## x 0.09945273 0.08950937
## y 0.08950937 0.33201458
## [,,19]
##             x           y
## x  0.35996090 -0.02704966
## y -0.02704966  0.07150641
## [,,20]
##             x           y
## x  0.08085131 -0.05779217
## y -0.05779217  0.35061600
## [,,21]
##             x           y
## x  0.07048621 -0.02088927
## y -0.02088927  0.36098110
## [,,22]
##             x           y
## x  0.08511608 -0.06687475
## y -0.06687475  0.34635123
## [,,23]
##            x          y
## x  0.3071500 -0.1147878
## y -0.1147878  0.1243173
## [,,24]
##            x          y
## x  0.3114502 -0.1112274
## y -0.1112274  0.1200171
## [,,25]
##           x         y
## x 0.3278207 0.0947084
## y 0.0947084 0.1036466
## [,,26]
##           x         y
## x 0.1854140 0.1435754
## y 0.1435754 0.2460533
## [,,27]
##            x          y
## x  0.1517596 -0.1320625
## y -0.1320625  0.2797077
## [,,28]
##           x         y
## x 0.1904220 0.1445424
## y 0.1445424 0.2410453
## [,,29]
##           x         y
## x 0.2503525 0.1425999
## y 0.1425999 0.1811148
## [,,30]
##           x         y
## x 0.1130983 0.1048769
## y 0.1048769 0.3183690
## [,,31]
##            x          y
## x  0.1548514 -0.1335161
## y -0.1335161  0.2766159
## [,,32]
##           x         y
## x 0.1068358 0.0983587
## y 0.0983587 0.3246315
## [,,33]
##           x         y
## x 0.2102951 0.1466411
## y 0.1466411 0.2211723
## [,,34]
##            x          y
## x 0.35818726 0.03521579
## y 0.03521579 0.07328005
## [,,35]
##            x          y
## x 0.07176199 0.02837856
## y 0.02837856 0.35970532
## [,,36]
##           x         y
## x 0.2625748 0.1390651
## y 0.1390651 0.1688925
## [,,37]
##          x         y
## x 0.207721 0.1465230
## y 0.146523 0.2237463
## [,,38]
##            x          y
## x  0.2588308 -0.1402705
## y -0.1402705  0.1726365
## [,,39]
##             x           y
## x  0.33329191 -0.08782505
## y -0.08782505  0.09817540
## [,,40]
##             x           y
## x  0.36179408 -0.01412568
## y -0.01412568  0.06967323

plot(mbx2, what=c("classification"))

## Warning in mclust2Dplot(data = data[, dimens, drop = FALSE], what =
## "classification", : more symbols needed to show classification

## Warning in mclust2Dplot(data = data[, dimens, drop = FALSE], what =
## "classification", : more colors needed to show classification

plot(da3$da)

x3 <- da3$da

Clustering using EM Algorithm

2022-07-23

Model-based clustering.

Gaussian mixture models

Details of Synthetic Datasets