Expectation Maximization (EM) algorithm is an iterative method used to estimate parameters of statistical models with hidden or latent variables by alternating between estimating the expected values of those variables (E-step) and maximizing the likelihood given those estimates (M-step). It repeats these steps until convergence, typically increasing the likelihood at each iteration.
library(mclust)
## Warning: package 'mclust' was built under R version 4.5.3
## Package 'mclust' version 6.1.2
## Type 'citation("mclust")' for citing this R package in publications.
data <- faithful
head(data)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
str(data)
## 'data.frame': 272 obs. of 2 variables:
## $ eruptions: num 3.6 1.8 3.33 2.28 4.53 ...
## $ waiting : num 79 54 74 62 85 55 88 85 51 85 ...
summary(data)
## eruptions waiting
## Min. :1.600 Min. :43.0
## 1st Qu.:2.163 1st Qu.:58.0
## Median :4.000 Median :76.0
## Mean :3.488 Mean :70.9
## 3rd Qu.:4.454 3rd Qu.:82.0
## Max. :5.100 Max. :96.0
plot(faithful$eruptions, faithful$waiting,
main = "Old Faithful Geyser",
xlab = "Eruption Time",
ylab = "Waiting Time",
pch = 19, col = "darkblue")
hist(faithful$eruptions, col = "skyblue", main = "Histogram Eruptions")
hist(faithful$waiting, col = "salmon", main = "Histogram Waiting Time")
The faithful dataset contains two variables: eruption duration (eruptions) and waiting time between eruptions (waiting). From the scatter plot, there is a clear positive relationship—longer eruptions tend to be followed by longer waiting times.
The distribution of both variables appears bimodal, indicating the presence of two distinct groups: short eruptions with shorter waiting times, and long eruptions with longer waiting times. This suggests that the data likely come from a mixture of two underlying distributions, making it well-suited for clustering using the Expectation-Maximization (EM) algorithm.
As you can see, the eruption time is in 1-5 intervals, and waiting time is within 0 to around 100. So, if we dont normalize it, it will explode the model cause the model will tend to the biggest number
data_scaled <- scale(data)
In this stage we use Gaussian Mixture Models (GMM) to model data as a combination of multiple Gaussian distributions, allowing us to capture hidden subpopulations and perform probabilistic clustering more flexibly than hard clustering methods.
model <- Mclust(data_scaled)
summary(model)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 3
## components:
##
## log-likelihood n df BIC ICL
## -380.5144 272 11 -822.6927 -866.328
##
## Clustering table:
## 1 2 3
## 41 97 134
clusters <- model$classification
faithful$cluster <- as.factor(clusters)
head(faithful)
## eruptions waiting cluster
## 1 3.600 79 1
## 2 1.800 54 2
## 3 3.333 74 1
## 4 2.283 62 2
## 5 4.533 85 3
## 6 2.883 55 2
table(faithful$cluster)
##
## 1 2 3
## 41 97 134
GMM models produce 3 cluster for faithful data, where: cluster 1 contains 41 data, cluster 2 contains 97 data, and cluster 3 contains 134
plot(faithful$eruptions, faithful$waiting,
col = faithful$cluster,
pch = 19,
main = "Clustering Result (EM - GMM)",
xlab = "Eruption Time",
ylab = "Waiting Time")
plot(model, what = "classification")
plot(model, what = "uncertainty")
The EM-based GMM identifies three clusters in the faithful data: a well-separated group of short eruptions with short waiting times, and two overlapping groups representing longer eruptions with moderate to long waiting times. The ellipse contours indicate each cluster’s Gaussian distribution, showing that while one cluster is clearly distinct, the other two have some overlap and uncertainty in their boundaries.
model$parameters
## $pro
## [1] 0.1663753 0.3563687 0.4772559
##
## $mean
## [,1] [,2] [,3]
## eruptions 0.2683357 -1.270568 0.855194
## waiting 0.4845402 -1.206768 0.732183
##
## $variance
## $variance$modelName
## [1] "EEE"
##
## $variance$d
## [1] 2
##
## $variance$G
## [1] 3
##
## $variance$sigma
## , , 1
##
## eruptions waiting
## eruptions 0.05999763 0.03061616
## waiting 0.03061616 0.18243359
##
## , , 2
##
## eruptions waiting
## eruptions 0.05999763 0.03061616
## waiting 0.03061616 0.18243359
##
## , , 3
##
## eruptions waiting
## eruptions 0.05999763 0.03061616
## waiting 0.03061616 0.18243359
##
##
## $variance$Sigma
## eruptions waiting
## eruptions 0.05999763 0.03061616
## waiting 0.03061616 0.18243359
##
## $variance$cholSigma
## eruptions waiting
## eruptions -0.2449441 -0.1249924
## waiting 0.0000000 0.4084244
model$parameters$pro
## [1] 0.1663753 0.3563687 0.4772559
model$parameters$mean
## [,1] [,2] [,3]
## eruptions 0.2683357 -1.270568 0.855194
## waiting 0.4845402 -1.206768 0.732183
model$parameters$variance
## $modelName
## [1] "EEE"
##
## $d
## [1] 2
##
## $G
## [1] 3
##
## $sigma
## , , 1
##
## eruptions waiting
## eruptions 0.05999763 0.03061616
## waiting 0.03061616 0.18243359
##
## , , 2
##
## eruptions waiting
## eruptions 0.05999763 0.03061616
## waiting 0.03061616 0.18243359
##
## , , 3
##
## eruptions waiting
## eruptions 0.05999763 0.03061616
## waiting 0.03061616 0.18243359
##
##
## $Sigma
## eruptions waiting
## eruptions 0.05999763 0.03061616
## waiting 0.03061616 0.18243359
##
## $cholSigma
## eruptions waiting
## eruptions -0.2449441 -0.1249924
## waiting 0.0000000 0.4084244
The GMM identifies three clusters with proportions of approximately 16.6%, 35.6%, and 47.7%, indicating that the majority of observations belong to the third cluster while the first is the smallest group. The cluster means show a clear separation: one cluster represents short eruptions with short waiting times (negative values), while the other two correspond to moderate and long eruptions with increasing waiting times. Since the model uses the “EEE” covariance structure, all clusters share the same shape and orientation, suggesting similar variability across groups with differences mainly in their centers.