In this lesson, we cover expectation-maximization theory, apply and evaluate expectation-maximization based clustering.
To run the code in M04_Lesson_03.Rmd you may need additional packages.
install.packages("mclust");
require(mclust)
## Loading required package: mclust
## Package 'mclust' version 5.1
## Type 'citation("mclust")' for citing this R package in publications.
We’ll be using the “diabetes” data that comes with the mclust package.
data(diabetes)
summary(diabetes)
## class glucose insulin sspg
## Chemical:36 Min. : 70 Min. : 45.0 Min. : 10.0
## Normal :76 1st Qu.: 90 1st Qu.: 352.0 1st Qu.:118.0
## Overt :33 Median : 97 Median : 403.0 Median :156.0
## Mean :122 Mean : 540.8 Mean :186.1
## 3rd Qu.:112 3rd Qu.: 558.0 3rd Qu.:221.0
## Max. :353 Max. :1568.0 Max. :748.0
Data References
G.M. Reaven and R.G. Miller, Diabetologica 16:17-24 (1979).
The expectation-maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. Expectation Maximization (EM) is perhaps most often used algorithm for unsupervised learning.
It is based upon probabilistic model building with partial observations
Likelihood Function
\[ P(Model|Data)=\frac{P(Data|Model)P(Model)}{P(Data)} \]
The Likelihood Function finds the best model given the data.
expectation-maximization (EM) algorithm
Suppose we flip a coin and get the following - 0,1,1,0,0,1,1,0,0,1. We may choose a Bernoulli distribution
\[ Pr(X=k) = p^k (1-p)^{1-k}\!\quad \text{for }k\in\{0,1\}. \]
And estimate the most likely p.
Or if we have data of heights of people in cm (both male and female). Heights follow a normal distrution but men are (on average) taller than women so this would suggest a mixture model of two Gaussian distributions.
\[ \mathcal{N}(\mu_1,\sigma_1) + \mathcal{N}(\mu_2,\sigma_2) \]
Mixture model of two Gaussian distributions
maximum likelihood idea:
\[ \mathrm{P}\left(\bigcap_{i=1}^n A_i\right)=\prod_{i=1}^n \mathrm{P}(A_i). \]
This is the multiplication rule for independent events.
Suppose we flip a coin and get the following - 0,1,1,0,0,1,1,0,0,1.
\[ L(p) = P(0,1,1,0,0,1,1,0,0,1|p) \]
\[ = P(0|p)P(1|p)...P(1|p) \]
\[ = p^5(1-p)^5 \]
choose p which maximizes L(p). Instead we will maximize \(l(p)= log L(p)\)
\[ l(p)= log L(p) = 5log(p) + 5log(1-p) \\ \]
\[ = \frac{5}{p} - \frac{5}{1-p} \equiv 0 (at \quad max) \\ \]
\[ \Rightarrow p = \frac{1}{2} \]
Given a statistical model which generates a set \(\mathbf{X}\) of observed data, a set of unobserved latent data or missing values \(\mathbf{Z}\), and a vector of unknown parameters \(\boldsymbol\theta\), along with a likelihood function \(L(\boldsymbol\theta; \mathbf{X}, \mathbf{Z}) = p(\mathbf{X}, \mathbf{Z}|\boldsymbol\theta)\), the maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the observed data
\[ L(\boldsymbol\theta; \mathbf{X}) = p(\mathbf{X}|\boldsymbol\theta) = \sum_{\mathbf{Z}} p(\mathbf{X},\mathbf{Z}|\boldsymbol\theta) \]
Create an initial model, \(\theta_0\).
Decide on a model with arbitrary paramters.
The EM phase seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps:
Calculate the expected value of the log likelihood function, with respect to the conditional distribution of \(\mathbf{Z}\) given \(\mathbf{X}\) under the current estimate of the parameters \(\boldsymbol\theta^{(t)}\):
\[ Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta;\mathbf{X},\mathbf{Z}) \right] \, \]
Maximization step (M step): Find the parameter that maximizes this quantity:
\[ \boldsymbol\theta^{(t+1)} = \underset{\boldsymbol\theta}{\operatorname{arg\,max}} \ Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) \, \]
Repeat the above EM steps until reaching a local maximum.
A very common “mixture model” is a mixture of Gaussians. For example, let \(\mathbf{x} = (\mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_n)\) be a sample of \(n\) independent observations from a mixture of two multivariate normal distributions of dimension d, and let \(\mathbf{z} = (z_1,z_2,\ldots,z_n)\) be the latent variables that determine the component from which the observation originates.
\[ X_i |(Z_i = 1) \sim \mathcal{N}_d(\boldsymbol{\mu}_1,\Sigma_1) \quad and \quad X_i |(Z_i = 2) \sim \mathcal{N}_d(\boldsymbol{\mu}_2,\Sigma_2) \]
where
\[ \operatorname{P} (Z_i = 1 ) = \tau_1 \, \quad and \quad \operatorname{P} (Z_i=2) = \tau_2 = 1-\tau_1 \]
The aim is to estimate the unknown parameters representing the “mixing” value between the Gaussians and the means and covariances of each:
\[ \theta = \big( \boldsymbol{\tau},\boldsymbol{\mu}_1,\boldsymbol{\mu}_2,\Sigma_1,\Sigma_2 \big) \]
where the incomplete-data likelihood function is
\[ L(\theta;\mathbf{x}) = \prod_{i=1}^n \sum_{j=1}^2 \tau_j \ f(\mathbf{x}_i;\boldsymbol{\mu}_j,\Sigma_j), \]
and the complete-data likelihood function is
\[ L(\theta;\mathbf{x},\mathbf{z}) = P(\mathbf{x},\mathbf{z} \vert \theta) = \prod_{i=1}^n \sum_{j=1}^2 \mathbb{I}(z_i=j) \ f(\mathbf{x}_i;\boldsymbol{\mu}_j,\Sigma_j) \tau_j \]
or
\[ L(\theta;\mathbf{x},\mathbf{z}) = \exp \left\{ \sum_{i=1}^n \sum_{j=1}^2 \mathbb{I}(z_i=j) \big[ \log \tau_j -\tfrac{1}{2} \log |\Sigma_j| -\tfrac{1}{2}(\mathbf{x}_i-\boldsymbol{\mu}_j)^\top\Sigma_j^{-1} (\mathbf{x}_i-\boldsymbol{\mu}_j) -\tfrac{d}{2} \log(2\pi) \big] \right\}. \]
where \(\mathbb{I}\) is an indicator function and f is the probability density function of a multivariate normal.
Expectation-maximization clustering probabilistically assigns data to different clusters. This is sometimes called “soft-clustering” (as oppossed to “hard-clustering” in which data only belongs to one cluster). I fairly tall person may be 55% likely to be a “man” and 45% likely to be a woman. This property may be useful in data which may share classes. For example, a song may be mostly “country” but a little bit “rock and roll.”
R has the Mclust function from the mclust library to provide the estimating the parameters in a statistical model using the Expectation-maximization (EM) algorthm.
We’ll go over a simple example, but Mclust has many tools normal mixture modeling using the EM algorithm ,for model-based clustering, classification, for density estimation, and Bayesian approaches to maximum liklihood estimation (Baysian probability will be covered in a future module).
Given a model (a mixture Guassians, Binomial, etc.), we have the following parameters:
The expectation-maximization steps:
Iterate until we have a convergence, usually when \(Z\) and \(T\) don’t change much or after a fixed number of steps.
Mclust(data, G = NULL, modelNames = NULL, prior = NULL, control = emControl(), initialization = NULL, warn = FALSE, …)
data - A numeric vector, matrix, or data frame of observations. Categorical variables are not allowed. If a matrix or data frame, rows correspond to observations and columns correspond to variables.
G - An integer vector specifying the numbers of mixture components (clusters) for which the BIC is to be calculated. The default is G=1:9.
modelNames - A vector of character strings indicating the models to be fitted in the EM phase of clustering. The default is:
The help file for mclustModelNames describes the available models.
prior - The default assumes no prior, but this argument allows specification of a conjugate prior on the means and variances through the function priorControl.
control - A list of control parameters for EM. The defaults are set by the call emControl().
initialization - A list containing zero or more of the following components:
hcPairs - A matrix of merge pairs for hierarchical clustering such as produced by function hc. For multivariate data, the default is to compute a hierarchical clustering tree by applying function hc with modelName = “VVV” to the data or a subset as indicated by the subset argument. The hierarchical clustering results are to start EM. For univariate data, the default is to use quantiles to start EM.
subset - A logical or numeric vector specifying a subset of the data to be used in the initial hierarchical clustering phase.
noise - A logical or numeric vector indicating an initial guess as to which observations are noise in the data. If numeric the entries should correspond to row indexes of the data. If supplied, a noise term will be added to the model in the estimation.
warn - A logical value indicating whether or not certain warnings (usually related to singularity) should be issued. The default is to suppress these warnings.
… Catches unused arguments in indirect or list calls via do.call.
An object of class “Mclust” providing the optimal (according to BIC) mixture model estimation.
The details of the output components are as follows:
call - The matched call
data - The input data matrix.
modelName - A character string denoting the model at which the optimal BIC occurs.
n - The number of observations in the data.
d - The dimension of the data.
G - The optimal number of mixture components.
BIC - All BIC values.
bic - Optimal BIC value.
loglik - The loglikelihood corresponding to the optimal BIC.
df - The number of estimated parameters.
hypvol - The hypervolume parameter for the noise component if required, otherwise set to NULL (see hypvol).
parameters - A list with the following components:
pro - A vector whose kth component is the mixing proportion for the kth component of the mixture model. If missing, equal proportions are assumed.
mean - The mean for each component. If there is more than one component, this is a matrix whose kth column is the mean of the kth component of the mixture model.
variance - A list of variance parameters for the model. The components of this list depend on the model specification. See the help file for mclustVariance for details.
z - A matrix whose [i,k]th entry is the probability that observation i in the test data belongs to the kth class.
classification - The classification corresponding to z, i.e. map(z).
uncertainty - The uncertainty associated with the classification.
The Bayesian information criterion (BIC) is used my mclust with is a test used to assess the fit of a model, The Bayesian information criterion (BIC) or Schwarz criterion (also SBC, SBIC) is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the likelihood function and it It is similar to the Akaike information criterion (AIC) which is founded on information theory: it offers a relative estimate of the information lost when a given model is used to represent the process that generates the data. In doing so, it deals with the trade-off between the goodness of fit of the model and the complexity of the model.
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC try to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC. The BIC was developed by Gideon E. Schwarz and published in a 1978 paper, where he gave a Bayesian argument for adopting it.
mclust also uses a Marginal likelihood called the Integrated Complete X Likelihood to assess classification fits.
EM clustering with diabetes data using mclust. Note that the summary command with an mclust object generates:
data(diabetes)
head(diabetes)
## class glucose insulin sspg
## 1 Normal 80 356 124
## 2 Normal 97 289 117
## 3 Normal 105 319 143
## 4 Normal 90 356 199
## 5 Normal 90 323 240
## 6 Normal 86 381 157
class.d = diabetes$class
table(class.d)
## class.d
## Chemical Normal Overt
## 36 76 33
X = diabetes[,-1]
head(X)
## glucose insulin sspg
## 1 80 356 124
## 2 97 289 117
## 3 105 319 143
## 4 90 356 199
## 5 90 323 240
## 6 86 381 157
clPairs(X, class.d)
fit <- Mclust(X)
fit
## 'Mclust' model object:
## best model: ellipsoidal, varying volume, shape, and orientation (VVV) with 3 components
summary(fit)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3 components:
##
## log.likelihood n df BIC ICL
## -2307.883 145 29 -4760.091 -4776.086
##
## Clustering table:
## 1 2 3
## 82 33 30
The plot command for EM produces the following four plots:
# 1: BIC
plot(fit, what = "BIC")
#table(class, fit$BIC)
#2: classification
plot(fit, what = "classification")
#table(class, fit$classification)
# 3: uncertainty
plot(fit, what = "uncertainty")
#table(class, fit$uncertainty)
# 4: density
plot(fit, what = "density")
#table(class, fit$density)
The mclustBIC from mclust uses BIC for EM initialized by model-based hierarchical clustering for parameterized Gaussian mixture models.
BIC = mclustBIC(X)
summary(BIC)
## Best BIC values:
## VVV,3 VVE,3 EVE,4
## BIC -4760.091 -4775.53693 -4793.26143
## BIC diff 0.000 -15.44628 -33.17079
plot(BIC) # Only BIC plot
# plot(BIC,what = "BIC")
# The following just get BIC plot: plot(BIC,what = "classification"), plot(BIC,what = "uncertainty"), plot(BIC,what = "density")
mclustICL from mclust uses ICL (Integrated Complete-data Likelihood) for parameterized Gaussian mixture models fitted by EM algorithm initialized by model-based hierarchical clustering.
ICL = mclustICL(X)
summary(ICL)
## Best ICL values:
## VVV,3 VVE,3 EVE,4
## ICL -4776.086 -4793.27143 -4809.16868
## ICL diff 0.000 -17.18553 -33.08278
plot(ICL) # Only ICL plot
Suppose you have two coins. The probability of observing a head with the first coin is p and the second coin is q. We don’t observe the flips, only the resulting sequence of heads or tails. Let \(X_i \in {0,1}\) be the result of heads or tails at flip \(i\).
Let \(C_i \in {0,1}\) represent which coin was used. That is, \(C = 0\) the first coin whith probability \(p\) was used and \(C = 1\) the second coin whith probability \(q\) was used.
Since either the first or second coin was used for flip \(i\), the probability of the result at flip \(i\), is:
\[ P(X_i) = P(X_i|C_i=0)P(C_i=0) + P(X_i|C_i=1)P(C_i=1) \]
Note that if the probability of using coin one doesn’t change during the sequence of flips then \(P(C_i=0)\) is just \((C=0)\) and the probability of using coin two is \(1- P(C=0)\).
Note also that if we wanted to model a sequence of observed heights using the Galton data directly we would use a mixture of two Gaussian distributions repsesenting the height distributions by gender and the probability of getting “coin one or two” (i.e. a male or female) would be roughly 50%.
#-------------------------------- Expectation -----------------------------
expectation <- function(sample,p,a,b)
{
p_expectation <- (p*dbinom(sample,1,a)) / ( p*dbinom(sample,1,a) + (1-p)*dbinom(sample,1,b) )
return(p_expectation)
}
#-------------------------------- Maximization --------------------------------
maximization <- function(sample,epart){
# estimate p
p_temp <- mean(epart)
# estimate a and b
a_temp <- sum(sample*epart) / sum(epart)
b_temp <- sum(sample*(1-epart)) / sum(1-epart)
list(p_temp,a_temp,b_temp)
}
#---------------------------- Expectation Maximization Algorithm -------------------------
EM <- function(sample,p_inits,a_inits,b_inits,maxit=1000,tol=1e-6)
{
# Estimation of parameter(Initial)
flag <- 0
p_cur <- p_inits; a_cur <- a_inits; b_cur <- b_inits
# Iterate between expectation and maximization parts
for(i in 1:maxit){
cur <- c(p_cur,a_cur,b_cur)
new <- maximization(sample,expectation(sample, p_cur, a_cur, b_cur))
p_new <- new[[1]]; a_new <- new[[2]]; b_new <- new[[3]]
new_step <- c(p_new,a_new,b_new)
# Stop iteration if the difference between the current and new estimates is less than a tolerance level
if( all(abs(cur - new_step) < tol) ){ flag <- 1; break}
# Otherwise continue iteration
p_cur <- p_new; a_cur <- a_new; b_cur <- b_new
}
if(!flag) warning("Didn't converge\n")
list(p_cur, a_cur, b_cur)
}
#------------------------------ Calculating Information matrix ----------------------------
Info.Mat.function <- function(sample, p.est, a.est, b.est){
expectation.est <- expectation(sample,p.est, a.est, b.est)
info.mat <- matrix(rep(0,9),3,3)
info.mat[1,1] <- - sum(expectation.est)/(p.est^2) - sum((1-expectation.est))/((1-p.est)^2)
info.mat[2,2] <- - sum(expectation.est*sample)/(a.est^2) - sum(expectation.est*(1-sample))/((1-a.est)^2)
info.mat[3,3] <- - sum((1-expectation.est)*sample)/(b.est^2) - sum((1-expectation.est)*(1-sample))/((1-b.est)^2)
return(-info.mat)
}
#------------------ Now Generate sample data --------------------------------
n <- 10000
p_true <- 0.85 # prob of using first coin
a_true <- 0.50 # the first coin has P(heads) = 0.50
b_true <- 0.70 # the second coin has P(heads) = 0.70
true <- c(p_true,a_true,b_true)
u <- ifelse(runif(n)<p_true, rbinom(n,1,a_true),rbinom(n,1,b_true))
# Set parameter estimates
p_init = 0.70; a_init = 0.70; b_init = 0.60
#--------------------Return EM Algorithm function and calculate Confidence Interval-----------------------------
output <- EM(u,p_init,a_init,b_init)
# Confidence Intervals
sd.out <- sqrt(diag(solve(Info.Mat.function(u,output[[1]],output[[2]],output[[3]]))))
data.frame("Truth" = true, "EM Estimate" = unlist(output), "Lower CI" = unlist(output) - qnorm(.975)*sd.out, "Upper CI" = unlist(output) + qnorm(.975)*sd.out)
## Truth EM.Estimate Lower.CI Upper.CI
## 1 0.85 0.6863229 0.6772290 0.6954169
## 2 0.50 0.5605037 0.5487615 0.5722460
## 3 0.70 0.4505061 0.4330945 0.4679177
Answer the following questions:
* How did you choose a model for EM?
* Evaluate the model performance.
Answer the following questions:
How do the clustering appaoches compare on the same data?
Write up your report as an .Rmd file.
Data Mining Algorithms In R/Clustering/Expectation Maximization (EM) [https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Expectation_Maximization_(EM)]
C. Fraley, A. E. Raftery, T. B. Murphy and L. Scrucca (2012). mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report No. 597, Department of Statistics, University of Washington.
C. Fraley and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97:611:631.
C. Fraley and A. E. Raftery (2005, revised 2009). Bayesian regularization for normal mixture estimation and model-based clustering. Technical Report, Department of Statistics, University of Washington.
C. Fraley and A. E. Raftery (2007). Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification 24:155-181.