The term EM was introduced in Dempster, Laird, and Rubin (1977) “Maximum Likelihood from Incomplete Data via the EM Algorithm” where proof of general results about the behavior of the algorithm was first given as well as a large number of applications. http://web.mit.edu/6.435/www/Dempster77.pdf. This idea has been in use for many years before Orchard and Woodbury (1972) in their missing information principle provided the theoretical foundation of the underlying idea. Common applications include fitting mixture models, learning Bayes net parameters with latent data, and learning hidden Markov models.
In statistics, an expectation maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
More applications can be found in Kevin Murphy’s textbook, Machine Learning: a Probabilistic Perspective.
Here we shall present only a simple numerical example to give the flavour of the method. The example in Dempster’s 1977 paper. Where cited Rao (1965, pp. 368-369) presents data in which 197 animals are distributed multinomially into four categories, so that the observed data consist of A genetic model for the population specifies cell probabilities.
n1=125, n2=18, n3=20, n4=34, we need to estimate the parameters, p1, p2, p3 and p4. What is there are missing data, and their are five groups instead of four?
p1=125/197
p2=18/197
p3=20/197
p4=34/197
my_prob <- c(p1,p2,p3,p4)
number_of_experiments <- 10
number_of_samples <- 197
experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 125 117 135 129 116 112 133 124 127 129
## [2,] 21 16 14 18 15 17 17 16 20 12
## [3,] 23 20 20 21 16 25 16 26 15 18
## [4,] 28 44 28 29 50 43 31 31 35 38
df=data.frame(experiments)/number_of_samples
par(mfrow=c(2,5))
for(i in 1:10) {
barplot(df[,i],ylim=c(0,1))
}
The key is to find the likelihood function, once the likelihood function of the complete data representation model, we can apply the EM algorithm.
To illustrate the EM algorithm, we represent Rao’s data as incomplete data from a five-category multinomial population where the cell probabilities are (p1,p2,p3,p4,p5), the idea being to split the first of the original four categories into two categories. The expectation step estimates the sufficient statistics of the complete data, given the observed data. In this case, (n3, n4, n5) are known to be (18,20,34) so that the only sufficient statistics that have to be estimated are z1 and z2, where z1+z2 =n1 = 125. Estimating z1 and z2, using the current estimate of p or theta leads to E-step computation
# defube expectation function
estep <- function(theta,z2){
z2 = 125*(0.25*theta/(0.5+0.25*theta))
# z1 = 125*(0.5/(0.5+0.25*theta))
return(z2)
}
The maximization step then takes the estimated complete data (z1,z2, 18,20,34) and estimates theta by maximum likelihood as though the estimated complete data were the observed data, thus yielding the maximization function
mstep <- function(theta,z2){
theta <- (z2+34)/(z2+34+18+20)
return(theta)
}
The EM algorithm for this example is defined by cycling back and forth between expectation and maximization steps. The idea been we have missing data z2, and parameter theta both need to be estimated, but we only one log-likelihood function, so we go back and forth to optimize the pair(z2,theta)
# set initial value for iteration
cur_theta = 0.5
maxit = 100
maxit=1000
tol=1e-6
# start iteration
for(i in 1:maxit){
new_z2 <- estep(cur_theta,cur_z2)
new_theta <- mstep(cur_theta,new_z2)
# Stop iteration if the difference between the current and new estimates is less than a tolerance level
if( all(abs(cur_theta - new_theta) < tol) ){ flag <- 1; break}
# Otherwise continue iteration
cur_theta <- new_theta
cur_z2 <- new_z2
}
if(!flag) warning("Didn’t converge\n")
final_theta = cur_theta
paste0("Final Theta = ", format(round(cur_theta, 4), nsmall = 4))
## [1] "Final Theta = 0.6268"
paste0("Final Z2 = ", format(round(cur_z2, 4), nsmall = 4))
## [1] "Final Z2 = 29.8277"
p1=125/197
p2=18/197
p3=20/197
p4=34/197
my_prob <- c(p1,p2,p3,p4)
number_of_experiments <- 1
number_of_samples <- 1000
experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments
## [,1]
## [1,] 651
## [2,] 84
## [3,] 105
## [4,] 160
df=data.frame(experiments)/number_of_samples
par(mfrow=c(1,1))
x<-df[,1]
names(x)<- c("p1","p2","p3","p4")
barplot(x,ylim=c(0,1),main="Estimation with observed data only")
p1=0.5
p2=0.25*final_theta
p3=0.25-p2
p4=p3
p5=p2
my_prob <- c(p1,p2,p3,p4,p5)
number_of_experiments <- 10
number_of_samples <- 1000
experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 501 485 525 486 506 500 520 485 496 497
## [2,] 155 172 153 147 168 158 140 163 158 140
## [3,] 103 118 87 89 96 89 97 101 98 113
## [4,] 89 83 85 104 95 93 97 101 111 99
## [5,] 152 142 150 174 135 160 146 150 137 151
df=data.frame(experiments)/number_of_samples
par(mfrow=c(2,5))
for(i in 1:10) {
barplot(df[,i],ylim=c(0,1))
}
par(mfrow=c(1,1))
number_of_experiments <- 1
number_of_samples <- 1000
experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments
## [,1]
## [1,] 517
## [2,] 160
## [3,] 81
## [4,] 96
## [5,] 146
df=data.frame(experiments)/number_of_samples
x<-df[,1]
names(x)<- c("p1","p2","p3","p4","p5")
barplot(x,ylim=c(0,1),main="Estimation with missing data with EM Algorithm")