Applying the EM Algorithm: Missing Data (or Latent Parameters)

The term EM was introduced in Dempster, Laird, and Rubin (1977) “Maximum Likelihood from Incomplete Data via the EM Algorithm” where proof of general results about the behavior of the algorithm was first given as well as a large number of applications. http://web.mit.edu/6.435/www/Dempster77.pdf. This idea has been in use for many years before Orchard and Woodbury (1972) in their missing information principle provided the theoretical foundation of the underlying idea. Common applications include fitting mixture models, learning Bayes net parameters with latent data, and learning hidden Markov models.

In statistics, an expectation maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

More applications can be found in Kevin Murphy’s textbook, Machine Learning: a Probabilistic Perspective.

Here we shall present only a simple numerical example to give the flavour of the method. The example in Dempster’s 1977 paper. Where cited Rao (1965, pp. 368-369) presents data in which 197 animals are distributed multinomially into four categories, so that the observed data consist of A genetic model for the population specifies cell probabilities.

Obeserved Data - Animals in 4 groups

n1=125, n2=18, n3=20, n4=34, we need to estimate the parameters, p1, p2, p3 and p4. What is there are missing data, and their are five groups instead of four?

p1=125/197
p2=18/197
p3=20/197
p4=34/197

my_prob <- c(p1,p2,p3,p4)
number_of_experiments <- 10
number_of_samples <- 197

experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]  125  117  135  129  116  112  133  124  127   129
## [2,]   21   16   14   18   15   17   17   16   20    12
## [3,]   23   20   20   21   16   25   16   26   15    18
## [4,]   28   44   28   29   50   43   31   31   35    38
df=data.frame(experiments)/number_of_samples

par(mfrow=c(2,5))

for(i in 1:10) {
  barplot(df[,i],ylim=c(0,1))
}

Expectation-Maximization Algorithm

The key is to find the likelihood function, once the likelihood function of the complete data representation model, we can apply the EM algorithm.

Expectation Step

To illustrate the EM algorithm, we represent Rao’s data as incomplete data from a five-category multinomial population where the cell probabilities are (p1,p2,p3,p4,p5), the idea being to split the first of the original four categories into two categories. The expectation step estimates the sufficient statistics of the complete data, given the observed data. In this case, (n3, n4, n5) are known to be (18,20,34) so that the only sufficient statistics that have to be estimated are z1 and z2, where z1+z2 =n1 = 125. Estimating z1 and z2, using the current estimate of p or theta leads to E-step computation

# defube expectation function 

estep <- function(theta,z2){
z2 = 125*(0.25*theta/(0.5+0.25*theta))
# z1 = 125*(0.5/(0.5+0.25*theta))
return(z2)
}

Maximization Step

The maximization step then takes the estimated complete data (z1,z2, 18,20,34) and estimates theta by maximum likelihood as though the estimated complete data were the observed data, thus yielding the maximization function

mstep <- function(theta,z2){
theta <- (z2+34)/(z2+34+18+20)
return(theta)
}

Iteration between Expectation-Maximization Steps

The EM algorithm for this example is defined by cycling back and forth between expectation and maximization steps. The idea been we have missing data z2, and parameter theta both need to be estimated, but we only one log-likelihood function, so we go back and forth to optimize the pair(z2,theta)

# set initial value for iteration

cur_theta = 0.5
maxit = 100
maxit=1000
tol=1e-6

# start iteration

for(i in 1:maxit){
  new_z2 <- estep(cur_theta,cur_z2)
  new_theta <- mstep(cur_theta,new_z2)
  
  # Stop iteration if the difference between the current and new estimates is less than a tolerance level
  if( all(abs(cur_theta - new_theta) < tol) ){ flag <- 1; break}
  
  
  # Otherwise continue iteration
  cur_theta <- new_theta
  cur_z2 <- new_z2
}
if(!flag) warning("Didn’t converge\n")

final_theta = cur_theta
paste0("Final Theta = ", format(round(cur_theta, 4), nsmall = 4))
## [1] "Final Theta = 0.6268"
paste0("Final Z2 = ", format(round(cur_z2, 4), nsmall = 4))
## [1] "Final Z2 = 29.8277"

Estimation of Probability Mass Functions of Observed Data

p1=125/197
p2=18/197
p3=20/197
p4=34/197

my_prob <- c(p1,p2,p3,p4)


number_of_experiments <- 1
number_of_samples <- 1000

experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments
##      [,1]
## [1,]  651
## [2,]   84
## [3,]  105
## [4,]  160
df=data.frame(experiments)/number_of_samples

par(mfrow=c(1,1))

x<-df[,1]
names(x)<- c("p1","p2","p3","p4")

barplot(x,ylim=c(0,1),main="Estimation with observed data only")

Estimation of Probability Mass Functions of Complete Data - Animal actually in 5 groups

p1=0.5
p2=0.25*final_theta
p3=0.25-p2
p4=p3
p5=p2

my_prob <- c(p1,p2,p3,p4,p5)
number_of_experiments <- 10
number_of_samples <- 1000

experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]  501  485  525  486  506  500  520  485  496   497
## [2,]  155  172  153  147  168  158  140  163  158   140
## [3,]  103  118   87   89   96   89   97  101   98   113
## [4,]   89   83   85  104   95   93   97  101  111    99
## [5,]  152  142  150  174  135  160  146  150  137   151
df=data.frame(experiments)/number_of_samples


par(mfrow=c(2,5))

for(i in 1:10) {
  barplot(df[,i],ylim=c(0,1))
}

par(mfrow=c(1,1))

number_of_experiments <- 1
number_of_samples <- 1000

experiments <- rmultinom(n=number_of_experiments, size=number_of_samples, prob=my_prob)
experiments
##      [,1]
## [1,]  517
## [2,]  160
## [3,]   81
## [4,]   96
## [5,]  146
df=data.frame(experiments)/number_of_samples
x<-df[,1]
names(x)<- c("p1","p2","p3","p4","p5")

barplot(x,ylim=c(0,1),main="Estimation with missing data with EM Algorithm")