The implementation of EM algorithm for Gaussian mixture distribution and the effect of different initial number to convergence

First,I implemented the EM algorithm for Gaussian mixture distribution, by using a latent variable Z. Below is the funtion for Estep and Mstep, I just apply the formulation in the textbook.

library(mvtnorm)
## Warning: package 'mvtnorm' was built under R version 3.0.3
Estep <- function(data,mix.pi,mu,sigma,K){
  result <- apply(data, 1, function(xt){
    gammar_i <- sapply(1:K,function(k) {
      mix.pi[k] * dmvnorm(xt, mu[,k], sigma[,,k])
    });
    gammar_i / sum(gammar_i);
  });
  t(result);
}

Mstep <- function(gamma,data,K,N,degree) {
  gamma.sum <- apply(gamma,2,sum)
  new.mix <- gamma.sum/N;
  new.mu <- t(t(t(data) %*% gamma) / gamma.sum);
  new.sigma <- array(0, dim=c(degree, degree, K));
  for(k in 1:K) {
    sig <- matrix(0,degree,degree);
    for(n in 1:N) {
      sig <- sig + gamma[n, k] * (data[n,] %*% t(data[n,]));
    }
    new.sigma[,,k] <- sig / gamma.sum[k] - new.mu[,k] %*% t(new.mu[,k])
  }  
  list(new.mix, new.mu, new.sigma);
}

So I will use the Estep and Mstep funtion to find the parametres of the Gaussian mixture distribution and apply them untill converge.

This time I do not use some real data, instead I will use the R functions to randomly sample from a known Gaussian mixture distribution, so I know the true parametres of the Gaussian mixture distribution and I will compare the parametres found by the EM algorithm to the real value.

I generate the data as below

set.seed(1010);

K <- 3;  #number of the clusters
degree <- 2;  #degree of the data(2D data this time)
N <- 1000; #number of samples

mix.pi.true <- runif(K); 
mix.pi.true <- mix.pi.true/sum(mix.pi.true);  #true mixture rate

mu.true <- matrix(runif(K*degree,min=-2,max=2), nrow=degree, ncol=K);  #true mean

sigma.true <- array(0, dim=c(degree, degree, K));
for(k in 1:K){
  sigma.true[,,k] <- diag(1,nrow=degree, ncol=degree); #true var matrix(only diag)
}

Then I use the parametres above to generate the data as below. First sample N times from a multinomial distribution to determine which cluster generate the data, then for each result of the samples of the multinomial distribution I use the multi Gaussian distribution (mvnorm function) to sample.

data <-  t(apply(rmultinom(N,1,mix.pi.true),2,function(num) {
  maxindex <- which.max(num)
  rmvnorm(1, mu.true[,maxindex], sigma.true[,,maxindex])
}));

First let us see what is the data look like.

plot(data)

plot of chunk unnamed-chunk-4

Then I will use EM algorithm to fit the parametres.

(1)Fixed number

First use the trival one as the initial parametres, Which is the estimated variable from the data. The initial guess of the mix rate is just all 1/N. The initial guess of the mean for every cluster K is just the mean of the data. The initial guess of the variance for every cluster K is just the variance of the data. But those initial values will not do well because they are all the same, so I add a small random number epsilon.

n.trial <- 1000
mix.pi <- rep(1/k,K);
mu <- matrix(apply(data,2,mean), nrow=degree, ncol=K);
mu <- mu + 0.1*matrix(runif(K*degree), nrow=degree, ncol=K);

sigma <- array(0, dim=c(degree, degree, K));
for(k in 1:K){
  sigma[,,k] <- var(data) + 0.1*diag(runif(degree));
}

Then learn the parame. The measurement of time is the number of repetitions.

count <- 0
errorlist <-  rep(0,n.trial)
for(i in 1:n.trial){
  count <- count +1
  gamma <- Estep(data,mix.pi,mu,sigma,K);
  result <- Mstep(gamma,data,K,N,degree);
  new.mix.pi <- result[[1]]; new.mu <- result[[2]]; new.sigma <- result[[3]];
  error <-sum((new.mix.pi-mix.pi)^2) + sum((new.mu-mu)^2) + sum((new.sigma-sigma)^2)
  errorlist[i] <- error
  mix.pi <- new.mix.pi;
  mu <- new.mu;
  sigma <- new.sigma;
  if (error < 1e-5) break;
}

Then we can compare the parametres we learned from data with the true parametre. we can also plot the error-time relation.

sum((mix.pi-mix.pi.true)^2);
## [1] 0.1787
sum((mu-mu.true)^2);
## [1] 11.04
sum((sigma-sigma.true)^2);
## [1] 0.4786
plot(errorlist[1:count],log="y")

plot of chunk unnamed-chunk-7

We can see that it is the trend that the error indeed converges to 0, but the error go down much slower than the early iteration.

(2)big random number then I use the big random number instead of the estimated number for the initial number.

n.trial <- 1000
mix.pi <- runif(K);
mix.pi <- runif(K)/sum(mix.pi);

mu <- matrix(runif(K*degree), nrow=degree, ncol=K);

sigma <- array(0, dim=c(degree, degree, K));
for(k in 1:K){
  sigma[,,k] <- diag(runif(degree));
}

count <- 0
errorlist <-  rep(0,n.trial)
for(i in 1:n.trial){
  count <- count +1
  gamma <- Estep(data,mix.pi,mu,sigma,K);
  result <- Mstep(gamma,data,K,N,degree);
  new.mix.pi <- result[[1]]; new.mu <- result[[2]]; new.sigma <- result[[3]];
  error <-sum((new.mix.pi-mix.pi)^2) + sum((new.mu-mu)^2) + sum((new.sigma-sigma)^2)
  errorlist[i] <- error
  mix.pi <- new.mix.pi;
  mu <- new.mu;
  sigma <- new.sigma;
  if (error < 1e-5) break;
}

Then we can also compare the parametres we learned from data with the true parametre. we can also plot the error-time relation.

sum((mix.pi-mix.pi.true)^2);
## [1] 0.02299
sum((mu-mu.true)^2);
## [1] 0.7169
sum((sigma-sigma.true)^2);
## [1] 1.244
plot(errorlist[1:count],log="y")

plot of chunk unnamed-chunk-9

It takes more time to converge until below 2e-5, but it converges more smoothly. We can also see that it becomes more slowly when the count is big.

(3)Small random number Last I use the small random number instead of the estimated number for the initial number.

n.trial <- 1000
mix.pi <- runif(K);
mix.pi <- runif(K)/sum(mix.pi);

mu <- 0.1*matrix(runif(K*degree), nrow=degree, ncol=K);

sigma <- array(0, dim=c(degree, degree, K));
for(k in 1:K){
  sigma[,,k] <- 0.1*diag(runif(degree));
}

count <- 0
errorlist <-  rep(0,n.trial)
for(i in 1:n.trial){
  count <- count +1
  gamma <- Estep(data,mix.pi,mu,sigma,K);
  result <- Mstep(gamma,data,K,N,degree);
  new.mix.pi <- result[[1]]; new.mu <- result[[2]]; new.sigma <- result[[3]];
  error <-sum((new.mix.pi-mix.pi)^2) + sum((new.mu-mu)^2) + sum((new.sigma-sigma)^2)
  errorlist[i] <- error
  mix.pi <- new.mix.pi;
  mu <- new.mu;
  sigma <- new.sigma;
  if (error < 1e-5) break;
}

Then we can also compare the parametres we learned from data with the true parametre. we can also plot the error-time relation.

sum((mix.pi-mix.pi.true)^2);
## [1] 0.9792
sum((mu-mu.true)^2);
## [1] 14.74
sum((sigma-sigma.true)^2);
## [1] 3.401
plot(errorlist[1:count],log="y")

plot of chunk unnamed-chunk-11

This time it takes much less steps to converge and it is smoonth also. But the error compared to the true parametres is much bigger. Also, we can see that it takes much time to converge when count is bigger.

So from three types of initial numbers I find that it may be good to choose the big random number (same order as the true values) as the initial number. This time it has the least error with the true parametres and not so many iteration.