First,I implemented the EM algorithm for Gaussian mixture distribution, by using a latent variable Z. Below is the funtion for Estep and Mstep, I just apply the formulation in the textbook.
library(mvtnorm)
## Warning: package 'mvtnorm' was built under R version 3.0.3
Estep <- function(data,mix.pi,mu,sigma,K){
result <- apply(data, 1, function(xt){
gammar_i <- sapply(1:K,function(k) {
mix.pi[k] * dmvnorm(xt, mu[,k], sigma[,,k])
});
gammar_i / sum(gammar_i);
});
t(result);
}
Mstep <- function(gamma,data,K,N,degree) {
gamma.sum <- apply(gamma,2,sum)
new.mix <- gamma.sum/N;
new.mu <- t(t(t(data) %*% gamma) / gamma.sum);
new.sigma <- array(0, dim=c(degree, degree, K));
for(k in 1:K) {
sig <- matrix(0,degree,degree);
for(n in 1:N) {
sig <- sig + gamma[n, k] * (data[n,] %*% t(data[n,]));
}
new.sigma[,,k] <- sig / gamma.sum[k] - new.mu[,k] %*% t(new.mu[,k])
}
list(new.mix, new.mu, new.sigma);
}
So I will use the Estep and Mstep funtion to find the parametres of the Gaussian mixture distribution and apply them untill converge.
This time I do not use some real data, instead I will use the R functions to randomly sample from a known Gaussian mixture distribution, so I know the true parametres of the Gaussian mixture distribution and I will compare the parametres found by the EM algorithm to the real value.
I generate the data as below
set.seed(1010);
K <- 3; #number of the clusters
degree <- 2; #degree of the data(2D data this time)
N <- 1000; #number of samples
mix.pi.true <- runif(K);
mix.pi.true <- mix.pi.true/sum(mix.pi.true); #true mixture rate
mu.true <- matrix(runif(K*degree,min=-2,max=2), nrow=degree, ncol=K); #true mean
sigma.true <- array(0, dim=c(degree, degree, K));
for(k in 1:K){
sigma.true[,,k] <- diag(1,nrow=degree, ncol=degree); #true var matrix(only diag)
}
Then I use the parametres above to generate the data as below. First sample N times from a multinomial distribution to determine which cluster generate the data, then for each result of the samples of the multinomial distribution I use the multi Gaussian distribution (mvnorm function) to sample.
data <- t(apply(rmultinom(N,1,mix.pi.true),2,function(num) {
maxindex <- which.max(num)
rmvnorm(1, mu.true[,maxindex], sigma.true[,,maxindex])
}));
First let us see what is the data look like.
plot(data)
Then I will use EM algorithm to fit the parametres.
(1)Fixed number
First use the trival one as the initial parametres, Which is the estimated variable from the data. The initial guess of the mix rate is just all 1/N. The initial guess of the mean for every cluster K is just the mean of the data. The initial guess of the variance for every cluster K is just the variance of the data. But those initial values will not do well because they are all the same, so I add a small random number epsilon.
n.trial <- 1000
mix.pi <- rep(1/k,K);
mu <- matrix(apply(data,2,mean), nrow=degree, ncol=K);
mu <- mu + 0.1*matrix(runif(K*degree), nrow=degree, ncol=K);
sigma <- array(0, dim=c(degree, degree, K));
for(k in 1:K){
sigma[,,k] <- var(data) + 0.1*diag(runif(degree));
}
Then learn the parame. The measurement of time is the number of repetitions.
count <- 0
errorlist <- rep(0,n.trial)
for(i in 1:n.trial){
count <- count +1
gamma <- Estep(data,mix.pi,mu,sigma,K);
result <- Mstep(gamma,data,K,N,degree);
new.mix.pi <- result[[1]]; new.mu <- result[[2]]; new.sigma <- result[[3]];
error <-sum((new.mix.pi-mix.pi)^2) + sum((new.mu-mu)^2) + sum((new.sigma-sigma)^2)
errorlist[i] <- error
mix.pi <- new.mix.pi;
mu <- new.mu;
sigma <- new.sigma;
if (error < 1e-5) break;
}
Then we can compare the parametres we learned from data with the true parametre. we can also plot the error-time relation.
sum((mix.pi-mix.pi.true)^2);
## [1] 0.1787
sum((mu-mu.true)^2);
## [1] 11.04
sum((sigma-sigma.true)^2);
## [1] 0.4786
plot(errorlist[1:count],log="y")
We can see that it is the trend that the error indeed converges to 0, but the error go down much slower than the early iteration.
(2)big random number then I use the big random number instead of the estimated number for the initial number.
n.trial <- 1000
mix.pi <- runif(K);
mix.pi <- runif(K)/sum(mix.pi);
mu <- matrix(runif(K*degree), nrow=degree, ncol=K);
sigma <- array(0, dim=c(degree, degree, K));
for(k in 1:K){
sigma[,,k] <- diag(runif(degree));
}
count <- 0
errorlist <- rep(0,n.trial)
for(i in 1:n.trial){
count <- count +1
gamma <- Estep(data,mix.pi,mu,sigma,K);
result <- Mstep(gamma,data,K,N,degree);
new.mix.pi <- result[[1]]; new.mu <- result[[2]]; new.sigma <- result[[3]];
error <-sum((new.mix.pi-mix.pi)^2) + sum((new.mu-mu)^2) + sum((new.sigma-sigma)^2)
errorlist[i] <- error
mix.pi <- new.mix.pi;
mu <- new.mu;
sigma <- new.sigma;
if (error < 1e-5) break;
}
Then we can also compare the parametres we learned from data with the true parametre. we can also plot the error-time relation.
sum((mix.pi-mix.pi.true)^2);
## [1] 0.02299
sum((mu-mu.true)^2);
## [1] 0.7169
sum((sigma-sigma.true)^2);
## [1] 1.244
plot(errorlist[1:count],log="y")
It takes more time to converge until below 2e-5, but it converges more smoothly. We can also see that it becomes more slowly when the count is big.
(3)Small random number Last I use the small random number instead of the estimated number for the initial number.
n.trial <- 1000
mix.pi <- runif(K);
mix.pi <- runif(K)/sum(mix.pi);
mu <- 0.1*matrix(runif(K*degree), nrow=degree, ncol=K);
sigma <- array(0, dim=c(degree, degree, K));
for(k in 1:K){
sigma[,,k] <- 0.1*diag(runif(degree));
}
count <- 0
errorlist <- rep(0,n.trial)
for(i in 1:n.trial){
count <- count +1
gamma <- Estep(data,mix.pi,mu,sigma,K);
result <- Mstep(gamma,data,K,N,degree);
new.mix.pi <- result[[1]]; new.mu <- result[[2]]; new.sigma <- result[[3]];
error <-sum((new.mix.pi-mix.pi)^2) + sum((new.mu-mu)^2) + sum((new.sigma-sigma)^2)
errorlist[i] <- error
mix.pi <- new.mix.pi;
mu <- new.mu;
sigma <- new.sigma;
if (error < 1e-5) break;
}
Then we can also compare the parametres we learned from data with the true parametre. we can also plot the error-time relation.
sum((mix.pi-mix.pi.true)^2);
## [1] 0.9792
sum((mu-mu.true)^2);
## [1] 14.74
sum((sigma-sigma.true)^2);
## [1] 3.401
plot(errorlist[1:count],log="y")
This time it takes much less steps to converge and it is smoonth also. But the error compared to the true parametres is much bigger. Also, we can see that it takes much time to converge when count is bigger.
So from three types of initial numbers I find that it may be good to choose the big random number (same order as the true values) as the initial number. This time it has the least error with the true parametres and not so many iteration.