EM Algorithm

Gerry Alfa Dito

Soal dan Jawaban

  1. Pendugaan parameter pada data genetika (Rao 1973, hal 369), diasumsikan data phenotype \(y = (y_{1}, y_{2}, y_{3}, y_{4}) = (125, 18, 20, 34)\) memiliki distribusi multinomial dengan peluang

\[ {\frac{1}{2}+\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{\theta}{4}} \]

sehingga fungsi peluang bagi y adalah

\[ g(y|\theta) = \frac{y_{1}+y_{2}+y_{3}+y_{4}}{y_{1}! y_{2}! y_{3}! y_{4}!} \left(\frac{1}{2}+\frac{\theta}{4} \right) ^{y_{1}} \left(\frac{1-\theta}{4}\right)^{y_{2}} \left(\frac{1-\theta}{4}\right)^{y_{3}} \left(\frac{\theta}{4}\right)^{y_{4}} \]

Fungsi log-likelihood bagi \(y\) adalah

\[\log{L(\theta,y)} = y_{1} \log{2+\theta} + (y_{2} + y_{3}) \log{1-\theta} + y_{4} \log{\theta} \]

Misalkan y data taklengkap dari \(x = (x_1,\ldots, x_5)\) dengan peluang multinomial sebesar

\[ {\frac{1}{2},\frac{\theta}{4},\frac{1-\theta}{4},\frac{1-\theta}{4},\frac{\theta}{4}} \] di mana \(y_1 = x_1 + x_2, y_2 = x_3, y_3 = x_4\) dan \(y_4= x_5\), sehingga diperoleh fungsi peluang x sebagai berikut

\[ f(x|\theta) = \frac{x_{1}+x_{2}+x_{3}+x_{4}+x_{5}}{x_{1}! x_{2}! x_{3}! x_{4}! x_{5}!} \left(\frac{1}{2}\right) ^{x_{1}} \left(\frac{\theta}{4} \right) ^{x_{2}} \left(\frac{1-\theta}{4}\right)^{x_{3}} \left(\frac{1-\theta}{4}\right)^{x_{4}} \left(\frac{\theta}{4}\right)^{x_{5}} \] Fungsi log-likelihood dari \(x\) adalah

\[ \log{L(\theta,x)} = (x_2+ x_5)\log{\theta} + (x_3+x_4) \log{ 1-\theta} \]

Dugalah data hilang \(x_2\) dengan EM algorithm!

Jawab:

  • Tahap E

\[ Q(\theta) = E(x_2+ x_5|y,\theta^{(0)})\log{\theta} + E(x_3+ x_4|y,\theta^{(0)}) \log{1-\theta} \]

\[ = [ E(x_2|y,\theta^{(0)})+ x_5] \log{\theta} + (x_3+ x_4) \log{1-\theta} \]

Pendugaan x2 diperoleh sebagai berikut:

\[ \hat{𝑥}_{2}^{𝑝} = E(x_2|y,\theta^{(0)}) \]

karena \(x_1 + x_2 =y_1\), distribusi bersyarat dari \(x_2|y_1\) adalah binomial dengan parameter \(y_1=125\) dan peluang

\[ p=\frac{\frac{1}{4} \theta^{𝑝}}{\frac{1}{2}+\frac{1}{4} \theta^{𝑝}} \]

jadi

\[ \hat{𝑥}_{2}^{𝑝} = E(x_2|y,\theta^{(0)})=y_1p \]

\[ \hat{𝑥}_{2}^{𝑝}=125 \frac{\frac{1}{4} \theta^{𝑝}}{\frac{1}{2}+\frac{1}{4} \theta^{𝑝}} \]

  • Tahap M

\[ Q(\theta) = [ E(x_2|y,\theta^{(0)})+ x_5] \log{\theta} + (x_3+ x_4) \log{1-\theta} \]

$$ = [++ x_5] + (x_3+ x_4)

$$ Memaksimumkan \(Q(\theta)\) sehingga diperoleh hasil

\[ \frac{dQ(\theta)}{d\theta}=0 \]

\[ \frac{[\hat{x_2}+ x_5]}{\theta} - \frac{(x_3+ x_4)} {1-\theta} =0 \]

\[ \hat{\theta} = \frac{\hat{x_2}+ x_5}{\hat{x_2}+x_3+x_4+x_5} \]

expectation <- function(theta){
  125*( ( (1/4)*theta ) / ( (1/2) + (1/4)*theta) )
}

maximization <- function(x2){
  (x2 + x5) / (x2 + x3 + x4 + x5)
}
x2=0
x3<- 18
x4<- 20
x5<- 34
niter <- 100
theta0 <- 2
save_iter <- data.frame("iter"=0,"theta"=theta0,"x2"=x2)
for (i in 1:niter){
  x2 <- expectation(theta0)
  theta <- maximization(x2)
  criteria <- abs(theta-theta0)
  theta0 <- theta
  save_iter <- rbind(save_iter,c(i,theta0,x2))
  if (criteria<10^-7){
    break
  }
}

tail(save_iter)
##    iter     theta       x2
## 5     4 0.6270186 29.88175
## 6     5 0.6268477 29.83508
## 7     6 0.6268250 29.82889
## 8     7 0.6268220 29.82807
## 9     8 0.6268216 29.82796
## 10    9 0.6268215 29.82795