Exercise 1

1a

We have \[ Y \sim binomial(99,\theta), \]and \[ X = x\ if\ 4(x-1)\leq Y < 4x, \]for \( x=1,2,...,25. \).

The E-step: we have a known X but unknown Y, so we determine the expectation function for Y. This Y lies within [4(x-1), 4x), so we have: \[ Q(\theta...)=E_{4(x-1)\leq y < 4x} \log L(y|\tilde\theta,x) = \sum_{4(x-1)\leq y < 4x} y \cdot \log ({99 \choose y} {\tilde\theta} ^ y (1-\tilde\theta)^{99-y}) \] Maximizing for y gives us: \[ y_{max} = \frac { \sum_{4(x-1)\leq y < 4x} y {99 \choose y} {\tilde\theta} ^ y (1-\tilde\theta)^{99-y} } {\sum_{4(x-1)\leq y < 4x} {99 \choose y} {\tilde\theta} ^ y (1-\tilde\theta)^{99-y} } \]

We can now calculate theta from this y by dividing by 99 (since it's binomial), giving us: \[ \theta_{new} = \frac{1}{99} \frac { \sum_{4(x-1)\leq y < 4x} y {99 \choose y} {\tilde\theta} ^ y (1-\tilde\theta)^{99-y} } {\sum_{4(x-1)\leq y < 4x} {99 \choose y} {\tilde\theta} ^ y (1-\tilde\theta)^{99-y} } \]

1b

# Iterate a function n times. Return the iterated parameters.
iterate <- function(func, init, n, ...) {
    l <- init
    x <- init
    for (i in 1:n) {
        x <- func(x, ...)
        l <- rbind(l, x)
    }
    return(l)
}

em.step <- function(theta, x) {
    y <- (4 * (x - 1)):(4 * x - 1)
    top <- sum(y * choose(99, y) * theta^y * (1 - theta)^(99 - y))
    bottom <- sum(choose(99, y) * theta^y * (1 - theta)^(99 - y))

    return((1/99) * top/bottom)
}

print(res <- iterate(em.step, init = 0.2, n = 20, x = 21), digits = 10)

##           [,1]
## l 0.2000000000
## x 0.8087044751
## x 0.8223049516
## x 0.8234143400
## x 0.8235078061
## x 0.8235157006
## x 0.8235163675
## x 0.8235164239
## x 0.8235164286
## x 0.8235164290
## x 0.8235164291
## x 0.8235164291
## x 0.8235164291
## x 0.8235164291
## x 0.8235164291
## x 0.8235164291
## x 0.8235164291
## x 0.8235164291
## x 0.8235164291
## x 0.8235164291
## x 0.8235164291

1c

Using the Q derived in 1a:

lik <- function(theta, x) {
    y <- (4 * (x - 1)):(4 * x - 1)
    return(-sum(y * log(choose(99, y) * theta^y * (1 - theta)^(99 - y))))
}
nlm(f = lik, p = 0.2, x = 21)

## $minimum
## [1] 749.7
## 
## $estimate
## [1] 0.8234
## 
## $gradient
## [1] 1.137e-07
## 
## $code
## [1] 1
## 
## $iterations
## [1] 10

nlm(f = lik, p = 0.6, x = 21)

## $minimum
## [1] 749.7
## 
## $estimate
## [1] 0.8234
## 
## $gradient
## [1] 2.274e-07
## 
## $code
## [1] 1
## 
## $iterations
## [1] 8

Exercise 2

# setwd('~/Documents/Universiteit')
data <- scan("Xmix.txt")

2a

Since \( \sigma=1 \), this mixture has the density: \[ p(x)=\sum_{j=1}^4 w_j \phi(x-\mu_j) \]

To do EM, we use augmented data: \( P(Y_i=j)=p_j=w_j \), \( X_i|Y_i = j \sim f(\cdot;\mu_j) \), \( i=1,\dots,n. \) where \( f(x;\mu_j)=\phi(x-\mu_j) \)

Now \[ \tilde \alpha_{i,j} = P(Y_i=j|X_i)=\frac{\tilde p_j f(X_i,\tilde \mu_j)}{\sum_c \tilde p_j f(X_i,\tilde \mu_j)} \]

So, for the M-step: \[ p_j^{new} = \frac {1} {n} \sum_{i=1}^n\tilde \alpha_{i,j} \]

X <- data
mu <- c(0, 5, 3, 7)
w <- c(0.1, 0.25, 0.1, 0.25)
sigma <- 1

f <- function(x, mu) dnorm(x - mu)
alpha <- function(i, j, w, mu) w[j] * f(X[i], mu[j])/sum(w * f(X[i], mu))

w.step <- function(w, mu) {
    total <- c(0, 0, 0, 0)
    for (j in 1:4) {
        # since alpha() doesn't take vectors since alpha() doesn't take vectors
        for (i in 1:4) {
            total[j] <- total[j] + alpha(i, j, w, mu)
        }
    }
    return(total/4)
}

w

## [1] 0.10 0.25 0.10 0.25

w.step(w, mu)

## [1] 0.371020 0.202999 0.421605 0.004377

iterate(w.step, init = w, n = 20, mu = mu)

##     [,1]      [,2]   [,3]      [,4]
## l 0.1000 2.500e-01 0.1000 2.500e-01
## x 0.3710 2.030e-01 0.4216 4.377e-03
## x 0.3627 9.837e-02 0.5389 4.796e-05
## x 0.3425 4.852e-02 0.6089 5.384e-07
## x 0.3304 2.365e-02 0.6459 5.986e-09
## x 0.3245 1.143e-02 0.6640 6.604e-11
## x 0.3218 5.499e-03 0.6727 7.258e-13
## x 0.3206 2.641e-03 0.6768 7.964e-15
## x 0.3200 1.268e-03 0.6788 8.733e-17
## x 0.3197 6.082e-04 0.6797 9.574e-19
## x 0.3196 2.918e-04 0.6801 1.049e-20
## x 0.3195 1.399e-04 0.6803 1.150e-22
## x 0.3195 6.713e-05 0.6804 1.261e-24
## x 0.3195 3.220e-05 0.6805 1.382e-26
## x 0.3195 1.544e-05 0.6805 1.514e-28
## x 0.3195 7.408e-06 0.6805 1.660e-30
## x 0.3195 3.553e-06 0.6805 1.819e-32
## x 0.3195 1.704e-06 0.6805 1.994e-34
## x 0.3195 8.174e-07 0.6805 2.185e-36
## x 0.3195 3.921e-07 0.6805 2.395e-38
## x 0.3195 1.881e-07 0.6805 2.625e-40

So if we fix these values for \( \mu \) we find that two of the weights go to 0, so we only recognize two groups: one around 0 and one around 3.

2b

Now it is only known that \( \sigma = 1 \). So, we take the given mu-values from 2a as starting values, and add the second part to the EM-algorithm we used at 2a:

\[ \mu_j^{new} = argmax_\mu \sum_{i=1}^n \log f(X_i ; \mu ) \tilde \alpha_{i,j}. \] where \( f(x;\mu_j)=\phi(x-\mu_j) \).

We simplify the inner part to:

\[ \sum_{i=1}^n \tilde \alpha_{i,j} (X_i - \mu )^2 + Const. \]

Maximizing gives us:

\[ \mu_j^{new} =\frac { \sum_{i=1}^n \tilde \alpha_{i,j} X_i} {\sum_{i=1}^n \tilde \alpha_{i,j} } \]

X <- data
mu <- c(0, 5, 3, 7)
w <- c(0.1, 0.25, 0.1, 0.25)
sigma <- 1

mu.step <- function(w, mu) {
    total_top <- c(0, 0, 0, 0)
    total_bottom <- c(0, 0, 0, 0)

    for (j in 1:4) {
        # since alpha() doesn't take vectors since alpha() doesn't take vectors
        for (i in 1:4) {
            a <- alpha(i, j, w, mu)
            total_top[j] <- total_top[j] + a * X[i]
            total_bottom[j] <- total_bottom[j] + a
        }
    }
    return(total_top/total_bottom)
}

both.step <- function(params) {
    w_new <- w.step(params$w, params$mu)
    mu_new <- mu.step(params$w, params$mu)
    list(w = w_new, mu = mu_new)
}

n <- 10
result <- iterate(both.step, init = list(w = w, mu = mu), n = n)
# w:
t(data.frame(result[, 1]))

##       [,1]   [,2]   [,3]     [,4]
## l   0.1000 0.2500 0.1000 0.250000
## x   0.3710 0.2030 0.4216 0.004377
## x.1 0.3041 0.2092 0.4823 0.004393
## x.2 0.2662 0.2122 0.5172 0.004433
## x.3 0.2542 0.2121 0.5293 0.004427
## x.4 0.2518 0.2114 0.5324 0.004410
## x.5 0.2514 0.2107 0.5335 0.004395
## x.6 0.2514 0.2101 0.5341 0.004383
## x.7 0.2514 0.2097 0.5346 0.004373
## x.8 0.2514 0.2093 0.5350 0.004365
## x.9 0.2514 0.2089 0.5353 0.004358

# mu:
t(data.frame(result[, 2]))

##        [,1]  [,2]  [,3]  [,4]
## l    0.0000 5.000 3.000 7.000
## x   -0.1153 3.956 2.240 4.119
## x.1 -0.5180 3.836 2.199 3.912
## x.2 -0.8472 3.784 2.181 3.828
## x.3 -0.9681 3.765 2.179 3.793
## x.4 -0.9923 3.759 2.181 3.776
## x.5 -0.9962 3.756 2.184 3.768
## x.6 -0.9968 3.756 2.186 3.763
## x.7 -0.9969 3.755 2.187 3.760
## x.8 -0.9970 3.755 2.189 3.759
## x.9 -0.9970 3.756 2.190 3.758

2c

require(mixtools)

set.seed(1)

res <- normalmixEM(X, k = 4, sd.constr = c("s", "s", "s", "s"), maxit = 1e+05, 
    maxrestarts = 1000)

## number of iterations= 1600


res$lambda  # weights

## [1] 0.4322 0.1849 0.1599 0.2231

res$mu

## [1]  3.32425  2.50919 -0.04579  6.46792

res$sigma

## [1] 1.013 1.013 1.013 1.013

2d

Let some large likelihood \( L_{arge} \) be given. We will choose variables such that the likelihood is larger than this \( L_{arge} \).

Set \( \mu_1=X_1 \). This means \( \phi(\frac{x-\mu_j}{\sigma_j}) \) = \( \phi(0) \) ~= 0.3989423 > 0.3. Now choose, for some arbitrary \( w_1 \), \( \sigma_1=\frac{1}{w_j*L_{arge}*2} \). Then, since all other values summed for the likelihood are bigger than 0, this means that the likelihood is also larger than \( L_{arge} \)

2e

set.seed(1)
res <- normalmixEM(X, sigma = c(0.01, 1, 1, 1), k = 4, maxit = 1e+05, maxrestarts = 1000)

## One of the variances is going to zero;  trying new starting values.
## number of iterations= 1248


res$lambda  # weights

## [1] 0.19404 0.57416 0.17287 0.05893

res$mu

## [1] 0.2352 3.1387 6.0552 7.4905

res$sigma

## [1] 1.1748 0.9999 0.8659 0.5047

With the above starting values it detects that the variance is going to 0, so it restarts. So this is indeed as asked a starting point for which the algorithm diverges. Fortunately it is detected.

2f

require(flexmix)

set.seed(1)

res <- flexmix(X ~ 1, k = 4)
# weights:
res@prior

## [1] 0.2189 0.2838 0.2563 0.2410

# mu and sigma:
parameters(res)

##                  Comp.1 Comp.2 Comp.3 Comp.4
## coef.(Intercept)  2.982  3.382  3.601  3.321
## sigma             2.238  2.278  2.272  2.279

2g

The answers of c, e and f are compatible: they are all estimates for the data. The reason they are (significantly) different is that there a lot of local maxima: there is no clear perfect answer (from seeing the data), and the EM-algorithms merely guarantee to approach some maximum, not that they approach the global maximum.