Mixture Model and Expectation Maximization

Theory

1. Consider a mixture of two Normal distributions \(N(0, 1)\) and \(N(\mu, 1)\) in which the probability of the first component is known to be 1/2, and \(\mu\) is unknown. Prove that an improper prior on \(\mu\) results in an improper posterior regardless of the number of observations.

For each observation, the density of \(p(x_i|\mu)\) follows \(\frac{1}{2} (\frac{1}{\sqrt{2 \pi} } e^{-\frac{x_i^2}{2}} + \frac{1}{\sqrt{2 \pi} } e^{-\frac{(x_i-\mu)^2}{2}})\), then

\(\begin{aligned} P(X|\mu) \propto &\frac{1}{2^n} (\frac{1}{\sqrt{2 \pi} } e^{-\frac{x_i^2}{2}} + \frac{1}{\sqrt{2 \pi} } e^{-\frac{(x_i-\mu)^2}{2}})^n \\ &>2^{-n} (2 \pi)^{-\frac{n}{2}} e^{-\sum_i^n x_i^2/2} \\ \end{aligned}\)

\(\begin{aligned} \int P(\mu|X) d\mu &\propto \int P(X|\mu) P(\mu)d\mu \\ &> 2^{-n} (2 \pi)^{-\frac{n}{2}} \int P(\mu)d\mu \\ &> 2^{-n} (2 \pi)^{-\frac{n}{2}} \infty \\ &> \infty \\ \end{aligned}\)

Therefore, whenever \(\mu\) is not proper, the posterior is not proper.

2. Consider a mixture of two Normal distributions \(N(\mu_1, \sigma_1^2)\) and \(N(\mu_2, \sigma_2^2)\) in which independent at priors are placed on the probability of the first component, \(\mu_1\), \(\mu_2\), \(log (\sigma_1)\), and \(log (\sigma_2)\), respectively. Suppose no restrictions are placed on the parameters. Describe why any attempt to identify a posterior mode in this case will fail to converge to a meaningful point in the parameter space.

In this case, denote the probability of the first component is \(p(\omega)\).

Then, \(p(x_i|\omega, \mu_1, \mu_2, \sigma_1, \sigma_2) = \frac{1}{2} (\omega \frac{1}{\sqrt{2 \pi} \sigma_1} exp\{-\frac{(x_i-\mu_1)^2}{2\sigma_1^2}\} + (1-\omega)\frac{1}{\sqrt{2 \pi} \sigma_2} exp\{-\frac{(x_i-\mu_2)^2}{2\sigma_2^2}\})\).

\(\begin{aligned} P(\mu_1, \mu_2|X, \omega, \sigma_1, \sigma_2) &\propto \int \int p(x_i|\omega, \mu_1, \mu_2, \sigma_1, \sigma_2) p(\mu_1) p(\mu_2) d\mu_1 d\mu_2 \\ &> \int \int 2^{-n} (\omega \frac{1}{\sqrt{2 \pi} \sigma_1})^n exp\{-\frac{\sum (x_i-\mu_1)^2}{2\sigma_1^2}\} + 2^{-n} ((1-\omega)\frac{1}{\sqrt{2 \pi} \sigma_2})^n exp\{-\frac{\sum (x_i-\mu_2)^2}{2\sigma_2^2}\} d\mu_1 d\mu_2 \\ &\propto \int \int (\frac{\omega}{ \sigma_1})^n exp\{-\frac{\sum (x_i-\mu_1)^2}{2\sigma_1^2}\} + (\frac{(1-\omega)}{ \sigma_2})^n exp\{-\frac{\sum (x_i-\mu_2)^2}{2\sigma_2^2}\} d\mu_1 d\mu_2 \\ &\propto \int \int (\frac{\omega}{ \sigma_1})^n exp\{-\frac{\sum (x_i-\mu_1)^2}{2\sigma_1^2}\} d\mu_2 d\mu_1 + \int \int (\frac{1-\omega}{ \sigma_2})^n exp\{-\frac{\sum (x_i-\mu_2)^2}{2\sigma_2^2}\} d\mu_1 d\mu_2 \\ &\propto \int (\frac{\omega}{ \sigma_1})^n exp\{-\frac{\sum (x_i-\mu_1)^2}{2\sigma_1^2}\} d\mu_1 \cdot \infty\ + \int (\frac{1-\omega}{ \sigma_2})^n exp\{-\frac{\sum (x_i-\mu_2)^2}{2\sigma_2^2}\} \ d\mu_2 \cdot \infty \\ \end{aligned}\)

Therefore, \(\frac{d P(\mu_1, \mu_2|X, \omega, \sigma_1, \sigma_2)}{d\mu_j}=\infty \cdot (\frac{\omega}{ \sigma_j})^n exp\{-\frac{\sum (x_i-\mu_j)^2}{2\sigma_j^2}\} = \infty\), where j = 1 or 2.

Thus we can’t set the derivative of the posterior to be 0, which fails to converge.

Computation

In this problem we will consider a mixture of linear regression models. The data for this problem is contained in \(mix\_reg.txt\), with variable x denoting the predictor and y the (continuous) response.

Load data first.

set.seed(666)
setwd("C:/Users/Wei/Documents/Purdue STAT 695 Bayesian Data Analysis/HW5")
data = read.csv(file="mix_reg.txt", header=TRUE)

Set predictor and response.

X = data$x
y = data$y[order(X)]
X = sort(X)
X = cbind(1, X)

1. Fit the standard linear regression model \(y_i = \beta_0 + \beta_1 x_i + \epsilon_i\), where \(\epsilon_i \overset{\mathrm{i.i.d.}}{\sim} N(0, \sigma^2)\). Specify a conjugate prior on \(\theta =(\beta_0, \beta_1, \sigma^2)\), making sure to explain your choice of hyperparameters. Plot the observed data, posterior predictive mean trend, and 90% central credible intervals on a grid over [0, 10] with increments of 0.1.

Specify a form of prior \(p(\beta, \sigma^2) \propto p(\beta|\sigma^2) p(\sigma^2)\), here \(\beta =(\beta_0, \beta_1)\), \(p(\sigma^2)\) follows inverse gamma distribution \(Inv\ Gamma(\alpha, \beta)\), \(p(\beta|\sigma^2)\) follows a normal distribution \(Normal(\mu, I_2)\), where \(\mu = 0\).

\(p(\beta|\sigma^2) \propto \sigma^{-2} exp(-\frac{1}{2\sigma^2}(\beta-\mu)^T \Lambda_0(\beta-\mu))\)

Then, the posterior distribution can be expressed as

\(\begin{aligned} p(\beta,\sigma^2|y, X) &\propto p(y|X, \beta, \sigma^2)p(\beta|\sigma^2)p(\sigma^2) \\ &\propto (\sigma^2)^{-\frac{n}{2}}exp(-\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)) \sigma^{-2} exp(-\frac{1}{2\sigma^2}(\beta-\mu)^T \Lambda (\beta-\mu))\ p_{Inv\ Gamma}(\alpha, \beta) \\ \end{aligned}\)

Then, we can build our model posterior, to simplify the process of sampling, we take log of the variance to make it \((-\infty, \infty)\).

shape = 1
scale = 1
std = 1

library(invgamma)
log_priors = function(pars) {
    dnorm(pars[1], mean=0, sd=std, log=T) + 
    dnorm(pars[2], mean=0, sd=std, log=T) +
    dinvgamma(exp(pars[3]), shape, rate = 1, scale = scale, log = TRUE)
}

log_likelihood = function(pars) {
    beta = pars[1:2] # beta_0, beta_1
    sigma = sqrt(exp(pars[3]))
    sum(dnorm(y, mean = X %*% beta, sd = sigma, log = T))
}

log_posterior = function(pars) log_priors(pars) + log_likelihood(pars)

Use numerical methods to find initial parameters and Hessian.

optimal = optim(rep(1, 3), log_posterior, control=list(fnscale=-1), hessian=TRUE)

initialValue = optimal$par
Hessian = optimal$hessian

Set our MH sampler using symmetric proposal which is normal distribution.

burnIn = 50000
iterations = 2 * burnIn

chains = array(dim=c(iterations + 1, 3))
chains[1, ] = initialValue
Rchol = chol(-Hessian)

BarProgress = txtProgressBar(min = 1, max = nrow(chains), style = 3)
for (i in 1: iterations) {
    # better avoid saving the inverse of a matrix, compute them instead
    proposal = chains[i, ] + backsolve(Rchol, rnorm(3))

    # write exp(num) as num to avoid overflow; symmetric proposal
    log_acceptance_prob = log_posterior(proposal)  -  log_posterior(chains[i, ])
    chains[i + 1, ] = chains[i, ]
    if (log(runif(1)) < log_acceptance_prob)
        chains[i + 1, ] = proposal
    #setTxtProgressBar(BarProgress, i) # not suitable in rmd, useful in other cases
}
close(BarProgress)

pars_draws = chains[-(1: burnIn), ]
print(paste("Acceptance rate", round(nrow(unique(chains)) / nrow(chains), 4)))

## [1] "Acceptance rate 0.452"

Change log-scale to the required.

pars_draws[, 3] = exp(pars_draws[, 3])

As we can see from the figure below, the error bar is too wide to accept. The residual is hardly normal distributed.

qt = array(NA, c(200, 3))
for (i in 1:200) {
    beta = cbind(pars_draws[, 1], pars_draws[, 2])
    std = sqrt(pars_draws[, 3])
    y_samples = beta %*% X[i, ] + rnorm(burnIn+1, sd=std)
    qt[i, ] = quantile(y_samples, c(0.05, 0.5, 0.95))
}
plot(X[,2], y, xlab='X', ylab='y', main="Bayesian Linear Regression")
lines(X[,2], qt[, 2], col='red')
lines(X[,2], qt[, 1], col='blue')
lines(X[,2], qt[, 3], col='blue')

2. Now fit a mixture of two linear regression models to the data. Under this mixture model, component \(i = 1, 2\) has probability \(\pi_i\) and parameter vector \(\theta_i\), and the response for a given predictor is generated by first selecting a regression component with its associated probability \(\pi_i\), and then using its parameters \(\theta_i\) to generate a response. Use your previous conjugate prior for the parameter vectors \(\theta_i\), and place a Dirichlet prior over the probability parameters. Implement a Gibbs sampler to obtain posterior draws of these parameters for this data, and discuss the performance of your sampler. As before, provide a visualization of the posterior predictions under this mixture model for a grid of x-values, and compare and contrast this model’s utility with that of the previous standard linear regression.

Set hyperparameters. From the figure above, we plan to model the data using a mixture of regressions: one is linear regression, another is regression with quadratic terms.

n = nrow(data)
shape = 1
scale = 1000
std = 1000
X = cbind(X, X[, 2]^2)

We should apply Dirichlet prior to our model, but in this case, 2-d Dirichlet prior is equivalent to beta prior, thus, we use beta prior instead.

sigmoid = function(x) 1 / (1 + exp(-x))

log_priors = function(pars) {
    sum(dnorm(pars[1:5], mean=0, sd=std, log=T)) + 
    sum(dinvgamma(exp(pars[6:7]), shape, rate=1, scale=scale, log=TRUE)) +
    dbeta(sigmoid(pars[8]), 1 , 1, log=TRUE)
}

# joint likelihood
log_likelihood = function(pars) {
    mu1 = c(pars[1:2], 0)
    mu2 = pars[3:5]
    sd1 = exp(pars[6])
    sd2 = exp(pars[7])
    lambda = sigmoid(pars[8])
    sum((dnorm(y, mean=X %*% mu1, sd=sd1, log=T) + log(lambda)) * Z) + 
    sum((dnorm(y, mean=X %*% mu2, sd=sd2, log=T) + log(1 - lambda)) * (1 - Z))
}

log_posterior = function(pars) log_priors(pars) + log_likelihood(pars)

Write MH sampling within Gibbs sampling. First we choose the optimal \(Z\) based on the current \(\beta\) and \(\lambda\), then we sample new \(\beta\) and \(\lambda\) conditional on \(Z\) and \(Y\).

pars = rep(0.5, 8)
Z = rbinom(n, 1, 0.5)
burnIn = 1000
iterations = 2 * burnIn

for (i in 1: 10) {
    optimal = optim(pars, log_posterior, control=list(fnscale=-1), hessian=TRUE)
    pars = optimal$par
    log_posterior_raw = log_posterior(pars)
    chains = array(dim=c(iterations + 1, 8))
    chains[1, ] = pars
    for (j in 1: iterations) {
        # better avoid saving the inverse of a matrix, compute them instead
        proposal = chains[j, ] + rnorm(8, sd=0.1)

        # write exp(num) as num to avoid overflow; symmetric proposal
        log_acceptance_prob = log_posterior(proposal)  -  log_posterior(chains[j, ])
        chains[j + 1, ] = chains[j, ]
        if (log(runif(1)) < log_acceptance_prob)
            chains[j + 1, ] = proposal
    }
    pars_draws = chains[-(1: burnIn), ]
    #print(paste(i, "th round: ", "Acceptance rate", round(nrow(unique(chains)) / nrow(chains), 4)))
    pars = tail(pars_draws, 1)

    group_1 = dnorm(y, X %*% c(pars[1:2], 0), sd=exp(pars[6]), log=T)
    group_2 = dnorm(y, X %*% pars[3:5], sd=exp(pars[7]), log=T)
    Z = as.numeric(group_1 > group_2) # key part to assign label to Z
    log_posterior_update = log_posterior(pars)
}

From below, we can see a clear improvement on model fitting. The points assigned make sense.

plot(X[, 2] * Z, y * Z, ylim=c(-10, 60), col="red", pch=19, xlab='X', ylab='y')
points(X[, 2] * (1 - Z), y * (1 - Z), col="black", pch=15)

qt = array(NA, c(200, 2, 3))
pars_draws[, 6] = exp(pars_draws[, 6])
pars_draws[, 7] = exp(pars_draws[, 7])

for (i in 1:200) {
    beta = cbind(pars_draws[, 1:2], 0)
    std = sqrt(pars_draws[, 6])
    y_samples = beta %*% X[i, ] + rnorm(burnIn+1, sd=std)
    qt[i, 1, ] = quantile(y_samples, c(0.05, 0.5, 0.95))

    beta = pars_draws[, 3:5]
    std = sqrt(pars_draws[, 6])
    y_samples = beta %*% X[i, ] + rnorm(burnIn+1, sd=std)
    qt[i, 2, ] = quantile(y_samples, c(0.05, 0.5, 0.95))
}

plot(X[, 2], y, xlab='X', ylab='y', main='Bayeisna Mixture Model')
lines(X[, 2], qt[, 1, 1], col='red')
lines(X[, 2], qt[, 1, 2], col='red')
lines(X[, 2], qt[, 1, 3], col='red')
lines(X[, 2], qt[, 2, 1], col='blue')
lines(X[, 2], qt[, 2, 2], col='blue')
lines(X[, 2], qt[, 2, 3], col='blue')

Mixture Model and Expectation Maximization

Wei Deng

November 7, 2017

Theory

1. Consider a mixture of two Normal distributions \(N(0, 1)\) and \(N(\mu, 1)\) in which the probability of the first component is known to be 1/2, and \(\mu\) is unknown. Prove that an improper prior on \(\mu\) results in an improper posterior regardless of the number of observations.

Computation

In this problem we will consider a mixture of linear regression models. The data for this problem is contained in \(mix\_reg.txt\), with variable x denoting the predictor and y the (continuous) response.