For each observation, the density of \(p(x_i|\mu)\) follows \(\frac{1}{2} (\frac{1}{\sqrt{2 \pi} } e^{-\frac{x_i^2}{2}} + \frac{1}{\sqrt{2 \pi} } e^{-\frac{(x_i-\mu)^2}{2}})\), then
\(\begin{aligned} P(X|\mu) \propto &\frac{1}{2^n} (\frac{1}{\sqrt{2 \pi} } e^{-\frac{x_i^2}{2}} + \frac{1}{\sqrt{2 \pi} } e^{-\frac{(x_i-\mu)^2}{2}})^n \\ &>2^{-n} (2 \pi)^{-\frac{n}{2}} e^{-\sum_i^n x_i^2/2} \\ \end{aligned}\)
\(\begin{aligned} \int P(\mu|X) d\mu &\propto \int P(X|\mu) P(\mu)d\mu \\ &> 2^{-n} (2 \pi)^{-\frac{n}{2}} \int P(\mu)d\mu \\ &> 2^{-n} (2 \pi)^{-\frac{n}{2}} \infty \\ &> \infty \\ \end{aligned}\)
Therefore, whenever \(\mu\) is not proper, the posterior is not proper.
In this case, denote the probability of the first component is \(p(\omega)\).
Then, \(p(x_i|\omega, \mu_1, \mu_2, \sigma_1, \sigma_2) = \frac{1}{2} (\omega \frac{1}{\sqrt{2 \pi} \sigma_1} exp\{-\frac{(x_i-\mu_1)^2}{2\sigma_1^2}\} + (1-\omega)\frac{1}{\sqrt{2 \pi} \sigma_2} exp\{-\frac{(x_i-\mu_2)^2}{2\sigma_2^2}\})\).
\(\begin{aligned} P(\mu_1, \mu_2|X, \omega, \sigma_1, \sigma_2) &\propto \int \int p(x_i|\omega, \mu_1, \mu_2, \sigma_1, \sigma_2) p(\mu_1) p(\mu_2) d\mu_1 d\mu_2 \\ &> \int \int 2^{-n} (\omega \frac{1}{\sqrt{2 \pi} \sigma_1})^n exp\{-\frac{\sum (x_i-\mu_1)^2}{2\sigma_1^2}\} + 2^{-n} ((1-\omega)\frac{1}{\sqrt{2 \pi} \sigma_2})^n exp\{-\frac{\sum (x_i-\mu_2)^2}{2\sigma_2^2}\} d\mu_1 d\mu_2 \\ &\propto \int \int (\frac{\omega}{ \sigma_1})^n exp\{-\frac{\sum (x_i-\mu_1)^2}{2\sigma_1^2}\} + (\frac{(1-\omega)}{ \sigma_2})^n exp\{-\frac{\sum (x_i-\mu_2)^2}{2\sigma_2^2}\} d\mu_1 d\mu_2 \\ &\propto \int \int (\frac{\omega}{ \sigma_1})^n exp\{-\frac{\sum (x_i-\mu_1)^2}{2\sigma_1^2}\} d\mu_2 d\mu_1 + \int \int (\frac{1-\omega}{ \sigma_2})^n exp\{-\frac{\sum (x_i-\mu_2)^2}{2\sigma_2^2}\} d\mu_1 d\mu_2 \\ &\propto \int (\frac{\omega}{ \sigma_1})^n exp\{-\frac{\sum (x_i-\mu_1)^2}{2\sigma_1^2}\} d\mu_1 \cdot \infty\ + \int (\frac{1-\omega}{ \sigma_2})^n exp\{-\frac{\sum (x_i-\mu_2)^2}{2\sigma_2^2}\} \ d\mu_2 \cdot \infty \\ \end{aligned}\)
Therefore, \(\frac{d P(\mu_1, \mu_2|X, \omega, \sigma_1, \sigma_2)}{d\mu_j}=\infty \cdot (\frac{\omega}{ \sigma_j})^n exp\{-\frac{\sum (x_i-\mu_j)^2}{2\sigma_j^2}\} = \infty\), where j = 1 or 2.
Thus we can’t set the derivative of the posterior to be 0, which fails to converge.
Load data first.
set.seed(666)
setwd("C:/Users/Wei/Documents/Purdue STAT 695 Bayesian Data Analysis/HW5")
data = read.csv(file="mix_reg.txt", header=TRUE)
Set predictor and response.
X = data$x
y = data$y[order(X)]
X = sort(X)
X = cbind(1, X)
Specify a form of prior \(p(\beta, \sigma^2) \propto p(\beta|\sigma^2) p(\sigma^2)\), here \(\beta =(\beta_0, \beta_1)\), \(p(\sigma^2)\) follows inverse gamma distribution \(Inv\ Gamma(\alpha, \beta)\), \(p(\beta|\sigma^2)\) follows a normal distribution \(Normal(\mu, I_2)\), where \(\mu = 0\).
\(p(\beta|\sigma^2) \propto \sigma^{-2} exp(-\frac{1}{2\sigma^2}(\beta-\mu)^T \Lambda_0(\beta-\mu))\)
Then, the posterior distribution can be expressed as
\(\begin{aligned} p(\beta,\sigma^2|y, X) &\propto p(y|X, \beta, \sigma^2)p(\beta|\sigma^2)p(\sigma^2) \\ &\propto (\sigma^2)^{-\frac{n}{2}}exp(-\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)) \sigma^{-2} exp(-\frac{1}{2\sigma^2}(\beta-\mu)^T \Lambda (\beta-\mu))\ p_{Inv\ Gamma}(\alpha, \beta) \\ \end{aligned}\)
Then, we can build our model posterior, to simplify the process of sampling, we take log of the variance to make it \((-\infty, \infty)\).
shape = 1
scale = 1
std = 1
library(invgamma)
log_priors = function(pars) {
dnorm(pars[1], mean=0, sd=std, log=T) +
dnorm(pars[2], mean=0, sd=std, log=T) +
dinvgamma(exp(pars[3]), shape, rate = 1, scale = scale, log = TRUE)
}
log_likelihood = function(pars) {
beta = pars[1:2] # beta_0, beta_1
sigma = sqrt(exp(pars[3]))
sum(dnorm(y, mean = X %*% beta, sd = sigma, log = T))
}
log_posterior = function(pars) log_priors(pars) + log_likelihood(pars)
Use numerical methods to find initial parameters and Hessian.
optimal = optim(rep(1, 3), log_posterior, control=list(fnscale=-1), hessian=TRUE)
initialValue = optimal$par
Hessian = optimal$hessian
Set our MH sampler using symmetric proposal which is normal distribution.
burnIn = 50000
iterations = 2 * burnIn
chains = array(dim=c(iterations + 1, 3))
chains[1, ] = initialValue
Rchol = chol(-Hessian)
BarProgress = txtProgressBar(min = 1, max = nrow(chains), style = 3)
for (i in 1: iterations) {
# better avoid saving the inverse of a matrix, compute them instead
proposal = chains[i, ] + backsolve(Rchol, rnorm(3))
# write exp(num) as num to avoid overflow; symmetric proposal
log_acceptance_prob = log_posterior(proposal) - log_posterior(chains[i, ])
chains[i + 1, ] = chains[i, ]
if (log(runif(1)) < log_acceptance_prob)
chains[i + 1, ] = proposal
#setTxtProgressBar(BarProgress, i) # not suitable in rmd, useful in other cases
}
close(BarProgress)
pars_draws = chains[-(1: burnIn), ]
print(paste("Acceptance rate", round(nrow(unique(chains)) / nrow(chains), 4)))
## [1] "Acceptance rate 0.452"
Change log-scale to the required.
pars_draws[, 3] = exp(pars_draws[, 3])
As we can see from the figure below, the error bar is too wide to accept. The residual is hardly normal distributed.
qt = array(NA, c(200, 3))
for (i in 1:200) {
beta = cbind(pars_draws[, 1], pars_draws[, 2])
std = sqrt(pars_draws[, 3])
y_samples = beta %*% X[i, ] + rnorm(burnIn+1, sd=std)
qt[i, ] = quantile(y_samples, c(0.05, 0.5, 0.95))
}
plot(X[,2], y, xlab='X', ylab='y', main="Bayesian Linear Regression")
lines(X[,2], qt[, 2], col='red')
lines(X[,2], qt[, 1], col='blue')
lines(X[,2], qt[, 3], col='blue')
Set hyperparameters. From the figure above, we plan to model the data using a mixture of regressions: one is linear regression, another is regression with quadratic terms.
n = nrow(data)
shape = 1
scale = 1000
std = 1000
X = cbind(X, X[, 2]^2)
We should apply Dirichlet prior to our model, but in this case, 2-d Dirichlet prior is equivalent to beta prior, thus, we use beta prior instead.
sigmoid = function(x) 1 / (1 + exp(-x))
log_priors = function(pars) {
sum(dnorm(pars[1:5], mean=0, sd=std, log=T)) +
sum(dinvgamma(exp(pars[6:7]), shape, rate=1, scale=scale, log=TRUE)) +
dbeta(sigmoid(pars[8]), 1 , 1, log=TRUE)
}
# joint likelihood
log_likelihood = function(pars) {
mu1 = c(pars[1:2], 0)
mu2 = pars[3:5]
sd1 = exp(pars[6])
sd2 = exp(pars[7])
lambda = sigmoid(pars[8])
sum((dnorm(y, mean=X %*% mu1, sd=sd1, log=T) + log(lambda)) * Z) +
sum((dnorm(y, mean=X %*% mu2, sd=sd2, log=T) + log(1 - lambda)) * (1 - Z))
}
log_posterior = function(pars) log_priors(pars) + log_likelihood(pars)
Write MH sampling within Gibbs sampling. First we choose the optimal \(Z\) based on the current \(\beta\) and \(\lambda\), then we sample new \(\beta\) and \(\lambda\) conditional on \(Z\) and \(Y\).
pars = rep(0.5, 8)
Z = rbinom(n, 1, 0.5)
burnIn = 1000
iterations = 2 * burnIn
for (i in 1: 10) {
optimal = optim(pars, log_posterior, control=list(fnscale=-1), hessian=TRUE)
pars = optimal$par
log_posterior_raw = log_posterior(pars)
chains = array(dim=c(iterations + 1, 8))
chains[1, ] = pars
for (j in 1: iterations) {
# better avoid saving the inverse of a matrix, compute them instead
proposal = chains[j, ] + rnorm(8, sd=0.1)
# write exp(num) as num to avoid overflow; symmetric proposal
log_acceptance_prob = log_posterior(proposal) - log_posterior(chains[j, ])
chains[j + 1, ] = chains[j, ]
if (log(runif(1)) < log_acceptance_prob)
chains[j + 1, ] = proposal
}
pars_draws = chains[-(1: burnIn), ]
#print(paste(i, "th round: ", "Acceptance rate", round(nrow(unique(chains)) / nrow(chains), 4)))
pars = tail(pars_draws, 1)
group_1 = dnorm(y, X %*% c(pars[1:2], 0), sd=exp(pars[6]), log=T)
group_2 = dnorm(y, X %*% pars[3:5], sd=exp(pars[7]), log=T)
Z = as.numeric(group_1 > group_2) # key part to assign label to Z
log_posterior_update = log_posterior(pars)
}
From below, we can see a clear improvement on model fitting. The points assigned make sense.
plot(X[, 2] * Z, y * Z, ylim=c(-10, 60), col="red", pch=19, xlab='X', ylab='y')
points(X[, 2] * (1 - Z), y * (1 - Z), col="black", pch=15)
qt = array(NA, c(200, 2, 3))
pars_draws[, 6] = exp(pars_draws[, 6])
pars_draws[, 7] = exp(pars_draws[, 7])
for (i in 1:200) {
beta = cbind(pars_draws[, 1:2], 0)
std = sqrt(pars_draws[, 6])
y_samples = beta %*% X[i, ] + rnorm(burnIn+1, sd=std)
qt[i, 1, ] = quantile(y_samples, c(0.05, 0.5, 0.95))
beta = pars_draws[, 3:5]
std = sqrt(pars_draws[, 6])
y_samples = beta %*% X[i, ] + rnorm(burnIn+1, sd=std)
qt[i, 2, ] = quantile(y_samples, c(0.05, 0.5, 0.95))
}
plot(X[, 2], y, xlab='X', ylab='y', main='Bayeisna Mixture Model')
lines(X[, 2], qt[, 1, 1], col='red')
lines(X[, 2], qt[, 1, 2], col='red')
lines(X[, 2], qt[, 1, 3], col='red')
lines(X[, 2], qt[, 2, 1], col='blue')
lines(X[, 2], qt[, 2, 2], col='blue')
lines(X[, 2], qt[, 2, 3], col='blue')