References

BST 249 Materials

Escobar and West (1995), “Bayesian Density Estimation and Inference Using Mixtures”, \(\textit{Journal of American Statistical Association}\), Vol. 90, No. 430, 577-588.

Load packages

library(assertthat)
library(ggplot2)
library(tidyverse)
library(MCMCpack)

Aims

Here we aim to generate samples from posterior distribution in a Dirichlet Process Mixture model. The setting is \[\begin{align*} F \sim \text{DP}(M = 1, G_0 = N(0,1))\\ \theta_1,\theta_2, \dots, \theta_n|F \overset{\text{iid}}{\sim} F\\ X_i|\theta_i \sim N(\theta_i,1) \end{align*}\]

We are interested in generating samples from \(F|{\boldsymbol{X}}\) where \({\boldsymbol{X}}= (X_1, X_2, \dots, X_{50})\).

A quick look on prior distribution of \(F\)

Since the prior is \(F\sim \text{DP}(M = 1, G_0 = N(0,1))\), we can generate samples of \(F\) using any of the five definitions as illustrated in http://rpubs.com/xihaoli/DP-for-G. Here we use the stick-breaking approach as follows.

## Define a one-time stick-breaking generator
StickBreakingGenerator <- function(num, M, G0.generator) {
  theta.vector <- G0.generator(n = num)
  v.vector <- rbeta(num, 1, M)
  w.vector <- rep(NA,num)
  remaining <- 1
  for (i in 1:num){
    w.vector[i] <- remaining * v.vector[i]
    remaining <- remaining-w.vector[i]
  }
  return(list(theta.vector, w.vector))
}

## Define a stick-breaking sampler
StickBreakingSampler <- function(n, M, cutoff, num=1000, G0.generator=rnorm){
  output <- matrix(NA, nrow = n, ncol = length(cutoff) + 1)
  cutoff <- c(-Inf, cutoff, Inf)
  for(i in 1:n){
    SB.sample <- StickBreakingGenerator(num=num, M=M, G0.generator=G0.generator)
    for(j in 1:ncol(output)){
      output[i,j] <- sum(SB.sample[[2]][intersect(which(SB.sample[[1]] > cutoff[j]), which(SB.sample[[1]] <= cutoff[j+1]))])
    }
  }
  return(output)
}

## Sample 1 vector
M <- 1
Prior1 <- StickBreakingSampler(n = 1, M = M, cutoff = c(-3,-2,-1,0,1,2,3))
Prior1
##               [,1]           [,2]             [,3]      [,4]      [,5]         [,6]         [,7] [,8]
## [1,] 3.937587e-230 0.000007703455 0.00000007882224 0.1492728 0.8501038 0.0006155712 4.591116e-15    0
sum(Prior1)
## [1] 1
## Sample many
Prior_n <- StickBreakingSampler(n = 10^4, M = M, cutoff = c(-3,-2,-1,0,1,2,3))
## Construct df for visualization
df_prior <- as.data.frame(Prior_n)
colnames(df_prior) <- paste0("G_A", seq_len(ncol(df_prior)))
df_prior$iter <- seq_len(nrow(df_prior))
df_prior_long <- gather(data = df_prior, key = key, value = value, -iter)

## Marginal distribution of G_Ai
ggplot(data = df_prior_long, mapping = aes(x = key, y = value)) +
  geom_boxplot() +
  theme_bw() + theme(legend.key = element_blank())

## Observe several realized vectors (line connects values from a single vector)
ggplot(data = subset(df_prior_long, iter <= 5),
       mapping = aes(x = key, y = value, group = iter, color = factor(iter))) +
  geom_line() +
  geom_point() +
  theme_bw() + theme(legend.key = element_blank())

We can see that on average, the prior of \(F\) is symmetric around 0 since the base distribution \(G_0 = N(0,1)\) is symmetric.

Data generation

Now we generate the observed \(X\) according to a mixture of two normal distributions \[X \sim \frac{1}{2}\times N(2,1) + \frac{1}{2} \times N(-2,1).\]

# Data Generation
set.seed(123)
X <- c(rnorm(25,2,1), rnorm(25,-2,1))
length(X)
## [1] 50
hist(X, breaks = 25)

mean(X[1:25])
## [1] 1.96667
sd(X[1:25])
## [1] 0.9467324
mean(X[26:50])
## [1] -1.897863
sd(X[26:50])
## [1] 0.9188734

Decomposition of the problem

Our goal is to sample from \(F|{\boldsymbol{X}}\). To get this, note that \[\begin{align*} p(F|\boldsymbol{X}) &= \int p(F|\boldsymbol{X}, \boldsymbol{\theta}) p(\theta|\boldsymbol{X})d\boldsymbol{\theta}\\ &=\int p(F|\boldsymbol{\theta}) p(\boldsymbol{\theta}|\boldsymbol{X})d\boldsymbol{\theta} \end{align*}\]

Since \(F \sim \text{DP}(M = 1, G_0 = N(0,1))\) and \(\theta_1,\theta_2, \dots, \theta_n|F \overset{\text{iid}}{\sim} F\), by conjugacy, the posterior \[F|{\boldsymbol{\theta}}\sim \text{DP}(M+n, \frac{MG_0 + \sum_{i=1}^n\delta_{X_i}}{M+n})\] which is tractable using any of techniques for DP (stick-breaking, etc.). Thus, the problem comes down to generate samples from \(p({\boldsymbol{\theta}}|{\boldsymbol{X}})\). Here, a MCMC-type of approach (i.e. Gibbs sampling) can be applied.

Set-ups for Gibbs sampling algorithm

To use Gibbs sampler, we need to derive the full-conditionals for \(\theta_i|{\boldsymbol{\theta}}_{-i},{\boldsymbol{X}}\), where \({\boldsymbol{\theta}}_{-i} = (\theta_1, \dots, \theta_{i-1}, \theta_{i+1},\dots, \theta_n)\). Also let \(\{{\boldsymbol{\theta}}_{-i}\} = \{\theta_1, \dots, \theta_{i-1}, \theta_{i+1},\dots, \theta_n\}\) denote the set of all the \(\theta\)’s except \(\theta_i\). Due to the nature of DP, this problem can be further decomposed into a two-step procedure.

Step 1. Generate a Bernoulli random variable indicating whether \(\theta_i \in \{{\boldsymbol{\theta}}_{-i}\}|{\boldsymbol{X}}, {\boldsymbol{\theta}}_{-i}\) with probability \[\Pr(\theta_i \in \{{\boldsymbol{\theta}}_{-i}\}|{\boldsymbol{X}}, {\boldsymbol{\theta}}_{-i}) = \frac{\sum_{j\neq i}f_{\theta_j}(x_i)}{\sum_{j\neq i}f_{\theta_j}(x_i) + \int Mf_{\theta_i}(x_i)dG_0}.\] Note that for \(G_0 = N(0,1), \theta_i \sim G_0, X_i|\theta_i \sim N(\theta_i,1)\), the joint distribution for \((X_i, \theta_i)\) is \[\begin{eqnarray} \left(\begin{array}{c} X_i\\\theta_i\end{array}\right) \sim N_2 \left(\left(\begin{array}{c}0\\0\end{array}\right), \left(\begin{array}{cc} 2 & 1\\ 1 & 1\end{array}\right)\right). \end{eqnarray}\]

Thus, \(\int M f_{\theta_i}(x_i)dG_0 = M \cdot f_{N(0,2)}(x_i)\).

Step 2. According to the outcome of step 1, if \(\theta_i \in \{{\boldsymbol{\theta}}_{-i}\}|{\boldsymbol{X}}, {\boldsymbol{\theta}}_{-i}\), then generate \(\theta_i|\theta_i \in \{{\boldsymbol{\theta}}_{-i}\},{\boldsymbol{X}}, {\boldsymbol{\theta}}_{-i}\) based on \[\Pr(\theta_i=\theta_j|\theta_i \in \{{\boldsymbol{\theta}}_{-i}\},{\boldsymbol{X}}, {\boldsymbol{\theta}}_{-i}) \propto f_{\theta_j}(x_i),\] which can be drawn from the Categorical distribution (i.e. Multinomial distribution with \(n = 1\)) with normalized proportion among all the \(n-1\) \(\theta\)’s.

If \(\theta_i \notin \{{\boldsymbol{\theta}}_{-i}\}|{\boldsymbol{X}}, {\boldsymbol{\theta}}_{-i}\), then generate \(\theta_i|\theta_i \notin \{{\boldsymbol{\theta}}_{-i}\},{\boldsymbol{X}}, {\boldsymbol{\theta}}_{-i}\) based on \[\Pr(\theta_i|\theta_i \notin \{{\boldsymbol{\theta}}_{-i}\},{\boldsymbol{X}}, {\boldsymbol{\theta}}_{-i}) \propto f_{G_0}(\theta_i) \cdot f_{\theta_i}(x_i).\] By the conditional density of multivariate normal, we have that \(\theta_i|X_i \sim N(\frac{X_i}{2}, \frac{1}{2})\). Thus, we can generate the posterior \(\theta_i\) from \(N(x_i/2, 1/2)\).

Combining steps 1 and 2, we have derived the full conditional distribution of \({\boldsymbol{\theta}}\). Using Gibbs sampling algorithm, given \({\boldsymbol{\theta}}^{(r)} = (\theta_1^{(r)}, \dots, \theta_n^{(r)})\), we first update \(\theta_1^{(r)}|{\boldsymbol{\theta}}_{-1}^{(r)}\) and get \(\theta_1^{(r+1)}\), then update \(\theta_2^{(r)}|\theta_1^{(r+1)}, \theta_3^{(r)}, \dots, \theta_n^{(r)}\), and keep updating until we update \(\theta_{n}^{(r)}|{\boldsymbol{\theta}}_{-n}^{(r+1)}\).

M <- 1
# put an initial value of theta vector from U(-3,3)
theta.init <- runif(50,-3,3)

normalized.vector <- function(vector){
  return(vector/sum(vector))
}

theta.update.individual <- function(X, index, theta.vector){
  prob.in <- sum(dnorm(X[index], mean = theta.vector, sd = 1))/(sum(dnorm(X[index], mean = theta.vector, sd = 1))+M*dnorm(X[index],0,sd=sqrt(2)))
  indicator <- rbernoulli(n = 1, prob.in)
  if (indicator == TRUE){
    category <- which(rmultinom(n=1,size=1,prob=normalized.vector(dnorm(X[index], mean = theta.vector, sd = 1)))==1)
    theta.out <- theta.vector[category]
  }else{
    theta.out <- rnorm(n=1, mean = X[index]/2, sd = sqrt(1/2))
  }
  return(theta.out)
}
theta.update.individual(X, 1, theta.init[2:50])
## [1] 2.878932
## Gibbs sampler for theta
theta.update <- function(X, theta.previous){
  assert_that(length(X) == length(theta.previous))
  N <- length(X)
  theta.temp <- c(theta.previous, rep(NA, N))  # make a length 2*N vector for dynamic updating
  for (i in 1:N){
    theta.temp[N+i] <- theta.update.individual(X, i, theta.temp[(i+1):(i+N-1)])
  }
  return(theta.temp[(N+1):(N*2)])
}

theta.update(X, theta.init)
##  [1]  2.726842965  1.140042609  2.726842965  1.140042609  2.611798820  2.724547431  2.724547431  0.715538900
##  [9]  2.348364703  0.003884288  2.348364703  1.392811233  1.392811233  2.724547431  0.715538900  2.056375913
## [17]  2.724547431  0.126814355  1.140042609  1.717689310  1.140042609  0.715538900  2.726842965  1.717689310
## [25]  2.348364703 -2.453736001 -2.148558553 -0.543150284 -1.525657932 -0.543150284 -2.453736001 -2.148558553
## [33] -1.129786788 -2.453736001 -2.453736001 -2.539853008 -1.525657932 -2.148558553 -2.539853008 -2.539853008
## [41] -2.148558553 -1.613029308 -2.539853008  1.140042609  0.126814355 -2.148558553 -1.613029308 -2.453736001
## [49] -0.801479428 -2.539853008
## Simulation
theta.matrix <- matrix(NA, nrow = length(X), ncol = 10^5 + 10000)
theta.matrix[,1] <- theta.init
for (i in 2:ncol(theta.matrix)){
    theta.matrix[,i] <- theta.update(X, theta.matrix[,i-1])
}
# Remove burnin (first 10%) 
theta.final <- theta.matrix[,10001:(10^5+10000)] 
# Perform thinin (every 10-th point)
theta.final <- theta.final[,c(1:10000)*10]

Generate posterior samples of \(F|{\boldsymbol{X}}\)

Once we have obtained \({\boldsymbol{\theta}}|{\boldsymbol{X}}\), from above we can use the stick breaking approach to generate \(F|{\boldsymbol{\theta}}\) from \[F|{\boldsymbol{\theta}}\sim \text{DP}(M+n, \frac{MG_0 + \sum_{i=1}^n\delta_{X_i}}{M+n})\]

# Get samples of F based on posterior theta
G0.posterior.generator <- function(n, G0=rnorm, theta.vector){
  N <- length(theta.vector)
  prob <- rep(1/(N+1), (N+1))
  out <- rep(NA,n)
  for (i in 1:n){
    category <- which(rmultinom(n=1,size=1,prob=prob)==1)-1
    if (category == 0){
      out[i] <- G0(n=1)
    }else{
      out[i] <- theta.vector[category]
    }
  }
  return(out)
}

## Sample 1 vector
M <- 1
Posterior1 <- StickBreakingSampler(n = 1, M = M, cutoff = c(-3,-2,-1,0,1,2,3), G0.generator=function(n){G0.posterior.generator(n,theta.vector=theta.final[,1])})
Posterior1
##      [,1] [,2]      [,3]         [,4]        [,5]          [,6]        [,7] [,8]
## [1,]    0    0 0.9923196 6.321893e-97 0.005840333 1.844588e-223 0.001840045    0
sum(Posterior1)
## [1] 1
## Sample many
Posterior_n <- matrix(NA, nrow = 10^4, ncol = dim(Posterior1)[2])
for (i in 1:10^4){
  Posterior_n[i,] <- StickBreakingSampler(n = 1, M = M, cutoff = c(-3,-2,-1,0,1,2,3), G0.generator=function(n){G0.posterior.generator(n,theta.vector=theta.final[,i])})
}
## Construct df for visualization
df_post <- as.data.frame(Posterior_n)
colnames(df_post) <- paste0("G_A", seq_len(ncol(df_post)))
df_post$iter <- seq_len(nrow(df_post))
df_post_long <- gather(data = df_post, key = key, value = value, -iter)
## Marginal distribution of G_Ai
ggplot(data = df_post_long, mapping = aes(x = key, y = value)) +
  geom_boxplot() +
  theme_bw() + theme(legend.key = element_blank())

## Observe several realized vectors (line connects values from a single vector)
ggplot(data = subset(df_post_long, iter <= 5),
       mapping = aes(x = key, y = value, group = iter, color = factor(iter))) +
  geom_line() +
  geom_point() +
  theme_bw() + theme(legend.key = element_blank())

From the posterior samples of \(F|{\boldsymbol{X}}\) we can see that the posterior is updated to capture the data generating mechanism of \(X\) as the bimodal distribution.

Conclusion

In this example, we implement the Dirichlet Process Mixture model to learn the data generating mechanism. The prior of unknown distribution \(F\) is set to be Dirichlet Process and the posterior is updated accordingly. To conclude, Bayesian Nonparametrics provides a very flexible way to model data with a reasonable amount of running time.