Hierarchical Mixture Models & Likelihoods

We have the mixture model written as \[f(x) = \sum_{k=1}^{K}w_kg_k(x)\] We will introduce an indicator \(C\), which is a (discrete) random variable where \(C \in 1,2,,,,K\).

Then, \(X|C \sim g_c(x)\) and \(C\sim Pr(C=k) = w_k\), and

\[Pr(X)=\sum_{k=1}^{K}f(x|C=k)Pr(C=k)=\sum_{k=1}^{K}w_kg_k(x)\]

Setting up the hierarchical problem:

\(X|C \sim g_c(x)\) and \(C\sim Pr(C=k) = w_k\)

For each observation:

Randomly sample \(C_i\) with probability given by \(w_1,...,w_k\).
Given the value of the sampled \(C_i\), \(x_i \sim g_{C_i}\). This means the sample \(x_i\) from the \(C_i\) component.

# Generate n observations from a mixture of two Gaussian 
# distributions
n     = 50           # Size of the sample to be generated
w     = c(0.6, 0.4)  # Weights
mu    = c(0, 5)      # Means
sigma = c(1, 2)      # Standard deviations
cc    = sample(1:2, n, replace=T, prob=w)
x     = rnorm(n, mu[cc], sigma[cc])
    
# Plot f(x) along with the observations 
# just sampled
xx = seq(-5, 12, length=200)
yy = w[1]*dnorm(xx, mu[1], sigma[1]) + 
     w[2]*dnorm(xx, mu[2], sigma[2])
par(mar=c(4,4,1,1)+0.1)
plot(xx, yy, type="l", ylab="Density", xlab="x", las=1, lwd=2)
points(x, y=rep(0,n), pch=1, col=cc)

Observed data likelihood

\(x_1,x_2,...,x_n\) are observations that have been collected. We assume that the \(x_i\)s are independent and identically distributed and \(x_i \sim f\), where \(f(x) = \sum_{k=1}^{K}w_kg_k(x|\theta_k)\).

The likelihood function \[L(w_1,...w_k,\theta_1,...\theta_k)=\prod_{i=1}^{n}\sum_{k=1}^{K}w_kg_k(x_i|\theta_k)\] This expression is very difficult to work with, to say the least…

Complete data likelihood

\[x_i|C_i \sim g_{C_i}(x_i), \space\space Pr(C_i=k) = w_k\] where \(C_1,C_2,...C_n\) are \(iid\).

We write the likelihood as \[L(w_1,...w_k,\theta_1,...\theta_k,C_1,...C_n)=\prod_{i=1}^{n}\prod_{k=1}^{K}\left[w_kg_k(x_i)\right]^{\mathbb{I}_{C_k}}\]

where \(\mathbb{I}_{C_k} = 1\) if \(C_i=k\), otherwise \(\mathbb{I}_{C_k} = 0\)

\[L(w_1,...w_k,\theta_1,...\theta_k,C_1,...C_n)=\prod_{i=1}^{n}\prod_{k=1}^{K}\left[g_k(x_i)\right]^{\mathbb{I}_{C_k}}\prod_{k=1}^{K}\prod_{i=1}^{n}w_k^{\mathbb{I}_{C_k}}\]

Identifiability in Mixture Models

Label Switching Example - mixture of two Gaussians

\[f_1(x)=(0.7)\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\left (\frac{x-(0)}{(1)}\right)^2\right]+(0.3)\frac{1}{\sqrt{2\pi}(2)}\exp\left[-\frac{1}{2}\left(\frac{x-(1)}{(2)}\right)^2\right]\] where we have \(w_1=0.7,\space w_2=0.3,\space \mu_1=0,\space \mu_2=1,\space \sigma_1=1,\space \sigma_2=2\)

\[f_2(x)=(0.3)\frac{1}{\sqrt{2\pi}(2)}\exp\left[-\frac{1}{2}\left (\frac{x-(1)}{(2)}\right)^2\right]+(0.7)\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\left(\frac{x-(0)}{(1)}\right)^2\right]\] where we have \(w_1=0.3,\space w_2=0.7,\space \mu_1=1,\space \mu_2=0,\space \sigma_1=2,\space \sigma_2=1\)

Another example

\[f_1(x)=(0.7)\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\left (\frac{x-(0)}{(1)}\right)^2\right]+(0.3)\frac{1}{\sqrt{2\pi}(2)}\exp\left[-\frac{1}{2}\left(\frac{x-(1)}{(2)}\right)^2\right]\]

\[f_2(x)=(0.7)\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\left (\frac{x-(0)}{(1)}\right)^2\right]+(0.2)\frac{1}{\sqrt{2\pi}(2)}\exp\left[-\frac{1}{2}\left(\frac{x-(1)}{(2)}\right)^2\right]+(0.1)\frac{1}{\sqrt{2\pi}(2)}\exp\left[-\frac{1}{2}\left(\frac{x-(1)}{(2)}\right)^2\right]\]

\[f_3(x)=(0.7)\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\left (\frac{x-(0)}{(1)}\right)^2\right]+(0.3)\frac{1}{\sqrt{2\pi}(2)}\exp\left[-\frac{1}{2}\left(\frac{x-(1)}{(2)}\right)^2\right]+(0)\frac{1}{\sqrt{2\pi}(3)}\exp\left[-\frac{1}{2}\left(\frac{x-(100)}{(3)}\right)^2\right]\]

Quiz Questions

Consider a random sample \((3.5, 9.7, 8.2, 6.4, 7.1)\) composed of \(n = 5\) observations from the mixture with density:

\[f(x)=w\lambda_1e^{-\lambda_1x}+(1-w)\lambda_2^{-\lambda_2x}\]

What is the complete-data likelihood associated with the indicator vector \((1, 1, 2, 1, 2)\)?

Recall the definition of the complete-data likelihood: it is (proportional to) the joint distribution of the indicator variables and the observation. Since the indicator vector is \((1, 1, 2, 1, 2)\), there are 3 observations in group 1 and 2 observations in group 2. So, the first term must be \(w^3(1 - w)^2\).

Now, the distribution of the data given the indicator involves the product over two components: For the first component, we have 3 Poison distributions (associated with the first, second and fourth observations, which add up to 19.6), and for the second component we have 2 Poisson distributions (associated with the third and fifth observations, which add up to 15.3).

Therefore, the complete likelihood is: \[w^3(1-w)\lambda_1^3e^{-19.6\lambda_1}\lambda_2^2e^{-15.3\lambda_2}\]

Consider a random sample \((3.5, 9.7, 7.1)\) composed of \(n = 3\) observations form the mixture with density:\[f(x)=w\lambda_1e^{-\lambda_1x}+(1-w)\lambda_2^{-\lambda_2x}\]

The observed data likelihood \(L_O(w,\lambda_1,\lambda_2;x)\) for this sample is given by the product

\[\left\{w\lambda_1e^{-3.5\lambda_1}+(1-w)\lambda_2e^{-3.5\lambda_2}\right\}\times\] \[\left\{w\lambda_1e^{-9.7\lambda_1}+(1-w)\lambda_2e^{-9.7\lambda_2}\right\}\times\] \[\left\{w\lambda_1e^{-7.1\lambda_1}+(1-w)\lambda_2e^{-7.1\lambda_2}\right\}\]

The complete-data likelihood is proportional to the joint distribution of the data. Because observations are independent (this is a random sample), that corresponds to \(f(3.5) × f(9.7) × f(7.1)\).