We have the mixture model written as \[f(x) = \sum_{k=1}^{K}w_kg_k(x)\] We will introduce an indicator \(C\), which is a (discrete) random variable where \(C \in 1,2,,,,K\).

Then, \(X|C \sim g_c(x)\) and \(C\sim Pr(C=k) = w_k\), and

\[Pr(X)=\sum_{k=1}^{K}f(x|C=k)Pr(C=k)=\sum_{k=1}^{K}w_kg_k(x)\]

Setting up the hierarchical problem:

\(X|C \sim g_c(x)\) and \(C\sim Pr(C=k) = w_k\)

For each observation:

  1. Randomly sample \(C_i\) with probability given by \(w_1,...,w_k\).
  2. Given the value of the sampled \(C_i\), \(x_i \sim g_{C_i}\). This means the sample \(x_i\) from the \(C_i\) component.
# Generate n observations from a mixture of two Gaussian 
# distributions
n     = 50           # Size of the sample to be generated
w     = c(0.6, 0.4)  # Weights
mu    = c(0, 5)      # Means
sigma = c(1, 2)      # Standard deviations
cc    = sample(1:2, n, replace=T, prob=w)
x     = rnorm(n, mu[cc], sigma[cc])
    
# Plot f(x) along with the observations 
# just sampled
xx = seq(-5, 12, length=200)
yy = w[1]*dnorm(xx, mu[1], sigma[1]) + 
     w[2]*dnorm(xx, mu[2], sigma[2])
par(mar=c(4,4,1,1)+0.1)
plot(xx, yy, type="l", ylab="Density", xlab="x", las=1, lwd=2)
points(x, y=rep(0,n), pch=1, col=cc)

Observed data likelihood

\(x_1,x_2,...,x_n\) are observations that have been collected. We assume that the \(x_i\)s are independent and identically distributed and \(x_i \sim f\), where \(f(x) = \sum_{k=1}^{K}w_kg_k(x|\theta_k)\).

The likelihood function \[L(w_1,...w_k,\theta_1,...\theta_k)=\prod_{i=1}^{n}\sum_{k=1}^{K}w_kg_k(x_i|\theta_k)\] This expression is very difficult to work with, to say the least…

Complete data likelihood

\[x_i|C_i \sim g_{C_i}(x_i), \space\space Pr(C_i=k) = w_k\] where \(C_1,C_2,...C_n\) are \(iid\).

We write the likelihood as \[L(w_1,...w_k,\theta_1,...\theta_k,C_1,...C_n)=\prod_{i=1}^{n}\prod_{k=1}^{K}\left[w_kg_k(x_i)\right]^{\mathbb{I}_{C_k}}\]

where \(\mathbb{I}_{C_k} = 1\) if \(C_i=k\), otherwise \(\mathbb{I}_{C_k} = 0\)

\[L(w_1,...w_k,\theta_1,...\theta_k,C_1,...C_n)=\prod_{i=1}^{n}\prod_{k=1}^{K}\left[g_k(x_i)\right]^{\mathbb{I}_{C_k}}\prod_{k=1}^{K}\prod_{i=1}^{n}w_k^{\mathbb{I}_{C_k}}\]

Identifiability in Mixture Models

Label Switching Example - mixture of two Gaussians

\[f_1(x)=(0.7)\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\left (\frac{x-(0)}{(1)}\right)^2\right]+(0.3)\frac{1}{\sqrt{2\pi}(2)}\exp\left[-\frac{1}{2}\left(\frac{x-(1)}{(2)}\right)^2\right]\] where we have \(w_1=0.7,\space w_2=0.3,\space \mu_1=0,\space \mu_2=1,\space \sigma_1=1,\space \sigma_2=2\)

\[f_2(x)=(0.3)\frac{1}{\sqrt{2\pi}(2)}\exp\left[-\frac{1}{2}\left (\frac{x-(1)}{(2)}\right)^2\right]+(0.7)\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\left(\frac{x-(0)}{(1)}\right)^2\right]\] where we have \(w_1=0.3,\space w_2=0.7,\space \mu_1=1,\space \mu_2=0,\space \sigma_1=2,\space \sigma_2=1\)

Another example

\[f_1(x)=(0.7)\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\left (\frac{x-(0)}{(1)}\right)^2\right]+(0.3)\frac{1}{\sqrt{2\pi}(2)}\exp\left[-\frac{1}{2}\left(\frac{x-(1)}{(2)}\right)^2\right]\]

\[f_2(x)=(0.7)\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\left (\frac{x-(0)}{(1)}\right)^2\right]+(0.2)\frac{1}{\sqrt{2\pi}(2)}\exp\left[-\frac{1}{2}\left(\frac{x-(1)}{(2)}\right)^2\right]+(0.1)\frac{1}{\sqrt{2\pi}(2)}\exp\left[-\frac{1}{2}\left(\frac{x-(1)}{(2)}\right)^2\right]\]

\[f_3(x)=(0.7)\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\left (\frac{x-(0)}{(1)}\right)^2\right]+(0.3)\frac{1}{\sqrt{2\pi}(2)}\exp\left[-\frac{1}{2}\left(\frac{x-(1)}{(2)}\right)^2\right]+(0)\frac{1}{\sqrt{2\pi}(3)}\exp\left[-\frac{1}{2}\left(\frac{x-(100)}{(3)}\right)^2\right]\]

Quiz Questions

  1. Consider a random sample \((3.5, 9.7, 8.2, 6.4, 7.1)\) composed of \(n = 5\) observations from the mixture with density:

\[f(x)=w\lambda_1e^{-\lambda_1x}+(1-w)\lambda_2^{-\lambda_2x}\]

What is the complete-data likelihood associated with the indicator vector \((1, 1, 2, 1, 2)\)?

Recall the definition of the complete-data likelihood: it is (proportional to) the joint distribution of the indicator variables and the observation. Since the indicator vector is \((1, 1, 2, 1, 2)\), there are 3 observations in group 1 and 2 observations in group 2. So, the first term must be \(w^3(1 - w)^2\).

Now, the distribution of the data given the indicator involves the product over two components: For the first component, we have 3 Poison distributions (associated with the first, second and fourth observations, which add up to 19.6), and for the second component we have 2 Poisson distributions (associated with the third and fifth observations, which add up to 15.3).

Therefore, the complete likelihood is:

\[w^3(1-w)\lambda_1^3e^{-(3.5+9.7+6.4)\lambda_1}\lambda_2^2e^{-(8.2+7.1)\lambda_2}\] \[w^3(1-w)\lambda_1^3e^{-19.6\lambda_1}\lambda_2^2e^{-15.3\lambda_2}\]

  1. Consider a random sample \((3.5, 9.7, 7.1)\) composed of \(n = 3\) observations form the mixture with density:\[f(x)=w\lambda_1e^{-\lambda_1x}+(1-w)\lambda_2^{-\lambda_2x}\]

The observed data likelihood \(L_O(w,\lambda_1,\lambda_2;x)\) for this sample is given by the product

\[\left\{w\lambda_1e^{-3.5\lambda_1}+(1-w)\lambda_2e^{-3.5\lambda_2}\right\}\times\] \[\left\{w\lambda_1e^{-9.7\lambda_1}+(1-w)\lambda_2e^{-9.7\lambda_2}\right\}\times\] \[\left\{w\lambda_1e^{-7.1\lambda_1}+(1-w)\lambda_2e^{-7.1\lambda_2}\right\}\]

The complete-data likelihood is proportional to the joint distribution of the data. Because observations are independent (this is a random sample), that corresponds to \(f(3.5) × f(9.7) × f(7.1)\).

  1. Consider a random sample \((-0.3, 4.1, 3.6, 7.5, 1.9, 2.7)\) composed of \(n = 6\) observations form the mixture with density:

\[f(x)=w_1\frac{1}{\sqrt{2\pi}}\exp\left[-\frac{x^2}{2}\right]+w_2\frac{1}{\sqrt{2\pi}}\exp\left[-\frac{(x-2)^2}{2}\right]+w_3\frac{1}{\sqrt{2\pi}}\exp\left[-\frac{(x-4)^2}{2}\right]\]

\[f(x)=w_1\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\frac{(x-0)^2}{(1)}\right]+w_2\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\frac{(x-2)^2}{(1)}\right]+w_3\frac{1}{\sqrt{2\pi}(1)}\exp\left[-\frac{1}{2}\frac{(x-4)^2}{(1)}\right]\]

Write an expression that is proportional to the complete-data likelihood associated with the indicator vector \((1,2, 2, 3, 1, 2)\)?

Recall the definition of the complete-data likelihood: it is (proportional to) the joint distribution of the indicator variables and the observation. Since the indicator vector is \((1,2, 2, 3, 1, 2)\), there are 2 observations in group 1, 3 observations in group 2 and 1 observation in group 3. So, the first term must be \(w_1^2w_2^3w_3\).

The expression becomes \[w_1^2w_2^3w_3 \exp\left\{-\frac{(-0.3)^2+(4.1-2)^2+ (3.6-2)^2+(7.5-4)^2+(1.9)^2+(2.7-2)^2}{2}\right\}\]

\[=w_1^2w_2^3w_3 \exp\left\{-11.705\right\}\]

  1. Consider a location mixture of normals

\[f(x) = \sum_{k=1}^{K}w_k\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x-\mu_k)^2}{2\sigma^2}\right]\]

True or False: The following 3 constraints make all parameters fully identifiable:

  1. The means \(\mu_1,...,\mu_k\) should all be different.
  2. No weight \(w_k\) is allowed to be zero.
  3. The component are ordered based on the values of their means, i.e., the component with the smallest \(\mu_k\) is labeled component 1, the one with the second smallest \(\mu_k\) is labeled component 2, etc.

This is TRUE. The three constraints are enough to ensure identifiability. The last one addresses label switching, while the first two address identifiability of the number of components in the mixture.