Consider a Dirichlet Process (DP) with concentration parameter \(\alpha\) and baseline distribution \(G_0\).
Let \(z_1,z_2,\ldots\) and \(\vartheta_1,\vartheta_2.\ldots\) are independent sequences of independent and identically distributed (i.i.d.) random variables such that \[ z_r\sim\textsf{Beta}(1,\alpha)\,,\quad r=1,2,\ldots \qquad\text{and}\qquad \vartheta_\ell\sim G_0\,,\quad \ell=1,2,\ldots \]
According to the constructive definition of the DP, a realization \(G\) from \(\textsf{DP}(\alpha,G_0)\) is given by \[ G(\cdot) = \sum_{\ell=1}^\infty \omega_\ell\,\delta_{\vartheta_\ell}(\cdot) \] where:
It stick-breaking process provides a way to understand the infinite-dimensional nature of the DP, by generating a sequence of weights \(\omega_1,\omega_2,\ldots\) of a discrete probability distribution.
Start with a stick of length 1 representing the total probability to be distributed among the \(\vartheta_1,\vartheta_2,\ldots\)
\(z_1\) is the portion of the original stick assigned to \(\vartheta_1\), so that \(\omega_1 = z_1\).
The remaining part of the stick has length \(1-z_1\).
\(z_2\) is the portion of the remaining stick assigned to \(\vartheta_2\), so that \(\omega_2 = z_2(1-z_1)\).
The remaining part of the stick has length \((1-z_1)(1-z_2)\).
Continue breaking.
Under this setup, we say that \(\omega_1,\omega_2,\ldots\textsf{SB}(\alpha)\).
It can be shown that \(\sum_{\ell=1}^\infty \omega_\ell = 1\).
The DP generates distributions that can be represented as countable mixtures of point masses.
As a matter of fact, any distribution on \(\mathbb{R}^d\) can be approximated arbitrarily well using a countable mixture of point masses.
The constructive definition points out to the fact that the DP generates discrete distributions.
The DP constructive definition yields another method to simulate from the DP up to a truncation approximation: \[ G(\cdot) \approx \sum_{\ell=1}^J p_\ell\,\delta_{\vartheta_\ell}(\cdot)\,, \] where \(p_\ell = \omega_\ell\), for \(\ell=1,\ldots,J\), and \(p_J = 1 - \sum_{\ell=1}^{J-1} p_\ell\).
It can be shown that \[ \textsf{E}\left( \sum_{\ell=1}^{J} \omega_\ell\right) = 1 - \left( \frac{\alpha}{\alpha+1} \right)^J\,. \] Thus, \(J\) can be chosen such that \((\alpha/(\alpha+1))^J=\epsilon\), for small \(\epsilon\), i.e., \(J=\log(\epsilon)/\log(\alpha/(\alpha+1))\).
The following correspond to a realization from a \(\textsf{DP}(\alpha, G_0\)), for \(\alpha= 25\) and \(G_0 = \textsf{N}(0,1)\). The solid black line corresponds to \(G_0\), while the blue line represents the corresponding realization.
In the left panel, the spiked lines are located at 1000 draws from \(G_0\) with heights given by the (truncated) stick-breaking weights. These spikes are then summed up to generate a sample path.
# simulation of several G such that G ~ DP(alpha = 25, G_0 = N(0,1))
# using the constructive definition of the DP
alpha <- 25
eps <- 1e-6
J <- round(log(eps)/log(alpha/(alpha+1)))
G0 <- function(x) pnorm(x)
set.seed(1)
vartheta <- rnorm(n = J)
z <- rbeta(n = J, shape1 = 1, shape2 = alpha)
omega <- NULL
omega[1] <- z[1]
for (j in 2:(J-1))
omega[j] <- z[j]*prod(1 - z[1:(j-1)])
omega[J] <- 1 - sum(omega[1:(J-1)])
# plot
par(mfrow = c(1,2), mar = c(3,3,1.4,1.4), mgp = c(1.75,0.75,0))
# panel 1: atoms
plot(x = vartheta, y = omega, type = "h", xlim = c(-3,3), xlab = "x", ylab = expression(omega), col = 4)
# panel 2: G ~ DP(alpha = 25, G_0 = N(0,1))
x <- seq(from = -3, to = 3, len = 1000)
G <- NULL
for (j in 1:length(x))
G[j] <- sum(omega[vartheta <= x[j]])
plot(x = x, y = G, type = "l", xlim = c(-3,3), col = 4)
curve(expr = G0(x), from = -3, to = 3, n = 1000, lwd = 2, add = TRUE)