Model formulation

Consider the location-scale Normal **Dirichlet Process Mixture Model* (DPMM)*: \[ f(y \mid G, \phi) = \int \textsf{N}(y \mid \theta, \sigma^2) \, \textsf{d}G(\theta,\sigma^2), \] where the mixing distribution \(G\) follows a DP process prior: \[ G \mid \alpha, \mu, \kappa, \beta \sim \textsf{DP}(\alpha, G_0 = \textsf{N}(\mu,\sigma^2/\kappa)\cdot\textsf{IG}(c,\beta)), \] with fixed \(c\).

We assume the following prior distributions: \[ \alpha \sim \textsf{G}(a_{\alpha}, b_{\alpha}),\quad \mu \sim \textsf{N}(a_{\mu}, b_{\mu}),\quad \kappa \sim \textsf{G}(a_{\kappa}, b_{\kappa}),\quad \beta \sim \textsf{G}(a_{\beta}, b_{\beta}). \]

Thus, the hierarchical representation of this semiparametric DPMM is: \[ \begin{align*} y_i \mid \theta_i, \sigma^2 &\overset{\text{ind}}{\sim} \textsf{N}(y_i \mid \theta_i, \sigma^2_i), \qquad i = 1, \dots, n, \\ (\theta_i,\sigma^2_i) \mid G &\overset{\text{i.i.d.}}{\sim} G, \qquad i = 1, \dots, n, \\ G \mid \alpha, \mu, \kappa,\beta &\sim \textsf{DP}(\alpha, G_0 = \textsf{N}(\mu,\sigma^2/\kappa)\cdot\textsf{IG}(c,\beta)), \\ \alpha, \mu, \kappa, \beta &\sim p(\alpha) \, p(\mu) \, p(\kappa) \, p(\beta), \end{align*} \] where the independent prior distributions \(p(\alpha)\), \(p(\mu)\), \(p(\kappa)\), and \(p(\beta)\) are specified as above.

To study inference under this model, we consider the galaxy dataset: velocities (km/second) for 82 galaxies, drawn from six well-separated conic sections of the Corona Borealis region.

# data
# https://rdrr.io/cran/rebmix/man/galaxy.html
data(galaxy, package = "rebmix")
y <- galaxy$Velocity

# histograma
hist(x = y, freq = F, nclass = 30, col = "lightgray", border = "lightgray", 
     xlab = "y", ylab = "Densidad", main = "Galaxy dataset")

Cluster assingment indicators

By integrating \(G\) out, the CRP formulation leads to a finite mixture model where the cluster assignments \(\xi_i\) determine which cluster each observation \(y_i\) belongs to. The model can be rewritten in a hierarchical form as follows:

Observation model: Each observation \(y_i\) follows a Normal distribution with a mean \(\vartheta_{\xi_i}\) and variance \(\varsigma^2_{\xi_i}\) corresponding to its assigned cluster: \[ y_i \mid \xi_i, \vartheta_{\xi_i}, \varsigma^2_{\xi_i} \overset{\text{ind}}{\sim} \textsf{N}(\vartheta_{\xi_i}, \varsigma^2_{\xi_i}), \quad i = 1, \dots, n. \]
Cluster assignments prior: We use the CRP prior to define the distribution of cluster assignments \(\xi_i\): \[ p(\xi_i \mid \boldsymbol{\xi}_{-i}, \alpha) = \begin{cases} \frac{n_{-i,k}}{n - 1 + \alpha}, & \text{if } \xi_i = k \qquad\quad \,\, \text{ (existing cluster)}, \\ \frac{\alpha}{n - 1 + \alpha}, & \text{if } \xi_i = K_{-i} + 1 \,\,\, \text{ (new cluster)}. \end{cases} \]
Cluster mean prior: Each cluster-specific mean and variance \((\vartheta_k,\varsigma^2_k)\) is drawn from the base measure \(G_0\), which follows a Normal-Inverse-Gamma distribution: \[ \vartheta_k \mid \varsigma^2_k, \mu, \kappa, \overset{\text{ind.}}{\sim} \textsf{N}(\mu, \varsigma^2_k/\kappa), \quad \varsigma^2_k \mid \beta \overset{\text{i.i.d.}}{\sim} \textsf{IG}(c,\beta), \quad k = 1, \dots, K. \]
Priors for model parameters: To complete the hierarchical model, we place priors on the rest of model parameters: \[ \alpha \sim \textsf{G}(a_{\alpha}, b_{\alpha}),\quad \mu \sim \textsf{N}(a_{\mu}, b_{\mu}),\quad \kappa \sim \textsf{G}(a_{\kappa}, b_{\kappa}),\quad \beta \sim \textsf{G}(a_{\beta}, b_{\beta}). \]

Joint posterior distribution

By integrating out \(G\), the joint posterior distribution of the model parameters is:

\[ \begin{aligned} p(\boldsymbol{\xi}, \boldsymbol{\vartheta}, \boldsymbol{\varsigma}^2, \alpha, \mu, \kappa, \beta \mid \boldsymbol{y}) &\propto \prod_{i=1}^n \textsf{N}(y_i \mid \vartheta_{\xi_i}, \varsigma^2_{\xi_i}) \cdot \prod_{k=1}^K \textsf{N}(\vartheta_k \mid \mu, \varsigma^2_k / \kappa) \cdot \textsf{IG}(\varsigma^2_k \mid c, \beta) \\ &\quad \cdot p(\boldsymbol{\xi} \mid \alpha) \cdot \textsf{G}(\alpha \mid a_\alpha, b_\alpha) \cdot \textsf{N}(\mu \mid a_\mu, b_\mu) \cdot \textsf{G}(\kappa \mid a_\kappa, b_\kappa) \cdot \textsf{G}(\beta \mid a_\beta, b_\beta). \end{aligned} \]

We now derive the full conditional distributions for posterior inference via Gibbs sampling.

Full Conditional for the cluster assignments \(\xi_i\)

We want to derive: \[ p(\xi_i = k \mid \boldsymbol{\xi}_{-i}, \boldsymbol{\vartheta}, \boldsymbol{\varsigma}^2, \alpha, \mu, \kappa, \beta, \mathbf{y}) \propto p(\xi_i = k \mid \boldsymbol{\xi}_{-i}, \alpha) \cdot p(y_i \mid \vartheta_k, \varsigma^2_k) \]

Assigning \(\xi_i\) to an existing cluster \(k \in \{1, \dots, K_{-i} \}\): \[ p(\xi_i = k \mid \cdots) \propto \frac{n_{-i,k}}{n - 1 + \alpha} \cdot \textsf{N}(y_i \mid \vartheta_k, \varsigma^2_k) \]

Assigning \(\xi_i\) to a new cluster \(k^\ast = K_{-i} + 1\): In this case, the new cluster has no associated \((\vartheta_{k^\ast}, \varsigma^2_{k^\ast})\), so we must integrate them out under the base measure \(G_0\). That is: \[ p(y_i \mid \mu, \kappa, \beta) = \int \textsf{N}(y_i \mid \vartheta, \varsigma^2) \cdot \textsf{N}(\vartheta \mid \mu, \varsigma^2/\kappa) \cdot \textsf{IG}(\varsigma^2 \mid c, \beta) \, \textsf{d}\vartheta \, \textsf{d}\varsigma^2 = \textsf{t}_{2c} \left(y_i \mid \mu, \beta (1 + 1/\kappa)/c \right) \] So: \[ p(\xi_i = \ K_{-i} + 1 \mid \cdots) \propto \frac{\alpha}{n - 1 + \alpha} \cdot \textsf{t}_{2c} \left(y_i \mid \mu, \beta (1 + 1/\kappa)/c \right) \] If \(\xi_i\) is assigned to a new cluster, we must sample a new value for \((\vartheta_{K+1},\varsigma^2_{K+1})\) from its posterior distribution: \[ p(\vartheta,\varsigma^2 \mid y_i, \mu, \kappa, \beta) = p(\vartheta \mid y_i, \varsigma^2 \mu, \kappa) \, p(\varsigma^2 \mid y_i, \mu, \kappa, \beta) = \textsf{N}(\vartheta \mid \mu^*, \sigma^{*2}) \, \textsf{IG}(\varsigma^2 \mid \alpha^*,\beta^*), \] where \[ \mu^* = \frac{\kappa \mu + y_i}{\kappa + 1}, \quad \sigma^{*2} = \frac{\varsigma^2}{\kappa + 1}, \quad \alpha^* = cc + \frac{1}{2}, \quad \beta^* = \beta + \frac{1}{2}\,\frac{(y_i - \mu)^2}{1 + 1/\kappa}. \]

Full conditional for the cluster means \(\vartheta_k\)

We want to derive: \[ p(\vartheta_k \mid \{y_i : \xi_i = k\}, \varsigma^2_k, \mu, \kappa) \propto \prod_{i : \xi_i = k} \textsf{N}(y_i \mid \vartheta_k, \varsigma^2_k) \cdot \textsf{N}(\vartheta_k \mid \mu, \varsigma^2_k / \kappa). \] This corresponds to: \[ \vartheta_k \mid \cdots \sim \textsf{N} \left( \frac{\kappa \mu + n_k \bar{y}_k}{\kappa + n_k}, \, \frac{\varsigma^2_k}{\kappa + n_k} \right) \] Here, the term \(\bar{y}_k\) represents the sample mean of all observations assigned to cluster \(k\): \[ \bar{y}_k = \frac{1}{n_k} \sum_{i: \xi_i = k} y_i. \] where \(n_k\) is the number of observations assigned to cluster \(k\), i.e., \(n_k = \sum_{i} \mathbb{1}(\xi_i = k)\).

Full conditional for the cluster variances \(\varsigma^2_k\)

We want to derive: \[ p(\varsigma^2_k \mid \{y_i : \xi_i = k\}, \vartheta_k, \mu, \kappa, \beta) \propto \prod_{i : \xi_i = k} \textsf{N}(y_i \mid \vartheta_k, \varsigma^2_k) \cdot \textsf{N}(\vartheta_k \mid \mu, \varsigma^2_k / \kappa \cdot \textsf{IG}(\varsigma_k^2 \mid c, \beta) \] This corresponds to: \[ \varsigma_k^2 \mid \cdots \sim \textsf{IG} \left( c + \frac{n_k + 1}{2}, \, \beta + \frac{\kappa (\vartheta_k - \mu)^2 + \sum_{i : \xi_i = k} (y_i - \vartheta_k)^2}{2} \right). \]

Full conditional for the global mean \(\mu\)

We want to derive: \[ p(\mu \mid \boldsymbol{\vartheta}, \boldsymbol{\varsigma}^2, \kappa) \propto \prod_{k=1}^K \textsf{N}(\vartheta_k \mid \mu, \varsigma_k^2 / \kappa) \cdot \textsf{N}(\mu \mid a_\mu, b_\mu). \] This corresponds to \(\mu \mid \cdots \sim \textsf{N}(M,V)\), where: \[ M = \frac{\frac{1}{b_\mu} \, a_\mu + \sum_{k=1}^K \frac{\kappa}{\varsigma_k^2} \, \vartheta_k}{\frac{1}{b_\mu} + \sum_{k=1}^K \frac{\kappa}{\varsigma_k^2}}, \qquad V = \frac{1}{\frac{1}{b_\mu} + \sum_{k=1}^K \frac{\kappa}{\varsigma_k^2}}. \]

Full conditional for \(\kappa\)

\[ \kappa \mid \cdots \sim \textsf{G}\left( a_\kappa + \frac{K}{2}, \, b_\kappa + \frac{1}{2} \sum_{k=1}^K \frac{(\vartheta_k - \mu)^2}{\varsigma_k^2} \right). \]

Full conditional for \(\beta\)

\[ \beta \mid \cdots \sim \textsf{G}\left( a_\beta + cK, \, b_\beta + \sum_{k=1}^K \frac{1}{\varsigma_k^2} \right) \]

Implementation

sample_xi <- function(y, xi, theta, sigma2, alpha, mu, kappa, beta, cc) {
  n <- length(y)
  
  for (i in 1:n) {
    # Temporarily remove current assignment
    xi_new <- xi
    xi_new[i] <- 0
    
    # Relabel clusters
    unique_xi  <- unique(xi_new[-i])
    unique_xi  <- unique_xi[order(unique_xi)]
    xi_new[-i] <- as.numeric(factor(xi_new[-i], levels = unique_xi, labels = seq_along(unique_xi)))
    
    K_new <- max(c(0, xi_new))  # in case all elements are zero
    
    # Keep only active cluster parameters
    theta <- theta[unique_xi]
    sigma2 <- sigma2[unique_xi]
    
    log_probs <- numeric(K_new + 1)
    
    # Probabilities for existing clusters
    for (k in 1:K_new) {
      n_k <- sum(xi_new == k)
      log_probs[k] <- log(n_k) + dnorm(y[i], mean = theta[k], sd = sqrt(sigma2[k]), log = TRUE)
    }
    
    # Predictive for new cluster: Student-t(2cc, location = mu, scale^2 = beta(1 + 1/kappa)/cc)
    df <- 2 * cc
    scale2 <- beta * (1 + 1/kappa) / cc
    log_probs[K_new + 1] <- log(alpha) + dt((y[i] - mu)/sqrt(scale2), df = df, log = TRUE) - 0.5 * log(scale2)
    
    # Sample new assignment
    new_cluster <- sample(1:(K_new + 1), size = 1, prob = exp(log_probs - max(log_probs)))
    xi_new[i] <- new_cluster
    
    # If new cluster
    if (new_cluster == (K_new + 1)) {
      # Sample sigma2
      shape <- cc + 0.5
      rate <- beta + 0.5 * (y[i] - mu)^2 / (1 + 1/kappa)
      new_sigma2 <- 1 / rgamma(1, shape = shape, rate = rate)
      
      # Sample theta
      mu_star <- (kappa * mu + y[i]) / (kappa + 1)
      tau2_star <- new_sigma2 / (kappa + 1)
      new_theta <- rnorm(1, mean = mu_star, sd = sqrt(tau2_star))
      
      theta <- c(theta, new_theta)
      sigma2 <- c(sigma2, new_sigma2)
    }
    
    xi <- xi_new
  }
  
  return(list(xi = xi, theta = theta, sigma2 = sigma2))
}

sample_theta <- function(y, xi, sigma2, mu, kappa) 
{
  K <- max(xi)
  theta <- numeric(K)
  
  for (k in 1:K) {
    y_k <- y[xi == k]
    n_k <- length(y_k)
    sum_k <- sum(y_k)
    
    tau2_star <- sigma2[k] / (kappa + n_k)
    mu_star <- (kappa * mu + sum_k) / (kappa + n_k)
    
    theta[k] <- rnorm(1, mean = mu_star, sd = sqrt(tau2_star))
  }
  
  return(theta)
}

sample_sigma2 <- function(y, xi, theta, mu, kappa, beta, cc)
{
  K <- max(xi)
  sigma2 <- numeric(K)
  
  for (k in 1:K) {
    y_k  <- y[xi == k]
    n_k  <- length(y_k)

    shape <- cc + 0.5 * (n_k + 1)
    rate  <- beta + 0.5 * (sum((y_k - theta[k])^2) + kappa * (theta[k] - mu)^2)
    
    sigma2[k] <- 1 / rgamma(1, shape = shape, rate = rate)
  }
  
  return(sigma2)
}

sample_mu <- function(theta, sigma2, kappa, a_mu, b_mu) 
{
  K <- length(theta)
  precision_terms <- kappa / sigma2

  V_mu <- 1 / (1 / b_mu + sum(precision_terms))
  M_mu <- V_mu * (a_mu / b_mu + sum(precision_terms * theta))
  
  return(rnorm(1, mean = M_mu, sd = sqrt(V_mu)))
}

sample_kappa <- function(theta, sigma2, mu, a_kappa, b_kappa) 
{
  K <- length(theta)

  shape <- a_kappa + 0.5 * K
  rate  <- b_kappa + 0.5 * sum((theta - mu)^2 / sigma2)
  
  return(rgamma(1, shape = shape, rate = rate))
}

sample_beta <- function(sigma2, cc, a_beta, b_beta) 
{
  K <- length(sigma2)
  
  shape <- a_beta + cc * K
  rate  <- b_beta + sum(1 / sigma2)
  
  return(rgamma(1, shape = shape, rate = rate))
}

sample_alpha <- function(alpha, xi, n, a_alpha, b_alpha) 
{
  K <- length(unique(xi))
  eta <- rbeta(1, shape1 = alpha + 1, shape2 = n)
  delta <- (a_alpha + K - 1)/(a_alpha + K - 1 + n*(b_alpha - log(eta)))
  
  if (runif(1) < delta) {
    return(rgamma(1, shape = a_alpha + K, rate = b_alpha - log(eta)))
  } else {
    return(rgamma(1, shape = a_alpha + K - 1, rate = b_alpha - log(eta)))
  }
}

gibbs_sampler_bnp <- function(y, iter, burn,
                              a_alpha, b_alpha,
                              a_mu, b_mu,
                              a_kappa, b_kappa,
                              a_beta, b_beta,
                              cc) {
  n <- length(y)
  max_K <- floor(n / 2)

  # Storage
  B <- iter - burn
  xi_store     <- matrix(NA, nrow = B, ncol = n)
  mu_store     <- numeric(B)
  kappa_store  <- numeric(B)
  beta_store   <- numeric(B)
  alpha_store  <- numeric(B)
  theta_store  <- vector("list", B)
  sigma2_store <- vector("list", B)
  ll_store     <- numeric(iter)     
  
  # Initialization
  alpha <- rgamma(1, a_alpha, b_alpha)
  mu    <- rnorm(1, a_mu, sqrt(b_mu))
  kappa <- rgamma(1, a_kappa, b_kappa)
  beta  <- rgamma(1, a_beta, b_beta)

  xi <- sample(1:max_K, n, replace = TRUE)
  xi <- as.numeric(factor(xi, levels = unique(xi), labels = seq_along(unique(xi))))
  K <- length(unique(xi))

  sigma2 <- 1 / rgamma(K, shape = cc, rate = beta)
  theta  <- rnorm(K, mean = mu, sd = sqrt(sigma2 / kappa))

  for (t in 1:iter) {
    # Sample xi (updates theta and sigma2 if a new cluster is formed)
    xi_update <- sample_xi(y, xi, theta, sigma2, alpha, mu, kappa, beta, cc)
    xi     <- xi_update$xi
    theta  <- xi_update$theta
    sigma2 <- xi_update$sigma2

    # Sample cluster parameters
    theta  <- sample_theta(y, xi, sigma2, mu, kappa)
    sigma2 <- sample_sigma2(y, xi, theta, mu, kappa, beta, cc)

    # Sample hyperparameters
    mu     <- sample_mu(theta, sigma2, kappa, a_mu, b_mu)
    kappa  <- sample_kappa(theta, sigma2, mu, a_kappa, b_kappa)
    beta   <- sample_beta(sigma2, cc, a_beta, b_beta)
    alpha  <- sample_alpha(alpha, xi, n, a_alpha, b_alpha)
    
    ll_store[t] <- sum(dnorm(y, mean = theta[xi], sd = sqrt(sigma2[xi]), log = T))

    # Store post-burn-in samples
    if (t > burn) {
      i <- t - burn
      xi_store[i, ]     <- xi
      mu_store[i]       <- mu
      kappa_store[i]    <- kappa
      beta_store[i]     <- beta
      alpha_store[i]    <- alpha
      theta_store[[i]]  <- theta
      sigma2_store[[i]] <- sigma2
    }

    # Progress update
    if (t %% (iter / 10) == 0) {
      cat("Iteration", t, "/", iter, "| Clusters:", length(unique(xi)), "\n")
    }
  }

  return(list(
    xi     = xi_store,
    theta  = theta_store,
    sigma2 = sigma2_store,
    mu     = mu_store,
    kappa  = kappa_store,
    beta   = beta_store,
    alpha  = alpha_store,
    ll     = ll_store
  ))
}

# Hyperparameters for location-scale BNP model
a_alpha <- 1
b_alpha <- 1
a_mu    <- mean(y)
b_mu    <- 2 * var(y)
a_kappa <- 2
b_kappa <- 1
a_beta  <- 2
b_beta  <- 1
cc      <- 2

# Gibbs sampling settings
iter <- 25000
burn <- 5000

# Run Gibbs sampler
samples <- gibbs_sampler_bnp(y, iter, burn,
                             a_alpha, b_alpha,
                             a_mu, b_mu,
                             a_kappa, b_kappa,
                             a_beta, b_beta,
                             cc)

# Save output
save(samples, file = "samples_bnp_location_scale.RData")

Taking the logarithm of the likelihood at iteration \(b\), we obtain:
\[ \log p(y \mid \theta^{(b)}, \phi^{(b)}) = \sum_{i=1}^{n} \log \textsf{N}(y_i \mid \theta_{\xi_i}^{(b)}, \varsigma_{\xi_i}^{2\,(b)}), \quad b = 1,\ldots,B, \] where \(B\) is the number of posterior samples.

# Plot log-likelihood trace
par(mfrow = c(1,1), mar = c(3,3,1.4,1.4), mgp = c(1.75,0.75,0))
plot(samples$ll, type = "p", pch = ".", col = "blue", 
     main = "Log-Likelihood Trace Plot",
     xlab = "Iteration", ylab = "Log-Likelihood")

Inference on the number of clusters

We estimate the posterior distribution of clusters by computing \(K^{(b)} = \max \boldsymbol{\xi}^{(b)}\) at each iteration. The normalized frequencies approximate \(p(K \mid \mathbf{y})\), which we visualize with a histogram.

# Compute the number of clusters K at each iteration
K_values <- apply(samples$xi, 1, function(x) length(unique(x)))

# Compute the posterior distribution of K
K_table <- table(K_values) / length(K_values)

par(mfrow = c(1,1), mar = c(3,3,1.4,1.4), mgp = c(1.75,0.75,0))
plot(as.numeric(names(K_table)), K_table, type = "h", lwd = 3, col = "blue",
     main = "Posterior distribution of number of clusters",
     xlab = "Number of clusters", ylab = "Density",
     yaxt = "n")  # Suppress default y-axis

axis(2, at = pretty(K_table), labels = pretty(K_table))

Inference on the density function

In a DPMM, the number of clusters \(K\) varies across posterior samples. So, the density estimate can be computed as: \[ \hat{g}(x) = \frac{1}{B} \sum_{b=1}^B \sum_{k=1}^{K^{(b)}} \frac{n_k^{(b)}}{n} \, \textsf{N}(x \mid \theta_k^{(b)}, \phi^{(b)}). \] This estimator combines posterior samples to estimate the population density function** while incorporating uncertainty in both the number of clusters and parameter values.

posterior_density_estimate <- function(samples, x_seq) {
  B <- length(samples$theta)
  n <- ncol(samples$xi)
  M <- length(x_seq)
  
  # Store density estimates
  FE <- matrix(NA, nrow = B, ncol = M)
  
  for (b in 1:B) {
    xi_b <- samples$xi[b, ]
    theta_b <- samples$theta[[b]]
    sigma2_b <- samples$sigma2[[b]]
    
    # Unique cluster IDs
    cluster_labels <- sort(unique(xi_b))
    cluster_counts <- as.numeric(table(xi_b)[as.character(cluster_labels)])
    
    # Cluster probabilities
    weight_b <- cluster_counts / sum(cluster_counts)
    
    for (i in 1:M) {
      FE[b, i] <- sum(weight_b * dnorm(x_seq[i], mean = theta_b, sd = sqrt(sigma2_b)))
    }
  }
  
  f_hat <- colMeans(FE)  # Posterior mean density
  f_inf <- apply(FE, 2, quantile, probs = 0.025)  # 2.5%  credible interval
  f_sup <- apply(FE, 2, quantile, probs = 0.975)  # 97.5% credible interval
  
  return(list(f_hat = f_hat, f_inf = f_inf, f_sup = f_sup))
}

# Define a sequence of x values for density estimation
x_seq <- seq(min(y) - 1, max(y) + 1, length.out = 250)

# Compute posterior density estimate and credible intervals
density_estimate <- posterior_density_estimate(samples, x_seq)

# Extract results
f_hat <- density_estimate$f_hat
f_inf <- density_estimate$f_inf
f_sup <- density_estimate$f_sup

# Plot the histogram
par(mfrow = c(1,1), mar = c(3,3,1.4,1.4), mgp = c(1.75,0.75,0))
hist(y, breaks = 30, freq = FALSE, col = "lightgray", border = "lightgray",
     main = "Posterior predictive density estimate",
     xlab = "x", ylab = "Density", ylim = c(0, max(f_sup)))

# Add the 95% credible interval as a shaded region
polygon(c(x_seq, rev(x_seq)), c(f_inf, rev(f_sup)), col = rgb(0, 0, 1, 0.2), border = NA)

# Overlay the posterior density estimate as a blue line
lines(x_seq, f_hat, col = "blue", lwd = 2)

Inference on the cluster assingments

The co-clustering matrix \(\mathbf{A} = [a_{i,j}]\) is an \(n \times n\) symmetric matrix that represents the posterior probability that observations \(i\) and \(j\) belong to the same cluster. It is defined as: \[ a_{i,j} = \textsf{Pr}(\xi_i = \xi_j \mid \boldsymbol{y}) \approx \frac{1}{B} \sum_{b=1}^{B} I\left\{ \xi_i^{(b)} = \xi_j^{(b)} \right\}, \] where \(I(A)\) is the indicator function that equals 1 if \(A\) is true and 0 otherwise.

Since cluster assignments are identical in both directions, \(\mathbf{A}\) is symmetric: \[ \textsf{Pr}(\xi_i = \xi_j \mid \boldsymbol{y}) = \textsf{Pr}(\xi_j = \xi_i \mid \boldsymbol{y}). \] Furthermore, the diagonal elements satisfy \(a_{i,i} = 1\) because each observation is always in the same cluster as itself.

# Compute co-clustering matrix A
compute_co_clustering_matrix <- function(samples) {
  n <- ncol(samples$xi)
  B <- nrow(samples$xi)
  A <- matrix(0, nrow = n, ncol = n)
  
  for (b in 1:B) {
    xi_b <- samples$xi[b, ]
    for (k in unique(xi_b)) {
      cluster_members <- which(xi_b == k)
      A[cluster_members, cluster_members] <- A[cluster_members, cluster_members] + 1
    }
  }
  
  A <- A / B
  diag(A) <- 1
  
  return(A)
}

# Compute and reorder A based on true partition
A <- compute_co_clustering_matrix(samples)

# Convert A to dissimilarity
D <- 1 - A

suppressMessages(suppressWarnings(library(mclust)))

# Apply classical multidimensional scaling
mds_coords <- cmdscale(as.dist(D), k = 4)

# Run Mclust on the embedded coordinates
clustering_result <- Mclust(mds_coords, G = 8, verbose = FALSE)

# Optimal partition
components <- clustering_result$classification

# Sort observations by true partition
A <- A[order(components), order(components)]

# Visualize co-clustering matrix
par(mar = c(2.75, 2.75, 0.5, 0.5), mgp = c(1.7, 0.7, 0))
colorscale <- c("white", rev(heat.colors(100)))
corrplot::corrplot(A, is.corr = FALSE, method = "color", col = colorscale,
                   tl.pos = "n", addgrid.col = NA)

Location-scale Normal Dirichlet Process Mixture Model

Juan Sosa PhD

Email jcsosam@unal.edu.co

GitHub https://github.com/jstats1702