Histogram and Density Plot

In this dataset, there is only one variable, the response variable with some value between -5 and 4. From the density plot and the histogram, it appears that there are two populations. The problem here is that there is no information about which population each observation belongs to. The dataset does not have any covariates (variables) that might explain why there are groups. This is a typical case where a mixture model can become handy.

column=400x

column=400x

Mixture Model

Because the covariates are unobserved, they are called latent variables. By fitting a mixture model of two normal distributions to the data, you can treat them as parameters in a hierarchical model and perform Bayesian inference for them. In this case, you might use a Dirichlet prior for the weight vector ω and conjugate priors for the population-specific parameters in θ. This model would allow you to obtain posterior distributions for z (population membership of the observations), ω (population weights), and θ (population-specific parameters in fj). Among the parameters to monitor, you can have a look at a few of the zs defined in the model (here, z1, z31, z49, z6 have been picked).

mod_string <- " model{
# Likelihood of the data
  for (i in 1:length(y)) {
    y[i] ~ dnorm(mu[z[i]], prec)
    # Prior for z (latent variable) coming from a categorical distribution
    # The probability of being in these 2 categories are saved in
    # the probability vector omega
    z[i] ~ dcat(omega)
  }
  mu[1] ~ dnorm(-1.0, 1.0/100.0)
  # mu2 constrained to be larger than m1:
  mu[2] ~ dnorm(-1.0, 1.0/100.0) T(mu[1],) # ensures mu[1] < mu[2]
  # Prior for the variance (precision)
  prec ~ dgamma(1.0/2.0, 1.0*1.0/2.0)
  sig = sqrt(1.0/prec)
  # Prior distribution for probability vector is the Dirichlet distrib.
  omega ~ ddirich(c(1.0, 1.0))
} "

set.seed(11)
data_jags = list(y=y)
# Select some z's to monitor
params = c("mu", "sig", "omega", "z[1]", "z[31]", "z[49]", "z[6]") 
mod <- jags.model(textConnection(mod_string), data = data_jags,
                  n.chains = 3)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 200
   Unobserved stochastic nodes: 204
   Total graph size: 614

Initializing model
update(mod, 1e3)
mod_sim <- coda.samples(model = mod, variable.names = params, n.iter = 5e3)
mod_csim <- as.mcmc(do.call(rbind, mod_sim))

Model Results

If you have a look at a posterior summary from these Markov chains and compare the estimates with the histogram, you see that Mu[1] has a posterior mean of -2, which looks about right for the first normal distribution. And mu[2] has a posterior mean about 1.5, which also seems reasonable. So it looks like the model has identified two populations. It also looks like the two normal distributions have a standard deviation of about 1. And it suggests that the probability of being in group 1 is about 39%, and the probability of being in group 2 is about 61%.


Iterations = 2001:7000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

            Mean       SD  Naive SE Time-series SE
mu[1]    -2.1213 0.167262 1.366e-03      2.680e-03
mu[2]     1.4905 0.126084 1.029e-03      1.597e-03
omega[1]  0.3875 0.040890 3.339e-04      5.208e-04
omega[2]  0.6125 0.040890 3.339e-04      5.208e-04
sig       1.1370 0.074266 6.064e-04      9.983e-04
z[1]      1.0073 0.084937 6.935e-04      7.462e-04
z[31]     1.5733 0.494609 4.038e-03      4.407e-03
z[49]     1.7990 0.400761 3.272e-03      3.462e-03
z[6]      1.9999 0.008165 6.667e-05      6.667e-05

2. Quantiles for each variable:

            2.5%     25%     50%     75%   97.5%
mu[1]    -2.4485 -2.2332 -2.1221 -2.0126 -1.7824
mu[2]     1.2388  1.4084  1.4910  1.5742  1.7334
omega[1]  0.3101  0.3597  0.3868  0.4148  0.4688
omega[2]  0.5312  0.5852  0.6132  0.6403  0.6899
sig       1.0081  1.0844  1.1319  1.1821  1.2997
z[1]      1.0000  1.0000  1.0000  1.0000  1.0000
z[31]     1.0000  1.0000  2.0000  2.0000  2.0000
z[49]     1.0000  2.0000  2.0000  2.0000  2.0000
z[6]      2.0000  2.0000  2.0000  2.0000  2.0000

Latent Variables

Identifying groupings of variables that were not explicit in the data can be visually done using density plots. In this case, almost all of the posterior probability for the first observation associated with z[1] is clearly within group 1. The posterior for z[6] suggests that this observation is clearly a member of group 2. For observation 31 the case is more ambiguous: The posterior distribution of z suggests that there’s a little more posterior probability that it was in group two than in group one. Observation 49 belongs fairly clearly to population 2.