NCM Workshop on Introduction to Statistical Learning Techniques

Author

Minerva Mukhopadhyay

Bayesian Variable Selection in Linear Regression

(A) Variable Selection in Linear Regression

  • Regression is an approved and attractive tool in different fields of science for the investigation of linear dependencies among independent variables and a response variable.

  • Usually, in order to build an ideal regression model, a common question is “how many relevant pieces of information need to be added”, which in the regression context means including only the important predictor variables which are related to \(Y\), namely variable selection.

  • Consider a regression setup with \(n\) observations on a response \(y\), and a set of \(p\) regressors \(x_1\), \(x_2\), \(\ldots\), \(x_p\). Linear model assumes that the \(i\)-th observation of the response \(y_i\) is related to \(x_{1,i}, \ldots, x_{p,i}\)\[y_{i}=\beta_0+\beta_1 x_{1,i}+\beta_2 x_{2,i} + \cdots + \beta_p x_{p,i} + e_{i}, \]

    where \(\beta_j\) is the \(j^{th}\) regression coefficient for \(j \in \{1, \ldots, p\}\) and \(e_{1}, \ldots, e_{n}\) are iid normal with mean \(0\) and variance \(\sigma^{2}\).

Example: The Residential Building data set is available from the UCI machine learning repository. This data contains the construction costs and sale prices corresponding to \(n = 372\) real estate single-family residential apartments in Tehran, Iran. The data include 8 physical and financial regressors, and 19 economic regressors collected for 5 time lags before the construction, i.e., \(p = 103\) variables in total. A heatmap of the covariates are is given below:

  • What happens if we include all the variables in the regression model?

    • There are variables which are not involved in the data generating process (DGP) of \(Y\), and may have an insignificant regression coefficient. Suppose \({\bf x}_{1} = \{ x_{1}, \ldots, x_{k}\}\) is the set of variables which are involved in the DGP of \(Y\) and \({\bf x}_{2} = \{ x_{k+1}, \ldots, x_{p} \}\) is the set of variables which are additionally included.
    • The variance of the least-squares estimate of the regression mean \(\boldsymbol{\mu}\) is \(\widehat{\boldsymbol{\mu}}_{\text{OLS}}\) is \(\sigma^{2} X\left(X^{\top} X\right)^{-1}X^{\top} = \sigma^{2} P_{X}\). It is easy to see that \(P_{X} = P_{X_{1}} + P_{\bf Z}\), where \(P_{\bf Z}\) is the projection matrix onto the column space of \({\bf Z} = (I - P_{X_{1}}) X_{2}\) and \(P_{X_{i}}\) is the projection matrix onto the column space of the \(X_{i}\), which is the design matrix corresponding to \({\bf x}_{i}\), \(i = 1,2\).
    • Therefore, if we consider variance of any component of \(\widehat{\boldsymbol{\mu}}\), say \(\widehat{\mu}_{1}\), then \(\text{var} \left( \widehat{\mu}_{1} \right) = \sigma^{2} \left( P_{X_{1}}^{1,1} + P_{\bf Z}^{1,1} \right)\). The additional component \(\sigma^{2} P_{\bf Z}^{1,1}\) is the noise, which is added due to the involvement of \({\bf x}_{2}\).
    • Next, consider the situation where the covariates suffer from multicollinearity, a phenomenon which is very common in the high-dimensional situation. Suppose \(X_{1}\) and \(X_{2}\) are near-collinear. Then \({\bf Z} = (I - P_{X_{1}}) X_{2}\) has very low singular-values, which makes the noise variance extremely large.
    • Let us consider the residential building dataset.
[1] "Least squares"
[1] "Average MAE and variance of the estimate of a test point over 500 repetitions are 0.992 and 0.89, respectively."
[1] "\n"
[1] "LASSO selects only one variable"
[1] "Average MAE and variance of the estimate of a test point over 500 repetitions are 0.987 and 0.87, respectively."
  • We use half of the observations for training and remaining half for prediction purposes. Over \(500\) repetitions, we record the mean absolute error (MAE) and the point estimate of \(\mu_{1}\), when calculated by LASSO (which selects only \(1\) variable) and least squares (LS). We observe that average MAE of LS and LASSO are \(0.992\) and \(0.987\), respectively. Further, variances of \(\widehat{\mu}_{1}\) are \(0.89\) (LS) and \(0.87\) (LASSO), respectively.
    • From the above results, it is clear that variable selection may not necessarily affect the prediction power of the model, rather it makes the prediction less noisy.

(B) Why Bayesian? What is different in Bayesian paradigm?

  • What is Probability?

    • It is the degree of belief.
  • What is a parameter \(\theta\) ? Prior and Posterior.

    • Since we don’t know the value of a parameter, it is a random quantity to a Bayesian.

    • We model our uncertainty with a probability distribution, \(\Pi(\theta)\).

    • \(\Pi(\theta)\) is called a prior distribution.

    • Prior because it represents the statistician’s belief about \(\theta\) before observing the data.

    • The distribution of \(\theta\) after seeing the data is called the posterior distribution \(\pi(\theta \mid \text{Data})\).

    • The posterior is the conditional distribution of the parameter given the data.

  • Subjectivity.

    • Subjectivity is the most frequent objection to Bayesian methods.

    • The prior distribution influences the conclusions.

    • However, the influence of the prior goes to zero as the sample size increases. For all but the most wide-supported priors.

  • What can be achieved by obtaining the posterior distribution of \(\theta\)?

    • Point estimate of \(\theta\), for example \(E(\theta \mid \text{Data})\).

    • Uncertainty quantification: Credible region of \(\theta\), \(a<b\), such that \(P(a< \theta \leq b \mid \text{Data}) \geq 0.95\).

    • Testing hypothesis: \(H_{0} : \theta \in \Theta_{0}\)

(C) Bayesian variable selection (BVS)

  • There are several ways in which BVS can be performed. We will, in particular, consider two dominant approaches:

    • Maximum a-posteriori (MAP) approach

    • Shrinkage prior based approach

(C1) Maximum a-posteriori (MAP) approach

  • Recall the the problem of variable selection in linear regression. The linear regression model with a response \(y\) and covariates \(\{x_{1}, \ldots, x_{p}\}\) assumes that the \(i\)-th observation of the response \(y_i\) is related to that of the covariates \(x_{1,i}, \ldots, x_{p,i}\) as

\[y_{i}=\beta_0+\beta_1 x_{1,i}+\beta_2 x_{2,i} + \cdots + \beta_p x_{p,i} +e_{i}, \]

where \(\beta_j\) is the \(j^{th}\) regression coefficient for \(j \in \{1, \ldots, p\}\) and \(e_{1}, \ldots, e_{n} \stackrel{iid}{\sim} N( 0, \sigma^{2})\).

  • Given \(p\) covariates, the space of all models, \(\mathcal{M}\), directly maps to the space of all subsets of \(\{1,\ldots,p\}\) and has cardinality \(2^{p}\).

  • Let \(\boldsymbol{\gamma}\) be a subset of \(\{1,\ldots,p\}\). Then, there exists a model \(M_{\boldsymbol{\gamma}}\) consisting of the covariates corresponding to \(\boldsymbol{\gamma}\), say \({\bf X}_{\boldsymbol{\gamma}}\). For example, if \(\boldsymbol{\gamma} = \{1,2\}\), then under \(M_{\boldsymbol{\gamma}}\), we have

    \[y_{i}=\beta_0+\beta_1 x_{1,i}+\beta_2 x_{2,i} +e_{i}. \]

  • The set of parameters involved in \(M_{\boldsymbol{\gamma}}\) is denoted by \(\boldsymbol{\theta}_{\boldsymbol{\gamma}}\).

    • Thus, if \(\boldsymbol{\gamma} = \{1,2\}\), then \(\boldsymbol{\theta}_{\boldsymbol{\gamma}} = \{ \beta_{0}, \beta_{1}, \beta_{2}, \sigma^{2} \}\).
  • As the Bayesian paradigm suggests, one places a prior distribution to all unknown quantities in the model.

    • Let \(\pi\left(\boldsymbol{\theta}_{\boldsymbol{\gamma}}\right)\) be a prior density on \(\boldsymbol{\theta}_{\boldsymbol{\gamma}}\), and \(P(M_{\boldsymbol{\gamma}})\) be the prior on the model \(M_{\boldsymbol{\gamma}}\).
  • Then the integrated likelihood is calculated as

\[m_{\boldsymbol{\gamma}}({\bf y}_{n})=\int_{\boldsymbol{\theta}_{\boldsymbol{\gamma}}} f\left({\bf y}_{n}|M_{\boldsymbol{\gamma}},\boldsymbol{\theta}_{\boldsymbol{\gamma}} \right) \pi \left( \boldsymbol{\theta}_{\boldsymbol{\gamma}} \right) d{\boldsymbol{\theta}_{\boldsymbol{\gamma}}} .\]

Integrated likelihood is the expected likelihood of the model \(M_{\boldsymbol{\gamma}}\) over the prior distribution, and can be interpreted as the probability (density) of the response given the model (integrating out the effect of a particular choice of \(\theta_{\boldsymbol{\gamma}}\)).

  • Next the posterior probability of \(M_{\boldsymbol{\gamma}}\) is obtained as \[ P(M_{\boldsymbol{\gamma}}|{\bf y}_{n},{\bf X}_{\boldsymbol{\gamma}}) = \frac{P(M_{\boldsymbol{\gamma}}) m_{\boldsymbol{\gamma}}({\bf y}_{n})}{\sum_{\boldsymbol{\gamma^{\prime}}} P(M_{\boldsymbol{\gamma^{\prime}}}) m_{\boldsymbol{\gamma^{\prime}}}({\bf y}_{n})} \propto P(M_{\boldsymbol{\gamma}}) m_{\boldsymbol{\gamma}}({\bf y}_{n}), \]

    by an application of the Bayes theorem.

  • The maximum a-posteriori (MAP) approach selects the model with highest posterior probability as the best model.

Common choices of priors:

  • Let us denote the regression coefficients specific to the model \(M_{\boldsymbol{\gamma}}\) as \(\boldsymbol{\beta}_{\boldsymbol{\gamma}}\). For example, if \(\boldsymbol{\gamma} = \{1,2\}\), then \(\boldsymbol{\beta}_{\boldsymbol{\gamma}} = \{ \beta_{1}, \beta_{2}\}\). Except \(\boldsymbol{\beta}_{\boldsymbol{\gamma}}\) , the other two parameters, i.e., \(\{ \beta_{0}, \sigma^{2} \}\) are common to all the models.

  • For the parameters, which are common to all the models, one can place a non-informative prior. (WHY?)

  • However, for model specific parameters proper priors are required.

  • Choices of priors for the common parameters:

    • On \(\beta_{0}\) an uniform prior on \(\mathbb{R}\) is usually imposed.

    • On \(\sigma^{2}\) one may assign a conjugate inverse-Gamma prior \(\sigma^{2} \sim IG(a_{\sigma}, b_{\sigma})\).

    • In the limiting case, when \(a_{\sigma} \to 0\) and \(b_{\sigma} \to 0\), then the density of \(\pi(\sigma^{2}) \propto 1/ \sigma^{2}\) (although it does not correspond to a proper density). This improper prior on \(\sigma^{2}\) is called the Jeffrey’s prior, and is invariant under log-transformation.

    • If no a-priori information is known about the best model, one may impose an uniform distribution on the model space so that \(P(M_{\boldsymbol{\gamma}}) = 2^{-p}\).

    • If one wishes to penalize larger models (as done in classical penalized likelihood methods), then one may impose a Bernoulli prior on the model space as

      \[ P(M_{\boldsymbol{\gamma}} ) \propto \omega_{n}^{|\boldsymbol{\gamma}|} (1- \omega_{n})^{p - | \boldsymbol{\gamma} |}\]

      for some \(0 <\alpha_{n} \ll 1/2\).

    • Observe that, the above prior penalizes higher dimensional models. The extent of penalty depends on the choice of \(\omega_{n}\).

  • Zellner’s \(g\)-prior:

    • In Zellner’s \(g\)-prior setup, one places a conjugate prior on \(\boldsymbol{\beta}_{\boldsymbol{\gamma}}\) as \(\boldsymbol{\beta}_{\boldsymbol{\gamma}} \sim N \left({\bf 0}, g_{n} \sigma^{2} \left( {\bf X}{\boldsymbol{\gamma}}^{\top} {\bf X}_{\boldsymbol{\gamma}} \right)^{-1} \right)\), where \(g_{n}\) is a hyperparameter.
  • Spike and Slab prior:

    • Another popular choice of prior for the regression parameters is the spike and slab prior, where one imposes a mixture prior on each regression parameter as follows:

      \[ \pi(\beta_{j} | \boldsymbol{\gamma} ) = I(j \notin \boldsymbol{\gamma}) \pi^{\text{spike}} (\beta_{j}) + I(j \in \boldsymbol{\gamma}) \pi^{\text{slab}} (\beta_{j}), ~~~ \text{for}~~ j = 1, \ldots, p. \]

    • Here \(\pi^{\text{spike}}\) and \(\pi^{\text{slab}}\) are probabilities (or densities) of some distributions. Typically, \(\pi^{\text{spike}}\) is highly peaked in a region around zero with small variance capturing the coefficients that are not significant, and \(\pi^{\text{slab}}\) is a more spread distribution covering plausible values moving away from zero.

    • In a special case, where the spike-distribution is degenerate at \(0\), and the slab-distribution is centered normal with large variance.

    • Another common choice is to set both the spike and slab distributions to be normal. The variance of spike distribution is set to a very small constant \(\sigma^{2} \nu_{n}^{2}\), whereas the variance of the slab distribution is set to a large constant \(\sigma^{2}\tau_{n}^{2}\).

(C2) Shrinkage prior based approach:

  • A different approach towards inducing sparseness is not to use indicators in the model, but instead to specify a prior directly on \(\beta_{j}\), that approximates the “slab and spike” shape. For example,

    \[ \beta_{j} \stackrel{iid}{\sim} DE \left(0,\sigma^{2} \lambda^{-1} \right) \]

  • The posterior mode resulting this choice of priors is obtained by the minimization problem

    \[ \arg\min_{\boldsymbol{\beta}} \left\{ \| {\bf y} - X \boldsymbol{\beta}\|_{2}^{2} + \lambda \| \boldsymbol{\beta} \|_{1} \right\}, \]

    which resembles the frequestist LASSO estimator. Hence this method is called Bayesian LASSO.

  • The prior should work by shrinking values of βj towards zero if there is no evidence in the data for non-zero values (i.e. the likelihood is concentrated around zero).

  • Conversely, there should be practically no shrinkage for data-supported values of covariates that are non-zero.

  • Unlike the MAP approach, here the posterior distribution of \(\boldsymbol{\beta}\) decides the optimal model. Consequently, it lacks direct variable selection.

  • However, a variable selection rule can be formed by taking those covariates, whose corresponding \(\beta_{j}\) satisfies \(|\beta_{j}|>c\). Otherwise, one can choose those covariates for which the \(95\%\) credible region of \(\beta_{j}\) does not include \(0\).

  • Laplace distribution is equivalent to a continuous mixture of Gaussian distributions, as

    \[ \frac{\lambda}{2} \exp\{ - \lambda |\beta|\} = \int_{0}^{\infty} \frac{1}{\sqrt{2\pi \tau^{2}}} \exp \left\{ - \frac{\beta^{2}}{2 \tau^{2}}\right\} \times \frac{\lambda^{2}}{2} \exp \left\{ - \frac{\lambda^{2} \tau^{2}}{2 }\right\} d \tau^{2}. \]

  • This opens up the possibility of generalizing the Bayesian LASSO, by choosing different distribution of the mixing scale parameter \(\tau^{2}\), and also by using adaptive shrinkage, i.e., \(\tau_{1}^{2},\ldots, \tau_{p}^{2}\) instead of an uniform \(\tau^{2}\).

(D) Computation

One of the major aspects in Bayesian inference is the theoretical and numerical computation.

Let us understand the computational issues of the above-mentioned methods of Bayesian variable selection.

(D1) Calculation of posteriors.

Consider the problem of variable selection using the MAP approach with (a) Zellner’s \(g\)-prior, and (b) Spike-Slab prior.

(a) Posterior probability of a model for Zellner’s \(g\) -prior:

  • We first calculate the marginal likelihood of the model \(M_{\boldsymbol{\gamma}}\):

    \[ m_{\boldsymbol{\gamma}} ({\bf y}_{n}) = \int_{\sigma^{2}} \int_{\boldsymbol{\beta}_{\boldsymbol{\gamma}}} \int_{\beta_{0}}\frac{| X_{\boldsymbol{\gamma}}^{\top} X_{\boldsymbol{\gamma}} |^{1/2} }{ (2\pi)^{(n + p_{\boldsymbol{\gamma}} )/2} \sigma^{ (n + p_{\boldsymbol{\gamma}} +2)}} \exp \left[ - \frac{1}{2 \sigma^{2}} \left\{ \| {\bf y}_{n} - \beta_{0} {\bf 1} - X_{\boldsymbol{\gamma}} \boldsymbol{\beta}_{\boldsymbol{\gamma}} \|^{2} + g_{n}^{-1} \boldsymbol{\beta}_{\boldsymbol{\gamma}}^{\top} X_{\boldsymbol{\gamma}}^{\top} X_{\boldsymbol{\gamma}} \boldsymbol{\beta}_{\boldsymbol{\gamma}} \right\} \right] d\beta_{0} d \boldsymbol{\beta}_{\boldsymbol{\gamma}} d \sigma^{2}. \]

    where \(p_{\boldsymbol{\gamma}} = |\boldsymbol{\gamma}|\).

  • If the covariates are centered, after tedious algebra one gets:

    \[ m_{\boldsymbol{\gamma}} ({\bf y}_{n}) \propto \frac{1}{ (1+g_{n})^{p_{\boldsymbol{\gamma}}/2}} \times \left\{ \sum_{i=1}^{n} (y_{i} - \bar{y}_{n} )^{2} + \frac{g_{n}}{1+g_{n}} {\bf y}_{n}^{\top} \left( I_{n} - P_{\boldsymbol{\gamma}} \right) {\bf y}_{n} \right\}^{- (n-1)/2}, \]

    where \(P_{\boldsymbol{\gamma}}\) is the orthogonal projection matrix onto the column space of \([{\bf 1} : X_{\boldsymbol{\gamma}}]\).

  • Therefore the posterior probability of \(M_{\boldsymbol{\gamma}}\) as \[ P \left( M_{\boldsymbol{\gamma}} \mid {\bf y}_{n}, X \right) \propto P( M_{\boldsymbol{\gamma}} ) m_{\boldsymbol{\gamma}} ({\bf y}_{n}). \]

(b) Posterior probability of a model for spike and slab prior:

  • We first calculate the marginal likelihood of the model \(M_{\boldsymbol{\gamma}}\):

    \[ m_{\boldsymbol{\gamma}} ({\bf y}_{n}) = \int_{\sigma^{2}} \int_{\boldsymbol{\beta}} \int_{\beta_{0}} \frac{|D_{\boldsymbol{\gamma}}|^{1/2} |D_{-\boldsymbol{\gamma}}|^{1/2}}{ (2\pi)^{(n + p_{\boldsymbol{\gamma}} )/2} \sigma^{ (n + p_{\boldsymbol{\gamma}} +2)}} \exp \left[ - \frac{1}{2 \sigma^{2}} \left\{ \| {\bf y}_{n} - \beta_{0} {\bf 1} - X_{\boldsymbol{\gamma}} \boldsymbol{\beta}_{\boldsymbol{\gamma}} \|^{2} + \boldsymbol{\beta}_{\boldsymbol{\gamma}}^{\top} D_{\boldsymbol{\gamma}} \boldsymbol{\beta}_{\boldsymbol{\gamma}} + \boldsymbol{\beta}_{-\boldsymbol{\gamma}}^{\top} D_{-\boldsymbol{\gamma}} \boldsymbol{\beta}_{-\boldsymbol{\gamma}} \right\} \right] d\beta_{0} d \boldsymbol{\beta}_{\boldsymbol{\gamma}} d \sigma^{2}. \]

    where \(D_{\boldsymbol{\gamma}}\) is a diagonal matrix with diagonal elements \(\sigma^{2} \tau_{n}^{2}\) and \(D_{-\boldsymbol{\gamma}}\) is a diagonal matrix with diagonal elements \(\sigma^{2} \nu_{n}^{2}\).

  • If the covariates are centered, after tedious algebra one gets:

    \[ m_{\boldsymbol{\gamma}} ({\bf y}_{n}) \propto \frac{|D_{\boldsymbol{\gamma}}|^{1/2}}{|X_{\boldsymbol{\gamma}}^{\top} X_{\boldsymbol{\gamma}} + D_{\boldsymbol{\gamma}} |^{1/2}} \times \left[ {\bf y}_{n}^{\top} \left\{ I_{n} - X_{\boldsymbol{\gamma}} \left( X_{\boldsymbol{\gamma}}^{\top} X_{\boldsymbol{\gamma}} + D_{\boldsymbol{\gamma}} \right)^{-1} X_{\boldsymbol{\gamma}}^{\prime} \right\} {\bf y}_{n} - n\bar{y}_{n}^{2} \right]^{- (n-1)/2}. \]

  • Therefore the posterior probability of \(M_{\boldsymbol{\gamma}}\) as \[ P \left( M_{\boldsymbol{\gamma}} \mid {\bf y}_{n}, X \right) \propto P( M_{\boldsymbol{\gamma}} ) m_{\boldsymbol{\gamma}} ({\bf y}_{n}). \]

  • There is no need to calculate the constant of proportionality.

(D2) How to explore the model space?

  • The cardinality of the space of models is same as that of the set of all subsets of \(\{1, \ldots, p\}\), \(2^{p}\).

  • It is therefore not possible to explore the model space in finite time, even if \(p\) is moderately large. For example, with \(p= 103\) in the residential building data, the cardinality of model space is \(\approx 10^{31}\).

  • One possible way is to apply stochastic search variable selection algorithm.

Stochastic Search Variable Selection (SSVS)

  • The SSVS algorithms rely on neighborhood based Metropolis Hasting random walks as follows:

  • [Step 1: Initialize] Set the values of the hyperparameters, for example, \(g_{n}\), \(\nu_{n}\), \(\tau_{n}\), etc. Further, set an initial model \(M_{\boldsymbol{\gamma}^{(0)}}\).

  • [Step 2: MCMC updates] Suppose that the chain is at the model \(M_{\boldsymbol{\gamma}^{(m)}}\) for the \(m\)-th iteration.

    • [Step 2a: Define Neighborhood] Let \(N( \boldsymbol{\gamma}^{(m)})\) be the neighborhood of \(M_{\boldsymbol{\gamma}^{(m)}}\), which comprises of all possible models which could be obtained from \(M_{\boldsymbol{\gamma}^{(m)}}\) with an addition (A), removal (R) or swap (S) move.

    • [Step 2b: Define transition function] Let \(T( \boldsymbol{\gamma}^{(m)}, \cdot)\) be the proposal transition function based over \(N(\boldsymbol{\gamma}^{(m)})\). The transition function can be uniform over the neighborhood, or one can select a move first (uniformly) then take an SRS from the relevant neighborhood, etc.

    • [Step 2c: Proposal model] Choose a proposal model \(M_{\boldsymbol{\gamma}^{\prime}}\) randomly according to \(T(\boldsymbol{\gamma}^{(m)}, \cdot)\).

    • [Step 2d: Metropolis-Hastings step] Accept \(M_{\boldsymbol{\gamma}^{\prime}}\) with probability \(\alpha(\boldsymbol{\gamma}, \boldsymbol{\gamma}^{\prime})\), where

      • \[ \alpha ( \boldsymbol{\gamma}^{(m)}, \boldsymbol{\gamma}^{\prime})= \min \displaystyle\left\{1, \frac{P(M_{\boldsymbol{\gamma}^{\prime}}| {\bf y}) \times T( \boldsymbol{\gamma}^{\prime} , \boldsymbol{\gamma}^{(m)} )}{P(M_{\boldsymbol{\gamma}^{(m)} }| {\bf y} ) \times T( \boldsymbol{\gamma}^{(m)}, \boldsymbol{\gamma}^{\prime} )} \right\}. \]

      • The resultant model is denoted by \(M_{\boldsymbol{\gamma}^{(m+1)}}\).

  • This random walk induces a irreducible, aperiodic and positive recurrent Markov chain with stationary distribution \(P(M_{\boldsymbol{\gamma}}\mid {\bf y}, X)\).

  • Thus, one would expect that after sufficient iterations of this algorithm, the high posterior probability models would frequently visited.

  • How long??? That’s a tough question, and can be answered elsewhere.

(D3) What if the posterior distribution is not in an analytic form?

  • We are lucky enough to obtain the posterior probability of the models in an analytic form.

  • In most of the cases, calculation of the marginals is not feasible, or extremely cumbersome. Consider the case with double exponential prior.

  • In such cases, one has to decide on the best model from the posterior distribution of \(\boldsymbol{\beta}\).

  • One can use Gibbs sampling in such cases.

  • Heuristically, the idea of Gibbs sampling is that, given a joint distribution one might hope that sampling iteratively from its conditional distributions would ultimately provide a sample from the joint distribution.

  • This can be analyzed by considering the Markov chain that results from the procedure and it can be shown that under suitable conditions this is the case.

Let us write an algorithm to obtain samples from the posterior distribution for Bayesian LASSO.

  • Recall the Bayesian LASSO formulation:

    • \({\bf y} \mid X, \beta_{0}, \boldsymbol{\beta}, \sigma^{2} \sim N(\beta_{0} {\bf 1} + X\boldsymbol{\beta}, \sigma^{2} I)\).

    • \(\boldsymbol{\beta} \mid \lambda, \sigma^{2} \sim DE({\bf 0}, \sigma \lambda^{-1})\).

    • \(\pi(\beta_{0}, \sigma^{2}) \propto 1/ \sigma^{2}\).

  • Further, recall the Laplace equivalence of a continuous mixture of Gaussian distributions

    \[ \frac{\lambda}{2} \exp\{ - \lambda |\beta|\} = \int_{0}^{\infty} \frac{1}{\sqrt{2\pi \tau^{2}}} \exp \left\{ - \frac{\beta^{2}}{2 \tau^{2}}\right\} \times \frac{\lambda^{2}}{2} \exp \left\{ - \frac{\lambda^{2} \tau^{2}}{2 }\right\} d \tau^{2}. \]

  • This now enables us to have an alternative (hierarchical) prior formulation as below:

    • \({\bf y} \mid X, \beta_{0}, \boldsymbol{\beta}, \sigma^{2} \sim N(\beta_{0} {\bf 1} + X\boldsymbol{\beta}, \sigma^{2} I)\).

    • \(\beta_{j} \mid \sigma^{2}, \tau_{j}^{2} \stackrel{ind}{\sim} N(0, \sigma^{2} \tau_{j}^{2} \lambda^{-1})\), for \(j = 1, \ldots, p\).

    • \(\tau_{j}^{2} \stackrel{iid}{\sim} \text{Exp} (\lambda^{2}/2)\), for \(j = 1, \ldots, p\).

    • \(\pi(\beta_{0}, \sigma^{2}) \propto 1/ \sigma^{2}\).

  • Using the above hierarchical setup the following Gibbs sampling algorithm can be developed.

  • The trick is to write the full conditional posterior of all the parameters, i.e., to treat all other parameters as fixed while updating one parameter in the joint distribution of likelihood and prior:

    \[ f({\bf y} \mid \beta_{0}, \boldsymbol{\beta}, \sigma^{2}) \pi( \boldsymbol{\beta} \mid \boldsymbol{\tau}, \sigma^{2}) \pi( \boldsymbol{\tau} \mid \sigma^{2}) \pi(\beta_{0}, \sigma^{2}). \]

    • [Step 1: Initialization] Set the values of the hyperparameter \(\lambda\) and initial values of all the parameters \(\beta_{0}\), \(\boldsymbol{\beta}\), \(\boldsymbol{\tau}\) and \(\sigma^{2}\).

    • [Step 2: MCMC Updates] At the \(m\)-th iteration, let the updated values of the parameters be \(\boldsymbol{\beta}^{(m)}, \boldsymbol{\tau}^{(m)}, \beta_{0}^{(m)}, \sigma^{2(m)}\), \(m=1,2, \ldots\).

      • [Updating \(\beta_{0}\)] Generate \(\beta_{0}^{(m+1)}\) from the conditional posterior of \(\beta_{0} \sim N \left( \bar{y}_{n}, \sigma^{2(m)}/n \right)\).

      • [Updating \(\boldsymbol{\beta}\)] Let \(D_{\tau} = \text{diag} (\tau_{1}^{2}, \ldots, \tau_{p}^{2} )\). Then the conditional posterior of \(\boldsymbol{\beta} \sim N \left( A^{-1} X^{\top} {\bf y}, \sigma^{2} A^{-1} \right)\) where \(A = X^{\top} X + D_{\tau}^{-1}\).

      • [Updating \(\sigma^{2}\)] The conditional distribution of \(\sigma^{2}\) is inverse gamma with shape parameter \((n-1)/2 + p/2\) and scale parameter \(\left( {\bf y} - \bar{y} {\bf 1} - X \boldsymbol{\beta} \right)^{\top} \left( {\bf y} - \bar{y} {\bf 1} - X \boldsymbol{\beta} \right) /2 + \boldsymbol{\beta}^{\top} D_{\tau}^{-1} \boldsymbol{\beta}/ 2\).

      • [Updating \(\tau\)] For each \(j= 1, \ldots, p\), the conditional posterior of \(\tau_{j}^{-2}\) is inverse Gaussian with mean parameter \(|\lambda \sigma|/|\beta_{j}|\) and scale parameter \(\sigma^{2}\).

(E) Application to Simulated and Real Data.

  1. Simulated Data:
  • n = 100, p = 50.

  • Covariates are generated from mutivariate normal distribution with mean \({\bf 0}\), AR(1) variance-covariance matrix with AR(1) probability \(0.5\).

  • There are 3 true covariates with regression coefficient \(1\) each and \(\sigma^{2} = 1\).

  • Results of \(g\)-prior with SSVS:

    • \(g_{n} = n\), \(w_{n} = 1/\sqrt{p}\): Results are sensitive to the choice of these hyperparameters.

    • Among the \(1000\) posterior samples, the true model is visited \(367\) times, a sub-model of the true model (containing 2 out of 3 true covariates) is visited \(33\) times, and all other models are visited less than \(20\) times.

  • Results of Bayesian LASSO: The mean of posterior samples of the regression coefficients are given below:

  • The (empirical) posterior distribution of the regression coefficients corresponding to the true covariates and all other covariates (together) is given below:

  1. Residential Building Data:
  • The (Frequentist) LASSO estimation gives:

  • The mean of posterior samples of Bayesian LASSO provides:

  • The \(g\)-prior with SSVS selects the singleton model with covariate ‘\(X_{8}\)\(257\) times out of \(1000\), and none of the models are selected more than \(15\) times.