\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\prob}[1]{\operatorname{P}\left(#1\right)}\) \(\newcommand{\Var}[1]{\operatorname{Var}\left(#1\right)}\) \(\newcommand{\E}[1]{\operatorname{E}\left(#1\right)}\) \(\newcommand{\Cov}[1]{\operatorname{Cov}\left(#1\right)}\) \(\newcommand{\Cor}[1]{\operatorname{Corr}\left(#1\right)}\) \(\newcommand{\sd}[1]{\operatorname{sd}\left(#1\right)}\) \(\newcommand{\prob}[1]{\operatorname{P}\left(#1\right)}\) \(\DeclareMathOperator*{\minf}{minimize \quad}\) \(\newcommand{\mininlineeq}[3]{\mbox{minimize}_{#1}\quad#2\qquad\mbox{subject to }#3}\) \(\newcommand{\mininline}[2]{\mbox{minimize}_{#1}\quad#2}\) \(\def\KL{\mathrm{KL}}\) \(\def\LB{\mathrm{LB}}\) \(\def\E{\mathrm{E}}\) \(\def\l{\phi}\) \(\def\t{\theta}\) \(\def\veps{\epsilon}\) \(\def\N{\mathrm{N}}\)

Introduction

Bayesian shrinkage also known as regularization (in the context of linear models) is a central tool in modern-day statistical and machine learning methods. Many applications involve high-dimensional sparse signal recovery problems.

Regularization is a machine learning technique that allows for an optimal trade-off between model complexity (bias) and out-of-sample performance (variance). To fix ideas, consider regularization in the context of a linear model, where an output \(y\) is generated by \[\begin{equation}\label{eq:lm} y = x^T\beta + \epsilon,~~~\epsilon\sim p(\epsilon). \end{equation}\] Assuming normally distributed errors, \(p(\epsilon) = N(0,\sigma^2_{\epsilon})\), the corresponding regularized maximum likelihood optimization problem is finding the solution to \[ \mininlineeq{\beta}{\lVert y- X\beta\rVert ^2_2 }{\sum_{i=1}^{p}\phi(\beta_i) \le s.} \] Here, \(y\) is the vector of observed outputs, \(X\) is a design matrix, and \(\beta\) are the model parameters. Each \(\beta_i\) has a regularization penalty \(\phi(\beta_i)\) and \(s\) is a hyper-parameter that controls the bias-variance trade-off.

Regularization can be viewed as constraint on the model space. The techniques were originally applied to solve ill-posed problems where a slight change in the initial data could significantly alter the solution. Regularization techniques were then proposed for parameter reconstruction in a physical system modeled by a linear operator implied by a set of observations. It had long been believed that ill-conditioned problems offered little practical value, until Tikhonov published his seminal paper (Tikhonov 1943) on regularization. Tihonov (1963) proposed methods for solving regularized problems of the form \[ \minf_\beta \lVert y- X\beta\rVert ^p_p + \lambda\lVert (\beta - \beta^{(0)})\rVert ^q_q. \] Here \(\lambda\) is the weight on the regularization penalty and the \(\ell_q\)-norm is defined by \(\lVert \beta\rVert _q^q = \sum_i \beta_i^q\). This optimization problem is a Lagrangian form of the constrained problem with \(\phi(\beta_i) = (\beta_i - \beta_i^{(0)})^q\).

Bayesian Regularization Duality

From a Bayesian perspective regularization is performed by defining a prior distribution over the model parameters. A Bayesian linear regression model is defined as \[y = x^T\beta+ \epsilon,~~\epsilon\sim N(0,\sigma^2_{\epsilon}),~~~\beta\sim p(\beta\mid \tau),\] the log of the posterior distribution is then given by \[- \log p(\beta\mid X,y) = (1/2) \sigma_{\epsilon}^2 \sum_{i} (y_i - x_i^T\beta)^2 + \log p(\beta\mid \tau).\]

A regularized maximum a posteriori probability (MAP) estimator can be found by minimizing the negative log-posterior \[\label{eqn:reg} \begin{aligned} & \hat \beta_{\mathrm{MAP}} = \underset{\beta \in R^p}{\text{argmin}} & & \lVert y-X\beta\rVert _2^2+ \phi_{\tau}(\beta), \end{aligned}\] where \(\phi_{\tau}(\beta)\propto \log p(\beta \mid \tau)\). The penalty term is interpreted as the log of the prior distribution, and is parameterized by the hyper-parameters \(\tau\). The resulting maximum a posteriori probability (MAP) is equivalent to the classical approach of constraining a search space by adding a penalty.

Table 1 provides penalty functions and their corresponding prior distributions, including lasso, ridge, Cauchy and horseshoe.

Prior distributions and corresponding penalty functions (negative log-probability)
Ridge Lasso Cauchy Horseshoe
Prior \(p(\beta_i\mid \tau)\) \(\frac{1}{\sqrt{2 \pi } \tau }\exp\left(-\frac{\beta_i^2}{2 \tau ^2}\right)\) \(\frac{1}{2 \tau }\exp\left(-\frac{|\beta|}{\tau}\right)\) \(\frac{\tau }{\pi \tau ^2+\pi \beta_i^2}\) \(\le \pi\sqrt{\dfrac{\pi}{2}}\log \left ( 1 + \frac{2 \tau^2}{\beta_i^2} \right )\)
Penalty \(\phi_{\tau}(\beta_i)\) \(\dfrac{1}{2\tau^2}\beta_i^2\) \(\frac{|\beta_i|}{\tau}\) \(\log\left( \tau ^2+ \beta_i^2\right)\) \(- \log \log \left ( 1 + \frac{2 \tau^2}{\beta_i^2} \right )\)

A typical approach in Bayesian analysis is to define normal scale mixture priors which are constructed as a hierarchical model of the form \[\label{eq:scalemix} \beta_i \mid \lambda_i,\tau_i \sim N(0,\tau_i^2\lambda_i^2),~~~p(\lambda^2,\sigma^2,\tau) = p(\lambda^2)p(\sigma^2)p(\tau)\]

While classical approach requires solving an optimization problem, the Bayesian approach requires calculating integrals. While in conjugate models, e.g. when both likelihood and priors are normal (Ridge), we can calculate those integrals analytically, it is not possible in general case. An efficient numerical techniques for calculating samples from posterior distributions are required. (George and McCulloch 1993) proposed a Gibbs sample for finding posterior of the following problem \[\beta_i\mid \gamma_i \sim (1-\gamma_i)N(0,\tau^2_i) + \gamma_iN(0,c_i^2\tau_i^2),~~~ p(\gamma_i=1) = p_i,\] where \(\tau_i\) is chosen to be small, so that for \(\gamma_i = 0\), the estimated \(\beta_i\) is close to zero and and \(c_i\) is large so that when \(\gamma_i = 1\) the estimated \(\beta_i\) does not get shrunk. Then variable selection is performed by calculating the posterior distribution over \(\gamma\). \[p(\gamma\mid X,y ) \propto p(y\mid X,\gamma)p(\gamma).\]

(Carlin and Polson 1991) proposed Gibbs sampling MCMC for the class of scale mixtures of Normals, taking the form \[\epsilon_j \mid \sigma,\lambda_j \sim N(0,\lambda_j\sigma^2),~~~\lambda_j \sim p(\lambda_j)\] We now turn to lasso and horseshoe as special cases.

Lasso

From a Bayesian perspective, lasso (Tibshirani 1996) is equivalent to specifying double exponential (Laplace) prior distribution (Carlin and Polson 1991) for each parameter \(\beta_i\) with \(\sigma^2\) fixed \[p(\beta_i \mid b) = (b/2)\exp(-|\beta_i|/b).\] Bayes rule then calculates the posterior as a product of Normal likelihood and the Laplace prior to yield \[\log p(\beta \mid X,y, b) \propto \lVert y-X\beta\rVert _2^2 + \dfrac{2\sigma^2}{b}\lVert \beta\rVert _1.\] For \(b>0\), the posterior mode is equivalent to the LASSO estimate with \(\lambda = 2\sigma^2/b\). Large variance \(b\) of the prior is equivalent to the small penalty weight \(\lambda\) in the Lasso objective function.

The Laplace prior used in Lasso can be represented as scale mixture of Normal distribution (Andrews and Mallows 1974; Carlin and Polson 1991) \[\begin{aligned} \beta_i \mid \sigma^2,\tau \sim &N(0,\tau^2\sigma^2)\\ \tau^2 \mid \alpha \sim &\exp (\alpha^2/2)\\ \sigma^2 \sim & \pi(\sigma^2).\end{aligned}\] There is an equivalence with the lasso penalty obtained by integrating out \(\tau\) \[p(\beta_i\mid \sigma^2,\alpha) = \int_{0}^{\infty} \dfrac{1}{\sqrt{2\pi \tau}}\exp\left(-\dfrac{\beta_i^2}{2\sigma^2\tau}\right)\dfrac{\alpha^2}{2}\exp\left(-\dfrac{\alpha^2\tau}{2}\right)d\tau = \dfrac{\alpha}{2\sigma}\exp(-\alpha/\sigma|\beta_i|).\] Thus it is a Laplace distribution with location 0 and scale \(\alpha/\sigma\).

(Carlin and Polson 1991; Carlin, Polson, and Stoffer 1992; Park and Casella 2008) used representation of Laplace prior is a scale Normal mixture to develop a Gibbs sampler that iteratively samples from \(\beta \mid a,y\) and \(b\mid \beta,y\) to estimate joint distribution over \((\hat \beta, \hat b)\). Thus, we so not need to apply cross-validation to find optimal value of \(b\), the Bayesian algorithm does it “automatically”. Given data \(D = (X,y)\), where \(X\) is the \(n\times p\) matrix of standardized regressors and \(y\) is the \(n\)-vector of outputs. Implement a Gibbs sampler for this model when Laplace prior is used for model coefficients \(\beta_i\). Use scale mixture normal representation. \[\begin{aligned} \beta \mid \sigma^2,\tau_1,\ldots,\tau_p \sim & N(0,\sigma^2D_{\tau})\\ D_{\tau} = & \mathrm{diag}(\tau_1^2,\ldots,\tau_p^2)\\ \tau_i^2 \mid \lambda \sim &\exp (\lambda^2/2)\\ \sigma^2 \sim & 1/\sigma^2.\end{aligned}\]

Then the complete conditional required for Gibbs sampling are given by \[\begin{aligned} \beta \mid D,D_{\tau} \sim & N(A^{-1}X^Ty,\sigma^2A^{-1}), ~~A = X^TX + D^{-1}_{\tau}\\ \sigma^2 \mid \beta,D,D_{\tau} \sim & \mathrm{InverseGamma}\left((n-1)/2+p/2,(y-X\beta)^T(y-X\beta)/2 + \beta^TD_{\tau}^{-1}\beta/2\right)\\ 1/\tau_j^2 \mid \beta_j,\lambda \sim & \mathrm{InverseGaussian}\left(\sqrt{\dfrac{\lambda^2\sigma^2}{\beta_j^2}},\lambda^2\right)\end{aligned}\] The formulas above assume that \(X\) is standardized, e.g. observations for each feature are scaled to be of mean 0 and standard deviation one, and \(y\) is centered \(y = y - \bar y\).

You can use empirical priors and initialize the parameters as follows \[\begin{aligned} \beta = & (X^TX + I)^{-1}X^Ty\\ r = & y - X\beta\\ \sigma^2 = & r^Tr/n\\ \tau^{-2} = &1/(\beta \odot \beta)\\ \lambda = & p \sqrt{\sigma^2} / \sum|\beta|.\end{aligned}\] Here \(n\) is number of rows (observations) and \(p\) is number of columns (inputs) in matrix \(X\).

Ridge

When the prior is Normal \(\beta_i \sim N(0,\sigma_{\beta}^2)\), the posterior mode is equivalent to the ridge estimate (Hoerl and Kennard 1970) . The relation between variance of the prior and the penalty weight in ridge regression is inverse proportional \(\lambda\propto 1/\sigma_{\beta}^2\).

Thus, Lasso and Ridge regressions are both maximum a posteriori (MAP) estimates for Laplace and Normal priors.

Given design matrix \(X\) and observed output values \(y = (y_1,\ldots,y_n)\), and assuming \(\epsilon \sim N(0,\sigma^2)\), the MLE is given by the solution to the following optimization problem \[\mininline{\beta}{\lVert X\beta^T - y\rVert_2^2}\] and the solution is given by: \[\beta = \left(X^TX\right)^{-1}X^Ty.\] However, when matrix \(X\) is close to being rank-deficient, the \(X^TX\) will be ill-conditioned. This means that the problem of estimating \(\beta\) will also be ill-conditioned. For a linear model, we can quantify the sensitivity to perturbation in \(y\) by \[\dfrac{\lVert \Delta \beta\rVert}{\lVert\beta||} \le \dfrac{\kappa(X^TX)}{\cos\theta}\dfrac{\lVert\delta X^Ty\rVert}{\lVert X^Ty\rVert},\] here \(\theta\) is the angle between \(X^Ty\) and the range of \(X^TX\) and \(\kappa(X^TX)\) is the condition number which is the ratio of the largest to smallest eigenvalues of \(X^TX\).

A trivial example is shown when \(y\) is nearly orthogonal to \(x\) \[x = \left[\begin{array}{c} 1 \\ 0 \end{array} \right], \ y^{(1)} = \left[\begin{array}{c} \epsilon \\ 1 \end{array}\right]\] The solution to the problem is \(\beta^{(1)} = \epsilon\); but the solution for \[x = \left[\begin{array}{c} 1 \\ 0 \end{array} \right], \ y^{(2)} = \left[\begin{array}{c} -\epsilon \\ 1 \end{array}\right]\] is \(\hat\beta^{(2)} = -\epsilon\). Note that \(\lVert y^{(1)}- y^{(2)}\rVert / \lVert y^{(1)}\lVert \approx 2\epsilon\) is small, but \(|\hat{\beta}^{(1)} - \hat{\beta}^{(2)}|/|\hat{\beta}^{(1)}| = 2\), is huge.

Another case of interest is when a least squares problem is ill-conditioned is when the observations are close to be linearly dependent. It happens, for example, when input variables are correlated. Consider an example \[X = \left( \begin{array}{cc} 1 & 1 \\ 1 & \epsilon +1 \\ \end{array} \right),~~~y = \left( \begin{array}{c} 2\\\delta+2 \end{array} \right)\] The MLE estimate is given by \[\beta = \left\{2-\frac{\delta }{\epsilon },\frac{\delta }{\epsilon }\right\}\] For \(\delta = 0\), we have \(\hat \beta^{(1)}= (2,0)\) but for \(\delta = \epsilon\), we have \(\hat \beta^{(2)} = (1,1)\) with both \(\epsilon\) and \(\delta\) being arbitrarily small. We can analytically calculate the condition number \[\kappa(X^TX) = \frac{\epsilon ^2+(\epsilon +2) \sqrt{\epsilon ^2+4}+2 \epsilon +4}{\epsilon ^2-(\epsilon +2) \sqrt{\epsilon ^2+4}+2 \epsilon +4}\] It goes to infinity as \(\epsilon\) goes to zero. Since condition number is the ratio of eigenvalues \[\kappa(X^TX) = \dfrac{\lambda_{\mathrm{\max}}(X^TX)}{\lambda_{\mathrm{\min}}(X^TX)}\] and in our case \(\lambda_{\mathrm{\min}}(X^TX)\) is close to zero, we can improve the condition number by shifting the spectrum \(\lambda(A+\alpha I) = \lambda(A) + \alpha\), thus \[\kappa(X^TX+\alpha I) = \dfrac{\lambda_{\mathrm{\max}}(X^TX)+\alpha}{\lambda_{\mathrm{\min}}(X^TX)+\alpha}.\]

The spectrum shift allows to address the issue of numerical instability when \(X^T X\) is ill-conditioned, which is always a case whenever \(p\) is large. The solution is then given by \[\hat{\beta} = ( X^T X + \lambda I )^{-1} X^T y.\] The corresponding objective function that leads to this regularized solution is \[\label{eq:ridge} \mathop{\mathrm{minimize \quad}}_\beta \lVert y- X\beta\rVert _2^2 + \lambda\lVert \beta\rVert _2^2.\]

An alternative formulation is \[\begin{aligned} \label{eq:tikhonov-constr} \mathop{\mathrm{minimize \quad}}_\beta \lVert y- X\beta\rVert _2^2 + \lambda\lVert \beta\rVert _2^2 \qquad\mbox{subject to}\quad \lVert \beta\rVert _2^2 \le s.\end{aligned}\]

We can think of the constrain is of a budget on the size of \(\beta\). In statistics the problem of solving this problem is called ridge regression.

Gibbs Sample for Ridge

The likelihood is given by \[ y = x^T\beta + \epsilon,~~~\epsilon\sim N(0,\phi^{-1}). \] Again, assuming that we have centered and rescaled each input variable \(j\) so that \[ x_j = \dfrac{x_j - \bar x_j}{\sqrt{\sum_i \left(x_{ij} - \bar x_j\right)}} \]

The Bayesian formulation for Ridge regression assumes normal prior on the parameters. We start with reference prior on the intrcept \(p(\beta_0\mid \phi )\propto \phi^{-1}\) and normal prior on the rest \[ \beta_i \mid \phi,k \sim \mathrm{N}\left(0, 1/(\phi k)\right) \] The log of the posterior is then \[ \log p(\beta\mid X,y) = -\dfrac{\phi}{2}\left(\sum_{i} (y_i - x_i^T\beta)^2 + k \lVert \beta \rVert^2\right). \] The problem is convex and we can find analytical expression for the mode of the log-posterior, which is called the MAP (maximum a posteriori) estimator. \[ \dfrac{\partial \log p(\beta\mid X,y)}{\partial \beta} = X^T\left(X\beta - y\right) + k\beta. \] Setting it to zero leads to the ridge estimator \[ \hat\beta_{\mathrm{ridge}} = \left(X^TX + kI\right)^{-1}X^Ty \] Compare it to the classical ridge estimator that is found by maximizing the penalized likelihood

\[ \sum_{i} (y_i - x_i^T\beta)^2 + \lambda \lVert \beta \rVert^2 \] Thus, the MAP estimator from Bayes model is the same as classical ridge with \(\lambda = k\).

We will use the longley data set from the datasets package. The data set contains 16 observations on 7 variables. The variables are - GNP.deflator - GNP implicit price deflator - GNP - Gross National Product - Unemployed - number of unemployed - Armed.Forces - number of people in the armed forces - Population - population size - Year - year - Employed - number of employed

library(MASS)
data("longley")
plot(longley, pch=16, cex=0.5)

Let’s estimate a liner model using Ridge regression

longley.ridge = lm.ridge(Employed ~ ., data=longley, lambda=seq(0, 0.1, 0.0001))
plot(longley.ridge)

The trace plot shows how our coefficient shringk as a function of the penalty size \(\lambda\)

select(longley.ridge)
## modified HKB estimator is 0.004275357 
## modified L-W estimator is 0.03229531 
## smallest value of GCV  at 0.0028

Or we can do it manually

k = seq(0, 0.1, 0.0001)
n.k = length(k); n = nrow(longley)
cv.lambda = matrix(NA, n, n.k)
rmse.ridge = function(data, i, j, k) {
  m.ridge = lm.ridge(Employed ~ ., data = data, lambda=k[j], subset = -i)
  yhat = scale(data[i,1:6, drop=F],center = m.ridge$xm,
                            scale = m.ridge$scales) %*%
                     m.ridge$coef + m.ridge$ym
     (yhat - data$Employed[i])^2
}
for(i in 1:n) { for (j in 1:n.k) {
cv.lambda[i,j] = rmse.ridge(longley, i, j, k) }
}
cv.error = apply(cv.lambda, 2, mean) 
plot(k, cv.error, type="l")

The best performance is for \(k\)

k[which.min(cv.error)]
## [1] 0.0028

Now my best model is

best.lamba = k[which.min(cv.error)]
longley.RReg = lm.ridge(Employed ~ ., data=longley, lambda=best.lamba)
knitr::kable(coef(longley.RReg))
x
-2950.3475198
GNP.deflator -0.0005381
GNP -0.0182264
Unemployed -0.0176111
Armed.Forces -0.0096073
Population -0.1185103
Year 1.5578560

Since we have already shown that classical ridge is an estimator is equivalent to MAP estimator of the Bayes model. We can extend the Bayes model so that we have a prior on \(k\) and the model is selected as part of our inference process. We will consider a slightly more genral problem.

We consider a more general model and will assume that precision of \(\beta\) coefficient does not depend on \(\phi\). Instead of assuming the same precision for all of the \(\beta_i\) , we will assume different scales and will model them via precision matrix \(C\) \[ \beta_i \mid k = N(0,k_i^{-1}). \] We define \(C = \mathrm{diag}(k_1,k_2,\ldots,k_p)\)

To maintain conjugacy, we select prior on \(k\) to be Gamma \[ k_i \sim \mathrm{Gamma}(\kappa/2,h/2). \] We also assume Gamma prior for erorr precision \(\phi\) \[ \phi \sim \mathrm{Gamma}(a/2,b/2). \]

Then the conditional posteriors are \[ \beta \mid y,\phi,C \sim N(b,B^{-1}) \] with \(B = C + X^TX\phi\) and \(b = B^{-1}\phi X^Ty\) (multivariate normal-normal). Now, if we know \(y,\beta\) and \(C\), then conditional posterior for the error precision \(k\) depends only on data through the residual \(e = y - X^T\beta\), then the updated Gamma for \(\phi\) is \[ \phi \mid y, \beta, C \sim \mathrm{Gamma}((a+n)/2,(b+e^Te)/2) \] and finally posterio for \(k_i\) is \[ k_i \mid \beta_i \sim \mathrm{Gamma}((\kappa+1)/2,(h+\beta_i)/2). \]

Now let’s put it all together. We will scale the data so we do not need to worry about the intercept

set.seed(7)
longley.s = as.data.frame(scale(longley))
knitr::kable(colMeans(longley.s))
x
GNP.deflator 0
GNP 0
Unemployed 0
Armed.Forces 0
Population 0
Year 0
Employed 0
knitr::kable(head(longley.s))
GNP.deflator GNP Unemployed Armed.Forces Population Year Employed
1947 -1.7310992 -1.5434331 -0.8960348 -1.4609267 -1.4111352 -1.5753151 -1.4219946
1948 -1.2214414 -1.2905329 -0.9292089 -1.6534776 -1.2639263 -1.3652731 -1.1944868
1949 -1.2492409 -1.3043364 0.5229601 -1.4235660 -1.0998977 -1.1552311 -1.4652752
1950 -1.1287763 -1.0372705 0.1687464 -1.3747098 -0.9337126 -0.9451891 -1.1759787
1951 -0.5079204 -0.5908091 -1.1710587 0.7074273 -0.7689652 -0.7351470 -0.5968163
1952 -0.3318568 -0.4094719 -1.3497707 1.4187163 -0.5971736 -0.5251050 -0.4777947
# split the data
train = sample(1:nrow(longley.s),as.integer(0.8*nrow(longley.s)))
m.ridge = lm.ridge(Employed ~ ., data = longley.s, lambda=0.0028, supbset = train)

Then define our sampling functions

sample.c = function(k,h,p) {
  C = matrix(0,nrow=p, ncol = p)
  for (i in 1:p) {
    C[i,i] = rgamma(1,k/2,h[i]/2)
  }
  return(C)
}

ridge.gibbs = function(x,y,k,h,a,b, iter=20000, burn = 2000) {
  p = ncol(x)
  n = nrow(x)
  # Initialize
  C = sample.c(k,rep(h,p),p)
  phi = rgamma(1,a/2,b/2)
  beta.samples = matrix(,nrow = iter,ncol = p)
  phi.samples = matrix(,nrow = iter,ncol = 1)
  k.samples = matrix(,nrow = iter,ncol = p)
  for (i in 1:iter) {
    B = C + t(x)%*%x*phi
    bm = solve(B,t(x)%*%y*phi)
    beta = mvrnorm(n = 1, mu=bm, Sigma=solve(B), tol = 1e-6)
    e = y - x%*%beta
    e2 = sum(e*e)
    a = a+n
    b = b+e2
    k = k+1
    h = h+beta*beta
    phi = rgamma(1,a/2,b/2)
    C = sample.c(k,rep(h,p),p)
    beta.samples[i,] = beta
    phi.samples[i,] = phi
    k.samples[i,] = diag(C)
  }
  keep = burn:iter
  colnames(beta.samples) = colnames(x)
  colnames(k.samples) = colnames(x)
  return(list(beta.samples = beta.samples[keep,], 
              phi.samples = phi.samples[keep,], 
              k.samples = k.samples[keep,]))
}

Let’s run our model and look at the results

x = as.matrix(longley.s[train,1:6])
y = as.matrix(longley.s[train,7])
m.ridge.gibbs = ridge.gibbs(x,y,1,1,1,1)
knitr::kable(data.frame(classical.ridge = coef(m.ridge)[-1],
                        gibbs.median = apply(m.ridge.gibbs$beta.samples,2,median),
                        gibbs.mean = apply(m.ridge.gibbs$beta.samples,2,mean))
)
classical.ridge gibbs.median gibbs.mean
GNP.deflator -0.0016536 -0.0619692 -0.0614489
GNP -0.5158391 -0.1292012 -0.1300381
Unemployed -0.4685953 -0.3967750 -0.3965108
Armed.Forces -0.1903741 -0.1784846 -0.1781146
Population -0.2347315 -0.3983877 -0.3981136
Year 2.1118864 1.9012120 1.9014959
par(mfrow = c(2,3))
hist(m.ridge.gibbs$beta.samples[,'GNP'], main="", xlab= 'beta for GNP')
abline(v=coef(m.ridge)[3], col='red', lwd=2)
hist(m.ridge.gibbs$beta.samples[,'Year'], main="", xlab= 'beta for Year')
abline(v=coef(m.ridge)[7], col='red', lwd=2)
hist(m.ridge.gibbs$beta.samples[,'Population'], main="", xlab= 'beta for Population')
abline(v=coef(m.ridge)[6], col='red', lwd=2)

hist(m.ridge.gibbs$k.samples[,'GNP'], main = "", xlab= 'precision for GNP')
hist(m.ridge.gibbs$k.samples[,'Year'], main = "", xlab= 'precision for Year')
hist(m.ridge.gibbs$k.samples[,'Population'], main = "", xlab= 'precision for Population')

beta.gibbs = apply(m.ridge.gibbs$beta.samples,2,median)
beta.ridge = coef(m.ridge)[-1]
xtest = as.matrix(longley.s[-train,1:6])
ytest = longley.s[-train,7]

d = rbind(Gibbs = c(mean((y - x%*%beta.gibbs)^2), mean((ytest - xtest%*%beta.gibbs)^2)),
          Ridge = c(mean((y- x%*%beta.ridge)^2),mean((ytest- xtest%*%beta.ridge)^2)))
colnames(d) = c("in-sample","out-of-sample")
knitr::kable(d, caption = "Mean Squared Error")
Mean Squared Error
in-sample out-of-sample
Gibbs 0.0041609 0.0070724
Ridge 0.0039279 0.0058506

Case Study: Ridge and Lasso from glmnet for the Longley Data

First, we load the data and fit a linear model using the ridge regression. We will use the glmnet package that uses two key parameters alpha and lambda. The alpha parameter controls the mix of L1 and L2 regularization. When alpha=0 we have ridge regression and when alpha=1 we have lasso. The lambda parameter controls the strength of the regularization. The loss function is given by the following formula \[ \mininline{\beta}{\lVert y- X\beta\rVert _2^2} + \lambda\left((1-\alpha)\lVert \beta\rVert _2^2 + \alpha\lVert \beta\rVert _1\right) \]

# load the package
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
# load data
data(longley)
x <- as.matrix(longley[,1:6])
y <- as.matrix(longley[,7])
# fit model
fit <- glmnet(x, y, family="gaussian", alpha=0, lambda=0.001)
# summarize the fit
summary(fit)
##           Length Class     Mode   
## a0        1      -none-    numeric
## beta      6      dgCMatrix S4     
## df        1      -none-    numeric
## dim       2      -none-    numeric
## lambda    1      -none-    numeric
## dev.ratio 1      -none-    numeric
## nulldev   1      -none-    numeric
## npasses   1      -none-    numeric
## jerr      1      -none-    numeric
## offset    1      -none-    logical
## call      6      -none-    call   
## nobs      1      -none-    numeric
# make predictions
predictions <- predict(fit, x, type="link")
# summarize accuracy
mse <- mean((y - predictions)^2)
print(mse)
## [1] 0.05919831

Now we find the lass fit

# load the package
library(lars)
## Loaded lars 1.3
# load data
data(longley)
x <- as.matrix(longley[,1:6])
y <- as.matrix(longley[,7])
# fit model
fit <- lars(x, y, type="lasso")
# summarize the fit
summary(fit)
## LARS/LASSO
## Call: lars(x = x, y = y, type = "lasso")
##    Df     Rss        Cp
## 0   1 185.009 1976.7120
## 1   2   6.642   59.4712
## 2   3   3.883   31.7832
## 3   4   3.468   29.3165
## 4   5   1.563   10.8183
## 5   4   1.339    6.4068
## 6   5   1.024    5.0186
## 7   6   0.998    6.7388
## 8   7   0.907    7.7615
## 9   6   0.847    5.1128
## 10  7   0.836    7.0000
# select a step with a minimum error
best_step <- fit$df[which.min(fit$RSS)]
# make predictions
predictions <- predict(fit, x, s=best_step, type="fit")$fit
# summarize accuracy
mse <- mean((y - predictions)^2)
print(mse)
## [1] 0.06400169

Now we can combine both ridge and lasso in a single model, called elastic net. The loss function is given by the following formula \[ \mininline{\beta}{\lVert y- X\beta\rVert _2^2} + \lambda\left((1-\alpha)\lVert \beta\rVert _2^2 + \alpha\lVert \beta\rVert _1\right) \]

# load the package
library(glmnet)
# load data
data(longley)
x <- as.matrix(longley[,1:6])
y <- as.matrix(longley[,7])
# fit model
fit <- glmnet(x, y, family="gaussian", alpha=0.5, lambda=0.001)
# summarize the fit
summary(fit)
##           Length Class     Mode   
## a0        1      -none-    numeric
## beta      6      dgCMatrix S4     
## df        1      -none-    numeric
## dim       2      -none-    numeric
## lambda    1      -none-    numeric
## dev.ratio 1      -none-    numeric
## nulldev   1      -none-    numeric
## npasses   1      -none-    numeric
## jerr      1      -none-    numeric
## offset    1      -none-    logical
## call      6      -none-    call   
## nobs      1      -none-    numeric
# make predictions
predictions <- predict(fit, x, type="link")
# summarize accuracy
mse <- mean((y - predictions)^2)
print(mse)
## [1] 0.0590839

References

Andrews, David F, and Colin L Mallows. 1974. “Scale Mixtures of Normal Distributions.” Journal of the Royal Statistical Society. Series B (Methodological), 99–102.
Carlin, Bradley P, and Nicholas G Polson. 1991. “Inference for Nonconjugate Bayesian Models Using the Gibbs Sampler.” Canadian Journal of Statistics 19 (4): 399–405.
Carlin, Bradley P, Nicholas G Polson, and David S Stoffer. 1992. “A Monte Carlo Approach to Nonnormal and Nonlinear State-Space Modeling.” Journal of the American Statistical Association 87 (418): 493–500.
George, Edward I, and Robert E McCulloch. 1993. “Variable Selection via Gibbs Sampling.” Journal of the American Statistical Association 88 (423): 881–89.
Hoerl, Arthur E, and Robert W Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67.
Park, Trevor, and George Casella. 2008. “The Bayesian Lasso.” Journal of the American Statistical Association 103 (482): 681–86.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), 267–88.
Tihonov, Andrei Nikolajevits. 1963. “Solution of Incorrectly Formulated Problems and the Regularization Method.” Soviet Math. 4: 1035–38.
Tikhonov, Andrey Nikolayevich. 1943. On the Stability of Inverse Problems.” In Dokl. Akad. Nauk SSSR, 39:195–98.