```

Loading

library(ggplot2)
library(tidyr)

References

BST249 Materials

Setting

Consider a setting with a single data point \(y\) with a single parameter \(\mu\).

Here we consider how LASSO and Ridge regression can be considered as Bayesian maximum a posteriori (MAP) estimation given specific priors.

\(\mu \sim \pi(\mu)\)

\(Y \sim N(\mu, 1)\)

For MAP, maximize the following

\(\pi(\mu | y) \propto e^{-\frac{(y - \mu)^{2}}{2}} \pi(\mu)\)

LASSO

\(\mu\) can be considered to have \(Laplace(0, \lambda)\) prior.

The resulting optimization is minimization of the following objective function.

\(\frac{(\mu - y)^{2}}{2} + \lambda |\mu|\)

## Log likelihood kernel
ll_kernel <- function(mu, y) {
    (mu - y)^2 / 2
}

## Log prior kernel
lp_kernel <- function(mu, lambda) {
    lambda * abs(mu)
}

## Observed data point is 1 (mu_hat_MLE = 1)
data1 <- expand.grid(y = 1,
                     lambda = c(0, 0.1, 0.5, 1, 1.5, 2),
                     mu = seq(-0.5, 2, length.out = 100))

## Fit functions
data1$ll <- with(data1, ll_kernel(mu, y))
data1$lp <- with(data1, lp_kernel(mu, lambda))
## Objective function to minimize
data1$obj <- with(data1, ll + lp)

## Factor lambda for easy plotting
data1$lambda <- factor(data1$lambda)
## Long format for easier plotting
data1_long <- gather(data1, key = "fun", value = "value",
                     -y, -lambda, -mu)

##
ggplot(data = data1_long, mapping = aes(x = mu, y = value,
                                        color = fun,
                                        group = fun)) +
    geom_line(data = subset(data1_long, fun != "obj"),
              alpha = 1/3) +
    geom_line(data = subset(data1_long, fun == "obj"),
              size = 1, alpha = 1) +
    geom_vline(xintercept = c(0,1), alpha = 1/5) +
    facet_wrap(~ lambda) +
    labs(title = "LASSO objective function at different lambda values") +
    theme_bw() + theme(legend.key = element_blank())

As \(\lambda\) increases, the pointy shape of the log prior takes over to create a minimum point at \(\mu = 0\). At some large enough \(\lambda\) value, the MAP estimate becomes 0 and stays there. This will induce sparcity.

Ridge regression

\(\mu\) can be considered to have \(N(0, \lambda^{-1})\) prior.

The resulting optimization is minimization of the following objective function.

\((\mu - y)^{2} + \lambda \mu^{2}\)

## Log likelihood kernel
ll_kernel <- function(mu, y) {
    (mu - y)^2
}

## Log prior kernel
lp_kernel <- function(mu, lambda) {
    lambda * mu^2
}

## Observed data point is 1 (mu_hat_MLE = 1)
data1 <- expand.grid(y = 1,
                     lambda = c(0, 0.1, 0.5, 1, 1.5, 2),
                     mu = seq(-0.5, 2, length.out = 100))

## Fit functions
data1$ll <- with(data1, ll_kernel(mu, y))
data1$lp <- with(data1, lp_kernel(mu, lambda))
## Objective function to minimize
data1$obj <- with(data1, ll + lp)

## Factor lambda for easy plotting
data1$lambda <- factor(data1$lambda)
## Long format for easier plotting
data1_long <- gather(data1, key = "fun", value = "value",
                     -y, -lambda, -mu)

##
ggplot(data = data1_long, mapping = aes(x = mu, y = value,
                                        color = fun,
                                        group = fun)) +
    geom_line(data = subset(data1_long, fun != "obj"),
              alpha = 1/3) +
    geom_line(data = subset(data1_long, fun == "obj"),
              size = 1, alpha = 1) +
    geom_vline(xintercept = c(0,1), alpha = 1/5) +
    facet_wrap(~ lambda) +
    labs(title = "Ridge objective function at different lambda values") +
    theme_bw() + theme(legend.key = element_blank())

As \(\lambda\) increases, the the log prior takes over to pull the minimum point toward \(\mu = 0\). This only gradually happens causing shrinkage only.