```
library(ggplot2)
library(tidyr)
BST249 Materials
Consider a setting with a single data point \(y\) with a single parameter \(\mu\).
Here we consider how LASSO and Ridge regression can be considered as Bayesian maximum a posteriori (MAP) estimation given specific priors.
\(\mu \sim \pi(\mu)\)
\(Y \sim N(\mu, 1)\)
For MAP, maximize the following
\(\pi(\mu | y) \propto e^{-\frac{(y - \mu)^{2}}{2}} \pi(\mu)\)
\(\mu\) can be considered to have \(Laplace(0, \lambda)\) prior.
The resulting optimization is minimization of the following objective function.
\(\frac{(\mu - y)^{2}}{2} + \lambda |\mu|\)
## Log likelihood kernel
ll_kernel <- function(mu, y) {
(mu - y)^2 / 2
}
## Log prior kernel
lp_kernel <- function(mu, lambda) {
lambda * abs(mu)
}
## Observed data point is 1 (mu_hat_MLE = 1)
data1 <- expand.grid(y = 1,
lambda = c(0, 0.1, 0.5, 1, 1.5, 2),
mu = seq(-0.5, 2, length.out = 100))
## Fit functions
data1$ll <- with(data1, ll_kernel(mu, y))
data1$lp <- with(data1, lp_kernel(mu, lambda))
## Objective function to minimize
data1$obj <- with(data1, ll + lp)
## Factor lambda for easy plotting
data1$lambda <- factor(data1$lambda)
## Long format for easier plotting
data1_long <- gather(data1, key = "fun", value = "value",
-y, -lambda, -mu)
##
ggplot(data = data1_long, mapping = aes(x = mu, y = value,
color = fun,
group = fun)) +
geom_line(data = subset(data1_long, fun != "obj"),
alpha = 1/3) +
geom_line(data = subset(data1_long, fun == "obj"),
size = 1, alpha = 1) +
geom_vline(xintercept = c(0,1), alpha = 1/5) +
facet_wrap(~ lambda) +
labs(title = "LASSO objective function at different lambda values") +
theme_bw() + theme(legend.key = element_blank())
As \(\lambda\) increases, the pointy shape of the log prior takes over to create a minimum point at \(\mu = 0\). At some large enough \(\lambda\) value, the MAP estimate becomes 0 and stays there. This will induce sparcity.
\(\mu\) can be considered to have \(N(0, \lambda^{-1})\) prior.
The resulting optimization is minimization of the following objective function.
\((\mu - y)^{2} + \lambda \mu^{2}\)
## Log likelihood kernel
ll_kernel <- function(mu, y) {
(mu - y)^2
}
## Log prior kernel
lp_kernel <- function(mu, lambda) {
lambda * mu^2
}
## Observed data point is 1 (mu_hat_MLE = 1)
data1 <- expand.grid(y = 1,
lambda = c(0, 0.1, 0.5, 1, 1.5, 2),
mu = seq(-0.5, 2, length.out = 100))
## Fit functions
data1$ll <- with(data1, ll_kernel(mu, y))
data1$lp <- with(data1, lp_kernel(mu, lambda))
## Objective function to minimize
data1$obj <- with(data1, ll + lp)
## Factor lambda for easy plotting
data1$lambda <- factor(data1$lambda)
## Long format for easier plotting
data1_long <- gather(data1, key = "fun", value = "value",
-y, -lambda, -mu)
##
ggplot(data = data1_long, mapping = aes(x = mu, y = value,
color = fun,
group = fun)) +
geom_line(data = subset(data1_long, fun != "obj"),
alpha = 1/3) +
geom_line(data = subset(data1_long, fun == "obj"),
size = 1, alpha = 1) +
geom_vline(xintercept = c(0,1), alpha = 1/5) +
facet_wrap(~ lambda) +
labs(title = "Ridge objective function at different lambda values") +
theme_bw() + theme(legend.key = element_blank())
As \(\lambda\) increases, the the log prior takes over to pull the minimum point toward \(\mu = 0\). This only gradually happens causing shrinkage only.