```
library(ggplot2)
library(tidyr)
BST249 Materials
Consider a setting with a single data point y with a single parameter μ.
Here we consider how LASSO and Ridge regression can be considered as Bayesian maximum a posteriori (MAP) estimation given specific priors.
μ∼π(μ)
Y∼N(μ,1)
For MAP, maximize the following
π(μ|y)∝e−(y−μ)22π(μ)
μ can be considered to have Laplace(0,λ) prior.
The resulting optimization is minimization of the following objective function.
(μ−y)22+λ|μ|
## Log likelihood kernel
ll_kernel <- function(mu, y) {
(mu - y)^2 / 2
}
## Log prior kernel
lp_kernel <- function(mu, lambda) {
lambda * abs(mu)
}
## Observed data point is 1 (mu_hat_MLE = 1)
data1 <- expand.grid(y = 1,
lambda = c(0, 0.1, 0.5, 1, 1.5, 2),
mu = seq(-0.5, 2, length.out = 100))
## Fit functions
data1$ll <- with(data1, ll_kernel(mu, y))
data1$lp <- with(data1, lp_kernel(mu, lambda))
## Objective function to minimize
data1$obj <- with(data1, ll + lp)
## Factor lambda for easy plotting
data1$lambda <- factor(data1$lambda)
## Long format for easier plotting
data1_long <- gather(data1, key = "fun", value = "value",
-y, -lambda, -mu)
##
ggplot(data = data1_long, mapping = aes(x = mu, y = value,
color = fun,
group = fun)) +
geom_line(data = subset(data1_long, fun != "obj"),
alpha = 1/3) +
geom_line(data = subset(data1_long, fun == "obj"),
size = 1, alpha = 1) +
geom_vline(xintercept = c(0,1), alpha = 1/5) +
facet_wrap(~ lambda) +
labs(title = "LASSO objective function at different lambda values") +
theme_bw() + theme(legend.key = element_blank())
As λ increases, the pointy shape of the log prior takes over to create a minimum point at μ=0. At some large enough λ value, the MAP estimate becomes 0 and stays there. This will induce sparcity.
μ can be considered to have N(0,λ−1) prior.
The resulting optimization is minimization of the following objective function.
(μ−y)2+λμ2
## Log likelihood kernel
ll_kernel <- function(mu, y) {
(mu - y)^2
}
## Log prior kernel
lp_kernel <- function(mu, lambda) {
lambda * mu^2
}
## Observed data point is 1 (mu_hat_MLE = 1)
data1 <- expand.grid(y = 1,
lambda = c(0, 0.1, 0.5, 1, 1.5, 2),
mu = seq(-0.5, 2, length.out = 100))
## Fit functions
data1$ll <- with(data1, ll_kernel(mu, y))
data1$lp <- with(data1, lp_kernel(mu, lambda))
## Objective function to minimize
data1$obj <- with(data1, ll + lp)
## Factor lambda for easy plotting
data1$lambda <- factor(data1$lambda)
## Long format for easier plotting
data1_long <- gather(data1, key = "fun", value = "value",
-y, -lambda, -mu)
##
ggplot(data = data1_long, mapping = aes(x = mu, y = value,
color = fun,
group = fun)) +
geom_line(data = subset(data1_long, fun != "obj"),
alpha = 1/3) +
geom_line(data = subset(data1_long, fun == "obj"),
size = 1, alpha = 1) +
geom_vline(xintercept = c(0,1), alpha = 1/5) +
facet_wrap(~ lambda) +
labs(title = "Ridge objective function at different lambda values") +
theme_bw() + theme(legend.key = element_blank())
As λ increases, the the log prior takes over to pull the minimum point toward μ=0. This only gradually happens causing shrinkage only.