Processing math: 100%

```

Loading

library(ggplot2)
library(tidyr)

References

BST249 Materials

Setting

Consider a setting with a single data point y with a single parameter μ.

Here we consider how LASSO and Ridge regression can be considered as Bayesian maximum a posteriori (MAP) estimation given specific priors.

μπ(μ)

YN(μ,1)

For MAP, maximize the following

π(μ|y)e(yμ)22π(μ)

LASSO

μ can be considered to have Laplace(0,λ) prior.

The resulting optimization is minimization of the following objective function.

(μy)22+λ|μ|

## Log likelihood kernel
ll_kernel <- function(mu, y) {
    (mu - y)^2 / 2
}

## Log prior kernel
lp_kernel <- function(mu, lambda) {
    lambda * abs(mu)
}

## Observed data point is 1 (mu_hat_MLE = 1)
data1 <- expand.grid(y = 1,
                     lambda = c(0, 0.1, 0.5, 1, 1.5, 2),
                     mu = seq(-0.5, 2, length.out = 100))

## Fit functions
data1$ll <- with(data1, ll_kernel(mu, y))
data1$lp <- with(data1, lp_kernel(mu, lambda))
## Objective function to minimize
data1$obj <- with(data1, ll + lp)

## Factor lambda for easy plotting
data1$lambda <- factor(data1$lambda)
## Long format for easier plotting
data1_long <- gather(data1, key = "fun", value = "value",
                     -y, -lambda, -mu)

##
ggplot(data = data1_long, mapping = aes(x = mu, y = value,
                                        color = fun,
                                        group = fun)) +
    geom_line(data = subset(data1_long, fun != "obj"),
              alpha = 1/3) +
    geom_line(data = subset(data1_long, fun == "obj"),
              size = 1, alpha = 1) +
    geom_vline(xintercept = c(0,1), alpha = 1/5) +
    facet_wrap(~ lambda) +
    labs(title = "LASSO objective function at different lambda values") +
    theme_bw() + theme(legend.key = element_blank())

As λ increases, the pointy shape of the log prior takes over to create a minimum point at μ=0. At some large enough λ value, the MAP estimate becomes 0 and stays there. This will induce sparcity.

Ridge regression

μ can be considered to have N(0,λ1) prior.

The resulting optimization is minimization of the following objective function.

(μy)2+λμ2

## Log likelihood kernel
ll_kernel <- function(mu, y) {
    (mu - y)^2
}

## Log prior kernel
lp_kernel <- function(mu, lambda) {
    lambda * mu^2
}

## Observed data point is 1 (mu_hat_MLE = 1)
data1 <- expand.grid(y = 1,
                     lambda = c(0, 0.1, 0.5, 1, 1.5, 2),
                     mu = seq(-0.5, 2, length.out = 100))

## Fit functions
data1$ll <- with(data1, ll_kernel(mu, y))
data1$lp <- with(data1, lp_kernel(mu, lambda))
## Objective function to minimize
data1$obj <- with(data1, ll + lp)

## Factor lambda for easy plotting
data1$lambda <- factor(data1$lambda)
## Long format for easier plotting
data1_long <- gather(data1, key = "fun", value = "value",
                     -y, -lambda, -mu)

##
ggplot(data = data1_long, mapping = aes(x = mu, y = value,
                                        color = fun,
                                        group = fun)) +
    geom_line(data = subset(data1_long, fun != "obj"),
              alpha = 1/3) +
    geom_line(data = subset(data1_long, fun == "obj"),
              size = 1, alpha = 1) +
    geom_vline(xintercept = c(0,1), alpha = 1/5) +
    facet_wrap(~ lambda) +
    labs(title = "Ridge objective function at different lambda values") +
    theme_bw() + theme(legend.key = element_blank())

As λ increases, the the log prior takes over to pull the minimum point toward μ=0. This only gradually happens causing shrinkage only.