Consider the simplest possible regression model \[ Y_i = \beta_0 + \epsilon_i \] where \(\epsilon_i\) i = 1,…,n are independent and identically distributed random variables with \(E(\epsilon_i)=0\) and \(Var(\epsilon_i)=\sigma^2\). The ridge estimator of \(\beta_0\) solves.
\[\min_{b}[\sum_{i = 1}^n(Y_i - b)^2+\lambda b^2 ]\] For some \(\lambda\ge0\). In the special case \(\lambda=0\), the solution is of course the OLS estimator.
Show that the solution to this problem is given by \(\hat{beta}_0^{ridge}=\sum_{i = 1}^nY_i/(n+\lambda)\). Compare this to the OLS estimator \(\hat{beta}_0^{OLS}=\overline{Y}\).
Suppose that \(\beta_0=1\) and \(\epsilon\sim N(0,\sigma^2)\) with \(\sigma^2=4\). Generate a sample of size n = 10 from a model and compute \(\hat{beta}_0^{ridge}\) for a grid \(\lambda\) values over the interval \([0, 20]\).
Repeat part b), say 1000 times so that you end up with 1000 estiamtes of \(\beta_0\) for all the \(\lambda\) values that you have picked. For each value of \(\lambda\), compute \(bias^2[\hat{beta}_0^{ridge}]\), \(Var[\hat{beta}_0^{ridge}]\) and \(MSE[\hat{beta}_0^{ridge}]=bias^2[\hat{beta}_0^{ridge}]+Var[\hat{beta}_0^{ridge}]\).
Plot \(bias^2[\hat{beta}_{0}^{ridge}]\), \(Var[\hat{beta}_{0}^{ridge}]\) and \(MSE[\hat{beta}_{0}^{ridge}]\) as a function of \(\lambda\) and interpret the result.
In the section (b), ib the given formula, I used the alpha = 0 because in Ridge regression using glmnent formula alpha should be zero.
generateBeta <- function(beta_length) {
4 / seq(beta_length)^2
}
f_y_x <- function(x) {
beta <- generateBeta(dim(x)[2]) # approximately sparse model
x %*% beta
}
# x0 value is a random number, this is a matrix where all of the values is equal to x0 value.
run_simulation <- function(x_generator, x0 = 0.2, lambdas = seq(0, 20, 1)) {
# generate the sample
x <- x_generator()
y_exp <- f_y_x(x)
y <- y_exp + rnorm(length(y_exp)) * 4
# generate the x0 value
x_eval <- matrix(x0, ncol = dim(x)[2])
map_df(lambdas, ~{
model <- glmnet(x, y, alpha = 0, lambda = .x)
tibble(
lambda = .x,
f_hat = as.numeric(predict(model, newx = x_eval)),
error = as.numeric(f_y_x(x_eval)) - f_hat
)
})
}
visualize_simulation_results <- function(simulation_results) {
group_by(simulation_results, lambda) %>%
summarise(bias2 = mean(error)^2, var = var(f_hat)) %>%
mutate(MSE = bias2 + var) %>%
pivot_longer(bias2:MSE, names_to = "metric") %>%
mutate(metric = factor(metric, levels = c("bias2", "var", "MSE"))) %>%
ggplot(aes(lambda, value, color = metric)) + geom_line(size = 1)
}
# here we are generating independent predictors the total number of observations is 10 and one variable
generate_independent_predictors <- function(n = 10, p = 1) {
matrix(rnorm(n * p), nrow = n, ncol = 2)
}
# run simulation for 1000 times
run_simulation(x_generator = generate_independent_predictors)
## # A tibble: 21 × 3
## lambda f_hat error
## <dbl> <dbl> <dbl>
## 1 0 -1.19 2.19
## 2 1 -1.22 2.22
## 3 2 -1.25 2.25
## 4 3 -1.27 2.27
## 5 4 -1.29 2.29
## 6 5 -1.30 2.30
## 7 6 -1.31 2.31
## 8 7 -1.32 2.32
## 9 8 -1.33 2.33
## 10 9 -1.34 2.34
## # … with 11 more rows
n_sim <- 1000
simulation_results2 <- map_df(
seq(n_sim),
run_simulation,
x_generator = generate_independent_predictors
)
visualize_simulation_results(simulation_results2)
We can see from the result variance is decreasing and and squared bias is increasing.
Let X and Y be two random variables with zero mean. The population version of the optimization problem that defines the first principal component of the two variables is
\[ \max_{u1, u2} Var(u_1 X + u_2 Y)\]
Subject to \[ u_1^2 + u_2^2 = 1\] The following questions ask you to examine some insightful special cases.
Suppose that Var(X) > Var(Y) and cov(X, Y) = E(XY) = 0. Derive the first principle component vector. Draw an illustrative picture and explain the result intuitively. (Hint: expand the variance formula and substitute the constraint.Then carry out the minimization.)
Suppose that Var(X) = Var(Y) = 1 (principle component analysis is often performed after standardization) and cov(X, Y ) = E(XY ) = 0. Show that in this case any vector (u1, u2) with length 1 is a principal component vector (i.e., it solves the problem above). Explain intuitively this puzzling result. (A picture can help.)
fix unit vectors are (u1, u2) meaning \[ \sqrt{u_1^2 + u_2^2} = 1 => u_1^2 + u_2^2 = 1\]
The goal is to pick u1 and u2 to maximize. The u1 is the first component of the vector of fixed and u2 is the second component of the vector
Initially, I expanded the \(Var(u_1 X + u_2 Y)\) based on the variance property, then I replaced the value of \(u_1\). As a result I took the derivatives of the outcome. The result of the the above equation we get that \((u_1, u_2)\) can have the values of \((1, 0), (-1, 0), (0, 1)\) and \((0, -1)\). The below graph shows that the vectors. The below shape shows a Rhombus where estimates are bounded in a box at the origin. The ellipses intersect the bounding box which gives us the lasso estimates. It is worth mentioning that the intersect happens in the corners.
knitr::include_graphics("https://raw.githubusercontent.com/ghazalayobi/ml/main/lasso-regression.png")
In the part (b) we see that \(Var(X) = Var(Y) = 1\). The final result is equal to the 0. The main goal is to minimize the ellipse size from OLS and circle simultaneously in the given ridge regression. As a result the ridge estimate is given by the point at which the ellipse and circle touch each other. It can be said that the ridge regression estimates are bounded in a circle at the point of origin and the radius is 1 unit.
knitr::include_graphics("https://raw.githubusercontent.com/ghazalayobi/ml/main/ridge-regression.png")
ISLR Exercise 3 in Section 6.8: Suppose we estimate the regression coefficients in a linear regression model by minimizing for a particular value of s. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.
\[\sum_{i=1}^{n}(y_{i} - {\beta_{0}} - \sum_{j=1}^{p}{\beta_{i}} x_{ij})^2\] Subject to \[\sum_{j=1}^{p}|\beta_{j}| \le s\]
Repeat (a) for test RSS.
Repeat (a) for variance.
Repeat (a) for (squared) bias.
Repeat (a) for the irreducible error.
The equation above states that we we run lasso our aim is to find the set of coefficient estimates that lead to the smallest RSS subject to the constraint that there is a budget s for how large \(\sum_{j=1}^{p}|\beta_{j}|\) can be.
As we increase s from 0, the training RSS will (iv)steadily decrease. RSS is subject to the given constraint. This is because when s is large enough, we are making our model more flexible. In this case restriction on \(\beta\) is reducing. \(\beta\) is also minimizing the RSS. As a result this will lead to decrease in RSS.
As we increase s from 0 the test RSS will (ii) initially decrease, and then eventually start increasing, as a result it will create a U shape. When s increases and the given constraint loosens, the flexibility of the model will increase. Therefore the Test RSS will decrease till the point of over fit then after the over fit point it will start increasing.
As we increase s from 0 the variance (iii) steadily increase. The reason is as the constraint is increasing meaning s increasing from zero which corresponds to the reduction in shrinkage, meaning \(\lambda\)is decreasing, thus, model flexibility is increasing, as a result variance will increase.
As we increase s from 0, squared bias will (iv) steadily decrease. The justification is same as part (c). This is because the increase in flexibility will decrease the bias. Moreover, we can see that if the least squares result is within the given constraint region, the squared bias will stop declining.
As we increase s from 0, (v) the irreducible error remain constant. Based on the definition of irreducible error is the error that we can not remove with our model we can say that irreducible error is independent of model parameters, as a result it is in independent of s. The irreducible error remain constant