A fractional power makes the shrinkage too weak to be useful in practice. The regularisation term penalises the model’s fit in an effort to reduce overfitting - by using a fractional (\(<1\)) \(p\), we may see only a small regularization effect. This wouldn’t be useful, so we don’t do that.
The loss function used for LASSO may not necessarily be applicable to other models. This means that the subset of features selected by LASSO will not necessarily be best in other models.
More research needed here, but maybe the No Free Lunch Theorem applies?
sim_linear_data = function (k=4, n=3) {
# we have k predictors and n rows
# Assume x's come from standard normal distribution
set.seed(20)
x = rnorm(k*n, 0, 1)
# convert x's into a n by k matrix and add constant term to create design matrix:
X = cbind(1, matrix(x, ncol=k))
# Pick j < k non-zero betas
num_non_zero = floor(runif(1, 1, k))
betas = append(runif(num_non_zero), runif(k+1 - num_non_zero))
# Assume epsilon is from standard normal
epsilon = rnorm(n, 0, 1)
# Calculate y vector
y = betas %*% t(X) + epsilon
# # Build a dataframe of our simulated dataset
# data = cbind(X, t(y))
# simulated_data = as.data.frame(X)
# Return everything
list(betas, epsilon, X, t(y))
}
results = sim_linear_data(20, 100)
true_betas = results[[1]]
true_epsilon = results[[2]]
X = results[[3]]
y = results[[4]]
glmnet package to select LASSO using k-fold cross-validation (CV)References:
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0-2
fit = glmnet(X, y)
plot(fit)
Each curve corresponds to a variable, showing the path of the coefficient against the \(l_1\)-norm as \(\lambda\) varies.
We perform cross validation and plot the CV curve to see how the model performs when cross-validated. Note that the default number of folds is 10:
set.seed(21)
cv = cv.glmnet(X, y)
plot(cv)
honestly idk what the question means here, i’m presuming they mean different numbers of CV folds (aside from the default 10) lol. Could be wrong…
# function for plotting n CV plots, starting from 5-fold
plot_CVs = function (num_CVs, X, y) {
set.seed(22)
numrows = ceiling(num_CVs/2)
par(mfrow=c(numrows, 2))
for (i in 1:num_CVs) {
plot(cv.glmnet(X, y, nfolds=5*i)) # cv.glmnet really does all the heavy lifting here, if you want you can pull the coefficients out and inspect them too. glmnet has great documentation on this.
}
}
plot_CVs(6, X, y)
# We will use the functions we declared earlier to explore different dataset sizes quickly.
wrapper_function = function(num_predictors, dataset_size, number_of_CVs_to_test) {
# simulate dataset
simulation = sim_linear_data(k=num_predictors, n=dataset_size)
X = simulation[[3]]
y = simulation[[4]]
# pass into cross validation plotting
plot_CVs(number_of_CVs_to_test, X, y)
}
wrapper_function(10, 100000, 4)
If it’s optional, we don’t do those.