In this problem, you will generate a simulated dataset, and then use subset selection and LASSO to identify the most important predictors. In class we introduced subset selection as a method to select an optimal set of predictors. Specifically, given \(p\) predictors and the model \[\begin{equation} Y \simeq \beta_0 + \beta_1X_1 + ... + \beta_pX_p, \end{equation}\] subset selection identifies a set of predictors \(M\) such that \(\beta_j\) can take any value if \(j \in M\) and \(\beta_j=0\) if \(j\notin M\). Earlier in the class, we showed that linear regression can be used to fit polynomials by treating \(X^j\) (that is, \(X\) raised to the power \(j\), for \(j=2...p\)) as independent predictors: \[\begin{equation} Y \simeq \beta_0 + \beta_1X + \beta_2X^2 ... + \beta_pX^p \end{equation}\] In this problem you will first generate synthetic data for a specific choice of a polynomial. You will then show that the polynomial can be re-inferred from the synthetic data using best subset selection or LASSO. Make sure to read through the R-lab in section 6.5 of ISLR to see how to implement subset selection and LASSO in R.
set.seed(1) for reproducibility. Then use the function
rnorm() to generate a vector of \(n=100\) random predictor values \(X\) as well as a vector of \(n=00\) random noise values \(\epsilon\), both following the standard
normal distribution \(N(0,1)\). Then
generate a vector of response values \(Y\) according to the model: \[\begin{equation}
Y = 2 + X + 3X^2 - 0.5X^3 + \epsilon
\label{eqn:model}
\end{equation}\] Finally store vectors \(X\) and \(Y\) in a data frame df with
with column names x and y (hint: use
data.frame() to create the data frame.set.seed(1)
n <- 100
X <- rnorm(n)
epsilon <- rnorm(n)
Y <- 2 + X + 3 * X^2 - 0.5 * X^3 + epsilon
df <- data.frame(x = X, y = Y)
regsubsets() to perform best subset
selection to identify the model containing the best combination of the
predictors \(X,\,X^2,\,...X^{10}\).
Perform best subset selection and store the results in the variable
regfit.full. What powers of \(X\) were selected for the model with \(k=3\) predictors, and is this what you
expected? (hint: use library(leaps) to load the required
library. A \(p\)th degree polynomial
model can be specified using the formula
y~poly(x, p, raw=TRUE) in R. Alternatively it can be
specified using y~x+I(x^2)+I(x^3)+.... The command
summary() show what predictors were selected).library(leaps)
regfit.full <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = df, nvmax = 10)
regfit.sum <- summary(regfit.full)
print(regfit.sum)
## Subset selection object
## Call: regsubsets.formula(y ~ poly(x, 10, raw = TRUE), data = df, nvmax = 10)
## 10 Variables (and intercept)
## Forced in Forced out
## poly(x, 10, raw = TRUE)1 FALSE FALSE
## poly(x, 10, raw = TRUE)2 FALSE FALSE
## poly(x, 10, raw = TRUE)3 FALSE FALSE
## poly(x, 10, raw = TRUE)4 FALSE FALSE
## poly(x, 10, raw = TRUE)5 FALSE FALSE
## poly(x, 10, raw = TRUE)6 FALSE FALSE
## poly(x, 10, raw = TRUE)7 FALSE FALSE
## poly(x, 10, raw = TRUE)8 FALSE FALSE
## poly(x, 10, raw = TRUE)9 FALSE FALSE
## poly(x, 10, raw = TRUE)10 FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)3 poly(x, 10, raw = TRUE)4
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)5 poly(x, 10, raw = TRUE)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) "*" " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) " " " "
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)7 poly(x, 10, raw = TRUE)8
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) " " "*"
## 10 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)9 poly(x, 10, raw = TRUE)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) " " "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
regfit.sum=summary(regfit.full) computes these three values
for all models. For each of the measures \(C_p\), \(BIC\) or the adjusted \(R^2\), do: 1) create a plot showing how the
value depends on the number of predictors, 2) find the optimal value of
\(k\) according to the measure, and 3)
report the coefficients and compare them to the true model. (hint: use
names() to find what values are stored in
regfit.sum. These can then be accessed using, for example
regfit.sum$cp. which.min() and
which.max() are useful for finding the index in an array
with the largest or smallest value. The command \(coef()\) is uefull for getting the
coefficients.)plot(regfit.sum$cp, type = 'b', xlab = "Number of Predictors", ylab = "Cp")
plot(regfit.sum$bic, type = 'b', xlab = "Number of Predictors", ylab = "BIC")
plot(regfit.sum$adjr2, type = 'b', xlab = "Number of Predictors", ylab = "Adjusted R^2")
k_cp <- which.min(regfit.sum$cp)
k_bic <- which.min(regfit.sum$bic)
k_adjr2 <- which.max(regfit.sum$adjr2)
coef(regfit.full, k_cp)
## (Intercept) poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2
## 2.07200775 1.38745596 2.84575641
## poly(x, 10, raw = TRUE)3 poly(x, 10, raw = TRUE)5
## -0.94202574 0.08072292
coef(regfit.full, k_bic)
## (Intercept) poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2
## 2.0615072 0.9752803 2.8762090
## poly(x, 10, raw = TRUE)3
## -0.4823614
coef(regfit.full, k_adjr2)
## (Intercept) poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2
## 2.07200775 1.38745596 2.84575641
## poly(x, 10, raw = TRUE)3 poly(x, 10, raw = TRUE)5
## -0.94202574 0.08072292
regfit.fwd <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = df, nvmax = 10, method = "forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(y ~ poly(x, 10, raw = TRUE), data = df, nvmax = 10,
## method = "forward")
## 10 Variables (and intercept)
## Forced in Forced out
## poly(x, 10, raw = TRUE)1 FALSE FALSE
## poly(x, 10, raw = TRUE)2 FALSE FALSE
## poly(x, 10, raw = TRUE)3 FALSE FALSE
## poly(x, 10, raw = TRUE)4 FALSE FALSE
## poly(x, 10, raw = TRUE)5 FALSE FALSE
## poly(x, 10, raw = TRUE)6 FALSE FALSE
## poly(x, 10, raw = TRUE)7 FALSE FALSE
## poly(x, 10, raw = TRUE)8 FALSE FALSE
## poly(x, 10, raw = TRUE)9 FALSE FALSE
## poly(x, 10, raw = TRUE)10 FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: forward
## poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)3 poly(x, 10, raw = TRUE)4
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" " "
## 9 ( 1 ) "*" " "
## 10 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)5 poly(x, 10, raw = TRUE)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) "*" " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)7 poly(x, 10, raw = TRUE)8
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)9 poly(x, 10, raw = TRUE)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" " "
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
regfit.bwd <- regsubsets(y ~ poly(x, 10, raw = TRUE), data = df, nvmax = 10, method = "backward")
summary(regfit.bwd)
## Subset selection object
## Call: regsubsets.formula(y ~ poly(x, 10, raw = TRUE), data = df, nvmax = 10,
## method = "backward")
## 10 Variables (and intercept)
## Forced in Forced out
## poly(x, 10, raw = TRUE)1 FALSE FALSE
## poly(x, 10, raw = TRUE)2 FALSE FALSE
## poly(x, 10, raw = TRUE)3 FALSE FALSE
## poly(x, 10, raw = TRUE)4 FALSE FALSE
## poly(x, 10, raw = TRUE)5 FALSE FALSE
## poly(x, 10, raw = TRUE)6 FALSE FALSE
## poly(x, 10, raw = TRUE)7 FALSE FALSE
## poly(x, 10, raw = TRUE)8 FALSE FALSE
## poly(x, 10, raw = TRUE)9 FALSE FALSE
## poly(x, 10, raw = TRUE)10 FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: backward
## poly(x, 10, raw = TRUE)1 poly(x, 10, raw = TRUE)2
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)3 poly(x, 10, raw = TRUE)4
## 1 ( 1 ) " " " "
## 2 ( 1 ) "*" " "
## 3 ( 1 ) "*" " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)5 poly(x, 10, raw = TRUE)6
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)7 poly(x, 10, raw = TRUE)8
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) " " "*"
## 10 ( 1 ) "*" "*"
## poly(x, 10, raw = TRUE)9 poly(x, 10, raw = TRUE)10
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) "*" " "
## 5 ( 1 ) "*" " "
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
cv.glmnet() (imported using
library(glmnet)) with alpha=1 to fit a LASSO
model to the synthetic data and store it in cv.out. Plot
how the cross-validation error depends on \(\lambda\). How many predictors were
selected? Is the LASSO model similar to the true model? (hint: An
example of how to fit a model using cv.glmnet() can be
found on page 278. As input, cv.glmnet() needs the design
matrix with the intercept column removed, as demonstrated on page 274.
Both the \(\lambda\) values and the
corresponding cross-validation error are stored in cv.out;
use names() to find them! It may be useful to use the
argument log='x' in plot())library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
x_matrix <- model.matrix(y ~ poly(x, 10, raw = TRUE), data = df)[, -1] # Remove intercept
y_response <- df$y
cv.out <- cv.glmnet(x_matrix, y_response, alpha = 1)
plot(cv.out)
best_lambda <- cv.out$lambda.min
lasso_model <- glmnet(x_matrix, y_response, alpha = 1, lambda = best_lambda)
coef(lasso_model)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 2.1482869810
## poly(x, 10, raw = TRUE)1 0.8852714744
## poly(x, 10, raw = TRUE)2 2.7158832671
## poly(x, 10, raw = TRUE)3 -0.4596297829
## poly(x, 10, raw = TRUE)4 .
## poly(x, 10, raw = TRUE)5 .
## poly(x, 10, raw = TRUE)6 0.0036926467
## poly(x, 10, raw = TRUE)7 .
## poly(x, 10, raw = TRUE)8 0.0007278168
## poly(x, 10, raw = TRUE)9 .
## poly(x, 10, raw = TRUE)10 .
In the previous problem you performed subset selection using three approaches: best subset selection, forward stepwise selection, and backward stepwise selection. For each approach, you obtained the models \(M_0\), …, \(M_p\) containing \(0, 1, 2, ..., p\) predictors. In this problem, you will compare the theoretical properties of the three approaches.
The best subset selection will have the smallest training RSS because it evaluates all possible predictor combinations. Forward and backward stepwise selection may match the best subset model’s RSS, but usually will not, because they are limited.
Any method could perform best, as it depends on which predictors go best with the actual data pattern. None of these methods is guaranteed to have the smallest test RSS, but they can produce similar test errors if they select similar predictors.
True: In forward stepwise selection, each (k+1) predictor model includes all k-predictors + 1, so each k-predictor model is a subset of the next.
False: Backward stepwise doesn’t guarantee that each k-predictor model is a subset of the previous (k+1) predictor model, because predictors are removed sequentially.
False: Forward and backward stepwise models do not follow the same paths, so the predictors chosen by one method are not necessarily subsets of the other.
False: Forward stepwise predictors are not guaranteed to be subsets of those chosen by backward stepwise selection.The two methods can select different predictors at each step, so there’s no subset relationship.
True: Best subset selection includes each k-predictor model as a subset of the (k+1) model since it explores all possible combinations.
In this problem, you will study the fat dataset from the
library faraway. To load this data, use:
#install.packages("faraway")
library(faraway)
data(fat)
Here the first line installs the package and only needs to be
executed once. This dataset contains biometric data for 252 individuals.
Our goal is to predict the body fat (stored in brozek)
using all other predictors except for siri,
density and free. These predictors can be
removed using
df=subset(fat, select = -c(siri,density,free))
We also remove one datapoint with an anomalously short individual, and split the data into a training and testing dataset
df=df[df$height>50,]
set.seed(1)
train=sample(1:nrow(df), nrow(df)/2)
df.tr = df[train,]
df.te = df[-train,]
lm.fit <- lm(brozek ~ ., data = df.tr)
train.mse <- mean(lm.fit$residuals^2)
pred.te <- predict(lm.fit, newdata = df.te)
test.mse <- mean((df.te$brozek - pred.te)^2)
cv.out, with \(\lambda\)
chosen by cross-validation. What is the optimal value of \(\lambda\)? Compute and report the training
and testing MSE. (hint: use cv.glmnet() as in Problem 1e
but with a different value of alpha. The optimal value of
\(\lambda\) is stored in
cv.out$lambda.min. When making predictions, use the
function predict() with the model cv.out; the
argument newx is used to specify the matrix of (possibly
new) predictor values as input, and the argument s is used
to specify what lambda to use for the prediction).library(glmnet)
x <- model.matrix(brozek ~ ., df.tr)[, -1]
y <- df.tr$brozek
cv.out <- cv.glmnet(x, y, alpha = 0)
best.lambda <- cv.out$lambda.min
ridge.train.pred <- predict(cv.out, newx = x, s = best.lambda)
ridge.train.mse <- mean((y - ridge.train.pred)^2)
x.te <- model.matrix(brozek ~ ., df.te)[, -1]
ridge.test.pred <- predict(cv.out, newx = x.te, s = best.lambda)
ridge.test.mse <- mean((df.te$brozek - ridge.test.pred)^2)
cv.out.lasso <- cv.glmnet(x, y, alpha = 1)
lasso.train.pred <- predict(cv.out.lasso, newx = x, s = cv.out.lasso$lambda.min)
lasso.train.mse <- mean((y - lasso.train.pred)^2)
lasso.test.pred <- predict(cv.out.lasso, newx = x.te, s = cv.out.lasso$lambda.min)
lasso.test.mse <- mean((df.te$brozek - lasso.test.pred)^2)
lasso.coef <- coef(cv.out.lasso, s = cv.out.lasso$lambda.min)
num.nonzero <- sum(lasso.coef != 0)
regfit.full. Plot the \(C_p\), BIC and adjusted \(R^2\) as a function of the number of
predictors \(k\). What number of
predictors minimize \(BIC\)? What
predictors does this model use? (hint: See problem 1c)regfit.full <- regsubsets(brozek ~ ., data = df.tr, nvmax = ncol(df.tr) - 1)
regfit.sum <- summary(regfit.full)
plot(regfit.sum$cp, xlab = "Number of Predictors", ylab = "Cp", type = "l")
plot(regfit.sum$bic, xlab = "Number of Predictors", ylab = "BIC", type = "l")
plot(regfit.sum$adjr2, xlab = "Number of Predictors", ylab = "Adjusted R^2", type = "l")
best.bic.k <- which.min(regfit.sum$bic)
best.bic.predictors <- coef(regfit.full, id = best.bic.k)
regsubsets():predict.regsubsets <- function(object, newdata, id, ...) {
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}