Question 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

2A: The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Answer: (iii) Lasso’s advantage over least squares is rooted in the bias-variance tradeoff. Unlike the least squares, the lasso solution can produce a depletion in variance at the expense of a small increase in bias - - as λ increases, the variance decreases and the bias increases. Thus, the lasso solution can generate more accurate preodictions.

2B: Repeat (a) for ridge regression relative to least squares.

Answer: (iii) The ridge regression has qualitatively similar behavior to lasso. As the tuning parameter increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. When the number of variables p is almost as large as the number of observations n, the least squares estimates will be extremely variable. And if p > n, then the least squares estimates do not even have a unique solution, whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Thus, ridge regression works best in situations where the least squares estimates have high variance.

2C: Repeat (a) for non-linear methods relative to least squares.

Answer: (ii) Non-linear methods are more flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Question 5

It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting.

Suppose that n = 2, p = 2, x11 = x12, x21 = x22. Furthermore, suppose that y1 +y2 = 0 and x11 +x21 = 0 and x12 +x22 = 0, so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: βˆ0 = 0.

5A: Write out the ridge regression optimization problem in this setting.

Answer: The ridge regression problem seeks to minimize \[(y_1 - \hat{\beta}_1x_1 - \hat{\beta}_2x_1)^2 + (y_2 - \hat{\beta}_1x_2 - \hat{\beta}_2x_2)^2 + \lambda(\hat{\beta}_1^2 + \hat{\beta}_2^2.\]

5B: Argue that in this setting, the ridge coefficient estimates satisfy βˆ1 = βˆ2.

Answer: By taking the derivatives of the above expression with respect to β̂ 1 and β̂ 2 and setting them equal to 0, we obtain respectively

\[\hat{\beta}_1(x_1^2 + x_2^2 + \lambda) + \hat{\beta}_2(x_1^2 + x_2^2) = y_1x_1 + y_2x_2\]

and

\[\hat{\beta}_1(x_1^2 + x_2^2) + \hat{\beta}_2(x_1^2 + x_2^2 + + \lambda) = y_1x_1 + y_2x_2\]

Thus, by substracting those two expression mentioned above, we get βˆ1 = βˆ2

5C: Write out the lasso optimization problem in this setting

Answer: The lasso optimization problem seeks to minimize

\[(y_1 - \hat{\beta}_1x_1 - \hat{\beta}_2x_1)^2 + (y_2 - \hat{\beta}_1x_2 - \hat{\beta}_2x_2)^2 + \lambda(|\hat{\beta}_1| + |\hat{\beta}_2|).\]

5D: Argue that in this setting, the lasso coefficients βˆ1 and βˆ2 are not unique—in other words, there are many possible solutions to the optimization problem in (c). Describe these solutions.

Answer: By replacing the values mentioned above, we get the optimization problem:

\[2[b-a({\beta}1 + {\beta}2)^2] + \lambda(|{\beta}1| + |{\beta}2|).\]

Taking the derivates with respect to \[{\beta}1\] and \[{\beta}2\] and setting them to 0, we get:

\[4a[b-a({\beta}1 + {\beta}2)] = +/- \lambda.\]

This equation represents the boundary of lasso constraint, and because of this, its optimization problem has many possible solutions.

Question 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

9A: Split the data set into a training set and a test set.
library(ISLR)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
data(College)
set.seed(9)

inTrain <- createDataPartition(College$Apps, p=0.70, list=FALSE)
train <- College[inTrain,]
test <- College[-inTrain,]
9B: Fit a linear model using least squares on the training set, and report the test error obtained
model <- lm(Apps~., data = train)
lm.pred <- predict(model, newdata = test) #fitting training model on test set
lin_info <- mean((test$Apps-lm.pred)^2) #calculating accuracy
print(paste("The test MSE for linear model is:", lin_info))
## [1] "The test MSE for linear model is: 1166319.68208018"
9C: Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
train.mat <- model.matrix(Apps~., data=train)
test.mat <- model.matrix(Apps~., data=test)
grid <- 10 ^ seq(4, -2, length=100)
mod.ridge <- cv.glmnet(train.mat, train[, "Apps"], alpha=0, lambda=grid, thresh=1e-12)
lambda.best <- mod.ridge$lambda.min
ridge.pred <- predict(mod.ridge, newx=test.mat, s=lambda.best)
x <- mean((test[, "Apps"] - ridge.pred)^2)
print(paste("The test MSE for ridge regression is:", x, "which is smaller than linear regression"))
## [1] "The test MSE for ridge regression is: 1166284.2097576 which is smaller than linear regression"
9D: Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
mod.lasso <- cv.glmnet(train.mat, train[, "Apps"], alpha=1, lambda=grid, thresh=1e-12)
lambda.best <- mod.lasso$lambda.min
lasso.pred <- predict(mod.lasso, newx=test.mat, s=lambda.best)
z <- mean((test[, "Apps"] - lasso.pred)^2)
print(paste("The test MSE for lasso is:", z, "which is smaller than linear and ridge regression"))
## [1] "The test MSE for lasso is: 1166245.89868265 which is smaller than linear and ridge regression"
9E: Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
pcr.fit <- pcr(Apps~., data=train, scale=T, validation="CV")
pcr.pred <- predict(pcr.fit, test, ncomp=5)
y <- mean((test[, "Apps"] - c(pcr.pred))^2)
print(paste("The test MSE for PCR is:", y))
## [1] "The test MSE for PCR is: 1562880.06238225"
9F: Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.fit <- plsr(Apps~., data=train, scale=T, validation="CV")
pls.pred <- predict(pls.fit, test, ncomp=6)
p <- mean((test[, "Apps"] - data.frame(pls.pred))^2)
## Warning in mean.default((test[, "Apps"] - data.frame(pls.pred))^2): argument is
## not numeric or logical: returning NA
print(paste("The test MSE for PLS is:", p))
## [1] "The test MSE for PLS is: NA"
9G: Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
test.avg <- mean(test[, "Apps"])
lm.test.r2 <- 1-mean((test[,"Apps"]-lm.pred)^2)/mean((test[,"Apps"]-test.avg)^2)
ridge.test.r2=1-mean((test[,"Apps"]-ridge.pred)^2)/mean((test[,"Apps"]-test.avg)^2)
lasso.test.r2=1-mean((test[,"Apps"]-lasso.pred)^2)/mean((test[,"Apps"]-test.avg)^2)
pcr.test.r2=1-mean((test[,"Apps"]-data.frame(pcr.pred))^2)/mean((test[,"Apps"]-test.avg)^2)
## Warning in mean.default((test[, "Apps"] - data.frame(pcr.pred))^2): argument is
## not numeric or logical: returning NA
pls.test.r2=1-mean((test[,"Apps"]-data.frame(pls.pred))^2)/mean((test[,"Apps"]-test.avg)^2)
## Warning in mean.default((test[, "Apps"] - data.frame(pls.pred))^2): argument is
## not numeric or logical: returning NA
rbind(c("OLS", "Ridge", "Lasso", "PCR", "PLS"),
c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2))
##      [,1]                [,2]                [,3]                [,4]  [,5] 
## [1,] "OLS"               "Ridge"             "Lasso"             "PCR" "PLS"
## [2,] "0.902490000520399" "0.902492966179138" "0.902496169171384" NA    NA

Comment: Each method has a reasonable R-squared, and they’re quite similar to each other. Hence, there isn’t much difference among the test errors.