Answer: (iii) Lasso’s advantage over least squares is rooted in the bias-variance tradeoff. Unlike the least squares, the lasso solution can produce a depletion in variance at the expense of a small increase in bias - - as λ increases, the variance decreases and the bias increases. Thus, the lasso solution can generate more accurate preodictions.
Answer: (iii) The ridge regression has qualitatively similar behavior to lasso. As the tuning parameter increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. When the number of variables p is almost as large as the number of observations n, the least squares estimates will be extremely variable. And if p > n, then the least squares estimates do not even have a unique solution, whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Thus, ridge regression works best in situations where the least squares estimates have high variance.
Answer: (ii) Non-linear methods are more flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Answer: The ridge regression problem seeks to minimize \[(y_1 - \hat{\beta}_1x_1 - \hat{\beta}_2x_1)^2 + (y_2 - \hat{\beta}_1x_2 - \hat{\beta}_2x_2)^2 + \lambda(\hat{\beta}_1^2 + \hat{\beta}_2^2.\]
Answer: By taking the derivatives of the above expression with respect to β̂ 1 and β̂ 2 and setting them equal to 0, we obtain respectively
\[\hat{\beta}_1(x_1^2 + x_2^2 + \lambda) + \hat{\beta}_2(x_1^2 + x_2^2) = y_1x_1 + y_2x_2\]
and
\[\hat{\beta}_1(x_1^2 + x_2^2) + \hat{\beta}_2(x_1^2 + x_2^2 + + \lambda) = y_1x_1 + y_2x_2\]
Thus, by substracting those two expression mentioned above, we get βˆ1 = βˆ2
Answer: The lasso optimization problem seeks to minimize
\[(y_1 - \hat{\beta}_1x_1 - \hat{\beta}_2x_1)^2 + (y_2 - \hat{\beta}_1x_2 - \hat{\beta}_2x_2)^2 + \lambda(|\hat{\beta}_1| + |\hat{\beta}_2|).\]
Answer: By replacing the values mentioned above, we get the optimization problem:
\[2[b-a({\beta}1 + {\beta}2)^2] + \lambda(|{\beta}1| + |{\beta}2|).\]
Taking the derivates with respect to \[{\beta}1\] and \[{\beta}2\] and setting them to 0, we get:
\[4a[b-a({\beta}1 + {\beta}2)] = +/- \lambda.\]
This equation represents the boundary of lasso constraint, and because of this, its optimization problem has many possible solutions.
College data set.library(ISLR)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
data(College)
set.seed(9)
inTrain <- createDataPartition(College$Apps, p=0.70, list=FALSE)
train <- College[inTrain,]
test <- College[-inTrain,]
model <- lm(Apps~., data = train)
lm.pred <- predict(model, newdata = test) #fitting training model on test set
lin_info <- mean((test$Apps-lm.pred)^2) #calculating accuracy
print(paste("The test MSE for linear model is:", lin_info))
## [1] "The test MSE for linear model is: 1166319.68208018"
train.mat <- model.matrix(Apps~., data=train)
test.mat <- model.matrix(Apps~., data=test)
grid <- 10 ^ seq(4, -2, length=100)
mod.ridge <- cv.glmnet(train.mat, train[, "Apps"], alpha=0, lambda=grid, thresh=1e-12)
lambda.best <- mod.ridge$lambda.min
ridge.pred <- predict(mod.ridge, newx=test.mat, s=lambda.best)
x <- mean((test[, "Apps"] - ridge.pred)^2)
print(paste("The test MSE for ridge regression is:", x, "which is smaller than linear regression"))
## [1] "The test MSE for ridge regression is: 1166284.2097576 which is smaller than linear regression"
mod.lasso <- cv.glmnet(train.mat, train[, "Apps"], alpha=1, lambda=grid, thresh=1e-12)
lambda.best <- mod.lasso$lambda.min
lasso.pred <- predict(mod.lasso, newx=test.mat, s=lambda.best)
z <- mean((test[, "Apps"] - lasso.pred)^2)
print(paste("The test MSE for lasso is:", z, "which is smaller than linear and ridge regression"))
## [1] "The test MSE for lasso is: 1166245.89868265 which is smaller than linear and ridge regression"
pcr.fit <- pcr(Apps~., data=train, scale=T, validation="CV")
pcr.pred <- predict(pcr.fit, test, ncomp=5)
y <- mean((test[, "Apps"] - c(pcr.pred))^2)
print(paste("The test MSE for PCR is:", y))
## [1] "The test MSE for PCR is: 1562880.06238225"
pls.fit <- plsr(Apps~., data=train, scale=T, validation="CV")
pls.pred <- predict(pls.fit, test, ncomp=6)
p <- mean((test[, "Apps"] - data.frame(pls.pred))^2)
## Warning in mean.default((test[, "Apps"] - data.frame(pls.pred))^2): argument is
## not numeric or logical: returning NA
print(paste("The test MSE for PLS is:", p))
## [1] "The test MSE for PLS is: NA"
test.avg <- mean(test[, "Apps"])
lm.test.r2 <- 1-mean((test[,"Apps"]-lm.pred)^2)/mean((test[,"Apps"]-test.avg)^2)
ridge.test.r2=1-mean((test[,"Apps"]-ridge.pred)^2)/mean((test[,"Apps"]-test.avg)^2)
lasso.test.r2=1-mean((test[,"Apps"]-lasso.pred)^2)/mean((test[,"Apps"]-test.avg)^2)
pcr.test.r2=1-mean((test[,"Apps"]-data.frame(pcr.pred))^2)/mean((test[,"Apps"]-test.avg)^2)
## Warning in mean.default((test[, "Apps"] - data.frame(pcr.pred))^2): argument is
## not numeric or logical: returning NA
pls.test.r2=1-mean((test[,"Apps"]-data.frame(pls.pred))^2)/mean((test[,"Apps"]-test.avg)^2)
## Warning in mean.default((test[, "Apps"] - data.frame(pls.pred))^2): argument is
## not numeric or logical: returning NA
rbind(c("OLS", "Ridge", "Lasso", "PCR", "PLS"),
c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2))
## [,1] [,2] [,3] [,4] [,5]
## [1,] "OLS" "Ridge" "Lasso" "PCR" "PLS"
## [2,] "0.902490000520399" "0.902492966179138" "0.902496169171384" NA NA
Comment: Each method has a reasonable R-squared, and they’re quite similar to each other. Hence, there isn’t much difference among the test errors.