Load Libraries and attach datasets
library(ISLR2)
library(glmnet)
library(pls)
library(leaps)
library(tidyverse)
attach(College)
attach(Boston)
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Justify your answer:
Lasso chooses coefficients that minimizes RSS. Additionally, as the lamda increases, variance decreases at a higher rate than the bias increases. This results in improved accuracy with a small increase in bias compared to the corresponding decrease in variance. The larger shrinkage penalty results in the coefficients being close to zero, which increases shrinkage and reducing the prediction variance. So in this case, the bias-variance trade-off, may be more beneficial than least squares.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Justify your answer:
Ridge regression’s justification is very similar to the lasso justification, with the exception that the shrinkage penalty is smaller, meaning that the coefficients don’t get to zero as quickly or absolutely as they do in the lasso method. However, there is still a larger reduction in variance than there is an increase in bias, so once again the bias-variance trade-off may make this model a better fit than least squares.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Justify your answer:
Non-linear methods make less assumptions because they do not assume linearity. As a result, this method is more flexible. When the relationship is in-fact non-linear, which then leads to a decrease in bias that will be more substantial than the increase in variance. For those reasons, once again, the bias-variance trade-off may make this model better than least squares.
College data set.I split the data 75/25 Train/Test.
set.seed(22)
train_num = sample(nrow(College), (nrow(College)*.75))
train <- College[train_num, ]
test <- College[-train_num, ]
The linear model MSE is 894367.7.
fit_lm <- lm(Apps ~ ., data = train)
pred_lm <- predict(fit_lm, test)
LS_MSE <- mean((pred_lm - test$Apps)^2)
LS_MSE
## [1] 894367.7
The ridge regression MSE is 894344.5.
set.seed(22)
train_mat <- model.matrix(Apps ~ ., data = train)
test_mat <- model.matrix(Apps ~ ., data = test)
grid <- 10 ^ seq(10, -2, length = 100)
fit_ridge <- glmnet(train_mat, train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
cv_ridge <- cv.glmnet(train_mat, train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam <- cv_ridge$lambda.min
ridge_pred <- predict(fit_ridge, s = bestlam, newx = test_mat)
ridge_MSE <- mean((ridge_pred - test$Apps)^2)
ridge_MSE
## [1] 894344.5
The lasso model MSE is 894286.5. This model used 18 coefficients, however several were near zero.
set.seed(22)
fit_lasso <- glmnet(train_mat, train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv_lasso <- cv.glmnet(train_mat, train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam <- cv_lasso$lambda.min
lasso_pred <- predict(fit_lasso, s = bestlam, newx = test_mat)
lasso_MSE <- mean((lasso_pred - test$Apps)^2)
lasso_coef <- predict(fit_lasso, type = 'coefficient', s = bestlam)
lasso_MSE
## [1] 894286.5
lasso_coef[lasso_coef!=0]
## [1] -544.02484392 -505.69565764 1.61674471 -1.17471944 52.63154083
## [6] -19.04816319 0.10540607 0.03969968 -0.07265414 0.12794426
## [11] -0.16297511 0.07056082 -9.10044055 -2.68997846 15.52020545
## [16] 1.64990949 0.09487090 10.02553687
The PCR model MSE is 1550146 with an M value of 6.
set.seed(22)
fit_pcr = pcr(Apps~., data = train, scale = TRUE, validation = 'CV')
validationplot(fit_pcr, val.type = 'MSEP')
pcr_pred <- predict(fit_pcr,test, ncomp = 6)
pcr_MSE <- mean((pcr_pred-test$Apps)^2)
pcr_MSE
## [1] 1550146
The PLS-model MSE is 840735.5 with an M value of 8
set.seed(22)
fit_pls = plsr(Apps~., data = train, scale = TRUE, validation = 'CV')
validationplot(fit_pls, val.type = 'MSEP')
pls_pred <- predict(fit_pls,test, ncomp = 8)
pls_MSE <- mean((pls_pred-test$Apps)^2)
pls_MSE
## [1] 840735.5
There was not a significant difference in MSE between most of the methods, with the exception of PCR and PLS. PLS had the lowest MSE of 840735.5. The ridge and lasso methods also preformed slightly better than least square, but were worse than PLS. PCR performed significantly worse than all the other models, and only became close when so many predictors were included that it was essentially replicating the least squares method.
all_MSE = c(LS_MSE, ridge_MSE, lasso_MSE, pcr_MSE, pls_MSE)
names(all_MSE) = c("ls", "ridge", "lasso", "pcr", "pls")
all_MSE
## ls ridge lasso pcr pls
## 894367.7 894344.5 894286.5 1550146.4 840735.5
which.min(all_MSE)
## lasso
## 3
Boston data set.#Explore Data
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
#Split data into train and test
set.seed(22)
train_num = sample(nrow(Boston), (nrow(Boston)*.75))
train <- Boston[train_num, ]
test <- Boston[-train_num, ]
fit_lm <- lm(crim ~ ., data = train)
pred_lm <- predict(fit_lm, test)
LS_MSE <- mean((pred_lm - test$crim)^2)
LS_MSE
## [1] 20.28627
set.seed(22)
train_mat <- model.matrix(crim ~ ., data = train)
test_mat <- model.matrix(crim ~ ., data = test)
grid <- 10 ^ seq(10, -2, length = 100)
fit_ridge <- glmnet(train_mat, train$crim, alpha = 0, lambda = grid, thresh = 1e-12)
cv_ridge <- cv.glmnet(train_mat, train$crim, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam <- cv_ridge$lambda.min
ridge_pred <- predict(fit_ridge, s = bestlam, newx = test_mat)
ridge_MSE <- mean((ridge_pred - test$crim)^2)
ridge_MSE
## [1] 20.00783
set.seed(22)
fit_lasso <- glmnet(train_mat, train$crim, alpha = 1, lambda = grid, thresh = 1e-12)
cv_lasso <- cv.glmnet(train_mat, train$crim, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam <- cv_lasso$lambda.min
lasso_pred <- predict(fit_lasso, s = bestlam, newx = test_mat)
lasso_MSE <- mean((lasso_pred - test$crim)^2)
lasso_MSE
## [1] 19.78296
set.seed(22)
fit_pcr = pcr(crim~., data = train, scale = TRUE, validation = 'CV')
validationplot(fit_pcr, val.type = 'MSEP')
pcr_pred <- predict(fit_pcr,test, ncomp = 7)
pcr_MSE <- mean((pcr_pred-test$crim)^2)
pcr_MSE
## [1] 22.40217
set.seed(22)
fit_pls = plsr(crim~., data = train, scale = TRUE, validation = 'CV')
validationplot(fit_pls, val.type = 'MSEP')
pls_pred <- predict(fit_pls,test, ncomp = 3)
pls_MSE <- mean((pls_pred-test$crim)^2)
pls_MSE
## [1] 21.37494
The three models that seem to perform well on this data set are LS (20.28 MSE), Ridge (20 MSE), and Lasso (19.78 MSE), with Lasso being the best by a slight margin. PCR (22.40) and PLS (21.37 MSE) perform nearly as well, with PCR being the worst of the five.
all_MSE = c(LS_MSE, ridge_MSE, lasso_MSE, pcr_MSE, pls_MSE)
names(all_MSE) = c("ls", "ridge", "lasso", "pcr", "pls")
all_MSE
## ls ridge lasso pcr pls
## 20.28627 20.00783 19.78296 22.40217 21.37494
which.min(all_MSE)
## lasso
## 3
The Lasso model does use all features of the data set, however, when looking at the coefficients, some of them are so close to zero, that the model is effectively not using them.