- For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
- The lasso, relative to least squares, is:
- More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
False: The lasso reduces the models flexibility by decreasing the variance, and increasing the bias as it uses a different penalty term. Overall, the lasso will give improved accuracy prediction specific to when the increase in bias, is less than the decrease in variance.
- More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
False (My answ justified shortly)
- Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
True (My answ justified shortly)
- Less flexible and hence will give improved prediction accuracy whe
False (My answ justified shortly)
Justification: Like in my explanation with a) i., the lasso has an advantage to Least squares when we analize the bias-variance trade-off. When our least squares estimates have a high variance, lasso can reduce its variance, however, in turn it will increase the bias. This allows to create more accurate predictions. In addition to this, lassos penalty term that incorporates the absolute value, can drive coefficients to be 0. This will allow for an easier interpretation.
- Repeat (a) for ridge regression relative to least squares.
Answer: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. [True]
Both Ridge and Lasso regression have an advantage over least squares based on the bias-variance trade-off. When there is an increase in lambda, ridge regressions flexibility tends to decrease, thus, decreasing the variance and increasing the bias.
Even when a small change in the training data exists, the least squares coefficients create a change, and create a bigger change for the variance.
However, ridge regression can perfom well by traiding small incraments in bias, at the expense of a large decrease in bias.
When it comes to ridge vs Lasso, the lassos penalty term, that incorporates the absolute value, can drive coefficients to be 0. This will allow for an easier interpretation.
- Repeat (a) for non-linear methods relative to least squares.
Answer: ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Non-linear models are more flexible in the shapes of the curves that in can fit in the data. In addition, they provide a higher accuracy when its increase in variance is less than the decrease in bias. It’s important to know that Non- linear models (such as decision trees (which is later talked about in this class)) can more effectively describe relationships in complex data sets.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.3
head(College)
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## Agnes Scott College Yes 417 349 137 60 89
## Alaska Pacific University Yes 193 146 55 16 44
## Albertson College Yes 587 479 158 38 62
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Agnes Scott College 510 63 12960 5450 450
## Alaska Pacific University 249 869 7560 4120 800
## Albertson College 678 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Agnes Scott College 875 92 97 7.7 37 19016
## Alaska Pacific University 1500 76 72 11.9 2 10922
## Albertson College 675 67 73 9.4 11 9727
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
## Agnes Scott College 59
## Alaska Pacific University 15
## Albertson College 55
- Split the data set into a training set and a test set.
I will split 50/50
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#make this example reproducible
set.seed(1)
#create ID column
College = as.data.frame(College)
College$id <- 1:nrow(College)
train <- College %>% dplyr::sample_frac(.50)
test <- dplyr::anti_join(College, train, by = 'id')
# drop id
College = College[,-c(19)]
test = test[,-c(19)]
train = train[,-c(19)]
- Fit a linear model using least squares on the training set, and report the test error obtained
Use # of apps as the predictor
lm.fit = lm(Apps ~., data = train)
lm.pred = predict(lm.fit, test)
# Calculate the mean squared error
mse = mean((test$Apps - lm.pred)^2)
cat("Test MSE:", mse, "\n")
## Test MSE: 1135758
# Calculate the mean absolute error
mae = mean(abs(test$Apps - lm.pred))
cat("Test MAE:", mae, "\n")
## Test MAE: 650.8525
- Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.1.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.1.3
## Loaded glmnet 4.1-6
# convert the test and train to matrix
train.m = model.matrix(Apps~., data = train)
test.m = model.matrix(Apps ~., data = test)
grid = 10^seq(10, -2, length = 100) # give me ridge reg, over range of values of lambda
mod.ridge = cv.glmnet(train.m, train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
lambda.best = mod.ridge$lambda.min
cat("The best lambda value is", lambda.best, "\n")
## The best lambda value is 0.01
plot(mod.ridge)
ridge.pred = predict(mod.ridge, newx=test.m, s=lambda.best)
# Calculate the mean squared error
mse = mean((test$Apps - ridge.pred)^2)
cat("Test MSE:", mse, "\n")
## Test MSE: 1135714
# Calculate the mean absolute error
mae = mean(abs(test$Apps - ridge.pred))
cat("Test MAE:", mae, "\n")
## Test MAE: 650.8396
The test MSE in ridge is just lower than the test mse in the OLS. This also applies to the test MAE
- Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
mod.lasso = cv.glmnet(train.m, train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
lambda.best = mod.lasso$lambda.min
cat("The best lambda value is", lambda.best, "\n")
## The best lambda value is 0.01
plot(mod.lasso)
lasso.pred = predict(mod.lasso, newx=test.m, s=lambda.best)
# Calculate the mean squared error
mse = mean((test$Apps - lasso.pred)^2)
cat("Test MSE:", mse, "\n")
## Test MSE: 1135659
# Calculate the mean absolute error
mae = mean(abs(test$Apps - lasso.pred))
cat("Test MAE:", mae, "\n")
## Test MAE: 650.8153
The test MSE in lasso is just slightly lower than the test mse in the OLS. However, the test MAE is almost identical in both
The coefficients are as follows:
mod.lasso = glmnet(model.matrix(Apps ~., data = College), College$Apps, alpha =1)
predict(mod.lasso, s = lambda.best, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -471.39372052
## (Intercept) .
## PrivateYes -491.04485137
## Accept 1.57033288
## Enroll -0.75961467
## Top10perc 48.14698892
## Top25perc -12.84690695
## F.Undergrad 0.04149116
## P.Undergrad 0.04438973
## Outstate -0.08328388
## Room.Board 0.14943472
## Books 0.01532293
## Personal 0.02909954
## PhD -8.39597537
## Terminal -3.26800340
## S.F.Ratio 14.59298267
## perc.alumni -0.04404771
## Expend 0.07712632
## Grad.Rate 8.28950241
My results may be do to my cross validation! Maybe K-fold cross validation would provide a more appropriate answer
- Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
## Warning: package 'pls' was built under R version 4.1.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.fit = pcr(Apps ~., data = train, scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred = predict(pcr.fit, test, ncomp=10)
# Calculate the mean squared error
mse = mean((test$Apps - pcr.pred)^2)
cat("Test MSE:", mse, "\n")
## Test MSE: 1723100
# Calculate the mean absolute error
mae = mean(abs(test$Apps - pcr.pred))
cat("Test MAE:", mae, "\n")
## Test MAE: 814.0476
summary(pcr.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 4054 2422 2432 2117 2022 1959
## adjCV 4288 4051 2415 2426 2061 1999 1948
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1957 1955 1911 1852 1853 1856 1861
## adjCV 1947 1947 1900 1840 1843 1846 1851
## 14 comps 15 comps 16 comps 17 comps
## CV 1877 1679 1338 1269
## adjCV 1876 1649 1323 1256
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.20 57.78 65.31 70.99 76.37 81.27 84.8 87.85
## Apps 13.44 70.93 71.07 79.87 81.15 82.25 82.3 82.33
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.62 92.91 94.98 96.74 97.79 98.72 99.42
## Apps 83.38 84.76 84.80 84.84 85.11 85.14 90.55
## 16 comps 17 comps
## X 99.88 100.00
## Apps 93.42 93.89
If we looked at an “elbow plot”, the lowest MSEP with PCR would be at M = 10. By analyzing the summary, the M=10 has the lowest CV error (if you think of it as a local minimum)
- Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.fit = plsr(Apps ~., data = train, scale = TRUE, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")
pls.pred = predict(pls.fit, test, ncomp = 10)
# Calculate the mean squared error
mse = mean((test$Apps - pls.pred)^2)
cat("Test MSE:", mse, "\n")
## Test MSE: 1131661
# Calculate the mean absolute error
mae = mean(abs(test$Apps - pls.pred))
cat("Test MAE:", mae, "\n")
## Test MAE: 654.9752
summary(pls.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 2215 2010 1762 1610 1480 1330
## adjCV 4288 2209 2001 1748 1582 1460 1316
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1303 1292 1289 1286 1282 1277 1270
## adjCV 1290 1279 1275 1273 1268 1264 1258
## 14 comps 15 comps 16 comps 17 comps
## CV 1270 1270 1270 1270
## adjCV 1257 1257 1257 1257
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 27.21 50.73 63.06 65.52 70.20 74.20 78.62 80.81
## Apps 75.39 81.24 86.97 91.14 92.62 93.43 93.56 93.68
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.29 87.17 89.15 91.37 92.58 94.42 96.98
## Apps 93.76 93.79 93.83 93.86 93.88 93.89 93.89
## 16 comps 17 comps
## X 98.78 100.00
## Apps 93.89 93.89
With a value of M = 10, The test MSE for the pls.pred is higher when comopared to the OLS. This value was chosen as the MSEP appears to plateu starting at that value.
- Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Lets look at this visually:
test.avg = mean(test[, "Apps"])
lm.test.r2 = 1 - mean((test$Apps - lm.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((test$Apps - ridge.pred)^2)/mean((test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((test$Apps - lasso.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((test$Apps - pcr.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((test$Apps - pls.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
yuh = rbind(lm.test.r2, ridge.test.r2)
yuh = rbind(yuh, lasso.test.r2)
yuh = rbind(yuh, pcr.test.r2)
yuh = rbind(yuh, pls.test.r2)
yuh = as.data.frame(yuh)
head(yuh)
## V1
## lm.test.r2 0.9015413
## ridge.test.r2 0.9015452
## lasso.test.r2 0.9015499
## pcr.test.r2 0.8506248
## pls.test.r2 0.9018965
data <- read.table(header=TRUE, text="
Model_Name Value
lm.test.r2 0.9015413
ridge.test.r2 0.9015452
lasso.test.r2 0.9015499
pcr.test.r2 0.8506248
pls.test.r2 0.9018965 ")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3
ggplot(data, aes(y=Value, x=Model_Name)) +
geom_bar(position='dodge', stat='identity') +
geom_text(aes(label=Value), vjust=1.6, color="white", size=3.5)+
theme_minimal()
Using pearsons correlation coefficient (R^2), we can determine how well a model fits our data. The graph demonstrates such R^2 per model. Most of them are really close to each other, with the exception of pcr.test.r^2. Our best model was pls.test. However, this is just by a very small amount of decimals.
PCR has the smallest test R^2 with .85.
- Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(leaps)
## Warning: package 'leaps' was built under R version 4.1.3
library(glmnet)
library(MASS)
## Warning: package 'MASS' was built under R version 4.1.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
Lets try the subset selection:
set.seed(1)
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")
which.min(rmse.cv)
## [1] 9
rmse.cv[which.min(rmse.cv)]
## [1] 6.543281
Best subset selection yielded a RMSE of 6.54
it select a subset of predictor variables that provides the most accurate and parsimonious model. Specifically, it involves fitting all possible regression models using a subset of the available predictors, and selecting the model that yields the best performance according to some criteria.
Ridge regression is a model tuning method that is used to analyse any data that suffers from multicollinearity.
set.seed(2)
x = model.matrix(crim ~. -1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha =0)
plot(cv.ridge)
bestlam = cv.ridge$lambda.min
bestlam
## [1] 0.5374992
The MSE is .537 Ridge regression is particularly useful when there are many predictor variables in the data set, and when these predictor variables are highly correlated. In such cases, the traditional linear regression model can suffer from high variance and overfitting, which can lead to poor generalization performance when applied to new data.
Lasso regression is a regularization technique. It is used over regression methods for a more accurate prediction
set.seed(51)
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x,y,type.measure = "mse")
plot(cv.lasso)
bestlamLasso = cv.lasso$lambda.min
bestlamLasso
## [1] 0.02220953
coef(cv.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.7799525
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.1920089
## tax .
## ptratio .
## black .
## lstat .
## medv .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
## [1] 7.720478
Lasso yields a super low MSE
Lasso regression is useful when dealing with high-dimensional data, i.e., when the number of predictors is large relative to the sample size. Lasso regression can help identify the most important predictors and exclude those that do not contribute significantly to the model.
library(pls)
pcr.fit = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.235 7.240 6.784 6.783 6.789 6.811
## adjCV 8.61 7.231 7.236 6.779 6.774 6.783 6.804
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.804 6.683 6.714 6.717 6.725 6.680 6.612
## adjCV 6.796 6.674 6.704 6.706 6.713 6.667 6.598
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
At a value of 10, our test adjCV can be used to calculate the MSE
Thus, MSE = 6.732 * (506 - 13) / 506 = 6.312
PCR works by transforming the original predictors into a set of new, uncorrelated variables called principal components (PCs) using PCA. The PCs are then used as the predictors in a linear regression model to predict the response variable.
- Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
I will propose the lasso model as it tends to be the most interpretable:
out=glmnet(x,y,alpha=1,lambda =grid)
lasso.coef = predict(out, type = "coefficients", s= bestlamLasso)[1:14,]
lasso.coef
## (Intercept) zn indus chas nox rm
## 15.060859666 0.040656013 -0.069249939 -0.666111834 -8.854980673 0.351038670
## age dis rad tax ptratio black
## 0.000000000 -0.904538130 0.550358063 -0.001745966 -0.235371992 -0.007534167
## lstat medv
## 0.127094579 -0.180777239
The lasso has tends to have the lowest MSE out of all of the models used. I know this wasn’t the case (which is kinda odd, might have to do with the data split), but it makes it easier to interpret.
The following model under lasso is: crim = 3.049570400 + 0.018324675(zn) - 0.003300391(indus) - 0.271023113(chas) - 0.269010433(dis) + 0.464856290(rad) - 0.001043060(ptratio) - 0.007238973(black) + 0.116419366(lstat) - 0.083382104(medv)
- Does your chosen model involve all of the features in the data set? Why or why not?
No, my model does not use all features in the data set. This is a perk of Lasso
In Lasso regression, the L1 penalty is used to constrain the magnitude of the coefficients of the predictor variables in the model. This penalty shrinks the coefficients towards zero, and as the strength of the penalty is increased, some coefficients may be forced to exactly zero.
When a coefficient is shrunk to exactly zero, it means that the corresponding predictor variable is excluded from the model entirely. This is the key feature of Lasso regression that makes it useful for feature selection and variable reduction.