Ch.6: 2,9,11
The correct answer is iii. Lasso regularization tends to reduce variance while potentially introducing a slight increase in bias. In contrast, least squares estimates often exhibit high variance. Lasso achieves this variance reduction by shrinking coefficient estimates, effectively eliminating non-essential variables, which results in higher bias but lower variance.
Repeat (a) for ridge regression relative to least squares. Option iii is the correct answer. Similar to Lasso, ridge regression can shrink coefficient estimates, leading to a reduction in variance but potentially increasing bias. Ridge regression is comparatively less flexible than ordinary least squares regression.
Repeat (a) for non-linear methods relative to least squares. The correct answer is ii. Non-linear models exhibit greater flexibility and typically entail lower bias compared to the least squares method.
library(ISLR)
attach(College)
x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
set.seed(10)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
College.train = College[train, ]
College.test = College[test, ]
y.test=y[test]
lin_mod<-lm(Apps~., data=College, subset=train)
summary(lin_mod)
##
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5139.5 -473.3 -21.1 353.2 7402.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -629.36179 639.35741 -0.984 0.325579
## PrivateYes -647.56836 192.17056 -3.370 0.000832 ***
## Accept 1.68912 0.05038 33.530 < 2e-16 ***
## Enroll -1.02383 0.27721 -3.693 0.000255 ***
## Top10perc 48.19124 8.10714 5.944 6.42e-09 ***
## Top25perc -10.51538 6.44952 -1.630 0.103865
## F.Undergrad 0.01992 0.05364 0.371 0.710574
## P.Undergrad 0.04213 0.05348 0.788 0.431373
## Outstate -0.09489 0.02674 -3.549 0.000436 ***
## Room.Board 0.14549 0.07243 2.009 0.045277 *
## Books 0.06660 0.31115 0.214 0.830623
## Personal 0.05663 0.09453 0.599 0.549475
## PhD -10.11489 7.11588 -1.421 0.156027
## Terminal -2.29300 8.03546 -0.285 0.775528
## S.F.Ratio 22.07117 18.70991 1.180 0.238897
## perc.alumni 2.08121 6.00673 0.346 0.729179
## Expend 0.07654 0.01672 4.577 6.45e-06 ***
## Grad.Rate 9.99706 4.49821 2.222 0.026857 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1092 on 370 degrees of freedom
## Multiple R-squared: 0.9395, Adjusted R-squared: 0.9367
## F-statistic: 338 on 17 and 370 DF, p-value: < 2.2e-16
pred <-predict(lin_mod, College.test)
t_error<-mean((College.test$Apps-pred)^2)
t_error
## [1] 1020100
The test error obtained is 1020100.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
grid=10^seq(10,-2,length=100)
ridge_mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge_mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv.college.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.college.out$lambda.min
bestlam
## [1] 411.3927
ridge.pred=predict(ridge_mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 985020.1
The MSE for the ridge model is 985020.1 (d) Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the num- ber of non-zero coefficient estimates.
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam
## [1] 24.66235
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1008145
out=glmnet(x,y,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
lasso.coef[lasso.coef!=0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -6.324960e+02 -4.087012e+02 1.436837e+00 -1.410240e-01 3.143012e+01
## Top25perc P.Undergrad Outstate Room.Board Personal
## -8.606536e-01 1.480293e-02 -5.342495e-02 1.205819e-01 4.379135e-05
## PhD Terminal S.F.Ratio perc.alumni Expend
## -5.121245e+00 -3.371192e+00 2.717231e+00 -1.039648e+00 6.838161e-02
## Grad.Rate
## 4.700317e+00
The MSE for the lasso model is 1008145. (e) Fit a PCR model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
summary(pcr.college)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4347 4345 2371 2391 2104 1949 1898
## adjCV 4347 4345 2368 2396 2085 1939 1891
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1899 1880 1864 1861 1870 1873 1891
## adjCV 1893 1862 1857 1853 1862 1865 1885
## 14 comps 15 comps 16 comps 17 comps
## CV 1903 1727 1295 1260
## adjCV 1975 1669 1283 1249
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.6794 56.94 64.38 70.61 76.27 80.97 84.48 87.54
## Apps 0.9148 71.17 71.36 79.85 81.49 82.73 82.79 83.70
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.50 92.89 94.96 96.81 97.97 98.73 99.39
## Apps 83.86 84.08 84.11 84.11 84.16 84.28 93.08
## 16 comps 17 comps
## X 99.86 100.00
## Apps 93.71 93.95
validationplot(pcr.college, val.type="MSEP")
pcr.pred=predict(pcr.college,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)
## [1] 1422699
The number of components was 10 due to its low CV and high variance explained.
pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")
summary(pls.college)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4347 2178 1872 1734 1615 1453 1359
## adjCV 4347 2171 1867 1726 1586 1427 1341
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1347 1340 1329 1317 1310 1305 1305
## adjCV 1330 1324 1314 1302 1296 1291 1291
## 14 comps 15 comps 16 comps 17 comps
## CV 1305 1307 1307 1307
## adjCV 1291 1292 1293 1293
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 24.27 38.72 62.64 65.26 69.01 73.96 78.86 82.18
## Apps 76.96 84.31 86.80 91.48 93.37 93.75 93.81 93.84
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 85.35 87.42 89.18 91.41 92.70 94.58 97.16
## Apps 93.88 93.91 93.93 93.94 93.95 93.95 93.95
## 16 comps 17 comps
## X 98.15 100.00
## Apps 93.95 93.95
pls.pred=predict(pls.college,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)
## [1] 1049868
The PLS model converged with 11 components, achieving the lowest cross-validation error at 1285, while also explaining a high variance of 89.18%.
test.avg = mean(College.test[, "Apps"])
lm.test.r2 = 1 - mean((College.test[, "Apps"] - pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((pcr.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((pls.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")
detach(College)
Given the histogram, the PCR model has the lowest accuracy, while looking at the MSE the Ridge model has the lowest MSE.
library(leaps)
library(MASS)
set.seed(1)
attach(Boston)
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")
which.min(mean.cv.errors)
## [1] 9
mean.cv.errors[which.min(mean.cv.errors)]
## [1] 42.81453
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)
coef(cv.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 2.176491
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.150484
## tax .
## ptratio .
## black .
## lstat .
## medv .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
## [1] 7.921353
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)
coef(cv.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.523899542
## zn -0.002949852
## indus 0.029276741
## chas -0.166526007
## nox 1.874769665
## rm -0.142852604
## age 0.006207995
## dis -0.094547258
## rad 0.045932737
## tax 0.002086668
## ptratio 0.071258052
## black -0.002605281
## lstat 0.035745604
## medv -0.023480540
sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])
## [1] 7.669133
pcr.crime = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.crime)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.175 7.180 6.724 6.731 6.727 6.727
## adjCV 8.61 7.174 7.179 6.721 6.725 6.724 6.724
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.722 6.614 6.618 6.607 6.598 6.553 6.488
## adjCV 6.718 6.609 6.613 6.602 6.592 6.546 6.481
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
Based on the MSE, the best subset selection model had the lowest cross-validation error with the MSE of 42.8.
The model chosen is the best subset selection model. This model has 9 predictors and the lowest MSE. By not including all 13 predictors, we have less variance. The goal of this model is to have low variances and low MSE while having good accuracy.