Questions #2, 9, 11
2.For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
MULTIPLE CHOICE ANSWERS: i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The correct answer is iii. Lasso’s advantage over least squares is the bias-variance trade-off. Lasso has a reduction in variance at the expense of a small increase in bias. Lasso also shrinks coefficient estimate towards zero which produces simpler and easier to interpret models.
The correct answer is iii. Similar to lasso, ridge regression’s advantage is the bias-variance trade-off.
The correct answer is ii. Non-linear models are more flexible and have less bias than least squares.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.1.3
library(leaps)
## Warning: package 'leaps' was built under R version 4.1.3
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.1.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.1.3
## Loaded glmnet 4.1-4
library(pls)
## Warning: package 'pls' was built under R version 4.1.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
sum(is.na(College$Apps)) #check for missing observations
## [1] 0
college = College
attach(college)
#Reproducibility
set.seed(13)
#Matrices
x = model.matrix(Apps~.,
college)[, -1]
y = college$Apps
#splitting the data
train = sample(1:nrow(x),
nrow(x) / 2)
test = (-train)
xtest = x[test,]
xtrain = x[train,]
ytest = y[test]
ytrain = y[train]
college.train = college[train, ]
college.test = college[test, ]
MSE for linear model using the least squares is 1,062,514.
lm.fit = lm(Apps~.,
data = college.train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = college.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5350.6 -407.8 -43.6 325.9 7218.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -776.48813 581.48126 -1.335 0.18258
## PrivateYes -467.94850 210.11642 -2.227 0.02654 *
## Accept 1.67710 0.05402 31.048 < 2e-16 ***
## Enroll -1.14544 0.25916 -4.420 1.30e-05 ***
## Top10perc 45.40894 8.20656 5.533 5.96e-08 ***
## Top25perc -10.03632 6.35221 -1.580 0.11497
## F.Undergrad 0.06213 0.04903 1.267 0.20595
## P.Undergrad 0.07724 0.04238 1.823 0.06914 .
## Outstate -0.08890 0.02775 -3.203 0.00148 **
## Room.Board 0.14797 0.06968 2.124 0.03436 *
## Books 0.02191 0.33728 0.065 0.94824
## Personal 0.06113 0.08837 0.692 0.48948
## PhD -9.78906 6.33935 -1.544 0.12340
## Terminal -3.67940 7.12700 -0.516 0.60598
## S.F.Ratio 32.77911 17.70983 1.851 0.06498 .
## perc.alumni 0.93641 6.16345 0.152 0.87933
## Expend 0.08741 0.01758 4.972 1.02e-06 ***
## Grad.Rate 8.12168 4.46196 1.820 0.06954 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1068 on 370 degrees of freedom
## Multiple R-squared: 0.9406, Adjusted R-squared: 0.9379
## F-statistic: 344.6 on 17 and 370 DF, p-value: < 2.2e-16
lm.pred = predict(lm.fit,
college.test)
lm.err = mean((college.test$Apps - lm.pred)^2)
MSE for the ridge model is 1,172,800.
grid = 10^seq(10, -2, length = 100)
ridge.mod = glmnet(xtrain, #fit ridge regression model
ytrain,
alpha = 0,
lambda = grid)
summary(ridge.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
dim(coef(ridge.mod))
## [1] 18 100
set.seed(13)
cv.out = cv.glmnet(xtrain,
ytrain,
alpha = 0)
plot(cv.out)
bestlam = cv.out$lambda.min
bestlam
## [1] 407.0529
#smallest cross-validation error is 407
ridge.pred = predict(ridge.mod,
s = bestlam,
newx = xtest)
ridge.err = mean((ridge.pred - ytest)^2)
MSE for the lasso model is 1,055,889.
set.seed(13)
lasso.mod = glmnet(xtrain,
ytrain,
alpha = 1,
lambda = grid)
summary(lasso.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv.out = cv.glmnet(xtrain,
ytrain,
alpha = 1)
plot(cv.out)
bestlam = cv.out$lambda.min
bestlam #smallest cross-validation is 20
## [1] 20.25912
lasso.pred = predict(lasso.mod,
s = bestlam,
newx = xtest)
lasso.err = mean((lasso.pred - ytest)^2)
out = glmnet(x,
y,
alpha = 1,
lambda = grid)
lasso.coef = predict(out,
type = "coefficients",
s = bestlam)[1:18,]
lasso.coef[lasso.coef != 0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -5.922287e+02 -4.295925e+02 1.463333e+00 -2.259221e-01 3.467654e+01
## Top25perc P.Undergrad Outstate Room.Board Personal
## -3.110529e+00 2.303473e-02 -5.976986e-02 1.263825e-01 1.822314e-03
## PhD Terminal S.F.Ratio perc.alumni Expend
## -5.777387e+00 -3.284070e+00 5.098129e+00 -9.176608e-01 7.012809e-02
## Grad.Rate
## 5.354180e+00
set.seed(13)
pcr.fit = pcr(Apps~.,
data = college.train,
scale = T,
validation = "CV")
summary(pcr.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4290 4143 2285 2290 1938 1858 1827
## adjCV 4290 4145 2279 2286 1876 1838 1817
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1827 1775 1747 1737 1746 1742 1740
## adjCV 1819 1755 1740 1726 1738 1734 1727
## 14 comps 15 comps 16 comps 17 comps
## CV 1742 1729 1398 1331
## adjCV 1734 1714 1378 1314
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.815 56.30 63.57 69.42 74.94 80.01 84.00 87.65
## Apps 9.051 73.44 73.60 83.11 83.88 84.02 84.03 85.64
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.78 92.95 95.05 96.76 97.81 98.79 99.40
## Apps 85.72 86.03 86.17 86.32 86.87 86.99 90.83
## 16 comps 17 comps
## X 99.83 100.00
## Apps 93.69 94.06
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred = predict(pcr.fit,
xtest,
ncomp = 10)
pcr.err = mean((pcr.pred - ytest)^2)
We can explain more variance using more PLS components but we can see that adding in more components no longer produces significant changes. The PLS model resulted with 7 components as this had the lowest CV at 1363 with 78.79 variance explained.
set.seed(13)
pls.fit = plsr(Apps~.,
data = college.train,
scale = T,
validation = "CV")
summary(pls.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4290 2112 1887 1676 1665 1567 1461
## adjCV 4290 2104 1885 1666 1633 1532 1436
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1363 1352 1337 1334 1335 1335 1334
## adjCV 1344 1334 1320 1317 1318 1318 1316
## 14 comps 15 comps 16 comps 17 comps
## CV 1332 1332 1331 1331
## adjCV 1314 1314 1314 1314
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.73 49.37 61.82 64.43 67.98 73.18 75.79 79.75
## Apps 78.00 83.25 88.08 91.45 93.18 93.64 93.86 93.93
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 81.98 84.98 87.93 90.50 92.21 94.32 96.91
## Apps 93.99 94.02 94.04 94.05 94.05 94.06 94.06
## 16 comps 17 comps
## X 98.74 100.00
## Apps 94.06 94.06
validationplot(pls.fit, val.type = "MSEP")
pls.pred = predict(pls.fit,
xtest,
ncomp = 16)
pls.err = mean((pls.pred - ytest)^2)
Comparing the R-square of each model shows PCR has the lowest accuracy where others are very similar. Looking at just the MSE, lasso has the lowest MSE so this would be a good model.
barplot(c(lm.err, ridge.err, lasso.err, pcr.err, pls.err),
names.arg = c("OLS", "Ridge", "Lasso", "PCR", "PLS"),
col = c("#eb8060", "#b9e38d", "#a1e9f0", "#d9b1f0"),
xlab = "Regression Methods",
ylab = "Test Error",
main = "Test Errors for All Methods")
test.avg = mean(college.test$Apps)
lm.test.r2 = 1 - mean((college.test$Apps - lm.pred)^2) /
mean((college.test$Apps - test.avg)^2)
ridge.test.r2 = 1 - mean((college.test$Apps - ridge.pred)^2) /
mean((college.test$Apps - test.avg)^2)
lasso.test.r2 = 1 - mean((college.test$Apps - lasso.pred)^2) /
mean((college.test$Apps - test.avg)^2)
pcr.test.r2 = 1 - mean((college.test$Apps - pcr.pred)^2) /
mean((college.test$Apps - test.avg)^2)
pls.test.r2 = 1 - mean((college.test$Apps - pls.pred)^2) /
mean((college.test$Apps - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2),
names.arg = c("OLS", "Ridge", "Lasso", "PCR", "PLS"),
col = c("#eb8060", "#b9e38d", "#a1e9f0", "#d9b1f0"),
xlab = "Regression Methods",
ylab = "Test R^2",
main = "Test R-squared")
str(Boston)
## 'data.frame': 506 obs. of 13 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
sum(is.na(Boston$crim))
## [1] 0
#Predict Function
predict.regsubsets <- function(object , newdata , id, ...) {
form <- as.formula(object$call [[2]])
mat <- model.matrix(form , newdata)
coefi <- coef(object , id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
#Matrices
x = model.matrix(crim~.,
data = Boston)[,-1]
y = Boston$crim
Best Subset
set.seed(13)
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
#Loops
for (i in 1:k) {
best.fit = regsubsets(crim ~ .,
data = Boston[folds!=i,],
nvmax = p)
for (j in 1:p) {
pred = predict(best.fit,
Boston[folds== i,],
id = j)
cv.errors[i,j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
mean.cv.errors = apply(cv.errors, 2, mean)
mean.cv.errors
## [1] 46.06337 44.30723 44.44411 43.94068 44.04942 43.62439 43.29353 43.43413
## [9] 43.58565 43.64144 43.54290 43.53425
plot(mean.cv.errors, type = "b",
xlab = "Number of Variables",
ylab = "CV Error")
which.min(mean.cv.errors) #7
## [1] 7
points(7, mean.cv.errors[7],
col = "red",
cex = 2,
pch = 20)
mean.cv.errors[which.min(mean.cv.errors)]
## [1] 43.29353
Lasso
#Reproducibility
set.seed(13)
cv.lasso = cv.glmnet(x,
y,
type.measure = "mse")
plot(cv.lasso)
coef(cv.lasso)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.7799525
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.1920089
## tax .
## ptratio .
## lstat .
## medv .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se]) #7.715462
## [1] 7.715462
Ridge Regression
set.seed(13)
cv.ridge = cv.glmnet(x,
y,
type.measure = "mse",
alpha = 0)
plot(cv.ridge)
coef(cv.ridge)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.713143934
## zn -0.002994073
## indus 0.028593701
## chas -0.157594395
## nox 1.825119224
## rm -0.138551061
## age 0.006029211
## dis -0.091316045
## rad 0.043640145
## tax 0.001994740
## ptratio 0.068418435
## lstat 0.034406354
## medv -0.022668857
sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se]) #7.754428
## [1] 7.754428
PCR
set.seed(13)
pcr.fit = pcr(crim~.,
data = Boston,
scale = T,
validation = "CV")
summary(pcr.fit)
## Data: X dimension: 506 12
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.265 7.259 6.864 6.841 6.790 6.775
## adjCV 8.61 7.262 7.256 6.861 6.838 6.786 6.771
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.651 6.658 6.648 6.636 6.595 6.525
## adjCV 6.645 6.653 6.642 6.630 6.587 6.517
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 49.93 63.64 72.94 80.21 86.83 90.26 92.79 94.99
## crim 29.39 29.55 37.39 37.85 38.85 39.23 41.73 41.82
## 9 comps 10 comps 11 comps 12 comps
## X 96.78 98.33 99.48 100.00
## crim 42.12 42.43 43.58 44.93
validationplot(pcr.fit, val.type = "MSEP")
Subset selection model had the lowest cv error with MSE 42.7.
The model chosen is the best subset selection model. This model has low variances and low MSE while having better accuracy.