knitr::opts_chunk$set(echo = TRUE)
✨ Problem 2
For parts (a) through (c), indicate which of i. through iv. is correct:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Justify your answer.
(a) The lasso, relative to least squares, is:
Response: (iii) - The lasso adds an L1 penalty term to the least squares objective function. Since the lasso is less flexible, it typically increases bias but decreases variance. The model will give improved prediction accuracy when this trade-off is favorable.
(b) Repeat (a) for ridge regression relative to least squares.
Response: (iii) - The ridge regression adds an L2 penalty term to the least squares object function, similar to the lasso. It is less flexible than least squares, increasing bias while decreasing variance. It improves predication accuracy when the bias increase is outweighed by the variance reduction.
(c) Repeat (a) for non-linear methods relative to least squares.
Response: (ii) - Non-linear methods are more flexible than least squares. They improve prediction accuracy when the increase in variance is less than the decrease in bias.
✨ Problem 9
In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.
library(ISLR2)
attach(College)
x = model.matrix(Apps~., College)[,-1]
y = College$Apps
set.seed(10)
train = sample(1:nrow(x), nrow(x)/2)
test = (-train)
College.train = College[train, ]
College.test = College[test, ]
y.test = y[test]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lsfit <- lm(Apps~., data = College, subset = train)
summary(lsfit)
##
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5139.5 -473.3 -21.1 353.2 7402.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -629.36179 639.35741 -0.984 0.325579
## PrivateYes -647.56836 192.17056 -3.370 0.000832 ***
## Accept 1.68912 0.05038 33.530 < 2e-16 ***
## Enroll -1.02383 0.27721 -3.693 0.000255 ***
## Top10perc 48.19124 8.10714 5.944 6.42e-09 ***
## Top25perc -10.51538 6.44952 -1.630 0.103865
## F.Undergrad 0.01992 0.05364 0.371 0.710574
## P.Undergrad 0.04213 0.05348 0.788 0.431373
## Outstate -0.09489 0.02674 -3.549 0.000436 ***
## Room.Board 0.14549 0.07243 2.009 0.045277 *
## Books 0.06660 0.31115 0.214 0.830623
## Personal 0.05663 0.09453 0.599 0.549475
## PhD -10.11489 7.11588 -1.421 0.156027
## Terminal -2.29300 8.03546 -0.285 0.775528
## S.F.Ratio 22.07117 18.70991 1.180 0.238897
## perc.alumni 2.08121 6.00673 0.346 0.729179
## Expend 0.07654 0.01672 4.577 6.45e-06 ***
## Grad.Rate 9.99706 4.49821 2.222 0.026857 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1092 on 370 degrees of freedom
## Multiple R-squared: 0.9395, Adjusted R-squared: 0.9367
## F-statistic: 338 on 17 and 370 DF, p-value: < 2.2e-16
predApp <- predict(lsfit, College.test)
testError <- mean((College.test$Apps-predApp)^2)
testError
## [1] 1020100
**Test Error = 1,020,100.00 **
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-9
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv.college.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.college.out$lambda.min
bestlam
## [1] 411.3927
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 985020.1
**Test Error = 985,020.10**
- Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1700 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam
## [1] 24.66235
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1008145
**Test Error = 1,008,145.00**
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("pls")
##
## The downloaded binary packages are in
## /var/folders/rn/t9sf6jsn5cd7nqt_3m6r_r9r0000gn/T//RtmpiB7sRh/downloaded_packages
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
summary(pcr.college)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4347 4345 2371 2391 2104 1949 1898
## adjCV 4347 4345 2368 2396 2085 1939 1891
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1899 1880 1864 1861 1870 1873 1891
## adjCV 1893 1862 1857 1853 1862 1865 1885
## 14 comps 15 comps 16 comps 17 comps
## CV 1903 1727 1295 1260
## adjCV 1975 1669 1283 1249
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.6794 56.94 64.38 70.61 76.27 80.97 84.48 87.54
## Apps 0.9148 71.17 71.36 79.85 81.49 82.73 82.79 83.70
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.50 92.89 94.96 96.81 97.97 98.73 99.39
## Apps 83.86 84.08 84.11 84.11 84.16 84.28 93.08
## 16 comps 17 comps
## X 99.86 100.00
## Apps 93.71 93.95
validationplot(pcr.college, val.type="MSEP")
pcr.pred=predict(pcr.college,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)
## [1] 1422699
Test Error = 1,422,699.00, M = 10
(f) Fit a PLS model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")
summary(pls.college)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4347 2178 1872 1734 1615 1453 1359
## adjCV 4347 2171 1867 1726 1586 1427 1341
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1347 1340 1329 1317 1310 1305 1305
## adjCV 1330 1324 1314 1302 1296 1291 1291
## 14 comps 15 comps 16 comps 17 comps
## CV 1305 1307 1307 1307
## adjCV 1291 1292 1293 1293
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 24.27 38.72 62.64 65.26 69.01 73.96 78.86 82.18
## Apps 76.96 84.31 86.80 91.48 93.37 93.75 93.81 93.84
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 85.35 87.42 89.18 91.41 92.70 94.58 97.16
## Apps 93.88 93.91 93.93 93.94 93.95 93.95 93.95
## 16 comps 17 comps
## X 98.15 100.00
## Apps 93.95 93.95
pls.pred=predict(pls.college,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)
## [1] 1049868
Test Error = 1,049,868, M = 11
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Model | Test Error |
---|---|
Ridge Regression | 985,020 (best performance) |
Linear Regression | 1,020,100 |
Lasso Regression | 1,008,145 |
PLS | 1,049,868 (M=11) |
PCR | 1,422,699 (M=10, worst performance) |
Response: It is difficult to accuratly predict the number of college applications received without knowing the scale of the response variable. The test errors are large, and taking the square root gives you a sense of typical prediction errors, but more metrics are needed to make an accurate prediction. Ridge regression performed best, but all models had a similar performance, suggesting that most of the predictors contribute meaningfully to the model and that the assumed linear relationships are reasonable for this dataset.
✨ Problem 11
We will now try to predict per-capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(ISLR2)
attach(Boston)
x=model.matrix(crim~.,Boston)[,-1]
y=Boston$crim
set.seed(10)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
Boston.train = Boston[train, ]
Boston.test = Boston[test, ]
y.test=y[test]
Ridge Regression
library(glmnet)
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1200 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv.boston.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.boston.out$lambda.min
bestlam
## [1] 0.5060121
ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 56.03423
Test Error = 56.03
Linear
Lsfit<-lm(crim~., data=Boston, subset=train)
summary(Lsfit)
##
## Call:
## lm(formula = crim ~ ., data = Boston, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.339 -2.001 -0.250 1.016 59.921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.851945 7.849743 1.255 0.2107
## zn 0.038563 0.023453 1.644 0.1014
## indus -0.093307 0.107353 -0.869 0.3856
## chas -1.258217 1.338569 -0.940 0.3482
## nox -4.490898 5.785471 -0.776 0.4384
## rm 0.066139 0.622549 0.106 0.9155
## age 0.002521 0.019938 0.126 0.8995
## dis -0.626845 0.319291 -1.963 0.0508 .
## rad 0.564707 0.095397 5.920 1.11e-08 ***
## tax -0.002446 0.006107 -0.401 0.6891
## ptratio -0.254045 0.222763 -1.140 0.2552
## lstat 0.167313 0.083650 2.000 0.0466 *
## medv -0.143073 0.068243 -2.097 0.0371 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.336 on 240 degrees of freedom
## Multiple R-squared: 0.5188, Adjusted R-squared: 0.4947
## F-statistic: 21.56 on 12 and 240 DF, p-value: < 2.2e-16
pred.crim<-predict(Lsfit, Boston.test)
test.error<-mean((Boston.test$crim-pred.crim)^2)
test.error
## [1] 55.17084
Test Error = 55.17
Lasso
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)
## Length Class Mode
## a0 100 -none- numeric
## beta 1200 dgCMatrix S4
## df 100 -none- numeric
## dim 2 -none- numeric
## lambda 100 -none- numeric
## dev.ratio 100 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## call 5 -none- call
## nobs 1 -none- numeric
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.04830131
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 55.692
Test Error = 55.69
PCR
library(pls)
pcr.boston=pcr(crim~., data=Boston.train,scale=TRUE,validation="CV")
summary(pcr.boston)
## Data: X dimension: 253 12
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 7.521 6.155 6.111 5.807 5.681 5.635 5.646
## adjCV 7.521 6.153 6.109 5.803 5.677 5.632 5.643
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 5.504 5.378 5.377 5.391 5.371 5.320
## adjCV 5.503 5.370 5.371 5.386 5.364 5.314
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.65 61.88 71.98 79.65 86.07 89.97 92.68 94.95
## crim 33.28 34.39 40.90 43.35 44.56 44.71 47.38 50.19
## 9 comps 10 comps 11 comps 12 comps
## X 96.87 98.31 99.47 100.00
## crim 50.21 50.32 50.91 51.88
validationplot(pcr.boston, val.type="MSEP")
pcr.pred=predict(pcr.boston,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)
## [1] 57.42611
Test Error = 57.43, M = 10
PLS
pls.boston=plsr(crim~., data=Boston.train,scale=TRUE, validation="CV")
validationplot(pls.boston, val.type="MSEP")
summary(pls.boston)
## Data: X dimension: 253 12
## Y dimension: 253 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 7.521 5.993 5.562 5.445 5.42 5.409 5.410
## adjCV 7.521 5.988 5.555 5.438 5.41 5.399 5.399
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 5.412 5.407 5.403 5.401 5.402 5.402
## adjCV 5.400 5.396 5.392 5.390 5.391 5.391
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.13 58.29 65.50 73.47 80.99 84.08 89.11 91.99
## crim 37.99 47.63 50.18 51.18 51.43 51.72 51.82 51.86
## 9 comps 10 comps 11 comps 12 comps
## X 94.43 96.65 98.11 100.00
## crim 51.87 51.88 51.88 51.88
pls.pred=predict(pls.boston,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)
## [1] 55.13179
Test Error = 55.13, M = 9
Model | Test Error |
---|---|
PLS Regression | 55.13 (best performance, M=9) |
Linear Regression | 55.17 |
Lasso Regression | 55.69 |
Ridge Regression | 56.03 |
PCR | 57.43 (worst performance, M=10) |
Response: Based on the Boston crime prediction analysis, all five regression methods achieved very similar test errors ranging from 55.13 to 57.43, with PLS regression performing marginally best and PCR worst. The negligible differences between approaches suggest that regularization techniques offer little advantage over ordinary least squares for this dataset, though the relatively high test errors (corresponding to prediction errors of about 7.4-7.6 crime incidents per capita) indicate that accurately predicting crime rates remains challenging even with these methods.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
Response: I would recommend ordinary least squares regression for this dataset. Even though PLS had a slightly lower test error (55.13 vs 55.17), the difference is too small to matter, and linear regression is much simpler to interpret and implement.
To better validate this choice, I would use 10-fold cross-validation instead of just one train-test split, which gives a more reliable estimate of how well the model actually performs. Since all the methods performed almost identically, it shows that basic linear regression works just as well as the more complicated approaches for predicting crime rates in this data.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
Response: Yes, my chosen model uses all features in the dataset. The similar performance between ordinary least squares and Lasso regression (which automatically removes unimportant variables) suggests that most predictors contribute meaningfully to predicting crime rates. Since removing features didn’t improve performance, it’s better to keep all variables to capture the full range of factors that influence crime.