ISLR Chapter 6: Linear Model Selection and Regularization

Assignment 5 - STA 6543

Olivia Shipley

2025-07-29

knitr::opts_chunk$set(echo = TRUE)

✨ Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Justify your answer.

(a) The lasso, relative to least squares, is:

Response: (iii) - The lasso adds an L1 penalty term to the least squares objective function. Since the lasso is less flexible, it typically increases bias but decreases variance. The model will give improved prediction accuracy when this trade-off is favorable.

(b) Repeat (a) for ridge regression relative to least squares.

Response: (iii) - The ridge regression adds an L2 penalty term to the least squares object function, similar to the lasso. It is less flexible than least squares, increasing bias while decreasing variance. It improves predication accuracy when the bias increase is outweighed by the variance reduction.

(c) Repeat (a) for non-linear methods relative to least squares.

Response: (ii) - Non-linear methods are more flexible than least squares. They improve prediction accuracy when the increase in variance is less than the decrease in bias.

✨ Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

library(ISLR2)
attach(College)

x = model.matrix(Apps~., College)[,-1]
y = College$Apps

set.seed(10)
train = sample(1:nrow(x), nrow(x)/2)
test = (-train)

College.train = College[train, ]
College.test = College[test, ]

y.test = y[test]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lsfit <- lm(Apps~., data = College, subset = train)
summary(lsfit)

## 
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5139.5  -473.3   -21.1   353.2  7402.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -629.36179  639.35741  -0.984 0.325579    
## PrivateYes  -647.56836  192.17056  -3.370 0.000832 ***
## Accept         1.68912    0.05038  33.530  < 2e-16 ***
## Enroll        -1.02383    0.27721  -3.693 0.000255 ***
## Top10perc     48.19124    8.10714   5.944 6.42e-09 ***
## Top25perc    -10.51538    6.44952  -1.630 0.103865    
## F.Undergrad    0.01992    0.05364   0.371 0.710574    
## P.Undergrad    0.04213    0.05348   0.788 0.431373    
## Outstate      -0.09489    0.02674  -3.549 0.000436 ***
## Room.Board     0.14549    0.07243   2.009 0.045277 *  
## Books          0.06660    0.31115   0.214 0.830623    
## Personal       0.05663    0.09453   0.599 0.549475    
## PhD          -10.11489    7.11588  -1.421 0.156027    
## Terminal      -2.29300    8.03546  -0.285 0.775528    
## S.F.Ratio     22.07117   18.70991   1.180 0.238897    
## perc.alumni    2.08121    6.00673   0.346 0.729179    
## Expend         0.07654    0.01672   4.577 6.45e-06 ***
## Grad.Rate      9.99706    4.49821   2.222 0.026857 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1092 on 370 degrees of freedom
## Multiple R-squared:  0.9395, Adjusted R-squared:  0.9367 
## F-statistic:   338 on 17 and 370 DF,  p-value: < 2.2e-16

predApp <- predict(lsfit, College.test)
testError <- mean((College.test$Apps-predApp)^2)
testError

## [1] 1020100

**Test Error = 1,020,100.00 **

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-9

grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)

##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric

cv.college.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.college.out$lambda.min
bestlam

## [1] 411.3927

ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)

## [1] 985020.1

**Test Error = 985,020.10**

Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the number of non-zero coeﬀicient estimates.

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)

##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric

cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam

## [1] 24.66235

lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)

## [1] 1008145

**Test Error = 1,008,145.00**

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("pls")

## 
## The downloaded binary packages are in
##  /var/folders/rn/t9sf6jsn5cd7nqt_3m6r_r9r0000gn/T//RtmpiB7sRh/downloaded_packages

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
summary(pcr.college)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4347     4345     2371     2391     2104     1949     1898
## adjCV         4347     4345     2368     2396     2085     1939     1891
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1899     1880     1864      1861      1870      1873      1891
## adjCV     1893     1862     1857      1853      1862      1865      1885
##        14 comps  15 comps  16 comps  17 comps
## CV         1903      1727      1295      1260
## adjCV      1975      1669      1283      1249
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     32.6794    56.94    64.38    70.61    76.27    80.97    84.48    87.54
## Apps   0.9148    71.17    71.36    79.85    81.49    82.73    82.79    83.70
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.50     92.89     94.96     96.81     97.97     98.73     99.39
## Apps    83.86     84.08     84.11     84.11     84.16     84.28     93.08
##       16 comps  17 comps
## X        99.86    100.00
## Apps     93.71     93.95

validationplot(pcr.college, val.type="MSEP")

pcr.pred=predict(pcr.college,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)

## [1] 1422699

Test Error = 1,422,699.00, M = 10

(f) Fit a PLS model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")

summary(pls.college)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4347     2178     1872     1734     1615     1453     1359
## adjCV         4347     2171     1867     1726     1586     1427     1341
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1347     1340     1329      1317      1310      1305      1305
## adjCV     1330     1324     1314      1302      1296      1291      1291
##        14 comps  15 comps  16 comps  17 comps
## CV         1305      1307      1307      1307
## adjCV      1291      1292      1293      1293
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       24.27    38.72    62.64    65.26    69.01    73.96    78.86    82.18
## Apps    76.96    84.31    86.80    91.48    93.37    93.75    93.81    93.84
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       85.35     87.42     89.18     91.41     92.70     94.58     97.16
## Apps    93.88     93.91     93.93     93.94     93.95     93.95     93.95
##       16 comps  17 comps
## X        98.15    100.00
## Apps     93.95     93.95

pls.pred=predict(pls.college,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)

## [1] 1049868

Test Error = 1,049,868, M = 11

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Model	Test Error
Ridge Regression	985,020 (best performance)
Linear Regression	1,020,100
Lasso Regression	1,008,145
PLS	1,049,868 (M=11)
PCR	1,422,699 (M=10, worst performance)

Response: It is difficult to accuratly predict the number of college applications received without knowing the scale of the response variable. The test errors are large, and taking the square root gives you a sense of typical prediction errors, but more metrics are needed to make an accurate prediction. Ridge regression performed best, but all models had a similar performance, suggesting that most of the predictors contribute meaningfully to the model and that the assumed linear relationships are reasonable for this dataset.

✨ Problem 11

We will now try to predict per-capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(ISLR2)
attach(Boston)
x=model.matrix(crim~.,Boston)[,-1]
y=Boston$crim
set.seed(10)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
Boston.train = Boston[train, ]
Boston.test = Boston[test, ]
y.test=y[test]

Ridge Regression

library(glmnet)
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)

##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1200   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric

cv.boston.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.boston.out$lambda.min
bestlam

## [1] 0.5060121

ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)

## [1] 56.03423

Test Error = 56.03

Linear

Lsfit<-lm(crim~., data=Boston, subset=train)
summary(Lsfit)

## 
## Call:
## lm(formula = crim ~ ., data = Boston, subset = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.339 -2.001 -0.250  1.016 59.921 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.851945   7.849743   1.255   0.2107    
## zn           0.038563   0.023453   1.644   0.1014    
## indus       -0.093307   0.107353  -0.869   0.3856    
## chas        -1.258217   1.338569  -0.940   0.3482    
## nox         -4.490898   5.785471  -0.776   0.4384    
## rm           0.066139   0.622549   0.106   0.9155    
## age          0.002521   0.019938   0.126   0.8995    
## dis         -0.626845   0.319291  -1.963   0.0508 .  
## rad          0.564707   0.095397   5.920 1.11e-08 ***
## tax         -0.002446   0.006107  -0.401   0.6891    
## ptratio     -0.254045   0.222763  -1.140   0.2552    
## lstat        0.167313   0.083650   2.000   0.0466 *  
## medv        -0.143073   0.068243  -2.097   0.0371 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.336 on 240 degrees of freedom
## Multiple R-squared:  0.5188, Adjusted R-squared:  0.4947 
## F-statistic: 21.56 on 12 and 240 DF,  p-value: < 2.2e-16

pred.crim<-predict(Lsfit, Boston.test)
test.error<-mean((Boston.test$crim-pred.crim)^2)
test.error

## [1] 55.17084

Test Error = 55.17

Lasso

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)

##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1200   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric

cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam

## [1] 0.04830131

lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)

## [1] 55.692

Test Error = 55.69

PCR

library(pls)
pcr.boston=pcr(crim~., data=Boston.train,scale=TRUE,validation="CV")
summary(pcr.boston)

## Data:    X dimension: 253 12 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           7.521    6.155    6.111    5.807    5.681    5.635    5.646
## adjCV        7.521    6.153    6.109    5.803    5.677    5.632    5.643
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       5.504    5.378    5.377     5.391     5.371     5.320
## adjCV    5.503    5.370    5.371     5.386     5.364     5.314
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.65    61.88    71.98    79.65    86.07    89.97    92.68    94.95
## crim    33.28    34.39    40.90    43.35    44.56    44.71    47.38    50.19
##       9 comps  10 comps  11 comps  12 comps
## X       96.87     98.31     99.47    100.00
## crim    50.21     50.32     50.91     51.88

validationplot(pcr.boston, val.type="MSEP")

pcr.pred=predict(pcr.boston,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)

## [1] 57.42611

Test Error = 57.43, M = 10

PLS

pls.boston=plsr(crim~., data=Boston.train,scale=TRUE, validation="CV")
validationplot(pls.boston, val.type="MSEP")

summary(pls.boston)

## Data:    X dimension: 253 12 
##  Y dimension: 253 1
## Fit method: kernelpls
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           7.521    5.993    5.562    5.445     5.42    5.409    5.410
## adjCV        7.521    5.988    5.555    5.438     5.41    5.399    5.399
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       5.412    5.407    5.403     5.401     5.402     5.402
## adjCV    5.400    5.396    5.392     5.390     5.391     5.391
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.13    58.29    65.50    73.47    80.99    84.08    89.11    91.99
## crim    37.99    47.63    50.18    51.18    51.43    51.72    51.82    51.86
##       9 comps  10 comps  11 comps  12 comps
## X       94.43     96.65     98.11    100.00
## crim    51.87     51.88     51.88     51.88

pls.pred=predict(pls.boston,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)

## [1] 55.13179

Test Error = 55.13, M = 9

Model	Test Error
PLS Regression	55.13 (best performance, M=9)
Linear Regression	55.17
Lasso Regression	55.69
Ridge Regression	56.03
PCR	57.43 (worst performance, M=10)

Response: Based on the Boston crime prediction analysis, all five regression methods achieved very similar test errors ranging from 55.13 to 57.43, with PLS regression performing marginally best and PCR worst. The negligible differences between approaches suggest that regularization techniques offer little advantage over ordinary least squares for this dataset, though the relatively high test errors (corresponding to prediction errors of about 7.4-7.6 crime incidents per capita) indicate that accurately predicting crime rates remains challenging even with these methods.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

Response: I would recommend ordinary least squares regression for this dataset. Even though PLS had a slightly lower test error (55.13 vs 55.17), the difference is too small to matter, and linear regression is much simpler to interpret and implement.

To better validate this choice, I would use 10-fold cross-validation instead of just one train-test split, which gives a more reliable estimate of how well the model actually performs. Since all the methods performed almost identically, it shows that basic linear regression works just as well as the more complicated approaches for predicting crime rates in this data.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

Response: Yes, my chosen model uses all features in the dataset. The similar performance between ordinary least squares and Lasso regression (which automatically removes unimportant variables) suggests that most predictors contribute meaningfully to predicting crime rates. Since removing features didn’t improve performance, it’s better to keep all variables to capture the full range of factors that influence crime.