Assignment 5

Question 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

A: iii is the correct answer. Lasso’s solution can have a reduction in variance at the expense of small increase in bias while least squares estimates have high variance. Lasso can shrink coefficient estimates, removing non-essential variables for less variance and higher bias.

(b) Repeat (a) for ridge regression relative to least squares.

A: iii is the correct answer. Like Lasso, ridge can shrink the coefficient estimates, decreasing variance with higher bias. Ridge is less flexible than the least squares.

*A: ii is the correct answer. Non-linear models are more flexible and have less bias than least squares.

Question 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

library(ISLR2)

attach(College)
x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
set.seed(10)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
College.train = College[train, ]
College.test = College[test, ]
y.test=y[test]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

pls.fit<-lm(Apps~., data=College, subset=train)
summary(pls.fit)


Call:
lm(formula = Apps ~ ., data = College, subset = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-5139.5  -473.3   -21.1   353.2  7402.7 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -629.36179  639.35741  -0.984 0.325579    
PrivateYes  -647.56836  192.17056  -3.370 0.000832 ***
Accept         1.68912    0.05038  33.530  < 2e-16 ***
Enroll        -1.02383    0.27721  -3.693 0.000255 ***
Top10perc     48.19124    8.10714   5.944 6.42e-09 ***
Top25perc    -10.51538    6.44952  -1.630 0.103865    
F.Undergrad    0.01992    0.05364   0.371 0.710574    
P.Undergrad    0.04213    0.05348   0.788 0.431373    
Outstate      -0.09489    0.02674  -3.549 0.000436 ***
Room.Board     0.14549    0.07243   2.009 0.045277 *  
Books          0.06660    0.31115   0.214 0.830623    
Personal       0.05663    0.09453   0.599 0.549475    
PhD          -10.11489    7.11588  -1.421 0.156027    
Terminal      -2.29300    8.03546  -0.285 0.775528    
S.F.Ratio     22.07117   18.70991   1.180 0.238897    
perc.alumni    2.08121    6.00673   0.346 0.729179    
Expend         0.07654    0.01672   4.577 6.45e-06 ***
Grad.Rate      9.99706    4.49821   2.222 0.026857 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1092 on 370 degrees of freedom
Multiple R-squared:  0.9395,    Adjusted R-squared:  0.9367 
F-statistic:   338 on 17 and 370 DF,  p-value: < 2.2e-16

pred.app<-predict(pls.fit, College.test)
test.error<-mean((College.test$Apps-pred.app)^2)
test.error

[1] 1020100

library(glmnet)

Loading required package: Matrix

Loaded glmnet 4.1-10

grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)

          Length Class     Mode   
a0         100   -none-    numeric
beta      1700   dgCMatrix S4     
df         100   -none-    numeric
dim          2   -none-    numeric
lambda     100   -none-    numeric
dev.ratio  100   -none-    numeric
nulldev      1   -none-    numeric
npasses      1   -none-    numeric
jerr         1   -none-    numeric
offset       1   -none-    logical
call         5   -none-    call   
nobs         1   -none-    numeric

cv.college.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.college.out$lambda.min
bestlam

[1] 411.3927

ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)

[1] 985020.1

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)

          Length Class     Mode   
a0         100   -none-    numeric
beta      1700   dgCMatrix S4     
df         100   -none-    numeric
dim          2   -none-    numeric
lambda     100   -none-    numeric
dev.ratio  100   -none-    numeric
nulldev      1   -none-    numeric
npasses      1   -none-    numeric
jerr         1   -none-    numeric
offset       1   -none-    logical
call         5   -none-    call   
nobs         1   -none-    numeric

cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam

[1] 24.66235

lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)

[1] 1008145

out=glmnet(x,y,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
lasso.coef[lasso.coef!=0]

  (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
-6.324960e+02 -4.087012e+02  1.436837e+00 -1.410240e-01  3.143012e+01 
    Top25perc   P.Undergrad      Outstate    Room.Board      Personal 
-8.606536e-01  1.480293e-02 -5.342495e-02  1.205819e-01  4.379135e-05 
          PhD      Terminal     S.F.Ratio   perc.alumni        Expend 
-5.121245e+00 -3.371192e+00  2.717231e+00 -1.039648e+00  6.838161e-02 
    Grad.Rate 
 4.700317e+00

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)


Attaching package: 'pls'

The following object is masked from 'package:stats':

    loadings

pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
summary(pcr.college)

Data:   X dimension: 388 17 
    Y dimension: 388 1
Fit method: svdpc
Number of components considered: 17

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV            4347     4345     2371     2391     2104     1949     1898
adjCV         4347     4345     2368     2396     2085     1939     1891
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV        1899     1880     1864      1861      1870      1873      1891
adjCV     1893     1862     1857      1853      1862      1865      1885
       14 comps  15 comps  16 comps  17 comps
CV         1903      1727      1295      1260
adjCV      1975      1669      1283      1249

TRAINING: % variance explained
      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
X     32.6794    56.94    64.38    70.61    76.27    80.97    84.48    87.54
Apps   0.9148    71.17    71.36    79.85    81.49    82.73    82.79    83.70
      9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
X       90.50     92.89     94.96     96.81     97.97     98.73     99.39
Apps    83.86     84.08     84.11     84.11     84.16     84.28     93.08
      16 comps  17 comps
X        99.86    100.00
Apps     93.71     93.95

validationplot(pcr.college, val.type="MSEP")

pcr.pred=predict(pcr.college,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)

[1] 1422699

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")

summary(pls.college)

Data:   X dimension: 388 17 
    Y dimension: 388 1
Fit method: kernelpls
Number of components considered: 17

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV            4347     2178     1872     1734     1615     1453     1359
adjCV         4347     2171     1867     1726     1586     1427     1341
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV        1347     1340     1329      1317      1310      1305      1305
adjCV     1330     1324     1314      1302      1296      1291      1291
       14 comps  15 comps  16 comps  17 comps
CV         1305      1307      1307      1307
adjCV      1291      1292      1293      1293

TRAINING: % variance explained
      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
X       24.27    38.72    62.64    65.26    69.01    73.96    78.86    82.18
Apps    76.96    84.31    86.80    91.48    93.37    93.75    93.81    93.84
      9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
X       85.35     87.42     89.18     91.41     92.70     94.58     97.16
Apps    93.88     93.91     93.93     93.94     93.95     93.95     93.95
      16 comps  17 comps
X        98.15    100.00
Apps     93.95     93.95

pls.pred=predict(pls.college,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)

[1] 1049868

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

test.avg = mean(College.test[, "Apps"])
lm.test.r2 = 1 - mean((College.test[, "Apps"] - pred.app)^2) /mean((College.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((pcr.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((pls.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")

detach(College)

Question 11

We will now try to predict per capita crime rate in the Boston dataset.

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(leaps)

library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:ISLR2':

    Boston

set.seed(1)
attach(Boston)
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")

which.min(mean.cv.errors)

[1] 9

mean.cv.errors[which.min(mean.cv.errors)]

[1] 42.81453

x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)

coef(cv.lasso)

14 x 1 sparse Matrix of class "dgCMatrix"
            lambda.1se
(Intercept)   2.176491
zn            .       
indus         .       
chas          .       
nox           .       
rm            .       
age           .       
dis           .       
rad           0.150484
tax           .       
ptratio       .       
black         .       
lstat         .       
medv          .

sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])

[1] 7.921353

x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)

coef(cv.ridge)

14 x 1 sparse Matrix of class "dgCMatrix"
              lambda.1se
(Intercept)  1.523899542
zn          -0.002949852
indus        0.029276741
chas        -0.166526007
nox          1.874769665
rm          -0.142852604
age          0.006207995
dis         -0.094547258
rad          0.045932737
tax          0.002086668
ptratio      0.071258052
black       -0.002605281
lstat        0.035745604
medv        -0.023480540

sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])

[1] 7.669133

pcr.crime = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.crime)

Data:   X dimension: 506 13 
    Y dimension: 506 1
Fit method: svdpc
Number of components considered: 13

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV            8.61    7.175    7.180    6.724    6.731    6.727    6.727
adjCV         8.61    7.174    7.179    6.721    6.725    6.724    6.724
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV       6.722    6.614    6.618     6.607     6.598     6.553     6.488
adjCV    6.718    6.609    6.613     6.602     6.592     6.546     6.481

TRAINING: % variance explained
      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
      9 comps  10 comps  11 comps  12 comps  13 comps
X       95.40     97.04     98.46     99.52     100.0
crim    42.55     42.78     43.04     44.13      45.4

Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

Based on the MSE, the best subset selection model had the lowest cross-validation error with the MSE of 42.8.

Does your chosen model involve all of the features in the dataset? Why or why not?

The model chosen is the best subset selection model. This model has 9 predictors and the lowest MSE. By not including all 13 predictors, we have less variance. The goal of this model is to have low variances and low MSE while having good accuracy.