Questions #2, 9, 11

0.1 Lasso, Ridge Regression, and Non-Linear Models vs Least Squares

2.For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

MULTIPLE CHOICE ANSWERS: i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  1. The lasso, relative to least squares, is:

The correct answer is iii. Lasso’s advantage over least squares is the bias-variance trade-off. Lasso has a reduction in variance at the expense of a small increase in bias. Lasso also shrinks coefficient estimate towards zero which produces simpler and easier to interpret models.

  1. Repeat (a) for ridge regression relative to least squares.

The correct answer is iii. Similar to lasso, ridge regression’s advantage is the bias-variance trade-off.

  1. Repeat (a) for non-linear methods relative to least squares.

The correct answer is ii. Non-linear models are more flexible and have less bias than least squares.

0.2 Applications Received from College Data Set

  1. In this exercise, we will predict the number of applications received using the other variables in the College data set.
  1. Split the data set into a training set and a test set.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.1.3
library(leaps)
## Warning: package 'leaps' was built under R version 4.1.3
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.1.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.1.3
## Loaded glmnet 4.1-4
library(pls)
## Warning: package 'pls' was built under R version 4.1.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
sum(is.na(College$Apps)) #check for missing observations
## [1] 0
college = College
attach(college)
#Reproducibility
set.seed(13)

#Matrices
x = model.matrix(Apps~.,
                 college)[, -1]
y = college$Apps

#splitting the data
train = sample(1:nrow(x),
               nrow(x) / 2)
test = (-train)
xtest = x[test,]
xtrain = x[train,]
ytest = y[test]
ytrain = y[train]
college.train = college[train, ]
college.test = college[test, ]
  1. Fit a linear model using least squares on the training set, and report the test error obtained.

MSE for linear model using the least squares is 1,062,514.

lm.fit = lm(Apps~.,
            data = college.train)
summary(lm.fit)
## 
## Call:
## lm(formula = Apps ~ ., data = college.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5350.6  -407.8   -43.6   325.9  7218.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -776.48813  581.48126  -1.335  0.18258    
## PrivateYes  -467.94850  210.11642  -2.227  0.02654 *  
## Accept         1.67710    0.05402  31.048  < 2e-16 ***
## Enroll        -1.14544    0.25916  -4.420 1.30e-05 ***
## Top10perc     45.40894    8.20656   5.533 5.96e-08 ***
## Top25perc    -10.03632    6.35221  -1.580  0.11497    
## F.Undergrad    0.06213    0.04903   1.267  0.20595    
## P.Undergrad    0.07724    0.04238   1.823  0.06914 .  
## Outstate      -0.08890    0.02775  -3.203  0.00148 ** 
## Room.Board     0.14797    0.06968   2.124  0.03436 *  
## Books          0.02191    0.33728   0.065  0.94824    
## Personal       0.06113    0.08837   0.692  0.48948    
## PhD           -9.78906    6.33935  -1.544  0.12340    
## Terminal      -3.67940    7.12700  -0.516  0.60598    
## S.F.Ratio     32.77911   17.70983   1.851  0.06498 .  
## perc.alumni    0.93641    6.16345   0.152  0.87933    
## Expend         0.08741    0.01758   4.972 1.02e-06 ***
## Grad.Rate      8.12168    4.46196   1.820  0.06954 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1068 on 370 degrees of freedom
## Multiple R-squared:  0.9406, Adjusted R-squared:  0.9379 
## F-statistic: 344.6 on 17 and 370 DF,  p-value: < 2.2e-16
lm.pred = predict(lm.fit,
                  college.test)
lm.err = mean((college.test$Apps - lm.pred)^2)
  1. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

MSE for the ridge model is 1,172,800.

grid = 10^seq(10, -2, length = 100)
ridge.mod = glmnet(xtrain, #fit ridge regression model
                   ytrain,
                   alpha = 0,
                   lambda = grid)
summary(ridge.mod)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric
dim(coef(ridge.mod))
## [1]  18 100
set.seed(13)
cv.out = cv.glmnet(xtrain,
                   ytrain,
                   alpha = 0)
plot(cv.out)

bestlam = cv.out$lambda.min
bestlam
## [1] 407.0529
#smallest cross-validation error is 407

ridge.pred = predict(ridge.mod,
                     s = bestlam,
                     newx = xtest)
ridge.err = mean((ridge.pred - ytest)^2)
  1. Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

MSE for the lasso model is 1,055,889.

set.seed(13)
lasso.mod = glmnet(xtrain,
                   ytrain,
                   alpha = 1,
                   lambda = grid)
summary(lasso.mod)
##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric
cv.out = cv.glmnet(xtrain,
                 ytrain,
                 alpha = 1)
plot(cv.out)

bestlam = cv.out$lambda.min
bestlam #smallest cross-validation is 20
## [1] 20.25912
lasso.pred = predict(lasso.mod,
                     s = bestlam,
                     newx = xtest)
lasso.err = mean((lasso.pred - ytest)^2)
out = glmnet(x,
             y,
             alpha = 1,
             lambda = grid)
lasso.coef = predict(out,
                     type = "coefficients",
                     s = bestlam)[1:18,]
lasso.coef[lasso.coef != 0]
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -5.922287e+02 -4.295925e+02  1.463333e+00 -2.259221e-01  3.467654e+01 
##     Top25perc   P.Undergrad      Outstate    Room.Board      Personal 
## -3.110529e+00  2.303473e-02 -5.976986e-02  1.263825e-01  1.822314e-03 
##           PhD      Terminal     S.F.Ratio   perc.alumni        Expend 
## -5.777387e+00 -3.284070e+00  5.098129e+00 -9.176608e-01  7.012809e-02 
##     Grad.Rate 
##  5.354180e+00
  1. Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(13)
pcr.fit = pcr(Apps~.,
              data = college.train,
              scale = T,
              validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4290     4143     2285     2290     1938     1858     1827
## adjCV         4290     4145     2279     2286     1876     1838     1817
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1827     1775     1747      1737      1746      1742      1740
## adjCV     1819     1755     1740      1726      1738      1734      1727
##        14 comps  15 comps  16 comps  17 comps
## CV         1742      1729      1398      1331
## adjCV      1734      1714      1378      1314
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.815    56.30    63.57    69.42    74.94    80.01    84.00    87.65
## Apps    9.051    73.44    73.60    83.11    83.88    84.02    84.03    85.64
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.78     92.95     95.05     96.76     97.81     98.79     99.40
## Apps    85.72     86.03     86.17     86.32     86.87     86.99     90.83
##       16 comps  17 comps
## X        99.83    100.00
## Apps     93.69     94.06
validationplot(pcr.fit, val.type = "MSEP")

pcr.pred = predict(pcr.fit,
                   xtest,
                   ncomp = 10)
pcr.err = mean((pcr.pred - ytest)^2)
  1. Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

We can explain more variance using more PLS components but we can see that adding in more components no longer produces significant changes. The PLS model resulted with 7 components as this had the lowest CV at 1363 with 78.79 variance explained.

set.seed(13)
pls.fit = plsr(Apps~.,
               data = college.train,
               scale = T,
               validation = "CV")
summary(pls.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4290     2112     1887     1676     1665     1567     1461
## adjCV         4290     2104     1885     1666     1633     1532     1436
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1363     1352     1337      1334      1335      1335      1334
## adjCV     1344     1334     1320      1317      1318      1318      1316
##        14 comps  15 comps  16 comps  17 comps
## CV         1332      1332      1331      1331
## adjCV      1314      1314      1314      1314
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.73    49.37    61.82    64.43    67.98    73.18    75.79    79.75
## Apps    78.00    83.25    88.08    91.45    93.18    93.64    93.86    93.93
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       81.98     84.98     87.93     90.50     92.21     94.32     96.91
## Apps    93.99     94.02     94.04     94.05     94.05     94.06     94.06
##       16 comps  17 comps
## X        98.74    100.00
## Apps     94.06     94.06
validationplot(pls.fit, val.type = "MSEP")

pls.pred = predict(pls.fit,
                   xtest,
                   ncomp = 16)
pls.err = mean((pls.pred - ytest)^2)
  1. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Comparing the R-square of each model shows PCR has the lowest accuracy where others are very similar. Looking at just the MSE, lasso has the lowest MSE so this would be a good model.

barplot(c(lm.err, ridge.err, lasso.err, pcr.err, pls.err),
        names.arg = c("OLS", "Ridge", "Lasso", "PCR", "PLS"),
        col = c("#eb8060", "#b9e38d", "#a1e9f0", "#d9b1f0"),
        xlab = "Regression Methods",
        ylab = "Test Error",
        main = "Test Errors for All Methods")

test.avg = mean(college.test$Apps)
lm.test.r2 = 1 - mean((college.test$Apps - lm.pred)^2) /
  mean((college.test$Apps - test.avg)^2)
ridge.test.r2 = 1 - mean((college.test$Apps - ridge.pred)^2) /
  mean((college.test$Apps - test.avg)^2)
lasso.test.r2 = 1 - mean((college.test$Apps - lasso.pred)^2) /
  mean((college.test$Apps - test.avg)^2)
pcr.test.r2 = 1 - mean((college.test$Apps - pcr.pred)^2) /
  mean((college.test$Apps - test.avg)^2)
pls.test.r2 = 1 - mean((college.test$Apps - pls.pred)^2) /
  mean((college.test$Apps - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2),
        names.arg = c("OLS", "Ridge", "Lasso", "PCR", "PLS"),
        col = c("#eb8060", "#b9e38d", "#a1e9f0", "#d9b1f0"),
        xlab = "Regression Methods",
        ylab = "Test R^2",
        main = "Test R-squared")

0.3 Crime Rate from Boston Data Set

  1. We will now try to predict per capita crime rate in the Boston data set.
  1. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
str(Boston)
## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
sum(is.na(Boston$crim))
## [1] 0
#Predict Function
predict.regsubsets <- function(object , newdata , id, ...) {
  form <- as.formula(object$call [[2]])
  mat <- model.matrix(form , newdata)
  coefi <- coef(object , id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}

#Matrices
x = model.matrix(crim~.,
                 data = Boston)[,-1]
y = Boston$crim

Best Subset

set.seed(13)
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)

#Loops
for (i in 1:k) {
  best.fit = regsubsets(crim ~ .,
                        data = Boston[folds!=i,],
                        nvmax = p)
  for (j in 1:p) {
    pred = predict(best.fit,
                   Boston[folds== i,],
                   id = j)
    cv.errors[i,j] = mean((Boston$crim[folds == i] - pred)^2)
  }
}


mean.cv.errors = apply(cv.errors, 2, mean)
mean.cv.errors
##  [1] 46.06337 44.30723 44.44411 43.94068 44.04942 43.62439 43.29353 43.43413
##  [9] 43.58565 43.64144 43.54290 43.53425
plot(mean.cv.errors, type = "b",
     xlab = "Number of Variables",
     ylab = "CV Error")
which.min(mean.cv.errors) #7
## [1] 7
points(7, mean.cv.errors[7],
       col = "red",
       cex = 2,
       pch = 20)

mean.cv.errors[which.min(mean.cv.errors)]
## [1] 43.29353

Lasso

#Reproducibility
set.seed(13)
cv.lasso = cv.glmnet(x,
                     y,
                     type.measure = "mse")
plot(cv.lasso)

coef(cv.lasso)
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                    s1
## (Intercept) 1.7799525
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.1920089
## tax         .        
## ptratio     .        
## lstat       .        
## medv        .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se]) #7.715462
## [1] 7.715462

Ridge Regression

set.seed(13)
cv.ridge = cv.glmnet(x,
                     y,
                     type.measure = "mse",
                     alpha = 0)
plot(cv.ridge)

coef(cv.ridge)
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  0.713143934
## zn          -0.002994073
## indus        0.028593701
## chas        -0.157594395
## nox          1.825119224
## rm          -0.138551061
## age          0.006029211
## dis         -0.091316045
## rad          0.043640145
## tax          0.001994740
## ptratio      0.068418435
## lstat        0.034406354
## medv        -0.022668857
sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se]) #7.754428
## [1] 7.754428

PCR

set.seed(13)
pcr.fit = pcr(crim~.,
              data = Boston,
              scale = T,
              validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 506 12 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.265    7.259    6.864    6.841    6.790    6.775
## adjCV         8.61    7.262    7.256    6.861    6.838    6.786    6.771
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.651    6.658    6.648     6.636     6.595     6.525
## adjCV    6.645    6.653    6.642     6.630     6.587     6.517
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       49.93    63.64    72.94    80.21    86.83    90.26    92.79    94.99
## crim    29.39    29.55    37.39    37.85    38.85    39.23    41.73    41.82
##       9 comps  10 comps  11 comps  12 comps
## X       96.78     98.33     99.48    100.00
## crim    42.12     42.43     43.58     44.93
validationplot(pcr.fit, val.type = "MSEP")

  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

Subset selection model had the lowest cv error with MSE 42.7.

  1. Does your chosen model involve all of the features in the data set? Why or why not?

The model chosen is the best subset selection model. This model has low variances and low MSE while having better accuracy.