HW 5

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.3

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.0.4

## Loading required package: Matrix

## Loaded glmnet 4.1-1

library(pls)

## Warning: package 'pls' was built under R version 4.0.4

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

library(miscTools)

## Warning: package 'miscTools' was built under R version 4.0.4

library(MASS)
library(leaps)

## Warning: package 'leaps' was built under R version 4.0.4

QUESTION 2: For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Reason: Lasso will limit predictors’ coefficients to 0 which reduces variance, and a small increase in bias. We get a better error rate when increase in bias is less relative to the decrease in variance. (b) Repeat (a) for ridge regression relative to least squares. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Reason: Similar to (a), Ridge regression will limit the correfficients by shrinking them (as opposed to 0 in Lasso). This will reduce variance but have a high bias. It is less flexible and best in situations where OLS estimates have a high variance (c) Repeat (a) for non-linear methods relative to least squares. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Reason: This is non linear and hence, more flexible. We will have increased accuracy when increase in variance is less relative to decrease in bias.

QUESTION 9: In this exercise, we will predict the number of applications received using the other variables in the College data set.

data("College")

Split the data set into a training set and a test set.

set.seed(1)
train <- sample(1:nrow(College), nrow(College)/2)
test <- (-train)

coll.train <- College[train,]
coll.test <- College[test,]

Fit a linear model using least squares on the training set, and report the test error obtained.

coll.lm <- lm(Apps ~ ., coll.train)

lm.pred <- predict(coll.lm, coll.test)
mean((coll.test$Apps - lm.pred)^2)

## [1] 1135758

Test error is 1135758

Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

set.seed(1)

mat.train <- model.matrix(Apps ~ ., coll.train)
mat.test <- model.matrix(Apps ~ ., coll.test)

grid <- 10^seq(10, -2, length = 100)

coll.cvridge <- cv.glmnet(mat.train, coll.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)

coll.bestlamridge <- coll.cvridge$lambda.min
coll.bestlamridge

## [1] 0.01

ridge.pred = predict(coll.cvridge, mat.test, s = coll.bestlamridge)
mean((ridge.pred - coll.test$Apps)^2)

## [1] 1135714

Test error is 1135714 which is slightly better than LM test error of 1135758

Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(1)

mat.train <- model.matrix(Apps ~ ., coll.train)
mat.test <- model.matrix(Apps ~ ., coll.test)

grid <- 10^seq(10, -2, length = 100)

coll.lasso = cv.glmnet(mat.train, coll.train$Apps, alpha=1, lambda = grid, thresh=1e-12)

coll.lassobestlam = coll.lasso$lambda.min
coll.lassobestlam

## [1] 0.01

lasso.pred <- predict(coll.lasso, mat.test, s = coll.lassobestlam)
mean((lasso.pred - coll.test$Apps)^2)

## [1] 1135659

The test error is 1135659 which is lower and better compared to both ridge and LM models

predict(coll.lasso, s = coll.lassobestlam, type = "coefficients")

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -7.900392e+02
## (Intercept)  .           
## PrivateYes  -3.070090e+02
## Accept       1.779328e+00
## Enroll      -1.469512e+00
## Top10perc    6.672205e+01
## Top25perc   -2.230441e+01
## F.Undergrad  9.259049e-02
## P.Undergrad  9.408441e-03
## Outstate    -1.083495e-01
## Room.Board   2.115144e-01
## Books        2.912138e-01
## Personal     6.120206e-03
## PhD         -1.547169e+01
## Terminal     6.409278e+00
## S.F.Ratio    2.282635e+01
## perc.alumni  1.130513e+00
## Expend       4.856698e-02
## Grad.Rate    7.488072e+00

Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)

coll.pcr = pcr(Apps ~ ., data = coll.train, scale = TRUE, validation = 'CV')

validationplot(coll.pcr, val.type = 'MSEP')

summary(coll.pcr)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     4006     2373     2372     2069     1961     1919
## adjCV         4288     4007     2368     2369     1999     1948     1911
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1919     1921     1876      1832      1832      1836      1837
## adjCV     1912     1915     1868      1821      1823      1827      1827
##        14 comps  15 comps  16 comps  17 comps
## CV         1853      1759      1341      1270
## adjCV      1850      1733      1326      1257
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       32.20    57.78    65.31    70.99    76.37    81.27     84.8    87.85
## Apps    13.44    70.93    71.07    79.87    81.15    82.25     82.3    82.33
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.62     92.91     94.98     96.74     97.79     98.72     99.42
## Apps    83.38     84.76     84.80     84.84     85.11     85.14     90.55
##       16 comps  17 comps
## X        99.88    100.00
## Apps     93.42     93.89

Lowest CV error at M = 10 hence use ncomp = 10

pcr.pred <- predict(coll.pcr, coll.test, ncomp = 10)
mean((pcr.pred - coll.test$Apps)^2)

## [1] 1723100

The test error rate 1723100 is much higher compared to all the models above

Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)

coll.pls <- plsr(Apps ~ ., data = coll.train, scale=TRUE, validation='CV')

validationplot(coll.pls)

pls.pred <- predict(coll.pls, coll.test, ncomp = 10)
mean((pls.pred - coll.test$Apps)^2)

## [1] 1131661

The test error is 1131661 which is better than 1723100 from pcr above

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

LM 1135758 Ridge 1135714 Lasso 1135659 PCR 1723100 PLS 1131661

PCR has the worst MSE PLS has the best MSE in comparison, the MSEs of PLS, Lasso, Ridge, and LM are similar and not much different, however, PLS has the best MSE of all followed by Lasso

rSquared(coll.test$Apps, resid = as.matrix(coll.test$Apps - lm.pred))

##           [,1]
## [1,] 0.9015413

rSquared(coll.test$Apps, resid = as.matrix(coll.test$Apps - ridge.pred))

##           1
## 1 0.9015452

rSquared(coll.test$Apps, resid = as.matrix(coll.test$Apps - lasso.pred))

##           1
## 1 0.9015499

rSquared(coll.test$Apps, resid = as.matrix(coll.test$Apps - pcr.pred))

##           [,1]
## [1,] 0.8506248

rSquared(coll.test$Apps, resid = as.matrix(coll.test$Apps - pls.pred))

##           [,1]
## [1,] 0.9018965

LM 0.9015413 Ridge 0.9015452 Lasso 0.9015499 PCR 0.8506248 PLS 0.9018965

PCR has the worst R squared and also the highest MSE making PCR PCR has an R squared of 85.06 which means it only explains 85.06% of variance in apps

While the R-squared for the rest of the models is similar, PLS has the best R-squared nd it also has the lowest MSE.

QUESTION 11: We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

data("Boston")

set.seed(1)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
test <- (-train)

bos.train <- Boston[train,]
bos.test <- Boston[test,]

Best Subset Selection

set.seed(1)

predict.regsubsets <- function(object, newdata, id, ...) 
{
    form <- as.formula(object$call[[2]])
    mat <- model.matrix(form, newdata)
    coefi <- coef(object, id = id)
    xvars <- names(coefi)
    mat[, xvars] %*% coefi
}

k = 10
p = ncol(Boston) - 1

folds <- sample(1:k, nrow(Boston), replace = TRUE)

cv.errors <- matrix(NA, k, p, dimnames = list(NULL, paste(1:p)))


for (j in 1:k) 
{
    best.fit <- regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = p)
    for (i in 1:p) 
    {
        pred <- predict(best.fit, Boston[folds == j, ], id = i)
        cv.errors[j, i] <- mean((Boston$crim[folds == j] - pred)^2)
    }
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors)

mean.cv.errors

##        1        2        3        4        5        6        7        8 
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948 
##        9       10       11       12       13 
## 42.53822 42.73416 42.52367 42.46014 42.50125

The lowest CV error is 12, and the MSE at 12 is 42.46014

bos.lm <- lm(crim ~ ., bos.train)

boslm.pred <- predict(bos.lm, bos.test)
mean((bos.test$crim - boslm.pred)^2)

## [1] 41.54639

MSE for LM is 41.54639

Lasso

set.seed(1)

bosmat.train <- model.matrix(crim ~ ., bos.train)
bosmat.test <- model.matrix(crim ~ ., bos.test)

grid <- 10^seq(10, -2, length = 100)

bos.lasso = cv.glmnet(bosmat.train, bos.train$crim, alpha=1, lambda = grid, thresh=1e-12)

bos.lassobestlam = bos.lasso$lambda.min
bos.lassobestlam

## [1] 0.05336699

boslasso.pred <- predict(bos.lasso, bosmat.test, s = bos.lassobestlam)
mean((boslasso.pred - bos.test$crim)^2)

## [1] 40.97987

MSE for Lasso is 40.97987

Ridge

set.seed(1)

bosmat.train <- model.matrix(crim ~ ., bos.train)
bosmat.test <- model.matrix(crim ~ ., bos.test)

grid <- 10^seq(10, -2, length = 100)

bos.cvridge <- cv.glmnet(bosmat.train, bos.train$crim, alpha = 0, lambda = grid, thresh = 1e-12)

bos.bestlamridge <- bos.cvridge$lambda.min
bos.bestlamridge

## [1] 0.2848036

bosridge.pred = predict(bos.cvridge, bosmat.test, s = bos.bestlamridge)
mean((bosridge.pred - bos.test$crim)^2)

## [1] 41.10444

MSE for Ridge is 41.10444

PCR

set.seed(1)

bos.pcr = pcr(crim ~ ., data = bos.train, scale = TRUE, validation = 'CV')

validationplot(bos.pcr, val.type = 'MSEP')

summary(bos.pcr)

## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.275    7.555    7.549    7.093    6.926    6.932    6.977
## adjCV        9.275    7.550    7.544    7.088    6.920    6.926    6.968
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.976    6.871    6.845     6.848     6.797     6.764     6.700
## adjCV    6.967    6.859    6.843     6.842     6.786     6.751     6.686
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.51     60.4    69.86    77.08    82.80    87.68    91.24    93.56
## crim    34.94     35.2    42.83    45.47    45.57    45.58    45.75    47.59
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.47     97.08     98.48     99.54    100.00
## crim    47.68     48.75     49.31     50.14     51.37

bospcr.pred <- predict(bos.pcr, bos.test, ncomp = 13)
mean((bospcr.pred - bos.test$crim)^2)

## [1] 41.54639

The MSE for PCR with M = 13 is 41.54639

bospcr.pred <- predict(bos.pcr, bos.test, ncomp = 4)
mean((bospcr.pred - bos.test$crim)^2)

## [1] 43.91107

The MSE for PCR with M = 4 is 43.91107

MSE summary Best Subset 42.46014 LM 41.54639 Lasso 40.97987 Ridge 41.10444 PCR 41.54639

Lasso has the lowest MSE Best Subset model has the highest MSE

Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

rSquared(bos.test$crim, resid = as.matrix(bos.test$crim - boslm.pred))

##           [,1]
## [1,] 0.3297973

rSquared(bos.test$crim, resid = as.matrix(bos.test$crim - bosridge.pred))

##           1
## 1 0.3369265

rSquared(bos.test$crim, resid = as.matrix(bos.test$crim - boslasso.pred))

##           1
## 1 0.3389361

rSquared(bos.test$crim, resid = as.matrix(bos.test$crim - bospcr.pred))

##           [,1]
## [1,] 0.2916516

The best model out of all is lasso in terms of MSE since it has the lowest MSE The best model in terms of R-squared is also Lasso with the highest R-squared of 33.89361

Does your chosen model involve all of the features in the data set? Why or why not?

as.matrix(coef(bos.lasso, bos.lasso$lambda.min))

##                         1
## (Intercept) 18.7499830804
## (Intercept)  0.0000000000
## zn           0.0364732124
## indus       -0.1248293847
## chas        -0.4697623428
## nox         -8.1724091087
## rm           0.1123257531
## age         -0.0005500647
## dis         -0.8323606739
## rad          0.5306677665
## tax          0.0000000000
## ptratio     -0.3791108368
## black       -0.0129846301
## lstat        0.2578373987
## medv        -0.1591511818

Lasso does not use all features. Lasso specializes in feature reduction. Tax has a coefficient of 0 which means 12 out of 13 features are used and tax is dropped out of the model

HW 5

Farheen Ali

4/1/2021