Assignment 5

library(ISLR)
library(glmnet)
library(pls)
library(leaps)
library(MASS)

Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is iii:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

(b) Repeat (a) for ridge regression relative to least squares.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

(c) Repeat (a) for non-linear methods relative to least squares.
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.

data("College")
attach(College)

set.seed(2)
train <- sample(1:nrow(College), nrow(College)/2)
test <- -train
y.test <- Apps[test]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit <- lm(Apps ~., data=College, subset = train)
lm.pred <- predict(lm.fit, College[test,])
lm.error <- mean((lm.pred - y.test)^2)
lm.error

## [1] 1093608

The test MSE is 1093608.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

x <- model.matrix(Apps~., data=College)[, -1]
x.train <- x[train,]
y <- College$Apps
y.train <- y[train]

set.seed(7)
ridge.fit <- glmnet(x.train,y.train,alpha=0)
cv.ridge.fit <- cv.glmnet(x.train,y.train,alpha=0)
plot(cv.ridge.fit)

bestlam <- cv.ridge.fit$lambda.min
bestlam

## [1] 424.2704

Best lambda chosen by cross-validation is 424.2704.

ridge.pred <- predict(ridge.fit, s=bestlam, newx=x[test,]) 
ridge.error <- mean((ridge.pred-y.test)^2)
ridge.error

## [1] 1138030

The test MSE is 1138030, and is higher than linear regression.

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(2)
lasso.fit <- glmnet(x.train,y.train,alpha=1)
cv.lasso.fit<- cv.glmnet(x.train,y.train,alpha=1)
bestlam.lasso <- cv.lasso.fit$lambda.min
bestlam.lasso

## [1] 15.97351

The best lambda value chosen using cross-validation for the lasso model is 15.97351.

lasso.pred <- predict(lasso.fit, s=bestlam.lasso, newx=x[test,]) 
lasso.error <- mean((lasso.pred-y.test)^2)
lasso.error

## [1] 1045922

The test MSE for lasso regresison is 1045922.

lasso.c <- predict(lasso.fit,type="coefficients", s=bestlam)[1:17,]
length(lasso.c[lasso.c != 0])

## [1] 3

There are 3 non-zero coefficient estimates.

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pcr.fit <- pcr(Apps~., data=College, subset=train, scale=TRUE, validation ="CV")
summary(pcr.fit)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4440     4463     2362     2385     2092     1835     1832
## adjCV         4440     4465     2359     2384     2066     1824     1823
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1829     1821     1753      1764      1771      1772      1784
## adjCV     1818     1813     1746      1757      1764      1764      1777
##        14 comps  15 comps  16 comps  17 comps
## CV         1778      1747      1401      1338
## adjCV      1773      1701      1385      1322
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     31.4816    57.40    64.84    70.54    75.80    80.23    84.04    87.64
## Apps   0.1398    72.45    72.50    80.45    84.74    84.77    85.06    85.16
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.87     93.29     95.33     96.98     98.05     98.75     99.37
## Apps    85.93     86.16     86.21     86.32     86.40     86.61     92.49
##       16 comps  17 comps
## X        99.85    100.00
## Apps     93.65     94.32

validationplot(pcr.fit, val.type = "MSEP")

We see that the lowest cross-validation error occurs when all 17 components are included.

pcr.pred <- predict(pcr.fit,x[test,], ncomp=17) 
pcr.error <- mean((pcr.pred-y.test)^2)
pcr.error

## [1] 1093608

The test MSE is 1093608.

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pls.fit <- plsr(Apps~., data=College, subset=train, scale=TRUE, validation="CV")
summary(pls.fit)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4440     2153     1701     1711     1650     1498     1413
## adjCV         4440     2147     1677     1704     1615     1470     1392
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1378     1358     1356      1342      1338      1337      1337
## adjCV     1361     1342     1339      1326      1322      1321      1321
##        14 comps  15 comps  16 comps  17 comps
## CV         1338      1338      1338      1338
## adjCV      1322      1322      1322      1322
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.76    31.99    61.88    65.00    68.68    73.82    78.33    81.38
## Apps    77.92    87.80    88.38    92.06    93.59    93.92    94.01    94.09
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.24     85.59     87.76     90.49     92.69     94.27     97.03
## Apps    94.20     94.27     94.30     94.31     94.32     94.32     94.32
##       16 comps  17 comps
## X        99.21    100.00
## Apps     94.32     94.32

validationplot(pls.fit, val.type = "MSEP")

* The lowest cross-validation error occurs with 12 components.

pls.pred <- predict(pls.fit,x[test,],ncomp=12) 
pls.error <- mean((pls.pred-y.test)^2)
pls.error

## [1] 1085346

The test MSE for PLS is 1085346, which is lower than PCR, LM, and ridge regressions.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

errors <- c(lm.error, ridge.error, lasso.error, pcr.error, pls.error)
names(errors) <- c("lm", "ridge", "lasso", "pcr", "pls")
barplot(sort(errors))

print(sort(errors))

##   lasso     pls      lm     pcr   ridge 
## 1045922 1085346 1093608 1093608 1138030

LASSO has the lowest test error, followed by PLS, then PCR and LM methods, and then RIDGE regression. All models, except RIDGE, predict college applications with high accuracy.

Problem 11

11. We will now try to predict per capita crime rate in the Boston data set.

data(Boston)
attach(Boston)

# Split the data into train and test sets
set.seed(4)
train <- sample(1:nrow(Boston), nrow(Boston)*0.70)
test <- -train
y.test <- crim[test]

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Least squares

lm.fit <- lm(crim~., data=Boston, subset=train)
summary(lm.fit)

## 
## Call:
## lm(formula = crim ~ ., data = Boston, subset = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.262 -2.330 -0.452  1.065 73.765 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.835e+01  8.809e+00   2.083 0.038043 *  
## zn           5.297e-02  2.463e-02   2.150 0.032242 *  
## indus       -4.926e-02  1.045e-01  -0.471 0.637746    
## chas        -5.041e-01  1.526e+00  -0.330 0.741398    
## nox         -1.106e+01  6.734e+00  -1.643 0.101354    
## rm           3.983e-01  7.073e-01   0.563 0.573711    
## age          4.431e-03  2.129e-02   0.208 0.835250    
## dis         -1.139e+00  3.429e-01  -3.322 0.000991 ***
## rad          6.346e-01  1.112e-01   5.707 2.51e-08 ***
## tax         -4.745e-03  6.555e-03  -0.724 0.469591    
## ptratio     -3.449e-01  2.375e-01  -1.453 0.147257    
## black        2.775e-04  4.356e-03   0.064 0.949251    
## lstat        6.032e-02  9.378e-02   0.643 0.520535    
## medv        -2.537e-01  7.344e-02  -3.455 0.000620 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.66 on 340 degrees of freedom
## Multiple R-squared:  0.451,  Adjusted R-squared:   0.43 
## F-statistic: 21.48 on 13 and 340 DF,  p-value: < 2.2e-16

The variables that are statistically significant are zn, dis, rad, lstat, and medv.

lm.pred <- predict(lm.fit, Boston[test,])
lm.error <- mean((lm.pred - y.test)^2)
lm.error

## [1] 36.58822

The test MSE equal to 36.58822

Best subset selection.

# Choose the model using cross-validation and the best subset method
set.seed(1)
predict.regsubsets <- function(object, newdata, id, ...) {
    form <- as.formula(object$call[[2]])
    mat <- model.matrix(form, newdata)
    coefi <- coef(object, id = id)
    xvars <- names(coefi)
    mat[, xvars] %*% coefi
}

k = 10
folds <- sample(1:k, nrow(Boston), replace = TRUE)
cv.errors <- matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
for (j in 1:k) {
    best.fit <- regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
    for (i in 1:13) {
        pred <- predict(best.fit, Boston[folds == j, ], id = i)
        cv.errors[j, i] <- mean((Boston$crim[folds == j] - pred)^2)
    }
}
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors

##        1        2        3        4        5        6        7        8 
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948 
##        9       10       11       12       13 
## 42.53822 42.73416 42.52367 42.46014 42.50125

plot(mean.cv.errors, type="b")
points(which.min(mean.cv.errors), mean.cv.errors[12], pch=20, col='red', cex=2)

min(mean.cv.errors)

## [1] 42.46014

We may see that cross-validation selects an 12-variables model. We have a CV estimate for the test MSE equal to 42.46014.

Lasso.

x <- model.matrix(crim~., Boston)[, -1]
y <- Boston$crim
lasso.fit <- glmnet(x[train, ], y[train], alpha=1)
cv.lasso.fit <- cv.glmnet(x[train, ], y[train], alpha=1)
plot(cv.lasso.fit)

bestlam.lasso <- cv.lasso.fit$lambda.min
bestlam.lasso

## [1] 0.1013094

lasso.pred <- predict(lasso.fit, s=bestlam.lasso, newx=x[test,]) 
lasso.error <- mean((lasso.pred-y[test])^2)
lasso.error

## [1] 36.80542

The test MSE is 36.80542.

Ridge regression

ridge.fit <- glmnet(x[train, ], y[train], alpha=0)
cv.ridge.fit <- cv.glmnet(x[train, ], y[train], alpha=0)
plot(cv.ridge.fit)

bestlam.ridge <- cv.ridge.fit$lambda.min
bestlam.ridge

## [1] 0.5533799

ridge.pred <- predict(ridge.fit, s=bestlam.ridge, newx=x[test,]) 
ridge.error <- mean((ridge.pred-y[test])^2)
ridge.error

## [1] 36.17571

The ridge error is 36.17571, which is lower than the lasso.

PCR (Principal Component Regression)

set.seed(4)
pcr.fit <- pcr(crim~., data=Boston, subset=train, scale=TRUE, validation ="CV")
summary(pcr.fit)

## Data:    X dimension: 354 13 
##  Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.834    7.362    7.365    7.027    7.028    7.042    7.020
## adjCV        8.834    7.358    7.361    7.022    7.022    7.037    7.013
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       7.033    7.007    6.975     6.979     6.984     6.934     6.842
## adjCV    7.024    6.980    6.962     6.973     6.970     6.917     6.826
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.99    60.42    69.63    76.41    82.83    87.86    91.23    93.43
## crim    31.27    31.40    37.67    37.79    37.79    38.97    39.32    40.69
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.53     97.05     98.52     99.57     100.0
## crim    41.47     41.50     41.92     43.60      45.1

We see that the lowest cross-validation error occurs when all 13 components are included.

validationplot(pcr.fit, val.type = "MSEP")

pcr.pred <- predict(pcr.fit,x[test,], ncomp=13) 
pcr.error <- mean((pcr.pred-y.test)^2)
pcr.error

## [1] 36.58822

The test MSE equal to 36.58822.

errors <- c(lm.error, ridge.error, lasso.error, pcr.error,min(mean.cv.errors))
names(errors) <- c("lm", "ridge", "lasso", "pcr", "cross+best")
print(sort(errors))

##      ridge        pcr         lm      lasso cross+best 
##   36.17571   36.58822   36.58822   36.80542   42.46014

The lowest MSE occurs when all 13 variables are present in the model, which doesn’t help in dimensionality reduction.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
As computed above the model with the lower cross-validation error is the one chosen by the Ridge method. (c) Does your chosen model involve all of the features in the data set? Why or why not?
Yes. Because I chose the Ridge method, it doesn’t perform variable selection, and it includes all the predictors in the final model.

Assignment 5

Lily He

3/18/2022

Problem 2

Problem 9

Problem 11