ISL Ch6 Linear Model Selection and Regularization

pg 203

Let’s examine other linear model fitting procedures besides least squares. Why?

Prediction Accuracy: If n is not much larger than p, there’s a lot of variability. And if $p>n$ then least squares is no longer viable because there is no longer a unique solution. By constraining or shrinking the estimated coefficients, the variance can be reduced at a negligible increase in bias.
Model Interpretability: removing variables irrelevant to the response.

Subset Selection

To perform best subset selection, we fit a separate least squares regression for each possible combination of the p predictors.

Algorithm for best subset selection

Let $\mathcal{M}_0$ denote the null model containing no predictors. This model simply predicts the sample mean for each observation.
For $k = 1, 2, \ldots p$:
1. Fit all $\binom{p}{k}$ models that contain exactly k predictors
2. Pick the best among these $\binom{p}{k}$ models, and call it $\mathcal{M}_k$. Here best is defined as having the smallest RSS, or largest $R^2$.
Select a single best model from among $\mathcal{M}_0, \ldots, \mathcal{M}_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$.

Since subset selection scales $2^p$, it becomes computationally infeasible for $p > 20$.

Stepwise Selection

Forward Stepwise Selection

While best subset selection considers all $2^p$ possible models, forward stepwise selection considers a much smaller set of models. It begins with the null model, then adds predictors until there are no more to add. At each step, the variable providing the greatest additional improvement to the fit is added to the model.

Algorithm for Forward stepwise selection

Let $\mathcal{M}_0$ denote the null model with no predictors.
For $k = 0, \ldots, p-1$:
1. Consider all $p-k$ models that augment the predictors in $\mathcal{M}_k$ with one additional predictor.
2. Choose the best among these $p-k$ models, and call it $\mathcal{M}_{k+1}$. Here best is defined having the smallest RSS or highest $R^2$.
Select a single best model from among $\mathcal{M}_0, \ldots, \mathcal{M}_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$.

Whereas subset selection fits $2^p$ models, forward stepwise selection fits one null model, along with $p-k$ models in the $k$th iteration, for $k = 0, \ldots, p-1$. This amounts to $1 + \sum_{k = 0}^{p - 1} (p - k) = 1 + p(p + 1) / 2$.

Backward Stepwise Selection

Like forward, but starts with all p predictors then removes the least useful.

Choosing the Optimal Model

The model with the most predictors will always have the lowest RSS (fewer degrees of freedom), so we need another way to pick the best model. Therefore the training data RSS and $R^2$ is a poor predictor of the test error. To estimate the test error, we can indirectly assess it by making a bias adjustment resulting from overfitting, or directly assess it using a cross-validation approach.

$C_p$, AIC, BIC, and Adjusted $R^2$

Least squares fits by minimizing the training RSS, and in general more predictors decrease the training RSS. Let’s discuss methods for adjusting the training error.

$C_p$

\[ C_p = \frac{1}{n} (\text{RSS} + 2 d \hat\sigma^2) \] Where d is the number of predictors, and $\hat\sigma^2$ is an estimate of the variance of the error $\epsilon$ associated with each response measurement, typically using the full model. Clearly the number of predictors creates a penalty, adjusting for RSS decreasing.

Akaike information criterion (AIC)

The AIC criterion is defined for a large class of models fit by maximum likelihood. In the case of least squares, it’s the same form as $C_p$.

Bayesian information criterion

Similar form to $C_p$, but replaces $2d\hat\sigma^2$ with a $\log (n)d \hat\sigma^2$ term. Since $\log n > 2$ for any $n > 7$, the BIC statistic places heavier penalties on larger models.

Adjusted $R^2$

Recall $R^2 = 1 - \text{RSS} / \text{TSS}$ where $\text{TSS} = \sum (y_i - \bar y_i)$ the total sum of squares (how much of the variance is accounted for by the model?).

\[ \text{Adjusted } R^2 = 1 - \frac{\text{RSS} / (n - d - 1)} {\text{TSS} / (n - 1)} \] We want the largest adjusted $R^2$ since the number of predictors in the model is inversely proportional to the fit measure. The intuition is that once all the correct predictors have been added to the model, any more will just add noise and minimally decrease RSS.

Validation and Cross-Validation

Directly estimate the test error with held out data. Can be used on many types of models and does not require assumptions about the underlying data or estimating the error variance. When in doubt be parsimonious! (Choose the simplest model)

6.2 Shrinkage Methods

Constraining or regularizing the coefficients towards zero. It will be shown this can significantly reduce variance.

Ridge Regression

Recall ordinary least squares minimizes $\text{RSS} = (y - \mathbf{X} \beta)^2$. Ridge regression is very similar but adds a regularization term $\text{RSS} + \lambda \sum_{j = 1}^p \beta_j^2$. Where $\lambda \geq 0$ is a tuning parameter to be determined separately. Ideally, only the important coefficients will have values, and the unimportant ones are shrunk by $\lambda$. Because reularization penalizes the absolute size of predictors , it is best to standardize the parameters so they’re all on the same scale. Ridge regression’s advantage over OLSq is rooted in the bias-variance trade-off. As $\lambda$ increases, the model flexibility decreases, increasing the bias but decreasing the variance. Since the test MSE is proportional to both the bias and variance, finding a minimum of the sum is the goal. Ridge regression performs best in situations where the least squares estimates have high variance. ### The Lasso

Ridge regression will never totally remove a predictor (unless $\lambda = \infty$). The lasso has a very similar form to ridge regression $\text{RSS} + \lambda \sum_{j = 1}^p |\beta_j|$ (the $\ell_2$ norm has been replaced by the $\ell_1$ norm). The lasso can force some predictors to zero, yielding easier interpretability. The lasso can remove variables because of the $\ell_1$ norm. The space it covers has corners on the axis (where at least one variable is zero), while the $\ell_2$ norm is a sphere which practically never exactly equals zero. Given a special case where $\mathbf{X}$ is the identity matrix, the solutions for ridge regression and lasso show the former shrinks all the coefficients proportionately, while the latter shrinks the coefficients by similar amounts and ones close to zero are set to zero.

Selecing the Tuning Parameter

What is the best $\lambda$? Choose a grid of $\lambda$ values, and compute the cross-validation error for each value of $\lambda$.

6.3 Dimension Reduction Methods

The variance controlling methods discussed so far use all the original predictors $X^T$. Now we explora approaches that transform the predictors then fit a least sqaures model on the transformed variables. Let $Z_1, \ldots, Z_M$ represent $M < p$ linear combinations of the original $p$ predictors. That is $Z_m = \sum_{j = 1}^p \phi_{jm} X_j$ for some constants $\phi_{1m}, \phi_{2m}, \ldots, \phi_{pm}, m = 1, \ldots, M$. We can then fit the linear regression model \[ y_i = \theta_0 + \sum_{m = 1}^M \theta_m z_{im} + \epsilon_i, \quad i = 1, \ldots, n\] Note the regression coefficients are now $\theta_0, \theta_1, \ldots, \theta_M$. If the constants $\phi_{1m}, \phi_{2m}, \ldots, \phi_{pm}$ are chosen wiely, such dimension reduction can easily outperform least squares regression. Thus we’ve reduced the problem space from $p + 1$ to $M + 1$. Where $\beta_j = \sum_{m = 1}^M \theta_m \phi_{jm}$. All dimension reduction methods work in two steps. First, the transformed predictors $Z_1, \ldots, Z_M$ are obtained. Second, the model is fit using these $M$ predictors. However the selection of $Z$ (or equivilantly $\phi_{jm}$) can be achieved in different ways.

Principle Components Analysis and Regression

PCA is a technique for reducing $\mathbf X_{n \times p}$. The first principle component direction of the data is that along which the observations vary the most. The book shows a line with the observations projected onto it. The principle component is where the resulting projected observations have the highest possible variance. This can be summarized mathematically by \[ Z_1 = \phi_{11} \times (X_1 - \bar X_1) + \phi_{21} \times (X_2 - \bar X_2)\] The principle component comes out of the linear combination of $X_1$ and $X_2$ such that $\phi_{11}^2 + \phi_{21}^2 = 1$ where the variance is the highest. The second principle component $Z_2$ is the linear combination of variables that is uncorrelated with $Z_1$, and has the highest variance s.t. this constraint. The zero correlation condition means an orthogonal condition between the principle components $Z_2 \perp Z_1$ (the dot product of $Z_1 \cdot Z_2 = 0$).

Partial Least Squares

PLS works like PCR, but identifies the new feature set $Z^T_M$ in a supervised way, using the response $Y$ to identify new features that approximate the old ones and are related to the response. PLS attempts to find directions that explain both the response and predictors. After standardizing the predictors, PLC computes the first direction $Z_1$ by setting each $\phi_{j1}$ equal to the coefficient from the simple linear regression of $Y \sim X_j$. One can show this coefficient is proportional to the correlation between $y$ and $X_j$. Hence computing $Z_1 = \sum_{j=1}^p \phi_{j1} X_j$, PLS places the highest weight on the variables that are most stringly related to the response. To identify the second PLS direction, we first adjust each of the variables for $Z_1$, by regressing each variables on $Z_1$ and taking the residuals. This can be interpreted as the remaining information unexplained by $Z_1$.

Lab 1: Subset Selection Methods

library(ISLR); library(leaps)
Hitters <- na.omit(Hitters)

# perform best subset selection
regfit.full <- regsubsets(Salary ~ ., Hitters)
summary(regfit.full) # asterisk means variable included in this model

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "  
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"  
##          CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) " "  "*"    " "     "*"       "*"     " "     " "    " "

reg.summary <- summary(regfit.full)
names(reg.summary)

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

df <- tibble(
  rsq = reg.summary$rsq,
  adjr2 = reg.summary$adjr2,
  cp = reg.summary$cp,
  bic = reg.summary$bic,
  x = seq_along(rsq)
)
df %>%
  gather(stat, score, -x) %>%
  ggplot(aes(x, score, color = stat)) +
  geom_line()

plot(regfit.full ,scale ="r2")

plot(regfit.full ,scale ="adjr2")

plot(regfit.full ,scale ="Cp")

plot(regfit.full ,scale ="bic")

regfit.fwd <- regsubsets(Salary ~ ., Hitters, method = "forward")
regfit.bwd <- regsubsets(Salary ~ ., Hitters, method = "backward")
plot(regfit.bwd ,scale ="r2")

plot(regfit.bwd ,scale ="adjr2")

plot(regfit.bwd ,scale ="Cp")

plot(regfit.bwd ,scale ="bic")

set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Hitters), rep = T)
test <- !train

predict.regsubsets <- function(object, newdata, id, ...) {
  form <- formula(object$call [[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}

regfit.best <- regsubsets(Salary ~ ., Hitters[train, ], nvmax = 19)
test.mat <- model.matrix(Salary ~., Hitters[test, ])
val.errors <- rep(NA, 19)
for(ii in 1:19) {
  coefii <- coef(regfit.best, id = ii)
  pred <- test.mat[,names(coefii)] %*% coefii
  val.errors[ii] <- mean((Hitters$Salary[test] - pred)^2)
}
which.min(val.errors)

## [1] 10

# best model has 10 variables
coef(regfit.best, 10)

## (Intercept)       AtBat        Hits       Walks      CAtBat       CHits 
## -80.2751499  -1.4683816   7.1625314   3.6430345  -0.1855698   1.1053238 
##      CHmRun      CWalks     LeagueN   DivisionW     PutOuts 
##   1.3844863  -0.7483170  84.5576103 -53.0289658   0.2381662

# use cross-validation with 10 folds 
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(Hitters), replace = T)
cv.error <- matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))

for(jj in 1:k) {
  best.fit <- regsubsets(Salary ~., Hitters[folds != jj, ], nvmax = 19)
  for(ii in 1:19) {
    pred <- predict(best.fit, Hitters[folds == jj,], id = ii)
    cv.error[jj, ii] <- mean((Hitters$Salary[folds == jj] - pred)^2)
  }
}
which.min(apply(cv.error, 2, mean))

## 11 
## 11

reg.best <- regsubsets(Salary ~., Hitters, nvmax = 19)
coef(reg.best, 11)

##  (Intercept)        AtBat         Hits        Walks       CAtBat 
##  135.7512195   -2.1277482    6.9236994    5.6202755   -0.1389914 
##        CRuns         CRBI       CWalks      LeagueN    DivisionW 
##    1.4553310    0.7852528   -0.8228559   43.1116152 -111.1460252 
##      PutOuts      Assists 
##    0.2894087    0.2688277

Lab 2: Ridge Regression and Lasso

x <- model.matrix(Salary ~ ., Hitters)[,-1]
y <- Hitters$Salary

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:tidyr':
## 
##     expand

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loaded glmnet 2.0-16

grid <- 10^seq(10, -2, length = 100) # lambda ranges from 0.01 - 10 billion
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)
# glmnet standardizes variables by default
dim(coef(ridge.mod))

## [1]  20 100

# 100 vectors of regression coefficients for each lambda

# predict a new set of coefficients for lambda = 50
predict(ridge.mod, s = 50, type = "coefficients")[1:20,]

##   (Intercept)         AtBat          Hits         HmRun          Runs 
##  4.876610e+01 -3.580999e-01  1.969359e+00 -1.278248e+00  1.145892e+00 
##           RBI         Walks         Years        CAtBat         CHits 
##  8.038292e-01  2.716186e+00 -6.218319e+00  5.447837e-03  1.064895e-01 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##  6.244860e-01  2.214985e-01  2.186914e-01 -1.500245e-01  4.592589e+01 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
## -1.182011e+02  2.502322e-01  1.215665e-01 -3.278600e+00 -9.496680e+00

# use validation set
set.seed(1)
train <- sample(1:nrow(x), nrow(x)/2)
test <- -train
y.test <- y[test]

ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = grid, thresh = 1e-12)
ridge.pred <- predict(ridge.mod, s = 4, newx = x[test,])
mean((ridge.pred - y.test)^2) # test MSE with lambda = 4

## [1] 101036.8

mean((mean(y[train]) - y.test)^2) # test MSE with no predictors

## [1] 193253.1

ridge.pred <- predict(ridge.mod, s = 1e10, newx = x[test,])
mean((ridge.pred - y.test)^2) # test MSE with giant lambda. same as null model

## [1] 193253.1

ridge.pred <- predict(ridge.mod, s = 0, newx = x[test,])
mean((ridge.pred - y.test)^2) # ordinary least squares has higher test MSE

## [1] 114723.6

# run 10-fold CV to find the best lambda
set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha = 0)
plot(cv.out)

(bestlam <- cv.out$lambda.min)

## [1] 211.7416

ridge.pred <- predict(ridge.mod, s = bestlam, newx = x[test,])
mean((ridge.pred - y.test)^2) # lowest possible test MSE

## [1] 96015.51

# now refit the model with all the observations and examine the coefficients
out <- glmnet(x, y, alpha = 0)
predict(out, type = "coefficients", s = bestlam)[1:20,]

##  (Intercept)        AtBat         Hits        HmRun         Runs 
##   9.88487157   0.03143991   1.00882875   0.13927624   1.11320781 
##          RBI        Walks        Years       CAtBat        CHits 
##   0.87318990   1.80410229   0.13074383   0.01113978   0.06489843 
##       CHmRun        CRuns         CRBI       CWalks      LeagueN 
##   0.45158546   0.12900049   0.13737712   0.02908572  27.18227527 
##    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
## -91.63411282   0.19149252   0.04254536  -1.81244470   7.21208394

lasso.mod <- glmnet(x[train,], y[train], alpha = 1, lambda = grid)
plot(lasso.mod)

set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha = 1)
plot(cv.out)

bestlam <- cv.out$lambda.min
lasso.pred <- predict(lasso.mod, s = bestlam, newx = x[test,])
mean((lasso.pred - y.test)^2) # similar test MSE as ridge regression

## [1] 100743.4

out <- glmnet(x, y, alpha = 1, lambda = grid)
(lasso.coef <- predict(out, type = "coefficients", s = bestlam)[1:20,])

##  (Intercept)        AtBat         Hits        HmRun         Runs 
##   18.5394844    0.0000000    1.8735390    0.0000000    0.0000000 
##          RBI        Walks        Years       CAtBat        CHits 
##    0.0000000    2.2178444    0.0000000    0.0000000    0.0000000 
##       CHmRun        CRuns         CRBI       CWalks      LeagueN 
##    0.0000000    0.2071252    0.4130132    0.0000000    3.2666677 
##    DivisionW      PutOuts      Assists       Errors   NewLeagueN 
## -103.4845458    0.2204284    0.0000000    0.0000000    0.0000000

# 12 of 19 predictors are noise

Lab 3: PCR and PLS Regression

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

set.seed(2)
pcr.fit <- pcr(Salary ~., data = Hitters, scale = T, validation = "CV")
summary(pcr.fit) # show CV sqrt(MSE) values for each set of principle components

## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV             452    348.9    352.2    353.5    352.8    350.1    349.1
## adjCV          452    348.7    351.8    352.9    352.1    349.3    348.0
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       349.6    350.9    352.9     353.8     355.0     356.2     363.5
## adjCV    348.5    349.8    351.6     352.3     353.4     354.5     361.6
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        355.2     357.4     347.6     350.1     349.2     352.6
## adjCV     352.8     355.2     345.5     347.6     346.7     349.8
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         94.96    96.28     97.26     97.98     98.65     99.15     99.47
## Salary    46.75    46.86     47.76     47.82     47.85     48.10     50.40
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.75     99.89     99.97     99.99    100.00
## Salary     50.55     53.01     53.85     54.61     54.61

validationplot(pcr.fit, val.type = "MSEP")

# check fit on test data
set.seed(1)
pcr.fit <- pcr(Salary ~ ., data = Hitters, 
               subset = train, 
               scale = T,
               validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")

pcr.pred <- predict(pcr.fit, x[test,], ncomp = 7)
mean((pcr.pred - y.test)^2)

## [1] 96556.22

pcr.fit <- pcr(y ~ x, scale = T, ncomp = 7)
summary(pcr.fit)

## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 7
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X    38.31    60.16    70.84    79.03    84.29    88.63    92.26
## y    40.63    41.58    42.17    43.22    44.90    46.48    46.69

set.seed(1)
pls.fit <- plsr(Salary ~ ., data = Hitters, subset = train, scale = T, validation = "CV")
summary(pls.fit)

## Data:    X dimension: 131 19 
##  Y dimension: 131 1
## Fit method: kernelpls
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           464.6    394.2    391.5    393.1    395.0    415.0    424.0
## adjCV        464.6    393.4    390.2    391.1    392.9    411.5    418.8
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       424.5    415.8    404.6     407.1     412.0     414.4     410.3
## adjCV    418.9    411.4    400.7     402.2     407.2     409.3     405.6
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        406.2     408.6     410.5     408.8     407.8     410.2
## adjCV     401.8     403.9     405.6     404.1     403.2     405.5
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         38.12    53.46    66.05    74.49    79.33    84.56    87.09
## Salary    33.58    38.96    41.57    42.43    44.04    45.59    47.05
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         90.74    92.55     93.94     97.23     97.88     98.35     98.85
## Salary    47.53    48.42     49.68     50.04     50.54     50.78     50.92
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.11     99.43     99.78     99.99    100.00
## Salary     51.04     51.11     51.15     51.16     51.18

pls.pred <- predict(pls.fit, x[test,], ncomp = 2)
mean((pls.pred - y.test)^2)

## [1] 101417.5

pls.fit <- plsr(Salary ~ ., data = Hitters, scale = T, ncomp = 2)
summary(pls.fit)

## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: kernelpls
## Number of components considered: 2
## TRAINING: % variance explained
##         1 comps  2 comps
## X         38.08    51.03
## Salary    43.05    46.40

# two pls components explain as much variance as 7 pcr components, 
# because pls explains directions that explain variance in both predictors and response

6.8 Exercises

Best subset selection has lowest training MSE
Unknown. Have to compare forward and backward selection.
False, False, True, True, False

Lasso reduces variance at a cost of increased bias
Ridge regression reduces variance at a cost of increased bias
Non-linear methods decrease bias at a cost of increased variance

Regarding the lasso as $s$ increases from 0: pg260. Ref pg 224

the training RSS steadily decreases
but the test RSS decreases then increases in a U shape
The variance steadily increases
and the bias steadily decreases
The irreducible error remains constant regardless of the model parameters

Regarding ridge regression, as $\lambda$ increases from 0:

the training RSS steadily decreases
the test RSS initially increases then decreases in an inverted U shape
the varaiance steadily decreases
the bias steadily increases
the irreducible error remains constant

$$ \begin{eqnarray} & ( y_1 - 1 x{11} - 2 x{12})^2 + ( y_2 - 1 x{21} - 2 x{22})^2 + (_1^2 + _2^2) \

& ( y_1 - x_1(_1 - _2))^2 + ( y_2 - x_2(_1 - _2))^2 + (_1^2 + _2^2) \

& y_1^2 - 2y_1 x_1(_1 - _2) + x_1^2(_1 - _2)^2 + y_2^2 - 2y_2 x_2(_1 - _2) + x_2^2(_1 - _2)^2 + (_1^2 + _2^2) \ \end{eqnarray} $$

plotFit <- function(mod) {
  plot(mod ,scale ="r2")
  plot(mod ,scale ="adjr2")
  plot(mod ,scale ="Cp")
  plot(mod ,scale ="bic")
}

x <- rnorm(100)
err <- runif(100)
beta <- c(5, 19, -4.5, pi)
y <- beta[1] + beta[2]*x + beta[3] + x^2 + beta[4]*x^3 + err
df <- tibble(x = x, y = y)
regfit.full <- regsubsets(y ~ poly(x, degree = 10), data = df)
regfit.fwd <- regsubsets(y ~ poly(x, degree = 10), data = df, method = "forward")
regfit.bwd <- regsubsets(y ~ poly(x, degree = 10), data = df, method = "backward")

plotFit(regfit.bwd)

coef(regfit.full, id = 3)

##           (Intercept) poly(x, degree = 10)1 poly(x, degree = 10)2 
##               3.30962             246.08910              49.08779 
## poly(x, degree = 10)3 
##              38.34104

coef(regfit.fwd, id = 3)

##           (Intercept) poly(x, degree = 10)1 poly(x, degree = 10)2 
##               3.30962             246.08910              49.08779 
## poly(x, degree = 10)3 
##              38.34104

coef(regfit.bwd, id = 3)

##           (Intercept) poly(x, degree = 10)1 poly(x, degree = 10)2 
##               3.30962             246.08910              49.08779 
## poly(x, degree = 10)3 
##              38.34104

grid <- seq(1e-5, 100, length.out = 100)
cv.lasso <- cv.glmnet(poly(x, degree = 10), y, lambda = grid, alpha = 1)
plot(cv.lasso)

(bestlam <- cv.lasso$lambda.min)

## [1] 1e-05

predict(cv.lasso, type = "coefficients", s = bestlam)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)   3.30961995
## 1           246.08899754
## 2            49.08769396
## 3            38.34094135
## 4             0.10449777
## 5             0.58210431
## 6            -0.05610677
## 7            -0.48965239
## 8             0.30743114
## 9             0.37958681
## 10            0.05628639

set.seed(1)
MSE <- vector("numeric", 5)
train <- sample(c(TRUE,TRUE,FALSE), nrow(College), replace = T)
x.train <- model.matrix(Apps ~ ., College)[train, -1]
x.test <-  model.matrix(Apps ~ ., College)[!train, -1]  
y.train <- College$Apps[train]
y.test <- College$Apps[!train]
# linear model
lm.fit <- glm(Apps ~ ., data = College, subset = train)
lm.pred <- predict(lm.fit, newdata = College[!train,])
MSE[1] <- mean((y.test - lm.pred)^2)

# ridge regression
ridge.fit <- cv.glmnet(x.train, y.train, alpha = 0)
ridge.pred <- predict(ridge.fit, newx = x.test, s = ridge.fit$lambda.min)
MSE[2] <- mean((y.test - ridge.pred)^2)

# lasso
lasso.fit <- cv.glmnet(x.train, y.train, alpha = 1)
lasso.pred <- predict(lasso.fit, newx = x.test, s = lasso.fit$lambda.min)
MSE[3] <- mean((y.test - lasso.pred)^2)

# PCR
pcr.fit <- pcr(Apps ~ ., data = College, 
               subset = train, 
               scale = T,
               validation = "CV")
summary(pcr.fit) # 4 pcr components explain 88% of variance, adjCV 1243

## Data:    X dimension: 522 17 
##  Y dimension: 522 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3561     3531     1680     1688     1276     1250     1243
## adjCV         3561     3535     1679     1691     1243     1242     1241
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1238     1225     1154      1148      1151      1153      1154
## adjCV     1236     1233     1152      1146      1149      1151      1152
##        14 comps  15 comps  16 comps  17 comps
## CV         1154      1160      1044      1030
## adjCV      1152      1158      1041      1027
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       31.61    57.86    64.99    70.73    76.08    80.92    84.61
## Apps     3.40    78.19    78.30    88.11    88.13    88.26    88.35
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       87.80    90.85     93.26     95.38     97.12     98.19     98.95
## Apps    88.36    90.01     90.13     90.13     90.16     90.20     90.24
##       15 comps  16 comps  17 comps
## X        99.46     99.82    100.00
## Apps     90.24     92.26     92.59

pcr.pred <- predict(pcr.fit, x.test, ncomp = 4)
MSE[4] <- mean((y.test - pcr.pred)^2)

# PLS
pls.fit <- plsr(Apps ~ ., data = College, 
                subset = train, scale = T, validation = "CV")
summary(pls.fit) # 4 pls components explain 91% variance, adjCV 1124

## Data:    X dimension: 522 17 
##  Y dimension: 522 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3561     1516     1245     1150     1129     1087     1067
## adjCV         3561     1514     1246     1148     1124     1080     1061
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1058     1050     1050      1051      1054      1052      1051
## adjCV     1053     1046     1046      1046      1049      1047      1047
##        14 comps  15 comps  16 comps  17 comps
## CV         1051      1052      1052      1052
## adjCV      1047      1047      1047      1047
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       26.45    41.83    63.36    66.56    70.32    74.17    77.61
## Apps    82.46    88.24    90.31    91.11    91.88    92.44    92.51
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       80.58    82.93     85.69     88.57     91.53     93.22     94.42
## Apps    92.54    92.56     92.57     92.58     92.59     92.59     92.59
##       15 comps  16 comps  17 comps
## X        97.27     98.89    100.00
## Apps     92.59     92.59     92.59

pls.pred <- predict(pls.fit, x.test, ncomp = 4)
MSE[5] <- mean((pls.pred - y.test)^2)

plot(MSE)

# the ordinary linear model performs best, followed by lasso. PCR and PLS not so good