pg 203
Let’s examine other linear model fitting procedures besides least squares. Why?
Prediction Accuracy: If n is not much larger than p, there’s a lot of variability. And if \(p>n\) then least squares is no longer viable because there is no longer a unique solution. By constraining or shrinking the estimated coefficients, the variance can be reduced at a negligible increase in bias.
Model Interpretability: removing variables irrelevant to the response.
To perform best subset selection, we fit a separate least squares regression for each possible combination of the p predictors.
Since subset selection scales \(2^p\), it becomes computationally infeasible for \(p > 20\).
While best subset selection considers all \(2^p\) possible models, forward stepwise selection considers a much smaller set of models. It begins with the null model, then adds predictors until there are no more to add. At each step, the variable providing the greatest additional improvement to the fit is added to the model.
Whereas subset selection fits \(2^p\) models, forward stepwise selection fits one null model, along with \(p-k\) models in the \(k\)th iteration, for \(k = 0, \ldots, p-1\). This amounts to \(1 + \sum_{k = 0}^{p - 1} (p - k) = 1 + p(p + 1) / 2\).
Like forward, but starts with all p predictors then removes the least useful.
The model with the most predictors will always have the lowest RSS (fewer degrees of freedom), so we need another way to pick the best model. Therefore the training data RSS and \(R^2\) is a poor predictor of the test error. To estimate the test error, we can indirectly assess it by making a bias adjustment resulting from overfitting, or directly assess it using a cross-validation approach.
Least squares fits by minimizing the training RSS, and in general more predictors decrease the training RSS. Let’s discuss methods for adjusting the training error.
\[ C_p = \frac{1}{n} (\text{RSS} + 2 d \hat\sigma^2) \] Where d is the number of predictors, and \(\hat\sigma^2\) is an estimate of the variance of the error \(\epsilon\) associated with each response measurement, typically using the full model. Clearly the number of predictors creates a penalty, adjusting for RSS decreasing.
The AIC criterion is defined for a large class of models fit by maximum likelihood. In the case of least squares, it’s the same form as \(C_p\).
Similar form to \(C_p\), but replaces \(2d\hat\sigma^2\) with a \(\log (n)d \hat\sigma^2\) term. Since \(\log n > 2\) for any \(n > 7\), the BIC statistic places heavier penalties on larger models.
Recall \(R^2 = 1 - \text{RSS} / \text{TSS}\) where \(\text{TSS} = \sum (y_i - \bar y_i)\) the total sum of squares (how much of the variance is accounted for by the model?).
\[ \text{Adjusted } R^2 = 1 - \frac{\text{RSS} / (n - d - 1)} {\text{TSS} / (n - 1)} \] We want the largest adjusted \(R^2\) since the number of predictors in the model is inversely proportional to the fit measure. The intuition is that once all the correct predictors have been added to the model, any more will just add noise and minimally decrease RSS.
Directly estimate the test error with held out data. Can be used on many types of models and does not require assumptions about the underlying data or estimating the error variance. When in doubt be parsimonious! (Choose the simplest model)
Constraining or regularizing the coefficients towards zero. It will be shown this can significantly reduce variance.
Recall ordinary least squares minimizes \(\text{RSS} = (y - \mathbf{X} \beta)^2\). Ridge regression is very similar but adds a regularization term \(\text{RSS} + \lambda \sum_{j = 1}^p \beta_j^2\). Where \(\lambda \geq 0\) is a tuning parameter to be determined separately. Ideally, only the important coefficients will have values, and the unimportant ones are shrunk by \(\lambda\). Because reularization penalizes the absolute size of predictors , it is best to standardize the parameters so they’re all on the same scale. Ridge regression’s advantage over OLSq is rooted in the bias-variance trade-off. As \(\lambda\) increases, the model flexibility decreases, increasing the bias but decreasing the variance. Since the test MSE is proportional to both the bias and variance, finding a minimum of the sum is the goal. Ridge regression performs best in situations where the least squares estimates have high variance. ### The Lasso
Ridge regression will never totally remove a predictor (unless \(\lambda = \infty\)). The lasso has a very similar form to ridge regression \(\text{RSS} + \lambda \sum_{j = 1}^p |\beta_j|\) (the \(\ell_2\) norm has been replaced by the \(\ell_1\) norm). The lasso can force some predictors to zero, yielding easier interpretability. The lasso can remove variables because of the \(\ell_1\) norm. The space it covers has corners on the axis (where at least one variable is zero), while the \(\ell_2\) norm is a sphere which practically never exactly equals zero. Given a special case where \(\mathbf{X}\) is the identity matrix, the solutions for ridge regression and lasso show the former shrinks all the coefficients proportionately, while the latter shrinks the coefficients by similar amounts and ones close to zero are set to zero.
What is the best \(\lambda\)? Choose a grid of \(\lambda\) values, and compute the cross-validation error for each value of \(\lambda\).
The variance controlling methods discussed so far use all the original predictors \(X^T\). Now we explora approaches that transform the predictors then fit a least sqaures model on the transformed variables. Let \(Z_1, \ldots, Z_M\) represent \(M < p\) linear combinations of the original \(p\) predictors. That is \(Z_m = \sum_{j = 1}^p \phi_{jm} X_j\) for some constants \(\phi_{1m}, \phi_{2m}, \ldots, \phi_{pm}, m = 1, \ldots, M\). We can then fit the linear regression model \[ y_i = \theta_0 + \sum_{m = 1}^M \theta_m z_{im} + \epsilon_i, \quad i = 1, \ldots, n\] Note the regression coefficients are now \(\theta_0, \theta_1, \ldots, \theta_M\). If the constants \(\phi_{1m}, \phi_{2m}, \ldots, \phi_{pm}\) are chosen wiely, such dimension reduction can easily outperform least squares regression. Thus we’ve reduced the problem space from \(p + 1\) to \(M + 1\). Where \(\beta_j = \sum_{m = 1}^M \theta_m \phi_{jm}\). All dimension reduction methods work in two steps. First, the transformed predictors \(Z_1, \ldots, Z_M\) are obtained. Second, the model is fit using these \(M\) predictors. However the selection of \(Z\) (or equivilantly \(\phi_{jm}\)) can be achieved in different ways.
PCA is a technique for reducing \(\mathbf X_{n \times p}\). The first principle component direction of the data is that along which the observations vary the most. The book shows a line with the observations projected onto it. The principle component is where the resulting projected observations have the highest possible variance. This can be summarized mathematically by \[ Z_1 = \phi_{11} \times (X_1 - \bar X_1) + \phi_{21} \times (X_2 - \bar X_2)\] The principle component comes out of the linear combination of \(X_1\) and \(X_2\) such that \(\phi_{11}^2 + \phi_{21}^2 = 1\) where the variance is the highest. The second principle component \(Z_2\) is the linear combination of variables that is uncorrelated with \(Z_1\), and has the highest variance s.t. this constraint. The zero correlation condition means an orthogonal condition between the principle components \(Z_2 \perp Z_1\) (the dot product of \(Z_1 \cdot Z_2 = 0\)).
PLS works like PCR, but identifies the new feature set \(Z^T_M\) in a supervised way, using the response \(Y\) to identify new features that approximate the old ones and are related to the response. PLS attempts to find directions that explain both the response and predictors. After standardizing the predictors, PLC computes the first direction \(Z_1\) by setting each \(\phi_{j1}\) equal to the coefficient from the simple linear regression of \(Y \sim X_j\). One can show this coefficient is proportional to the correlation between \(y\) and \(X_j\). Hence computing \(Z_1 = \sum_{j=1}^p \phi_{j1} X_j\), PLS places the highest weight on the variables that are most stringly related to the response. To identify the second PLS direction, we first adjust each of the variables for \(Z_1\), by regressing each variables on \(Z_1\) and taking the residuals. This can be interpreted as the remaining information unexplained by \(Z_1\).
library(ISLR); library(leaps)
Hitters <- na.omit(Hitters)
# perform best subset selection
regfit.full <- regsubsets(Salary ~ ., Hitters)
summary(regfit.full) # asterisk means variable included in this model
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " "
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*"
## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1 ( 1 ) "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " "*" " " " " " "
## 4 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 5 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 6 ( 1 ) "*" " " " " "*" "*" " " " " " "
## 7 ( 1 ) " " " " " " "*" "*" " " " " " "
## 8 ( 1 ) " " "*" " " "*" "*" " " " " " "
reg.summary <- summary(regfit.full)
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
df <- tibble(
rsq = reg.summary$rsq,
adjr2 = reg.summary$adjr2,
cp = reg.summary$cp,
bic = reg.summary$bic,
x = seq_along(rsq)
)
df %>%
gather(stat, score, -x) %>%
ggplot(aes(x, score, color = stat)) +
geom_line()
plot(regfit.full ,scale ="r2")
plot(regfit.full ,scale ="adjr2")
plot(regfit.full ,scale ="Cp")
plot(regfit.full ,scale ="bic")
regfit.fwd <- regsubsets(Salary ~ ., Hitters, method = "forward")
regfit.bwd <- regsubsets(Salary ~ ., Hitters, method = "backward")
plot(regfit.bwd ,scale ="r2")
plot(regfit.bwd ,scale ="adjr2")
plot(regfit.bwd ,scale ="Cp")
plot(regfit.bwd ,scale ="bic")
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Hitters), rep = T)
test <- !train
predict.regsubsets <- function(object, newdata, id, ...) {
form <- formula(object$call [[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
regfit.best <- regsubsets(Salary ~ ., Hitters[train, ], nvmax = 19)
test.mat <- model.matrix(Salary ~., Hitters[test, ])
val.errors <- rep(NA, 19)
for(ii in 1:19) {
coefii <- coef(regfit.best, id = ii)
pred <- test.mat[,names(coefii)] %*% coefii
val.errors[ii] <- mean((Hitters$Salary[test] - pred)^2)
}
which.min(val.errors)
## [1] 10
# best model has 10 variables
coef(regfit.best, 10)
## (Intercept) AtBat Hits Walks CAtBat CHits
## -80.2751499 -1.4683816 7.1625314 3.6430345 -0.1855698 1.1053238
## CHmRun CWalks LeagueN DivisionW PutOuts
## 1.3844863 -0.7483170 84.5576103 -53.0289658 0.2381662
# use cross-validation with 10 folds
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(Hitters), replace = T)
cv.error <- matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
for(jj in 1:k) {
best.fit <- regsubsets(Salary ~., Hitters[folds != jj, ], nvmax = 19)
for(ii in 1:19) {
pred <- predict(best.fit, Hitters[folds == jj,], id = ii)
cv.error[jj, ii] <- mean((Hitters$Salary[folds == jj] - pred)^2)
}
}
which.min(apply(cv.error, 2, mean))
## 11
## 11
reg.best <- regsubsets(Salary ~., Hitters, nvmax = 19)
coef(reg.best, 11)
## (Intercept) AtBat Hits Walks CAtBat
## 135.7512195 -2.1277482 6.9236994 5.6202755 -0.1389914
## CRuns CRBI CWalks LeagueN DivisionW
## 1.4553310 0.7852528 -0.8228559 43.1116152 -111.1460252
## PutOuts Assists
## 0.2894087 0.2688277
x <- model.matrix(Salary ~ ., Hitters)[,-1]
y <- Hitters$Salary
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
## Loading required package: foreach
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loaded glmnet 2.0-16
grid <- 10^seq(10, -2, length = 100) # lambda ranges from 0.01 - 10 billion
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)
# glmnet standardizes variables by default
dim(coef(ridge.mod))
## [1] 20 100
# 100 vectors of regression coefficients for each lambda
# predict a new set of coefficients for lambda = 50
predict(ridge.mod, s = 50, type = "coefficients")[1:20,]
## (Intercept) AtBat Hits HmRun Runs
## 4.876610e+01 -3.580999e-01 1.969359e+00 -1.278248e+00 1.145892e+00
## RBI Walks Years CAtBat CHits
## 8.038292e-01 2.716186e+00 -6.218319e+00 5.447837e-03 1.064895e-01
## CHmRun CRuns CRBI CWalks LeagueN
## 6.244860e-01 2.214985e-01 2.186914e-01 -1.500245e-01 4.592589e+01
## DivisionW PutOuts Assists Errors NewLeagueN
## -1.182011e+02 2.502322e-01 1.215665e-01 -3.278600e+00 -9.496680e+00
# use validation set
set.seed(1)
train <- sample(1:nrow(x), nrow(x)/2)
test <- -train
y.test <- y[test]
ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = grid, thresh = 1e-12)
ridge.pred <- predict(ridge.mod, s = 4, newx = x[test,])
mean((ridge.pred - y.test)^2) # test MSE with lambda = 4
## [1] 101036.8
mean((mean(y[train]) - y.test)^2) # test MSE with no predictors
## [1] 193253.1
ridge.pred <- predict(ridge.mod, s = 1e10, newx = x[test,])
mean((ridge.pred - y.test)^2) # test MSE with giant lambda. same as null model
## [1] 193253.1
ridge.pred <- predict(ridge.mod, s = 0, newx = x[test,])
mean((ridge.pred - y.test)^2) # ordinary least squares has higher test MSE
## [1] 114723.6
# run 10-fold CV to find the best lambda
set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha = 0)
plot(cv.out)
(bestlam <- cv.out$lambda.min)
## [1] 211.7416
ridge.pred <- predict(ridge.mod, s = bestlam, newx = x[test,])
mean((ridge.pred - y.test)^2) # lowest possible test MSE
## [1] 96015.51
# now refit the model with all the observations and examine the coefficients
out <- glmnet(x, y, alpha = 0)
predict(out, type = "coefficients", s = bestlam)[1:20,]
## (Intercept) AtBat Hits HmRun Runs
## 9.88487157 0.03143991 1.00882875 0.13927624 1.11320781
## RBI Walks Years CAtBat CHits
## 0.87318990 1.80410229 0.13074383 0.01113978 0.06489843
## CHmRun CRuns CRBI CWalks LeagueN
## 0.45158546 0.12900049 0.13737712 0.02908572 27.18227527
## DivisionW PutOuts Assists Errors NewLeagueN
## -91.63411282 0.19149252 0.04254536 -1.81244470 7.21208394
lasso.mod <- glmnet(x[train,], y[train], alpha = 1, lambda = grid)
plot(lasso.mod)
set.seed(1)
cv.out <- cv.glmnet(x[train,], y[train], alpha = 1)
plot(cv.out)
bestlam <- cv.out$lambda.min
lasso.pred <- predict(lasso.mod, s = bestlam, newx = x[test,])
mean((lasso.pred - y.test)^2) # similar test MSE as ridge regression
## [1] 100743.4
out <- glmnet(x, y, alpha = 1, lambda = grid)
(lasso.coef <- predict(out, type = "coefficients", s = bestlam)[1:20,])
## (Intercept) AtBat Hits HmRun Runs
## 18.5394844 0.0000000 1.8735390 0.0000000 0.0000000
## RBI Walks Years CAtBat CHits
## 0.0000000 2.2178444 0.0000000 0.0000000 0.0000000
## CHmRun CRuns CRBI CWalks LeagueN
## 0.0000000 0.2071252 0.4130132 0.0000000 3.2666677
## DivisionW PutOuts Assists Errors NewLeagueN
## -103.4845458 0.2204284 0.0000000 0.0000000 0.0000000
# 12 of 19 predictors are noise
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(2)
pcr.fit <- pcr(Salary ~., data = Hitters, scale = T, validation = "CV")
summary(pcr.fit) # show CV sqrt(MSE) values for each set of principle components
## Data: X dimension: 263 19
## Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 452 348.9 352.2 353.5 352.8 350.1 349.1
## adjCV 452 348.7 351.8 352.9 352.1 349.3 348.0
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 349.6 350.9 352.9 353.8 355.0 356.2 363.5
## adjCV 348.5 349.8 351.6 352.3 353.4 354.5 361.6
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 355.2 357.4 347.6 350.1 349.2 352.6
## adjCV 352.8 355.2 345.5 347.6 346.7 349.8
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 38.31 60.16 70.84 79.03 84.29 88.63 92.26
## Salary 40.63 41.58 42.17 43.22 44.90 46.48 46.69
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 94.96 96.28 97.26 97.98 98.65 99.15 99.47
## Salary 46.75 46.86 47.76 47.82 47.85 48.10 50.40
## 15 comps 16 comps 17 comps 18 comps 19 comps
## X 99.75 99.89 99.97 99.99 100.00
## Salary 50.55 53.01 53.85 54.61 54.61
validationplot(pcr.fit, val.type = "MSEP")
# check fit on test data
set.seed(1)
pcr.fit <- pcr(Salary ~ ., data = Hitters,
subset = train,
scale = T,
validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred <- predict(pcr.fit, x[test,], ncomp = 7)
mean((pcr.pred - y.test)^2)
## [1] 96556.22
pcr.fit <- pcr(y ~ x, scale = T, ncomp = 7)
summary(pcr.fit)
## Data: X dimension: 263 19
## Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 7
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 38.31 60.16 70.84 79.03 84.29 88.63 92.26
## y 40.63 41.58 42.17 43.22 44.90 46.48 46.69
set.seed(1)
pls.fit <- plsr(Salary ~ ., data = Hitters, subset = train, scale = T, validation = "CV")
summary(pls.fit)
## Data: X dimension: 131 19
## Y dimension: 131 1
## Fit method: kernelpls
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 464.6 394.2 391.5 393.1 395.0 415.0 424.0
## adjCV 464.6 393.4 390.2 391.1 392.9 411.5 418.8
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 424.5 415.8 404.6 407.1 412.0 414.4 410.3
## adjCV 418.9 411.4 400.7 402.2 407.2 409.3 405.6
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 406.2 408.6 410.5 408.8 407.8 410.2
## adjCV 401.8 403.9 405.6 404.1 403.2 405.5
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 38.12 53.46 66.05 74.49 79.33 84.56 87.09
## Salary 33.58 38.96 41.57 42.43 44.04 45.59 47.05
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 90.74 92.55 93.94 97.23 97.88 98.35 98.85
## Salary 47.53 48.42 49.68 50.04 50.54 50.78 50.92
## 15 comps 16 comps 17 comps 18 comps 19 comps
## X 99.11 99.43 99.78 99.99 100.00
## Salary 51.04 51.11 51.15 51.16 51.18
pls.pred <- predict(pls.fit, x[test,], ncomp = 2)
mean((pls.pred - y.test)^2)
## [1] 101417.5
pls.fit <- plsr(Salary ~ ., data = Hitters, scale = T, ncomp = 2)
summary(pls.fit)
## Data: X dimension: 263 19
## Y dimension: 263 1
## Fit method: kernelpls
## Number of components considered: 2
## TRAINING: % variance explained
## 1 comps 2 comps
## X 38.08 51.03
## Salary 43.05 46.40
# two pls components explain as much variance as 7 pcr components,
# because pls explains directions that explain variance in both predictors and response
& ( y_1 - x_1(_1 - _2))^2 + ( y_2 - x_2(_1 - _2))^2 + (_1^2 + _2^2) \
& y_1^2 - 2y_1 x_1(_1 - _2) + x_1^2(_1 - _2)^2 + y_2^2 - 2y_2 x_2(_1 - _2) + x_2^2(_1 - _2)^2 + (_1^2 + _2^2) \ \end{eqnarray} $$
plotFit <- function(mod) {
plot(mod ,scale ="r2")
plot(mod ,scale ="adjr2")
plot(mod ,scale ="Cp")
plot(mod ,scale ="bic")
}
x <- rnorm(100)
err <- runif(100)
beta <- c(5, 19, -4.5, pi)
y <- beta[1] + beta[2]*x + beta[3] + x^2 + beta[4]*x^3 + err
df <- tibble(x = x, y = y)
regfit.full <- regsubsets(y ~ poly(x, degree = 10), data = df)
regfit.fwd <- regsubsets(y ~ poly(x, degree = 10), data = df, method = "forward")
regfit.bwd <- regsubsets(y ~ poly(x, degree = 10), data = df, method = "backward")
plotFit(regfit.bwd)
coef(regfit.full, id = 3)
## (Intercept) poly(x, degree = 10)1 poly(x, degree = 10)2
## 3.30962 246.08910 49.08779
## poly(x, degree = 10)3
## 38.34104
coef(regfit.fwd, id = 3)
## (Intercept) poly(x, degree = 10)1 poly(x, degree = 10)2
## 3.30962 246.08910 49.08779
## poly(x, degree = 10)3
## 38.34104
coef(regfit.bwd, id = 3)
## (Intercept) poly(x, degree = 10)1 poly(x, degree = 10)2
## 3.30962 246.08910 49.08779
## poly(x, degree = 10)3
## 38.34104
grid <- seq(1e-5, 100, length.out = 100)
cv.lasso <- cv.glmnet(poly(x, degree = 10), y, lambda = grid, alpha = 1)
plot(cv.lasso)
(bestlam <- cv.lasso$lambda.min)
## [1] 1e-05
predict(cv.lasso, type = "coefficients", s = bestlam)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 3.30961995
## 1 246.08899754
## 2 49.08769396
## 3 38.34094135
## 4 0.10449777
## 5 0.58210431
## 6 -0.05610677
## 7 -0.48965239
## 8 0.30743114
## 9 0.37958681
## 10 0.05628639
set.seed(1)
MSE <- vector("numeric", 5)
train <- sample(c(TRUE,TRUE,FALSE), nrow(College), replace = T)
x.train <- model.matrix(Apps ~ ., College)[train, -1]
x.test <- model.matrix(Apps ~ ., College)[!train, -1]
y.train <- College$Apps[train]
y.test <- College$Apps[!train]
# linear model
lm.fit <- glm(Apps ~ ., data = College, subset = train)
lm.pred <- predict(lm.fit, newdata = College[!train,])
MSE[1] <- mean((y.test - lm.pred)^2)
# ridge regression
ridge.fit <- cv.glmnet(x.train, y.train, alpha = 0)
ridge.pred <- predict(ridge.fit, newx = x.test, s = ridge.fit$lambda.min)
MSE[2] <- mean((y.test - ridge.pred)^2)
# lasso
lasso.fit <- cv.glmnet(x.train, y.train, alpha = 1)
lasso.pred <- predict(lasso.fit, newx = x.test, s = lasso.fit$lambda.min)
MSE[3] <- mean((y.test - lasso.pred)^2)
# PCR
pcr.fit <- pcr(Apps ~ ., data = College,
subset = train,
scale = T,
validation = "CV")
summary(pcr.fit) # 4 pcr components explain 88% of variance, adjCV 1243
## Data: X dimension: 522 17
## Y dimension: 522 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3561 3531 1680 1688 1276 1250 1243
## adjCV 3561 3535 1679 1691 1243 1242 1241
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1238 1225 1154 1148 1151 1153 1154
## adjCV 1236 1233 1152 1146 1149 1151 1152
## 14 comps 15 comps 16 comps 17 comps
## CV 1154 1160 1044 1030
## adjCV 1152 1158 1041 1027
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 31.61 57.86 64.99 70.73 76.08 80.92 84.61
## Apps 3.40 78.19 78.30 88.11 88.13 88.26 88.35
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 87.80 90.85 93.26 95.38 97.12 98.19 98.95
## Apps 88.36 90.01 90.13 90.13 90.16 90.20 90.24
## 15 comps 16 comps 17 comps
## X 99.46 99.82 100.00
## Apps 90.24 92.26 92.59
pcr.pred <- predict(pcr.fit, x.test, ncomp = 4)
MSE[4] <- mean((y.test - pcr.pred)^2)
# PLS
pls.fit <- plsr(Apps ~ ., data = College,
subset = train, scale = T, validation = "CV")
summary(pls.fit) # 4 pls components explain 91% variance, adjCV 1124
## Data: X dimension: 522 17
## Y dimension: 522 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3561 1516 1245 1150 1129 1087 1067
## adjCV 3561 1514 1246 1148 1124 1080 1061
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1058 1050 1050 1051 1054 1052 1051
## adjCV 1053 1046 1046 1046 1049 1047 1047
## 14 comps 15 comps 16 comps 17 comps
## CV 1051 1052 1052 1052
## adjCV 1047 1047 1047 1047
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 26.45 41.83 63.36 66.56 70.32 74.17 77.61
## Apps 82.46 88.24 90.31 91.11 91.88 92.44 92.51
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 80.58 82.93 85.69 88.57 91.53 93.22 94.42
## Apps 92.54 92.56 92.57 92.58 92.59 92.59 92.59
## 15 comps 16 comps 17 comps
## X 97.27 98.89 100.00
## Apps 92.59 92.59 92.59
pls.pred <- predict(pls.fit, x.test, ncomp = 4)
MSE[5] <- mean((pls.pred - y.test)^2)
plot(MSE)
# the ordinary linear model performs best, followed by lasso. PCR and PLS not so good