iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance., is correct for the lasso, because:
The lasso shrinks the coefficient estimates towards zero, that’s is it selects the shrinkage penalty \(\lambda\sum_{i=1}^{p}|\beta_{i}|\) to be small, which makes the coefficient estimates \(\beta_1\), \(\beta_2\), …. \(\beta_p\) close to zero. This shrinkage of the coefficient estimates can significantly reduce their variance with the small increase in bias. This technique improves the prediction accuracy as depending on the value of \(\lambda\), the lasso can produce a model involving any number of variables.
Ridge regression is also similar as lasso and the below statement is correct.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
The reason is very similar to section (a), but the only difference in the ridge regression is that the shrinkage penalty term is \(\lambda\sum_{i=1}^{p}\beta_{i}^2\) and this will not shrink the coefficient estimates of the less useful predictors to 0.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. is correct for non-linear methods, because:
In the non-linear model, y = f(x) + ε\(\varepsilon\) in which the functional part of the model is not linear with respect to the unknown parameters, β0,β1,…, and the method of least squares is used to estimate the values of the unknown parameters. Since non-linear methods are making less assumptions, the function f(x) would be calculated using non-linear relationships between the predictors and the response. This decreases the bias and increases in variance, hence a higher prediction accuracy (both these advantages outweighs the increase in variance).
College data set.#Splitting thee data set by 70-30 split
set.seed(777)
train = sample(1:dim(College)[1], round(dim(College)[1] * 0.7))
test <- -train
train.college <- College[train, ]
test.college <- College[test, ]
lm.college.fit <- lm(Apps ~ ., data = train.college)
lm.college.pred <- predict(lm.college.fit, test.college)
lm.college.mse <- mean((lm.college.pred - test.college$Apps)^2)
lm.college.mse
## [1] 1346878
The test error is 1.346878e+06, that’s 1346878.
The glmnet() function has an alpha argument that determines what type of model is fit. If alpha=0 then a ridge regression model is fit. By default the glmnet() function performs ridge regression for an automatically selected range of \(\lambda\) values. However, here I have chosen to implement the function over a grid of values ranging from \(\lambda\)= \(10^{10}\) to \(\lambda\) = \(10^{-2}\), essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit.
set.seed(777)
grid <- 10 ^ seq(10, -2, length = 100)
#Convert the train and test dataset into matrix as it contains categorical variable.
train.college.mat <- model.matrix(Apps ~ ., data = train.college)
test.college.mat <- model.matrix(Apps ~ ., data = test.college)
#Fit the ridge regression using the above train matrix with the above grid
ridge.college.fit <- glmnet(train.college.mat, train.college$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
#Fit the ridge regression using cross-validation for the same train matrix with the above grid
ridge.college.cv <- cv.glmnet(train.college.mat, train.college$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
#Find the lamba minimum
ridge.college.bestlambda <- ridge.college.cv$lambda.min
ridge.college.bestlambda
## [1] 0.01
The best \(\lambda\) is 0.01.
ridge.college.pred <- predict(ridge.college.fit, s = ridge.college.bestlambda, newx = test.college.mat)
ridge.college.mse <- mean((ridge.college.pred - test.college$Apps)^2)
ridge.college.mse
## [1] 1346842
The test error is 1.346842e+06, that’s 1346842, which slightly lower than the linear model.
The glmnet() function has an alpha argument that determines what type of model is fit. If alpha=1 then a lasso regression model is fit.
#Fit the lasso regression
lasso.college.fit <- glmnet(train.college.mat, train.college$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
#Fit the lasso regression using cross-validation for the same train matrix with the above grid
lasso.college.cv <- cv.glmnet(train.college.mat, train.college$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
#Find the lamba minimum
lasso.college.bestlambda <- lasso.college.cv$lambda.min
lasso.college.bestlambda
## [1] 0.01
The best \(\lambda\) is 0.01.
lasso.college.pred <- predict(lasso.college.fit, s = lasso.college.bestlambda, newx = test.college.mat)
lasso.college.mse <- mean((lasso.college.pred - test.college$Apps)^2)
lasso.college.mse
## [1] 1346806
The test error is 1.346806e+06, that’s 1346806, which slightly lower than the linear model and ridge model.
The coefficient estimates are:
predict(lasso.college.fit, s = lasso.college.bestlambda, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -3.082576e+02
## (Intercept) .
## PrivateYes -4.252529e+02
## Accept 1.699384e+00
## Enroll -1.220297e+00
## Top10perc 4.655390e+01
## Top25perc -1.217544e+01
## F.Undergrad 8.967979e-02
## P.Undergrad 3.431486e-02
## Outstate -1.205147e-01
## Room.Board 1.119671e-01
## Books 2.541690e-02
## Personal -7.734325e-03
## PhD -6.091927e+00
## Terminal -5.324239e+00
## S.F.Ratio 5.835626e+00
## perc.alumni 3.990627e-01
## Expend 1.186827e-01
## Grad.Rate 9.111712e+00
All predictors are having non-zero coefficient estimates except PrivateNo.
set.seed(777)
pcr.college.fit <- pcr(Apps ~ ., data = train.college, scale = TRUE, validation = "CV")
summary(pcr.college.fit)
## Data: X dimension: 544 17
## Y dimension: 544 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4006 4012 2215 2213 1786 1749 1666
## adjCV 4006 4014 2212 2215 1746 1745 1661
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1648 1642 1603 1599 1607 1614 1615
## adjCV 1640 1638 1597 1594 1602 1608 1609
## 14 comps 15 comps 16 comps 17 comps
## CV 1619 1531 1211 1136
## adjCV 1615 1498 1200 1128
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.9439 57.85 65.04 70.77 76.17 81.04 84.56 87.92
## Apps 0.5369 70.31 70.56 81.81 82.47 84.01 84.48 84.54
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.87 93.20 95.16 96.94 97.97 98.78 99.39
## Apps 85.51 85.67 85.79 85.80 85.89 86.05 91.95
## 16 comps 17 comps
## X 99.84 100.00
## Apps 93.21 93.72
validationplot(pcr.college.fit, val.type = "MSEP")
From the above Cross-Validation plot, we can identify the selected value for M would be 17 and there is no dimension reduction (because the lowest MSEP is at the components at 17).
Running the prediction with the number of components as 17 (that’s, M=17).
pcr.college.pred <- predict(pcr.college.fit, test.college, ncomp = 17)
pcr.college.mse <- mean((pcr.college.pred - test.college$Apps)^2)
pcr.college.mse
## [1] 1346878
The test error is 1.346878e+06, that’s 1346878, which is same as compare to the Linear Model with least squares in (b). This is because when M=p and all the linear combinations (\(Z_m\)) are linearly independent, then poses no constraints. In this case, no dimension occurs, and so fitting is equivalent to performing least squares on the original p predictors (reference from ISLR book).
set.seed(777)
pls.college.fit <- plsr(Apps ~ ., data = train.college, scale = TRUE, validation = "CV")
summary(pls.college.fit)
## Data: X dimension: 544 17
## Y dimension: 544 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4006 2001 1576 1532 1456 1333 1194
## adjCV 4006 1997 1565 1527 1435 1313 1183
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1171 1152 1144 1140 1139 1138 1135
## adjCV 1162 1144 1136 1132 1131 1130 1128
## 14 comps 15 comps 16 comps 17 comps
## CV 1136 1136 1136 1136
## adjCV 1128 1128 1128 1128
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.78 34.0 63.06 65.98 69.99 74.39 78.60 81.10
## Apps 76.41 86.6 87.66 91.10 92.82 93.40 93.49 93.56
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.27 86.67 89.72 91.28 92.90 94.41 97.25
## Apps 93.64 93.68 93.70 93.72 93.72 93.72 93.72
## 16 comps 17 comps
## X 98.34 100.00
## Apps 93.72 93.72
validationplot(pls.college.fit, val.type = "MSEP")
Not able to determine the exact M value from the above Cross-Validation plot. Lets try to get the cross-validation MSEs for the all the predictors.
select <- dplyr::select
pls.model.cse <- MSEP(pls.college.fit, estimate = "CV")$val %>%
reshape2::melt() %>%
mutate(M = 0:(nrow(.)-1)) %>%
select(M, value) %>%
rename(CV_MSE = value)
pls.model.cse
## M CV_MSE
## 1 0 16050223
## 2 1 4005362
## 3 2 2483826
## 4 3 2346362
## 5 4 2119328
## 6 5 1777693
## 7 6 1425126
## 8 7 1372166
## 9 8 1327818
## 10 9 1309405
## 11 10 1299834
## 12 11 1297801
## 13 12 1295053
## 14 13 1289172
## 15 14 1290366
## 16 15 1290629
## 17 16 1290838
## 18 17 1290708
From the above MSE table, we identify that for the M = 13 we have the lowest MSE = 1289172.
Running the prediction with the number of components as 13 (that’s, M=13).
pls.college.pred <- predict(pls.college.fit, test.college, ncomp = 13)
pls.college.mse <- mean((pls.college.pred - test.college$Apps)^2)
pls.college.mse
## [1] 1342229
The test error is 1.342229e+06, that’s 1342229, which is smaller than all the other regression models.
To compare and comment the results obtained all the 5 approaches, we need to calculate the R^2 and compare those.
\[R^2 = 1 - \frac{RSS}{TSS}\]
Where RSS is sum of squares of residuals and TSS is total sum of squares.
total.sum.squares <- sum((test.college$Apps - mean(test.college$Apps))^2)
lm.res.sum.squares <- sum((test.college$Apps - lm.college.pred)^2)
ridge.res.sum.squares <- sum((test.college$Apps - ridge.college.pred)^2)
lasso.res.sum.squares <- sum((test.college$Apps - lasso.college.pred)^2)
pcr.res.sum.squares <- sum((test.college$Apps - pcr.college.pred)^2)
pls.res.sum.squares <- sum((test.college$Apps - pls.college.pred)^2)
lm.r2 <- 1 - (lm.res.sum.squares / total.sum.squares)
ridge.r2 <- 1 - (ridge.res.sum.squares / total.sum.squares)
lasso.r2 <- 1 - (lasso.res.sum.squares / total.sum.squares)
pcr.r2 <- 1 - (pcr.res.sum.squares / total.sum.squares)
pls.r2 <- 1 - (pls.res.sum.squares / total.sum.squares)
df.mse.r2 <- data.frame(method = c("Least Square", "Ridge", "Lasso", "PCR", "PLS"),
Test_MSE = c(lm.college.mse, ridge.college.mse, lasso.college.mse, pcr.college.mse, pls.college.mse),
Test_R2 = c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2))
df.mse.r2
## method Test_MSE Test_R2
## 1 Least Square 1346878 0.8925678
## 2 Ridge 1346842 0.8925707
## 3 Lasso 1346806 0.8925736
## 4 PCR 1346878 0.8925678
## 5 PLS 1342229 0.8929387
In all these 5 approaches, there are very marginal difference between their test MSEs and test \(R^2\) values and at least all these 5 approaches has \(R^2\) as 89.26% approximately, which indicates that for 89.26% of the time, we can predict accurately the number of college applications received. Among all these 5 approaches, the PLS approach has the least MSE and marginally has higher percentage R^2.
Boston dataset.Boston dataset.#Below code as been referenced and/or sourced from Lab section of the ISLR book
set.seed(1)
#Since regsubsets() do not have predict method, below function will be used for the same purpose
predict.regsubsets <- function(object,newdata ,id ,...) {
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id=id)
xvars <- names(coefi)
mat[,xvars] %*% coefi
}
#Try to choose among the models of different sizes using cross-validation.
#Perform best subset selection within each of the k training sets.
#Create a vector that allocates each observation to one of k = 10 folds, and
# create a matrix in which we will store the results.
k = 10
folds <- sample(1:k, nrow(Boston), replace = TRUE)
cv.errors <- matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
#Predict for each model size (using our new predict() method)
#Compute the test errors on the appropriate subset, and store them in
#the appropriate slot in the matrix cv.errors.
for (j in 1:k) {
best.fit <- regsubsets(crim ~ ., data = Boston[folds !=j, ], nvmax =13)
for(i in 1:13){
pred <- predict(best.fit, Boston[folds==j, ], id=i)
cv.errors[j, i] <- mean( (Boston$crim[folds ==j] - pred)^2)
}
}
#use the apply() function to average over the columns of the matrix of cv.errors
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948
## 9 10 11 12 13
## 42.53822 42.73416 42.52367 42.46014 42.50125
#Plot the number of variables with the mean cv errors
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")
We see that cross-validation selects an 12-variable model using the best subset selection approach and the estimated MSE is 42.46014.
Boston dataset.set.seed(1)
x <- model.matrix(crim ~ ., Boston)[, -1]
y <- Boston$crim
#Alpha = 1 is for Lasso aproach with the glmnet()
cv.out.lasso <- cv.glmnet(x, y, alpha = 1, type.measure = "mse")
plot(cv.out.lasso)
bestlambda.boston.lasso <- cv.out.lasso$lambda.min
sprintf("λ = %f", bestlambda.boston.lasso)
## [1] "λ = 0.056309"
cv.out.lasso
##
## Call: cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.0563 50 42.52 13.53 11
## 1se 3.0758 7 55.33 17.14 1
lasso.coef <- predict(cv.out.lasso, type="coefficients", s = bestlambda.boston.lasso )[1:14 ,]
lasso.coef
## (Intercept) zn indus chas nox rm
## 12.319178080 0.035726832 -0.068876055 -0.577832639 -6.631559462 0.208676937
## age dis rad tax ptratio black
## 0.000000000 -0.768388824 0.512333871 0.000000000 -0.179631375 -0.007551172
## lstat medv
## 0.124630014 -0.154550130
We see that cross-validation selects an 11-variable (11 variable has non-zero coefficients) model using the lasso approach with λ = 0.056309 and the estimated MSE is 42.52.
Boston dataset.set.seed(1)
#Alpha = 0 is for Ridge aproach with the glmnet()
cv.out.ridge <- cv.glmnet(x, y, alpha = 0, type.measure = "mse")
plot(cv.out.ridge)
bestlambda.boston.ridge <- cv.out.ridge$lambda.min
sprintf("λ = %f", bestlambda.boston.ridge)
## [1] "λ = 0.537499"
cv.out.ridge
##
## Call: cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.54 100 42.71 13.71 13
## 1se 56.31 50 55.80 17.00 13
ridge.coef <- predict(cv.out.ridge, type="coefficients", s = bestlambda.boston.ridge )[1:14 ,]
ridge.coef
## (Intercept) zn indus chas nox rm
## 9.063048626 0.033002416 -0.082046152 -0.737684583 -5.393098481 0.335972073
## age dis rad tax ptratio black
## 0.001962473 -0.702123641 0.422779054 0.003400607 -0.135911587 -0.008483285
## lstat medv
## 0.142613436 -0.139604127
We see that cross-validation selects an 13-variable (13 variable has non-zero coefficients) model using the lasso approach with λ = 0.537499 and the estimated MSE is 42.71.
Boston dataset.set.seed(1)
pcr.boston.fit <- pcr(crim ~ ., data=Boston, scale=TRUE, validation ="CV")
summary(pcr.boston.fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.250 7.253 6.833 6.815 6.826 6.847
## adjCV 8.61 7.245 7.247 6.825 6.803 6.818 6.838
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.837 6.710 6.735 6.723 6.714 6.696 6.624
## adjCV 6.827 6.698 6.724 6.710 6.702 6.682 6.609
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
validationplot(pcr.boston.fit, val.type= "MSEP")
From the above plot we can identify that the lowest Cross-validation error occurs at variable 13. So, the PCR approach selects 13 variable model (that’s PCR select M = 13 - no dimension reduction). The pcr() reports the root mean squared error; in order to obtain the usual MSE, we must square this quantity. Since the CV of the component 13 is 6.624, the MSE (that’s the Cross-validation error) is \(6.624^2\) = 43.877376.
Boston dataset.set.seed(1)
pls.boston.fit <- plsr(crim ~ ., data=Boston, scale=TRUE, validation ="CV")
summary(pcr.boston.fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.250 7.253 6.833 6.815 6.826 6.847
## adjCV 8.61 7.245 7.247 6.825 6.803 6.818 6.838
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.837 6.710 6.735 6.723 6.714 6.696 6.624
## adjCV 6.827 6.698 6.724 6.710 6.702 6.682 6.609
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
validationplot(pls.boston.fit, val.type= "MSEP")
From the above plot we can identify that the lowest Cross-validation error occurs at variable 13. So, the PLS approach selects 13 variable model (that’s PLS select M = 13 - no dimension reduction). The plsr() reports the root mean squared error; in order to obtain the usual MSE, we must square this quantity. Since the CV of the component 13 is 6.624, the MSE (that’s the Cross-validation error) is \(6.624^2\) = 43.877376, which is same as PCR.
By using Cross-Validation errors from all the above approaches in (a), the lowest error rate (42.46014) is from Best Subset Selection approach, which has 12 variables in the final model.
No, the chosen model (Best Subset Selection) involves only 12 features from the Boston dataset, because that’s the model gives the lowest cross-validation error.
best.boston.fit <- regsubsets(crim ~ ., data = Boston ,nvmax = ncol(Boston) - 1)
reg.summary <- summary(best.boston.fit)
reg.summary$rsq
## [1] 0.3912567 0.4207965 0.4286123 0.4334892 0.4392738 0.4440173 0.4476594
## [8] 0.4504606 0.4524408 0.4530572 0.4535605 0.4540031 0.4540104
Even if we check the \(R^2\) values, the statistics percentage increases from 39.13% (Component = 1) to 45.40% (Component = 12) and between the components between 12 and 13 there is not much increase in the \(R^2\). So, the impact model would be with 12 components or features.