iii: Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Lasso limits the number of predictors by forcing the coeficients to 0 and thus it reduces the inherent variance at the cost of an increase in bias. So, you achieve better test error rate, when the bias increase is less than its decrease in variance.
Ridge regression does the same as Lasso except that it shrinks the coefficients where as in lasso, some of the predictors coefficients are 0. This way, the variance reduces but at the same point, bias is high. Ridge regression will work best in situations where the OLS estimates have high variance.
# Load ISLR library and check college dataset structure
library(ISLR)
?College
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
# set the seed first
set.seed(1)
# Get the total observations count into n
n = nrow(College)
# create a train vector with half of the observations
train <- sample(1:n, n/2)
test <- -train
# create two datasets train and test
College.train <- College[train, ]
College.test <- College[test, ]
# Just double check the size of train dataset
nrow(College.train)
## [1] 388
# Fit the least squares model on training data set with Apps (no.of applications received as response variable)
fit.lm <- lm(Apps ~ ., data = College.train)
# Use predict function to get the test error
pred.lm <- predict(fit.lm, College.test)
# Make sure you square the error for the MSE
MSE_lm = mean((pred.lm - College.test$Apps)^2)
MSE_lm
## [1] 1135758
# Load the library glmnet
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.2
## Loading required package: Matrix
## Loaded glmnet 4.0-2
# Set the seed
# get the training matrix and test matrix
train.mat <- model.matrix(Apps ~ ., data = College.train)
test.mat <- model.matrix(Apps ~ ., data = College.test)
# set the grid
grid <- 10 ^ seq(10, -2, length = 100)
# Fit the ridge regression model on training dataset
fit.ridge <- glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
# Cross-validation on train dataset
cv.ridge <- cv.glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
# get the best lamda
bestlam.ridge <- cv.ridge$lambda.min
# Print best lambda
bestlam.ridge
## [1] 0.01
# Get the test error rate
pred.ridge <- predict(fit.ridge, s = bestlam.ridge, newx = test.mat)
MSE_ridge = mean((pred.ridge - College.test$Apps)^2)
MSE_ridge
## [1] 1135714
# Fit the Lasso using the same train and test matrices; just populate alpha=1 and get the best lambda first
fit.lasso <- glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso <- cv.glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam.lasso <- cv.lasso$lambda.min
bestlam.lasso
## [1] 0.01
pred.lasso <- predict(fit.lasso, s = bestlam.lasso, newx = test.mat)
MSE_lasso = mean((pred.lasso - College.test$Apps)^2)
MSE_lasso
## [1] 1135659
# Get all the coefficients
predict(fit.lasso, s = bestlam.lasso, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -7.900392e+02
## (Intercept) .
## PrivateYes -3.070090e+02
## Accept 1.779328e+00
## Enroll -1.469512e+00
## Top10perc 6.672205e+01
## Top25perc -2.230441e+01
## F.Undergrad 9.259049e-02
## P.Undergrad 9.408441e-03
## Outstate -1.083495e-01
## Room.Board 2.115144e-01
## Books 2.912138e-01
## Personal 6.120206e-03
## PhD -1.547169e+01
## Terminal 6.409278e+00
## S.F.Ratio 2.282635e+01
## perc.alumni 1.130513e+00
## Expend 4.856698e-02
## Grad.Rate 7.488072e+00
# Load PLS library
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
# Fit pcr using the same train dataset; mark scale as TRUE and do cross validation
fit.pcr <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
# Build the CV plot
validationplot(fit.pcr, val.type = "MSEP")
Though the lowest cross validation error occurs when M = 17, the whole dataset has 17 components. The graph above indicates that we achieve the the lowest cross validation error at M=10. So use this and prodcue the MSE.
summary(fit.pcr)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 4054 2422 2432 2117 2022 1959
## adjCV 4288 4051 2415 2426 2061 1999 1948
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1957 1955 1911 1852 1853 1856 1861
## adjCV 1947 1947 1900 1840 1843 1846 1851
## 14 comps 15 comps 16 comps 17 comps
## CV 1877 1679 1338 1269
## adjCV 1876 1649 1323 1256
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.20 57.78 65.31 70.99 76.37 81.27 84.8 87.85
## Apps 13.44 70.93 71.07 79.87 81.15 82.25 82.3 82.33
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.62 92.91 94.98 96.74 97.79 98.72 99.42
## Apps 83.38 84.76 84.80 84.84 85.11 85.14 90.55
## 16 comps 17 comps
## X 99.88 100.00
## Apps 93.42 93.89
# Print the MSE
pred.pcr <- predict(fit.pcr, College.test, ncomp = 10)
MSE_pcr = mean((pred.pcr - College.test$Apps)^2)
MSE_pcr
## [1] 1723100
# Fit the PLS model. Make sure scale is set to TRUE and get the cross validation plot
fit.pls <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pls, val.type = "MSEP")
From the above chart, we can conclude that the cross validation error is low starting M=10 and went further down monotonically. Since the whole dataset has 17 components, we could use M=10 for better MSE value.
# Get the PLS MSE
pred.pls <- predict(fit.pls, College.test, ncomp = 10)
MSE_pls = mean((pred.pls - College.test$Apps)^2)
MSE_pls
## [1] 1131661
# Get all the MSEs into a matrix and print it
MSE_all <- matrix(c(MSE_lm,MSE_ridge,MSE_lasso,MSE_pcr,MSE_pls))
dimnames(MSE_all) = list(c("MSE_lm","MSE_ridge","MSE_lasso","MSE_pcr","MSE_pls"))
MSE_all
## [,1]
## MSE_lm 1135758
## MSE_ridge 1135714
## MSE_lasso 1135659
## MSE_pcr 1723100
## MSE_pls 1131661
Based on the above MSEs, PCR (Principal Components Regression) seems to be giving the largest MSE. All other models have MSEs that are very close and not so much different from the others. Partial Least Squares (PLS) seems to outperform out of all.
Lets confirm the same and accuracy rates by checking the RSquared values also on the test data set.
# Load miscTools library
library(miscTools)
# Calculate R2 for all Regression model predictions for test data set
R2_lm <- rSquared(College.test$Apps, resid = as.matrix(College.test$Apps - pred.lm))
R2_ridge <- rSquared(College.test$Apps, resid = as.matrix(College.test$Apps - pred.ridge))
R2_lasso <- rSquared(College.test$Apps, resid = as.matrix(College.test$Apps - pred.lasso))
R2_pcr <- rSquared(College.test$Apps, resid = as.matrix(College.test$Apps - pred.pcr))
R2_pls <- rSquared(College.test$Apps, resid = as.matrix(College.test$Apps - pred.pls))
# Get all the RSquared into a matrix and print it
R2_all <- matrix(c(R2_lm,R2_ridge,R2_lasso,R2_pcr,R2_pls))
dimnames(R2_all) = list(c("R2_lm","R2_ridge","R2_lasso","R2_pcr","R2_pls"))
R2_all
## [,1]
## R2_lm 0.9015413
## R2_ridge 0.9015452
## R2_lasso 0.9015499
## R2_pcr 0.8506248
## R2_pls 0.9018965
Based on MSEs and the RSquared values, we can conclude that PCR seems to be the worst fit of all. Except PCR, all other models do not have much difference in MSEs and Rsquared values.
So, we can conclude that all models except PCR can predict number of college applications received with high accuracy.
library(MASS)
## Warning: package 'MASS' was built under R version 3.6.2
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Install leaps library to use the subset selection method
library(leaps)
Lets start with best subset selection method.
set.seed(1)
# Create our own predict function for the regsubsets
predict.regsubsets <- function(object, newdata, id, ...)
{
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
# set k to 10 for 10-fold cross validation
k = 10
p = ncol(Boston) - 1
#create the folds by random sampling to K
folds <- sample(1:k, nrow(Boston), replace = TRUE)
#Initialize the CV.errors
cv.errors <- matrix(NA, k, p, dimnames = list(NULL, paste(1:p)))
#Loop thru 10-folds and call the predict function; Put the best fit in the best.fit
for (j in 1:k)
{
best.fit <- regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = p)
for (i in 1:p)
{
pred <- predict(best.fit, Boston[folds == j, ], id = i)
cv.errors[j, i] <- mean((Boston$crim[folds == j] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "mean.cv.errors")
The lowest CV error is at 12 variables.
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948
## 9 10 11 12 13
## 42.53822 42.73416 42.52367 42.46014 42.50125
The MSE value is 42.46014 for the lowest CV error.
MSE_ss_11 = 42.46014
Lets do the Lasso model.
x <- model.matrix(crim ~ ., Boston)[, -1]
y <- Boston$crim
cv.out <- cv.glmnet(x, y, alpha = 1, type.measure = "mse")
plot(cv.out)
cv.out
##
## Call: cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Measure SE Nonzero
## min 0.051 43.11 14.16 11
## 1se 3.376 56.89 17.29 1
cross-validation selects a λ value equal to 0.051. We have a CV estimate for the test MSE equal to 43.11.
MSE_lasso_11 = 43.11
Lets do the Ridge Regression now.
cv.out <- cv.glmnet(x, y, alpha = 0, type.measure = "mse")
plot(cv.out)
cv.out
##
## Call: cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Measure SE Nonzero
## min 0.54 43.48 14.33 13
## 1se 74.44 57.69 16.76 13
cross-validation selects a λ value equal to 0.54. We have a CV estimate for the test MSE equal to 43.48.
MSE_ridge_11 = 43.48
Lets do PCR now.
pcr.fit <- pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.198 7.198 6.786 6.762 6.790 6.821
## adjCV 8.61 7.195 7.195 6.780 6.753 6.784 6.813
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.822 6.689 6.712 6.720 6.712 6.664 6.593
## adjCV 6.812 6.679 6.701 6.708 6.700 6.651 6.580
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
Lets build the validation plot.
validationplot(pcr.fit, val.type = "MSEP")
The validation plot shows that the lowest cv error is needing all variables in the dataset. Hence there is no dimension reduction. we would ideally expect only few components to be used for PCR. Since, its not the case and CV error had its first reduction at 4 components, and also this explains for decent variability, using this value for MSE calculation.
pcr.pred=predict(pcr.fit,Boston,ncomp =4)
MSE_pcr_11 = mean((pcr.pred - Boston$crim)^2)
MSE_pcr_11
## [1] 44.5939
The test MSE for the PCR model is 44.5939.
# Get all the MSEs into a matrix and print it
MSE_all_11 <- matrix(c(MSE_ss_11,MSE_ridge_11,MSE_lasso_11,MSE_pcr_11))
dimnames(MSE_all_11) = list(c("MSE_subsets","MSE_ridge","MSE_lasso","MSE_pcr"))
MSE_all_11
## [,1]
## MSE_subsets 42.46014
## MSE_ridge 43.48000
## MSE_lasso 43.11000
## MSE_pcr 44.59390
Based on the MSEs above for the lowest cross validation error, Best subset selection method seems to perform well on this dataset.
reg.best=regsubsets(crim~.,data=Boston, nvmax=13)
coef(reg.best ,12)
## (Intercept) zn indus chas nox
## 16.985713928 0.044673247 -0.063848469 -0.744367726 -10.202169211
## rm dis rad tax ptratio
## 0.439588002 -0.993556631 0.587660185 -0.003767546 -0.269948860
## black lstat medv
## -0.007518904 0.128120290 -0.198877768
The chosen model does not involve all the features in the data set because not all predictors are strongly related to the response variable. Using all of them will decrease performance since it will overfit the model.
The entire dataset has 13 predictors. The CV error is lowest at 12. we would like to go with 12 variables only in the final chosen Best Subset selection method. The predictor “age” (proportion of owner-occupied units built prior to 1940) did not make it into the final model. Also, the coefficients tax and black have little influence on the response variable as their coefficients are close to 0.