ISLR-CH6-Linear Model Selection & Regularization

2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

2(a). The lasso, relative to least squares, is.

iii: Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Lasso limits the number of predictors by forcing the coeficients to 0 and thus it reduces the inherent variance at the cost of an increase in bias. So, you achieve better test error rate, when the bias increase is less than its decrease in variance.

2(a).(b) Repeat (a) for ridge regression relative to least squares.

Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Ridge regression does the same as Lasso except that it shrinks the coefficients where as in lasso, some of the predictors coefficients are 0. This way, the variance reduces but at the same point, bias is high. Ridge regression will work best in situations where the OLS estimates have high variance.

2(a).(c) Repeat (a) for non-linear methods relative to least squares.

More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

# Load ISLR library and check college dataset structure
library(ISLR)
?College
str(College)

## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

9(a). Split the data set into a training set and a test set.

# set the seed first
set.seed(1)
# Get the total observations count into n
n = nrow(College)
# create a train vector with half of the observations
train <- sample(1:n, n/2)
test <- -train
# create two datasets train and test
College.train <- College[train, ]
College.test <- College[test, ]

# Just double check the size of train dataset
nrow(College.train)

## [1] 388

9(b). Fit a linear model using least squares on the training set, and report the test error obtained.

# Fit the least squares model on training data set with Apps (no.of applications received as response variable)
fit.lm <- lm(Apps ~ ., data = College.train)
# Use predict function to get the test error
pred.lm <- predict(fit.lm, College.test)
# Make sure you square the error for the MSE
MSE_lm = mean((pred.lm - College.test$Apps)^2)
MSE_lm

## [1] 1135758

9(c). Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

# Load the library glmnet
library(glmnet)

## Warning: package 'glmnet' was built under R version 3.6.2

## Loading required package: Matrix

## Loaded glmnet 4.0-2

# Set the seed


# get the training matrix and test matrix
train.mat <- model.matrix(Apps ~ ., data = College.train)
test.mat <- model.matrix(Apps ~ ., data = College.test)
# set the grid
grid <- 10 ^ seq(10, -2, length = 100)
# Fit the ridge regression model on training dataset
fit.ridge <- glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
# Cross-validation on train dataset
cv.ridge <- cv.glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
# get the best lamda
bestlam.ridge <- cv.ridge$lambda.min
# Print best lambda
bestlam.ridge

## [1] 0.01

# Get the test error rate
pred.ridge <- predict(fit.ridge, s = bestlam.ridge, newx = test.mat)
MSE_ridge = mean((pred.ridge - College.test$Apps)^2)
MSE_ridge

## [1] 1135714

9(d). Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

# Fit the Lasso using the same train and test matrices; just populate alpha=1 and get the best lambda first
fit.lasso <- glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso <- cv.glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam.lasso <- cv.lasso$lambda.min
bestlam.lasso

## [1] 0.01

pred.lasso <- predict(fit.lasso, s = bestlam.lasso, newx = test.mat)
MSE_lasso = mean((pred.lasso - College.test$Apps)^2)
MSE_lasso

## [1] 1135659

# Get all the coefficients 
predict(fit.lasso, s = bestlam.lasso, type = "coefficients")

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -7.900392e+02
## (Intercept)  .           
## PrivateYes  -3.070090e+02
## Accept       1.779328e+00
## Enroll      -1.469512e+00
## Top10perc    6.672205e+01
## Top25perc   -2.230441e+01
## F.Undergrad  9.259049e-02
## P.Undergrad  9.408441e-03
## Outstate    -1.083495e-01
## Room.Board   2.115144e-01
## Books        2.912138e-01
## Personal     6.120206e-03
## PhD         -1.547169e+01
## Terminal     6.409278e+00
## S.F.Ratio    2.282635e+01
## perc.alumni  1.130513e+00
## Expend       4.856698e-02
## Grad.Rate    7.488072e+00

9(e). Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

# Load PLS library
library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

# Fit pcr using the same train dataset; mark scale as TRUE and do cross validation
fit.pcr <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
# Build the CV plot
validationplot(fit.pcr, val.type = "MSEP")

Though the lowest cross validation error occurs when M = 17, the whole dataset has 17 components. The graph above indicates that we achieve the the lowest cross validation error at M=10. So use this and prodcue the MSE.

summary(fit.pcr)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     4054     2422     2432     2117     2022     1959
## adjCV         4288     4051     2415     2426     2061     1999     1948
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1957     1955     1911      1852      1853      1856      1861
## adjCV     1947     1947     1900      1840      1843      1846      1851
##        14 comps  15 comps  16 comps  17 comps
## CV         1877      1679      1338      1269
## adjCV      1876      1649      1323      1256
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       32.20    57.78    65.31    70.99    76.37    81.27     84.8    87.85
## Apps    13.44    70.93    71.07    79.87    81.15    82.25     82.3    82.33
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.62     92.91     94.98     96.74     97.79     98.72     99.42
## Apps    83.38     84.76     84.80     84.84     85.11     85.14     90.55
##       16 comps  17 comps
## X        99.88    100.00
## Apps     93.42     93.89

# Print the MSE
pred.pcr <- predict(fit.pcr, College.test, ncomp = 10)
MSE_pcr = mean((pred.pcr - College.test$Apps)^2)
MSE_pcr

## [1] 1723100

9(f). Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

# Fit the PLS model. Make sure scale is set to TRUE and get the cross validation plot
fit.pls <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pls, val.type = "MSEP")

From the above chart, we can conclude that the cross validation error is low starting M=10 and went further down monotonically. Since the whole dataset has 17 components, we could use M=10 for better MSE value.

# Get the PLS MSE
pred.pls <- predict(fit.pls, College.test, ncomp = 10)
MSE_pls = mean((pred.pls - College.test$Apps)^2)
MSE_pls

## [1] 1131661

9(g). Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

# Get all the MSEs into a matrix and print it
MSE_all <- matrix(c(MSE_lm,MSE_ridge,MSE_lasso,MSE_pcr,MSE_pls))
dimnames(MSE_all) = list(c("MSE_lm","MSE_ridge","MSE_lasso","MSE_pcr","MSE_pls"))
MSE_all

##              [,1]
## MSE_lm    1135758
## MSE_ridge 1135714
## MSE_lasso 1135659
## MSE_pcr   1723100
## MSE_pls   1131661

Based on the above MSEs, PCR (Principal Components Regression) seems to be giving the largest MSE. All other models have MSEs that are very close and not so much different from the others. Partial Least Squares (PLS) seems to outperform out of all.

Lets confirm the same and accuracy rates by checking the RSquared values also on the test data set.

# Load miscTools library
library(miscTools)

# Calculate R2 for all Regression model predictions for test data set
R2_lm <- rSquared(College.test$Apps, resid = as.matrix(College.test$Apps - pred.lm))
R2_ridge <- rSquared(College.test$Apps, resid = as.matrix(College.test$Apps - pred.ridge))
R2_lasso <- rSquared(College.test$Apps, resid = as.matrix(College.test$Apps - pred.lasso))
R2_pcr <- rSquared(College.test$Apps, resid = as.matrix(College.test$Apps - pred.pcr))
R2_pls <- rSquared(College.test$Apps, resid = as.matrix(College.test$Apps - pred.pls))

# Get all the RSquared into a matrix and print it
R2_all <- matrix(c(R2_lm,R2_ridge,R2_lasso,R2_pcr,R2_pls))
dimnames(R2_all) = list(c("R2_lm","R2_ridge","R2_lasso","R2_pcr","R2_pls"))
R2_all

##               [,1]
## R2_lm    0.9015413
## R2_ridge 0.9015452
## R2_lasso 0.9015499
## R2_pcr   0.8506248
## R2_pls   0.9018965

Based on MSEs and the RSquared values, we can conclude that PCR seems to be the worst fit of all. Except PCR, all other models do not have much difference in MSEs and Rsquared values.

So, we can conclude that all models except PCR can predict number of college applications received with high accuracy.

11. We will now try to predict per capita crime rate in the Boston data set.

library(MASS)

## Warning: package 'MASS' was built under R version 3.6.2

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

11(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

# Install leaps library to use the subset selection method
library(leaps)

Lets start with best subset selection method.

set.seed(1)

# Create our own predict function for the regsubsets 
predict.regsubsets <- function(object, newdata, id, ...) 
{
    form <- as.formula(object$call[[2]])
    mat <- model.matrix(form, newdata)
    coefi <- coef(object, id = id)
    xvars <- names(coefi)
    mat[, xvars] %*% coefi
}

# set k to 10 for 10-fold cross validation
k = 10
p = ncol(Boston) - 1

#create the folds by random sampling to K
folds <- sample(1:k, nrow(Boston), replace = TRUE)

#Initialize the CV.errors
cv.errors <- matrix(NA, k, p, dimnames = list(NULL, paste(1:p)))

#Loop thru 10-folds and call the predict function; Put the best fit in the best.fit
for (j in 1:k) 
{
    best.fit <- regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = p)
    for (i in 1:p) 
    {
        pred <- predict(best.fit, Boston[folds == j, ], id = i)
        cv.errors[j, i] <- mean((Boston$crim[folds == j] - pred)^2)
    }
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "mean.cv.errors")

The lowest CV error is at 12 variables.

mean.cv.errors

##        1        2        3        4        5        6        7        8 
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948 
##        9       10       11       12       13 
## 42.53822 42.73416 42.52367 42.46014 42.50125

The MSE value is 42.46014 for the lowest CV error.

MSE_ss_11 = 42.46014

Lets do the Lasso model.

x <- model.matrix(crim ~ ., Boston)[, -1]
y <- Boston$crim
cv.out <- cv.glmnet(x, y, alpha = 1, type.measure = "mse")
plot(cv.out)

cv.out

## 
## Call:  cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Measure    SE Nonzero
## min  0.051   43.11 14.16      11
## 1se  3.376   56.89 17.29       1

cross-validation selects a λ value equal to 0.051. We have a CV estimate for the test MSE equal to 43.11.

MSE_lasso_11 = 43.11

Lets do the Ridge Regression now.

cv.out <- cv.glmnet(x, y, alpha = 0, type.measure = "mse")
plot(cv.out)

cv.out

## 
## Call:  cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 0) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Measure    SE Nonzero
## min   0.54   43.48 14.33      13
## 1se  74.44   57.69 16.76      13

cross-validation selects a λ value equal to 0.54. We have a CV estimate for the test MSE equal to 43.48.

MSE_ridge_11 = 43.48

Lets do PCR now.

pcr.fit <- pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.198    7.198    6.786    6.762    6.790    6.821
## adjCV         8.61    7.195    7.195    6.780    6.753    6.784    6.813
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.822    6.689    6.712     6.720     6.712     6.664     6.593
## adjCV    6.812    6.679    6.701     6.708     6.700     6.651     6.580
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4

Lets build the validation plot.

validationplot(pcr.fit, val.type = "MSEP")

The validation plot shows that the lowest cv error is needing all variables in the dataset. Hence there is no dimension reduction. we would ideally expect only few components to be used for PCR. Since, its not the case and CV error had its first reduction at 4 components, and also this explains for decent variability, using this value for MSE calculation.

pcr.pred=predict(pcr.fit,Boston,ncomp =4)
MSE_pcr_11 = mean((pcr.pred - Boston$crim)^2)
MSE_pcr_11

## [1] 44.5939

The test MSE for the PCR model is 44.5939.

11(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error.

# Get all the MSEs into a matrix and print it
MSE_all_11 <- matrix(c(MSE_ss_11,MSE_ridge_11,MSE_lasso_11,MSE_pcr_11))
dimnames(MSE_all_11) = list(c("MSE_subsets","MSE_ridge","MSE_lasso","MSE_pcr"))
MSE_all_11

##                 [,1]
## MSE_subsets 42.46014
## MSE_ridge   43.48000
## MSE_lasso   43.11000
## MSE_pcr     44.59390

Based on the MSEs above for the lowest cross validation error, Best subset selection method seems to perform well on this dataset.

11(c) Does your chosen model involve all of the features in the data set? Why or why not?

reg.best=regsubsets(crim~.,data=Boston, nvmax=13)
coef(reg.best ,12)

##   (Intercept)            zn         indus          chas           nox 
##  16.985713928   0.044673247  -0.063848469  -0.744367726 -10.202169211 
##            rm           dis           rad           tax       ptratio 
##   0.439588002  -0.993556631   0.587660185  -0.003767546  -0.269948860 
##         black         lstat          medv 
##  -0.007518904   0.128120290  -0.198877768

The chosen model does not involve all the features in the data set because not all predictors are strongly related to the response variable. Using all of them will decrease performance since it will overfit the model.

The entire dataset has 13 predictors. The CV error is lowest at 12. we would like to go with 12 variables only in the final chosen Best Subset selection method. The predictor “age” (proportion of owner-occupied units built prior to 1940) did not make it into the final model. Also, the coefficients tax and black have little influence on the response variable as their coefficients are close to 0.

ISLR-CH6-Linear Model Selection & Regularization - Assignment#5

Sitharam Chilakapati

7/8/2020