2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance., is correct for the lasso, because:

The lasso shrinks the coefficient estimates towards zero, that’s is it selects the shrinkage penalty \(\lambda\sum_{i=1}^{p}|\beta_{i}|\) to be small, which makes the coefficient estimates \(\beta_1\), \(\beta_2\), …. \(\beta_p\) close to zero. This shrinkage of the coefficient estimates can significantly reduce their variance with the small increase in bias. This technique improves the prediction accuracy as depending on the value of \(\lambda\), the lasso can produce a model involving any number of variables.

(b) Repeat (a) for ridge regression relative to least squares.

Ridge regression is also similar as lasso and the below statement is correct.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

The reason is very similar to section (a), but the only difference in the ridge regression is that the shrinkage penalty term is \(\lambda\sum_{i=1}^{p}\beta_{i}^2\) and this will not shrink the coefficient estimates of the less useful predictors to 0.

(c) Repeat (a) for non-linear methods relative to least squares.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. is correct for non-linear methods, because:

In the non-linear model, y = f(x) + ε\(\varepsilon\) in which the functional part of the model is not linear with respect to the unknown parameters, β0,β1,…, and the method of least squares is used to estimate the values of the unknown parameters. Since non-linear methods are making less assumptions, the function f(x) would be calculated using non-linear relationships between the predictors and the response. This decreases the bias and increases in variance, hence a higher prediction accuracy (both these advantages outweighs the increase in variance).

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

#Splitting thee data set by 70-30 split
set.seed(777)
train = sample(1:dim(College)[1], round(dim(College)[1] * 0.7))
test <- -train
train.college <- College[train, ]
test.college <- College[test, ]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm.college.fit <- lm(Apps ~ ., data = train.college)
lm.college.pred <- predict(lm.college.fit, test.college)
lm.college.mse <- mean((lm.college.pred - test.college$Apps)^2)
lm.college.mse
## [1] 1346878

The test error is 1.346878e+06, that’s 1346878.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

The glmnet() function has an alpha argument that determines what type of model is fit. If alpha=0 then a ridge regression model is fit. By default the glmnet() function performs ridge regression for an automatically selected range of \(\lambda\) values. However, here I have chosen to implement the function over a grid of values ranging from \(\lambda\)= \(10^{10}\) to \(\lambda\) = \(10^{-2}\), essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit.

set.seed(777)
grid <- 10 ^ seq(10, -2, length = 100)
#Convert the train and test dataset into matrix as it contains categorical variable.
train.college.mat <- model.matrix(Apps ~ ., data = train.college)
test.college.mat <- model.matrix(Apps ~ ., data = test.college)
#Fit the ridge regression using the above train matrix with the above grid
ridge.college.fit <- glmnet(train.college.mat, train.college$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
#Fit the ridge regression using cross-validation for the same train matrix with the above grid
ridge.college.cv <- cv.glmnet(train.college.mat, train.college$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
#Find the lamba minimum
ridge.college.bestlambda <- ridge.college.cv$lambda.min
ridge.college.bestlambda
## [1] 0.01

The best \(\lambda\) is 0.01.

ridge.college.pred <- predict(ridge.college.fit, s = ridge.college.bestlambda, newx = test.college.mat)
ridge.college.mse <- mean((ridge.college.pred - test.college$Apps)^2)
ridge.college.mse
## [1] 1346842

The test error is 1.346842e+06, that’s 1346842, which slightly lower than the linear model.

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

The glmnet() function has an alpha argument that determines what type of model is fit. If alpha=1 then a lasso regression model is fit.

#Fit the lasso regression
lasso.college.fit <- glmnet(train.college.mat, train.college$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
#Fit the lasso regression using cross-validation for the same train matrix with the above grid
lasso.college.cv <- cv.glmnet(train.college.mat, train.college$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
#Find the lamba minimum
lasso.college.bestlambda <- lasso.college.cv$lambda.min
lasso.college.bestlambda
## [1] 0.01

The best \(\lambda\) is 0.01.

lasso.college.pred <- predict(lasso.college.fit, s = lasso.college.bestlambda, newx = test.college.mat)
lasso.college.mse <- mean((lasso.college.pred - test.college$Apps)^2)
lasso.college.mse
## [1] 1346806

The test error is 1.346806e+06, that’s 1346806, which slightly lower than the linear model and ridge model.

The coefficient estimates are:

predict(lasso.college.fit, s = lasso.college.bestlambda, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -3.082576e+02
## (Intercept)  .           
## PrivateYes  -4.252529e+02
## Accept       1.699384e+00
## Enroll      -1.220297e+00
## Top10perc    4.655390e+01
## Top25perc   -1.217544e+01
## F.Undergrad  8.967979e-02
## P.Undergrad  3.431486e-02
## Outstate    -1.205147e-01
## Room.Board   1.119671e-01
## Books        2.541690e-02
## Personal    -7.734325e-03
## PhD         -6.091927e+00
## Terminal    -5.324239e+00
## S.F.Ratio    5.835626e+00
## perc.alumni  3.990627e-01
## Expend       1.186827e-01
## Grad.Rate    9.111712e+00

All predictors are having non-zero coefficient estimates except PrivateNo.

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(777)
pcr.college.fit <- pcr(Apps ~ ., data = train.college, scale = TRUE, validation = "CV")
summary(pcr.college.fit)
## Data:    X dimension: 544 17 
##  Y dimension: 544 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4006     4012     2215     2213     1786     1749     1666
## adjCV         4006     4014     2212     2215     1746     1745     1661
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1648     1642     1603      1599      1607      1614      1615
## adjCV     1640     1638     1597      1594      1602      1608      1609
##        14 comps  15 comps  16 comps  17 comps
## CV         1619      1531      1211      1136
## adjCV      1615      1498      1200      1128
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     31.9439    57.85    65.04    70.77    76.17    81.04    84.56    87.92
## Apps   0.5369    70.31    70.56    81.81    82.47    84.01    84.48    84.54
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.87     93.20     95.16     96.94     97.97     98.78     99.39
## Apps    85.51     85.67     85.79     85.80     85.89     86.05     91.95
##       16 comps  17 comps
## X        99.84    100.00
## Apps     93.21     93.72
validationplot(pcr.college.fit, val.type = "MSEP")

From the above Cross-Validation plot, we can identify the selected value for M would be 17 and there is no dimension reduction (because the lowest MSEP is at the components at 17).

Running the prediction with the number of components as 17 (that’s, M=17).

pcr.college.pred <- predict(pcr.college.fit, test.college, ncomp = 17)
pcr.college.mse <- mean((pcr.college.pred - test.college$Apps)^2)
pcr.college.mse
## [1] 1346878

The test error is 1.346878e+06, that’s 1346878, which is same as compare to the Linear Model with least squares in (b). This is because when M=p and all the linear combinations (\(Z_m\)) are linearly independent, then poses no constraints. In this case, no dimension occurs, and so fitting is equivalent to performing least squares on the original p predictors (reference from ISLR book).

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(777)
pls.college.fit <- plsr(Apps ~ ., data = train.college, scale = TRUE, validation = "CV")
summary(pls.college.fit)
## Data:    X dimension: 544 17 
##  Y dimension: 544 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4006     2001     1576     1532     1456     1333     1194
## adjCV         4006     1997     1565     1527     1435     1313     1183
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1171     1152     1144      1140      1139      1138      1135
## adjCV     1162     1144     1136      1132      1131      1130      1128
##        14 comps  15 comps  16 comps  17 comps
## CV         1136      1136      1136      1136
## adjCV      1128      1128      1128      1128
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.78     34.0    63.06    65.98    69.99    74.39    78.60    81.10
## Apps    76.41     86.6    87.66    91.10    92.82    93.40    93.49    93.56
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.27     86.67     89.72     91.28     92.90     94.41     97.25
## Apps    93.64     93.68     93.70     93.72     93.72     93.72     93.72
##       16 comps  17 comps
## X        98.34    100.00
## Apps     93.72     93.72
validationplot(pls.college.fit, val.type = "MSEP")

Not able to determine the exact M value from the above Cross-Validation plot. Lets try to get the cross-validation MSEs for the all the predictors.

select <- dplyr::select
pls.model.cse <- MSEP(pls.college.fit, estimate = "CV")$val %>%
                reshape2::melt() %>%
                mutate(M = 0:(nrow(.)-1)) %>%
                select(M, value) %>%
                rename(CV_MSE = value)
pls.model.cse
##     M   CV_MSE
## 1   0 16050223
## 2   1  4005362
## 3   2  2483826
## 4   3  2346362
## 5   4  2119328
## 6   5  1777693
## 7   6  1425126
## 8   7  1372166
## 9   8  1327818
## 10  9  1309405
## 11 10  1299834
## 12 11  1297801
## 13 12  1295053
## 14 13  1289172
## 15 14  1290366
## 16 15  1290629
## 17 16  1290838
## 18 17  1290708

From the above MSE table, we identify that for the M = 13 we have the lowest MSE = 1289172.

Running the prediction with the number of components as 13 (that’s, M=13).

pls.college.pred <- predict(pls.college.fit, test.college, ncomp = 13)
pls.college.mse <- mean((pls.college.pred - test.college$Apps)^2)
pls.college.mse
## [1] 1342229

The test error is 1.342229e+06, that’s 1342229, which is smaller than all the other regression models.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

To compare and comment the results obtained all the 5 approaches, we need to calculate the R^2 and compare those.
\[R^2 = 1 - \frac{RSS}{TSS}\]
Where RSS is sum of squares of residuals and TSS is total sum of squares.

total.sum.squares <- sum((test.college$Apps - mean(test.college$Apps))^2)

lm.res.sum.squares <- sum((test.college$Apps - lm.college.pred)^2)
ridge.res.sum.squares <- sum((test.college$Apps - ridge.college.pred)^2)
lasso.res.sum.squares <- sum((test.college$Apps - lasso.college.pred)^2)
pcr.res.sum.squares <- sum((test.college$Apps - pcr.college.pred)^2)
pls.res.sum.squares <- sum((test.college$Apps - pls.college.pred)^2)

lm.r2 <- 1 - (lm.res.sum.squares / total.sum.squares)
ridge.r2 <- 1 - (ridge.res.sum.squares / total.sum.squares)
lasso.r2 <- 1 - (lasso.res.sum.squares / total.sum.squares)
pcr.r2 <- 1 - (pcr.res.sum.squares / total.sum.squares)
pls.r2 <- 1 - (pls.res.sum.squares / total.sum.squares)

df.mse.r2 <- data.frame(method = c("Least Square", "Ridge", "Lasso", "PCR", "PLS"), 
           Test_MSE = c(lm.college.mse, ridge.college.mse, lasso.college.mse, pcr.college.mse, pls.college.mse), 
           Test_R2 = c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2)) 

df.mse.r2
##         method Test_MSE   Test_R2
## 1 Least Square  1346878 0.8925678
## 2        Ridge  1346842 0.8925707
## 3        Lasso  1346806 0.8925736
## 4          PCR  1346878 0.8925678
## 5          PLS  1342229 0.8929387

In all these 5 approaches, there are very marginal difference between their test MSEs and test \(R^2\) values and at least all these 5 approaches has \(R^2\) as 89.26% approximately, which indicates that for 89.26% of the time, we can predict accurately the number of college applications received. Among all these 5 approaches, the PLS approach has the least MSE and marginally has higher percentage R^2.

11. We will now try to predict per capita crime rate in the Boston dataset.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Best subset selection approach on the Boston dataset.

#Below code as been referenced and/or sourced from Lab section of the ISLR book 
set.seed(1)

#Since regsubsets() do not have predict method, below function will be used for the same purpose
predict.regsubsets <- function(object,newdata ,id ,...) {
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id=id)
  xvars <- names(coefi)
  mat[,xvars] %*% coefi
}
#Try to choose among the models of different sizes using cross-validation.
#Perform best subset selection within each of the k training sets.

#Create a vector that allocates each observation to one of k = 10 folds, and
# create a matrix in which we will store the results. 
k = 10
folds <- sample(1:k, nrow(Boston), replace = TRUE)
cv.errors <- matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))

#Predict for each model size (using our new predict() method)
#Compute the test errors on the appropriate subset, and store them in 
#the appropriate slot in the matrix cv.errors.
for (j in 1:k) {
  best.fit <- regsubsets(crim ~ ., data = Boston[folds !=j, ], nvmax =13)
  for(i in 1:13){
    pred <- predict(best.fit, Boston[folds==j, ], id=i)
    cv.errors[j, i] <- mean( (Boston$crim[folds ==j] - pred)^2)
  }
}

#use the apply() function to average over the columns of the matrix of cv.errors
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
##        1        2        3        4        5        6        7        8 
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948 
##        9       10       11       12       13 
## 42.53822 42.73416 42.52367 42.46014 42.50125
#Plot the number of variables with the mean cv errors
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")

We see that cross-validation selects an 12-variable model using the best subset selection approach and the estimated MSE is 42.46014.

Lasso approach on the Boston dataset.

set.seed(1)
x <- model.matrix(crim ~ ., Boston)[, -1]
y <- Boston$crim
#Alpha = 1 is for Lasso aproach with the glmnet()
cv.out.lasso <- cv.glmnet(x, y, alpha = 1, type.measure = "mse")
plot(cv.out.lasso)

bestlambda.boston.lasso <- cv.out.lasso$lambda.min
sprintf("λ = %f", bestlambda.boston.lasso)
## [1] "λ = 0.056309"
cv.out.lasso
## 
## Call:  cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min 0.0563    50   42.52 13.53      11
## 1se 3.0758     7   55.33 17.14       1
lasso.coef <- predict(cv.out.lasso, type="coefficients", s = bestlambda.boston.lasso )[1:14 ,]
lasso.coef
##  (Intercept)           zn        indus         chas          nox           rm 
## 12.319178080  0.035726832 -0.068876055 -0.577832639 -6.631559462  0.208676937 
##          age          dis          rad          tax      ptratio        black 
##  0.000000000 -0.768388824  0.512333871  0.000000000 -0.179631375 -0.007551172 
##        lstat         medv 
##  0.124630014 -0.154550130

We see that cross-validation selects an 11-variable (11 variable has non-zero coefficients) model using the lasso approach with λ = 0.056309 and the estimated MSE is 42.52.

Ridge approach on the Boston dataset.

set.seed(1)
#Alpha = 0 is for Ridge aproach with the glmnet()
cv.out.ridge <- cv.glmnet(x, y, alpha = 0, type.measure = "mse")
plot(cv.out.ridge)

bestlambda.boston.ridge <- cv.out.ridge$lambda.min
sprintf("λ = %f", bestlambda.boston.ridge)
## [1] "λ = 0.537499"
cv.out.ridge
## 
## Call:  cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 0) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min   0.54   100   42.71 13.71      13
## 1se  56.31    50   55.80 17.00      13
ridge.coef <- predict(cv.out.ridge, type="coefficients", s = bestlambda.boston.ridge )[1:14 ,]
ridge.coef
##  (Intercept)           zn        indus         chas          nox           rm 
##  9.063048626  0.033002416 -0.082046152 -0.737684583 -5.393098481  0.335972073 
##          age          dis          rad          tax      ptratio        black 
##  0.001962473 -0.702123641  0.422779054  0.003400607 -0.135911587 -0.008483285 
##        lstat         medv 
##  0.142613436 -0.139604127

We see that cross-validation selects an 13-variable (13 variable has non-zero coefficients) model using the lasso approach with λ = 0.537499 and the estimated MSE is 42.71.

PCR approach on the Boston dataset.

set.seed(1)
pcr.boston.fit <- pcr(crim ~ ., data=Boston, scale=TRUE, validation ="CV")
summary(pcr.boston.fit)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.250    7.253    6.833    6.815    6.826    6.847
## adjCV         8.61    7.245    7.247    6.825    6.803    6.818    6.838
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.837    6.710    6.735     6.723     6.714     6.696     6.624
## adjCV    6.827    6.698    6.724     6.710     6.702     6.682     6.609
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
validationplot(pcr.boston.fit, val.type= "MSEP")

From the above plot we can identify that the lowest Cross-validation error occurs at variable 13. So, the PCR approach selects 13 variable model (that’s PCR select M = 13 - no dimension reduction). The pcr() reports the root mean squared error; in order to obtain the usual MSE, we must square this quantity. Since the CV of the component 13 is 6.624, the MSE (that’s the Cross-validation error) is \(6.624^2\) = 43.877376.

PLS approach on the Boston dataset.

set.seed(1)
pls.boston.fit <- plsr(crim ~ ., data=Boston, scale=TRUE, validation ="CV")
summary(pcr.boston.fit)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.250    7.253    6.833    6.815    6.826    6.847
## adjCV         8.61    7.245    7.247    6.825    6.803    6.818    6.838
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.837    6.710    6.735     6.723     6.714     6.696     6.624
## adjCV    6.827    6.698    6.724     6.710     6.702     6.682     6.609
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
validationplot(pls.boston.fit, val.type= "MSEP")

From the above plot we can identify that the lowest Cross-validation error occurs at variable 13. So, the PLS approach selects 13 variable model (that’s PLS select M = 13 - no dimension reduction). The plsr() reports the root mean squared error; in order to obtain the usual MSE, we must square this quantity. Since the CV of the component 13 is 6.624, the MSE (that’s the Cross-validation error) is \(6.624^2\) = 43.877376, which is same as PCR.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

By using Cross-Validation errors from all the above approaches in (a), the lowest error rate (42.46014) is from Best Subset Selection approach, which has 12 variables in the final model.

(c) Does your chosen model involve all of the features in the dataset? Why or why not?

No, the chosen model (Best Subset Selection) involves only 12 features from the Boston dataset, because that’s the model gives the lowest cross-validation error.

best.boston.fit <- regsubsets(crim ~ ., data = Boston ,nvmax = ncol(Boston) - 1)
reg.summary <- summary(best.boston.fit)
reg.summary$rsq
##  [1] 0.3912567 0.4207965 0.4286123 0.4334892 0.4392738 0.4440173 0.4476594
##  [8] 0.4504606 0.4524408 0.4530572 0.4535605 0.4540031 0.4540104

Even if we check the \(R^2\) values, the statistics percentage increases from 39.13% (Component = 1) to 45.40% (Component = 12) and between the components between 12 and 13 there is not much increase in the \(R^2\). So, the impact model would be with 12 components or features.