Exercise 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is: 3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

(b) Repeat (a) for ridge regression relative to least squares.

  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

- As the shrinkage parameter in ridge regression increases, flexibility of ridge decreases. This leads to an increase in bias and decreases the variance and prediction accuracy of ridge regression will improve on the least square, when its increase in bias is less than its decrease in variance. Using least square leads to low bias and high variance, specially when the number of samples and variables are similar in size and prediction accuracy could be low.

(c) Repeat (a) for non-linear methods relative to least squares.

  1. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.


Exercise 9

College data set

(a) Split the data set into a training set and a test set.

data("College")
sum(is.na(College))
## [1] 0

50 % training - testing set split

set.seed(52)
train=sample(1:nrow(College),nrow(College)/2)
train_col = College[train,]
test_col = College[-train,]
Fit logistic regression model on training set
lm.fit = lm(Apps ~ ., data = train_col )

Predict defaul from the validation set

pred.lm = predict(lm.fit, newdata = test_col)
lm.mse=mean((test_col$Apps - pred.lm)^2)
lm.mse
## [1] 1338031

(b) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

build a matrix and vector for train and test splits

train_mat = model.matrix(Apps ~ ., train_col)[, -1] 
test.mat = model.matrix(Apps ~ ., test_col)[, -1] 
train_vec = train_col$Apps
test_vec = test_col$Apps
grid <- 10^seq(10, -2, length = 1000)
ridge.mod = cv.glmnet(train_mat, train_vec, alpha = 0, lambda = grid, thresh = 1e-12) 
best_lam = ridge.mod$lambda.min
best_lam
## [1] 20.6688
ridge.pred <- predict(ridge.mod , s = best_lam, newx = test.mat)
ridge.mse = mean (( ridge.pred - test_vec)^2)
ridge.mse
## [1] 1430793

The test MSE is higher than the OLS test MSE(1338031)

(c) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

grid <- 10^seq(10, -2, length = 1000)
lasso.mod = cv.glmnet(train_mat, train_vec, alpha = 1, lambda = grid, thresh = 1e-12) 
best_lam_lasso = lasso.mod$lambda.min
best_lam_lasso
## [1] 12.9155
lasso.pred <- predict(lasso.mod , s = best_lam_lasso, newx = test.mat)
lasso.mse = mean (( lasso.pred - test_vec)^2)
lasso.mse
## [1] 1416085

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed (1)
pcr.fit <- pcr(Apps ~ ., data = train_col , scale = TRUE , validation = "CV")
validationplot(pcr.fit, val.type="MSEP")

pcr.pred = predict(pcr.fit, test.mat, ncomp=9)
pcr.mse=mean((test_vec - pcr.pred)^2) 
pcr.mse
## [1] 3048364

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed (1)
pls.fit <- plsr(Apps ~ ., data = train_col , scale = TRUE , validation = "CV")
validationplot(pls.fit, val.type="MSEP")

pls.pred = predict(pls.fit, test.mat, ncomp=5)
pls.mse=mean((test_vec - pls.pred)^2) 
pls.mse
## [1] 2146014

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

MSE = c(lm.mse, ridge.mse, lasso.mse, pcr.mse, pls.mse)
Model = c("Linear Reg", "Ridge Reg", "Lasso", "PCR", "PLS")

all = data.frame(Model, MSE)
(all = all[order(MSE, Model),])

Exercise 11

We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Best subset
predict.regsubsets <- function(object , newdata , id, ...) {
  form <- as.formula(object$call [[2]])
  mat <- model.matrix(form , newdata)
  coefi <- coef(object , id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
}

k <- 10
n <- nrow(Boston)
p = ncol(Boston) - 1
folds = sample(rep (1:k, length = n))
cv.errors <- matrix(NA, k, p, dimnames = list(NULL , paste (1:p)))

for (j in 1:k) {
  best.fit = regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = n)
  for (i in 1:p) {
    pred = predict(best.fit , Boston[folds == j, ], id = i)
    cv.errors [j, i] = mean (( Boston$crim[folds == j] - pred)^2)
  }
}

rmse.cv.err = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv.err, pch = 1, type = "b")

folds
##   [1]  8  3  9 10  6  1  1  2  2  6  3  2  3  6  3  4  8  7  4  7  3  2  3  2  8
##  [26]  3  1  2 10  9  8  5  9 10 10  5  3 10 10  6  1  8 10  4  3  9  9  8  6  2
##  [51]  9  6  7  2  7  1  9  5  3  6  6  5 10  3  8  2  8  2  6  8  9  2  1  6  5
##  [76]  5  6  5  6 10  1  6  1  2  3  2 10  9  4  4  8 10 10  2  1  1  5  5 10  6
## [101]  8  2  5  3  9  1  8  4  3  6  7  9  6  7  5 10  7  3  8  2  2  8  4  6  8
## [126]  2  1  2  4  7  1  3  2  4  7  7  8 10  3  4  2 10 10  1  5  5  3  8  1  6
## [151]  5  4  6  4  1  6  8  5  6  7 10 10  8  5  4  8  4  2  5  3  5  3 10  8  5
## [176]  2  1  1 10  5  4  7  9  6  1  7  5  6  9  4  9  3  7  3  7 10  3  4  9  9
## [201]  8  5  6  7  7  2  4  3  9  6  4  6  2  2  4  7  2  9  9  7 10  4  4  3  5
## [226]  4  2  8  7  9  4 10  3  5  9  6  6  1  7  9  6  5  5  7  7  8  7  1  5  2
## [251]  5  9 10  4  3  4  9  1  5  5  1  2 10  5  2 10  8  6  5  1  9 10 10  9  4
## [276]  6  5  6  9  3  2  1  5  8  2 10  8  6  1  8  3  3  1  4  7  7  8  5  8  3
## [301]  7  2  9  9  9 10  6  4  6  6  1  4  9 10  9  4  8  9  9  4  1  7  1  1  7
## [326]  7  9  9  1  1  4  9 10  1  6  1 10  6  7  8  4  1  2 10  1 10 10  3  9  9
## [351]  7  3  2  1  7  1  8 10  8  8  2  1  4  1  9  7  7  7  1  5  9  7  5  4  2
## [376]  7  8 10  4  2  4  6  4  5  8  3  3  3  5  8  3  4  6  3  7  8  7 10  2  7
## [401]  6  5  8  7  2  3  1  3  2  1  2  2  2  7  4  9  7  3  9  9  2  5  8  5  2
## [426]  6  4  6 10  7  4  9  1  6  4 10  4  7 10  3  2  8  4  6  8  6  3  8  5  3
## [451]  4  5  1  9  3  3  6  4 10  2  3  1  5  9  4  3  1  9  5 10  7  7  9  9  3
## [476]  8  2  6  3  8  6  7 10  6  6  5 10  1  8  4  5  2 10  8  1  7  5  4  5  3
## [501] 10  1  8  5  8 10
coef(best.fit, 10)
##   (Intercept)            zn         indus           nox           dis 
##  21.824249741   0.053352858  -0.093824399 -10.682834604  -1.153005385 
##           rad           tax       ptratio         black         lstat 
##   0.622543420  -0.004911094  -0.273150005  -0.007863557   0.110571831 
##          medv 
##  -0.216655711
which.min(rmse.cv.err)
## 9 
## 9
best_rmse=rmse.cv.err[which.min(rmse.cv.err)]
best_rmse
##        9 
## 6.514147

create matrix and vector for Lasso and

x = model.matrix(crim ~ .-1, data = Boston)
y = Boston$crim

Lasso

lasso.cv = cv.glmnet(x, y, type.measure = "mse")
plot(lasso.cv)

coef(lasso.cv)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                    s1
## (Intercept) 1.0894283
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.2643196
## tax         .        
## ptratio     .        
## black       .        
## lstat       .        
## medv        .
lasso_rmse=sqrt(lasso.cv$cvm[lasso.cv$lambda == lasso.cv$lambda.1se])
lasso_rmse
## [1] 7.402467

Ridge Regression

ridge.cv = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(ridge.cv)

coef(ridge.cv)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  1.761587688
## zn          -0.002865786
## indus        0.026514040
## chas        -0.142145708
## nox          1.684835359
## rm          -0.130992458
## age          0.005612849
## dis         -0.084630196
## rad          0.039946869
## tax          0.001830673
## ptratio      0.063756646
## black       -0.002278831
## lstat        0.031682619
## medv        -0.020839190
ridge_rmse=sqrt(ridge.cv$cvm[ridge.cv$lambda == ridge.cv$lambda.1se])
ridge_rmse
## [1] 7.735648

PCR

pcr.fit.b = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit.b)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.195    7.190    6.781    6.756    6.773    6.805
## adjCV         8.61    7.193    7.188    6.776    6.749    6.768    6.798
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.798    6.669    6.685     6.687     6.679     6.630     6.562
## adjCV    6.790    6.660    6.677     6.677     6.669     6.619     6.550
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
rmse_pcr=sqrt(pcr.fit.b$validation$adj[13])
rmse_pcr
## [1] 6.361026

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

Summary of CV RMSE for the different models
RMSE = c(best_rmse, ridge_rmse, lasso_rmse, rmse_pcr)
Models = c("Best Subset", "Ridge Reg", "Lasso", "PCR")

all_models = data.frame(Models, RMSE)
(all_models = all_models[order(RMSE, Models),])

Based on the CV RMSE values for the different models, the PRC and best subset models seem to perform better on this data set.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

Though the PCR model with 13 features has the lowest RMSE, I would prefer the model chosen with best subset approach as it has only 10 features, making it a simpler model with slightly lower RMSE relative to PCR, but much lower RMSE compared to Lasso and Ridge models.

coef(best.fit, 10)
##   (Intercept)            zn         indus           nox           dis 
##  21.824249741   0.053352858  -0.093824399 -10.682834604  -1.153005385 
##           rad           tax       ptratio         black         lstat 
##   0.622543420  -0.004911094  -0.273150005  -0.007863557   0.110571831 
##          medv 
##  -0.216655711