Statistical Learning: Homework 2

Problem 1: k-fold cross-validation.

Problem 5.3, page 198.

Explain how k-fold cross-validation is implemented.
What are the advantages and disadvantages of k-fold cross validation relative to:
- The validation set approach?
- LOOCV?

Problem 2: Cross-validation on simulated data

Problem 5.8, page 200-201.

Generate a simulated data set:

set.seed (1)
eps <- rnorm(100)
x   <- rnorm(100)
y   <- x - 2*x^2 + eps

In this data set, n = 100, p = 1. Write out the model used to generate the data in equation form: \(y = x-2x^2\)

Create a scatterplot of X against Y:

plot(x,y)

The scatter plot shows an upside down parabola, indicating a quadratic relationship between \(Y\) and \(X\).

Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:

\(Y = \beta_0 + \beta_1 x + \epsilon\)
\(Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon\)
\(Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon\)
\(Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4 + \epsilon\)

library(boot)
library(broom)

# create data frame:
my.data   <- data.frame(y, x)
cv.error  <- rep (0, 4) # initialize the cross validation error vector cv.error 
glm.fit   <- list()     # initialize the glm.fit list
powers    <- 1:4
# xpoly     <- function(i) poly(x, i, raw = T)
for (i in powers){
  glm.fit[[i]] <- glm(y ~ poly(x, i, raw = T), data = my.data)
  cv.error[i] <- cv.glm(my.data, glm.fit[[i]])$delta[1]
  cat("Fitting model number", i, "\n")
  print(tidy(glm.fit[[i]]))
  cat("------------------------------ \n\n")
}

## Fitting model number 1 
##                  term   estimate std.error statistic      p.value
## 1         (Intercept) -1.7373176 0.2455663 -7.074739 2.250992e-10
## 2 poly(x, i, raw = T)  0.2955984 0.2574537  1.148162 2.536968e-01
## ------------------------------ 
## 
## Fitting model number 2 
##                   term   estimate  std.error  statistic      p.value
## 1          (Intercept)  0.1108125 0.11730063   0.944688 3.471654e-01
## 2 poly(x, i, raw = T)1  0.9998146 0.09932268  10.066327 9.641512e-17
## 3 poly(x, i, raw = T)2 -2.0021237 0.08043332 -24.891719 6.544936e-44
## ------------------------------ 
## 
## Fitting model number 3 
##                   term    estimate  std.error   statistic      p.value
## 1          (Intercept)  0.10255628 0.11784276   0.8702808 3.863179e-01
## 2 poly(x, i, raw = T)1  1.14371980 0.19402953   5.8945657 5.577118e-08
## 3 poly(x, i, raw = T)2 -1.96706790 0.09018675 -21.8110512 5.660371e-39
## 4 poly(x, i, raw = T)3 -0.06382361 0.07389037  -0.8637609 3.898723e-01
## ------------------------------ 
## 
## Fitting model number 4 
##                   term   estimate  std.error statistic      p.value
## 1          (Intercept)  0.1910793 0.13904404  1.374236 1.726019e-01
## 2 poly(x, i, raw = T)1  1.2440514 0.21108494  5.893606 5.731997e-08
## 3 poly(x, i, raw = T)2 -2.2415022 0.24703741 -9.073533 1.578905e-14
## 4 poly(x, i, raw = T)3 -0.1339425 0.09429305 -1.420491 1.587385e-01
## 5 poly(x, i, raw = T)4  0.0835760 0.07006355  1.192860 2.358950e-01
## ------------------------------

cv.error

## [1] 6.3517417 0.8392001 0.8475700 0.8737288

Repeat (c) using another random seed.

We redo part (c) using seed number 2:

set.seed (2)
x <- rnorm (100)
y <- x - 2*x^2 + rnorm (100)
my.data <- data.frame(y, x, x2 = x^2, x3 = x^3, x4 = x^4)

## Fitting model number 1 
##                  term  estimate std.error statistic      p.value
## 1         (Intercept) -2.643414 0.3071591 -8.606008 1.270675e-13
## 2 poly(x, i, raw = T)  0.818052 0.2659888  3.075512 2.723382e-03
## ------------------------------ 
## 
## Fitting model number 2 
##                   term    estimate  std.error   statistic      p.value
## 1          (Intercept) -0.04358682 0.13351347  -0.3264601 7.447796e-01
## 2 poly(x, i, raw = T)1  0.94929472 0.08593825  11.0462416 7.517248e-19
## 3 poly(x, i, raw = T)2 -1.94657385 0.06698738 -29.0588150 1.149731e-49
## ------------------------------ 
## 
## Fitting model number 3 
##                   term    estimate  std.error   statistic      p.value
## 1          (Intercept) -0.03784714 0.13674936  -0.2767628 7.825572e-01
## 2 poly(x, i, raw = T)1  0.98672964 0.19271204   5.1202283 1.569412e-06
## 3 poly(x, i, raw = T)2 -1.94955821 0.06870552 -28.3755706 1.785800e-48
## 4 poly(x, i, raw = T)3 -0.01248854 0.05747270  -0.2172952 8.284394e-01
## ------------------------------ 
## 
## Fitting model number 4 
##                   term    estimate  std.error   statistic      p.value
## 1          (Intercept) -0.03007907 0.16417497 -0.18321350 8.550210e-01
## 2 poly(x, i, raw = T)1  0.98183587 0.20180148  4.86535506 4.528389e-06
## 3 poly(x, i, raw = T)2 -1.96900750 0.23512000 -8.37447896 4.863204e-13
## 4 poly(x, i, raw = T)3 -0.01033050 0.06292451 -0.16417293 8.699438e-01
## 5 poly(x, i, raw = T)4  0.00451020 0.05211799  0.08653825 9.312207e-01
## ------------------------------

It seems that model #2 is the best model. The \(x^3\) and \(x^4\) terms do not seem to contribute explaining the variation in the response variable \(y\).

cv.error

## [1] 9.858301 1.004410 1.018030 1.035601

The new results are close to, but not exactly the same as what we got in (c), since the seed number is different.

Smallest LOOCV error?

We define the function min.point to overlay the plot of MSE with the minimum value:

min.point <- function(x, sign){
  # superimpose minimum/maximum value on a plot
  min.idx <- which.min(x)
  x.min <- x[min.idx]
  points(min.idx, sign*x.min, col ="red", cex = 2, pch = 20)
}

We will use this function again in chapter 6. Now, we plot the cross validation error values against each model:

plot(powers, cv.error, type = "l", main = "MSE in each model",
     xlab = "Model Number", ylab = "MSE")
min.point(cv.error, 1)

Compare results from fitting each of the models in (c) using least squares with the conclusions drawn based on the cross-validation results:

Based on the cross-validation results, the minimum error is obtained in model number 2, which matches with our underlying model and agrees with the least squares results. Let’s take one more look at the result for number 2:

tidy(glm.fit[[2]])

##                   term    estimate  std.error   statistic      p.value
## 1          (Intercept) -0.04358682 0.13351347  -0.3264601 7.447796e-01
## 2 poly(x, i, raw = T)1  0.94929472 0.08593825  11.0462416 7.517248e-19
## 3 poly(x, i, raw = T)2 -1.94657385 0.06698738 -29.0588150 1.149731e-49

With p-values less than \(10^{-19}\), the coefficient estimates are statistically significant, and also close to the true coefficients.

Problem 3: LASSO, Ridge Regression, and Least Squares

Problem 6.2(a,b), page 259-260.

The LASSO, relative to least squares, is more flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Ridge regression, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Problem 4: Best Subset Selection

Problem 6.8, page 262.

Generate a predictor X and a noise vector of length n = 100.

set.seed(1234)
x <- rnorm(100)
eps <- rnorm(100)

Generate a response vector Y of length n = 100 according to the model \(Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon\)

betas <- matrix(c(1, 2, 3, 4))
y <- cbind(1, x, x^2, x^3)%*%betas + eps

Perform best subset selection in order to choose the best model containing the predictors \(X\), \(X^2\),…,\(X^{10}\). What is the best model obtained according to \(C_p\), BIC, and adjusted \(R^2\)? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained.

We need the following libraries:

library(leaps)
library(glmnet)
library(reshape2)
library(ggplot2)
library(dplyr)

We define the function subset.select.reg that runs subset selection regression with 3 methods: exhaustive, forward selection, and backward selection; shows grid of variables used in each regression; then plots RSS, negative Adjust \(R^2\), \(C_p\), and BIC against number of variables

Then, we perform the subset selection with the three methods: exhaustive search, forward stepwise selection and backwards stepwise selection:

regs <- lapply(c("exhaustive", "forward", "backward"), function(a) subset.select.reg(a, x, y))

Since this is a relatively small set of data, the runtime does not differ much among the different methods. In all methods, model #3 yields the best value for \(C_p\), BIC, and adjusted \(R^2\). Although the actual min value of RSS seen in model #8, the gain is not significant. Let’s take a look at the coefficient estimates in #3 of the exhaustive method:

coef(regs[[1]], 3)

## (Intercept)   xpoly(x)1   xpoly(x)2   xpoly(x)3 
##    1.132470    1.912586    2.893627    4.032305

Now fit a lasso model to the simulated data.

We define the function my.lasso that fits a lasso model using glmnet, and select the optimal value of \(\lambda\) using cv.glmnet, plots of the cross-validation error as a function of \(\log\lambda\)

my.lasso(x,y)

## LASSO MSE:[1] 1.051734
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                    1
## (Intercept) 1.183271
## 1           1.841550
## 2           2.833649
## 3           4.027686
## 4           .       
## 5           .       
## 6           .       
## 7           .       
## 8           .       
## 9           .       
## 10          .

The best model associated with the best \(\lambda\) is the model with four coefficients \(\beta_0, \beta_1, \beta_2\) and \(\beta_3\), with a prediction error of 1.051734.

Now generate a response vector Y according to the model \(Y = \beta_0 + \beta_7 x^7\) and perform best subset selection and the lasso. Discuss the results obtained.

Best Subset Selection:

We redefine y, then rerun our previous analysis:

betas <- matrix(c(5,3))
new.y <- cbind(1, x^7)%*%betas
# new.data <- data.frame(new.y, x)
new.regs <- lapply(c("exhaustive", "forward", "backward"), function(a) subset.select.reg(a, x, new.y))

coef(regs[[2]], 3)

## (Intercept)   xpoly(x)1   xpoly(x)2   xpoly(x)3 
##    1.132470    1.912586    2.893627    4.032305

LASSO:

my.lasso(x, new.y)

## LASSO MSE:[1] 91.61023
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                    1
## (Intercept) 5.898808
## 1           .       
## 2           .       
## 3           .       
## 4           .       
## 5           .       
## 6           .       
## 7           2.904022
## 8           .       
## 9           .       
## 10          .

LASSO yields the underlying model with coefficient estimate rather close to the true values: \(\beta_0 = 5\) and \(\beta_1 = 3\). The mean squared error is 91.61023, which is better than any of the subset selection regression fit.

Problem 5: Collection of fits on real data

Problem 6.9, page 263.

library(leaps)
library(glmnet)
library(pls)
library(ISLR)
attach(College)

a) Split the data set into a training set and a test set:

n <- nrow(College)
set.seed(1234)
train <- sample(n, floor(0.8*n)) # randomly select 80% of data to train
test  <- setdiff(1:n, train) # leftover indices
College$Private <- as.numeric(College$Private)
College.train <- College[train,]
College.test  <- College[test,]
Apps.train    <- College.train$Apps
Apps.test     <- College.test$Apps

b) Fit a linear model using least squares on the training set, and report the test error obtained.

We first define the predict method of subsetreg:

predict.regsubsets <- function(object, newdata, id){
  # Gareth James textbook: page 249
  form  <- as.formula(object$call[[2]])
  mat   <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[,xvars]%*% coefi
}

Subset Selection:

subset.regs <- regsubsets(Apps ~ ., data = College.train)
subset.summary <- summary(subset.regs)
subset.pred <- predict(subset.regs, College.test, 1)
mean((subset.pred - College.test$Apps)^2)

## [1] 1106664

c) Ridge Regression:

x.new <- as.matrix(College.test[, -2])
lasso.ridge(x,y,0)

## LASSO MSE:[1] 981795.8
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -1.235857e+03
## Private     -4.746100e+02
## Accept       9.949284e-01
## Enroll       4.386823e-01
## Top10perc    2.577701e+01
## Top25perc    3.567258e-01
## F.Undergrad  7.478257e-02
## P.Undergrad  4.658016e-02
## Outstate    -2.848505e-02
## Room.Board   2.232910e-01
## Books        7.952077e-02
## Personal     2.644455e-02
## PhD         -3.922425e+00
## Terminal    -4.258764e+00
## S.F.Ratio    1.758636e+01
## perc.alumni -6.213019e+00
## Expend       8.285774e-02
## Grad.Rate    9.762872e+00

d) LASSO:

lasso.ridge(x,y, 1)

## LASSO MSE:[1] 891847.1
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -417.55816302
## Private     -406.46603056
## Accept         1.57615553
## Enroll        -0.83775182
## Top10perc     47.79311905
## Top25perc    -12.49563501
## F.Undergrad    0.05497218
## P.Undergrad    0.05440484
## Outstate      -0.09621718
## Room.Board     0.16168730
## Books          .         
## Personal       0.07431700
## PhD           -8.44513961
## Terminal      -2.62660814
## S.F.Ratio     24.97293446
## perc.alumni    3.13570081
## Expend         0.08914764
## Grad.Rate      7.16145639

e) Principal Component Regression:

set.seed (1234)
pcr.fit <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation ="CV")
summary(pcr.fit)

## Data:    X dimension: 621 17 
##  Y dimension: 621 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4043     3982     2152     2155     1744     1708     1714
## adjCV         4043     3982     2148     2153     1734     1697     1708
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1693     1651     1592      1583      1585      1586      1593
## adjCV     1699     1637     1587      1580      1582      1582      1589
##        14 comps  15 comps  16 comps  17 comps
## CV         1595      1409      1241      1186
## adjCV      1591      1391      1232      1178
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X      31.667    57.38    64.43    70.33    75.51    80.42    84.03
## Apps    4.478    72.90    73.09    82.49    83.54    83.54    83.73
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       87.51    90.53     92.99     95.13     96.89     97.98     98.74
## Apps    85.18    85.58     85.79     85.82     85.87     85.87     85.93
##       15 comps  16 comps  17 comps
## X        99.37     99.84    100.00
## Apps     91.40     92.75     93.21

validationplot(pcr.fit ,val.type = "MSEP")

pcr.pred <- predict(pcr.fit, x.new, ncomp = 16)
mean((pcr.pred - Apps.test)^2)

## [1] 904152.9

f) Partial Least Square:

pls.fit <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
summary(pls.fit)

## Data:    X dimension: 621 17 
##  Y dimension: 621 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4043     1950     1691     1536     1460     1302     1234
## adjCV         4043     1946     1693     1530     1444     1285     1224
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1217     1205     1193      1190      1191      1190      1189
## adjCV     1208     1197     1185      1182      1183      1182      1181
##        14 comps  15 comps  16 comps  17 comps
## CV         1189      1188      1188      1188
## adjCV      1181      1180      1180      1180
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       26.09    43.86    62.89    65.67    68.68    73.90    77.98
## Apps    77.84    84.27    87.59    90.71    92.59    92.94    93.01
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       81.20    82.69     84.83     86.62     88.51     91.99     95.55
## Apps    93.06    93.15     93.18     93.20     93.21     93.21     93.21
##       15 comps  16 comps  17 comps
## X        97.14     98.91    100.00
## Apps     93.21     93.21     93.21

validationplot(pls.fit ,val.type = "MSEP")

pls.pred <- predict(pls.fit, x.new, ncomp = 7)
mean((pls.pred - Apps.test)^2)

## [1] 907293.3

g) Comments:

With the minimum prediction error of roughly 900,000 applications, I would conclude that our predictions are not very accurate. The prediction errors of LASSO, ridge regression, PCR, and PLS are close to each other (around 900,000 applications). The best prediction error of the linear models (1,100,000) is still 200,000 applications more than in other methods’ errors.