Problem 5.3, page 198.
Problem 5.8, page 200-201.
set.seed (1)
eps <- rnorm(100)
x <- rnorm(100)
y <- x - 2*x^2 + eps
In this data set, n = 100, p = 1. Write out the model used to generate the data in equation form: \(y = x-2x^2\)
plot(x,y)
The scatter plot shows an upside down parabola, indicating a quadratic relationship between \(Y\) and \(X\).
library(boot)
library(broom)
# create data frame:
my.data <- data.frame(y, x)
cv.error <- rep (0, 4) # initialize the cross validation error vector cv.error
glm.fit <- list() # initialize the glm.fit list
powers <- 1:4
# xpoly <- function(i) poly(x, i, raw = T)
for (i in powers){
glm.fit[[i]] <- glm(y ~ poly(x, i, raw = T), data = my.data)
cv.error[i] <- cv.glm(my.data, glm.fit[[i]])$delta[1]
cat("Fitting model number", i, "\n")
print(tidy(glm.fit[[i]]))
cat("------------------------------ \n\n")
}
## Fitting model number 1
## term estimate std.error statistic p.value
## 1 (Intercept) -1.7373176 0.2455663 -7.074739 2.250992e-10
## 2 poly(x, i, raw = T) 0.2955984 0.2574537 1.148162 2.536968e-01
## ------------------------------
##
## Fitting model number 2
## term estimate std.error statistic p.value
## 1 (Intercept) 0.1108125 0.11730063 0.944688 3.471654e-01
## 2 poly(x, i, raw = T)1 0.9998146 0.09932268 10.066327 9.641512e-17
## 3 poly(x, i, raw = T)2 -2.0021237 0.08043332 -24.891719 6.544936e-44
## ------------------------------
##
## Fitting model number 3
## term estimate std.error statistic p.value
## 1 (Intercept) 0.10255628 0.11784276 0.8702808 3.863179e-01
## 2 poly(x, i, raw = T)1 1.14371980 0.19402953 5.8945657 5.577118e-08
## 3 poly(x, i, raw = T)2 -1.96706790 0.09018675 -21.8110512 5.660371e-39
## 4 poly(x, i, raw = T)3 -0.06382361 0.07389037 -0.8637609 3.898723e-01
## ------------------------------
##
## Fitting model number 4
## term estimate std.error statistic p.value
## 1 (Intercept) 0.1910793 0.13904404 1.374236 1.726019e-01
## 2 poly(x, i, raw = T)1 1.2440514 0.21108494 5.893606 5.731997e-08
## 3 poly(x, i, raw = T)2 -2.2415022 0.24703741 -9.073533 1.578905e-14
## 4 poly(x, i, raw = T)3 -0.1339425 0.09429305 -1.420491 1.587385e-01
## 5 poly(x, i, raw = T)4 0.0835760 0.07006355 1.192860 2.358950e-01
## ------------------------------
cv.error
## [1] 6.3517417 0.8392001 0.8475700 0.8737288
We redo part (c) using seed number 2:
set.seed (2)
x <- rnorm (100)
y <- x - 2*x^2 + rnorm (100)
my.data <- data.frame(y, x, x2 = x^2, x3 = x^3, x4 = x^4)
## Fitting model number 1
## term estimate std.error statistic p.value
## 1 (Intercept) -2.643414 0.3071591 -8.606008 1.270675e-13
## 2 poly(x, i, raw = T) 0.818052 0.2659888 3.075512 2.723382e-03
## ------------------------------
##
## Fitting model number 2
## term estimate std.error statistic p.value
## 1 (Intercept) -0.04358682 0.13351347 -0.3264601 7.447796e-01
## 2 poly(x, i, raw = T)1 0.94929472 0.08593825 11.0462416 7.517248e-19
## 3 poly(x, i, raw = T)2 -1.94657385 0.06698738 -29.0588150 1.149731e-49
## ------------------------------
##
## Fitting model number 3
## term estimate std.error statistic p.value
## 1 (Intercept) -0.03784714 0.13674936 -0.2767628 7.825572e-01
## 2 poly(x, i, raw = T)1 0.98672964 0.19271204 5.1202283 1.569412e-06
## 3 poly(x, i, raw = T)2 -1.94955821 0.06870552 -28.3755706 1.785800e-48
## 4 poly(x, i, raw = T)3 -0.01248854 0.05747270 -0.2172952 8.284394e-01
## ------------------------------
##
## Fitting model number 4
## term estimate std.error statistic p.value
## 1 (Intercept) -0.03007907 0.16417497 -0.18321350 8.550210e-01
## 2 poly(x, i, raw = T)1 0.98183587 0.20180148 4.86535506 4.528389e-06
## 3 poly(x, i, raw = T)2 -1.96900750 0.23512000 -8.37447896 4.863204e-13
## 4 poly(x, i, raw = T)3 -0.01033050 0.06292451 -0.16417293 8.699438e-01
## 5 poly(x, i, raw = T)4 0.00451020 0.05211799 0.08653825 9.312207e-01
## ------------------------------
It seems that model #2 is the best model. The \(x^3\) and \(x^4\) terms do not seem to contribute explaining the variation in the response variable \(y\).
cv.error
## [1] 9.858301 1.004410 1.018030 1.035601
The new results are close to, but not exactly the same as what we got in (c), since the seed number is different.
We define the function min.point to overlay the plot of MSE with the minimum value:
min.point <- function(x, sign){
# superimpose minimum/maximum value on a plot
min.idx <- which.min(x)
x.min <- x[min.idx]
points(min.idx, sign*x.min, col ="red", cex = 2, pch = 20)
}
We will use this function again in chapter 6. Now, we plot the cross validation error values against each model:
plot(powers, cv.error, type = "l", main = "MSE in each model",
xlab = "Model Number", ylab = "MSE")
min.point(cv.error, 1)
Based on the cross-validation results, the minimum error is obtained in model number 2, which matches with our underlying model and agrees with the least squares results. Let’s take one more look at the result for number 2:
tidy(glm.fit[[2]])
## term estimate std.error statistic p.value
## 1 (Intercept) -0.04358682 0.13351347 -0.3264601 7.447796e-01
## 2 poly(x, i, raw = T)1 0.94929472 0.08593825 11.0462416 7.517248e-19
## 3 poly(x, i, raw = T)2 -1.94657385 0.06698738 -29.0588150 1.149731e-49
With p-values less than \(10^{-19}\), the coefficient estimates are statistically significant, and also close to the true coefficients.
Problem 6.2(a,b), page 259-260.
The LASSO, relative to least squares, is more flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Ridge regression, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Problem 6.8, page 262.
set.seed(1234)
x <- rnorm(100)
eps <- rnorm(100)
betas <- matrix(c(1, 2, 3, 4))
y <- cbind(1, x, x^2, x^3)%*%betas + eps
We need the following libraries:
library(leaps)
library(glmnet)
library(reshape2)
library(ggplot2)
library(dplyr)
We define the function subset.select.reg that runs subset selection regression with 3 methods: exhaustive, forward selection, and backward selection; shows grid of variables used in each regression; then plots RSS, negative Adjust \(R^2\), \(C_p\), and BIC against number of variables
Then, we perform the subset selection with the three methods: exhaustive search, forward stepwise selection and backwards stepwise selection:
regs <- lapply(c("exhaustive", "forward", "backward"), function(a) subset.select.reg(a, x, y))
Since this is a relatively small set of data, the runtime does not differ much among the different methods. In all methods, model #3 yields the best value for \(C_p\), BIC, and adjusted \(R^2\). Although the actual min value of RSS seen in model #8, the gain is not significant. Let’s take a look at the coefficient estimates in #3 of the exhaustive method:
coef(regs[[1]], 3)
## (Intercept) xpoly(x)1 xpoly(x)2 xpoly(x)3
## 1.132470 1.912586 2.893627 4.032305
We define the function my.lasso that fits a lasso model using glmnet, and select the optimal value of \(\lambda\) using cv.glmnet, plots of the cross-validation error as a function of \(\log\lambda\)
my.lasso(x,y)
## LASSO MSE:[1] 1.051734
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.183271
## 1 1.841550
## 2 2.833649
## 3 4.027686
## 4 .
## 5 .
## 6 .
## 7 .
## 8 .
## 9 .
## 10 .
The best model associated with the best \(\lambda\) is the model with four coefficients \(\beta_0, \beta_1, \beta_2\) and \(\beta_3\), with a prediction error of 1.051734.
We redefine y, then rerun our previous analysis:
betas <- matrix(c(5,3))
new.y <- cbind(1, x^7)%*%betas
# new.data <- data.frame(new.y, x)
new.regs <- lapply(c("exhaustive", "forward", "backward"), function(a) subset.select.reg(a, x, new.y))
coef(regs[[2]], 3)
## (Intercept) xpoly(x)1 xpoly(x)2 xpoly(x)3
## 1.132470 1.912586 2.893627 4.032305
my.lasso(x, new.y)
## LASSO MSE:[1] 91.61023
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.898808
## 1 .
## 2 .
## 3 .
## 4 .
## 5 .
## 6 .
## 7 2.904022
## 8 .
## 9 .
## 10 .
LASSO yields the underlying model with coefficient estimate rather close to the true values: \(\beta_0 = 5\) and \(\beta_1 = 3\). The mean squared error is 91.61023, which is better than any of the subset selection regression fit.
Problem 6.9, page 263.
library(leaps)
library(glmnet)
library(pls)
library(ISLR)
attach(College)
n <- nrow(College)
set.seed(1234)
train <- sample(n, floor(0.8*n)) # randomly select 80% of data to train
test <- setdiff(1:n, train) # leftover indices
College$Private <- as.numeric(College$Private)
College.train <- College[train,]
College.test <- College[test,]
Apps.train <- College.train$Apps
Apps.test <- College.test$Apps
We first define the predict method of subsetreg:
predict.regsubsets <- function(object, newdata, id){
# Gareth James textbook: page 249
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[,xvars]%*% coefi
}
Subset Selection:
subset.regs <- regsubsets(Apps ~ ., data = College.train)
subset.summary <- summary(subset.regs)
subset.pred <- predict(subset.regs, College.test, 1)
mean((subset.pred - College.test$Apps)^2)
## [1] 1106664
x.new <- as.matrix(College.test[, -2])
lasso.ridge(x,y,0)
## LASSO MSE:[1] 981795.8
## 18 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.235857e+03
## Private -4.746100e+02
## Accept 9.949284e-01
## Enroll 4.386823e-01
## Top10perc 2.577701e+01
## Top25perc 3.567258e-01
## F.Undergrad 7.478257e-02
## P.Undergrad 4.658016e-02
## Outstate -2.848505e-02
## Room.Board 2.232910e-01
## Books 7.952077e-02
## Personal 2.644455e-02
## PhD -3.922425e+00
## Terminal -4.258764e+00
## S.F.Ratio 1.758636e+01
## perc.alumni -6.213019e+00
## Expend 8.285774e-02
## Grad.Rate 9.762872e+00
lasso.ridge(x,y, 1)
## LASSO MSE:[1] 891847.1
## 18 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -417.55816302
## Private -406.46603056
## Accept 1.57615553
## Enroll -0.83775182
## Top10perc 47.79311905
## Top25perc -12.49563501
## F.Undergrad 0.05497218
## P.Undergrad 0.05440484
## Outstate -0.09621718
## Room.Board 0.16168730
## Books .
## Personal 0.07431700
## PhD -8.44513961
## Terminal -2.62660814
## S.F.Ratio 24.97293446
## perc.alumni 3.13570081
## Expend 0.08914764
## Grad.Rate 7.16145639
set.seed (1234)
pcr.fit <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation ="CV")
summary(pcr.fit)
## Data: X dimension: 621 17
## Y dimension: 621 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4043 3982 2152 2155 1744 1708 1714
## adjCV 4043 3982 2148 2153 1734 1697 1708
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1693 1651 1592 1583 1585 1586 1593
## adjCV 1699 1637 1587 1580 1582 1582 1589
## 14 comps 15 comps 16 comps 17 comps
## CV 1595 1409 1241 1186
## adjCV 1591 1391 1232 1178
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 31.667 57.38 64.43 70.33 75.51 80.42 84.03
## Apps 4.478 72.90 73.09 82.49 83.54 83.54 83.73
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 87.51 90.53 92.99 95.13 96.89 97.98 98.74
## Apps 85.18 85.58 85.79 85.82 85.87 85.87 85.93
## 15 comps 16 comps 17 comps
## X 99.37 99.84 100.00
## Apps 91.40 92.75 93.21
validationplot(pcr.fit ,val.type = "MSEP")
pcr.pred <- predict(pcr.fit, x.new, ncomp = 16)
mean((pcr.pred - Apps.test)^2)
## [1] 904152.9
pls.fit <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 621 17
## Y dimension: 621 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4043 1950 1691 1536 1460 1302 1234
## adjCV 4043 1946 1693 1530 1444 1285 1224
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1217 1205 1193 1190 1191 1190 1189
## adjCV 1208 1197 1185 1182 1183 1182 1181
## 14 comps 15 comps 16 comps 17 comps
## CV 1189 1188 1188 1188
## adjCV 1181 1180 1180 1180
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 26.09 43.86 62.89 65.67 68.68 73.90 77.98
## Apps 77.84 84.27 87.59 90.71 92.59 92.94 93.01
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 81.20 82.69 84.83 86.62 88.51 91.99 95.55
## Apps 93.06 93.15 93.18 93.20 93.21 93.21 93.21
## 15 comps 16 comps 17 comps
## X 97.14 98.91 100.00
## Apps 93.21 93.21 93.21
validationplot(pls.fit ,val.type = "MSEP")
pls.pred <- predict(pls.fit, x.new, ncomp = 7)
mean((pls.pred - Apps.test)^2)
## [1] 907293.3
With the minimum prediction error of roughly 900,000 applications, I would conclude that our predictions are not very accurate. The prediction errors of LASSO, ridge regression, PCR, and PLS are close to each other (around 900,000 applications). The best prediction error of the linear models (1,100,000) is still 200,000 applications more than in other methods’ errors.