In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.
Use the rnorm() function to generate a predictor \(X\) of length \(n = 100\), as well as a noise vector \(\epsilon\) of length \(n = 100\).
set.seed(16); n=100
x <- rnorm(n) #default mean = 0, and default std dev = 1... accepting those
error <- rnorm(n)
Generate a response vector \(Y\) of length \(n=100\) according to the model \[Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon \]
b0 <- 17; b1 <- 3; b2 <- .8; b3 <- -2 #set coefficients
y <- b0 + b1*x + b2*x^2 + b3*x^3 + error #f(x)
Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors \(X, X^2,\ldots,X^{10}\). What is the best model obtained according to \(C_p\), \(BIC\), and adjusted \(R^2\)? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both \(X\) and \(Y\).
(8c) Answer If we define ‘best’ as the most marginal reduction in error for each variable added, then all models indicate by their shape that 3 variables provide the best fit. The model with coefficients is: \[Y = 16.973 + 3.007X + 0.842X^2 - 1.986X^3 \]
library(leaps)
#Create data.frame with x^1,...x^10
x.new <- x
for(i in 2:10){
x.new <- cbind(x.new,x^i)
}
colnames(x.new) <- paste("x", 1:ncol(x.new), sep="")
data.8 <- data.frame(cbind(y, x.new))
regfit.8 <- regsubsets(y~ ., data=data.8, nvmax=10)
regsum.8 <- summary(regfit.8)
par(mfrow=c(2,2))
plot(regsum.8$cp, type="l", col=4, xlab = "# Variables", ylab = "Mallows Cp")
points(which.min(regsum.8$cp),regsum.8$cp[which.min(regsum.8$cp)], col=4, pch = 15, cex=2)
plot(regsum.8$bic, type="l", col=6, xlab = "# Variables", ylab = "Bayes Information Criterion")
points(which.min(regsum.8$bic),regsum.8$bic[which.min(regsum.8$bic)], col=6, pch = 16, cex=2)
plot(regsum.8$adjr2, type="l", col=3, xlab = "# Variables", ylab = "Adjusted R Squared")
points(which.max(regsum.8$adjr2),regsum.8$adjr2[which.max(regsum.8$adjr2)], col=3, pch = 17, cex=2)
coef(regfit.8,which.min(regsum.8$bic))
## (Intercept) x1 x2 x3
## 16.9733268 3.0071671 0.8415477 -1.9856216
Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?
(8d) Answer The performance of both forward and backward stepwise selection methods yields the same recommended model as that in part(c): \[Y = 16.973 + 3.007X + 0.842X^2 - 1.986X^3 \]
regfit.8.fwd <- regsubsets(y~ ., data=data.8, nvmax=10, method="forward")
regsum.8.fwd <- summary(regfit.8.fwd)
par(mfrow=c(2,2))
plot(regsum.8.fwd$cp, type="l", col=4, main="Forward Stepwise Selection", xlab = "# Variables", ylab = "Mallows Cp")
points(which.min(regsum.8.fwd$cp),regsum.8.fwd$cp[which.min(regsum.8.fwd$cp)], col=4, pch = 15, cex=2)
plot(regsum.8.fwd$bic, type="l", col=6, main="Forward Stepwise Selection", xlab = "# Variables", ylab = "Bayes Information Criterion")
points(which.min(regsum.8.fwd$bic),regsum.8.fwd$bic[which.min(regsum.8.fwd$bic)], col=6, pch = 16, cex=2)
plot(regsum.8.fwd$adjr2, type="l", col=3, main="Forward Stepwise Selection", xlab = "# Variables", ylab = "Adjusted R Squared")
points(which.max(regsum.8.fwd$adjr2),regsum.8.fwd$adjr2[which.max(regsum.8.fwd$adjr2)], col=3, pch = 17, cex=2)
coef(regfit.8.fwd,which.min(regsum.8.fwd$bic))
## (Intercept) x1 x2 x3
## 16.9733268 3.0071671 0.8415477 -1.9856216
regfit.8.bkwd <- regsubsets(y~ ., data=data.8, nvmax=10, method="backward")
regsum.8.bkwd <- summary(regfit.8.bkwd)
par(mfrow=c(2,2))
plot(regsum.8.bkwd$cp, type="l", col=4, main="backward Stepwise Selection", xlab = "# Variables", ylab = "Mallows Cp")
points(which.min(regsum.8.bkwd$cp),regsum.8.bkwd$cp[which.min(regsum.8.bkwd$cp)], col=4, pch = 15, cex=2)
plot(regsum.8.bkwd$bic, type="l", col=6, main="backward Stepwise Selection", xlab = "# Variables", ylab = "Bayes Information Criterion")
points(which.min(regsum.8.bkwd$bic),regsum.8.bkwd$bic[which.min(regsum.8.bkwd$bic)], col=6, pch = 16, cex=2)
plot(regsum.8.bkwd$adjr2, type="l", col=3, main="backward Stepwise Selection", xlab = "# Variables", ylab = "Adjusted R Squared")
points(which.max(regsum.8.bkwd$adjr2),regsum.8.bkwd$adjr2[which.max(regsum.8.bkwd$adjr2)], col=3, pch = 17, cex=2)
coef(regfit.8.bkwd,which.min(regsum.8.bkwd$bic))
## (Intercept) x1 x2 x3
## 16.9733268 3.0071671 0.8415477 -1.9856216
Now fit a lasso model to the simulated data, again using \(X, X^2,\ldots,X^{10}\) as predictors. Use cross-validation to select the optimal value of \(\lambda\). Create plots of the cross-validation error as a function of \(\lambda\). Report the resulting coefficient estimates, and discuss the results obtained.
(8e) Answer The resulting coefficents are similar to the previous approaches, with the addition of a 4th coeffiecient. Proposed model: \[Y = 17.007 + 2.924X + 0.809X^2 - 1.963X^3 + 0.070X^9 \]
#load library and proceed. Note: alpha is the elastic net parameter, alpha \in [0,1]
# alpha=0 selects ridge regression, alpha=1 selects lasso
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
#default for cv.glmnet() is 10 fold cross validation
set.seed(32)
grid.8 <- 10^seq(10,-2,length= nrow(data.8))
train.8 <- sample(1:nrow(data.8), nrow(data.8)/2) #create training set
test.8 <- (-train.8)
y.test.8 <- y[test.8]
x.8.e <- model.matrix(y~.,data.8)[,-1] #x matrix
y.8.e <- data.8[,1] #y vector
cv.8.out <- cv.glmnet(x.8.e[train.8,], y.8.e[train.8], alpha=1)
plot(cv.8.out)
lambda.lasso <- cv.8.out$lambda.min
log(lambda.lasso) #confirm min aligns with min on plot
## [1] -4.087388
#refit on full data set with proposed lambda
out.8.e <- glmnet(x.8.e, y.8.e, alpha = 1, lambda=grid.8)
predict(out.8.e, type="coefficients", s=lambda.lasso)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.700699e+01
## x1 2.924235e+00
## x2 8.086233e-01
## x3 -1.962989e+00
## x4 .
## x5 .
## x6 .
## x7 .
## x8 .
## x9 7.019081e-08
## x10 .
Now generate a response vector \(Y\) according to the model \[ Y= \beta_0 + \beta_7X^7 + \epsilon \] and perform best subset selection and the lasso. Discuss the results obtained.
(8f) Answer Using the best subsets approach, Mallows \(C_p\) and Adjusted \(R^2\) both propose using 4 coefficients for \(X^2, X^4, X^7,\) and \(X^9\) respectively. Bayes (bic) proposed 1 coefficient. In all three the recommended coefficients for intercept and \(X^7\) were very close to the values I set in the actual function for \(Y\).
The lasso approach correctly identified that the intercept and \(\beta_7X^7\) were the variables to use. However it’s estimates for these were farther off than the best subsets approach. The intercept I set is 17, and the predicted intercept using lasso is just over 19. The \(\beta_7\) coefficient which I set to be 1.4, is proposed by lasso to be set at 1.33.
b7 <- 1.4 #define coefficient b7, not defined in previous problem
yf <- b0 + b7*x^7+ error #f(x)
#Create updated data.frame with x^1,...x^10
data.8.f <- data.frame(cbind(yf, x.new))
#best subsets approach
regfit.8.f <- regsubsets(yf~ ., data=data.8.f, nvmax=10)
regsum.8.f <- summary(regfit.8.f)
par(mfrow=c(2,2))
plot(regsum.8.f$cp, type="l", col=4, main = "Y=B0 + B7X^7 + err | Best Subset", xlab = "# Variables", ylab = "Mallows Cp")
points(which.min(regsum.8.f$cp),regsum.8.f$cp[which.min(regsum.8.f$cp)], col=4, pch = 15, cex=2)
plot(regsum.8.f$bic, type="l", col=6, main = "Y=B0 + B7X^7 + err | Best Subset", xlab = "# Variables", ylab = "Bayes Information Criterion")
points(which.min(regsum.8.f$bic),regsum.8.f$bic[which.min(regsum.8.f$bic)], col=6, pch = 16, cex=2)
plot(regsum.8.f$adjr2, type="l", col=3, main = "Y=B0 + B7X^7 + err | Best Subset", xlab = "# Variables", ylab = "Adjusted R Squared")
points(which.max(regsum.8.f$adjr2),regsum.8.f$adjr2[which.max(regsum.8.f$adjr2)], col=3, pch = 17, cex=2)
coef(regfit.8.f,which.min(regsum.8.f$cp))
## (Intercept) x2 x4 x7 x9
## 16.774779768 0.622476173 -0.173878611 1.390280971 0.001273715
coef(regfit.8.f,which.min(regsum.8.f$bic))
## (Intercept) x7
## 17.007998 1.400317
coef(regfit.8.f,which.max(regsum.8.f$adjr2))
## (Intercept) x2 x4 x7 x9
## 16.774779768 0.622476173 -0.173878611 1.390280971 0.001273715
#lasso approach (use same seed as in previous)
set.seed(32)
grid.8.f <- 10^seq(10,-2,length= nrow(data.8.f))
train.8.f <- sample(1:nrow(data.8.f), nrow(data.8.f)/2) #create training set
test.8.f <- (-train.8.f)
y.test.8.f <- yf[test.8.f]
x.8.f <- model.matrix(yf~.,data.8.f)[,-1] #x matrix
y.8.f <- data.8.f[,1] #y vector
cv.8.f.out <- cv.glmnet(x.8.f[train.8.f,], y.8.f[train.8.f], alpha=1)
plot(cv.8.f.out, main = "Y=B0 + B7X^7 + err | Lasso", xlim=c(0,7), ylim= c(570000,590000))
lambda.lasso.f <- cv.8.f.out$lambda.min
lambda.lasso.f
## [1] 24.00218
log(lambda.lasso.f) #confirm min aligns with min on plot
## [1] 3.178145
#refit on full data set with proposed lambda
out.8.f <- glmnet(x.8.f, y.8.f, alpha = 1, lambda=grid.8.f)
predict(out.8.f, type="coefficients", s=lambda.lasso.f)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 19.083907
## x1 .
## x2 .
## x3 .
## x4 .
## x5 .
## x6 .
## x7 1.337827
## x8 .
## x9 .
## x10 .
In this exercise, we will predict the number of applications received using the other variables in the College data set.
Split the data set into a training set and a test set.
library(ISLR)
head(College) #review College data set
## Private Apps Accept Enroll Top10perc
## Abilene Christian University Yes 1660 1232 721 23
## Adelphi University Yes 2186 1924 512 16
## Adrian College Yes 1428 1097 336 22
## Agnes Scott College Yes 417 349 137 60
## Alaska Pacific University Yes 193 146 55 16
## Albertson College Yes 587 479 158 38
## Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University 52 2885 537 7440
## Adelphi University 29 2683 1227 12280
## Adrian College 50 1036 99 11250
## Agnes Scott College 89 510 63 12960
## Alaska Pacific University 44 249 869 7560
## Albertson College 62 678 41 13500
## Room.Board Books Personal PhD Terminal
## Abilene Christian University 3300 450 2200 70 78
## Adelphi University 6450 750 1500 29 30
## Adrian College 3750 400 1165 53 66
## Agnes Scott College 5450 450 875 92 97
## Alaska Pacific University 4120 800 1500 76 72
## Albertson College 3335 500 675 67 73
## S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University 18.1 12 7041 60
## Adelphi University 12.2 16 10527 56
## Adrian College 12.9 30 8735 54
## Agnes Scott College 7.7 37 19016 59
## Alaska Pacific University 11.9 2 10922 15
## Albertson College 9.4 11 9727 55
dim(College)
## [1] 777 18
#9(a) create training and test sets
set.seed(1971)
train.rows.9 <- sample (1:nrow(College), nrow(College)/2)
test.rows.9 <- (-train.rows.9) #just realized what this is doing... by negating, it simply says don't include those rows
train.9 <- College[train.rows.9,]
test.9 <- College[test.rows.9,]
Fit a linear model using least squares on the training set, and report the test error obtained.
(9b) Answer Test error is 1.257161\(\times 10^6\)
lm.9 <- lm(Apps ~ ., data = train.9) #create model
pred.9 <- predict(lm.9, test.9) #create predictions on test set
mean((pred.9 - test.9$Apps)^2) #Mean Sq Error Linear Model
## [1] 1257161
Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.
(9c) Answer Test error is 1.257143\(\times 10^6\). This is very similar to what I got with least squares. Many iterations would need to be done to determine if one approach was better than the other.
grid.9 <- 10^seq(10,-2,length= 100)
ridge.train <- model.matrix(Apps ~.,data = train.9)
ridge.test <- model.matrix(Apps ~.,data = test.9)
#find lambda w/cross validation
ridge.cv <- cv.glmnet(ridge.train, train.9$Apps, alpha=0, lambda=grid.9, thresh=1e-12)
#create prediction using lambda found w/cross validation
ridge.fit <- glmnet(ridge.train, train.9$Apps, alpha=0, lambda=grid.9, thresh=1e-12)
ridge.pred <- predict(ridge.fit, s= ridge.cv$lambda.min, newx = ridge.test)
mean((ridge.pred - test.9$Apps)^2) #Mean Sq Error ridge
## [1] 1257143
Fit a lasso model on the training set, with \(\lambda\) chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
(9d) Answer Test error is 1.416073 \(\times 10^6\). This is considerably greater than for least squares or ridge regression. Again, a number of iterations would need to be done to make conclusions about the best modeling approach with any certainty. The coefficients used aligned with the following predictors: ’’’Private,Accept,Top10perc, andExpend. Note that the proposed coefficient associated withExpend``` is very small. That predictor may not be necessary to making good predictions with this data set.
#find lambda w/cross validation
lasso.cv <- cv.glmnet(ridge.train, train.9$Apps, alpha=1, lambda=grid.9, thresh=1e-12)
#create prediction using lambda found w/cross validation
lasso.fit <- glmnet(ridge.train, train.9$Apps, alpha=1, lambda=grid.9, thresh=1e-12)
lasso.pred <- predict(lasso.fit, s= lasso.cv$lambda.min, newx = ridge.test)
mean((lasso.pred - test.9$Apps)^2) #Mean Sq Error lasso
## [1] 1416073
#find non-zero coefficients
lasso.coef= predict(lasso.fit, s= lasso.cv$lambda.min, type="coefficients")
lasso.coef
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -6.652029e+02
## (Intercept) .
## PrivateYes -2.568027e+01
## Accept 1.450968e+00
## Enroll .
## Top10perc 2.242047e+01
## Top25perc .
## F.Undergrad .
## P.Undergrad .
## Outstate .
## Room.Board .
## Books .
## Personal .
## PhD .
## Terminal .
## S.F.Ratio .
## perc.alumni .
## Expend 9.602596e-03
## Grad.Rate .
#lasso.coef[lasso.coef!=0] #only pull the non-zero coeffic, but can't tell which coeffic is which