STAA577 HW03 | Applied Problems

(Applied Q8)

In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

(8a) Question

Use the rnorm() function to generate a predictor \(X\) of length \(n = 100\), as well as a noise vector \(\epsilon\) of length \(n = 100\).

set.seed(16); n=100
x <- rnorm(n) #default mean = 0, and default std dev = 1... accepting those
error <- rnorm(n)

(8b) Question

Generate a response vector \(Y\) of length \(n=100\) according to the model \[Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon \]

b0 <- 17; b1 <- 3; b2 <- .8; b3 <- -2 #set coefficients
y <- b0 + b1*x + b2*x^2 + b3*x^3 + error #f(x)

(8c) Question

Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors \(X, X^2,\ldots,X^{10}\). What is the best model obtained according to \(C_p\), \(BIC\), and adjusted \(R^2\)? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both \(X\) and \(Y\).

(8c) Answer If we define ‘best’ as the most marginal reduction in error for each variable added, then all models indicate by their shape that 3 variables provide the best fit. The model with coefficients is: \[Y = 16.973 + 3.007X + 0.842X^2 - 1.986X^3 \]

library(leaps)

#Create data.frame with x^1,...x^10
x.new <- x
for(i in 2:10){
    x.new <- cbind(x.new,x^i)
}
colnames(x.new) <- paste("x", 1:ncol(x.new), sep="")
data.8 <- data.frame(cbind(y, x.new))
regfit.8 <- regsubsets(y~ ., data=data.8, nvmax=10) 
regsum.8 <- summary(regfit.8)

par(mfrow=c(2,2))

plot(regsum.8$cp, type="l", col=4, xlab = "# Variables", ylab = "Mallows Cp") 
points(which.min(regsum.8$cp),regsum.8$cp[which.min(regsum.8$cp)], col=4, pch = 15, cex=2)

plot(regsum.8$bic, type="l", col=6, xlab = "# Variables", ylab = "Bayes Information Criterion")
points(which.min(regsum.8$bic),regsum.8$bic[which.min(regsum.8$bic)], col=6, pch = 16, cex=2)

plot(regsum.8$adjr2, type="l", col=3, xlab = "# Variables", ylab = "Adjusted R Squared")
points(which.max(regsum.8$adjr2),regsum.8$adjr2[which.max(regsum.8$adjr2)], col=3, pch = 17, cex=2)

coef(regfit.8,which.min(regsum.8$bic))

## (Intercept)          x1          x2          x3 
##  16.9733268   3.0071671   0.8415477  -1.9856216

(8d) Question

Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?

(8d) Answer The performance of both forward and backward stepwise selection methods yields the same recommended model as that in part(c): \[Y = 16.973 + 3.007X + 0.842X^2 - 1.986X^3 \]

regfit.8.fwd <- regsubsets(y~ ., data=data.8, nvmax=10, method="forward") 
regsum.8.fwd <- summary(regfit.8.fwd)

par(mfrow=c(2,2))

plot(regsum.8.fwd$cp, type="l", col=4, main="Forward Stepwise Selection", xlab = "# Variables", ylab = "Mallows Cp") 
points(which.min(regsum.8.fwd$cp),regsum.8.fwd$cp[which.min(regsum.8.fwd$cp)], col=4, pch = 15, cex=2)

plot(regsum.8.fwd$bic, type="l", col=6, main="Forward Stepwise Selection", xlab = "# Variables", ylab = "Bayes Information Criterion")
points(which.min(regsum.8.fwd$bic),regsum.8.fwd$bic[which.min(regsum.8.fwd$bic)], col=6, pch = 16, cex=2)

plot(regsum.8.fwd$adjr2, type="l", col=3, main="Forward Stepwise Selection", xlab = "# Variables", ylab = "Adjusted R Squared")
points(which.max(regsum.8.fwd$adjr2),regsum.8.fwd$adjr2[which.max(regsum.8.fwd$adjr2)], col=3, pch = 17, cex=2)

coef(regfit.8.fwd,which.min(regsum.8.fwd$bic))

## (Intercept)          x1          x2          x3 
##  16.9733268   3.0071671   0.8415477  -1.9856216

regfit.8.bkwd <- regsubsets(y~ ., data=data.8, nvmax=10, method="backward") 
regsum.8.bkwd <- summary(regfit.8.bkwd)

par(mfrow=c(2,2))

plot(regsum.8.bkwd$cp, type="l", col=4, main="backward Stepwise Selection", xlab = "# Variables", ylab = "Mallows Cp") 
points(which.min(regsum.8.bkwd$cp),regsum.8.bkwd$cp[which.min(regsum.8.bkwd$cp)], col=4, pch = 15, cex=2)

plot(regsum.8.bkwd$bic, type="l", col=6, main="backward Stepwise Selection", xlab = "# Variables", ylab = "Bayes Information Criterion")
points(which.min(regsum.8.bkwd$bic),regsum.8.bkwd$bic[which.min(regsum.8.bkwd$bic)], col=6, pch = 16, cex=2)

plot(regsum.8.bkwd$adjr2, type="l", col=3, main="backward Stepwise Selection", xlab = "# Variables", ylab = "Adjusted R Squared")
points(which.max(regsum.8.bkwd$adjr2),regsum.8.bkwd$adjr2[which.max(regsum.8.bkwd$adjr2)], col=3, pch = 17, cex=2)

coef(regfit.8.bkwd,which.min(regsum.8.bkwd$bic))

## (Intercept)          x1          x2          x3 
##  16.9733268   3.0071671   0.8415477  -1.9856216

(8e) Question

Now fit a lasso model to the simulated data, again using \(X, X^2,\ldots,X^{10}\) as predictors. Use cross-validation to select the optimal value of \(\lambda\). Create plots of the cross-validation error as a function of \(\lambda\). Report the resulting coefficient estimates, and discuss the results obtained.

(8e) Answer The resulting coefficents are similar to the previous approaches, with the addition of a 4th coeffiecient. Proposed model: \[Y = 17.007 + 2.924X + 0.809X^2 - 1.963X^3 + 0.070X^9 \]

#load library and proceed. Note: alpha is the elastic net parameter, alpha \in [0,1]
#       alpha=0 selects ridge regression, alpha=1 selects lasso
library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-16

#default for cv.glmnet() is 10 fold cross validation
set.seed(32)
grid.8 <- 10^seq(10,-2,length= nrow(data.8))
train.8 <- sample(1:nrow(data.8), nrow(data.8)/2) #create training set
test.8 <- (-train.8)
y.test.8 <-  y[test.8]
               
x.8.e <- model.matrix(y~.,data.8)[,-1] #x matrix
y.8.e <- data.8[,1] #y vector

cv.8.out <- cv.glmnet(x.8.e[train.8,], y.8.e[train.8], alpha=1)
plot(cv.8.out)

lambda.lasso <- cv.8.out$lambda.min
log(lambda.lasso) #confirm min aligns with min on plot

## [1] -4.087388

#refit on full data set with proposed lambda
out.8.e <- glmnet(x.8.e, y.8.e, alpha = 1, lambda=grid.8)
predict(out.8.e, type="coefficients", s=lambda.lasso)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  1.700699e+01
## x1           2.924235e+00
## x2           8.086233e-01
## x3          -1.962989e+00
## x4           .           
## x5           .           
## x6           .           
## x7           .           
## x8           .           
## x9           7.019081e-08
## x10          .

(8f) Question

Now generate a response vector \(Y\) according to the model \[ Y= \beta_0 + \beta_7X^7 + \epsilon \] and perform best subset selection and the lasso. Discuss the results obtained.

(8f) Answer Using the best subsets approach, Mallows \(C_p\) and Adjusted \(R^2\) both propose using 4 coefficients for \(X^2, X^4, X^7,\) and \(X^9\) respectively. Bayes (bic) proposed 1 coefficient. In all three the recommended coefficients for intercept and \(X^7\) were very close to the values I set in the actual function for \(Y\).

The lasso approach correctly identified that the intercept and \(\beta_7X^7\) were the variables to use. However it’s estimates for these were farther off than the best subsets approach. The intercept I set is 17, and the predicted intercept using lasso is just over 19. The \(\beta_7\) coefficient which I set to be 1.4, is proposed by lasso to be set at 1.33.

b7 <- 1.4 #define coefficient b7, not defined in previous problem
yf <- b0 + b7*x^7+ error #f(x)

#Create updated data.frame with x^1,...x^10
data.8.f <- data.frame(cbind(yf, x.new))

#best subsets approach
regfit.8.f <- regsubsets(yf~ ., data=data.8.f, nvmax=10) 
regsum.8.f <- summary(regfit.8.f)

par(mfrow=c(2,2))

plot(regsum.8.f$cp, type="l", col=4, main = "Y=B0 + B7X^7 + err | Best Subset", xlab = "# Variables", ylab = "Mallows Cp") 
points(which.min(regsum.8.f$cp),regsum.8.f$cp[which.min(regsum.8.f$cp)], col=4, pch = 15, cex=2)

plot(regsum.8.f$bic, type="l", col=6, main = "Y=B0 + B7X^7 + err | Best Subset", xlab = "# Variables", ylab = "Bayes Information Criterion")
points(which.min(regsum.8.f$bic),regsum.8.f$bic[which.min(regsum.8.f$bic)], col=6, pch = 16, cex=2)

plot(regsum.8.f$adjr2, type="l", col=3, main = "Y=B0 + B7X^7 + err | Best Subset", xlab = "# Variables", ylab = "Adjusted R Squared")
points(which.max(regsum.8.f$adjr2),regsum.8.f$adjr2[which.max(regsum.8.f$adjr2)], col=3, pch = 17, cex=2)

coef(regfit.8.f,which.min(regsum.8.f$cp))

##  (Intercept)           x2           x4           x7           x9 
## 16.774779768  0.622476173 -0.173878611  1.390280971  0.001273715

coef(regfit.8.f,which.min(regsum.8.f$bic))

## (Intercept)          x7 
##   17.007998    1.400317

coef(regfit.8.f,which.max(regsum.8.f$adjr2))

##  (Intercept)           x2           x4           x7           x9 
## 16.774779768  0.622476173 -0.173878611  1.390280971  0.001273715

#lasso approach (use same seed as in previous)
set.seed(32)
grid.8.f <- 10^seq(10,-2,length= nrow(data.8.f))
train.8.f <- sample(1:nrow(data.8.f), nrow(data.8.f)/2) #create training set
test.8.f <- (-train.8.f)
y.test.8.f <-  yf[test.8.f]

x.8.f <- model.matrix(yf~.,data.8.f)[,-1] #x matrix
y.8.f <- data.8.f[,1] #y vector

cv.8.f.out <- cv.glmnet(x.8.f[train.8.f,], y.8.f[train.8.f], alpha=1)
plot(cv.8.f.out, main = "Y=B0 + B7X^7 + err | Lasso", xlim=c(0,7), ylim= c(570000,590000))

lambda.lasso.f <- cv.8.f.out$lambda.min
lambda.lasso.f

## [1] 24.00218

log(lambda.lasso.f) #confirm min aligns with min on plot

## [1] 3.178145

#refit on full data set with proposed lambda
out.8.f <- glmnet(x.8.f, y.8.f, alpha = 1, lambda=grid.8.f)
predict(out.8.f, type="coefficients", s=lambda.lasso.f)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 19.083907
## x1           .       
## x2           .       
## x3           .       
## x4           .       
## x5           .       
## x6           .       
## x7           1.337827
## x8           .       
## x9           .       
## x10          .

(Applied Q9)

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(9a) Question

Split the data set into a training set and a test set.

library(ISLR)
head(College) #review College data set

##                              Private Apps Accept Enroll Top10perc
## Abilene Christian University     Yes 1660   1232    721        23
## Adelphi University               Yes 2186   1924    512        16
## Adrian College                   Yes 1428   1097    336        22
## Agnes Scott College              Yes  417    349    137        60
## Alaska Pacific University        Yes  193    146     55        16
## Albertson College                Yes  587    479    158        38
##                              Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University        52        2885         537     7440
## Adelphi University                  29        2683        1227    12280
## Adrian College                      50        1036          99    11250
## Agnes Scott College                 89         510          63    12960
## Alaska Pacific University           44         249         869     7560
## Albertson College                   62         678          41    13500
##                              Room.Board Books Personal PhD Terminal
## Abilene Christian University       3300   450     2200  70       78
## Adelphi University                 6450   750     1500  29       30
## Adrian College                     3750   400     1165  53       66
## Agnes Scott College                5450   450      875  92       97
## Alaska Pacific University          4120   800     1500  76       72
## Albertson College                  3335   500      675  67       73
##                              S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University      18.1          12   7041        60
## Adelphi University                12.2          16  10527        56
## Adrian College                    12.9          30   8735        54
## Agnes Scott College                7.7          37  19016        59
## Alaska Pacific University         11.9           2  10922        15
## Albertson College                  9.4          11   9727        55

dim(College)

## [1] 777  18

#9(a) create training and test sets
set.seed(1971)
train.rows.9 <- sample (1:nrow(College), nrow(College)/2)
test.rows.9 <- (-train.rows.9)  #just realized what this is doing... by negating, it simply says don't include those rows

train.9 <- College[train.rows.9,]
test.9 <- College[test.rows.9,]

(9b) Question

Fit a linear model using least squares on the training set, and report the test error obtained.

(9b) Answer Test error is 1.257161\(\times 10^6\)

lm.9 <- lm(Apps ~ ., data = train.9) #create model
pred.9 <- predict(lm.9, test.9) #create predictions on test set
mean((pred.9 - test.9$Apps)^2) #Mean Sq Error Linear Model

## [1] 1257161

(9c) Question

Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.

(9c) Answer Test error is 1.257143\(\times 10^6\). This is very similar to what I got with least squares. Many iterations would need to be done to determine if one approach was better than the other.

grid.9 <- 10^seq(10,-2,length= 100) 
ridge.train <- model.matrix(Apps ~.,data = train.9)
ridge.test <- model.matrix(Apps ~.,data = test.9)

#find lambda w/cross validation
ridge.cv <- cv.glmnet(ridge.train, train.9$Apps, alpha=0, lambda=grid.9, thresh=1e-12)

#create prediction using lambda found w/cross validation
ridge.fit <- glmnet(ridge.train, train.9$Apps, alpha=0, lambda=grid.9, thresh=1e-12)
ridge.pred <- predict(ridge.fit, s= ridge.cv$lambda.min, newx = ridge.test)
mean((ridge.pred - test.9$Apps)^2) #Mean Sq Error ridge

## [1] 1257143

(9d) Question

Fit a lasso model on the training set, with \(\lambda\) chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

(9d) Answer Test error is 1.416073 \(\times 10^6\). This is considerably greater than for least squares or ridge regression. Again, a number of iterations would need to be done to make conclusions about the best modeling approach with any certainty. The coefficients used aligned with the following predictors: ’’’Private,Accept,Top10perc, andExpend. Note that the proposed coefficient associated withExpend``` is very small. That predictor may not be necessary to making good predictions with this data set.

#find lambda w/cross validation
lasso.cv <- cv.glmnet(ridge.train, train.9$Apps, alpha=1, lambda=grid.9, thresh=1e-12)

#create prediction using lambda found w/cross validation
lasso.fit <- glmnet(ridge.train, train.9$Apps, alpha=1, lambda=grid.9, thresh=1e-12)
lasso.pred <- predict(lasso.fit, s= lasso.cv$lambda.min, newx = ridge.test)
mean((lasso.pred - test.9$Apps)^2) #Mean Sq Error lasso

## [1] 1416073

#find non-zero coefficients
lasso.coef= predict(lasso.fit, s= lasso.cv$lambda.min, type="coefficients")
lasso.coef

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -6.652029e+02
## (Intercept)  .           
## PrivateYes  -2.568027e+01
## Accept       1.450968e+00
## Enroll       .           
## Top10perc    2.242047e+01
## Top25perc    .           
## F.Undergrad  .           
## P.Undergrad  .           
## Outstate     .           
## Room.Board   .           
## Books        .           
## Personal     .           
## PhD          .           
## Terminal     .           
## S.F.Ratio    .           
## perc.alumni  .           
## Expend       9.602596e-03
## Grad.Rate    .

#lasso.coef[lasso.coef!=0]  #only pull the non-zero coeffic, but can't tell which coeffic is which

STAA577 HW03 | Applied Problems | Exercises 6.8

Jennifer Mahoney

April 17, 2018