2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Answer:

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

When the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions. Consider a simple special case with n = p, and X a diagonal matrix with 1’s on the diagonal and 0’s in all off-diagonal elements. Let us also assume that we are performing regression without an intercept. The lasso amounts to finding the coefficients such that \(\sum_{j=1}^p ( y_{j} - \beta _{j} )^{2} + \lambda \sum_{j=1}^p | \beta _{j} |\) is minimized. The lasso shrinks each least squares coefficient towards zero by a constant amount, \(\lambda /2\). The least squares coefficients that are less than \(\lambda /2\) in absolute value are shrunken entirely to zero. The type of shrinkage performed by the lasso in this simple setting is known as soft-thresholding. The fact that some lasso coefficients are shrunken entirely to zero explains why the lasso performs feature selection.

(b) Repeat (a) for ridge regression relative to least squares.

Answer:

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

As \(\lambda\) increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. In general, in situations where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. This means that a small change in the training data can cause a large change in the least squares coefficient estimates. In particular, when the number of variables p is almost as large as the number of observations n, the least squares estimates will be extremely variable. And if p > n, then the least squares estimates do not even have a unique solution, whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, ridge regression works best in situations where the least squares estimates have high variance.

(c) Repeat (a) for non-linear methods relative to least squares.

Answer:

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

If we have a relationship Y = f(X) + \(\epsilon\), a non-linear method is more flexible because it makes less assumptions about the functional form of f. This will therefore lead to a decrease in bias that outweighs any increase in variance, hence a higher prediction accuracy.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

attach(College)
set.seed(1)
index = sample(1:nrow(College), 0.8 * nrow(College))
train = College[index,]
test = College[-index,]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit = lm(Apps ~ ., data = train)
summary(lm.fit)
## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5555.2  -404.6    19.9   310.3  7577.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -630.58238  435.56266  -1.448 0.148209    
## PrivateYes  -388.97393  148.87623  -2.613 0.009206 ** 
## Accept         1.69123    0.04433  38.153  < 2e-16 ***
## Enroll        -1.21543    0.20873  -5.823 9.41e-09 ***
## Top10perc     50.45622    5.88174   8.578  < 2e-16 ***
## Top25perc    -13.62655    4.67321  -2.916 0.003679 ** 
## F.Undergrad    0.08271    0.03632   2.277 0.023111 *  
## P.Undergrad    0.06555    0.03367   1.947 0.052008 .  
## Outstate      -0.07562    0.01987  -3.805 0.000156 ***
## Room.Board     0.14161    0.05130   2.760 0.005947 ** 
## Books          0.21161    0.25184   0.840 0.401102    
## Personal       0.01873    0.06604   0.284 0.776803    
## PhD           -9.72551    4.91228  -1.980 0.048176 *  
## Terminal      -0.48690    5.43302  -0.090 0.928620    
## S.F.Ratio     18.26146   13.83984   1.319 0.187508    
## perc.alumni    1.39008    4.39572   0.316 0.751934    
## Expend         0.05764    0.01254   4.595 5.26e-06 ***
## Grad.Rate      5.89480    3.11185   1.894 0.058662 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 993.8 on 603 degrees of freedom
## Multiple R-squared:  0.9347, Adjusted R-squared:  0.9328 
## F-statistic: 507.5 on 17 and 603 DF,  p-value: < 2.2e-16
lm.pred = predict(lm.fit, newdata = test, type = "response")
lm.mse = mean((lm.pred - test$Apps)^2)
lm.mse
## [1] 1567324

The test error obtained from the model using least squares is 1567324.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

x=model.matrix(Apps~.,train)
y=model.matrix(Apps~.,test)
grid=10^seq(10,-2,length=100)
set.seed(1)
ridge.mod = glmnet(x,train$Apps,alpha=0, lambda = grid, thresh=1e-12)
cv.out=cv.glmnet(x,train$Apps,alpha=0, lambda = grid, thresh=1e-12)
plot(cv.out)

bestlam=cv.out$lambda.min
bestlam
## [1] 0.01
ridge.pred=predict(ridge.mod,s=bestlam,newx=y)
ridge.mse = mean((ridge.pred-test$Apps)^2)
ridge.mse
## [1] 1567278

The test error obtained from the ridge regression is 1567278, which is lesser than the test error from the linear regression model.

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

x=model.matrix(Apps~.,train)
y=model.matrix(Apps~.,test)
grid=10^seq(10,-2,length=100)
set.seed(1)
lasso.mod = glmnet(x,train$Apps,alpha=1, lambda = grid, thresh=1e-12)
cv.out=cv.glmnet(x,train$Apps,alpha=1, lambda = grid, thresh=1e-12)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.01
lasso.coef=predict(lasso.mod,s=bestlam,newx=y)
lasso.mse = mean((lasso.coef-test$Apps)^2)
lasso.mse
## [1] 1567261
predict(lasso.mod, s=bestlam, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -630.61389530
## (Intercept)    .         
## PrivateYes  -388.94711918
## Accept         1.69115587
## Enroll        -1.21488255
## Top10perc     50.44933775
## Top25perc    -13.62126672
## F.Undergrad    0.08264172
## P.Undergrad    0.06554432
## Outstate      -0.07560901
## Room.Board     0.14159590
## Books          0.21154724
## Personal       0.01871734
## PhD           -9.72479696
## Terminal      -0.48590436
## S.F.Ratio     18.25562704
## perc.alumni    1.38634947
## Expend         0.05763637
## Grad.Rate      5.89356487

The test error obtained from the lasso model is 1567261 which is lesser than both the ridge regression and the linear regression models. All 17 coefficients from the lasso model are non-zero.

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pcr.fit=pcr(Apps~., data=train,scale=TRUE,validation="CV") 
validationplot(pcr.fit,val.type="MSEP")

pcr.fit$ncomp
## [1] 17
pcr.pred=predict(pcr.fit,test,ncomp=17)
pcr.mse = mean((pcr.pred-test$Apps)^2)
pcr.mse
## [1] 1567324

The M value chosen by cross-validation is M=17. Applying that to the prediction of the test set gives the lowest test error of 1567324.

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pls.fit=plsr(Apps~., data=train,scale=TRUE, validation="CV")
validationplot(pls.fit,val.type="MSEP")

pls.fit$ncomp
## [1] 17
pls.pred=predict(pls.fit,test,ncomp=17)
pls.mse = mean((pls.pred-test$Apps)^2)
pls.mse
## [1] 1567324

The M value chosen by cross-validation is M=17. Applying that to the prediction of the test set gives the lowest test error of 1567324.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

model.names=c("LM", "Ridge", "Lasso", "PCR", "PLS")
all.errors = c(lm.mse, ridge.mse, lasso.mse, pcr.mse, pls.mse)
data.frame(model.names, all.errors)
##   model.names all.errors
## 1          LM    1567324
## 2       Ridge    1567278
## 3       Lasso    1567261
## 4         PCR    1567324
## 5         PLS    1567324

From the above comparison, the lasso model gives the least error compared to all the other models. There is also not much difference among the test errors from these five approaches. They all lie in the same range.

11. We will now try to predict per capita crime rate in the Boston data set.

detach(College)
attach(Boston)

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Best subset selection

predict.regsubsets=function(object,newdata,id,...){
  form=as.formula(object$call[[2]])
  mat=model.matrix(form,newdata)
  coefi=coef(object,id=id)
  xvars=names(coefi)
  mat[,xvars]%*%coefi
}
regfit.full=regsubsets(crim~.,Boston, nvmax = 13) 
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., Boston, nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
k=10
set.seed(1)
folds=sample(1:k,nrow(Boston),replace=TRUE)
cv.errors=matrix(NA,k,13, dimnames=list(NULL, paste(1:13)))
for(j in 1:k){
  best.fit=regsubsets(crim~.,data=Boston[folds!=j,],nvmax=13)
  for(i in 1:13){
    pred=predict(best.fit,Boston[folds==j,],id=i)
    cv.errors[j,i]=mean( (Boston$crim[folds==j]-pred)^2)
  }
}
mean.cv.errors=apply(cv.errors,2,mean)
mean.cv.errors
##        1        2        3        4        5        6        7        8 
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948 
##        9       10       11       12       13 
## 42.53822 42.73416 42.52367 42.46014 42.50125
which.min(mean.cv.errors)
## 12 
## 12
par(mfrow=c(1,1))
plot(mean.cv.errors,type='b')

bss.error = min(mean.cv.errors)
bss.error
## [1] 42.46014

The least error obtained by cross validation is 42.46014

Lasso

x=model.matrix(crim~.,Boston)
y=Boston$crim
grid=10^seq(10,-2,length=100)
set.seed(1)
lasso.mod = glmnet(x,y,alpha=1, lambda = grid, thresh=1e-12)
cv.out=cv.glmnet(x,y,alpha=1, lambda = grid, thresh=1e-12)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.05336699
plot(cv.out)

lasso.coef=predict(lasso.mod,s=bestlam,newx=x)
lasso.mse = mean((lasso.coef-y)^2)
lasso.mse
## [1] 40.47009

The mean squared error obtained by the lasso model is 40.47009

Ridge regression

x=model.matrix(crim~.,Boston)
y=Boston$crim
grid=10^seq(10,-2,length=100)
set.seed(1)
ridge.mod = glmnet(x,y,alpha=0, lambda = grid, thresh=1e-12)
cv.out=cv.glmnet(x,y,alpha=0, lambda = grid, thresh=1e-12)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.1232847
plot(cv.out)

ridge.pred=predict(ridge.mod,s=bestlam,newx=x)
ridge.mse = mean((ridge.pred-y)^2)
ridge.mse
## [1] 40.36313

The test error obtained by the ridge regression model is 40.36313

Principal Components Regression

set.seed(1)
pcr.fit=pcr(crim~., data=Boston,scale=TRUE,validation="CV") 
validationplot(pcr.fit,val.type="MSEP")

pcr.fit$ncomp
## [1] 13
pcr.pred=predict(pcr.fit,Boston,ncomp=13)
pcr.mse = mean((pcr.pred-y)^2)
pcr.mse
## [1] 40.31607

Comparison of models

model.names = c("Best subset selection", "Lasso", "Ridge regression", "PCR")
error.matrix = c(bss.error, lasso.mse, ridge.mse, pcr.mse)
data.frame(model.names, error.matrix)
##             model.names error.matrix
## 1 Best subset selection     42.46014
## 2                 Lasso     40.47009
## 3      Ridge regression     40.36313
## 4                   PCR     40.31607

On comparing the four models that we tried, the PCR provided the least error.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

The PCR model gave us the least mean squared error with 40.31 compared to the other models. The other models also are not too far apart and the errors lie in the nearby range. But we performed all computations on the whole dataset and not the splits. Splitting it into training and testing would give us a better picture of how the model performs. With the whole dataset, the PCR calculated the M value using cross-validation and the best value chosen by it was M=13. It is with this value that the model gave us the least error.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

The PCR model has all the features included in it. More variations could be attempted by including interaction effects or polynomial terms to the model. This could help in analyzing if the new inclusions make the model better or worse.