Question 2

2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Answer

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

less flexible (for λ>0), giving increased prediction accuracy if the increase in bias is less than the decrease in variance.

This is because lasso selects the \(\hat{\beta}\) that minimizes \(RSS+\lambda\Sigma^{p}_{i = 1}|\beta_{i}|\), instead of simply minimizing the RSS as in least squares.

Since the shrinkage penalty \(\lambda\Sigma^{p}_{i=1}|\beta_{i}|\) is small for \(β1,β2,…,βp\) close to zero, this tends to shrink the estimates towards zero (because for a given λ>0, the optimal lasso \(\hat\beta\) will be closer to zero than the least squares \(\hat\beta\)). For a larger λ, the shrinkage terms importance is higher relative to the RSS, so the shrinkage increases.This shrinkage is what reduces the variance of the predictions, at the cost of a small increase in bias.

(b) Repeat (a) for ridge regression relative to least squares.

Answer

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

As \(\lambda\) increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. In general, in situations where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. This means that a small change in the training data can cause a large change in the least squares coefficient estimates. In particular, when the number of variables p is almost as large as the number of observations n, the least squares estimates will be extremely variable. And if \(p > n\), then the least squares estimates do not even have a unique solution, whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, ridge regression works best in situations where the least squares estimates have high variance.

(c) Repeat (a) for non-linear methods relative to least squares.

Answer

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

If we have the relationship Y=f(X)+ϵ, by using a non-linear method we are using a more flexible method because we are making less assumptions about the functional form of f (not assuming linearity). In the case where f is better approximated using non-linear relationships between the predictors and the response, this will therefore lead to a decrease in bias that outweighs any increase in variance, hence a higher prediction accuracy.

Question 9

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

attach(College)
set.seed(1)
index = sample(1:nrow(College), 0.8 * nrow(College))
college.train = College[index,]
college.test = College[-index,]

Fit a linear model using least squares on the training set, and report the test error obtained.

college.lm.fit = lm(Apps ~ ., data = college.train)
summary(college.lm.fit)
## 
## Call:
## lm(formula = Apps ~ ., data = college.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5555.2  -404.6    19.9   310.3  7577.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -630.58238  435.56266  -1.448 0.148209    
## PrivateYes  -388.97393  148.87623  -2.613 0.009206 ** 
## Accept         1.69123    0.04433  38.153  < 2e-16 ***
## Enroll        -1.21543    0.20873  -5.823 9.41e-09 ***
## Top10perc     50.45622    5.88174   8.578  < 2e-16 ***
## Top25perc    -13.62655    4.67321  -2.916 0.003679 ** 
## F.Undergrad    0.08271    0.03632   2.277 0.023111 *  
## P.Undergrad    0.06555    0.03367   1.947 0.052008 .  
## Outstate      -0.07562    0.01987  -3.805 0.000156 ***
## Room.Board     0.14161    0.05130   2.760 0.005947 ** 
## Books          0.21161    0.25184   0.840 0.401102    
## Personal       0.01873    0.06604   0.284 0.776803    
## PhD           -9.72551    4.91228  -1.980 0.048176 *  
## Terminal      -0.48690    5.43302  -0.090 0.928620    
## S.F.Ratio     18.26146   13.83984   1.319 0.187508    
## perc.alumni    1.39008    4.39572   0.316 0.751934    
## Expend         0.05764    0.01254   4.595 5.26e-06 ***
## Grad.Rate      5.89480    3.11185   1.894 0.058662 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 993.8 on 603 degrees of freedom
## Multiple R-squared:  0.9347, Adjusted R-squared:  0.9328 
## F-statistic: 507.5 on 17 and 603 DF,  p-value: < 2.2e-16
college.lm.pred = predict(college.lm.fit, newdata = college.test, type = "response")
college.lm.mse = mean((college.lm.pred - college.test$Apps)^2)
college.lm.mse
## [1] 1567324

The test error obtained from the model using least squares regression model is 1567324.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

x=model.matrix(Apps~.,college.train)
y=model.matrix(Apps~.,college.test)
grid=10^seq(10,-2,length=100)
set.seed(1)
ridge.mod = glmnet(x,college.train$Apps,alpha=0, lambda = grid, thresh=1e-12)
cv.out=cv.glmnet(x,college.train$Apps,alpha=0, lambda = grid, thresh=1e-12)
plot(cv.out)

bestlam=cv.out$lambda.min
bestlam
## [1] 0.01

The best lambda value obtained by cross validation that can be used for further analysis in 0.01.

ridge.pred=predict(ridge.mod,s=bestlam,newx=y)
ridge.mse = mean((ridge.pred-college.test$Apps)^2)
ridge.mse
## [1] 1567278

The Root mean square error obtained by performing ridge regression is 1567278, it is lesser than that obtained from least squares regression model.

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

x=model.matrix(Apps~.,college.train)
y=model.matrix(Apps~.,college.test)
grid=10^seq(10,-2,length=100)
set.seed(1)
lasso.mod = glmnet(x,college.train$Apps,alpha=1, lambda = grid, thresh=1e-12)
cv.out=cv.glmnet(x,college.train$Apps,alpha=1, lambda = grid, thresh=1e-12)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.01
lasso.coef=predict(lasso.mod,s=bestlam,newx=y)
lasso.mse = mean((lasso.coef-college.test$Apps)^2)
lasso.mse
## [1] 1567261
predict(lasso.mod, s=bestlam, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -630.61389530
## (Intercept)    .         
## PrivateYes  -388.94711918
## Accept         1.69115587
## Enroll        -1.21488255
## Top10perc     50.44933775
## Top25perc    -13.62126672
## F.Undergrad    0.08264172
## P.Undergrad    0.06554432
## Outstate      -0.07560901
## Room.Board     0.14159590
## Books          0.21154724
## Personal       0.01871734
## PhD           -9.72479696
## Terminal      -0.48590436
## S.F.Ratio     18.25562704
## perc.alumni    1.38634947
## Expend         0.05763637
## Grad.Rate      5.89356487

The test error obtained from the lasso model is 1567261 which is lesser than both the ridge regression and the least squares regression models. All 17 coefficients from the lasso model are having non-zero values.

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pcr.fit=pcr(Apps~., data=college.train,scale=TRUE,validation="CV") 
validationplot(pcr.fit,val.type="MSEP")

pcr.fit$ncomp
## [1] 17

This shows that cross validation model slelcted is when M = 17, the model used all the predictors in the data and dimesionality is not reduced at all.

pcr.pred=predict(pcr.fit,college.test,ncomp=17)
pcr.mse = mean((pcr.pred-college.test$Apps)^2)
pcr.mse
## [1] 1567324

Applying M = 17, as chosen by cross-validation, the test error is 1567324, it is the lowest among all the models fit.

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pls.fit=plsr(Apps~., data=college.train,scale=TRUE, validation="CV")
validationplot(pls.fit,val.type="MSEP")

pls.fit$ncomp
## [1] 17
pls.pred=predict(pls.fit,college.test,ncomp=17)
pls.mse = mean((pls.pred-college.test$Apps)^2)
pls.mse
## [1] 1567324

Applying the partial least squares model, with M = 17, as chosen from cross-validation on test set gave mean square error of 1567324.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

model.name=c("OLS", "Ridge", "Lasso", "PCR", "PLS")
model.error = c(college.lm.mse, ridge.mse, lasso.mse, pcr.mse, pls.mse)
data.frame(model.name, model.error)
##   model.name model.error
## 1        OLS     1567324
## 2      Ridge     1567278
## 3      Lasso     1567261
## 4        PCR     1567324
## 5        PLS     1567324

From the table metrics above, there is no much difference in the error rates among all the models. The least error rate belongs to LASSO. Given the size of the data set, the selected models and test performances would shift if the random seeds were changed.

Question 11

11. We will now try to predict per capita crime rate in the Boston data set.

detach(College)
attach(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Best subset Selection

#Splitting the data into training and testing subsets
set.seed(123)
index = sample(nrow(Boston), 0.7*nrow(Boston))
boston.train<-Boston[index,]
boston.test<-Boston[-index,]
boston.bsm = regsubsets(crim ~ .,data = boston.train, nvmax = 13)
summary(boston.bsm)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = boston.train, nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " "   "*" 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" "*" "*"     "*"   " "   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" " "   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
boston.test.mat = model.matrix(crim ~., data = boston.test, nvmax = 13)
val.errors = rep(NA, 13)
for (i in 1:13) {
    coefi = coef(boston.bsm, id = i)
    pred = boston.test.mat[, names(coefi)] %*% coefi
    val.errors[i] = mean((pred - boston.test$crim)^2)
}
plot(val.errors, xlab = "Number of predictors in model", ylab = "Test MSE", pch = 19, type = "b")

which.min(val.errors)
## [1] 9
coef(boston.bsm,which.min(val.errors))
##   (Intercept)            zn         indus           nox           dis 
##  21.440371381   0.048443492  -0.111243337 -10.221435882  -1.031313010 
##           rad       ptratio         black         lstat          medv 
##   0.600206094  -0.338229109  -0.008369759   0.095696510  -0.224527167
boston.bsm.MSE = val.errors[9]
boston.bsm.MSE
## [1] 18.43048

The best subset model adds 1 predictor each time, a model is created, i.e., 1st model consists of 1 most important predictors 2nd model contains two and so on. Among all 13 models, the model that gave least MSE is model with 9 predictors. Thus it is the final model selected via best subset approach.

Ridge Regression

boston.train.mat=model.matrix(crim~.,boston.train)
boston.test.mat=model.matrix(crim~.,boston.test)
grid=10^seq(10,-2,length=100)
set.seed(1)
bridge.mod = glmnet(boston.train.mat,boston.train$crim,alpha=0, lambda = grid, thresh=1e-12)
cv.out=cv.glmnet(boston.train.mat,boston.train$crim,alpha=0, lambda = grid, thresh=1e-12)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.3764936
plot(cv.out)

ridge.model.1<-glmnet(boston.train.mat,boston.train$crim,alpha=0,lambda=bestlam)
coef(ridge.model.1)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 12.018114390
## (Intercept)  .          
## zn           0.041192138
## indus       -0.086492370
## chas        -0.217088779
## nox         -6.516975373
## rm           0.500309671
## age          0.001354451
## dis         -0.797671315
## rad          0.516507292
## tax          0.001781834
## ptratio     -0.215594446
## black       -0.008820569
## lstat        0.129538348
## medv        -0.196469657
ridge.pred=predict(ridge.model.1,s=bestlam,newx=boston.test.mat)
ridge.MSE = mean((ridge.pred-boston.test$crim)^2)
ridge.MSE
## [1] 17.74828

The Ridge regression is performed on range of values of lambda. Lowest MSE of 17.74 is for the lambda 0.3764936

Lasso Regression

lasso.model<-glmnet(boston.train.mat,boston.train$crim,alpha=0,lambda=grid)
boston.cv.lasso<-cv.glmnet(boston.train.mat,boston.train$crim,alpha=0,lambda=grid)

plot(boston.cv.lasso)

boston.bestlam.lasso<-boston.cv.lasso$lambda.min
boston.bestlam.lasso
## [1] 0.2848036
lasso.model.1<-glmnet(boston.train.mat,boston.train$crim,alpha=0,lambda=boston.bestlam.lasso)
coef(lasso.model.1)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) 13.2006283803
## (Intercept)  .           
## zn           0.0430902248
## indus       -0.0852333649
## chas        -0.1999509852
## nox         -7.2443713108
## rm           0.5148991428
## age          0.0014044451
## dis         -0.8398083931
## rad          0.5410200875
## tax          0.0007837926
## ptratio     -0.2363806602
## black       -0.0086461246
## lstat        0.1263695340
## medv        -0.2062445016
boston.pred.newlasso<-predict(lasso.model,s=boston.bestlam.lasso,newx=boston.test.mat)

lasso.MSE<-mean((boston.test$crim-boston.pred.newlasso)^2)
lasso.MSE
## [1] 17.84529

The least MSE is 17.84529 is for lambda 0.2848036 for lasso regression.

PCR

boston.pcr.model<-pcr(crim~.,data=boston.train,scale=TRUE,validation="CV")
summary(boston.pcr.model)
## Data:    X dimension: 354 13 
##  Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.525    8.000    7.979    7.549    7.535    7.547    7.580
## adjCV        9.525    7.997    7.976    7.544    7.531    7.541    7.573
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       7.568    7.556    7.493     7.498     7.503     7.454     7.378
## adjCV    7.559    7.555    7.481     7.485     7.491     7.439     7.363
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.58    61.36    70.50    77.35    83.36    88.23    91.30    93.45
## crim    29.82    30.41    37.81    38.42    38.54    38.72    39.17    40.09
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.54     97.16     98.50     99.54    100.00
## crim    41.44     41.62     41.62     43.09     44.25
validationplot(boston.pcr.model,val.type="MSEP")

boston.pcr.model$ncomp
## [1] 13
boston.pcr.13.model= pcr(crim~.,data=boston.train,scale=TRUE,ncomp=13)
summary(boston.pcr.13.model)
## Data:    X dimension: 354 13 
##  Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 13
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.58    61.36    70.50    77.35    83.36    88.23    91.30    93.45
## crim    29.82    30.41    37.81    38.42    38.54    38.72    39.17    40.09
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.54     97.16     98.50     99.54    100.00
## crim    41.44     41.62     41.62     43.09     44.25
boston.predict.pcr<-predict(boston.pcr.model,boston.test,ncomp=13)
pcr.MSE<-mean((boston.test$crim-boston.predict.pcr)^2)
pcr.MSE
## [1] 18.4379

Comparison of models

model.names = c("Best subset selection", "Lasso", "Ridge regression", "PCR")
error.matrix = c(boston.bsm.MSE, lasso.MSE, ridge.MSE, pcr.MSE)
data.frame(model.names, error.matrix)
##             model.names error.matrix
## 1 Best subset selection     18.43048
## 2                 Lasso     17.84529
## 3      Ridge regression     17.74828
## 4                   PCR     18.43790

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.

Ridge regression model has the least error among all the models tried. The other models are not too far apart and the errors lie in the same range.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

The ridge regression model comprises of all the features in the data set.