Chapter # 6

Problem 2

(a) The lasso, relative to least squares, is:
  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Answer: iii Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Because the lasso model selects the beta or the coefficients values in such a way that the overall cross validation error decreased to a small extent. For λ>0 the coefficients will tend to reach 0. This process of shrinkage will reduce the variance of the predictions but with a small increase in bias. So there is a trade-off.The lasso limits the number of predictors, thus it reduces the inherent variance at the cost of an increase in bias.

(b) Repeat (a) for ridge regression relative to least squares.
Ridge Regression vs Least Squares
Answer: iii.Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

for the same reasons as part (a). The only real difference here is in the ridge objective function RSS+λ∑pi=1β2i, where the shrinkage term for ridge regression is slightly different to that of the lasso. RSS+λ∑pi=1β2i this is the ridge objective function and the shrinkage term λ is slightly different from the lasso regression. Ridge regression do not shrink coefficients that are useless t exactly 0 like lasso but just like Lasso the shrinkage will reduce the variance at the cost of an increase in bias.

(c) Repeat (a) for non-linear methods relative to least squares.
Answer:ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Non-linear methods are more flexible methods since they are not built on more assumptions. Since we do not consider any assumption the non linear methods have low bias that outweighs any increment in the variance. This is why there will be a high accuracy in the predictions.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.4
## Loading required package: Matrix
## Loaded glmnet 4.1-1
library(pls)
## Warning: package 'pls' was built under R version 4.0.4
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
df <- College
dim(df)
## [1] 777  18
sum(is.na(df))
## [1] 0
str(df)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
set.seed(3)
index = sample(1:nrow(df), 0.7*nrow(df))
train = df[index,]
test = df[-index,]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
OLS Regression
lm.fit <- lm(Apps ~ ., data = train)
summary(lm.fit)
## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3377.8  -432.6   -64.3   350.0  6230.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -638.64835  494.48369  -1.292  0.19708    
## PrivateYes  -596.53900  164.68933  -3.622  0.00032 ***
## Accept         1.32983    0.06126  21.709  < 2e-16 ***
## Enroll        -0.27428    0.21965  -1.249  0.21234    
## Top10perc     52.60987    7.00024   7.515 2.46e-13 ***
## Top25perc    -17.61480    5.44793  -3.233  0.00130 ** 
## F.Undergrad    0.04737    0.03878   1.221  0.22248    
## P.Undergrad    0.04116    0.03435   1.198  0.23133    
## Outstate      -0.07459    0.02326  -3.206  0.00143 ** 
## Room.Board     0.18067    0.05973   3.025  0.00261 ** 
## Books         -0.09611    0.26777  -0.359  0.71979    
## Personal       0.02455    0.07656   0.321  0.74859    
## PhD          -10.13790    5.51802  -1.837  0.06674 .  
## Terminal      -5.64889    6.08928  -0.928  0.35400    
## S.F.Ratio     23.73129   15.22420   1.559  0.11965    
## perc.alumni   -6.46646    5.04895  -1.281  0.20085    
## Expend         0.12574    0.01798   6.992 8.29e-12 ***
## Grad.Rate     11.01750    3.54346   3.109  0.00198 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1042 on 525 degrees of freedom
## Multiple R-squared:  0.9216, Adjusted R-squared:  0.919 
## F-statistic: 362.8 on 17 and 525 DF,  p-value: < 2.2e-16
ols.pred <- predict(lm.fit, test)
(ols_mse <- mean((ols.pred - test$Apps)^2))
## [1] 1409723
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
set.seed(1)
x_train = model.matrix(Apps~.,train)[,-1]
y_train = train$Apps


x_test = model.matrix(Apps~.,test)[,-1]
y_test = test$Apps
set.seed(1)
cv.out=cv.glmnet(x_train,y_train,alpha=0, lambda = 10^seq(2, -2, length = 100))
plot(cv.out)

bestlam=cv.out$lambda.min
bestlam
## [1] 29.83647
mod.ridge <- glmnet(y = y_train,
                           x = x_train,
                           alpha = 0, 
                           lambda = 10^seq(2,-2, length = 100))
ridge.pred <- predict(mod.ridge, s = bestlam, newx = x_test)

(ridge_mse <- mean((ridge.pred - y_test)^2))
## [1] 1567301
(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates
set.seed(1)
cv.out=cv.glmnet(x_train,y_train,alpha=1,lambda = 10^seq(2, -2, length = 100)) # for lasso we put alpha = 1
plot(cv.out)

bestlam =cv.out$lambda.min


lasso.mod=glmnet(x_train,y_train,alpha=1,lambda=10^seq(2,-2, length = 100))
lasso.pred=predict(lasso.mod ,s=bestlam ,newx=x_test)
mean((lasso.pred - y_test)^2)
## [1] 1451043
lasso.coef <- predict(lasso.mod, type = "coefficients", s = cv.out$lambda.min)

round(lasso.coef, 3)
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                    1
## (Intercept) -753.803
## PrivateYes  -583.680
## Accept         1.281
## Enroll         .    
## Top10perc     47.297
## Top25perc    -13.561
## F.Undergrad    0.018
## P.Undergrad    0.037
## Outstate      -0.065
## Room.Board     0.171
## Books         -0.024
## Personal       0.012
## PhD           -8.738
## Terminal      -5.494
## S.F.Ratio     20.345
## perc.alumni   -7.179
## Expend         0.121
## Grad.Rate      9.905
(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(5)
model.pcr <- pcr(Apps ~ .,data = train, scale = T, validation = "CV")

validationplot(model.pcr, val.type="MSEP")

model.pcr.mse <- MSEP(model.pcr, estimate = "CV")
# the minimum we got at 17 components

pcr.pred=predict(model.pcr,x_test,ncomp =17)
mean((pcr.pred-y_test)^2)
## [1] 1409723
(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)


pls.fit=plsr(Apps~.,data=train,scale=TRUE,validation ="CV")
summary (pls.fit)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3665     1593     1355     1231     1205     1154     1115
## adjCV         3665     1592     1355     1229     1199     1145     1110
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1096     1093     1090      1088      1088      1089      1089
## adjCV     1092     1089     1087      1085      1085      1085      1086
##        14 comps  15 comps  16 comps  17 comps
## CV         1089      1089      1089      1089
## adjCV      1086      1086      1086      1086
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       26.10    42.12    63.55    67.17    70.56    74.76    78.01    81.33
## Apps    81.44    87.06    89.38    90.32    91.37    91.90    92.07    92.11
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       84.04     87.16     89.65     92.05     93.58     94.52     96.94
## Apps    92.13     92.14     92.15     92.15     92.15     92.16     92.16
##       16 comps  17 comps
## X        98.36    100.00
## Apps     92.16     92.16
validationplot(pls.fit,val.type="MSEP")

model.pls.msep <- MSEP(pls.fit, estimate = "CV")
# we got the minimum at 17 components again
pls.pred=predict(pls.fit,x_test,ncomp =17)
mean((pls.pred -y_test)^2)
## [1] 1409723
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Answer: 1409723,1567301, 1451043, 1409723, 1409723 these are the test errors that are obtained from different models.

test.erros = c(1409723,1567301, 1451043, 1409723, 1409723)
min(test.erros)
## [1] 1409723

The minimum test error obtained is 1409723 from three models a linear model using least squares, PCR model, PLS model. The two models that got different and greater test errors than OLS,PCR, and PLS are ridge and lasso which got 1567301, 1451043 respectively.

Problem 11

We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(MASS)
boston <- Boston
str(boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
library(leaps) # for resubsets() to perform best subset regression
## Warning: package 'leaps' was built under R version 4.0.4
predict.regsubsets=function(object , newdata ,id ,...){
 form=as.formula (object$call [[2]])
 mat=model.matrix(form ,newdata )
 coefi=coef(object ,id=id)
 xvars=names(coefi)
 mat[,xvars]%*%coefi
}
regfit.best=regsubsets(crim~.,data=boston,nvmax=13)
k=10
set.seed(1)
folds=sample (1:k,nrow(boston),replace=TRUE)
cv.errors=matrix(NA, k,13, dimnames=list(NULL,paste(1:13)))
for(j in 1:k){
 best.fit=regsubsets (crim~.,data=boston[folds!=j,],nvmax=13)
 for(i in 1:13){
  pred=predict.regsubsets(best.fit ,boston [folds ==j,],id=i)
  cv.errors[j,i]= mean( ( boston$crim[ folds==j]-pred)^2)
 }
}
mean.cv.errors=apply(cv.errors ,2, mean)
mean.cv.errors
##        1        2        3        4        5        6        7        8 
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948 
##        9       10       11       12       13 
## 42.53822 42.73416 42.52367 42.46014 42.50125
which.min(mean.cv.errors) # we gor 12 th one 
## 12 
## 12
par(mfrow=c(1,1))
plot(mean.cv.errors ,type='b')

min(mean.cv.errors)
## [1] 42.46014

The minimum of CV errors is 42.46014

Lasso Regression

set.seed(1)
x_data = model.matrix(crim~.,boston)[,-1]
y_data = boston$crim
set.seed(1)
cv.out=cv.glmnet(x_data,y_data,alpha=1,lambda = 10^seq(2, -2, length = 100)) 
# for lasso we put alpha = 1
plot(cv.out)

bestlam =cv.out$lambda.min


lasso.mod=glmnet(x_data,y_data,alpha=1,lambda=10^seq(2,-2, length = 100))
lasso.pred=predict(lasso.mod ,s=bestlam ,newx=x_data)
mean((lasso.pred - y_data)^2)
## [1] 40.46977

The mean square error for this model is 40.46977

Ridge Regression

set.seed(1)
cv.out=cv.glmnet(x_data,y_data,alpha=0, lambda = 10^seq(2, -2, length = 100))
plot(cv.out)

bestlam=cv.out$lambda.min
bestlam
## [1] 0.1123324
mod.ridge <- glmnet(y = y_data,
                           x = x_data,
                           alpha = 0, 
                           lambda = 10^seq(2,-2, length = 100))
ridge.pred <- predict(mod.ridge, s = bestlam, newx = x_data)

(ridge_mse <- mean((ridge.pred - y_data)^2))
## [1] 40.35765

The mean square error for this model is 40.35765

PCR

set.seed(5)
model.pcr <- pcr(crim ~ .,data = boston, scale = T, validation = "CV")

validationplot(model.pcr, val.type="MSEP")

model.pcr.mse <- MSEP(model.pcr, estimate = "CV")
# the minimum we got at 13 components
model.pcr.mse
## (Intercept)      1 comps      2 comps      3 comps      4 comps      5 comps  
##       74.13        52.00        51.89        45.53        45.42        45.47  
##     6 comps      7 comps      8 comps      9 comps     10 comps     11 comps  
##       46.11        46.14        44.27        44.57        44.50        44.39  
##    12 comps     13 comps  
##       43.64        42.60
pcr.pred=predict(model.pcr,x_data,ncomp =13)
mean((pcr.pred-y_data)^2)
## [1] 40.31607

The mean square error for this model is 40.31607

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

Answer: From all the models I got the least mean square error for PCR and the MSE is 40.31. But the validation set errors and the cross-validation errors we obtained or not accurate test errors because we used the full data set for training and used it for our predictions as well.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

Answer: Yes I chose the PCR model and it has all the 13 variables selected. But we used all the variables in our data set that means we are simply performing the least squares and there is no dimension reduction happened. However the cv error is some what similar to the 3 component model also. So using a small component number for our model would be sufficient.