Chapter 6

Problem 2.

For parts (a) through (c), indicate which of i. through iv. is correct.Justify your answer. (a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

A) (iii)
Justification::
Relative to least Squares, Lasso is an alternate fitting procedure using a Shrinkage method, where coefficients of certain predictors are exactly 0(zero) with the goal of reducing the variance and improve prediction accuracy at a slight increase in bias.Lasso as compared to least squares introduces a penalty with tuning parameter \(\lambda\), which has a coefficient shrinking(make them exactly 0) effect.As we will be using only predictors which influence the response this is a less flexible approach with less variance.

(b) Repeat (a) for ridge regression relative to least squares. A) (iii)
Justification::
Relative to least Squares, Ridge Regression is an alternate fitting procedure using a Shrinkage method,where coefficients of certain predictors are shrinked towards 0(zero) with the goal of reducing the variance and improve prediction accuracy at a slight increase in bias.Ridge regression as compared to least squares introduces a penalty with tuning parameter \(\lambda\), which has a coefficient shrinking(make them towards 0) effect.As we will be using only predictors which influence the response this is a less flexible approach with less variance.

(c) Repeat (a) for non-linear methods relative to least squares. A) (ii)
Justification::
Non-linear methods compared to least square makes less assumptions by not assuming linearity in the functional form for the response and predictors reducing the bias. Thus making it a more flexible method.By fitting non-linear model will improve prediction accuracy with decrease in bias with relatively less increase in variance,bias-variance trade off can yield higher prediction accuracy.

Problem 9.

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

mycollege <- College

#Convert the data to a matrix and a vector for the response
set.seed(12)
x = model.matrix(Apps~.,mycollege)[,-1]
y = mycollege$Apps

#split the data into training and validation sets(70- 30 split)
coltrain = sample(1:nrow(x), nrow(x)/1.43) #Row indexes
coltest = (-coltrain)#Row indexes for test set
ytest = y[coltest]

Training and test/validation data set are split in 70:30 ratio.

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

col_lm <- lm(Apps ~ ., data = mycollege[coltrain,])
summary(col_lm)
## 
## Call:
## lm(formula = Apps ~ ., data = mycollege[coltrain, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2859.5  -427.9   -33.5   339.5  6074.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   86.95605  460.53700   0.189  0.85031    
## PrivateYes  -670.11120  156.80692  -4.273 2.29e-05 ***
## Accept         1.43861    0.05969  24.103  < 2e-16 ***
## Enroll        -0.43665    0.21050  -2.074  0.03854 *  
## Top10perc     45.82557    6.17884   7.417 4.86e-13 ***
## Top25perc    -12.03688    4.83039  -2.492  0.01301 *  
## F.Undergrad    0.04186    0.03689   1.135  0.25698    
## P.Undergrad    0.04527    0.03373   1.342  0.18011    
## Outstate      -0.06886    0.02151  -3.201  0.00145 ** 
## Room.Board     0.09760    0.05198   1.878  0.06099 .  
## Books         -0.01949    0.28463  -0.068  0.94544    
## Personal      -0.08695    0.06873  -1.265  0.20641    
## PhD           -9.30044    5.21205  -1.784  0.07493 .  
## Terminal      -4.76292    5.56306  -0.856  0.39230    
## S.F.Ratio      6.62384   14.38187   0.461  0.64530    
## perc.alumni   -6.66355    4.58755  -1.453  0.14695    
## Expend         0.10279    0.01791   5.740 1.60e-08 ***
## Grad.Rate      9.64039    3.20526   3.008  0.00276 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 946.1 on 525 degrees of freedom
## Multiple R-squared:  0.929,  Adjusted R-squared:  0.9267 
## F-statistic: 404.3 on 17 and 525 DF,  p-value: < 2.2e-16
#prediction error(MSE)
col_lm_test <- predict(col_lm,mycollege[coltest,])

lm_MSE <- mean((col_lm_test-ytest)^2)

lm_MSE
## [1] 1637411

Linear model using least squares Test error \(MSE = 1637411\)

(c) Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.

#selecting lambda using 10-fold cross validation
set.seed(12)
grid=10^seq(10,-2,length=100)

ridge_cv <- cv.glmnet(x[coltrain,],y[coltrain],alpha=0,lambda = grid )
plot(ridge_cv)

bestlam=ridge_cv$lambda.min

#predicting the MSE using the best lambda
ridge_pred=predict(ridge_cv,s=bestlam,newx=x[coltest,])
ridge_MSE <- mean((ridge_pred -ytest)^2)
ridge_MSE
## [1] 1737203
#checking if the ridge and the least squares are same at lambda=0
ridge_pred=predict(ridge_cv,s=0,newx=x[coltest,])
mean((ridge_pred -ytest)^2)
## [1] 1637852

A) Ridge regression with lambda selected using cross validation has an higher MSE compared to linear model using least squares. Test error \(MSE = 1737203\).

(d) Fit a lasso model on the training set, with \(\lambda\) chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

#selecting lambda using 10-fold cross validation
set.seed(12)
grid=10^seq(10,-2,length=100)

lasso_cv <- cv.glmnet(x[coltrain,],y[coltrain],alpha=1,lambda = grid,standized = TRUE,tresh=1e-12 )
plot(lasso_cv)

lasso_best=lasso_cv$lambda.min

#predicting the MSE using the best lambda
lasso_pred=predict(lasso_cv,s=lasso_best,newx=x[coltest,])
lasso_MSE <- mean((lasso_pred -ytest)^2)
lasso_MSE
## [1] 1716777
#getting the coefficients which are non zero
lasso_best=glmnet(x,y,alpha=1,lambda=lasso_cv$lambda.min)
lasso_coef=predict(lasso_best,type="coefficients",s=lasso_cv$lambda.min)[1:18,]
length(lasso_coef[lasso_coef!=0])
## [1] 16
# length(lasso_coef[lasso_coef==0])
lasso_coef
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -553.21870806 -458.45853912    1.49894031   -0.34176381   39.26867154 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##   -6.32553495    0.00000000    0.03387459   -0.06843036    0.13491214 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##    0.00000000    0.01149799   -6.73104428   -3.10748806    8.57911039 
##   perc.alumni        Expend     Grad.Rate 
##   -0.72236835    0.07249811    6.31855075

A) Lasso regression with lambda selected using cross validation has an higher MSE compared to linear model using least squares but less than Ridge. Test error \(MSE = 1716777\).

Number of non-zero coefficients are \(\beta_j\neq0\) are \(16\) out of 18.

(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(12)
pcr_fit=pcr(Apps~., data=mycollege, subset=coltrain, scale=TRUE, validation ="CV")
validationplot(pcr_fit, val.type="MSEP")

From the plot we can see that the MSE is hits the low point around 5 and around 16. As I don’t see much reduction after 5 but lowest seem to be $M = 16 $.

pcr_pred=predict(pcr_fit,x[coltest,],ncomp =16)
pcr_MSE <- mean((pcr_pred - ytest)^2)
pcr_MSE
## [1] 1762233

PCR \(MSE = 1762233\), which is still more than the linear fit using least squares and also more than Lasso and Ridge MSEs’.

pcr_fit_f=pcr(Apps~., data=mycollege, scale=TRUE,ncomps=16)
summary(pcr_fit)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3499     3400     1719     1683     1336     1307     1257
## adjCV         3499     3399     1717     1679     1325     1301     1254
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1226     1224     1172      1173      1178      1170      1171
## adjCV     1220     1220     1169      1172      1176      1169      1169
##        14 comps  15 comps  16 comps  17 comps
## CV         1172      1165     980.9     970.8
## adjCV      1170      1164     979.0     968.7
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.939    58.46    64.99    70.84    76.26     81.3    84.84    88.15
## Apps    5.969    76.38    77.63    85.99    86.65     87.6    88.41    88.44
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       91.06     93.47     95.49     97.14     98.16     98.99     99.47
## Apps    89.16     89.23     89.23     89.40     89.45     89.45     89.63
##       16 comps  17 comps
## X        99.84     100.0
## Apps     92.67      92.9

(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(12)
pls_fit=plsr(Apps~.,data=mycollege,subset=coltrain,scale=TRUE,validation ="CV")
summary(pls_fit)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3499     1542     1288     1143     1097     1052    992.9
## adjCV         3499     1540     1289     1142     1094     1047    989.7
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV         977    975.3    974.6     975.2     973.6     972.3     971.6
## adjCV      975    973.1    972.4     972.9     971.3     970.0     969.4
##        14 comps  15 comps  16 comps  17 comps
## CV        971.2     970.8     970.8     970.8
## adjCV     969.0     968.7     968.6     968.7
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       26.96    44.86    64.10    67.53    70.89    74.27    78.00    81.31
## Apps    81.01    86.93    89.73    90.84    91.97    92.68    92.81    92.85
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       84.10     86.75     90.09     91.62     93.39     95.17     97.15
## Apps    92.87     92.88     92.89     92.90     92.90     92.90     92.90
##       16 comps  17 comps
## X        99.06     100.0
## Apps     92.90      92.9
validationplot(pls_fit,val.type="MSEP")

pls.pred=predict(pls_fit,x[coltest,],ncomp =13)
pls_MSE <- mean((pls.pred -ytest)^2)

\(M=13\) is selected as it has the lowest MSE and also highest variance is explained.

pls_fit_f=plsr(Apps~.,data=mycollege,scale=TRUE,ncomps=13)
summary(pls_fit_f)
## Data:    X dimension: 777 17 
##  Y dimension: 777 1
## Fit method: kernelpls
## Number of components considered: 17
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.76    40.33    62.59    64.97    66.87    71.33    75.39    79.37
## Apps    78.01    85.14    87.67    90.73    92.63    92.72    92.77    92.82
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.36     85.04     87.92     90.65     92.69     95.50     96.87
## Apps    92.87     92.89     92.90     92.91     92.92     92.92     92.92
##       16 comps  17 comps
## X        98.65    100.00
## Apps     92.92     92.92

PLS \(MSE = 1635841\), which is the best so far with all of the above models.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

# MSE analysis
df_MSE <- data.frame(model=c("Lm","Ridge","Lasso","PCR","PLS"),
                     MSE  = c(lm_MSE,ridge_MSE,lasso_MSE,pcr_MSE,pls_MSE))

hc <- df_MSE %>%
  hchart('column', hcaes(x = model, y = MSE))
hc %>% 
  hc_add_theme(hc_theme_economist()) %>%
  hc_title(text = "test MSE across various models")
# R^2 analysis
avg_test = mean(ytest)
sum_of_sq_tot= mean((avg_test - ytest)^2)
lm_r2 = round(1 - (lm_MSE/sum_of_sq_tot),4)*100
ridge_r2 = round(1 - (ridge_MSE/sum_of_sq_tot),4)*100
lasso_r2 = round(1 - (lasso_MSE/sum_of_sq_tot),4)*100
pcr_r2 = round(1 - (pcr_MSE/sum_of_sq_tot),4)*100
pls_r2 = round(1 - (pls_MSE/sum_of_sq_tot),4)*100

df_r2 <- data.frame(model=c("Lm(92.33%)","Ridge(91.86%)","Lasso(91.96%)","PCR(91.75%)","PLS(92.34%)"),
                     r2  = c(lm_r2,ridge_r2,lasso_r2,pcr_r2,pls_r2))

hc <- df_r2 %>%
  hchart('column', hcaes(x = model, y = r2))
hc %>% 
  hc_add_theme(hc_theme_economist()) %>%
  hc_title(text = "% Variation explained across various models")

A) Analysis:Out of all the models the test \(MSE\) for the PLS is the lowest and also each of these models explain/predict the number of college applications received by 90%.linear model using least squares and partial least squares method have very good prediction and also less test error rate.

Problem 11.

We will now try to predict per capita crime rate in the Boston dataset. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

myboston <- Boston
set.seed(452)
#Predict function for the subset regularization
predict.regsubsets <- function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    xvars = names(coefi)
    mat[, xvars] %*% coefi
   
}
#Cross validation by making 10 folds and 
#we create a vector that allocates each observation to one of k = 10 folds, and we create a matrix in which we will store the results
k=10
folds=sample (1:k,nrow(myboston),replace=TRUE)
cv.errors=matrix(NA,k,13, dimnames=list(NULL,paste(1:13)))

# run the subsets model for 10 folds of data and collect the MSEs' for these folds
for (j in 1:k) {
    best.fit = regsubsets(crim ~ ., data = myboston[folds != j, ], nvmax = 13)
    for (i in 1:13) {
        pred = predict(best.fit, myboston[folds == j, ], id = i)
        cv.errors[j, i] = mean((myboston$crim[folds == j] - pred)^2)
    }
}

mean_errors = apply(cv.errors, 2, mean)
mean_errors
##        1        2        3        4        5        6        7        8 
## 45.74325 44.85431 44.63879 44.83412 44.55958 44.50076 44.27417 43.77496 
##        9       10       11       12       13 
## 43.46779 43.46229 43.34575 43.44321 43.44798
plot(mean_errors, type = "b", xlab = "Number of variables", ylab = "mean CV error")

Cross validation error is lowest for model with 11 variables. mean CV Error = 43.3457459

set.seed(12)
x = model.matrix(crim~.,myboston)[,-1]
y = myboston$crim

#split the data into training and validation sets(70- 30 split)
bostrain = sample(1:nrow(x), nrow(x)/1.43) #Row indexes
bostest = (-bostrain)#Row indexes for test set
ytest = y[bostest]


#selecting lambda using 10-fold cross validation
set.seed(12)
grid=10^seq(10,-2,length=100)

lasso_cv <- cv.glmnet(x[bostrain,],y[bostrain],alpha=1,lambda = grid,standized = TRUE,tresh=1e-12 )
plot(lasso_cv)

lasso_best=lasso_cv$lambda.min

#predicting the MSE using the best lambda
lasso_pred=predict(lasso_cv,s=lasso_best,newx=x[bostest,])
lasso_MSE <- mean((lasso_pred -ytest)^2)
lasso_MSE
## [1] 12.90178
#getting the coefficients which are non zero
lasso_best=glmnet(x,y,alpha=1,lambda=lasso_cv$lambda.min)
lasso_coef=predict(lasso_best,type="coefficients",s=lasso_cv$lambda.min)[1:13,]
length(lasso_coef[lasso_coef!=0])
## [1] 11
# length(lasso_coef[lasso_coef==0])
lasso_coef
##  (Intercept)           zn        indus         chas          nox           rm 
## 11.368044740  0.034360583 -0.062583150 -0.555543117 -5.684442646  0.148783449 
##          age          dis          rad          tax      ptratio        black 
##  0.000000000 -0.714300559  0.506525182  0.000000000 -0.156258741 -0.007558842 
##        lstat 
##  0.121609089

Lasso Regression has shrinked predictors tax and age making it a 10 predictor model, \(Lasso\ MSE=12.90178\). \(\lambda=0.07\) indicating that it is close to lease squares model.

#selecting lambda using 10-fold cross validation
set.seed(12)
grid=10^seq(10,-2,length=100)

ridge_cv <- cv.glmnet(x[bostrain,],y[bostrain],alpha=0,lambda = grid )
plot(ridge_cv)

bestlam=ridge_cv$lambda.min

#predicting the MSE using the best lambda
ridge_pred=predict(ridge_cv,s=bestlam,newx=x[bostest,])
ridge_MSE <- mean((ridge_pred -ytest)^2)
ridge_MSE
## [1] 13.42227

Ridge regression test \(Ridge\ Regression's\ MSE = 13.42227\), Also We can see that the \(\lambda = 0.163\) so we can see that the ridge MSE will be close to lease Squares MSE.

set.seed(12)
pcr_fit=pcr(crim~., data=myboston, subset=bostrain, scale=TRUE, validation ="CV")
validationplot(pcr_fit, val.type="MSEP")

pcr_pred=predict(pcr_fit,x[bostest,],ncomp =8)
pcr_MSE <- mean((pcr_pred - ytest)^2)
pcr_MSE
## [1] 13.4287

\(PCR's\ MSE = 13.4287\), with dimensions reduced to 8 gives us the best MSE and it falls short of the Lasso regressions performance.

pls_fit=plsr(crim~.,data=myboston,subset=bostrain,scale=TRUE,validation ="CV")
summary(pls_fit)
## Data:    X dimension: 353 13 
##  Y dimension: 353 1
## Fit method: kernelpls
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.653    7.962    7.672    7.618    7.628    7.591    7.587
## adjCV        9.653    7.959    7.663    7.602    7.613    7.578    7.570
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       7.576    7.542    7.551     7.545     7.549     7.549     7.549
## adjCV    7.557    7.525    7.533     7.529     7.532     7.532     7.532
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.89    57.99    61.63    68.66    76.33    81.06    84.96    87.03
## crim    32.96    39.66    42.06    42.68    43.06    43.41    43.58    43.68
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       89.64     94.13     96.60     98.21    100.00
## crim    43.70     43.70     43.71     43.71     43.71
validationplot(pls_fit,val.type="MSEP")

pls.pred=predict(pls_fit,x[bostest,],ncomp =8)
pls_MSE <- mean((pls.pred -ytest)^2)
pls_MSE
## [1] 14.56908

\(PLS's\ MSE = 14.56908\), with dimensions reduced to 8 gives us the best MSE and it falls short of the Lasso regressions performance.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

A)After evaluating using cross validation and test MSEs’ for Ridge,Lasso,PCR and PLS models, the best model which can predict the crime is Lasso with 10 predictors.

(c) Does your chosen model involve all of the features in the data set? Why or why not? A) Not all features are included as we have applied shrinkage methods to improve the prediction and based on the final model selected is Lasso where the coefficients for \(\beta_{tax}\) and \(\beta_{age}\) are reduced to zero.