Assignment 5: Chapter 6

2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

(a) The lasso, relative to least squares, is:

The lasso method is less flexible than the least squares method and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. (iii)

(b) Repeat (a) for ridge regression relative to least squares.

Similarly to the lasso method, ridge regression is less flexible than the least squares method and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. (iii)

(c) Repeat (a) for non-linear methods relative to least squares.

In contrasts to the two previous methods, non-linear methods is more flexible than the least square method and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. (ii)

9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

library(ISLR)
attach(College)
x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
set.seed(10)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
College.train = College[train, ]
College.test = College[test, ]
y.test=y[test]

The data was split equally so the training data set and the test data set contain 50% of the data.
(b) Fit a linear model using least squares on the training set, and report the test error obtained.

LS.fit<-lm(Apps~., data=College, subset=train)
summary(LS.fit)

## 
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5387.1  -468.0   -13.5   394.4  6645.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -676.07070  657.29073  -1.029  0.30435    
## PrivateYes  -577.50016  218.45328  -2.644  0.00855 ** 
## Accept         1.67767    0.05299  31.660  < 2e-16 ***
## Enroll        -0.55248    0.29906  -1.847  0.06549 .  
## Top10perc     66.28284    8.79947   7.533 3.85e-13 ***
## Top25perc    -23.12334    7.10800  -3.253  0.00125 ** 
## F.Undergrad   -0.05959    0.05733  -1.039  0.29928    
## P.Undergrad    0.02930    0.05744   0.510  0.61025    
## Outstate      -0.11885    0.02990  -3.974 8.49e-05 ***
## Room.Board     0.22169    0.07652   2.897  0.00399 ** 
## Books         -0.05290    0.38960  -0.136  0.89207    
## Personal       0.09498    0.09145   1.039  0.29967    
## PhD          -13.88234    8.02488  -1.730  0.08448 .  
## Terminal       0.95034    8.07090   0.118  0.90633    
## S.F.Ratio     21.41748   21.54419   0.994  0.32081    
## perc.alumni    8.82698    6.39055   1.381  0.16803    
## Expend         0.06876    0.01630   4.217 3.11e-05 ***
## Grad.Rate     11.24407    4.37044   2.573  0.01048 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1133 on 370 degrees of freedom
## Multiple R-squared:   0.93,  Adjusted R-squared:  0.9268 
## F-statistic: 289.1 on 17 and 370 DF,  p-value: < 2.2e-16

pred.app<-predict(LS.fit, College.test)
test.error<-mean((College.test$Apps-pred.app)^2)
test.error

## [1] 1016996

The MSE or test error for the linear model using the least square method is 1,016,996.
(c) Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.

library(glmnet)

## Warning: package 'glmnet' was built under R version 3.5.3

## Loading required package: Matrix

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 3.5.3

## Loaded glmnet 2.0-16

#Ridge Model
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)

##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric

#Best Lambda
cv.college.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.college.out$lambda.min
bestlam

## [1] 429.864

ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)

## [1] 906938.4

The MSE or test error for the ridge model is 906,938.4.
(d) Fit a lasso model on the training set, with \(\lambda\) chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

#Lasso Model
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)

##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric

#Best Lambda
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam

## [1] 23.48036

lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)

## [1] 930235

# Lasso Coefficients
out=glmnet(x,y,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
lasso.coef[lasso.coef!=0]

##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -6.216869e+02 -4.143091e+02  1.443950e+00 -1.638136e-01  3.230157e+01 
##     Top25perc   P.Undergrad      Outstate    Room.Board      Personal 
## -1.464598e+00  1.701264e-02 -5.512815e-02  1.221390e-01  5.212088e-04 
##           PhD      Terminal     S.F.Ratio   perc.alumni        Expend 
## -5.297377e+00 -3.347805e+00  3.356347e+00 -1.006902e+00  6.885043e-02 
##     Grad.Rate 
##  4.875837e+00

The MSE or test error for the lasso model is 930,235.
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
summary(pcr.college)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4193     4171     2422     2321     2341     2054     1987
## adjCV         4193     4172     2418     2315     2342     2042     1979
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1987     1959     1844      1848      1872      1878      1898
## adjCV     1982     1953     1835      1840      1863      1869      1888
##        14 comps  15 comps  16 comps  17 comps
## CV         1883      1666      1377      1380
## adjCV      1898      1637      1361      1364
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X      31.510    56.98    64.18    70.42    76.06    81.04    84.66
## Apps    1.913    68.37    71.21    71.45    78.19    80.06    80.06
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       87.95    90.78     93.13     95.25     96.86     97.95     98.74
## Apps    80.79    83.34     83.35     83.37     83.37     83.64     84.66
##       15 comps  16 comps  17 comps
## X        99.42     99.87       100
## Apps     91.09     92.95        93

validationplot(pcr.college, val.type="MSEP")

#CV error with choice of M=10
pcr.pred=predict(pcr.college,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)

## [1] 1371446

The number of components of selected for this particular model is 10; this number of components has the lowest CV (1857) with the most variance explained (83.35);for a more parisimonious model.The MSE or test error for the pcr model is 1,371,446.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

#PLS Model
pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")

summary(pls.college)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4193     2221     1901     1768     1697     1573     1398
## adjCV         4193     2213     1893     1763     1677     1547     1380
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1370     1359     1358      1363      1364      1364      1361
## adjCV     1355     1345     1344      1348      1349      1349      1346
##        14 comps  15 comps  16 comps  17 comps
## CV         1359      1359      1358      1358
## adjCV      1345      1344      1344      1343
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       25.52    38.49    62.37    66.41    70.45    73.47    77.99
## Apps    74.33    83.18    85.86    89.52    91.82    92.82    92.88
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       80.21    84.21     87.98     89.49     91.24     92.24     95.13
## Apps    92.95    92.98     92.98     92.99     92.99     93.00     93.00
##       15 comps  16 comps  17 comps
## X        96.53     98.31       100
## Apps     93.00     93.00        93

#CV error with M=9
pls.pred=predict(pls.college,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)

## [1] 1046329

In contrast to the PCR model, the PLS model resulted in model with 9 components. With 9 components the CV is the lowest (1358) with the highest amount of variance explained(84.21); for a more parisimonious model.The MSE or test error for the pls model is 1,046,329.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

test.avg = mean(College.test[, "Apps"])
lm.test.r2 = 1 - mean((College.test[, "Apps"] - pred.app)^2) /mean((College.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((pcr.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((pls.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), col="yellow", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")

detach(College)

Comparing the results of the \(R^2\) for each model shows that all the models could be used to accurately predict the number of college applications. The PCR model has the lowest \(R^2\) value ~ .85 and PLS following with an \(R^2\) equal to ~.87. Comparing the MSEs for each model (in table below), show that Ridge has the lowest MSE, so this method may result in the best model.
Model Test error

OLS 1,016,996

Ridge 906,938.4

Lasso 930,235

PCR 1,371,446

PLS 1,046,329

Model	Test error
OLS	1,016,996
Ridge	906,938.4
Lasso	930,235
PCR	1,371,446
PLS	1,046,329

11

We will now try to predict per capita crime rate in the Boston data set.

library(leaps)
library(MASS)
set.seed(1)

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

attach(Boston)
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")

which.min(mean.cv.errors)

## [1] 9

mean.cv.errors[which.min(mean.cv.errors)]

## [1] 43.47287

x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)

coef(cv.lasso)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 1.0894283
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.2643196
## tax         .        
## ptratio     .        
## black       .        
## lstat       .        
## medv        .

sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])

## [1] 7.405176

x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)

coef(cv.ridge)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  1.017516870
## zn          -0.002805664
## indus        0.034405928
## chas        -0.225250601
## nox          2.249887495
## rm          -0.162546004
## age          0.007343331
## dis         -0.114928730
## rad          0.059813843
## tax          0.002659110
## ptratio      0.086423005
## black       -0.003342067
## lstat        0.044495212
## medv        -0.029124577

sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])

## [1] 7.456946

pcr.crime = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.crime)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.170    7.163    6.733    6.728    6.740    6.765
## adjCV         8.61    7.169    7.162    6.730    6.723    6.737    6.760
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.760    6.634    6.652     6.642     6.652     6.607     6.546
## adjCV    6.754    6.628    6.644     6.635     6.643     6.598     6.536
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X       93.45    95.40     97.04     98.46     99.52     100.0
## crim    42.47    42.55     42.78     43.04     44.13      45.4

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
The best subset, the lasso, ridge resulted in MSE at ~42. The lowest CV error for the PCR method (6.567), Lasso (7.42108) and Ridge (7.549673).
(c) Does your chosen model involve all of the features in the data set? Why or why not?
The model I would select is the best subset model. This model has a low MSE and has the least number of predictors with 9. The Lasso and Ridge models have all the available predictors from this dataset. The PCR method has the lowest CV but it has 13 predictors so there isn’t any reduction to create the most parisimonious model.

Assignment 5: Chapter 6

Ebony Gaines

April 1, 2019

2

9

11