Assignment #5

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

More flexible and hence will give improved prediction ac- curacy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accu- racy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.

The lasso, relative to least squares, is: iii is the correct response. Lasso is better than least squares because it can balance bias and variance well. When least squares gives too much variation, Lasso can reduce it by accepting a bit more bias, resulting in more accurate predictions. Also, Lasso makes it easier to pick the important variables, which makes it easier to understand compared to methods like ridge regression.
Repeat (a) for ridge regression relative to least squares: iii is the correct choice. Ridge regression and lasso outperform least squares by balancing bias and variance. As λ increases, ridge regression’s flexibility decreases, reducing variance but increasing bias. Understanding this λ-variance-bias relationship is key. Least squares coefficients can lead to significant variance with minor changes in training data, while ridge regression can trade off a small bias increase for substantial variance reduction, performing well under such conditions. Ridge regression is particularly effective when least squares estimates have high variance. One notable distinction is that lasso conducts variable selection, making interpretation easier.
Repeat (a) for non-linear methods relative to least squares: ii is the correct option. Non-linear models offer greater flexibility and have lower bias compared to least squares models.

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)

Split the data set into a training set and a test set.

attach(College)
x=model.matrix(Apps~.,College)[,-1]
y=College$Apps
set.seed(10)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
College.train = College[train, ]
College.test = College[test, ]
y.test=y[test]

Fit a linear model using least squares on the training set, and report the test error obtained.

pls.fit<-lm(Apps~., data=College, subset=train)
summary(pls.fit)

## 
## Call:
## lm(formula = Apps ~ ., data = College, subset = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5139.5  -473.3   -21.1   353.2  7402.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -629.36179  639.35741  -0.984 0.325579    
## PrivateYes  -647.56836  192.17056  -3.370 0.000832 ***
## Accept         1.68912    0.05038  33.530  < 2e-16 ***
## Enroll        -1.02383    0.27721  -3.693 0.000255 ***
## Top10perc     48.19124    8.10714   5.944 6.42e-09 ***
## Top25perc    -10.51538    6.44952  -1.630 0.103865    
## F.Undergrad    0.01992    0.05364   0.371 0.710574    
## P.Undergrad    0.04213    0.05348   0.788 0.431373    
## Outstate      -0.09489    0.02674  -3.549 0.000436 ***
## Room.Board     0.14549    0.07243   2.009 0.045277 *  
## Books          0.06660    0.31115   0.214 0.830623    
## Personal       0.05663    0.09453   0.599 0.549475    
## PhD          -10.11489    7.11588  -1.421 0.156027    
## Terminal      -2.29300    8.03546  -0.285 0.775528    
## S.F.Ratio     22.07117   18.70991   1.180 0.238897    
## perc.alumni    2.08121    6.00673   0.346 0.729179    
## Expend         0.07654    0.01672   4.577 6.45e-06 ***
## Grad.Rate      9.99706    4.49821   2.222 0.026857 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1092 on 370 degrees of freedom
## Multiple R-squared:  0.9395, Adjusted R-squared:  0.9367 
## F-statistic:   338 on 17 and 370 DF,  p-value: < 2.2e-16

pred.app<-predict(pls.fit, College.test)
test.error<-mean((College.test$Apps-pred.app)^2)
test.error

## [1] 1020100

MSE for linear model using the least squares is 1,020,100

Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

chooseCRANmirror(ind=1) 
install.packages('glmnet')

## 
## The downloaded binary packages are in
##  /var/folders/qq/bfgnq9g96sz_3lrzwxn_zp3c0000gn/T//RtmpphC54P/downloaded_packages

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-8

grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid)
summary(ridge.mod)

##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric

cv.college.out=cv.glmnet(x[train,],y[train] ,alpha=0)
bestlam=cv.college.out$lambda.min
bestlam

## [1] 411.3927

ridge.pred=predict(ridge.mod,s=bestlam,newx=x[test,])
mean((ridge.pred-y.test)^2)

## [1] 985020.1

The ridge regression model performs better on this dataset than ordinary least squares (OLS). Ridge regression is especially useful when OLS struggles with high variability in the data. However, interpreting the relationship between variables becomes tricky with ridge regression because it includes all variables but adjusts their importance, making it harder to understand their impact on the outcome.

Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the num- ber of non-zero coefficient estimates.

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
summary(lasso.mod)

##           Length Class     Mode   
## a0         100   -none-    numeric
## beta      1700   dgCMatrix S4     
## df         100   -none-    numeric
## dim          2   -none-    numeric
## lambda     100   -none-    numeric
## dev.ratio  100   -none-    numeric
## nulldev      1   -none-    numeric
## npasses      1   -none-    numeric
## jerr         1   -none-    numeric
## offset       1   -none-    logical
## call         5   -none-    call   
## nobs         1   -none-    numeric

cv.out=cv.glmnet(x[train,],y[train],alpha=1)
bestlam=cv.out$lambda.min
bestlam

## [1] 24.66235

lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)

## [1] 1008145

out=glmnet(x,y,alpha=1,lambda = grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:18,]
lasso.coef[lasso.coef!=0]

##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -6.324960e+02 -4.087012e+02  1.436837e+00 -1.410240e-01  3.143012e+01 
##     Top25perc   P.Undergrad      Outstate    Room.Board      Personal 
## -8.606536e-01  1.480293e-02 -5.342495e-02  1.205819e-01  4.379135e-05 
##           PhD      Terminal     S.F.Ratio   perc.alumni        Expend 
## -5.121245e+00 -3.371192e+00  2.717231e+00 -1.039648e+00  6.838161e-02 
##     Grad.Rate 
##  4.700317e+00

MSE for the lasso model is 1,008,145

Fit a PCR model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

install.packages('pls')

## 
## The downloaded binary packages are in
##  /var/folders/qq/bfgnq9g96sz_3lrzwxn_zp3c0000gn/T//RtmpphC54P/downloaded_packages

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

pcr.college=pcr(Apps~., data=College.train,scale=TRUE,validation="CV")
summary(pcr.college)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4347     4345     2371     2391     2104     1949     1898
## adjCV         4347     4345     2368     2396     2085     1939     1891
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1899     1880     1864      1861      1870      1873      1891
## adjCV     1893     1862     1857      1853      1862      1865      1885
##        14 comps  15 comps  16 comps  17 comps
## CV         1903      1727      1295      1260
## adjCV      1975      1669      1283      1249
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     32.6794    56.94    64.38    70.61    76.27    80.97    84.48    87.54
## Apps   0.9148    71.17    71.36    79.85    81.49    82.73    82.79    83.70
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.50     92.89     94.96     96.81     97.97     98.73     99.39
## Apps    83.86     84.08     84.11     84.11     84.16     84.28     93.08
##       16 comps  17 comps
## X        99.86    100.00
## Apps     93.71     93.95

validationplot(pcr.college, val.type="MSEP")

pcr.pred=predict(pcr.college,x[test,],ncomp=10)
mean((pcr.pred-y.test)^2)

## [1] 1422699

Just by looking at the graph of the model, it seems the MSE is lowest with around 10 components. The summary confirms that using 10 components gives the lowest error and explains most of the variance in both predictors (92.89%) and the response variable (84.08%).

Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.college=plsr(Apps~., data=College.train,scale=TRUE, validation="CV")
validationplot(pls.college, val.type="MSEP")

summary(pls.college)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4347     2178     1872     1734     1615     1453     1359
## adjCV         4347     2171     1867     1726     1586     1427     1341
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1347     1340     1329      1317      1310      1305      1305
## adjCV     1330     1324     1314      1302      1296      1291      1291
##        14 comps  15 comps  16 comps  17 comps
## CV         1305      1307      1307      1307
## adjCV      1291      1292      1293      1293
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       24.27    38.72    62.64    65.26    69.01    73.96    78.86    82.18
## Apps    76.96    84.31    86.80    91.48    93.37    93.75    93.81    93.84
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       85.35     87.42     89.18     91.41     92.70     94.58     97.16
## Apps    93.88     93.91     93.93     93.94     93.95     93.95     93.95
##       16 comps  17 comps
## X        98.15    100.00
## Apps     93.95     93.95

pls.pred=predict(pls.college,x[test,],ncomp=9)
mean((pls.pred-y.test)^2)

## [1] 1049868

The PLS model resulted with 11 components because this had the lowest CV at 1310 with high variance explained at 89.18.

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five ap- proaches?

test.avg = mean(College.test[, "Apps"])
lm.test.r2 = 1 - mean((College.test[, "Apps"] - pred.app)^2) /mean((College.test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((College.test[, "Apps"] - ridge.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((College.test[, "Apps"] - lasso.pred)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((pcr.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((pls.pred-y.test)^2) /mean((College.test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")

Judging by the test r2 values, these models aren’t too bad in terms of accuracy. Except for the PCR model, all others show high accuracy.

We will now try to predict per capita crime rate in the Boston data set.

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

install.packages('leaps')

## 
## The downloaded binary packages are in
##  /var/folders/qq/bfgnq9g96sz_3lrzwxn_zp3c0000gn/T//RtmpphC54P/downloaded_packages

library(leaps)

library(MASS)
set.seed(1)
attach(Boston)
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "# of variables", ylab = "CV error")

which.min(mean.cv.errors)

## [1] 9

mean.cv.errors[which.min(mean.cv.errors)]

## [1] 42.81453

x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)

coef(cv.lasso)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                   s1
## (Intercept) 2.176491
## zn          .       
## indus       .       
## chas        .       
## nox         .       
## rm          .       
## age         .       
## dis         .       
## rad         0.150484
## tax         .       
## ptratio     .       
## black       .       
## lstat       .       
## medv        .

sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])

## [1] 7.921353

x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)

coef(cv.ridge)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  1.523899542
## zn          -0.002949852
## indus        0.029276741
## chas        -0.166526007
## nox          1.874769665
## rm          -0.142852604
## age          0.006207995
## dis         -0.094547258
## rad          0.045932737
## tax          0.002086668
## ptratio      0.071258052
## black       -0.002605281
## lstat        0.035745604
## medv        -0.023480540

sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])

## [1] 7.669133

pcr.crime = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.crime)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.175    7.180    6.724    6.731    6.727    6.727
## adjCV         8.61    7.174    7.179    6.721    6.725    6.724    6.724
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.722    6.614    6.618     6.607     6.598     6.553     6.488
## adjCV    6.718    6.609    6.613     6.602     6.592     6.546     6.481
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4

Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error. Based on the MSE, the best subset selection model had the lowest cross-validation error with the MSE of 42.8.
Does your chosen model involve all of the features in the data set? Why or why not? The selected model is the best subset selection model, featuring 9 predictors and the lowest Mean Squared Error (MSE). By excluding four out of the 13 predictors, we aim to minimize variance. This model is designed to achieve both low variance and MSE, while maintaining good accuracy.

Assignment #5

MK

2024-03-22