Chapter 06 (page 259): 2, 9, 11

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

The lasso, relative to least squares, is:

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

False: The lasso reduces the models flexibility by decreasing the variance, and increasing the bias as it uses a different penalty term. Overall, the lasso will give improved accuracy prediction specific to when the increase in bias, is less than the decrease in variance.

More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

False (My answ justified shortly)

Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

True (My answ justified shortly)

Less flexible and hence will give improved prediction accuracy whe

False (My answ justified shortly)

Justification: Like in my explanation with a) i., the lasso has an advantage to Least squares when we analize the bias-variance trade-off. When our least squares estimates have a high variance, lasso can reduce its variance, however, in turn it will increase the bias. This allows to create more accurate predictions. In addition to this, lassos penalty term that incorporates the absolute value, can drive coefficients to be 0. This will allow for an easier interpretation.

Repeat (a) for ridge regression relative to least squares.

Answer: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. [True]

Both Ridge and Lasso regression have an advantage over least squares based on the bias-variance trade-off. When there is an increase in lambda, ridge regressions flexibility tends to decrease, thus, decreasing the variance and increasing the bias.

Even when a small change in the training data exists, the least squares coefficients create a change, and create a bigger change for the variance.

However, ridge regression can perfom well by traiding small incraments in bias, at the expense of a large decrease in bias.

When it comes to ridge vs Lasso, the lassos penalty term, that incorporates the absolute value, can drive coefficients to be 0. This will allow for an easier interpretation.

Repeat (a) for non-linear methods relative to least squares.

Answer: ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Non-linear models are more flexible in the shapes of the curves that in can fit in the data. In addition, they provide a higher accuracy when its increase in variance is less than the decrease in bias. It’s important to know that Non- linear models (such as decision trees (which is later talked about in this class)) can more effectively describe relationships in complex data sets.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.1.3

head(College)

##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
## Agnes Scott College              Yes  417    349    137        60        89
## Alaska Pacific University        Yes  193    146     55        16        44
## Albertson College                Yes  587    479    158        38        62
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
## Agnes Scott College                  510          63    12960       5450   450
## Alaska Pacific University            249         869     7560       4120   800
## Albertson College                    678          41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
## Agnes Scott College               875  92       97       7.7          37  19016
## Alaska Pacific University        1500  76       72      11.9           2  10922
## Albertson College                 675  67       73       9.4          11   9727
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54
## Agnes Scott College                 59
## Alaska Pacific University           15
## Albertson College                   55

Split the data set into a training set and a test set.

I will split 50/50

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.1.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#make this example reproducible
set.seed(1)

#create ID column
College = as.data.frame(College)
College$id <- 1:nrow(College)

train <- College %>% dplyr::sample_frac(.50)
test  <- dplyr::anti_join(College, train, by = 'id')

# drop id
College = College[,-c(19)]
test = test[,-c(19)]
train = train[,-c(19)]

Fit a linear model using least squares on the training set, and report the test error obtained

Use # of apps as the predictor

lm.fit = lm(Apps ~., data = train)
lm.pred = predict(lm.fit, test)

# Calculate the mean squared error
mse = mean((test$Apps - lm.pred)^2)
cat("Test MSE:", mse, "\n")

## Test MSE: 1135758

# Calculate the mean absolute error
mae = mean(abs(test$Apps - lm.pred))
cat("Test MAE:", mae, "\n")

## Test MAE: 650.8525

Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.1.3

## Loading required package: Matrix

## Warning: package 'Matrix' was built under R version 4.1.3

## Loaded glmnet 4.1-6

# convert the test and train to matrix

train.m = model.matrix(Apps~., data = train)
test.m = model.matrix(Apps ~., data = test)
grid = 10^seq(10, -2, length = 100) # give me ridge reg, over range of values of lambda

mod.ridge = cv.glmnet(train.m, train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
lambda.best = mod.ridge$lambda.min

cat("The best lambda value is", lambda.best, "\n")

## The best lambda value is 0.01

plot(mod.ridge)

ridge.pred = predict(mod.ridge, newx=test.m, s=lambda.best)

# Calculate the mean squared error
mse = mean((test$Apps - ridge.pred)^2)
cat("Test MSE:", mse, "\n")

## Test MSE: 1135714

# Calculate the mean absolute error
mae = mean(abs(test$Apps - ridge.pred))
cat("Test MAE:", mae, "\n")

## Test MAE: 650.8396

The test MSE in ridge is just lower than the test mse in the OLS. This also applies to the test MAE

Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

mod.lasso = cv.glmnet(train.m, train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
lambda.best = mod.lasso$lambda.min

cat("The best lambda value is", lambda.best, "\n")

## The best lambda value is 0.01

plot(mod.lasso)

lasso.pred = predict(mod.lasso, newx=test.m, s=lambda.best)

# Calculate the mean squared error
mse = mean((test$Apps - lasso.pred)^2)
cat("Test MSE:", mse, "\n")

## Test MSE: 1135659

# Calculate the mean absolute error
mae = mean(abs(test$Apps - lasso.pred))
cat("Test MAE:", mae, "\n")

## Test MAE: 650.8153

The test MSE in lasso is just slightly lower than the test mse in the OLS. However, the test MAE is almost identical in both

The coefficients are as follows:

mod.lasso = glmnet(model.matrix(Apps ~., data = College), College$Apps, alpha =1)
predict(mod.lasso, s = lambda.best, type = "coefficients")

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -471.39372052
## (Intercept)    .         
## PrivateYes  -491.04485137
## Accept         1.57033288
## Enroll        -0.75961467
## Top10perc     48.14698892
## Top25perc    -12.84690695
## F.Undergrad    0.04149116
## P.Undergrad    0.04438973
## Outstate      -0.08328388
## Room.Board     0.14943472
## Books          0.01532293
## Personal       0.02909954
## PhD           -8.39597537
## Terminal      -3.26800340
## S.F.Ratio     14.59298267
## perc.alumni   -0.04404771
## Expend         0.07712632
## Grad.Rate      8.28950241

My results may be do to my cross validation! Maybe K-fold cross validation would provide a more appropriate answer

Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)

## Warning: package 'pls' was built under R version 4.1.3

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

pcr.fit = pcr(Apps ~., data = train, scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")

pcr.pred = predict(pcr.fit, test, ncomp=10)

# Calculate the mean squared error
mse = mean((test$Apps - pcr.pred)^2)
cat("Test MSE:", mse, "\n")

## Test MSE: 1723100

# Calculate the mean absolute error
mae = mean(abs(test$Apps - pcr.pred))
cat("Test MAE:", mae, "\n")

## Test MAE: 814.0476

summary(pcr.fit)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     4054     2422     2432     2117     2022     1959
## adjCV         4288     4051     2415     2426     2061     1999     1948
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1957     1955     1911      1852      1853      1856      1861
## adjCV     1947     1947     1900      1840      1843      1846      1851
##        14 comps  15 comps  16 comps  17 comps
## CV         1877      1679      1338      1269
## adjCV      1876      1649      1323      1256
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       32.20    57.78    65.31    70.99    76.37    81.27     84.8    87.85
## Apps    13.44    70.93    71.07    79.87    81.15    82.25     82.3    82.33
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.62     92.91     94.98     96.74     97.79     98.72     99.42
## Apps    83.38     84.76     84.80     84.84     85.11     85.14     90.55
##       16 comps  17 comps
## X        99.88    100.00
## Apps     93.42     93.89

If we looked at an “elbow plot”, the lowest MSEP with PCR would be at M = 10. By analyzing the summary, the M=10 has the lowest CV error (if you think of it as a local minimum)

Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.fit = plsr(Apps ~., data = train, scale = TRUE, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")

pls.pred = predict(pls.fit, test, ncomp = 10)

# Calculate the mean squared error
mse = mean((test$Apps - pls.pred)^2)
cat("Test MSE:", mse, "\n")

## Test MSE: 1131661

# Calculate the mean absolute error
mae = mean(abs(test$Apps - pls.pred))
cat("Test MAE:", mae, "\n")

## Test MAE: 654.9752

summary(pls.fit)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     2215     2010     1762     1610     1480     1330
## adjCV         4288     2209     2001     1748     1582     1460     1316
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1303     1292     1289      1286      1282      1277      1270
## adjCV     1290     1279     1275      1273      1268      1264      1258
##        14 comps  15 comps  16 comps  17 comps
## CV         1270      1270      1270      1270
## adjCV      1257      1257      1257      1257
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       27.21    50.73    63.06    65.52    70.20    74.20    78.62    80.81
## Apps    75.39    81.24    86.97    91.14    92.62    93.43    93.56    93.68
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.29     87.17     89.15     91.37     92.58     94.42     96.98
## Apps    93.76     93.79     93.83     93.86     93.88     93.89     93.89
##       16 comps  17 comps
## X        98.78    100.00
## Apps     93.89     93.89

With a value of M = 10, The test MSE for the pls.pred is higher when comopared to the OLS. This value was chosen as the MSEP appears to plateu starting at that value.

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Lets look at this visually:

test.avg = mean(test[, "Apps"])
lm.test.r2 = 1 - mean((test$Apps - lm.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((test$Apps - ridge.pred)^2)/mean((test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((test$Apps - lasso.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((test$Apps - pcr.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((test$Apps - pls.pred)^2) /mean((test[, "Apps"] - test.avg)^2)

yuh = rbind(lm.test.r2, ridge.test.r2)
yuh = rbind(yuh, lasso.test.r2)
yuh = rbind(yuh, pcr.test.r2)
yuh = rbind(yuh, pls.test.r2)
yuh = as.data.frame(yuh)
head(yuh)

##                      V1
## lm.test.r2    0.9015413
## ridge.test.r2 0.9015452
## lasso.test.r2 0.9015499
## pcr.test.r2   0.8506248
## pls.test.r2   0.9018965

data <- read.table(header=TRUE, text="
Model_Name Value
lm.test.r2  0.9015413           
ridge.test.r2   0.9015452           
lasso.test.r2   0.9015499           
pcr.test.r2 0.8506248           
pls.test.r2 0.9018965   ")

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.3

ggplot(data, aes(y=Value, x=Model_Name)) +
  geom_bar(position='dodge', stat='identity') +
  geom_text(aes(label=Value), vjust=1.6, color="white", size=3.5)+
  theme_minimal()

Using pearsons correlation coefficient (R^2), we can determine how well a model fits our data. The graph demonstrates such R^2 per model. Most of them are really close to each other, with the exception of pcr.test.r^2. Our best model was pls.test. However, this is just by a very small amount of decimals.

PCR has the smallest test R^2 with .85.

11 We will now try to predict per capita crime rate in the Boston data set.

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(leaps)

## Warning: package 'leaps' was built under R version 4.1.3

library(glmnet)
library(MASS)

## Warning: package 'MASS' was built under R version 4.1.3

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

Lets try the subset selection:

set.seed(1)
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")

which.min(rmse.cv)

## [1] 9

rmse.cv[which.min(rmse.cv)]

## [1] 6.543281

Best subset selection yielded a RMSE of 6.54

it select a subset of predictor variables that provides the most accurate and parsimonious model. Specifically, it involves fitting all possible regression models using a subset of the available predictors, and selecting the model that yields the best performance according to some criteria.

Ridge Reg:

Ridge regression is a model tuning method that is used to analyse any data that suffers from multicollinearity.

set.seed(2)
x = model.matrix(crim ~. -1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha =0)
plot(cv.ridge)

bestlam = cv.ridge$lambda.min
bestlam

## [1] 0.5374992

The MSE is .537 Ridge regression is particularly useful when there are many predictor variables in the data set, and when these predictor variables are highly correlated. In such cases, the traditional linear regression model can suffer from high variance and overfitting, which can lead to poor generalization performance when applied to new data.

Lasso

Lasso regression is a regularization technique. It is used over regression methods for a more accurate prediction

set.seed(51)
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x,y,type.measure = "mse")
plot(cv.lasso)

bestlamLasso = cv.lasso$lambda.min
bestlamLasso

## [1] 0.02220953

coef(cv.lasso)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                    s1
## (Intercept) 1.7799525
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.1920089
## tax         .        
## ptratio     .        
## black       .        
## lstat       .        
## medv        .

sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])

## [1] 7.720478

Lasso yields a super low MSE

Lasso regression is useful when dealing with high-dimensional data, i.e., when the number of predictors is large relative to the sample size. Lasso regression can help identify the most important predictors and exclude those that do not contribute significantly to the model.

PCR

library(pls)
pcr.fit = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.235    7.240    6.784    6.783    6.789    6.811
## adjCV         8.61    7.231    7.236    6.779    6.774    6.783    6.804
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.804    6.683    6.714     6.717     6.725     6.680     6.612
## adjCV    6.796    6.674    6.704     6.706     6.713     6.667     6.598
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4

At a value of 10, our test adjCV can be used to calculate the MSE

Thus, MSE = 6.732 * (506 - 13) / 506 = 6.312

PCR works by transforming the original predictors into a set of new, uncorrelated variables called principal components (PCs) using PCA. The PCs are then used as the predictors in a linear regression model to predict the response variable.

Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

I will propose the lasso model as it tends to be the most interpretable:

out=glmnet(x,y,alpha=1,lambda =grid)
lasso.coef = predict(out, type = "coefficients", s= bestlamLasso)[1:14,]
lasso.coef

##  (Intercept)           zn        indus         chas          nox           rm 
## 15.060859666  0.040656013 -0.069249939 -0.666111834 -8.854980673  0.351038670 
##          age          dis          rad          tax      ptratio        black 
##  0.000000000 -0.904538130  0.550358063 -0.001745966 -0.235371992 -0.007534167 
##        lstat         medv 
##  0.127094579 -0.180777239

The lasso has tends to have the lowest MSE out of all of the models used. I know this wasn’t the case (which is kinda odd, might have to do with the data split), but it makes it easier to interpret.

The following model under lasso is: crim = 3.049570400 + 0.018324675(zn) - 0.003300391(indus) - 0.271023113(chas) - 0.269010433(dis) + 0.464856290(rad) - 0.001043060(ptratio) - 0.007238973(black) + 0.116419366(lstat) - 0.083382104(medv)

Does your chosen model involve all of the features in the data set? Why or why not?

No, my model does not use all features in the data set. This is a perk of Lasso

In Lasso regression, the L1 penalty is used to constrain the magnitude of the coefficients of the predictor variables in the model. This penalty shrinks the coefficients towards zero, and as the strength of the penalty is increased, some coefficients may be forced to exactly zero.

When a coefficient is shrunk to exactly zero, it means that the corresponding predictor variable is excluded from the model entirely. This is the key feature of Lasso regression that makes it useful for feature selection and variable reduction.