library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(readxl)
library(tinytex)
library(ISLR)
library(ISLR2)
## 
## Attaching package: 'ISLR2'
## 
## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:ISLR2':
## 
##     Boston
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(class)
library(e1071)
library(boot)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## 
## The following object is masked from 'package:boot':
## 
##     melanoma
## 
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-7
library(pls)
## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:caret':
## 
##     R2
## 
## The following object is masked from 'package:stats':
## 
##     loadings
library(leaps)

PROBLEM 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

LASSO

The Lasso, relative to least squares, is correct for statement iii. The book states that “the Lasso shrinks the coefficient estimates towards zero. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large…the lasso performs variable selection. As a result, models generated from the lasso are generally much easier to interpret.” They go on to explain that when the least squares estimates have high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions. In other words, Lasso is less flexible than least squares which produces more accurate predictions.

RIDGE REGRESSION vs. LEAST SQUARES

Ridge Regression relative to least squares is correct for statement iii as well. The book states that Ridge’s “advantage over least squares is rooted in the bias-variance trade-off. As \(\lambda\) increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.” With linear data, least squares will have low bias but could have a much higher variance which in turn means small waves in the training data can turn into big waves in the least squares coefficient. Ridge Regression however can still perform well “by trading off a small increase in bias for a large decrease in variation.” Basically, Ridge has similar characteristics to Lasso in this regard but the the constraint on the coefficients is different.

NON-LINEAR vs. LEAST SQUARES

For non-linear methods relative to least squares the correct statement is ii. Non-linear won’t be an constrained and can help reduce bias when fitting to the data but due to the increase in dimensions there can be significant amount of variance. If the data ends up being non-linear there will be greater accuracy in predictions.

PROBLEM 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

  1. Split the data set into a training set and a test set.

The College data set in ISLR2 is described as statistics for a large number of US Colleges from the 1995 issue of US News and World Report.

A data frame with 777 observations on the following 18 variables.

Private A factor with levels No and Yes indicating private or public university

Apps Number of applications received

Accept Number of applications accepted

Enroll Number of new students enrolled

Top10perc Pct. new students from top 10% of H.S. class

Top25perc Pct. new students from top 25% of H.S. class

F.Undergrad Number of fulltime undergraduates

P.Undergrad Number of parttime undergraduates

Outstate Out-of-state tuition

Room.Board Room and board costs

Books Estimated book costs

Personal Estimated personal spending

PhD Pct. of faculty with Ph.D.’s

Terminal Pct. of faculty with terminal degree

S.F.Ratio Student/faculty ratio

perc.alumni Pct. alumni who donate

Expend Instructional expenditure per student

Grad.Rate Graduation rate

#SPLIT 60/40
set.seed(1)
sample1 <- sample(c(TRUE, FALSE), nrow(College), replace=TRUE, prob = c(.6,.4))
train <- College[sample1, ]
test <- College[!sample1, ]
dim(train)
## [1] 472  18
dim(test)
## [1] 305  18
  1. Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit.college <- lm(Apps ~ ., data = train)
summary(lm.fit.college)
## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2573.3  -423.5   -51.1   307.6  6762.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.471e+02  4.923e+02  -0.908 0.364325    
## PrivateYes  -6.778e+02  1.762e+02  -3.847 0.000136 ***
## Accept       1.191e+00  6.572e-02  18.116  < 2e-16 ***
## Enroll      -1.281e-01  2.272e-01  -0.564 0.573054    
## Top10perc    5.197e+01  7.163e+00   7.255 1.75e-12 ***
## Top25perc   -1.713e+01  5.675e+00  -3.019 0.002683 ** 
## F.Undergrad  7.921e-02  3.890e-02   2.036 0.042314 *  
## P.Undergrad -5.039e-03  4.706e-02  -0.107 0.914772    
## Outstate    -9.835e-03  2.430e-02  -0.405 0.685853    
## Room.Board   1.073e-01  6.230e-02   1.722 0.085775 .  
## Books        1.339e-01  3.250e-01   0.412 0.680506    
## Personal     3.000e-02  7.844e-02   0.383 0.702247    
## PhD         -1.036e+01  6.303e+00  -1.644 0.100868    
## Terminal     1.455e+00  6.830e+00   0.213 0.831348    
## S.F.Ratio    3.405e+00  1.538e+01   0.221 0.824845    
## perc.alumni -7.875e+00  5.093e+00  -1.546 0.122735    
## Expend       5.961e-02  1.465e-02   4.069 5.57e-05 ***
## Grad.Rate    8.848e+00  3.598e+00   2.459 0.014291 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1005 on 454 degrees of freedom
## Multiple R-squared:  0.9253, Adjusted R-squared:  0.9225 
## F-statistic: 330.7 on 17 and 454 DF,  p-value: < 2.2e-16
lm.pred <- predict(lm.fit.college, test)
lm.MSE <- mean((lm.pred - test$Apps)^2)
lm.MSE
## [1] 1622925

The MSE is calculated as 1622925.

  1. Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.
train.rm <-  model.matrix(Apps~., data=train)
test.rm <-  model.matrix(Apps ~., data=test)
grid = 10^seq(10, -2, length=100)

par(mfrow = c(1, 2))
ridgemodel <- glmnet(train.rm, train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
plot(ridgemodel)

cv.ridge  <- cv.glmnet(train.rm, train[, "Apps"], alpha=0)
plot(cv.ridge)

lambest <- cv.ridge$lambda.min
lambest
## [1] 337.6929

TEST ERROR

pred <- predict(ridgemodel,test.rm,s=lambest)
rss <- sum((pred-test$Apps)^2)
tss <- sum((test$Apps-mean(test$Apps))^2)
test.rsq <- 1-(rss/tss)
test.rsq
## [1] 0.8458411
  1. Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lassomod <-  cv.glmnet(train.rm, train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
lassbest <-  lassomod$lambda.min
plot(lassomod)

### TEST ERROR

pred <- predict(lassomod,test.rm,s=lassbest)
rss <- sum((pred-test$Apps)^2)
tss <- sum((test$Apps-mean(test$Apps))^2)
test.rsq <- 1-(rss/tss)
test.rsq
## [1] 0.9066181

Coefficients = 0

sum(coef(lassomod)[,1]==0)
## [1] 14
names(coef(lassomod)[, 1][coef(lassomod)[, 1] == 0])
##  [1] "(Intercept)" "PrivateYes"  "Enroll"      "Top25perc"   "P.Undergrad"
##  [6] "Outstate"    "Room.Board"  "Books"       "Personal"    "PhD"        
## [11] "Terminal"    "S.F.Ratio"   "perc.alumni" "Grad.Rate"
  1. Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
pcr.fit <- pcr(Apps~.,data=train, scale=TRUE, validation="CV")
summary(pcr.fit)
## Data:    X dimension: 472 17 
##  Y dimension: 472 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3614     3542     1721     1714     1265     1266     1255
## adjCV         3614     3544     1719     1716     1241     1259     1252
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1255     1254     1175      1165      1170      1176      1180
## adjCV     1253     1254     1172      1162      1167      1173      1177
##        14 comps  15 comps  16 comps  17 comps
## CV         1180      1188      1074      1068
## adjCV      1176      1185      1070      1064
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.695    58.25    65.13    70.86    76.21    81.16    84.76    88.00
## Apps    5.324    77.87    78.23    88.62    88.65    88.65    88.76    88.76
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       91.01     93.38      95.5     97.25     98.30     98.98     99.47
## Apps    90.16     90.29      90.3     90.34     90.34     90.37     90.38
##       16 comps  17 comps
## X        99.82    100.00
## Apps     92.21     92.53

Number of components considered: 17

pred <- predict(pcr.fit,test,ncomp=17)
rss <- sum((pred-test$Apps)^2)
tss <- sum((test$Apps-mean(test$Apps))^2)
test.rsq <- 1-(rss/tss)
test.rsq
## [1] 0.9096891
  1. Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
pls.fit <- plsr(Apps~.,data=train, scale=TRUE, validation="CV")
summary(pls.fit)
## Data:    X dimension: 472 17 
##  Y dimension: 472 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3614     1555     1273     1163     1141     1110     1086
## adjCV         3614     1552     1277     1160     1137     1111     1079
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1073     1064     1066      1066      1068      1068      1068
## adjCV     1068     1060     1062      1062      1064      1064      1064
##        14 comps  15 comps  16 comps  17 comps
## CV         1068      1068      1068      1068
## adjCV      1064      1064      1064      1064
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       26.90    43.71    63.83    67.12    71.61    74.29    76.87    79.98
## Apps    82.26    88.02    90.43    91.15    91.72    92.41    92.47    92.49
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.16     85.64     87.28     88.41     92.52     94.25     96.82
## Apps    92.50     92.52     92.52     92.53     92.53     92.53     92.53
##       16 comps  17 comps
## X        98.87    100.00
## Apps     92.53     92.53

Number of components considered: 8

pred <- predict(pls.fit,test,ncomp=8)
rss <- sum((pred-test$Apps)^2)
tss <- sum((test$Apps-mean(test$Apps))^2)
test.rsq <- 1-(rss/tss)
test.rsq
## [1] 0.9067511
  1. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

The Ridge Regression method provided the best results (.85) but and the rest provided similar Test Error results (~.9). That being said I’m not sure if my training/test data set was the best balanced because the Test Errors in my opinion are significant for all predictors.

PROBLEM 11

We will now try to predict per capita crime rate in the Boston data set.

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Boston <- na.omit(Boston)
set.seed(1)
pcr.fit <-  pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.250    7.253    6.833    6.815    6.826    6.847
## adjCV         8.61    7.245    7.247    6.825    6.803    6.818    6.838
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.837    6.710    6.735     6.723     6.714     6.696     6.624
## adjCV    6.827    6.698    6.724     6.710     6.702     6.682     6.609
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
validationplot(pcr.fit,val.type="MSEP")

set.seed(1)
sample2 <- sample(c(TRUE, FALSE), nrow(Boston), replace=TRUE, prob = c(.6,.4))
train <- Boston[sample2, ]
test <- Boston[!sample2, ]
dim(train)
## [1] 314  14
dim(test)
## [1] 192  14
regfit.full=regsubsets(crim~.,Boston)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., Boston)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 ) " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 ) " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 ) " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 ) "*" " "   " "  " " " " " " "*" "*" " " " "     " "   " "   "*" 
## 5  ( 1 ) "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 ) "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 ) "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 ) "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*"
reg.summary=summary(regfit.full)
names(reg.summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
set.seed(1)
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")

9 provides the best # of components to lower MSE around 6.55

LASSO

set.seed(1)
x <-  model.matrix(crim ~ . - 1, data = Boston)
y <-  Boston$crim
cv.lasso = cv.glmnet(x,y,type.measure = "mse")
plot(cv.lasso)

besLasso = cv.lasso$lambda.min
besLasso
## [1] 0.05630926
coef(cv.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                    s1
## (Intercept) 1.0894283
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.2643196
## tax         .        
## ptratio     .        
## black       .        
## lstat       .        
## medv        .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
## [1] 7.438669

Higher than subset MSE