Question 2

A

Answer 3. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. so as long as the decrease in variance is greater than the increase in bias, it will be an improved model over least squares.

B

Also answer 3 for the same reasoning as above.

C

Answer 2. As the flexibility increases, variance increases but bias is decreased almost completely to 0.

Question 9

A

college = College
set.seed(1)
train = sample(nrow(college), nrow(college) * .8)
#create matrix of test and train data
test.mat = model.matrix(Apps ~ ., data = college[-train, ])
train.mat = model.matrix(Apps ~ ., data = college[train, ])

B

The best linear model has 9 predictors (Accept, Private, Enroll, Top10PErc, Top25PErc, Outstate, Room.Boad, PhD, and Expend), and it has an MSE of 1,540,670.

#make best linear model for each number of allowed predictors
regfit.best = regsubsets(Apps ~ ., data = college[train, ], nvmax = ncol(college) - 1)
#create validation error holding variable
val.err = c()
#for each possible number of variables in model
for (i in 1:(ncol(college) - 1)){
  #create coefficient matrix for given number of coefficients
  coefi = coef(regfit.best, id = i)
  #multiply test data matrix by coefficient matrix
  pred = test.mat[, names(coefi)]%*%coefi
  #calculate error of prediction from true value
  val.err[i] = mean((college$Apps[-train] - pred) ^ 2)
}
coef(regfit.best,which.min(val.err))
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
##   70.78303705 -484.84140298    1.68905627   -0.74490718   50.34688616 
##     Top25perc      Outstate    Room.Board           PhD        Expend 
##  -12.98083070   -0.07794772    0.17470599   -9.04974055    0.05173806
min(val.err)
## [1] 1540670

C

The best Ridge Regression Model has a chosen lambda of 362.979 and a MSE of 1,440,058.

#create ridge model
grid = 10 ^ seq(10, -2, length = 100)
ridge.mod = glmnet(train.mat, college[train, 2], alpha = 0, lambda = grid)
plot(ridge.mod)

# use cross validation to determine best lambda
cv.ridge = cv.glmnet(train.mat, college[train, 2], alpha = 0)
bestlam_ridge = cv.ridge$lambda.min
#predict response variable in test set
ridge.pred = predict(ridge.mod, s = bestlam_ridge, newx = test.mat)
mean((ridge.pred - college[-train, 2])^2)
## [1] 1440058
bestlam_ridge
## [1] 362.9786

D

The best Lasso model has a lambda of 2.13 and an MSE of 1,553,875. It does not remove any predictors! Might as well go with the ridge model.

#create lasso model
grid = 10 ^ seq(10, -2, length = 100)
lasso.mod = glmnet(train.mat, college[train, 2], alpha = 1, lambda = grid)
# use cross validation to determine best lambda
cv.lasso = cv.glmnet(train.mat, college[train, 2], alpha = 1)
plot(cv.lasso)

bestlam_lasso = cv.lasso$lambda.min
#predict response variable in test set
lasso.pred = predict(lasso.mod, s = bestlam_lasso, newx = test.mat)
mean((lasso.pred - college[-train, 2])^2)
## [1] 1553875
bestlam_lasso
## [1] 2.125973
lasso.coef = predict(lasso.mod, type = "coefficients", s = bestlam_lasso)[1:19,]
lasso.coef
##   (Intercept)   (Intercept)    PrivateYes        Accept        Enroll 
## -639.39381436    0.00000000 -382.93087479    1.67486464   -1.09334415 
##     Top10perc     Top25perc   F.Undergrad   P.Undergrad      Outstate 
##   48.95030919  -12.47188598    0.06757611    0.06439575   -0.07269958 
##    Room.Board         Books      Personal           PhD      Terminal 
##    0.13917008    0.19758359    0.01597466   -9.61038299   -0.24087098 
##     S.F.Ratio   perc.alumni        Expend     Grad.Rate 
##   17.04076312    0.58266086    0.05695104    5.62894373

E

Cross validation shows that MSE levels out after 5 components and doesn’t begin decreasing again until 16 or more components are introduced. This is almost as if we aren’t reducing the number of variables we are using. However, using 5 components results in an MSE that is significantly worse (2,579,981) than the Lniear, Ridge Regression and Lasso models. Using 17 components still results in an MSE of 1,567,324 which is worse than all other models.

set.seed(1)
pcr.fit = pcr(Apps ~ ., data = college[train,] , scale = TRUE, validation = "CV")
#From summary and validation plot we go with 5 or 17 components
summary(pcr.fit)
## Data:    X dimension: 621 17 
##  Y dimension: 621 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3837     3756     2050     2056     1660     1574     1578
## adjCV         3837     3756     2047     2056     1642     1562     1576
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1564     1522     1509      1496      1501      1503      1496
## adjCV     1563     1515     1506      1493      1498      1500      1493
##        14 comps  15 comps  16 comps  17 comps
## CV         1497      1497      1199      1118
## adjCV      1494      1487      1187      1109
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.003    57.06    64.13    70.03    75.36    80.38    84.09    87.44
## Apps    4.441    72.01    72.02    81.86    83.67    83.77    84.01    85.14
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.47     92.83     94.91     96.78     97.86     98.72     99.36
## Apps    85.42     85.76     85.76     85.76     85.89     85.95     89.93
##       16 comps  17 comps
## X        99.83    100.00
## Apps     92.89     93.47
validationplot(pcr.fit, val.type = "MSEP")

pcr.pred_5 = predict(pcr.fit, college[-train,], ncomp = 5)
mean((pcr.pred_5 - college$Apps[-train])^2)
## [1] 2579981
pcr.pred_17 = predict(pcr.fit, college[-train,], ncomp = 17)
mean((pcr.pred_17 - college$Apps[-train])^2)
## [1] 1567324

F

Cross validation shows the best number of components is 9. As a result we have a MSE of 1,553,403.

set.seed(1)
pls.fit = plsr(Apps ~ ., data = college[train,], scale = TRUE, validation = 'CV')
#From the summary and plot we go with 9 components
summary(pls.fit)
## Data:    X dimension: 621 17 
##  Y dimension: 621 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3837     1865     1623     1441     1380     1244     1166
## adjCV         3837     1862     1623     1437     1363     1222     1154
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1147     1133     1123      1122      1122      1121      1118
## adjCV     1136     1125     1115      1114      1114      1112      1110
##        14 comps  15 comps  16 comps  17 comps
## CV         1118      1118      1118      1118
## adjCV      1109      1110      1109      1109
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.55    45.38    62.59    65.08    67.55    72.02    75.93    80.46
## Apps    77.30    83.57    87.51    90.88    92.88    93.15    93.24    93.31
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.51     85.43     87.83     91.09     92.73     95.12     96.95
## Apps    93.39     93.42     93.45     93.46     93.47     93.47     93.47
##       16 comps  17 comps
## X        97.97    100.00
## Apps     93.47     93.47
validationplot(pls.fit, val.type = "MSEP")

pls.pred = predict(pls.fit, college[-train,], ncomp = 9)
mean((pls.pred - college$Apps[-train]) ^ 2)
## [1] 1553403

G

The results are surprising. I expected to find that the regularization or Dimension reduction models would improve on the linear model, but it seems only a slight improvement is found using ridge regression at the cost of interpretability. The Ridge Repression model performed the best with a MSE of 1,440,058. However, the other models were not far behind. All models can predict the number of applications a college will receive fairly accurately. for example, the PLS model explained 93.4% of the variance in the test data set.

Question 11

A

Linear, Ridge, Lasso, PCR, and PLS are considered. The linear has a MSE of 68.934 and uses all variables but two. Hopefully, Lasso will reduce the number of variables used. But, if this data set is like the last, it is possible all predictors will be relevant and the model will not perform well if many are removed. The ridge model is not able to improve on the linear model. The lasso model adds predictors back in and does not improve on the linear model! PCR performs the worst and uses 9 components. PLS only uses 6 components and performs better thena Lasso, Ridge, and PCR but still not better than Linear.

#create data set and train/test split
boston = Boston
set.seed(1)
train = sample(nrow(boston), nrow(boston) * .8)
#create matrix of test and train data
test.mat = model.matrix(crim ~ ., data = boston[-train, ])
train.mat = model.matrix(crim ~ ., data = boston[train, ])
#make best linear model for each number of allowed predictors
regfit.best = regsubsets(crim ~ ., data = boston[train, ], nvmax = ncol(boston) - 1)
#create validation error holding variable
val.err = c()
#for each possible number of variables in model
for (i in 1:(ncol(boston) - 1)){
  #create coefficient matrix for given number of coefficients
  coefi = coef(regfit.best, id = i)
  #multiply test data matrix by coefficient matrix
  pred = test.mat[, names(coefi)]%*%coefi
  #calculate error of prediction from true value
  val.err[i] = mean((boston$crim[-train] - pred) ^ 2)
}
coef(regfit.best, which.min(val.err))
##  (Intercept)           zn        indus         chas          nox          dis 
## 19.594690827  0.034530108 -0.066045929 -0.571926856 -8.713860017 -0.814462590 
##          rad          tax      ptratio        black        lstat         medv 
##  0.532628847 -0.003413639 -0.271039381 -0.013805225  0.124256352 -0.140387354
min(val.err)
## [1] 68.93374
#create ridge model
grid = 10 ^ seq(10, -2, length = 100)
ridge.mod = glmnet(train.mat, boston$crim[train], alpha = 0, lambda = grid)
plot(ridge.mod)

# use cross validation to determine best lambda
cv.ridge = cv.glmnet(train.mat, boston$crim[train], alpha = 0)
bestlam_ridge = cv.ridge$lambda.min
#predict response variable in test set
ridge.pred = predict(ridge.mod, s = bestlam_ridge, newx = test.mat)
mean((ridge.pred - boston$crim[-train])^2)
## [1] 70.70392
#create lasso model
grid = 10 ^ seq(10, -2, length = 100)
lasso.mod = glmnet(train.mat, boston$crim[train], alpha = 1, lambda = grid)
# use cross validation to determine best lambda
cv.lasso = cv.glmnet(train.mat, boston$crim[train], alpha = 1)
plot(cv.lasso)

bestlam_lasso = cv.lasso$lambda.min
#predict response variable in test set
lasso.pred = predict(lasso.mod, s = bestlam_lasso, newx = test.mat)
mean((lasso.pred - boston$crim[-train]) ^ 2)
## [1] 69.59026
bestlam_lasso
## [1] 0.04430978
lasso.coef = predict(lasso.mod, type = "coefficients", s = bestlam_lasso)[1:15,]
lasso.coef
##   (Intercept)   (Intercept)            zn         indus          chas 
## 14.8727740946  0.0000000000  0.0277909498 -0.0691412360 -0.4052993621 
##           nox            rm           age           dis           rad 
## -6.2411295032  0.0000000000  0.0015367305 -0.6302413982  0.4678393633 
##           tax       ptratio         black         lstat          medv 
## -0.0001228092 -0.2013041366 -0.0137917745  0.1249240378 -0.1134321951
set.seed(1)
pcr.fit = pcr(crim ~ ., data = boston[train,] , scale = TRUE, validation = "CV")
#From summary and validation plot we go with 8 components
summary(pcr.fit)
## Data:    X dimension: 404 13 
##  Y dimension: 404 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.062    6.718    6.711    6.222    6.174    6.190    6.222
## adjCV        8.062    6.714    6.707    6.215    6.168    6.184    6.215
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.212    6.103    6.114     6.108     6.126     6.104     6.035
## adjCV    6.203    6.093    6.104     6.097     6.116     6.092     6.023
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.21    60.84    70.12    76.82    82.98    87.85    91.09    93.38
## crim    31.86    32.09    42.14    42.99    43.02    43.03    43.28    45.29
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.28     96.99     98.38     99.48    100.00
## crim    45.61     45.91     45.93     46.70     47.96
validationplot(pcr.fit, val.type = "MSEP")

pcr.pred_17 = predict(pcr.fit, boston[-train,], ncomp = 8)
mean((pcr.pred_17 - boston$crim[-train]) ^ 2)
## [1] 72.75513
set.seed(1)
pls.fit = plsr(crim ~ ., data = boston[train,], scale = TRUE, validation = 'CV')
#From the summary and plot we go with 6 components
summary(pls.fit)
## Data:    X dimension: 404 13 
##  Y dimension: 404 1
## Fit method: kernelpls
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.062    6.519    6.125    6.088    6.087    6.060    6.037
## adjCV        8.062    6.515    6.118    6.076    6.074    6.049    6.025
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.054    6.034    6.032     6.034     6.035     6.035     6.035
## adjCV    6.039    6.022    6.020     6.022     6.022     6.023     6.023
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.71    57.30    62.55    71.54    77.84    80.23    82.98    86.82
## crim    36.12    44.77    46.64    47.24    47.52    47.85    47.93    47.95
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       90.16     94.06     96.78     98.51    100.00
## crim    47.96     47.96     47.96     47.96     47.96
validationplot(pls.fit, val.type = "MSEP")

pls.pred = predict(pls.fit, boston[-train,], ncomp = 6)
mean((pls.pred - boston$crim[-train]) ^ 2)
## [1] 69.37375

B

I propose using a linear model for this data set. It provides the best MSE and interpretability. It also could benefit from feature transformation to see if any of the root, log, or square of a feature better fits the data. I am very surprised.

C

The linear model removes two features (age and rm) as this model has the lowest MSE compared to other linear models with more or less features. This puts it at fewer features than Lasso making it the most interpretible model.