Answer 3. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias. so as long as the decrease in variance is greater than the increase in bias, it will be an improved model over least squares.
Also answer 3 for the same reasoning as above.
Answer 2. As the flexibility increases, variance increases but bias is decreased almost completely to 0.
college = College
set.seed(1)
train = sample(nrow(college), nrow(college) * .8)
#create matrix of test and train data
test.mat = model.matrix(Apps ~ ., data = college[-train, ])
train.mat = model.matrix(Apps ~ ., data = college[train, ])
The best linear model has 9 predictors (Accept, Private, Enroll, Top10PErc, Top25PErc, Outstate, Room.Boad, PhD, and Expend), and it has an MSE of 1,540,670.
#make best linear model for each number of allowed predictors
regfit.best = regsubsets(Apps ~ ., data = college[train, ], nvmax = ncol(college) - 1)
#create validation error holding variable
val.err = c()
#for each possible number of variables in model
for (i in 1:(ncol(college) - 1)){
#create coefficient matrix for given number of coefficients
coefi = coef(regfit.best, id = i)
#multiply test data matrix by coefficient matrix
pred = test.mat[, names(coefi)]%*%coefi
#calculate error of prediction from true value
val.err[i] = mean((college$Apps[-train] - pred) ^ 2)
}
coef(regfit.best,which.min(val.err))
## (Intercept) PrivateYes Accept Enroll Top10perc
## 70.78303705 -484.84140298 1.68905627 -0.74490718 50.34688616
## Top25perc Outstate Room.Board PhD Expend
## -12.98083070 -0.07794772 0.17470599 -9.04974055 0.05173806
min(val.err)
## [1] 1540670
The best Ridge Regression Model has a chosen lambda of 362.979 and a MSE of 1,440,058.
#create ridge model
grid = 10 ^ seq(10, -2, length = 100)
ridge.mod = glmnet(train.mat, college[train, 2], alpha = 0, lambda = grid)
plot(ridge.mod)
# use cross validation to determine best lambda
cv.ridge = cv.glmnet(train.mat, college[train, 2], alpha = 0)
bestlam_ridge = cv.ridge$lambda.min
#predict response variable in test set
ridge.pred = predict(ridge.mod, s = bestlam_ridge, newx = test.mat)
mean((ridge.pred - college[-train, 2])^2)
## [1] 1440058
bestlam_ridge
## [1] 362.9786
The best Lasso model has a lambda of 2.13 and an MSE of 1,553,875. It does not remove any predictors! Might as well go with the ridge model.
#create lasso model
grid = 10 ^ seq(10, -2, length = 100)
lasso.mod = glmnet(train.mat, college[train, 2], alpha = 1, lambda = grid)
# use cross validation to determine best lambda
cv.lasso = cv.glmnet(train.mat, college[train, 2], alpha = 1)
plot(cv.lasso)
bestlam_lasso = cv.lasso$lambda.min
#predict response variable in test set
lasso.pred = predict(lasso.mod, s = bestlam_lasso, newx = test.mat)
mean((lasso.pred - college[-train, 2])^2)
## [1] 1553875
bestlam_lasso
## [1] 2.125973
lasso.coef = predict(lasso.mod, type = "coefficients", s = bestlam_lasso)[1:19,]
lasso.coef
## (Intercept) (Intercept) PrivateYes Accept Enroll
## -639.39381436 0.00000000 -382.93087479 1.67486464 -1.09334415
## Top10perc Top25perc F.Undergrad P.Undergrad Outstate
## 48.95030919 -12.47188598 0.06757611 0.06439575 -0.07269958
## Room.Board Books Personal PhD Terminal
## 0.13917008 0.19758359 0.01597466 -9.61038299 -0.24087098
## S.F.Ratio perc.alumni Expend Grad.Rate
## 17.04076312 0.58266086 0.05695104 5.62894373
Cross validation shows that MSE levels out after 5 components and doesn’t begin decreasing again until 16 or more components are introduced. This is almost as if we aren’t reducing the number of variables we are using. However, using 5 components results in an MSE that is significantly worse (2,579,981) than the Lniear, Ridge Regression and Lasso models. Using 17 components still results in an MSE of 1,567,324 which is worse than all other models.
set.seed(1)
pcr.fit = pcr(Apps ~ ., data = college[train,] , scale = TRUE, validation = "CV")
#From summary and validation plot we go with 5 or 17 components
summary(pcr.fit)
## Data: X dimension: 621 17
## Y dimension: 621 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3837 3756 2050 2056 1660 1574 1578
## adjCV 3837 3756 2047 2056 1642 1562 1576
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1564 1522 1509 1496 1501 1503 1496
## adjCV 1563 1515 1506 1493 1498 1500 1493
## 14 comps 15 comps 16 comps 17 comps
## CV 1497 1497 1199 1118
## adjCV 1494 1487 1187 1109
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.003 57.06 64.13 70.03 75.36 80.38 84.09 87.44
## Apps 4.441 72.01 72.02 81.86 83.67 83.77 84.01 85.14
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.47 92.83 94.91 96.78 97.86 98.72 99.36
## Apps 85.42 85.76 85.76 85.76 85.89 85.95 89.93
## 16 comps 17 comps
## X 99.83 100.00
## Apps 92.89 93.47
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred_5 = predict(pcr.fit, college[-train,], ncomp = 5)
mean((pcr.pred_5 - college$Apps[-train])^2)
## [1] 2579981
pcr.pred_17 = predict(pcr.fit, college[-train,], ncomp = 17)
mean((pcr.pred_17 - college$Apps[-train])^2)
## [1] 1567324
Cross validation shows the best number of components is 9. As a result we have a MSE of 1,553,403.
set.seed(1)
pls.fit = plsr(Apps ~ ., data = college[train,], scale = TRUE, validation = 'CV')
#From the summary and plot we go with 9 components
summary(pls.fit)
## Data: X dimension: 621 17
## Y dimension: 621 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3837 1865 1623 1441 1380 1244 1166
## adjCV 3837 1862 1623 1437 1363 1222 1154
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1147 1133 1123 1122 1122 1121 1118
## adjCV 1136 1125 1115 1114 1114 1112 1110
## 14 comps 15 comps 16 comps 17 comps
## CV 1118 1118 1118 1118
## adjCV 1109 1110 1109 1109
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.55 45.38 62.59 65.08 67.55 72.02 75.93 80.46
## Apps 77.30 83.57 87.51 90.88 92.88 93.15 93.24 93.31
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.51 85.43 87.83 91.09 92.73 95.12 96.95
## Apps 93.39 93.42 93.45 93.46 93.47 93.47 93.47
## 16 comps 17 comps
## X 97.97 100.00
## Apps 93.47 93.47
validationplot(pls.fit, val.type = "MSEP")
pls.pred = predict(pls.fit, college[-train,], ncomp = 9)
mean((pls.pred - college$Apps[-train]) ^ 2)
## [1] 1553403
The results are surprising. I expected to find that the regularization or Dimension reduction models would improve on the linear model, but it seems only a slight improvement is found using ridge regression at the cost of interpretability. The Ridge Repression model performed the best with a MSE of 1,440,058. However, the other models were not far behind. All models can predict the number of applications a college will receive fairly accurately. for example, the PLS model explained 93.4% of the variance in the test data set.
Linear, Ridge, Lasso, PCR, and PLS are considered. The linear has a MSE of 68.934 and uses all variables but two. Hopefully, Lasso will reduce the number of variables used. But, if this data set is like the last, it is possible all predictors will be relevant and the model will not perform well if many are removed. The ridge model is not able to improve on the linear model. The lasso model adds predictors back in and does not improve on the linear model! PCR performs the worst and uses 9 components. PLS only uses 6 components and performs better thena Lasso, Ridge, and PCR but still not better than Linear.
#create data set and train/test split
boston = Boston
set.seed(1)
train = sample(nrow(boston), nrow(boston) * .8)
#create matrix of test and train data
test.mat = model.matrix(crim ~ ., data = boston[-train, ])
train.mat = model.matrix(crim ~ ., data = boston[train, ])
#make best linear model for each number of allowed predictors
regfit.best = regsubsets(crim ~ ., data = boston[train, ], nvmax = ncol(boston) - 1)
#create validation error holding variable
val.err = c()
#for each possible number of variables in model
for (i in 1:(ncol(boston) - 1)){
#create coefficient matrix for given number of coefficients
coefi = coef(regfit.best, id = i)
#multiply test data matrix by coefficient matrix
pred = test.mat[, names(coefi)]%*%coefi
#calculate error of prediction from true value
val.err[i] = mean((boston$crim[-train] - pred) ^ 2)
}
coef(regfit.best, which.min(val.err))
## (Intercept) zn indus chas nox dis
## 19.594690827 0.034530108 -0.066045929 -0.571926856 -8.713860017 -0.814462590
## rad tax ptratio black lstat medv
## 0.532628847 -0.003413639 -0.271039381 -0.013805225 0.124256352 -0.140387354
min(val.err)
## [1] 68.93374
#create ridge model
grid = 10 ^ seq(10, -2, length = 100)
ridge.mod = glmnet(train.mat, boston$crim[train], alpha = 0, lambda = grid)
plot(ridge.mod)
# use cross validation to determine best lambda
cv.ridge = cv.glmnet(train.mat, boston$crim[train], alpha = 0)
bestlam_ridge = cv.ridge$lambda.min
#predict response variable in test set
ridge.pred = predict(ridge.mod, s = bestlam_ridge, newx = test.mat)
mean((ridge.pred - boston$crim[-train])^2)
## [1] 70.70392
#create lasso model
grid = 10 ^ seq(10, -2, length = 100)
lasso.mod = glmnet(train.mat, boston$crim[train], alpha = 1, lambda = grid)
# use cross validation to determine best lambda
cv.lasso = cv.glmnet(train.mat, boston$crim[train], alpha = 1)
plot(cv.lasso)
bestlam_lasso = cv.lasso$lambda.min
#predict response variable in test set
lasso.pred = predict(lasso.mod, s = bestlam_lasso, newx = test.mat)
mean((lasso.pred - boston$crim[-train]) ^ 2)
## [1] 69.59026
bestlam_lasso
## [1] 0.04430978
lasso.coef = predict(lasso.mod, type = "coefficients", s = bestlam_lasso)[1:15,]
lasso.coef
## (Intercept) (Intercept) zn indus chas
## 14.8727740946 0.0000000000 0.0277909498 -0.0691412360 -0.4052993621
## nox rm age dis rad
## -6.2411295032 0.0000000000 0.0015367305 -0.6302413982 0.4678393633
## tax ptratio black lstat medv
## -0.0001228092 -0.2013041366 -0.0137917745 0.1249240378 -0.1134321951
set.seed(1)
pcr.fit = pcr(crim ~ ., data = boston[train,] , scale = TRUE, validation = "CV")
#From summary and validation plot we go with 8 components
summary(pcr.fit)
## Data: X dimension: 404 13
## Y dimension: 404 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.062 6.718 6.711 6.222 6.174 6.190 6.222
## adjCV 8.062 6.714 6.707 6.215 6.168 6.184 6.215
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.212 6.103 6.114 6.108 6.126 6.104 6.035
## adjCV 6.203 6.093 6.104 6.097 6.116 6.092 6.023
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.21 60.84 70.12 76.82 82.98 87.85 91.09 93.38
## crim 31.86 32.09 42.14 42.99 43.02 43.03 43.28 45.29
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.28 96.99 98.38 99.48 100.00
## crim 45.61 45.91 45.93 46.70 47.96
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred_17 = predict(pcr.fit, boston[-train,], ncomp = 8)
mean((pcr.pred_17 - boston$crim[-train]) ^ 2)
## [1] 72.75513
set.seed(1)
pls.fit = plsr(crim ~ ., data = boston[train,], scale = TRUE, validation = 'CV')
#From the summary and plot we go with 6 components
summary(pls.fit)
## Data: X dimension: 404 13
## Y dimension: 404 1
## Fit method: kernelpls
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.062 6.519 6.125 6.088 6.087 6.060 6.037
## adjCV 8.062 6.515 6.118 6.076 6.074 6.049 6.025
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.054 6.034 6.032 6.034 6.035 6.035 6.035
## adjCV 6.039 6.022 6.020 6.022 6.022 6.023 6.023
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.71 57.30 62.55 71.54 77.84 80.23 82.98 86.82
## crim 36.12 44.77 46.64 47.24 47.52 47.85 47.93 47.95
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 90.16 94.06 96.78 98.51 100.00
## crim 47.96 47.96 47.96 47.96 47.96
validationplot(pls.fit, val.type = "MSEP")
pls.pred = predict(pls.fit, boston[-train,], ncomp = 6)
mean((pls.pred - boston$crim[-train]) ^ 2)
## [1] 69.37375
I propose using a linear model for this data set. It provides the best MSE and interpretability. It also could benefit from feature transformation to see if any of the root, log, or square of a feature better fits the data. I am very surprised.
The linear model removes two features (age and rm) as this model has the lowest MSE compared to other linear models with more or less features. This puts it at fewer features than Lasso making it the most interpretible model.