iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
LASSO works best when the OLS estimate has a very high variance. This is because they will have a higher bias than OLS, but less variability. The bias/variance tradeoff will balance out when the variance OLS variance is very high. The lack of flexibility is of benefit to the LASSO because it reduces variability.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Ridge Regression is very similar to LASSO. It has the most benefit when variance for the OLS Regression is very high due to the bias/variance trade-off.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Non-linear methods are more flexible than OLS, therfore increasing the variance. They benefit by lessening the variance and increasing the bias.
set.seed(1)
set.training <- sample(nrow(College), nrow(College)/2)
training.College <- College[set.training, ]
test.College <- College[-set.training, ]
lm.College <- lm(Apps ~ ., data = training.College)
summary(lm.College)
##
## Call:
## lm(formula = Apps ~ ., data = training.College)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5741.2 -479.5 15.3 359.6 7258.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.902e+02 6.381e+02 -1.238 0.216410
## PrivateYes -3.070e+02 2.006e+02 -1.531 0.126736
## Accept 1.779e+00 5.420e-02 32.830 < 2e-16 ***
## Enroll -1.470e+00 3.115e-01 -4.720 3.35e-06 ***
## Top10perc 6.673e+01 8.310e+00 8.030 1.31e-14 ***
## Top25perc -2.231e+01 6.533e+00 -3.415 0.000708 ***
## F.Undergrad 9.269e-02 5.529e-02 1.676 0.094538 .
## P.Undergrad 9.397e-03 5.493e-02 0.171 0.864275
## Outstate -1.084e-01 2.700e-02 -4.014 7.22e-05 ***
## Room.Board 2.115e-01 7.224e-02 2.928 0.003622 **
## Books 2.912e-01 3.985e-01 0.731 0.465399
## Personal 6.133e-03 8.803e-02 0.070 0.944497
## PhD -1.548e+01 6.681e+00 -2.316 0.021082 *
## Terminal 6.415e+00 7.290e+00 0.880 0.379470
## S.F.Ratio 2.283e+01 2.047e+01 1.115 0.265526
## perc.alumni 1.134e+00 6.083e+00 0.186 0.852274
## Expend 4.857e-02 1.619e-02 2.999 0.002890 **
## Grad.Rate 7.490e+00 4.397e+00 1.703 0.089324 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1083 on 370 degrees of freedom
## Multiple R-squared: 0.9389, Adjusted R-squared: 0.9361
## F-statistic: 334.3 on 17 and 370 DF, p-value: < 2.2e-16
pred.College.lm <- predict(lm.College, newdata = test.College)
error.College.lm <- mean((test.College$Apps - pred.College.lm)^2)
error.College.lm
## [1] 1135758
The MSE for this model is 1135758. The RSq of 0.9389 suggests that roughly 94% of the variation in the data is explained by the model.
Since the least squares model includes all of the components, this may not be the ideal model for the data.
set.seed(1)
x.College <- model.matrix(Apps ~ ., data = training.College)[ ,-1]
y.College <- training.College$Apps
x.test.College <- model.matrix(Apps ~ ., data = test.College)[ ,-1]
y.test.College <- test.College$Apps
cv.College <- cv.glmnet(x.College, y.College, alpha = 0)
bestlam.College <- cv.College$lambda.min
bestlam.College
## [1] 405.8404
plot(cv.College)
ridge.College <- glmnet(x.College, y.College, alpha = 0, lambda = bestlam.College)
dim(coef(ridge.College))
## [1] 18 1
pred.ridge.College <- predict(ridge.College, s = bestlam.College, newx = x.test.College)
mean((pred.ridge.College - y.test.College)^2)
## [1] 976897.6
err.ridge.College <- mean((pred.ridge.College - y.test.College)^2)
err.ridge.College
## [1] 976897.6
coef.ridge.College <- predict(ridge.College, type = 'coefficient', s = bestlam.College)
coef.ridge.College
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -2.028526e+03
## PrivateYes -2.881497e+02
## Accept 1.129842e+00
## Enroll 3.843311e-01
## Top10perc 3.047525e+01
## Top25perc -3.273388e-01
## F.Undergrad 5.542918e-02
## P.Undergrad 2.051701e-02
## Outstate -3.895379e-02
## Room.Board 2.631481e-01
## Books 4.149720e-01
## Personal -3.236328e-02
## PhD -7.849164e+00
## Terminal -1.025938e+00
## S.F.Ratio 2.767028e+01
## perc.alumni -5.364354e+00
## Expend 5.879543e-02
## Grad.Rate 8.625249e+00
The MSE for the Ridge Regression model is 976897.6.
This MSE is quite a bit lower than that of the least squares model, but still contains all of the components in the model.
cv.lasso.College <- cv.glmnet(x.College, y.College, alpha = 1)
bestlam.lasso.College <- cv.lasso.College$lambda.min
bestlam.lasso.College
## [1] 1.97344
plot(cv.lasso.College)
lasso.College <- glmnet(x.College, y.College, alpha = 1, lambda = bestlam.lasso.College)
dim(coef(lasso.College))
## [1] 18 1
pred.lasso.College <- predict(lasso.College, s = bestlam.lasso.College, newx = x.test.College)
err.lasso.College <- mean((pred.lasso.College - y.test.College)^2)
err.lasso.College
## [1] 1116402
coef.lasso.College <- predict(lasso.College, type = 'coefficient', s = bestlam.lasso.College)
coef.lasso.College
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -763.46736152
## PrivateYes -313.76789127
## Accept 1.76295802
## Enroll -1.32011619
## Top10perc 64.98040141
## Top25perc -20.93406828
## F.Undergrad 0.07157355
## P.Undergrad 0.01197052
## Outstate -0.10493165
## Room.Board 0.20915417
## Books 0.29265420
## Personal 0.00351509
## PhD -14.48803559
## Terminal 5.33334392
## S.F.Ratio 21.67584610
## perc.alumni 0.51564913
## Expend 0.04812700
## Grad.Rate 7.01799293
The MSE for the LASSO model is 1116402.
The LASSO does not out-perform Ridge Regression, but the error is still smaller than that of the least squares model.
This lasso model also does not fully take any of the coefficients to zero.
set.seed(1)
pcr.College <- pcr(Apps ~ ., data = training.College, scale = TRUE, validation = "CV")
summary(pcr.College)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 4006 2373 2372 2069 1961 1919
## adjCV 4288 4007 2368 2369 1999 1948 1911
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1919 1921 1876 1832 1832 1836 1837
## adjCV 1912 1915 1868 1821 1823 1827 1827
## 14 comps 15 comps 16 comps 17 comps
## CV 1853 1759 1341 1270
## adjCV 1850 1733 1326 1257
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.20 57.78 65.31 70.99 76.37 81.27 84.8 87.85
## Apps 13.44 70.93 71.07 79.87 81.15 82.25 82.3 82.33
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.62 92.91 94.98 96.74 97.79 98.72 99.42
## Apps 83.38 84.76 84.80 84.84 85.11 85.14 90.55
## 16 comps 17 comps
## X 99.88 100.00
## Apps 93.42 93.89
validationplot(pcr.College, validation = "MSEP")
pred.pcr.College <- predict(pcr.College, x.test.College, ncomp = 4)
mean((pred.pcr.College - y.test.College)^2)
## [1] 2168577
coef.pcr.College <- pcr(Apps ~ ., data = training.College, scale = TRUE, ncomp = 4)
summary(coef.pcr.College)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 4
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps
## X 32.20 57.78 65.31 70.99
## Apps 13.44 70.93 71.07 79.87
The PCR model with the lowest MSE (CV^2) includes 17 components. Since this is no different than just running a least squares regression we can look at the validation plot to see what it tells us about the model.
Upon inspection of the validation plot we can see that the biggest drop in MSE happens around when the model contains 3-4 components, followed by another modest drop around 15 components.
In an effort to avoid a model that is too complex to interpret, choosing M = 4 appears to be the best balance option. At M = 4 79.87% of the variation in the number of Applications is explained, and there are no stand out improvements to that until M = 15.
The MSE for M = 4 is 2168577.
set.seed(1)
pls.College <- plsr(Apps ~ ., data = training.College, scale = TRUE, validation = "CV")
summary(pls.College)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 2217 2019 1761 1630 1533 1347
## adjCV 4288 2211 2012 1749 1605 1510 1331
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1309 1303 1286 1283 1283 1277 1271
## adjCV 1296 1289 1273 1270 1270 1264 1258
## 14 comps 15 comps 16 comps 17 comps
## CV 1270 1270 1270 1270
## adjCV 1258 1257 1257 1257
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 27.21 50.73 63.06 65.52 70.20 74.20 78.62 80.81
## Apps 75.39 81.24 86.97 91.14 92.62 93.43 93.56 93.68
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.29 87.17 89.15 91.37 92.58 94.42 96.98
## Apps 93.76 93.79 93.83 93.86 93.88 93.89 93.89
## 16 comps 17 comps
## X 98.78 100.00
## Apps 93.89 93.89
validationplot(pls.College, validation = "MSEP")
pred.pls.College <- predict(pls.College, x.test.College, ncomp = 6)
mean((pred.pls.College - y.test.College)^2)
## [1] 1066991
coef.pls.College <- plsr(Apps ~ ., data = training.College, scale = TRUE, ncomp = 6)
summary(coef.pls.College)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 6
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 27.21 50.73 63.06 65.52 70.20 74.20
## Apps 75.39 81.24 86.97 91.14 92.62 93.43
The PLS model with the lowest MSE (CV^2) include 14 components. Since this may still make the model more complex than we would like we can look at the validation plot to see that the biggest drop in MSE happens when there are 6 components in the model. The MSE after that is relatively the same as the number of components increases. Because of this we can confidently choose M = 6. This is consistent with our results from cross validation as well.
A PSL model with 6 components explains approximately 93.43% of the variation in Apps, and there is not much improvement on that figure as the number of components is increased.
The MSE for M = 6 is 1066991.
MSE
Least Squares: 1135758
Ridge: 976897.6
LASSO: 1116402
PCR:2168577
PLS: 1066991
If only looking at the MSE, ridge regression would seem to be the best fit for the model because it has the lowest error.
Taking into consideration the complexities of the models, PLS may be the better choice for a model. The MSE is not much higher than the model using Ridge Regression and is a much simpler model to interpret since it only has 6 components rather than all 17 of the components in the data set. Using a PLS model is futher justified when we consider that 93% of the variation in Apps is explained by the model.
detach(College)
Split Data into Training and Test Data
set.seed(1)
set.training.Boston <- sample(nrow(Boston), nrow(Boston)/2)
training.Boston <- Boston[set.training.Boston, ]
test.Boston <- Boston[-set.training.Boston, ]
Set Up X & Y
set.seed(1)
x.Boston <- model.matrix(crim ~ ., data = training.Boston)[ ,-1]
y.Boston <- training.Boston$crim
x.test.Boston <- model.matrix(crim ~ ., data = test.Boston)[ ,-1]
y.test.Boston <- test.Boston$crim
Ridge Regression
cv.ridge.Boston <- cv.glmnet(x.Boston, y.Boston, alpha = 1)
bestlam.ridge.Boston <- cv.ridge.Boston$lambda.min
bestlam.ridge.Boston
## [1] 0.05148183
plot(cv.ridge.Boston)
ridge.Boston <- glmnet(x.Boston, y.Boston, alpha = 1, lambda = bestlam.ridge.Boston)
dim(coef(ridge.Boston))
## [1] 14 1
pred.ridge.Boston <- predict(ridge.Boston, s = bestlam.ridge.Boston, newx = x.test.Boston)
err.ridge.Boston <- mean((pred.ridge.Boston - y.test.Boston)^2)
err.ridge.Boston
## [1] 40.99319
coef.ridge.Boston <- predict(ridge.Boston, type = 'coefficient', s = bestlam.ridge.Boston)
coef.ridge.Boston
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 18.9622903072
## zn 0.0368274919
## indus -0.1257403173
## chas -0.4800347940
## nox -8.2790505636
## rm 0.1165681662
## age -0.0006106162
## dis -0.8428452301
## rad 0.5341757612
## tax -0.0001555180
## ptratio -0.3827058704
## black -0.0129713964
## lstat 0.2578264241
## medv -0.1607986419
MSE: 41.47312
A few of the components are reduced to minimal, but since ridge reduction does not reduce the number of components it may not be the best fit.
LASSO
cv.lasso.Boston <- cv.glmnet(x.Boston, y.Boston, alpha = 1)
bestlam.lasso.Boston <- cv.lasso.Boston$lambda.min
bestlam.lasso.Boston
## [1] 0.005029826
plot(cv.lasso.Boston)
lasso.Boston <- glmnet(x.Boston, y.Boston, alpha = 1, lambda = bestlam.lasso.Boston)
dim(coef(lasso.Boston))
## [1] 14 1
pred.lasso.Boston <- predict(lasso.Boston, s = bestlam.lasso.Boston, newx = x.test.Boston)
err.lasso.Boston <- mean((pred.lasso.Boston - y.test.Boston)^2)
err.lasso.Boston
## [1] 41.47312
coef.lasso.Boston <- predict(lasso.Boston, type = 'coefficient', s = bestlam.lasso.Boston)
coef.lasso.Boston
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 23.236959105
## zn 0.042603207
## indus -0.125417411
## chas -0.681840713
## nox -10.567296772
## rm 0.387379613
## age -0.006668022
## dis -1.069374865
## rad 0.612088100
## tax -0.004311220
## ptratio -0.472344094
## black -0.012666419
## lstat 0.264631900
## medv -0.208913686
MSE: 40.89965
The model is reduced to 11 components after LASSO brings 2 of the coefficients to zero.
PCR
set.seed(1)
pcr.Boston <- pcr(crim ~ ., data = training.Boston, scale = TRUE, validation = "CV")
summary(pcr.Boston)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.275 7.555 7.549 7.093 6.926 6.932 6.977
## adjCV 9.275 7.550 7.544 7.088 6.920 6.926 6.968
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.976 6.871 6.845 6.848 6.797 6.764 6.700
## adjCV 6.967 6.859 6.843 6.842 6.786 6.751 6.686
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.51 60.4 69.86 77.08 82.80 87.68 91.24 93.56
## crim 34.94 35.2 42.83 45.47 45.57 45.58 45.75 47.59
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.47 97.08 98.48 99.54 100.00
## crim 47.68 48.75 49.31 50.14 51.37
validationplot(pcr.Boston, validation = "MSEP")
pred.pcr.Boston <- predict(pcr.Boston, x.test.Boston, ncomp = 1)
mean((pred.pcr.Boston - y.test.Boston)^2)
## [1] 48.24988
coef.pcr.Boston <- pcr(crim ~ ., data = training.Boston, scale = TRUE, ncomp = 1)
summary(coef.pcr.Boston)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 1
## TRAINING: % variance explained
## 1 comps
## X 48.51
## crim 34.94
pred.pcr.Boston4 <- predict(pcr.Boston, x.test.Boston, ncomp = 4)
mean((pred.pcr.Boston4 - y.test.Boston)^2)
## [1] 43.91107
coef.pcr.Boston4 <- pcr(crim ~ ., data = training.Boston, scale = TRUE, ncomp = 4)
summary(coef.pcr.Boston4)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 4
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps
## X 48.51 60.4 69.86 77.08
## crim 34.94 35.2 42.83 45.47
MSE (1): 48.24988
MSE (4): 43.91107
The validation plot shows that there is a large drop in MSE when only 1 component is added to the model, and then there is another smaller drop when 4 components have been added.
After comparing the MSE for the two different M’s, the M = 4 model appears to be the better fit.
PLS
set.seed(1)
pls.Boston <- plsr(crim ~ ., data = training.Boston, scale = TRUE, validation = "CV")
summary(pls.Boston)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: kernelpls
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.275 7.328 6.842 6.784 6.741 6.718 6.695
## adjCV 9.275 7.322 6.834 6.768 6.728 6.707 6.683
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.722 6.707 6.696 6.701 6.700 6.700 6.700
## adjCV 6.706 6.693 6.683 6.688 6.686 6.686 6.686
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.10 57.46 62.68 69.65 77.95 80.82 84.40 87.65
## crim 38.94 47.55 49.73 50.61 50.85 51.17 51.29 51.35
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 90.47 93.07 96.66 98.26 100.00
## crim 51.37 51.37 51.37 51.37 51.37
validationplot(pls.Boston, validation = "MSEP")
pred.pls.Boston <- predict(pls.Boston, x.test.Boston, ncomp = 2)
mean((pred.pls.Boston - y.test.Boston)^2)
## [1] 42.74086
coef.pls.Boston <- plsr(crim ~ ., data = training.Boston, scale = TRUE, ncomp = 2)
summary(coef.pls.Boston)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: kernelpls
## Number of components considered: 2
## TRAINING: % variance explained
## 1 comps 2 comps
## X 48.10 57.46
## crim 38.94 47.55
MSE: 42.74086
There is a large drop in MSE when a second component is added to the model, then a steady drop for each component added afterward.
MSEs
Ridge: 41.47312
LASSO: 40.89965
PCR4: 43.91107
PLS: 42.74086
The best MSE for LASSO appears to be the best. Combine this with the fact that it only uses 11 of 13 components and it might be a good fit if accuracy is the most important factor.
On the other hand, PLS is similar in accuracy, but only adds to components to the model so it might be the best alternative if simplicity is the most important factor.
I chose 2 different models based on different factors. I would need more information on the purpose of the model to make a better decision on which model to choose.
With that said, neither of my chosen models would use all 13 components in the model.
detach(Boston)