For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. a. The lasso, relative to least squares, is: i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
- for ridge regression relative to least squares.
- non-linear methods relative to least squares.
In this exercise, we will predict the number of applications received using the other variables in the
Collegedata set.
- Split the data set into a training set and a test set.
- Fit a linear model using least squares on the training set, and report the test error obtained.
##
## Call:
## lm(formula = Apps ~ ., data = College[train, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -4963.4 -447.7 -42.8 338.3 7790.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -685.74538 470.15497 -1.459 0.145178
## PrivateYes -531.17790 159.50713 -3.330 0.000918 ***
## Accept 1.58651 0.04501 35.246 < 2e-16 ***
## Enroll -0.89994 0.19981 -4.504 7.93e-06 ***
## Top10perc 54.46747 6.25983 8.701 < 2e-16 ***
## Top25perc -17.40449 5.09522 -3.416 0.000676 ***
## F.Undergrad 0.06185 0.03609 1.714 0.087071 .
## P.Undergrad 0.04396 0.03491 1.259 0.208365
## Outstate -0.07995 0.02191 -3.649 0.000284 ***
## Room.Board 0.16431 0.05679 2.893 0.003941 **
## Books 0.16979 0.28981 0.586 0.558185
## Personal 0.01522 0.07075 0.215 0.829784
## PhD -10.58598 5.49279 -1.927 0.054390 .
## Terminal -1.95091 6.00164 -0.325 0.745239
## S.F.Ratio 22.79666 14.82745 1.537 0.124673
## perc.alumni -0.44579 4.77616 -0.093 0.925666
## Expend 0.07988 0.01372 5.821 9.22e-09 ***
## Grad.Rate 9.62036 3.40940 2.822 0.004924 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1092 on 642 degrees of freedom
## Multiple R-squared: 0.9297, Adjusted R-squared: 0.9278
## F-statistic: 499.5 on 17 and 642 DF, p-value: < 2.2e-16
## [1] 521452.1
- Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.
## (Intercept) PrivateYes Accept Enroll
## Min. :1 Min. :0.0000 Min. : 90.0 Min. : 35.0
## 1st Qu.:1 1st Qu.:0.0000 1st Qu.: 618.2 1st Qu.: 244.0
## Median :1 Median :1.0000 Median : 1127.0 Median : 438.0
## Mean :1 Mean :0.7242 Mean : 2073.3 Mean : 797.8
## 3rd Qu.:1 3rd Qu.:1.0000 3rd Qu.: 2447.8 3rd Qu.: 910.2
## Max. :1 Max. :1.0000 Max. :26330.0 Max. :6392.0
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.00 Min. : 199 Min. : 1.0
## 1st Qu.:15.75 1st Qu.: 42.00 1st Qu.: 1000 1st Qu.: 94.0
## Median :24.00 Median : 55.00 Median : 1720 Median : 351.0
## Mean :28.01 Mean : 56.46 Mean : 3795 Mean : 881.1
## 3rd Qu.:36.00 3rd Qu.: 70.00 3rd Qu.: 4270 3rd Qu.: 1017.2
## Max. :96.00 Max. :100.00 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2580 Min. :1880 Min. : 96.0 Min. : 250.0
## 1st Qu.: 7338 1st Qu.:3590 1st Qu.: 450.0 1st Qu.: 873.8
## Median :10147 Median :4205 Median : 500.0 Median :1200.0
## Mean :10519 Mean :4357 Mean : 546.4 Mean :1348.2
## 3rd Qu.:13126 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700.0
## Max. :21700 Max. :7425 Max. :2340.0 Max. :6800.0
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.0 Min. : 25.00 Min. : 2.90 Min. : 0.00
## 1st Qu.: 62.0 1st Qu.: 71.00 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.0 Median : 83.00 Median :13.60 Median :21.00
## Mean : 72.9 Mean : 79.95 Mean :14.12 Mean :23.04
## 3rd Qu.: 86.0 3rd Qu.: 92.00 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.0 Max. :100.00 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10
## 1st Qu.: 6806 1st Qu.: 54
## Median : 8520 Median : 66
## Mean : 9737 Mean : 66
## 3rd Qu.:10895 3rd Qu.: 78
## Max. :56233 Max. :118
## [1] 513191.6
- Fit a lasso model on the training set, with \(\lambda\) chosen by cross- validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
## (Intercept) PrivateYes Accept Enroll
## Min. :1 Min. :0.0000 Min. : 90.0 Min. : 35.0
## 1st Qu.:1 1st Qu.:0.0000 1st Qu.: 618.2 1st Qu.: 244.0
## Median :1 Median :1.0000 Median : 1127.0 Median : 438.0
## Mean :1 Mean :0.7242 Mean : 2073.3 Mean : 797.8
## 3rd Qu.:1 3rd Qu.:1.0000 3rd Qu.: 2447.8 3rd Qu.: 910.2
## Max. :1 Max. :1.0000 Max. :26330.0 Max. :6392.0
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.00 Min. : 199 Min. : 1.0
## 1st Qu.:15.75 1st Qu.: 42.00 1st Qu.: 1000 1st Qu.: 94.0
## Median :24.00 Median : 55.00 Median : 1720 Median : 351.0
## Mean :28.01 Mean : 56.46 Mean : 3795 Mean : 881.1
## 3rd Qu.:36.00 3rd Qu.: 70.00 3rd Qu.: 4270 3rd Qu.: 1017.2
## Max. :96.00 Max. :100.00 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2580 Min. :1880 Min. : 96.0 Min. : 250.0
## 1st Qu.: 7338 1st Qu.:3590 1st Qu.: 450.0 1st Qu.: 873.8
## Median :10147 Median :4205 Median : 500.0 Median :1200.0
## Mean :10519 Mean :4357 Mean : 546.4 Mean :1348.2
## 3rd Qu.:13126 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700.0
## Max. :21700 Max. :7425 Max. :2340.0 Max. :6800.0
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.0 Min. : 25.00 Min. : 2.90 Min. : 0.00
## 1st Qu.: 62.0 1st Qu.: 71.00 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.0 Median : 83.00 Median :13.60 Median :21.00
## Mean : 72.9 Mean : 79.95 Mean :14.12 Mean :23.04
## 3rd Qu.: 86.0 3rd Qu.: 92.00 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.0 Max. :100.00 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10
## 1st Qu.: 6806 1st Qu.: 54
## Median : 8520 Median : 66
## Mean : 9737 Mean : 66
## 3rd Qu.:10895 3rd Qu.: 78
## Max. :56233 Max. :118
## [1] 4
## [1] 513191.6
- Fit a PCR model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.
## Data: X dimension: 660 17
## Y dimension: 660 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4068 4007 2136 2139 1809 1644 1641
## adjCV 4068 4009 2134 2139 1811 1637 1636
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1630 1590 1571 1562 1571 1571 1571
## adjCV 1629 1580 1567 1557 1566 1567 1567
## 14 comps 15 comps 16 comps 17 comps
## CV 1574 1549 1242 1197
## adjCV 1570 1536 1233 1190
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.034 57.78 64.92 70.57 75.8 80.75 84.30 87.66
## Apps 3.259 73.13 73.14 80.92 84.7 84.83 85.02 85.86
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.70 93.09 95.16 96.94 97.98 98.78 99.36
## Apps 86.13 86.45 86.45 86.52 86.59 86.62 89.92
## 16 comps 17 comps
## X 99.83 100.00
## Apps 92.56 92.97
## [1] 521452.1
- Fit a PLS model on the training set, with \(M\) chosen by cross-validation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.
## [1] 521452.1
- Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Ridge (appears unlisted) and lasso give the lowest test errors but the lowest is generated by the ridge. It definitely seems to vary greatly by the split selected.
We will now try to predict per capita crime rate in the
Bostondata set.
- Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
## <seaborn.axisgrid.PairGrid object at 0x3415489b0>
## <Axes: >
- Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
Boston crime metrics seem to have the most ideal normal distribution, so this will be something to try to fit to.
## [1] 1.630519
## [1] 1.677635
## [1] 1.648161
## [1] 1.630519
## [1] 1.630519
Lasso seems to perform very slightly better. Some coefficients have been dropped:
## 7 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -12.157450324
## (Intercept) .
## rm .
## age 0.007925442
## medv -0.001304823
## nox 11.909560317
## ptratio 0.229580912
## [1] 13
- Does your chosen model involve all of the features in the data set? Why or why not?
The feature “rm” which is the average number of rooms per dwelling did not have any statistical significance in relation to crime.
##
## Call:
## lm(formula = log(crim) ~ ., data = boston_data[train, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2827 -0.8619 0.0630 0.7705 3.5294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.653847 0.967427 -13.080 < 2e-16 ***
## rm 0.042559 0.110288 0.386 0.69976
## age 0.008314 0.002980 2.790 0.00549 **
## medv -0.004776 0.009562 -0.499 0.61768
## nox 12.061847 0.735765 16.394 < 2e-16 ***
## ptratio 0.240261 0.030233 7.947 1.56e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.207 on 449 degrees of freedom
## Multiple R-squared: 0.683, Adjusted R-squared: 0.6795
## F-statistic: 193.5 on 5 and 449 DF, p-value: < 2.2e-16