Question #6.8.2
iii because the lasso is less flexible which will result in better predicitions because of less variance; however, this will also result in more bias.
iii same as in part a, ridge regression is less flexible and will lead to high variance, but more bias.
ii because for non-liner mehtids they are more flexible, meaning less bias and more variance (bias-variance trade-off)
Question #6.8.8
set.seed(1)
X=rnorm(100)
eps=rnorm(100)
Y= 3+4x-2((x2))+0.4(x3)
After graphing Cp, BIC, and adjusted R^2 we can see from the graphs that the subset size should be 3,3,and 3 meaning 3 variable models are respectively picked.
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 3.07187221 4.34563538 -2.15241773
## poly(x, 10, raw = T)5
## 0.07383426
All statistics pick X^5 over X^3. We find that the coefficients are very close to our original values:
B0=3.07 B1=4.34 B2=-2.15 B5= 0.074
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 3.07187221 4.34563538 -2.15241773
## poly(x, 10, raw = T)5
## 0.07383426
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 3.07187221 4.34563538 -2.15241773
## poly(x, 10, raw = T)5
## 0.07383426
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 3.109977095 4.368852343 -2.244291070
## poly(x, 10, raw = T)5 poly(x, 10, raw = T)6
## 0.070193530 0.005223555
We see that forward Cp, Backward Cp, Forward BIC, Backward BIC, and Forward adjusted R^2 all picked a subset size of 3; however, backward adjusted R^2 picked a subset size of 4. This backward adjusted R^2 subset size of 4 is the only difference from part c because all the subset sizes were 3 in part c.
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-2
## [1] 0.03621535
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 3.04236953
## poly(x, 10, raw = T)1 4.28347735
## poly(x, 10, raw = T)2 -2.10848549
## poly(x, 10, raw = T)3 0.04140829
## poly(x, 10, raw = T)4 .
## poly(x, 10, raw = T)5 0.06399416
## poly(x, 10, raw = T)6 .
## poly(x, 10, raw = T)7 .
## poly(x, 10, raw = T)8 .
## poly(x, 10, raw = T)9 .
## poly(x, 10, raw = T)10 .
We find that the optimal value for lambda is 0.0362. This makes sense because at that lambda it choses a model of subset size 4 while giving the minimum mean cross-validation error.
The resulting coefficient estimates are:
B0= 3.04 B1= 4.28 B2= -2.11 B3= 0.04 B5= 0.064
## (Intercept) poly(x, 10, raw = T)7
## 2.95894 10.00077
## (Intercept) poly(x, 10, raw = T)2 poly(x, 10, raw = T)7
## 3.0704904 -0.1417084 10.0015552
## (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2
## 3.0762524 0.2914016 -0.1617671
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)7
## -0.2526527 10.0091338
## [1] 19.39191
## 11 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 4.309249
## poly(x, 10, raw = T)1 .
## poly(x, 10, raw = T)2 .
## poly(x, 10, raw = T)3 .
## poly(x, 10, raw = T)4 .
## poly(x, 10, raw = T)5 .
## poly(x, 10, raw = T)6 .
## poly(x, 10, raw = T)7 9.680819
## poly(x, 10, raw = T)8 .
## poly(x, 10, raw = T)9 .
## poly(x, 10, raw = T)10 .
The results reveal that finding the best model size using cp was 2, the best model size when using BIC was one 1, and the best model size when using adjr^2 was found to be 4. Additionally, all models picked X^7.
The best lambda was found to be 19.39; which is the lambda that gives minimum mean cross-validation error. We can also see from the coefficients that at this lambda the model has only one regression coefficient, which is also seen on the LASSO plot (B^7=9.68).