Question #6.8.2

  1. iii because the lasso is less flexible which will result in better predicitions because of less variance; however, this will also result in more bias.

  2. iii same as in part a, ridge regression is less flexible and will lead to high variance, but more bias.

  3. ii because for non-liner mehtids they are more flexible, meaning less bias and more variance (bias-variance trade-off)

Question #6.8.8

set.seed(1)
X=rnorm(100)
eps=rnorm(100)
  1. I am selecting B0 = 3, B1 = 4, B2 = 0.02 and B3 = 0.4.

Y= 3+4x-2((x2))+0.4(x3)

After graphing Cp, BIC, and adjusted R^2 we can see from the graphs that the subset size should be 3,3,and 3 meaning 3 variable models are respectively picked.

##           (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2 
##            3.07187221            4.34563538           -2.15241773 
## poly(x, 10, raw = T)5 
##            0.07383426

All statistics pick X^5 over X^3. We find that the coefficients are very close to our original values:

B0=3.07 B1=4.34 B2=-2.15 B5= 0.074

  1. We are fitting forward and backward stepwise models to the data.

##           (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2 
##            3.07187221            4.34563538           -2.15241773 
## poly(x, 10, raw = T)5 
##            0.07383426
##           (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2 
##            3.07187221            4.34563538           -2.15241773 
## poly(x, 10, raw = T)5 
##            0.07383426
##           (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2 
##           3.109977095           4.368852343          -2.244291070 
## poly(x, 10, raw = T)5 poly(x, 10, raw = T)6 
##           0.070193530           0.005223555

We see that forward Cp, Backward Cp, Forward BIC, Backward BIC, and Forward adjusted R^2 all picked a subset size of 3; however, backward adjusted R^2 picked a subset size of 4. This backward adjusted R^2 subset size of 4 is the only difference from part c because all the subset sizes were 3 in part c.

  1. Lasso is used to reduce the number of predictors in the regression model, and also to identify important predictors and redundant predictors.
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-2
## [1] 0.03621535

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)             3.04236953
## poly(x, 10, raw = T)1   4.28347735
## poly(x, 10, raw = T)2  -2.10848549
## poly(x, 10, raw = T)3   0.04140829
## poly(x, 10, raw = T)4   .         
## poly(x, 10, raw = T)5   0.06399416
## poly(x, 10, raw = T)6   .         
## poly(x, 10, raw = T)7   .         
## poly(x, 10, raw = T)8   .         
## poly(x, 10, raw = T)9   .         
## poly(x, 10, raw = T)10  .

We find that the optimal value for lambda is 0.0362. This makes sense because at that lambda it choses a model of subset size 4 while giving the minimum mean cross-validation error.

The resulting coefficient estimates are:

B0= 3.04 B1= 4.28 B2= -2.11 B3= 0.04 B5= 0.064

  1. I created a new Y with a different B7=7
##           (Intercept) poly(x, 10, raw = T)7 
##               2.95894              10.00077
##           (Intercept) poly(x, 10, raw = T)2 poly(x, 10, raw = T)7 
##             3.0704904            -0.1417084            10.0015552
##           (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2 
##             3.0762524             0.2914016            -0.1617671 
## poly(x, 10, raw = T)3 poly(x, 10, raw = T)7 
##            -0.2526527            10.0091338
## [1] 19.39191
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                               1
## (Intercept)            4.309249
## poly(x, 10, raw = T)1  .       
## poly(x, 10, raw = T)2  .       
## poly(x, 10, raw = T)3  .       
## poly(x, 10, raw = T)4  .       
## poly(x, 10, raw = T)5  .       
## poly(x, 10, raw = T)6  .       
## poly(x, 10, raw = T)7  9.680819
## poly(x, 10, raw = T)8  .       
## poly(x, 10, raw = T)9  .       
## poly(x, 10, raw = T)10 .

The results reveal that finding the best model size using cp was 2, the best model size when using BIC was one 1, and the best model size when using adjr^2 was found to be 4. Additionally, all models picked X^7.

The best lambda was found to be 19.39; which is the lambda that gives minimum mean cross-validation error. We can also see from the coefficients that at this lambda the model has only one regression coefficient, which is also seen on the LASSO plot (B^7=9.68).