1)

a)

\[ \text{The differences between criteria are verified because the BIC put a penalty on the models with more predictors} \\ \text{and chooses a simpler model with only 2 predictors.} \] ###b)

\[ \text{The tenfold cross validation matches with the BIC predictors with higher bias, but lower variance.} \]

2)

a)

b)

K = 5
N = dim(diabetes[1])
folds = cut(sample(1:N,N),breaks = K, labels = FALSE)

## Warning in 1:N: numerical expression has 2 elements: only the first used

X = diabetes
Y = response

## The purpose of this block is to generate the same lambda every time.
fitLasso = glmnet(as.matrix(X),as.matrix(Y))
lamb = fitLasso$lambda
##
n = length(lamb)

testMSE = array(0,n)

## My approach to to this cross validation is to repeat plot 2 in part a) once for each fold.
## Visually, you can imagine me plotting five graphs with the same X vectors and taking the mean of all five 
## points at each lambda as the final product. Not quite sure exactly if this is how k-fold is supposed to work.
for(k in 1:K)
{
  idxTrn = which(folds != k)
  idxTst = which(folds == k)
  
  Xtrain = as.matrix(X[idxTrn,])
  Ytrain = as.matrix(Y[idxTrn,])
  Xtest = as.matrix(X[idxTst,])
  Ytest = as.matrix(Y[idxTst,])
  
  fitLasso = glmnet(Xtrain, Ytrain)
  
  for(i in 1:n)
  {
    lassoPred = predict(fitLasso, Xtest, s=lamb[i])
    testMSE[i] = mean((lassoPred-Ytest)^2) + testMSE[i]
  }
} 
testMSE = testMSE/K
minIdx = which(testMSE == min(testMSE))
plot(log(lamb), testMSE, col = ifelse(lamb == lamb[minIdx], "red","black"), xlab = bquote('log('*lambda*')'))
legend(x=min(log(lamb)),y=max(testMSE),
       legend = c(paste("Folds: ", K),paste("Optimal Lambda: ", round(lamb[minIdx],2), sep=""), paste("Minimum MSE: ", round(testMSE[minIdx],2), sep=""))
       )

\[ \text{The test MSE is non-monotonic, unlike what we saw and expected from the in-sample training MSE.} \]

c)

## [1] 32

\[ \text{Mallow's} \ C_p \ \text{leads to similar results as k-fold cross-validation, but not the same as simply taking the training MSE.} \]

3)

## 51 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -0.0285694254
## X46          2.3935773666
## X40         -4.7848878826
## X13          0.0065776908
## X2           0.0093454701
## X29         -0.0179679243
## X14          0.0149241649
## X7           0.0386545169
## X1           0.0184512296
## X34         -0.0126954580
## X44          0.0029576318
## X18          0.0009637210
## X35          0.0191414590
## X9           0.0036079987
## X11          0.0131186088
## X5          -0.0087679714
## X26          0.0059099909
## X41          0.0131383887
## X49          0.0013024791
## X12         -0.0152135541
## X19          0.0016280856
## X24          0.0106938867
## X36         -0.0097395843
## X39         -0.0029391697
## X48          0.0106176244
## X23         -0.0080238236
## X46          2.3790549648
## X40         -4.7563830285
## X13          0.0061725003
## X2           0.0086656043
## X29         -0.0172258561
## X14          0.0146085756
## X7           0.0370864296
## X1           0.0179552332
## X34         -0.0121932899
## X44          0.0029622669
## X18          0.0008462304
## X35          0.0186781157
## X9           0.0030960035
## X11          0.0124021791
## X5          -0.0085922036
## X26          0.0055237033
## X41          0.0126120052
## X49          0.0014554567
## X12         -0.0146599117
## X19          0.0018517138
## X24          0.0102244878
## X36         -0.0094776645
## X39         -0.0028311254
## X48          0.0101359089
## X23         -0.0080449951

## 51 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -0.0063037620
## X46          4.8310436023
## X40         -9.7268694004
## X13          .           
## X2           .           
## X29          .           
## X14          .           
## X7           .           
## X1           .           
## X34          .           
## X44          .           
## X18          .           
## X35          .           
## X9           .           
## X11          .           
## X5           .           
## X26          .           
## X41          .           
## X49          .           
## X12          .           
## X19          .           
## X24          .           
## X36          .           
## X39          .           
## X48          .           
## X23          .           
## X46          0.0002113901
## X40         -0.1000443079
## X13          .           
## X2           .           
## X29          .           
## X14          .           
## X7           .           
## X1           .           
## X34          .           
## X44          .           
## X18          .           
## X35          .           
## X9           .           
## X11          .           
## X5           .           
## X26          .           
## X41          .           
## X49          .           
## X12          .           
## X19          .           
## X24          .           
## X36          .           
## X39          .           
## X48          .           
## X23          .

\[ \text{Ridge regression gives X1, X2, X26, and X27 weight in the model with lambda min. Lasso regression on the other}\\ \text{hand only gives X1 and X2 weight. The Lasso model is better for our data, because y depends on only a small} \\ \text{subset of predictors out of the 50 available to us. Lasso produces sparse models better representing the data.} \]

HW 3

William Chance, Shray Mittal, Benjamin Leu

October 27, 2019

1)

a)

2)

a)

b)

c)

3)