\[ \text{The differences between criteria are verified because the BIC put a penalty on the models with more predictors} \\ \text{and chooses a simpler model with only 2 predictors.} \] ###b)
\[ \text{The tenfold cross validation matches with the BIC predictors with higher bias, but lower variance.} \]
K = 5
N = dim(diabetes[1])
folds = cut(sample(1:N,N),breaks = K, labels = FALSE)
## Warning in 1:N: numerical expression has 2 elements: only the first used
X = diabetes
Y = response
## The purpose of this block is to generate the same lambda every time.
fitLasso = glmnet(as.matrix(X),as.matrix(Y))
lamb = fitLasso$lambda
##
n = length(lamb)
testMSE = array(0,n)
## My approach to to this cross validation is to repeat plot 2 in part a) once for each fold.
## Visually, you can imagine me plotting five graphs with the same X vectors and taking the mean of all five
## points at each lambda as the final product. Not quite sure exactly if this is how k-fold is supposed to work.
for(k in 1:K)
{
idxTrn = which(folds != k)
idxTst = which(folds == k)
Xtrain = as.matrix(X[idxTrn,])
Ytrain = as.matrix(Y[idxTrn,])
Xtest = as.matrix(X[idxTst,])
Ytest = as.matrix(Y[idxTst,])
fitLasso = glmnet(Xtrain, Ytrain)
for(i in 1:n)
{
lassoPred = predict(fitLasso, Xtest, s=lamb[i])
testMSE[i] = mean((lassoPred-Ytest)^2) + testMSE[i]
}
}
testMSE = testMSE/K
minIdx = which(testMSE == min(testMSE))
plot(log(lamb), testMSE, col = ifelse(lamb == lamb[minIdx], "red","black"), xlab = bquote('log('*lambda*')'))
legend(x=min(log(lamb)),y=max(testMSE),
legend = c(paste("Folds: ", K),paste("Optimal Lambda: ", round(lamb[minIdx],2), sep=""), paste("Minimum MSE: ", round(testMSE[minIdx],2), sep=""))
)
\[ \text{The test MSE is non-monotonic, unlike what we saw and expected from the in-sample training MSE.} \]
## [1] 32
\[ \text{Mallow's} \ C_p \ \text{leads to similar results as k-fold cross-validation, but not the same as simply taking the training MSE.} \]
## 51 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.0285694254
## X46 2.3935773666
## X40 -4.7848878826
## X13 0.0065776908
## X2 0.0093454701
## X29 -0.0179679243
## X14 0.0149241649
## X7 0.0386545169
## X1 0.0184512296
## X34 -0.0126954580
## X44 0.0029576318
## X18 0.0009637210
## X35 0.0191414590
## X9 0.0036079987
## X11 0.0131186088
## X5 -0.0087679714
## X26 0.0059099909
## X41 0.0131383887
## X49 0.0013024791
## X12 -0.0152135541
## X19 0.0016280856
## X24 0.0106938867
## X36 -0.0097395843
## X39 -0.0029391697
## X48 0.0106176244
## X23 -0.0080238236
## X46 2.3790549648
## X40 -4.7563830285
## X13 0.0061725003
## X2 0.0086656043
## X29 -0.0172258561
## X14 0.0146085756
## X7 0.0370864296
## X1 0.0179552332
## X34 -0.0121932899
## X44 0.0029622669
## X18 0.0008462304
## X35 0.0186781157
## X9 0.0030960035
## X11 0.0124021791
## X5 -0.0085922036
## X26 0.0055237033
## X41 0.0126120052
## X49 0.0014554567
## X12 -0.0146599117
## X19 0.0018517138
## X24 0.0102244878
## X36 -0.0094776645
## X39 -0.0028311254
## X48 0.0101359089
## X23 -0.0080449951
## 51 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.0063037620
## X46 4.8310436023
## X40 -9.7268694004
## X13 .
## X2 .
## X29 .
## X14 .
## X7 .
## X1 .
## X34 .
## X44 .
## X18 .
## X35 .
## X9 .
## X11 .
## X5 .
## X26 .
## X41 .
## X49 .
## X12 .
## X19 .
## X24 .
## X36 .
## X39 .
## X48 .
## X23 .
## X46 0.0002113901
## X40 -0.1000443079
## X13 .
## X2 .
## X29 .
## X14 .
## X7 .
## X1 .
## X34 .
## X44 .
## X18 .
## X35 .
## X9 .
## X11 .
## X5 .
## X26 .
## X41 .
## X49 .
## X12 .
## X19 .
## X24 .
## X36 .
## X39 .
## X48 .
## X23 .
\[ \text{Ridge regression gives X1, X2, X26, and X27 weight in the model with lambda min. Lasso regression on the other}\\ \text{hand only gives X1 and X2 weight. The Lasso model is better for our data, because y depends on only a small} \\ \text{subset of predictors out of the 50 available to us. Lasso produces sparse models better representing the data.} \]