The project attempts 5 different methods of fitting a regression model to the diabetes dataset. The methods employed are: 1) Ordinary Least Squares, 2) Best Subset Selection using the BIC Criteria, 3) Best Subset Selection using the BIC Criteria with 10 Fold Cross Validation, 4) Ridge Regression, and 5) a Lasso Model.
Stated Issues: 1) I was not able to figure out how to use 10-Fold Cross Validation for Ridge Regression and Lasso Model. Hence, only cross validation is used.
Analysis is heavily based on the Mean Testing Error to form a conclusion on which method produces the best model. Note, however, that no “goodness of fit” analysis has been to judge whether a linear model is appropriate. In cases where adding a variable produces very low gains in improving the model’s prediction error, a choice was made to use the model containing the fewer number of variables. Hence, variables are only added to the model when the improvement in prediction error outweighs the impact from introducing an additional variable in the model.
## Loaded lars 1.2
The Ordinary Least Squares model with training coefficients, mean prediction error, and standard error:
## [1] "OLS Summary"
##
## Call:
## lm(formula = y.train ~ . - y, data = data.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155.726 -36.065 -2.758 35.039 151.509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 149.920 2.976 50.382 < 2e-16 ***
## age -66.758 68.946 -0.968 0.33364
## sex -304.651 69.847 -4.362 1.74e-05 ***
## bmi 518.663 76.573 6.773 6.01e-11 ***
## map 388.111 72.755 5.335 1.81e-07 ***
## tc -815.268 537.549 -1.517 0.13034
## ldl 387.604 439.162 0.883 0.37811
## hdl 162.903 269.117 0.605 0.54539
## tch 323.832 186.803 1.734 0.08396 .
## ltg 673.620 206.888 3.256 0.00125 **
## glu 94.219 79.590 1.184 0.23737
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54.05 on 321 degrees of freedom
## Multiple R-squared: 0.5213, Adjusted R-squared: 0.5064
## F-statistic: 34.96 on 10 and 321 DF, p-value: < 2.2e-16
## [1] "Mean Prediction Error"
## [1] 3111.265
The Best Subset Selection method (using BIC as the proxy for determining which model will produce the lowest test error) shows that adding variables will likely reduce the test error up to the point where we include 6 predictor variables. After the 6ht predictor is added in, we would actually start increasing our expectations of the test error. Once we have select a model with 6 variables as the optimal model, the best subsets selection is run again but this time, on the full data set. Although the variables are same in the model for both the training and full data phase, note the differences in the coefficients of the final below.
The 6 selected variables for the BIC based model are sex, bi, map, ct, ts, and lg.
Note: BIC is referred to a proxy for test error because it does not actually predict the test error. We expect it to have a strong relationship with test error, but it does not directly translate to test error. Also, we never have certainty that the number of optimal variables indicated by min(BIC) is also the optimal variables for min(test error), but we infer that the relationship is strong enough such that the optimal number of variables select by min(BIC) is highly likely to also be the optimal number of variables for min(test error).
## [1] "Best Subset Model Summary"
## Subset selection object
## Call: regsubsets.formula(y ~ ., data = data.train, nvmax = 10)
## 10 Variables (and intercept)
## Forced in Forced out
## age FALSE FALSE
## sex FALSE FALSE
## bmi FALSE FALSE
## map FALSE FALSE
## tc FALSE FALSE
## ldl FALSE FALSE
## hdl FALSE FALSE
## tch FALSE FALSE
## ltg FALSE FALSE
## glu FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## age sex bmi map tc ldl hdl tch ltg glu
## 1 ( 1 ) " " " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " "*" " " " " " " " " " " "*" " "
## 3 ( 1 ) " " " " "*" "*" " " " " " " " " "*" " "
## 4 ( 1 ) " " " " "*" "*" "*" " " " " " " "*" " "
## 5 ( 1 ) " " "*" "*" "*" " " " " "*" " " "*" " "
## 6 ( 1 ) " " "*" "*" "*" "*" " " " " "*" "*" " "
## 7 ( 1 ) " " "*" "*" "*" "*" " " " " "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" " " " " "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## [1] "BIC Metrics for each best subset model by number of variables [1:10]:"
## [1] -125.5340 -173.7155 -185.0856 -188.0941 -198.8718 -201.1269 -196.4423
## [8] -191.5214 -186.1681 -180.7417
## [1] "Best Subset Model Coefficients from Test Data:"
## (Intercept) sex bmi map tc tch
## 150.1166 -306.0420 538.8274 389.0673 -379.0379 332.6735
## ltg
## 527.5658
## [1] "Best Subset Model Coefficients from Full Data: (note the change in coefficients from the training only data)"
## (Intercept) sex bmi map tc ldl
## 152.1335 -226.5106 529.8730 327.2198 -757.9379 538.5859
## ltg
## 804.1923
## [1] "Mean Prediction Error"
## [1] 2876.677
10 Fold Cross-validation suggests that we can actually get a better fit with only 5 variables. Even though the difference is very marginal, our preference should be for models with fewer variables when all other attributes are equal.
# set the number of folds and set up the matrix
k=10
set.seed (1306)
folds <- sample(1:k,nrow(data.train),replace=TRUE)
cv.errors=matrix(NA,k,10, dimnames=list(NULL, paste(1:10)))
#
for(j in 1:k){
best.fit = regsubsets(y~., data=data.all[folds!=j,],nvmax=10)
for(i in 1:10){
pred = predict(best.fit, data.all[folds==j,], id=i)
cv.errors[j,i] = mean ((data.all$y[folds==j]-pred)^2)
}
}
mean.cv.errors=apply(cv.errors ,2,mean)
print("Mean of the Cross Validation Errors (Training) for k=10")
## [1] "Mean of the Cross Validation Errors (Training) for k=10"
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 4168.124 3255.921 3132.571 3176.801 2965.080 2967.791 2970.210 2955.554
## 9 10
## 2958.412 2963.217
par(mfrow=c(1,1))
plot(mean.cv.errors ,type="b", main="K=10 Cross Validation")
# Full test on all data to seek best model with 5 variables
reg.best=regsubsets (y~.,data=data.all , nvmax=5)
print("Coefficients from K=10 cross validation model with 5 variables:")
## [1] "Coefficients from K=10 cross validation model with 5 variables:"
coef(reg.best ,5)
## (Intercept) sex bmi map hdl ltg
## 152.1335 -235.7756 523.5623 326.2358 -289.1169 474.2918
full_bss <- predict.regsubsets(regfit.best, newdata=data.all, id=5)
print("Mean Prediction Error")
## [1] "Mean Prediction Error"
mean((y - full_bss)^2)
## [1] 2913.753
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-2
grid=10^seq(10,-2,length=100)
set.seed (1306)
#Had an issue where glmnet produce a null for $lambda.min. Had to use cv.glmnet to get the lambda min.
# Source: http://www.inside-r.org/packages/cran/glmnet/docs/cv.glmnet
ridge.mod <-cv.glmnet(x.train, y.train, alpha=0)
one_se_lam=ridge.mod$lambda.1se
print("Value for Lambda Used from CV.glmnet:")
## [1] "Value for Lambda Used from CV.glmnet:"
one_se_lam
## [1] 41.67209
print("Coefficients for Ridge Regression Model")
## [1] "Coefficients for Ridge Regression Model"
predict(ridge.mod,type="coefficients",s=one_se_lam)[1:11,]
## (Intercept) age sex bmi map tc
## 149.99086 -11.25502 -156.90281 374.44565 264.86245 -32.09103
## ldl hdl tch ltg glu
## -66.97779 -173.82190 124.03502 307.72524 134.51753
ridge.pred=predict(ridge.mod,s=one_se_lam ,newx=x[test,])
print("Mean Prediction Error:")
## [1] "Mean Prediction Error:"
mean((ridge.pred-y.test)^2)
## [1] 3070.94
The Lasso Model retains the following variables: sex, bi, map, HDTV, lg, and flu for a total of 6 variables. The other four variables have a coefficient of zero excluding them from the final model.
set.seed(1306)
cv.out <- cv.glmnet(x.train, y.train, alpha=1)
one_se_lam <- cv.out$lambda.1se
print("Largest value of lambda with cross_validation error within 1 standard error of the min:")
## [1] "Largest value of lambda with cross_validation error within 1 standard error of the min:"
one_se_lam
## [1] 4.791278
print("Lasso Model Mean Prediction Error:")
## [1] "Lasso Model Mean Prediction Error:"
lasso.pred=predict(cv.out, s=one_se_lam ,newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 2920.051
print("Lasso Model Coefficients")
## [1] "Lasso Model Coefficients"
lasso.coef=predict(cv.out,type="coefficients",s=one_se_lam)[1:11,]
lasso.coef
## (Intercept) age sex bmi map tc
## 149.95300 0.00000 -119.64893 501.48591 270.92404 0.00000
## ldl hdl tch ltg glu
## 0.00000 -180.30353 0.00000 390.57448 16.61318
For this specific data set, Best Subset Selection using 10-fold cross validation. This is based on having the lowest Mean Prediction Error and in addition, having the fewest number of variables. The non-OLS methods consistently showed that all variables do not have any explanatory power for the variable “y.” Hence, the models removing the requirement to include all variables started producing the best results. Overall, all methods were not computationally difficult or time consuming with the respect to utilization of system resources. However, a key issue restraining the ability to draw conclusions is that I was not able to get 10-fold cross validation to work with the Ridge Regression and Lasso approaches. But based on the differences seen with introducing ten-fold cross validation with Best Subset Selection, there could be an improvement, but it would not likely be significant for this data set. As there were several variables that are likely not related to the dependent variable in this data set, models showing some ability to discriminate in variable selection are likely to perform better than models with no discrimination (i.e. full variable OLS and Ridge Regression). In different sets however, it is highly likely the best method in this case would not be the same as for different data sets.