Introduction


The project attempts 5 different methods of fitting a regression model to the diabetes dataset. The methods employed are: 1) Ordinary Least Squares, 2) Best Subset Selection using the BIC Criteria, 3) Best Subset Selection using the BIC Criteria with 10 Fold Cross Validation, 4) Ridge Regression, and 5) a Lasso Model.

Stated Issues: 1) I was not able to figure out how to use 10-Fold Cross Validation for Ridge Regression and Lasso Model. Hence, only cross validation is used.

  1. I was confused about the requirement to specify the Standard Error even after searching several discussion threads from students on the course discussion forum. Hence, no Standard Errors are provided.

Analysis


Analysis is heavily based on the Mean Testing Error to form a conclusion on which method produces the best model. Note, however, that no “goodness of fit” analysis has been to judge whether a linear model is appropriate. In cases where adding a variable produces very low gains in improving the model’s prediction error, a choice was made to use the model containing the fewer number of variables. Hence, variables are only added to the model when the improvement in prediction error outweighs the impact from introducing an additional variable in the model.

Results


## Loaded lars 1.2

Ordinary Least Squares

The Ordinary Least Squares model with training coefficients, mean prediction error, and standard error:

## [1] "OLS Summary"
## 
## Call:
## lm(formula = y.train ~ . - y, data = data.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -155.726  -36.065   -2.758   35.039  151.509 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  149.920      2.976  50.382  < 2e-16 ***
## age          -66.758     68.946  -0.968  0.33364    
## sex         -304.651     69.847  -4.362 1.74e-05 ***
## bmi          518.663     76.573   6.773 6.01e-11 ***
## map          388.111     72.755   5.335 1.81e-07 ***
## tc          -815.268    537.549  -1.517  0.13034    
## ldl          387.604    439.162   0.883  0.37811    
## hdl          162.903    269.117   0.605  0.54539    
## tch          323.832    186.803   1.734  0.08396 .  
## ltg          673.620    206.888   3.256  0.00125 ** 
## glu           94.219     79.590   1.184  0.23737    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.05 on 321 degrees of freedom
## Multiple R-squared:  0.5213, Adjusted R-squared:  0.5064 
## F-statistic: 34.96 on 10 and 321 DF,  p-value: < 2.2e-16
## [1] "Mean Prediction Error"
## [1] 3111.265

Best subset selection using BIC:

The Best Subset Selection method (using BIC as the proxy for determining which model will produce the lowest test error) shows that adding variables will likely reduce the test error up to the point where we include 6 predictor variables. After the 6ht predictor is added in, we would actually start increasing our expectations of the test error. Once we have select a model with 6 variables as the optimal model, the best subsets selection is run again but this time, on the full data set. Although the variables are same in the model for both the training and full data phase, note the differences in the coefficients of the final below.

The 6 selected variables for the BIC based model are sex, bi, map, ct, ts, and lg.

Note: BIC is referred to a proxy for test error because it does not actually predict the test error. We expect it to have a strong relationship with test error, but it does not directly translate to test error. Also, we never have certainty that the number of optimal variables indicated by min(BIC) is also the optimal variables for min(test error), but we infer that the relationship is strong enough such that the optimal number of variables select by min(BIC) is highly likely to also be the optimal number of variables for min(test error).

## [1] "Best Subset Model Summary"
## Subset selection object
## Call: regsubsets.formula(y ~ ., data = data.train, nvmax = 10)
## 10 Variables  (and intercept)
##     Forced in Forced out
## age     FALSE      FALSE
## sex     FALSE      FALSE
## bmi     FALSE      FALSE
## map     FALSE      FALSE
## tc      FALSE      FALSE
## ldl     FALSE      FALSE
## hdl     FALSE      FALSE
## tch     FALSE      FALSE
## ltg     FALSE      FALSE
## glu     FALSE      FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
##           age sex bmi map tc  ldl hdl tch ltg glu
## 1  ( 1 )  " " " " "*" " " " " " " " " " " " " " "
## 2  ( 1 )  " " " " "*" " " " " " " " " " " "*" " "
## 3  ( 1 )  " " " " "*" "*" " " " " " " " " "*" " "
## 4  ( 1 )  " " " " "*" "*" "*" " " " " " " "*" " "
## 5  ( 1 )  " " "*" "*" "*" " " " " "*" " " "*" " "
## 6  ( 1 )  " " "*" "*" "*" "*" " " " " "*" "*" " "
## 7  ( 1 )  " " "*" "*" "*" "*" " " " " "*" "*" "*"
## 8  ( 1 )  "*" "*" "*" "*" "*" " " " " "*" "*" "*"
## 9  ( 1 )  "*" "*" "*" "*" "*" "*" " " "*" "*" "*"
## 10  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## [1] "BIC Metrics for each best subset model by number of variables [1:10]:"
##  [1] -125.5340 -173.7155 -185.0856 -188.0941 -198.8718 -201.1269 -196.4423
##  [8] -191.5214 -186.1681 -180.7417

## [1] "Best Subset Model Coefficients from Test Data:"
## (Intercept)         sex         bmi         map          tc         tch 
##    150.1166   -306.0420    538.8274    389.0673   -379.0379    332.6735 
##         ltg 
##    527.5658
## [1] "Best Subset Model Coefficients from Full Data: (note the change in coefficients from the training only data)"
## (Intercept)         sex         bmi         map          tc         ldl 
##    152.1335   -226.5106    529.8730    327.2198   -757.9379    538.5859 
##         ltg 
##    804.1923
## [1] "Mean Prediction Error"
## [1] 2876.677

Best subset selection using 10-fold cross-validation

10 Fold Cross-validation suggests that we can actually get a better fit with only 5 variables. Even though the difference is very marginal, our preference should be for models with fewer variables when all other attributes are equal.

# set the number of folds and set up the matrix
k=10
set.seed (1306)
folds <- sample(1:k,nrow(data.train),replace=TRUE)
cv.errors=matrix(NA,k,10, dimnames=list(NULL, paste(1:10)))

# 
for(j in 1:k){
  best.fit = regsubsets(y~., data=data.all[folds!=j,],nvmax=10)
  
  for(i in 1:10){
    pred = predict(best.fit, data.all[folds==j,], id=i)
    cv.errors[j,i] = mean ((data.all$y[folds==j]-pred)^2)
  } 
}

mean.cv.errors=apply(cv.errors ,2,mean)
print("Mean of the Cross Validation Errors (Training) for k=10")
## [1] "Mean of the Cross Validation Errors (Training) for k=10"
mean.cv.errors
##        1        2        3        4        5        6        7        8 
## 4168.124 3255.921 3132.571 3176.801 2965.080 2967.791 2970.210 2955.554 
##        9       10 
## 2958.412 2963.217
par(mfrow=c(1,1))
plot(mean.cv.errors ,type="b", main="K=10 Cross Validation")

# Full test on all data to seek best model with 5 variables
reg.best=regsubsets (y~.,data=data.all , nvmax=5)
print("Coefficients from K=10 cross validation model with 5 variables:")
## [1] "Coefficients from K=10 cross validation model with 5 variables:"
coef(reg.best ,5)
## (Intercept)         sex         bmi         map         hdl         ltg 
##    152.1335   -235.7756    523.5623    326.2358   -289.1169    474.2918
full_bss <- predict.regsubsets(regfit.best, newdata=data.all, id=5)

print("Mean Prediction Error")
## [1] "Mean Prediction Error"
mean((y - full_bss)^2)
## [1] 2913.753

Ridge Regression Model Using 10-fold cross-validation

library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-2
grid=10^seq(10,-2,length=100)
set.seed (1306)
#Had an issue where glmnet produce a null for $lambda.min.  Had to use cv.glmnet to get the lambda min.
# Source: http://www.inside-r.org/packages/cran/glmnet/docs/cv.glmnet

ridge.mod <-cv.glmnet(x.train, y.train, alpha=0)
one_se_lam=ridge.mod$lambda.1se
print("Value for Lambda Used from CV.glmnet:")
## [1] "Value for Lambda Used from CV.glmnet:"
one_se_lam
## [1] 41.67209
print("Coefficients for Ridge Regression Model")
## [1] "Coefficients for Ridge Regression Model"
predict(ridge.mod,type="coefficients",s=one_se_lam)[1:11,]
## (Intercept)         age         sex         bmi         map          tc 
##   149.99086   -11.25502  -156.90281   374.44565   264.86245   -32.09103 
##         ldl         hdl         tch         ltg         glu 
##   -66.97779  -173.82190   124.03502   307.72524   134.51753
ridge.pred=predict(ridge.mod,s=one_se_lam ,newx=x[test,])
print("Mean Prediction Error:")
## [1] "Mean Prediction Error:"
mean((ridge.pred-y.test)^2)
## [1] 3070.94

Lasso model using 10-fold cross-validation

The Lasso Model retains the following variables: sex, bi, map, HDTV, lg, and flu for a total of 6 variables. The other four variables have a coefficient of zero excluding them from the final model.

set.seed(1306)
cv.out <- cv.glmnet(x.train, y.train, alpha=1)
one_se_lam <- cv.out$lambda.1se

print("Largest value of lambda with cross_validation error within 1 standard error of the min:")
## [1] "Largest value of lambda with cross_validation error within 1 standard error of the min:"
one_se_lam
## [1] 4.791278
print("Lasso Model Mean Prediction Error:")
## [1] "Lasso Model Mean Prediction Error:"
lasso.pred=predict(cv.out, s=one_se_lam ,newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 2920.051
print("Lasso Model Coefficients")
## [1] "Lasso Model Coefficients"
lasso.coef=predict(cv.out,type="coefficients",s=one_se_lam)[1:11,]
lasso.coef
## (Intercept)         age         sex         bmi         map          tc 
##   149.95300     0.00000  -119.64893   501.48591   270.92404     0.00000 
##         ldl         hdl         tch         ltg         glu 
##     0.00000  -180.30353     0.00000   390.57448    16.61318

Conclusion

For this specific data set, Best Subset Selection using 10-fold cross validation. This is based on having the lowest Mean Prediction Error and in addition, having the fewest number of variables. The non-OLS methods consistently showed that all variables do not have any explanatory power for the variable “y.” Hence, the models removing the requirement to include all variables started producing the best results. Overall, all methods were not computationally difficult or time consuming with the respect to utilization of system resources. However, a key issue restraining the ability to draw conclusions is that I was not able to get 10-fold cross validation to work with the Ridge Regression and Lasso approaches. But based on the differences seen with introducing ten-fold cross validation with Best Subset Selection, there could be an improvement, but it would not likely be significant for this data set. As there were several variables that are likely not related to the dependent variable in this data set, models showing some ability to discriminate in variable selection are likely to perform better than models with no discrimination (i.e. full variable OLS and Ridge Regression). In different sets however, it is highly likely the best method in this case would not be the same as for different data sets.