Discussion 1: Main Ideas

  1. Suppose that the true relationship between a response variable and one predictor variable is \(y=\beta_0+\beta_1x+\beta_2x^2+\varepsilon\). For a given data set, if the analyst fits a fifth degree polynomial to the data, the test MSE will be higher or lower than the test MSE generated from a quadratic model. Answer the question again replacing the word “test” with “train”. ANSWER: Relationship Between Response Variable and Predictor Variable Given Relationship: y=β0+β1x+β2x2+ε This equation represents a quadratic model where the response variable y is related to the predictor variable x and an error term ε.

Fifth Degree Polynomial vs. Quadratic Model: Test MSE (Mean Squared Error): When fitting a fifth-degree polynomial to data that truly follows a quadratic relationship, the test MSE is likely to be higher compared to using a quadratic model. This is because the fifth-degree polynomial can overfit the training data, capturing noise rather than the underlying relationship, leading to poor performance on unseen data (test data).

Train MSE: For the training data, the MSE of the fifth-degree polynomial model will likely be lower than that of the quadratic model. This is because a more complex model (like a fifth-degree polynomial) can fit the training data more closely, even capturing noise, which reduces the training MSE. However, this doesn’t necessarily mean better predictive performance on new dat

  1. Under what scenario do you think performing feature selection would not be very helpful to do? Feature selection, the process of choosing relevant features for model building, may not always be beneficial. Here’s why:
  1. All Features Matter: If every feature in my dataset significantly impacts predictions, removing any could harm my model’s performance.

  2. Simple Data Structure: When my dataset has few features, feature selection isn’t crucial since the model isn’t complex and overfitting is less likely.

  3. Using Strong Regularization: With techniques like LASSO that automatically reduce less important features’ impact, I don’t need to do extra feature selection.

  4. Ample Computational Power: If I’m not limited by computational resources, managing many features isn’t a problem, so feature selection becomes less important.

  5. Self-Selecting Models: Models like decision trees or some ensemble methods already do feature selection as part of their training process, making additional steps redundant.

Discussion 2: Forward Selection and Error Metrics

The following data set, GolfData2.csv, contains various measurements on professional golfers. There are three variables related to monetary compensation. Additionally, numerous attributes of the golfer are provided that can be lumped together in two categories. Variables such as average drive and driving accuracy tells us a little bit about the golfers ability to hit the ball long range. Additional variables such as greens, average putt, and saves tell us how good the golfer is with their “short game” when working on shots closer to the hole. Finally, there are some additional “mystery variables” with generic labeling.

golf<-read.csv("GolfData2.csv")
names(golf)
##  [1] "Name"         "Age"          "AvgDrive"     "DriveAcc"     "Greens"      
##  [6] "AvgPutts"     "Save"         "Money.Rank"   "Events"       "TotalWinning"
## [11] "AvgWinnings"  "V12"          "V13"          "V14"          "V15"         
## [16] "V16"          "V17"          "V18"          "V19"          "V20"         
## [21] "V21"          "V22"          "V23"          "V24"          "V25"         
## [26] "V26"          "V27"          "V28"          "V29"          "V30"         
## [31] "V31"

A. Suppose we wanted to build a regression model to predict a players average winnings. Which variables in the data set do you think should be considered for use in a feature selection model and which ones should not? For the ones that should not, explain your reasons. ANSWER: When building a regression model to predict a player’s average winnings, I should consider variables that are likely to influence a golfer’s performance and, consequently, their earnings. Variables related to long-range abilities (like average drive and driving accuracy) and short game skills (like greens in regulation, average putt, and saves) are relevant. However, I should be cautious with “mystery variables” or those with generic labeling. If these variables lack clear definitions or relevance to golf performance, they might not contribute meaningfully to the model and could introduce noise. From the names(golf) output, it’s clear that the dataset contains various variables. For predicting average winnings (AvgWinnings), relevant variables would include those directly related to performance, such as AvgDrive, DriveAcc, Greens, AvgPutts, and Save. Variables like Age, Events, and the mystery variables (V12 to V31) might also influence earnings but their relevance should be evaluated carefully.

B. The following code chunk performs forward selection and computes traditional regression error metrics for each model iteration using log(AvgWinnings) as the response variable. When running the regsubsets function, it records Adjusted R-square, Residual Sums of Squares, and the BIC. The red point highlights the best metric across all of the models.

library(leaps)
## Warning: package 'leaps' was built under R version 4.3.2
golf2<-golf[,-c(1,8,10)]
reg.fwd=regsubsets(log(AvgWinnings)~.,data=golf2,method="forward",nvmax=20)

summary(reg.fwd)$adjr2
##  [1] 0.3297415 0.5660627 0.6039178 0.6189419 0.6263985 0.6336879 0.6398874
##  [8] 0.6464599 0.6516239 0.6553905 0.6584060 0.6617997 0.6628219 0.6640956
## [15] 0.6643452 0.6637931 0.6629888 0.6617231 0.6605712 0.6593374
summary(reg.fwd)$rss
##  [1] 126.87220  81.71594  74.20086  71.01450  69.26034  67.55157  66.05696
##  [8]  64.50638  63.22427  62.20445  61.32682  60.38757  59.87606  59.32213
## [15]  58.95055  58.71947  58.53112  58.42087  58.28863  58.16811
summary(reg.fwd)$bic
##  [1]  -68.86949 -149.81790 -163.44855 -166.77321 -166.39736 -166.01555
##  [7] -165.12273 -164.50026 -163.15703 -161.06621 -158.57309 -156.32004
## [13] -152.70922 -149.25279 -145.20623 -140.69791 -136.04952 -131.14093
## [19] -126.30700 -121.43456
par(mfrow=c(1,3))
bics<-summary(reg.fwd)$bic
plot(1:20,bics,type="l",ylab="BIC",xlab="# of predictors")
index<-which(bics==min(bics))
points(index,bics[index],col="red",pch=10)

adjr2<-summary(reg.fwd)$adjr2
plot(1:20,adjr2,type="l",ylab="Adjusted R-squared",xlab="# of predictors")
index<-which(adjr2==max(adjr2))
points(index,adjr2[index],col="red",pch=10)

rss<-summary(reg.fwd)$rss
plot(1:20,rss,type="l",ylab="train RSS",xlab="# of predictors")
index<-which(rss==min(rss))
points(index,rss[index],col="red",pch=10)

If using the BIC metric, it selects four predictors that should be used. To see which predictors are included we can run the following code chunk. Repeat this code to examine what predictors are kept if you selected the model based on Adjusted R-squared and residual sums of squares (train MSE). Which metric do you think you trust the most if I told you the “mystery variables” were randomly generated by me? When looking at Adjusted R-squared does the graph suggest as many predictors are needed as the maximum R square suggests?

coef(reg.fwd,4)
##  (Intercept)       Greens     AvgPutts         Save       Events 
##  32.98715837   0.17393243 -19.22206757   0.02501048  -0.03953744

ANSWER: The code performs forward selection and evaluates models using Adjusted R-square, Residual Sums of Squares (RSS), and the Bayesian Information Criterion (BIC). The BIC metric suggests using four predictors. To identify these predictors, I can run coef(reg.fwd,4). If I use Adjusted R-squared and RSS for model selection, I might end up with different predictors.

Given that the “mystery variables” are randomly generated, I would be cautious with metrics that might favor more complex models, like Adjusted R-squared. BIC, which penalizes model complexity, might be more reliable in this context. The Adjusted R-squared graph might suggest using fewer predictors than the maximum R-squared value indicates, as it adjusts for the number of predictors in the model.

This code performs forward selection using the regsubsets function from the leaps package and evaluates models based on Adjusted R-squared, RSS, and BIC.

Adjusted R-squared: It increases with the number of predictors, but starts to plateau, suggesting that adding more predictors beyond a certain point doesn’t significantly improve the model.

RSS (Residual Sum of Squares): It decreases with more predictors, indicating better fit as more variables are included.

BIC (Bayesian Information Criterion): It decreases initially and then starts to increase, suggesting an optimal number of predictors. The lowest BIC value indicates the best model balance between fit and complexity.

The coef(reg.fwd,4) output shows the coefficients for the model with four predictors, chosen based on the BIC criterion. These predictors are Greens, AvgPutts, Save, and Events. This model suggests that these variables are significant in predicting log-transformed average winnings.

C. The previous codes fit a model using the entire golf data set. The next set of code splits the data set into a training and validation set (50/50) and provides the MSE metric for both sets. What does the test MSE suggest the appropriate number of predictors should be? Are there any signs of overfitting based on the graphic? What happens if you change the split to 80/20?

library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: ggplot2
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.3.2
set.seed(1234)
trainIndex<-createDataPartition(golf2$AvgWinnings,p=.5,list=F)  #p: proportion of data in train

training<-golf2[trainIndex,]
validate<-golf2[-trainIndex,]

fwd.train=regsubsets(log(AvgWinnings)~.,data=training,method="forward",nvmax=20)

#Creating a prediction function 
predict.regsubsets =function (object , newdata ,id ,...){
  form=as.formula (object$call [[2]])
  mat=model.matrix(form ,newdata )
  coefi=coef(object ,id=id)
  xvars=names(coefi)
  mat[,xvars]%*%coefi
}

valMSE<-c()
#note my index, i, is to 20 since that is how many predictors I went up to during fwd selection
for (i in 1:20){
  predictions<-predict.regsubsets(object=fwd.train,newdata=validate,id=i) 
  valMSE[i]<-mean((log(validate$AvgWinnings)-predictions)^2)
}
par(mfrow=c(1,1))
plot(1:20,sqrt(valMSE),type="l",xlab="# of predictors",ylab="test vs train RMSE",ylim=c(0.3,.9))
index<-which(valMSE==min(valMSE))
points(index,sqrt(valMSE[index]),col="red",pch=10)

trainMSE<-summary(fwd.train)$rss/nrow(training)
lines(1:20,sqrt(trainMSE),lty=3,col="blue")  

ANSWER: The next code chunk splits the dataset into training and validation sets and calculates the Mean Squared Error (MSE) for both. The test MSE can help determine the appropriate number of predictors. If the test MSE increases with more predictors, it might indicate overfitting. The graphic will show if the model performs significantly better on the training data compared to the validation data, which is a sign of overfitting.

Changing the split to 80/20 will alter the amount of data used for training and validation, potentially affecting the model’s performance. A larger training set might improve the model’s ability to learn from the data, but it’s crucial to monitor the test MSE to ensure the model isn’t overfitting.

Interpretation: BIC as a Reliable Metric: Given the presence of mystery variables, BIC is a reliable metric as it penalizes model complexity. It helps to avoid overfitting, especially when some predictors might be irrelevant or randomly generated.

Adjusted R-squared: The graph suggests that adding more predictors initially increases the explanatory power of the model but reaches a point of diminishing returns. This aligns with the concept of model parsimony, where simpler models are preferred if they adequately explain the data.

Model Selection: The choice of four predictors by BIC seems to strike a good balance, focusing on key performance indicators while avoiding overfitting with too many variables.

Discussion 3: GLMNET and Cross Validation

The following code chunk performs the glmnet procedure using caret and compares the MSE values via repeated 10-fold cross validation for various mixing and penalty terms.

library(caret)
set.seed(1234)

fitControl<-trainControl(method="repeatedcv",number=10,repeats=10) 
glmnet.fit<-train(log(AvgWinnings)~.,
               data=golf2,
               method="glmnet",
               trControl=fitControl
               )
glmnet.fit
## glmnet 
## 
## 196 samples
##  27 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 176, 178, 176, 177, 177, 176, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda       RMSE       Rsquared   MAE      
##   0.10   0.001137415  0.6305772  0.5964487  0.5189343
##   0.10   0.011374155  0.6287136  0.5973913  0.5172888
##   0.10   0.113741547  0.6193615  0.6066092  0.5059088
##   0.55   0.001137415  0.6304031  0.5967984  0.5188379
##   0.55   0.011374155  0.6228292  0.6026313  0.5125484
##   0.55   0.113741547  0.6302600  0.6041683  0.5169850
##   1.00   0.001137415  0.6299355  0.5972033  0.5184713
##   1.00   0.011374155  0.6175146  0.6079286  0.5078276
##   1.00   0.113741547  0.6438276  0.6008456  0.5263401
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.01137415.
plot(glmnet.fit)

opt.pen<-glmnet.fit$finalModel$lambdaOpt #penalty term
coef(glmnet.fit$finalModel,opt.pen)
## 28 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  3.457540e+01
## Age         -8.733798e-04
## AvgDrive     .           
## DriveAcc    -1.396854e-02
## Greens       1.803266e-01
## AvgPutts    -1.905866e+01
## Save         2.768164e-02
## Events      -4.023300e-02
## V12          6.671971e-02
## V13          7.652751e-02
## V14         -1.715237e-02
## V15          .           
## V16          .           
## V17         -8.801762e-02
## V18         -8.599175e-02
## V19         -1.800267e-02
## V20         -4.190729e-02
## V21          5.584152e-02
## V22          .           
## V23          6.611264e-02
## V24          .           
## V25          6.033225e-02
## V26          .           
## V27         -2.672171e-02
## V28         -3.932099e-03
## V29          6.963435e-03
## V30         -4.890659e-02
## V31         -2.297226e-02

A. Examine the output graphics and determine if a glmnet, LASSO, or Ridge regression model was preferred. Compare the feature selection results relative to the forward selection results in the previous selection. In particular, what predictors are included in the model or not? ANSWER:

The preferred model from the caret package analysis was LASSO regression, indicated by an alpha of 1 and a small lambda value of 0.01137415. This preference stems from LASSO’s ability to perform feature selection by shrinking certain coefficients to zero, hence simplifying the model and reducing overfitting.

In the final LASSO model, significant predictors retained include Age, DriveAcc, and Greens, among others. These contribute to predicting average golf winnings. Conversely, predictors like AvgDrive and V15 were eliminated from the model, their coefficients set to zero, implying they are less critical in the presence of other variables.

This feature selection, which emphasizes model interpretability and focuses on influential predictors, is particularly beneficial when handling datasets with numerous predictors. It ensures a leaner, more understandable model.

When compared to forward selection, LASSO’s method integrates a regularization component, leading to potentially different, often more conservative, feature selection. This may result in a smaller subset of predictors due to LASSO’s stringent regularization standards, which can be crucial when dealing with multicollinearity.

Ultimately, the decision to use LASSO over Ridge or Elastic Net models is dictated by the dataset specifics and the desired balance between prediction accuracy and model simplicity. The LASSO model I used included predictors with non-zero coefficients: Age, DriveAcc, Greens, Save, Events, V12, V13, V17, V18, V19, V20, V21, V23, V25, V27, V28, V29, V30, and V31. It excluded AvgDrive, V15, V16, V22, V24, and V26, which were deemed non-essential within LASSO’s regularization framework.

  1. The previous results in part A may have some folks worried. There is another package that can run glmnet specifically that has some additional options to discuss. Run the following code, and compare the penalty term and the resulting predictors selected from this version of GLMNET compared to the previous one. What do you think about this result compared to the result in part A?
x=model.matrix(AvgWinnings~.,golf2)[,-1]
y=log(golf2$AvgWinnings)

library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.2
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.3.2
## Loaded glmnet 4.1-8
set.seed(1234)

#grid=10^seq(10,-2, length =100)
#lasso.mod=glmnet(x,y,alpha=1, lambda =grid)

cv.out=cv.glmnet(x,y,alpha=1)
plot(cv.out)

bestlambda<-cv.out$lambda.1se
coef(cv.out,s=bestlambda)
## 28 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  34.21411790
## Age           .         
## AvgDrive      .         
## DriveAcc      .         
## Greens        0.14861219
## AvgPutts    -17.54870783
## Save          0.01418931
## Events       -0.02663602
## V12           .         
## V13           .         
## V14           .         
## V15           .         
## V16           .         
## V17          -0.01938770
## V18          -0.01445805
## V19           .         
## V20           .         
## V21           .         
## V22           .         
## V23           .         
## V24           .         
## V25           .         
## V26           .         
## V27           .         
## V28           .         
## V29           .         
## V30           .         
## V31           .

ANSWER: After running the glmnet-specific code, the results indeed offer a different perspective. The cv.glmnet function led to the selection of a more regularized model with a best lambda value determined by the lambda.1se criterion. This typically results in a model with fewer predictors compared to the one chosen by minimum MSE, which might alleviate concerns about overfitting.

Upon comparing the selected predictors from this glmnet model to the previous caret model, it’s clear that glmnet has produced a sparser model. Many predictors that were included in the caret model now have coefficients shrunk to zero, such as Age and DriveAcc. Only a few predictors, like Greens and Save, remained significant in both models, reflecting their strong predictive power.

I think this result is quite informative. It suggests that when we apply more conservative regularization, as in the glmnet approach, we end up with a model that is likely to generalize better to new data, though at the potential cost of excluding predictors that might have some predictive value. This underscores the importance of balancing model complexity with predictive accuracy and the need for domain expertise to understand which predictors are truly important.