This lab on Model Validation using Validation and Cross-Validation in R comes from p. 248-251 of “Introduction to Statistical Learning with Applications in R” by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. It was re-implemented in Fall 2016 in tidyverse format by Amelia McNamara and R. Jordan Crouser at Smith College. Minor edits by Fatou Sanogo at Smith College.

library(ISLR)
library(leaps)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Model selection using the Validation Set Approach

In Lab 8, we saw that it is possible to choose among a set of models of different sizes using \(C_p\), BIC, and adjusted \(R^2\). We will now consider how to do this using the validation set and cross-validation approaches.

As in Lab 8, we’ll be working with the Hitters dataset from ISLR. Since we’re trying to predict Salary and we know from last time that some are missing, let’s first drop all the rows with missing values:

Hitters = na.omit(Hitters)

In order for these approaches to yield accurate estimates of the test error, we must use only the training observations to perform all aspects of model-fitting including variable selection. Therefore, the determination of which model of a given size is best must be made using only the training observations. This point is subtle but important. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error.

In order to use the validation set approach, we begin by splitting the observations into a training set and a test set as before. Here, we’ve decided to split the data in half using the sample_frac() method:

set.seed(1)

train = Hitters %>%
  sample_frac(0.5)

test = Hitters %>%
  setdiff(train)

Now, we apply regsubsets() to the training set in order to perform best subset selection*.

( *Note: If you’re trying to complete this lab on a machine that can’t handle calculating the best subset, or if you just want it to run a little faster, try forward or backward selection instead by adding the method = "forward" or method = "backward" parameter to your call to regsubsets(). You’ll get slightly different values, but the concepts are the same.)

regfit_best_train = regsubsets(Salary~., data = train, nvmax = 19)

Notice that we subset the Hitters data frame directly in the call in order to access only the training subset of the data, using the expression Hitters[train,]. We now compute the validation set error for the best model of each model size. We first make a model matrix from the test data.

test_mat = model.matrix (Salary~., data = test)

The model.matrix() function is used in many regression packages for building an \(X\) matrix from data. Now we run a loop, and for each size \(i\), we extract the coefficients from regfit.best for the best model of that size, multiply them into the appropriate columns of the test model matrix to form the predictions, and compute the test MSE.

val_errors = rep(NA,19)

# Iterates over each size i
for(i in 1:19){
    
    # Extract the vector of predictors in the best fit model on i predictors
    coefi = coef(regfit_best_train, id = i)
    
    # Make predictions using matrix multiplication of the test matirx and the coefficients vector
    pred = test_mat[,names(coefi)]%*%coefi
    
    # Calculate the MSE
    val_errors[i] = mean((test$Salary-pred)^2)
}

Now let’s plot the errors, and find the model that minimizes it:

# Find the model with the smallest error
min = which.min(val_errors)

# Plot the errors for each model size
plot(val_errors, type = 'b')
points(min, val_errors[min][1], col = "red", cex = 2, pch = 20)

Viola! We find that the best model (according to the validation set approach) is the one that contains 10 predictors.

This was a little tedious, partly because there is no predict() method for regsubsets(). Since we will be using this function again, we can capture our steps above and write our own predict() method:

predict.regsubsets = function(object,newdata,id,...){
      form = as.formula(object$call[[2]]) # Extract the formula used when we called regsubsets()
      mat = model.matrix(form,newdata)    # Build the model matrix
      coefi = coef(object,id=id)          # Extract the coefficiants of the ith model
      xvars = names(coefi)                # Pull out the names of the predictors used in the ith model
      mat[,xvars]%*%coefi               # Make predictions using matrix multiplication
}

This function pretty much mimics what we did above. The one tricky part is how we extracted the formula used in the call to regsubsets(), but you don’t need to worry too much about the mechanics of this right now. We’ll use this function to make our lives a little easier when we do cross-validation.

Now that we know what we’re looking for, let’s perform best subset selection on the full dataset (up to 10 predictors) and select the best 10-predictor model. It is important that we make use of the full data set in order to obtain more accurate coefficient estimates. Note that we perform best subset selection on the full data set and select the best 10-predictor model, rather than simply using the predictors that we obtained from the training set, because the best 10-predictor model on the full data set may differ from the corresponding model on the training set.

regfit_best = regsubsets(Salary~., data = Hitters, nvmax = 10)

In fact, we see that the best ten-variable model on the full data set has a different set of predictors than the best ten-variable model on the training set:

coef(regfit_best, 10)
##  (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
##  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490 
##         CRBI       CWalks    DivisionW      PutOuts      Assists 
##    0.7743122   -0.8308264 -112.3800575    0.2973726    0.2831680
coef(regfit_best_train, 10)
##  (Intercept)         Hits        HmRun        Walks       CAtBat        CHits 
##  169.4733954   -1.6969274    8.3820633    4.0717043   -0.7313505    2.8789766 
##       CHmRun      LeagueN    DivisionW      PutOuts      Assists 
##    1.9483614   68.6924641 -127.5915486    0.1906435    0.5090893

Model selection using Cross-Validation

Now let’s try to choose among the models of different sizes using cross-validation. This approach is somewhat involved, as we must perform best subset selection* within each of the \(k\) training sets. Despite this, we see that with its clever subsetting syntax, R makes this job quite easy. First, we create a vector that assigns each observation to one of \(k = 10\) folds, and we create a matrix in which we will store the results:

* or forward selection / backward selection

k = 10        # number of folds
set.seed(1)   # set the random seed so we all get the same results

# Assign each observation to a single fold
folds = sample(1:k, nrow(Hitters), replace = TRUE)

# Create a matrix to store the results of our upcoming calculations
cv_errors = matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))

Now let’s write a for loop that performs cross-validation. In the \(j^{th}\) fold, the elements of folds that equal \(j\) are in the test set, and the remainder are in the training set. We make our predictions for each model size (using our new \(predict()\) method), compute the test errors on the appropriate subset, and store them in the appropriate slot in the matrix cv.errors.

# Outer loop iterates over all folds
for(j in 1:k){
    
    # The perform best subset selection on the full dataset, minus the jth fold
    best_fit = regsubsets(Salary~., data = Hitters[folds!=j,], nvmax=19)
    
    # Inner loop iterates over each size i
    for(i in 1:19){
        
        # Predict the values of the current fold from the "best subset" model on i predictors
        pred = predict(best_fit, Hitters[folds==j,], id=i)
        
        # Calculate the MSE, store it in the matrix we created above
        cv_errors[j,i] = mean((Hitters$Salary[folds==j]-pred)^2)
    }
}

This has filled up the cv_._errors matrix such that the \((i,j)^{th}\) element corresponds to the test MSE for the \(i^{th}\) cross-validation fold for the best \(j\)-variable model. We can then use the apply() function to take the mean over the columns of this matrix. This will give us a vector for which the \(j^{th}\) element is the cross-validation error for the \(j\)-variable model.

# Take the mean of over all folds for each model size
mean_cv_errors = apply(cv_errors, 2, mean)

# Find the model size with the smallest cross-validation error
min = which.min(mean_cv_errors)

# Plot the cross-validation error for each model size, highlight the min
plot(mean_cv_errors, type='b')
points(min, mean_cv_errors[min][1], col = "red", cex = 2, pch = 20)

We see that cross-validation selects an 11-predictor model. Now let’s use best subset selection on the full data set in order to obtain the 11-predictor model.

reg_best = regsubsets(Salary~., data = Hitters, nvmax = 19)
coef(reg_best, 11)
##  (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
##  135.7512195   -2.1277482    6.9236994    5.6202755   -0.1389914    1.4553310 
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
##    0.7852528   -0.8228559   43.1116152 -111.1460252    0.2894087    0.2688277

For comparison, let’s also take a look at the statistics from last lab:

par(mfrow=c(2,2))

reg_summary = summary(reg_best)

# Plot RSS
plot(reg_summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")

# Plot Adjusted R^2, highlight max value
plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
max = which.max(reg_summary$adjr2)
points(max, reg_summary$adjr2[max], col = "red", cex = 2, pch = 20)

# Plot Cp, highlight min value
plot(reg_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
min = which.min(reg_summary$cp)
points(min,reg_summary$cp[min], col = "red", cex = 2, pch = 20)

# Plot BIC, highlight min value
plot(reg_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
min = which.min(reg_summary$bic)
points(min, reg_summary$bic[min], col = "red", cex = 2, pch = 20)

Notice how some of the indicators agree with the cross-validated model, and others are very different?

Your turn!

Now it’s time to test out these approaches (best / forward / backward selection) and evaluation methods (adjusted training error, validation set, cross validation) on other datasets. You may want to work with a team on this portion of the lab.

You may use any of the datasets included in the ISLR package, or choose one from the UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets.html) or one from Kaggle. Download a dataset, and try to determine the optimal set of parameters to use to model it!

To get credit for this lab, please post your answers to the following questions:

# best subset
regfit_full = regsubsets(Grad.Rate~., data = College, nvmax = 17)
reg_summary = summary(regfit_full)

# forward
regfit_fwd = regsubsets(Grad.Rate~., data=College, nvmax=17, method="forward")
regfit_fwd_summary = summary(regfit_fwd)

# backward
regfit_bwd = regsubsets(Grad.Rate~., data=College, nvmax=17, method="backward")
regfit_bwd_summary = summary(regfit_bwd)

The adjusted training errors (i.e., adjusted \(R^2\), \(C_p\), BIC) for three selection techniques demonstrate similar set of predicots where the optimal amount of predictors is in between 7 and 12. The validation set method gives the optimal set of predictors of 8. The cross validation with 10 folds give the optimal set of 7 predictors. All these can be observed from the graphs below.

I decided to use the best subset technique with the adjusted training error since it’s the most optimal one that considers all the models.

# best subset
# Set up a 2x2 grid so we can look at 4 plots at once
par(mfrow = c(2,2))
plot(reg_summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")

# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.
# The which.max() function can be used to identify the location of the maximum point of a vector
adj_r2_max = which.max(reg_summary$adjr2) # 11

# The points() command works like the plot() command, except that it puts points 
# on a plot that has already been created instead of creating a new plot
points(adj_r2_max, reg_summary$adjr2[adj_r2_max], col ="red", cex = 2, pch = 20)

# We'll do the same for C_p and BIC, this time looking for the models with the SMALLEST statistic
plot(reg_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
cp_min = which.min(reg_summary$cp) # 10
points(cp_min, reg_summary$cp[cp_min], col = "red", cex = 2, pch = 20)

plot(reg_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
bic_min = which.min(reg_summary$bic) # 6
points(bic_min, reg_summary$bic[bic_min], col = "red", cex = 2, pch = 20)

# forward
# Set up a 2x2 grid so we can look at 4 plots at once
par(mfrow = c(2,2))
plot(regfit_fwd_summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(regfit_fwd_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")

# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.
# The which.max() function can be used to identify the location of the maximum point of a vector
adj_r2_max = which.max(regfit_fwd_summary$adjr2) # 11

# The points() command works like the plot() command, except that it puts points 
# on a plot that has already been created instead of creating a new plot
points(adj_r2_max, regfit_fwd_summary$adjr2[adj_r2_max], col ="red", cex = 2, pch = 20)

# We'll do the same for C_p and BIC, this time looking for the models with the SMALLEST statistic
plot(regfit_fwd_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
cp_min = which.min(regfit_fwd_summary$cp) # 10
points(cp_min, regfit_fwd_summary$cp[cp_min], col = "red", cex = 2, pch = 20)

plot(regfit_fwd_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
bic_min = which.min(regfit_fwd_summary$bic) # 6
points(bic_min, regfit_fwd_summary$bic[bic_min], col = "red", cex = 2, pch = 20)

# backward
# Set up a 2x2 grid so we can look at 4 plots at once
par(mfrow = c(2,2))
plot(regfit_bwd_summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(regfit_bwd_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")

# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.
# The which.max() function can be used to identify the location of the maximum point of a vector
adj_r2_max = which.max(regfit_bwd_summary$adjr2) # 11

# The points() command works like the plot() command, except that it puts points 
# on a plot that has already been created instead of creating a new plot
points(adj_r2_max, regfit_bwd_summary$adjr2[adj_r2_max], col ="red", cex = 2, pch = 20)

# We'll do the same for C_p and BIC, this time looking for the models with the SMALLEST statistic
plot(regfit_bwd_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
cp_min = which.min(regfit_bwd_summary$cp) # 10
points(cp_min, regfit_bwd_summary$cp[cp_min], col = "red", cex = 2, pch = 20)

plot(regfit_bwd_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
bic_min = which.min(regfit_bwd_summary$bic) # 6
points(bic_min, regfit_bwd_summary$bic[bic_min], col = "red", cex = 2, pch = 20)

# validation set
set.seed(1)

train = College %>%
  sample_frac(0.5)

test = College %>%
  setdiff(train)

regfit_best_train = regsubsets(Grad.Rate~., data = train, nvmax = 17)
test_mat = model.matrix (Grad.Rate~., data = test)

val_errors = rep(NA,17)

# Iterates over each size i
for(i in 1:17){
    
    # Extract the vector of predictors in the best fit model on i predictors
    coefi = coef(regfit_best_train, id = i)
    
    # Make predictions using matrix multiplication of the test matirx and the coefficients vector
    pred = test_mat[,names(coefi)]%*%coefi
    
    # Calculate the MSE
    val_errors[i] = mean((test$Grad.Rate-pred)^2)
}

# Find the model with the smallest error
min = which.min(val_errors)

# Plot the errors for each model size
plot(val_errors, type = 'b')
points(min, val_errors[min][1], col = "red", cex = 2, pch = 20)

k = 10        # number of folds
set.seed(1)   # set the random seed so we all get the same results

# Assign each observation to a single fold
folds = sample(1:k, nrow(College), replace = TRUE)

# Create a matrix to store the results of our upcoming calculations
cv_errors = matrix(NA, k, 17, dimnames = list(NULL, paste(1:17)))

# Outer loop iterates over all folds
for(j in 1:k){
    
    # The perform best subset selection on the full dataset, minus the jth fold
    best_fit = regsubsets(Grad.Rate~., data = College[folds!=j,], nvmax=17)
    
    # Inner loop iterates over each size i
    for(i in 1:17){
        
        # Predict the values of the current fold from the "best subset" model on i predictors
        pred = predict(best_fit, College[folds==j,], id=i)
        
        # Calculate the MSE, store it in the matrix we created above
        cv_errors[j,i] = mean((College$Grad.Rate[folds==j]-pred)^2)
    }
}

# Take the mean of over all folds for each model size
mean_cv_errors = apply(cv_errors, 2, mean)

# Find the model size with the smallest cross-validation error
min = which.min(mean_cv_errors)

# Plot the cross-validation error for each model size, highlight the min
plot(mean_cv_errors, type='b')
points(min, mean_cv_errors[min][1], col = "red", cex = 2, pch = 20)

Thus, I choose seven variables predicted by the best subset method (table below). The seven predictors are Apps, Top25perc, P.Undergrad, Outstate, Room.Board, perc.alumni and Expend. In particular, Apps is the Number of applications received; Top25perc is Pct. new students from top 25% of H.S. class; P.Undergrad is the Number of parttime undergraduates, Outstate is Out-of-state tuition; Room.Board is Room and board costs; perc.alumni is Pct. alumni who donate; and Expend is Instructional expenditure per student. The response is Grad.Rate - Graduation rate.

# best subset
regfit_full = regsubsets(Grad.Rate~., data = College, nvmax = 17)
summary(regfit_full)
## Subset selection object
## Call: regsubsets.formula(Grad.Rate ~ ., data = College, nvmax = 17)
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Outstate        FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: exhaustive
##           PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 )  " "        " "  " "    " "    " "       "*"       " "        
## 3  ( 1 )  " "        " "  " "    " "    " "       "*"       " "        
## 4  ( 1 )  " "        " "  " "    " "    " "       "*"       " "        
## 5  ( 1 )  " "        "*"  " "    " "    " "       "*"       " "        
## 6  ( 1 )  " "        "*"  " "    " "    " "       "*"       " "        
## 7  ( 1 )  " "        "*"  " "    " "    " "       "*"       " "        
## 8  ( 1 )  " "        "*"  " "    " "    " "       "*"       " "        
## 9  ( 1 )  "*"        "*"  " "    " "    " "       "*"       " "        
## 10  ( 1 ) "*"        "*"  " "    " "    "*"       "*"       " "        
## 11  ( 1 ) "*"        "*"  " "    " "    " "       "*"       " "        
## 12  ( 1 ) "*"        "*"  "*"    " "    " "       "*"       " "        
## 13  ( 1 ) "*"        "*"  " "    " "    "*"       "*"       "*"        
## 14  ( 1 ) "*"        "*"  " "    " "    "*"       "*"       "*"        
## 15  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
## 16  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
## 17  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         "*"      " "        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         "*"      " "        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         "*"      " "        " "   " "      " " " "      " "      
## 4  ( 1 )  "*"         "*"      " "        " "   " "      " " " "      " "      
## 5  ( 1 )  "*"         "*"      " "        " "   " "      " " " "      " "      
## 6  ( 1 )  "*"         "*"      "*"        " "   " "      " " " "      " "      
## 7  ( 1 )  "*"         "*"      "*"        " "   " "      " " " "      " "      
## 8  ( 1 )  "*"         "*"      "*"        " "   "*"      " " " "      " "      
## 9  ( 1 )  "*"         "*"      "*"        " "   "*"      " " " "      " "      
## 10  ( 1 ) "*"         "*"      "*"        " "   "*"      " " " "      " "      
## 11  ( 1 ) "*"         "*"      "*"        " "   "*"      "*" "*"      " "      
## 12  ( 1 ) "*"         "*"      "*"        " "   "*"      "*" "*"      " "      
## 13  ( 1 ) "*"         "*"      "*"        " "   "*"      "*" "*"      " "      
## 14  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      " "      
## 15  ( 1 ) "*"         "*"      "*"        " "   "*"      "*" "*"      " "      
## 16  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      " "      
## 17  ( 1 ) "*"         "*"      "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend
## 1  ( 1 )  " "         " "   
## 2  ( 1 )  " "         " "   
## 3  ( 1 )  "*"         " "   
## 4  ( 1 )  "*"         " "   
## 5  ( 1 )  "*"         " "   
## 6  ( 1 )  "*"         " "   
## 7  ( 1 )  "*"         "*"   
## 8  ( 1 )  "*"         "*"   
## 9  ( 1 )  "*"         "*"   
## 10  ( 1 ) "*"         "*"   
## 11  ( 1 ) "*"         "*"   
## 12  ( 1 ) "*"         "*"   
## 13  ( 1 ) "*"         "*"   
## 14  ( 1 ) "*"         "*"   
## 15  ( 1 ) "*"         "*"   
## 16  ( 1 ) "*"         "*"   
## 17  ( 1 ) "*"         "*"

Below is the model using all seven variables to predict Grad.Rate. All variables are statistically significant and most importantly, the Adjusted R-squared is 0.4253, suggesting that 42.53% of the data fit the regression model.

# how well the model fits?
lm_fit = lm(Grad.Rate~Apps+Top25perc+P.Undergrad+Outstate+Room.Board+perc.alumni+Expend, 
              data = train)
summary(lm_fit)
## 
## Call:
## lm(formula = Grad.Rate ~ Apps + Top25perc + P.Undergrad + Outstate + 
##     Room.Board + perc.alumni + Expend, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.874  -7.585  -1.084   7.009  52.749 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.7627912  3.0465437  10.098  < 2e-16 ***
## Apps         0.0007058  0.0001936   3.645 0.000304 ***
## Top25perc    0.1743101  0.0433546   4.021 7.00e-05 ***
## P.Undergrad -0.0021934  0.0005547  -3.954 9.16e-05 ***
## Outstate     0.0012345  0.0002826   4.368 1.62e-05 ***
## Room.Board   0.0019927  0.0008304   2.400 0.016889 *  
## perc.alumni  0.2823510  0.0693514   4.071 5.69e-05 ***
## Expend      -0.0003546  0.0001607  -2.206 0.028001 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.88 on 380 degrees of freedom
## Multiple R-squared:  0.4607, Adjusted R-squared:  0.4508 
## F-statistic: 46.37 on 7 and 380 DF,  p-value: < 2.2e-16

to Moodle