library(tidyverse) # data manipulation and visualization
library(leaps) # model selection functions
library(ISLR2)
data("Hitters")
hitters <- na.omit(Hitters)
we’ll use the leaps package to illustrate subset selection methods.
choose(20,1)
choose(20,2)
choose(20,10)
You can build your own for loop to keep track the largest R2. leaps provide a regsubsets function, which indentifies the best model for a given number of predictor, where best is quantified using RSS.
best_subset <- regsubsets(Salary ~ ., hitters, nvmax = 19)
summary(best_subset)
The best 1 variable model: Salary ~ CRBI; The best 2 variables model: Salaray ~Hits + CRBI
The training set MSE is generally an underestimate of the test MSE. Therefore, traiing set RSS and training \(R^2\) cannot be used to select from among a set of models with different numbers of variables.
set.seed(1)
sample <- sample(c(TRUE, FALSE), nrow(hitters), replace = T, prob = c(0.6,0.4))
train <- hitters[sample, ]
test <- hitters[!sample, ]
# perform best subset selection
best_subset <- regsubsets(Salary ~ ., train, nvmax = 19)
results <- summary(best_subset)
# extract and plot results
data.frame(predictors = 1:19,
adj_R2 = results$adjr2,
BIC = results$bic,
Cp=results$cp) %>%
pivot_longer(c(adj_R2,Cp,BIC),names_to="statistic",values_to = "value") %>%
ggplot(aes(predictors, value)) + geom_line() +
facet_wrap(~ statistic, scales = "free")
Here we see that our results identify slightly different models that are considered the best. The adjusted R2 statistic suggests the 10 variable model is preferred, the BIC statistic suggests the 4 variable model, and the Cp suggests the 8 variable model.
which.max(results$adjr2)
which.min(results$bic)
which.min(results$cp)
We can compare the variables and coefficients that these models include using the coef function.
coef(best_subset, 10)
coef(best_subset, 4)
coef(best_subset, 8)
The default BBS is computationally intensive, we can use forward and backward stepwise selection. ## Stepwise selection - Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. The variable that gives the greatest additional improvement to the fit is added to the model. - Backward stepwise selection: just the othwer way around staring at the full model
forward <- regsubsets(Salary ~ ., train, nvmax = 19, method = "forward")
backward <- regsubsets(Salary ~ ., train, nvmax = 19, method = "backward")
which.min(summary(forward)$cp)
which.min(summary(backward)$cp)
However, when we assess these models we see that the 8 variable models include different predictors. Although, all models include AtBat, Hits, Walks, CWalks, and PutOuts, there are unique variables in each model.
coef(best_subset, 8)
coef(forward, 8)
coef(backward, 8)
we can directly estimate the test error using the validation set and cross-validation methods. We now compute the validation set error for the best model of each model size. We first make a model matrix from the test data. The model.matrix function is used in many regression packages for build- ing an “X” matrix from data.
test_m <- model.matrix(Salary ~ ., data = test)
# create empty vector to fill with error values
validation_errors <- vector("double", length = 19)
for(i in 1:19) {
coef_x <- coef(best_subset, id = i) # extract coefficients for model size i
pred_x <- test_m[ , names(coef_x)] %*% coef_x # predict salary using matrix algebra
validation_errors[i] <- mean((test$Salary - pred_x)^2) # compute test error btwn actual & predicted salary
}
# plot validation errors
plot(validation_errors, type = "b")
The 1 variable model produced by the best subset approach produces the lowest test MSE!
The least squares fitting problem: \(RSS = \sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij})^2\)
The ridge problem: \(RSS+\lambda\sum_{j=1}^p\beta_j^2\) \(RSS+\lambda\sum_{j=1}^p|\beta_j|\)
The penalty, is small when each coefficient are close to zero.
We will use the glmnet package in order to perform ridge regression and the lasso.
library(glmnet)
We must pass in an \(x\) matrix as well as a \(y\) vector, and we do not use the \(y~x\) syntax. it produce a matrix corresponding to the 19 predictors but it also automatically transforms any qualitative variables into dummy variables.
x <- model.matrix(Salary ~ .,na.omit(Hitters))[, -1]
y <- na.omit(Hitters)$Salary
The glmnet() function has an alpha argument that determines what type of model is fit.
rid.mod <- glmnet(x, y, alpha = 0, lambda = 0)
coef(rid.mod)
grid <- 10^seq(10, -2, length = 100)
rid.mod <- glmnet(x, y, alpha = 0, lambda = grid)
cat(rid.mod$lambda[1],coef(rid.mod)[,1])
We can use the predict() function for a number of purposes. For instance, we can obtain the ridge regression coefficients for a new value of λ, say 50:
predict(rid.mod , s = 50, type = "coefficients")[1:20, ]
We now split the samples into a training set and a test set in order to estimate the test error of ridge regression and the lasso.
set.seed(1)
train <- sample(1:nrow(x), nrow(x) / 2)
test <- -train
y.test <- y[test]
ridge.mod <- glmnet(x[train , ], y[train], alpha = 0,lambda = grid, thresh = 1e-12)
ridge.pred <- predict(ridge.mod , s = 4, newx = x[test , ])
mean((ridge.pred - y.test)^2)
The test MSE is 142,199.
We now check whether there is any benefit to performing ridge regression with λ = 4 instead of just performing least squares regression.
ridge.pred <- predict(rid.mod , s = 0, newx = x[test , ], exact = T, x = x[train , ], y = y[train])
mean((ridge.pred - y.test)^2)
lasso.mod <- glmnet(x[train , ], y[train], alpha = 1, lambda = grid)
plot(lasso.mod)