This question relates to the College data set.
Part A: Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
library(ISLR)
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.4
library(gam)
## Warning: package 'gam' was built under R version 4.0.5
## Loading required package: splines
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.0.4
## Loaded gam 1.20
library(gbm)
## Warning: package 'gbm' was built under R version 4.0.5
## Loaded gbm 2.1.8
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.4
## Loading required package: Matrix
## Loaded glmnet 4.1-1
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.5
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
set.seed(12345)
train <- sample(nrow(College) * 0.7)
train_set <- College[train, ]
test_set <- College[-train, ]
forward_subset <- regsubsets(Outstate ~ ., data = train_set, nvmax = ncol(College)-1, method = "forward")
model_summary <- summary(forward_subset)
plot_metric <- function(metric, yaxis_label, reverse = FALSE) {
plot(metric, xlab = "Number of Variables", ylab = yaxis_label, xaxt = "n", type = "l")
axis(side = 1, at = 1:length(metric))
if (reverse) {
metric_1se <- max(metric) - (sd(metric) / sqrt(length(metric)))
min_subset <- which(metric > metric_1se)
} else {
metric_1se <- min(metric) + (sd(metric) / sqrt(length(metric)))
min_subset <- which(metric < metric_1se)
}
abline(h = metric_1se, col = "red", lty = 2)
abline(v = min_subset[1], col = "green", lty = 2)
}
par(mfrow=c(1, 3))
plot_metric(model_summary$cp, "Cp")
plot_metric(model_summary$bic, "BIC")
plot_metric(model_summary$adjr2, "Adjusted R2", reverse = TRUE) # higher values are better

help("College")
## starting httpd help server ... done
List of Predictors:
coef(forward_subset, 6)
## (Intercept) PrivateYes Room.Board PhD perc.alumni
## -3769.0587788 2748.6944010 0.8999634 38.5143460 44.4889713
## Expend Grad.Rate
## 0.2543900 31.2043096
Part B: Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
Part C:Evaluate the model obtained on the test set, and explain the results obtained.
Part D: For which variables, if any, is there evidence of a non-linear relationship with the response?