Assignment 6

Provide a detailed explanation of the algorithm that is used to fit a regression tree.

Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of α. -Use K-fold cross-validation to choose α. That is, divide the training observations into K folds. For each k = 1, . . . , K:
1. Repeat Steps 1 and 2 on all but the kth fold of the training data.
2. Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of α. Average the results for each value of α, and pick α to minimize the average error.
Return the subtree from Step 2 that corresponds to the chosen value of α.

We now use boosting to predict “Salary” in the “Hitters” data set.

Remove the observations for whom the salary information is unknown, and then log-transform the salaries.

library(ISLR)

Hitters = na.omit(Hitters)
Hitters$Salary = log(Hitters$Salary)

Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.

train = 1:200
Hitters.train = Hitters[train, ]
Hitters.test = Hitters[-train, ]

Perform boosting on the training set with 1000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.

library(gbm)

## Warning: package 'gbm' was built under R version 3.6.2

## Loaded gbm 2.1.8

set.seed(1)
pows = seq(-10, -0.2, by = 0.1)
lambdas = 10^pows
train.err = rep(NA, length(lambdas))
for (i in 1:length(lambdas)) {
    boost.hitters = gbm(Salary ~ ., data = Hitters.train, distribution = "gaussian", n.trees = 1000, shrinkage = lambdas[i])
    pred.train = predict(boost.hitters, Hitters.train, n.trees = 1000)
    train.err[i] = mean((pred.train - Hitters.train$Salary)^2)
}
plot(lambdas, train.err, type = "b", xlab = "Shrinkage values", ylab = "Training MSE")

Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

set.seed(1)
test.err = rep(NA, length(lambdas))
for (i in 1:length(lambdas)) {
    boost.hitters <- gbm(Salary ~ ., data = Hitters.train, distribution = "gaussian", n.trees = 1000, shrinkage = lambdas[i])
    yhat = predict(boost.hitters, Hitters.test, n.trees = 1000)
    test.err[i] = mean((yhat - Hitters.test$Salary)^2)
}
plot(lambdas, test.err, type = "b", xlab = "Shrinkage values", ylab = "Test MSE")

min(test.err)

## [1] 0.2540265

lambdas[which.min(test.err)]

## [1] 0.07943282

Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.

fit1 = lm(Salary ~ ., data = Hitters.train)
pred1 = predict(fit1, Hitters.test)
mean((pred1 - Hitters.test$Salary)^2)

## [1] 0.4917959

library(glmnet)

## Warning: package 'glmnet' was built under R version 3.6.2

## Loading required package: Matrix

## Loaded glmnet 4.1-1

x = model.matrix(Salary ~ ., data = Hitters.train)
x.test = model.matrix(Salary ~ ., data = Hitters.test)
y = Hitters.train$Salary
fit2 = glmnet(x, y, alpha = 0)
pred2 = predict(fit2, s = 0.01, newx = x.test)
mean((pred2 - Hitters.test$Salary)^2)

## [1] 0.4570283

The test MSE for boosting is lower than for linear regression and ridge regression.

Which variables appear to be the most important predictors in the boosted model?

boost.hitters = gbm(Salary ~ ., data = Hitters.train, distribution = "gaussian", n.trees = 1000, shrinkage = lambdas[which.min(test.err)])
summary(boost.hitters)

##                 var    rel.inf
## CAtBat       CAtBat 20.8404970
## CRBI           CRBI 12.3158959
## Walks         Walks  7.4186037
## PutOuts     PutOuts  7.1958539
## Years         Years  6.3104535
## CWalks       CWalks  6.0221656
## CHmRun       CHmRun  5.7759763
## CHits         CHits  4.8914360
## AtBat         AtBat  4.2187460
## RBI             RBI  4.0812410
## Hits           Hits  4.0117255
## Assists     Assists  3.8786634
## HmRun         HmRun  3.6386178
## CRuns         CRuns  3.3230296
## Errors       Errors  2.6369128
## Runs           Runs  2.2048386
## Division   Division  0.5347342
## NewLeague NewLeague  0.4943540
## League       League  0.2062551

We may see that “CAtBat” is by far the most important variable.

Assignment 6

Josh Rodriguez

4/21/2021