Q6) Provide a detailed explanation of the algorithm that is used to fit a regression tree.

Regression Trees: The equation to calculate regression trees involve roughly two steps. First, the predictor space or set of possible values for X1 -Xp must be divided into J distinct and non-overlapping rectangular regions (R1 - Rj). Second, for every observation that falls into region Rj, we make the same prediction; equaling the mean of the response values for the training observations in Rj. In order to divide the feature space into J boxes a top-down greedy approach is used called recursive binary splitting. This technique works from the top of the tree down and greedily splits towards the best decision at each split without considering future steps or the overall best approach. The main goal is to reduce the predictor space into regions with the least amount of residual square error (RSS). The approach considers all predictors then chooses the predictor and cutpoint that leads to a tree with the lowest amount of RSS.

Q10) We now use boosting to predict Salary in the Hitters data set.

a) Remove the observations for whom the salary information is unknown, and then log-transform the salaries.

data("Hitters")

Hitters = Hitters %>% 
  na.omit() %>% 
  mutate(Salary = log(Salary))

(b) Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.

train = Hitters[1:200, ]

test = Hitters[-c(1:200), ]

(c) Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.

set.seed(123)

pow = seq(-2,0,0.1)

lambdas = 10^pow

train.error = rep(NA, length(lambdas))

for (i in 1:length(lambdas)) {
  
  model = gbm(Salary~., data = train,
                        distribution = "gaussian",
                        n.trees = 1000,
                        shrinkage = lambdas[i])
  
  # predict train error
  model.preds = predict(model, train, n.trees = 1000)
  train.error[i] = mean((model.preds - train$Salary)^2)
    
}

# plotting train error against Lambdas
plot(lambdas, train.error, type="b", xlab = "Shrinkage", ylab = "Train MSE")

(d) Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

set.seed(123)

test.error = rep(NA, length(lambdas))

for (i in 1:length(lambdas)) {
  
  model = gbm(Salary~., data = train,
                        distribution = "gaussian",
                        n.trees = 1000,
                        shrinkage = lambdas[i])
  
  # predict train error
  model.preds = predict(model, test, n.trees = 1000)
  test.error[i] = mean((model.preds - test$Salary)^2)
  
}


plot(lambdas, test.error, type = "b", xlab = "Shrinkage", ylab = "Test MSE")

boost.model.test.err = min(test.error)

boost.model.test.err
## [1] 0.2515519

The Minimum Test MSE obtained by boosting is 0.2515519

(e) Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.

Linear Regression

lm.model = lm(Salary~., train)
lm.preds = predict(lm.model, test)
lm.test.err = mean((lm.preds - test$Salary)^2)
plt.err = data.frame(Linear_Model = lm.test.err)
lm.test.err
## [1] 0.4917959

Lasso Regression

x.model.mat = model.matrix(Salary~., train)

y.model.mat = model.matrix(Salary~., test)

y = train$Salary

lasso.model = glmnet(x.model.mat, y, alpha = 1)

lasso.preds = predict(lasso.model, s = 0.01, y.model.mat)

lasso.test.err = mean((lasso.preds - test$Salary)^2)

plt.err$Lasso_Model = lasso.test.err

lasso.test.err
## [1] 0.4700537
plt.err$Boosted_Model = boost.model.test.err

The Test MSE for the different methods are shown below. The Boosted model performs the best and achieves the lowest Test MSE among the models.

Linear Regression: 0.4917959

Lasso Regression: 0.4700537

Boosted Model: 0.2515519

(f) Which variables appear to be the most important predictors in the boosted model?

boosted.model = gbm(Salary~., data = train, 
                              distribution = "gaussian", 
                              n.trees = 1000, 
                              shrinkage = lambdas[which.min(test.error)])

summary(boosted.model)

##                 var    rel.inf
## CAtBat       CAtBat 18.7023127
## CRBI           CRBI  9.8659213
## CRuns         CRuns  7.9576732
## PutOuts     PutOuts  7.4972912
## Years         Years  6.4960574
## CHits         CHits  6.4455652
## Walks         Walks  6.4069538
## RBI             RBI  5.4662886
## CWalks       CWalks  5.1848830
## CHmRun       CHmRun  5.1808186
## Assists     Assists  4.2844035
## AtBat         AtBat  3.8964989
## HmRun         HmRun  3.4330684
## Hits           Hits  3.2148601
## Errors       Errors  2.5437587
## Runs           Runs  2.1553117
## Division   Division  0.6012222
## NewLeague NewLeague  0.3699840
## League       League  0.2971277

The most important variables appear to be CAtBat and CHits. They are the highest values in our output.

(g) Now apply bagging to the training set. What is the test set MSE for this approach?

bag.model = randomForest(Salary~., data = train, 
                              distribution = "gaussian", 
                              n.trees = 1000, 
                              shrinkage = lambdas[which.min(test.error)],
                              mtry = 19,
                              importance = TRUE)



bag.preds = predict(bag.model, test)

bag.test.err = mean((bag.preds - test$Salary)^2)

bag.test.err
## [1] 0.2291835

The bagged approach performs the best when comparing each of the models. This approach takes all predictors into account and achieves a Test MSE of 0.2352884