Regression Trees: The equation to calculate regression trees involve roughly two steps. First, the predictor space or set of possible values for X1 -Xp must be divided into J distinct and non-overlapping rectangular regions (R1 - Rj). Second, for every observation that falls into region Rj, we make the same prediction; equaling the mean of the response values for the training observations in Rj. In order to divide the feature space into J boxes a top-down greedy approach is used called recursive binary splitting. This technique works from the top of the tree down and greedily splits towards the best decision at each split without considering future steps or the overall best approach. The main goal is to reduce the predictor space into regions with the least amount of residual square error (RSS). The approach considers all predictors then chooses the predictor and cutpoint that leads to a tree with the lowest amount of RSS.
Salary in the Hitters data set.data("Hitters")
Hitters = Hitters %>%
na.omit() %>%
mutate(Salary = log(Salary))
train = Hitters[1:200, ]
test = Hitters[-c(1:200), ]
set.seed(123)
pow = seq(-2,0,0.1)
lambdas = 10^pow
train.error = rep(NA, length(lambdas))
for (i in 1:length(lambdas)) {
model = gbm(Salary~., data = train,
distribution = "gaussian",
n.trees = 1000,
shrinkage = lambdas[i])
# predict train error
model.preds = predict(model, train, n.trees = 1000)
train.error[i] = mean((model.preds - train$Salary)^2)
}
# plotting train error against Lambdas
plot(lambdas, train.error, type="b", xlab = "Shrinkage", ylab = "Train MSE")
set.seed(123)
test.error = rep(NA, length(lambdas))
for (i in 1:length(lambdas)) {
model = gbm(Salary~., data = train,
distribution = "gaussian",
n.trees = 1000,
shrinkage = lambdas[i])
# predict train error
model.preds = predict(model, test, n.trees = 1000)
test.error[i] = mean((model.preds - test$Salary)^2)
}
plot(lambdas, test.error, type = "b", xlab = "Shrinkage", ylab = "Test MSE")
boost.model.test.err = min(test.error)
boost.model.test.err
## [1] 0.2515519
The Minimum Test MSE obtained by boosting is 0.2515519
Linear Regression
lm.model = lm(Salary~., train)
lm.preds = predict(lm.model, test)
lm.test.err = mean((lm.preds - test$Salary)^2)
plt.err = data.frame(Linear_Model = lm.test.err)
lm.test.err
## [1] 0.4917959
Lasso Regression
x.model.mat = model.matrix(Salary~., train)
y.model.mat = model.matrix(Salary~., test)
y = train$Salary
lasso.model = glmnet(x.model.mat, y, alpha = 1)
lasso.preds = predict(lasso.model, s = 0.01, y.model.mat)
lasso.test.err = mean((lasso.preds - test$Salary)^2)
plt.err$Lasso_Model = lasso.test.err
lasso.test.err
## [1] 0.4700537
plt.err$Boosted_Model = boost.model.test.err
The Test MSE for the different methods are shown below. The Boosted model performs the best and achieves the lowest Test MSE among the models.
Linear Regression: 0.4917959
Lasso Regression: 0.4700537
Boosted Model: 0.2515519
boosted.model = gbm(Salary~., data = train,
distribution = "gaussian",
n.trees = 1000,
shrinkage = lambdas[which.min(test.error)])
summary(boosted.model)
## var rel.inf
## CAtBat CAtBat 18.7023127
## CRBI CRBI 9.8659213
## CRuns CRuns 7.9576732
## PutOuts PutOuts 7.4972912
## Years Years 6.4960574
## CHits CHits 6.4455652
## Walks Walks 6.4069538
## RBI RBI 5.4662886
## CWalks CWalks 5.1848830
## CHmRun CHmRun 5.1808186
## Assists Assists 4.2844035
## AtBat AtBat 3.8964989
## HmRun HmRun 3.4330684
## Hits Hits 3.2148601
## Errors Errors 2.5437587
## Runs Runs 2.1553117
## Division Division 0.6012222
## NewLeague NewLeague 0.3699840
## League League 0.2971277
The most important variables appear to be CAtBat and CHits. They are the highest values in our output.
bag.model = randomForest(Salary~., data = train,
distribution = "gaussian",
n.trees = 1000,
shrinkage = lambdas[which.min(test.error)],
mtry = 19,
importance = TRUE)
bag.preds = predict(bag.model, test)
bag.test.err = mean((bag.preds - test$Salary)^2)
bag.test.err
## [1] 0.2291835
The bagged approach performs the best when comparing each of the models. This approach takes all predictors into account and achieves a Test MSE of 0.2352884