This is an exploration of the gradient boosting method for classifications and regression trees and the tuning algorithm, grid search. We apply this to the detection of spam using the spam7 data set in the DAAG package in R. This is derived from the boosting lectures and assignments from the Predictive Analytics course in my Master’s program. The mathematics is significantly more robust than what is explained here as my goal was just to put together an overview with an example application.
Boosting is a method for improving the prediction performance from a decision tree. The method works by sequentially applying a classification algorithm to reweighted versions of the training data, and taking a weighted majority vote of the classifiers generated. The decision trees are grown sequentially using information from previous trees. The idea is to fit the current tree to residuals of the previous tree rather than the response \(Y\). The new tree that is generated is added to the fitted function to update the residuals. Each tree has the number of nodes constrained by the parameter \(d\) in the algorithm and they are generally small trees by design.
The hypothesis behind boosting is that weak learners whose predictive abilities are slightly better than random guesses can be transformed into strong learners by adding the information from each into the process sequentially. Ada Boost was the early success of this approach where decision stumps with only one split were used by weighting the observations by difficulty in classification and add them to the algorithm.
The boosting algorithm for regression trees is - 1. Set \(\hat{f}(x)=0\) and \(r_i = y_i\) for all \(i\) in the training set. - 2. For \(b=1,2,...,B\) repeat: - (a) Fit a tree \(\hat{f}^b\) with \(d\) splits (\(d+1\) terminal nodes) to the training dat \((X, r)\). - (b) Update \(\hat{f}\) by adding in a shrunken version of the new tree: \[\hat{f}(x)\leftarrow \hat{f}(x) + \lambda \hat{f}^b(x_i)\]. - (c)Update the residuals, \[r_i\leftarrow r_i-\lambda \hat{f}^b(x_i)\] - 3. Output the boosted model, \[\hat{f}(x)=\sum_{b=1}^B \lambda \hat{f}^b(x).\]
There are three tuning parameters in the boosting algorithm above: - 1. The number of trees \(B\), - 2. The shrinkage parameter \(\lambda\), a small positive number that controls the rate of learning for boosting. Smaller \(\lambda\) can require larger \(B\), - 3. The number of \(d\) splits in each tree. Often \(d=1\) works we, creating a decision stump. The interaction depth of the boosted model is formed by \(d\).
Consider the AdaBoost algorithm on a two-class problem with the response \(Y\) output as either \(-1\) of \(+1\). Given predictors \(X\), a classifier \(G(X)\) predicts a response with either of those values. The error rate on the sample is \[\bar{error}=\frac{1}{N}\sum_{i=1}^N I)y_i \neq G(x_i)),\] and the expected future error rate is \[E_{XY} I(Y\neq G(X))\].
The predictions are generated by applying the algorithm to produce a weighted majority vote to generate the final prediction \[G(x)=signo\Big(\sum_{m=1}^M\alpha_mG_m(x)\Big)\] where the \(\alpha_i\) for \(i=1,...,M\) are computed by the boosting algorithm and weighting the contribution of each \(G_m(x)\).
At each boosting step the data is modified by applying weights \(w_i, i = 1,...,N\) to each of the training observations \((x_i, y_i), i = 1,...,N\) with the initial weight set to \(w_i=1/N\).
The AdaBoost.M1 algorithm follows the steps: - 1. Initialize the observation weights \(w_i = 1/N, i = 1,...,N\), - 2. For \(m=1\) to \(M\): - (a) Fit a classifier \(G_m(x)\) to the training data using weights \(w_i\). - (b) Compute \[err_m=\frac{\sum_{i=1}^Nw_iI(y_i\neq G_m(x_i))}{\sum_{i=1}^N w_i}\] - (c) Compute \(\alpha_m = log(1-err_m)/err_m)\). - (d) Set \(w_i \leftarrow w_i \cdot exp[\alpha_m\cdot I(y_1\neq G_m(x_i))], i = 1,...,N\)
- 3. Output \(G(x)=sign\big[\sum_{m=1}^M \alpha_m G_m(x)\big]\)
To expand this idea to gradiant boosting machines, we want to solve a function estimation problem by finding an estimation function, \(\hat{f}(x)\) that minimizes the expectation of a loss function, call it \(\psi(y, f)\) where
\(\begin{aligned} \hat{f}(x)&=\underset{f(x){\arg\min}E_{y,x}\psi(y, f(x))\\ &=\underset{f(x)}{\arg\min}E_x[\E_{y\vert x\psi(y, f(x))\vert ]\\ \end{aligned}\)
Gradient boosting uses iterative functional gradient descent that optimizes the cost function over the function space by iteratively choosing a function (using the weak learner hypothesis) in the negative direction of the gradient.
The gradient boosting algorithm follows the steps: - Initialize \(\hat{f}(x)\) to be a constant, \(\hat{f}(x)=\arg\min_{\rho}\sum_{i=1}^N\psi(y_i, \rho)\). - For \(t\) in \(1,...,T\) do - 1. Compute the negative gradient as the working response \[z_i =-\frac{\partial}{\partial f(x_i)}\psi(y_i, f(x_i))\big\vert_{f(x_i)=\hat{f}(x_i)}\] - 2. Fit a regression model, \(g(x)\), predicting \(z_i\) from the covariates \(x_i\). - 3. Choose a gradient descent step size as \[\rho=\arg\min_{\rho}\sum_{i=1}^N\psi(y_i, \hat{f}(x_i)+\rho\cdot g(x_i))\] - 4. Update the estimate of \(f(x)\) as \[\hat{f}(x)\leftarrow \hat{f}(x)+\rho\cdot g(x)\]
The focus is on finding estimates of \(f(x)\) such that \[\hat{f}(x)=\arg\min_{f(x)}E_{y\vert X}\big[\psi(y, f(x))\vert x\big]\]
Parametric regression models assume that \(f(x)\) is a function with a finite number of parameters \(\beta\), and we estimate them by minimizing the loss function over the training sample of \(N\) observations on \((y, x)\) pairs. \[\hat{\beta}=\arg\min_{\beta}\sum_{i=1}^N\psi(y_i, f(x_i; \beta))\]
If we are dealing with a non-parametric estimation, the process is more involved. Modifying the current estimate for the function \(f(x)\) by adding a new function in a greedy fashion, letting \(f_i=f(x_i)\) we want to decrease the \(N\) dimensional function \[\begin{aligned} J(f)&=\sum_{i=1}^N\psi(y_i, f(x_i))\\ &=\sum_{i=1}^N\psi(y_i, F_i).\\ \end{aligned}\]
The negative gradient of \(J(f)\) gives the direction of the locally greatest descent for \(J(f)\). Gradient descent requires a modification of \(f\) as \(\hat{f}\leftarrow \hat{f}-\rho\nabla J(f)\) where \(\rho\) is the step size along the line of greatest descent.
The problems of this process include only being able to fit \(f\) where we have an observation \(x\) and that observations with similar values of \(x\) are going to have similar values of \(f(x)\) is overcome by selecting a class of functions that use covariate information to approximate the gradient, usually a regression decision tree. At each iteration the algorithm finds the direction and selects a model that best agrees with the direction. With the squared-error loss, \(\psi(y_i,f(x_i))=sum_{i=1}^N(y_i-f(x_i))^2\) corresponding precisely to residual fitting.
Hyperparameter tuning involves choosing the optimal set of hyperparameters for a learning algorithm. Grid search is an exhaustive search through a pre-specified subset of the hyperparameter space of a learning algorithm, in this case, the gradient boosting algorithm. We use the trainControl() and expand.grid() functions in the caret package in R to set the method, number, repeats, and search type parameters and the grid constraints using the size and \(k\) parameters as shown below.
Use grid search to tune a gbm model using the Spam7 data in library(DAAG).
Separate the data into a training set (70%) and testing set (30%). Tune the model by grid search method and report the testing error.
Produce a plot to compare the different settings of the hyperparameters similar to graph on the last page of the chapter-8-Gradient-Boosting-Machine-classification.pdf
library(DAAG)
## Loading required package: lattice
library(caret)
## Loading required package: ggplot2
library(ggplot2)
set.seed(123)
data(spam7)
dim(spam7)
## [1] 4601 7
levels(spam7$yesno)
## [1] "n" "y"
train = sample(4601, 3220)
training <- spam7[train, ]
testing <- spam7[-train, ]
sum(training$yesno == 'y')/sum(training$yesno == 'n')
## [1] 0.6445352
sum(testing$yesno == 'y')/sum(testing$yesno == 'n')
## [1] 0.6638554
str(spam7)
## 'data.frame': 4601 obs. of 7 variables:
## $ crl.tot: num 278 1028 2259 191 191 ...
## $ dollar : num 0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
## $ bang : num 0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
## $ money : num 0 0.43 0.06 0 0 0 0 0 0.15 0 ...
## $ n000 : num 0 0.43 1.16 0 0 0 0 0 0 0.19 ...
## $ make : num 0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
## $ yesno : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 2 ...
We set the method to cross-validation using “cv”, the number of folds to be \(10\), and the search to be “grid” along with creating tuneGrid using the expand.grid() function to select the number of trees, the interaction depth constraints, the allowable shrinkage, and the minimum number of observations.
We then apply the gradient boosting algorithm using the gbm() function to the training data.
optimal_trees <- list()
library(caret)
trainControl <- trainControl(method = "cv",
number = 10,
returnResamp="all", ### use "all" to return all cross-validated metrics
search = "grid")
tuneGrid <- expand.grid(
n.trees = c(5000, 10000),
interaction.depth = c( 6, 13),
shrinkage = c(0.01, 0.001),
n.minobsinnode=c(5, 10)
)
gbm_op <- train(yesno ~.,
data = training,
method = "gbm",
tuneGrid = tuneGrid,
trControl = trainControl,
verbose=FALSE)
# key <- paste(tuneGrid[i,1], tuneGrid[i,2], sep='-')
#optimal_trees[[key]] <- rf_maxtrees
plot(gbm_op)
library(gbm)
## Loaded gbm 2.1.8
spam7Imp <- varImp(gbm_op, scale = T)
plot(spam7Imp, top = 5)
spam7Imp
## gbm variable importance
##
## Overall
## bang 100.000
## dollar 84.659
## crl.tot 33.678
## money 14.086
## n000 7.583
## make 0.000
trellis.par.set(caretTheme())
plot(gbm_op, metric = "Kappa", plotType = "level",
scales = list(x = list(rot = 90)))
pre_caret_gbm <- predict(gbm_op, newdata=testing)
mean((pre_caret_gbm != testing$yesno)^2)
## [1] 0.1230992
plot(testing$yesno, pre_caret_gbm)
names(gbm_op)
## [1] "method" "modelInfo" "modelType" "results" "pred"
## [6] "bestTune" "call" "dots" "metric" "control"
## [11] "finalModel" "preProcess" "trainingData" "resample" "resampledCM"
## [16] "perfNames" "maximize" "yLimits" "times" "levels"
## [21] "terms" "coefnames" "xlevels"
gbm_op$bestTune # parameter of the best model
## n.trees interaction.depth shrinkage n.minobsinnode
## 1 5000 6 0.001 5
(gbm_op$resample)$Accuracy # Finds the Accuracy and Kappa values of the 10-fold cross-validation
## [1] 0.8478261 0.8509317 0.8447205 0.8509317 0.8416149 0.8478261 0.8322981
## [8] 0.8509317 0.8447205 0.8478261 0.8447205 0.8416149 0.8385093 0.8478261
## [15] 0.8291925 0.8385093 0.9068323 0.9068323 0.9068323 0.9068323 0.9068323
## [22] 0.9068323 0.9068323 0.9099379 0.8975155 0.9037267 0.8944099 0.8975155
## [29] 0.8913043 0.9006211 0.9006211 0.8913043 0.8823529 0.8823529 0.8823529
## [36] 0.8792570 0.8730650 0.8792570 0.8730650 0.8823529 0.8792570 0.8823529
## [43] 0.8823529 0.8823529 0.8606811 0.8730650 0.8637771 0.8792570 0.8691589
## [50] 0.8722741 0.8691589 0.8753894 0.8753894 0.8691589 0.8691589 0.8629283
## [57] 0.8660436 0.8660436 0.8660436 0.8566978 0.8535826 0.8598131 0.8535826
## [64] 0.8629283 0.8850932 0.8850932 0.8850932 0.8819876 0.8850932 0.8913043
## [71] 0.8850932 0.8788820 0.8850932 0.8819876 0.8819876 0.8788820 0.8819876
## [78] 0.8819876 0.8819876 0.8819876 0.9040248 0.9009288 0.9071207 0.9009288
## [85] 0.9133127 0.9071207 0.9133127 0.9071207 0.9102167 0.9040248 0.9040248
## [92] 0.9040248 0.9071207 0.9133127 0.9040248 0.9133127 0.9099379 0.9068323
## [99] 0.9068323 0.9037267 0.9068323 0.9068323 0.9068323 0.9099379 0.8819876
## [106] 0.8975155 0.8819876 0.8975155 0.8788820 0.8850932 0.8757764 0.8913043
## [113] 0.8878505 0.8878505 0.8909657 0.8878505 0.9003115 0.8847352 0.9034268
## [120] 0.8878505 0.8847352 0.9003115 0.8878505 0.8909657 0.8878505 0.8847352
## [127] 0.8691589 0.8816199 0.8540373 0.8602484 0.8478261 0.8602484 0.8602484
## [134] 0.8509317 0.8540373 0.8540373 0.8602484 0.8602484 0.8633540 0.8571429
## [141] 0.8447205 0.8633540 0.8447205 0.8633540 0.8819876 0.8881988 0.8819876
## [148] 0.8788820 0.8695652 0.8850932 0.8726708 0.8850932 0.8602484 0.8633540
## [155] 0.8602484 0.8602484 0.8509317 0.8571429 0.8633540 0.8664596
gbm_op$results # Finds the mean of different metrics of cross-validation
## shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa
## 1 0.001 6 5 5000 0.8841543 0.7527226
## 3 0.001 6 10 5000 0.8826034 0.7495313
## 9 0.010 6 5 5000 0.8807391 0.7459046
## 11 0.010 6 10 5000 0.8766961 0.7370299
## 5 0.001 13 5 5000 0.8829092 0.7497149
## 7 0.001 13 10 5000 0.8829072 0.7493070
## 13 0.010 13 5 5000 0.8766951 0.7381968
## 15 0.010 13 10 5000 0.8770037 0.7382655
## 2 0.001 6 5 10000 0.8829101 0.7499231
## 4 0.001 6 10 10000 0.8822890 0.7481405
## 10 0.010 6 5 10000 0.8770066 0.7385153
## 12 0.010 6 10 10000 0.8766980 0.7377652
## 6 0.001 13 5 10000 0.8832265 0.7504684
## 8 0.001 13 10 10000 0.8816727 0.7471814
## 14 0.010 13 5 10000 0.8695570 0.7236796
## 16 0.010 13 10 10000 0.8686195 0.7214006
## AccuracySD KappaSD
## 1 0.01873939 0.04047873
## 3 0.01818698 0.03920598
## 9 0.02044145 0.04455086
## 11 0.02144147 0.04668073
## 5 0.02176952 0.04690944
## 7 0.02198079 0.04772389
## 13 0.02046232 0.04425314
## 15 0.02042876 0.04443814
## 2 0.02116748 0.04564782
## 4 0.02279398 0.04929545
## 10 0.01944444 0.04220321
## 12 0.01781855 0.03867984
## 6 0.02339480 0.05009626
## 8 0.02638461 0.05642846
## 14 0.02290508 0.04903168
## 16 0.02335900 0.05023024
# gbm_op$resample
Model <- c()
for (j in 1:nrow(gbm_op$resample)) {
k1 = ifelse(gbm_op$resample[j, 1] == 0.01, '01', '001')
k4 = ifelse(gbm_op$resample[j, 4] == 10000, '10k', '20k')
key <- paste(k1, gbm_op$resample[j,2],
gbm_op$resample[j,3], k4 , sep='-')
Model <- c(Model, key)
}
cv_result <- data.frame(Model=Model, Accuracy =(gbm_op$resample)[, 'Accuracy'])
library(ggplot2)
ggplot(cv_result,aes(y = Accuracy, x = Model))+
geom_boxplot() + xlab("Model")+
ylab("Accuracy")+
stat_summary(fun.y = mean,shape = 3,col = 'red',geom = 'point')+
ggtitle("Boxplot of Cross Validation Accuracy over 10 Runs for GBM")+
theme(plot.title = element_text(hjust = 0.5))+
coord_flip()
## Warning: `fun.y` is deprecated. Use `fun` instead.
trainControl <- trainControl(method = "cv",
number = 10,
search = "grid")
tuneGrid <- expand.grid(
interaction.depth = c(13),
shrinkage = c(0.01),
n.minobsinnode = c(5),
n.trees = c(10000))
op_boost <- train(yesno ~.,
data = training,
method = "gbm",
tuneGrid = tuneGrid,
trControl = trainControl,
verbose = FALSE )
pre_gbm <- predict(op_boost, newdata=testing)
mean((pre_gbm != testing$yesno)^2)
## [1] 0.141202
We have an error of \(0.1412\) or \(14.12\)% with the model when running the model on the test data.