Gradient Boosting with Grid Search Tuning for Spam Detection

This is an exploration of the gradient boosting method for classifications and regression trees and the tuning algorithm, grid search. We apply this to the detection of spam using the spam7 data set in the DAAG package in R. This is derived from the boosting lectures and assignments from the Predictive Analytics course in my Master’s program. The mathematics is significantly more robust than what is explained here as my goal was just to put together an overview with an example application.

Boosting

Boosting is a method for improving the prediction performance from a decision tree. The method works by sequentially applying a classification algorithm to reweighted versions of the training data, and taking a weighted majority vote of the classifiers generated. The decision trees are grown sequentially using information from previous trees. The idea is to fit the current tree to residuals of the previous tree rather than the response \(Y\). The new tree that is generated is added to the fitted function to update the residuals. Each tree has the number of nodes constrained by the parameter \(d\) in the algorithm and they are generally small trees by design.

The hypothesis behind boosting is that weak learners whose predictive abilities are slightly better than random guesses can be transformed into strong learners by adding the information from each into the process sequentially. Ada Boost was the early success of this approach where decision stumps with only one split were used by weighting the observations by difficulty in classification and add them to the algorithm.

The boosting algorithm for regression trees is - 1. Set \(\hat{f}(x)=0\) and \(r_i = y_i\) for all \(i\) in the training set. - 2. For \(b=1,2,...,B\) repeat: - (a) Fit a tree \(\hat{f}^b\) with \(d\) splits (\(d+1\) terminal nodes) to the training dat \((X, r)\). - (b) Update \(\hat{f}\) by adding in a shrunken version of the new tree: \[\hat{f}(x)\leftarrow \hat{f}(x) + \lambda \hat{f}^b(x_i)\]. - (c)Update the residuals, \[r_i\leftarrow r_i-\lambda \hat{f}^b(x_i)\] - 3. Output the boosted model, \[\hat{f}(x)=\sum_{b=1}^B \lambda \hat{f}^b(x).\]

There are three tuning parameters in the boosting algorithm above: - 1. The number of trees \(B\), - 2. The shrinkage parameter \(\lambda\), a small positive number that controls the rate of learning for boosting. Smaller \(\lambda\) can require larger \(B\), - 3. The number of \(d\) splits in each tree. Often \(d=1\) works we, creating a decision stump. The interaction depth of the boosted model is formed by \(d\).

Consider the AdaBoost algorithm on a two-class problem with the response \(Y\) output as either \(-1\) of \(+1\). Given predictors \(X\), a classifier \(G(X)\) predicts a response with either of those values. The error rate on the sample is \[\bar{error}=\frac{1}{N}\sum_{i=1}^N I)y_i \neq G(x_i)),\] and the expected future error rate is \[E_{XY} I(Y\neq G(X))\].

The predictions are generated by applying the algorithm to produce a weighted majority vote to generate the final prediction \[G(x)=signo\Big(\sum_{m=1}^M\alpha_mG_m(x)\Big)\] where the \(\alpha_i\) for \(i=1,...,M\) are computed by the boosting algorithm and weighting the contribution of each \(G_m(x)\).

At each boosting step the data is modified by applying weights \(w_i, i = 1,...,N\) to each of the training observations \((x_i, y_i), i = 1,...,N\) with the initial weight set to \(w_i=1/N\).

The AdaBoost.M1 algorithm follows the steps: - 1. Initialize the observation weights \(w_i = 1/N, i = 1,...,N\), - 2. For \(m=1\) to \(M\): - (a) Fit a classifier \(G_m(x)\) to the training data using weights \(w_i\). - (b) Compute \[err_m=\frac{\sum_{i=1}^Nw_iI(y_i\neq G_m(x_i))}{\sum_{i=1}^N w_i}\] - (c) Compute \(\alpha_m = log(1-err_m)/err_m)\). - (d) Set \(w_i \leftarrow w_i \cdot exp[\alpha_m\cdot I(y_1\neq G_m(x_i))], i = 1,...,N\)
- 3. Output \(G(x)=sign\big[\sum_{m=1}^M \alpha_m G_m(x)\big]\)

Gradient Boosting

To expand this idea to gradiant boosting machines, we want to solve a function estimation problem by finding an estimation function, \(\hat{f}(x)\) that minimizes the expectation of a loss function, call it \(\psi(y, f)\) where

\(\begin{aligned} \hat{f}(x)&=\underset{f(x){\arg\min}E_{y,x}\psi(y, f(x))\\ &=\underset{f(x)}{\arg\min}E_x[\E_{y\vert x\psi(y, f(x))\vert ]\\ \end{aligned}\)

Gradient boosting uses iterative functional gradient descent that optimizes the cost function over the function space by iteratively choosing a function (using the weak learner hypothesis) in the negative direction of the gradient.

The gradient boosting algorithm follows the steps: - Initialize \(\hat{f}(x)\) to be a constant, \(\hat{f}(x)=\arg\min_{\rho}\sum_{i=1}^N\psi(y_i, \rho)\). - For \(t\) in \(1,...,T\) do - 1. Compute the negative gradient as the working response \[z_i =-\frac{\partial}{\partial f(x_i)}\psi(y_i, f(x_i))\big\vert_{f(x_i)=\hat{f}(x_i)}\] - 2. Fit a regression model, \(g(x)\), predicting \(z_i\) from the covariates \(x_i\). - 3. Choose a gradient descent step size as \[\rho=\arg\min_{\rho}\sum_{i=1}^N\psi(y_i, \hat{f}(x_i)+\rho\cdot g(x_i))\] - 4. Update the estimate of \(f(x)\) as \[\hat{f}(x)\leftarrow \hat{f}(x)+\rho\cdot g(x)\]

The focus is on finding estimates of \(f(x)\) such that \[\hat{f}(x)=\arg\min_{f(x)}E_{y\vert X}\big[\psi(y, f(x))\vert x\big]\]

Parametric regression models assume that \(f(x)\) is a function with a finite number of parameters \(\beta\), and we estimate them by minimizing the loss function over the training sample of \(N\) observations on \((y, x)\) pairs. \[\hat{\beta}=\arg\min_{\beta}\sum_{i=1}^N\psi(y_i, f(x_i; \beta))\]

If we are dealing with a non-parametric estimation, the process is more involved. Modifying the current estimate for the function \(f(x)\) by adding a new function in a greedy fashion, letting \(f_i=f(x_i)\) we want to decrease the \(N\) dimensional function \[\begin{aligned} J(f)&=\sum_{i=1}^N\psi(y_i, f(x_i))\\ &=\sum_{i=1}^N\psi(y_i, F_i).\\ \end{aligned}\]

The negative gradient of \(J(f)\) gives the direction of the locally greatest descent for \(J(f)\). Gradient descent requires a modification of \(f\) as \(\hat{f}\leftarrow \hat{f}-\rho\nabla J(f)\) where \(\rho\) is the step size along the line of greatest descent.

The problems of this process include only being able to fit \(f\) where we have an observation \(x\) and that observations with similar values of \(x\) are going to have similar values of \(f(x)\) is overcome by selecting a class of functions that use covariate information to approximate the gradient, usually a regression decision tree. At each iteration the algorithm finds the direction and selects a model that best agrees with the direction. With the squared-error loss, \(\psi(y_i,f(x_i))=sum_{i=1}^N(y_i-f(x_i))^2\) corresponding precisely to residual fitting.

Project Specifications:

Use grid search to tune a gbm model using the Spam7 data in library(DAAG).

Load needed libraries and dataset and create the training and testing sets.

library(DAAG)
## Loading required package: lattice
library(caret)
## Loading required package: ggplot2
library(ggplot2)
set.seed(123)

data(spam7)
dim(spam7)
## [1] 4601    7
levels(spam7$yesno)
## [1] "n" "y"
train = sample(4601, 3220)
training <- spam7[train, ]
testing <- spam7[-train, ]
sum(training$yesno == 'y')/sum(training$yesno == 'n')
## [1] 0.6445352
sum(testing$yesno == 'y')/sum(testing$yesno == 'n')
## [1] 0.6638554
str(spam7)
## 'data.frame':    4601 obs. of  7 variables:
##  $ crl.tot: num  278 1028 2259 191 191 ...
##  $ dollar : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ bang   : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ money  : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
##  $ n000   : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
##  $ make   : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ yesno  : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 2 ...

Use the caret package to build and tune a gbm() model.

We set the method to cross-validation using “cv”, the number of folds to be \(10\), and the search to be “grid” along with creating tuneGrid using the expand.grid() function to select the number of trees, the interaction depth constraints, the allowable shrinkage, and the minimum number of observations.

We then apply the gradient boosting algorithm using the gbm() function to the training data.

optimal_trees <- list()
library(caret)

trainControl <- trainControl(method = "cv",
                          number = 10,
                          returnResamp="all", ### use "all" to return all cross-validated metrics
                          search = "grid")

tuneGrid <- expand.grid(
             n.trees = c(5000, 10000),
             interaction.depth = c( 6, 13),
             shrinkage = c(0.01, 0.001),
             n.minobsinnode=c(5, 10)
                                )
gbm_op <- train(yesno ~.,
                data = training,
                method = "gbm",
                tuneGrid = tuneGrid,
                trControl = trainControl,
                verbose=FALSE)
    
  # key <- paste(tuneGrid[i,1], tuneGrid[i,2], sep='-')
   #optimal_trees[[key]] <- rf_maxtrees

Plot the maximum tree depth using the cross-validation results.

plot(gbm_op)

Plot the importance of the Variables and show their values.

library(gbm)
## Loaded gbm 2.1.8
spam7Imp <- varImp(gbm_op, scale = T)
plot(spam7Imp, top = 5)

spam7Imp
## gbm variable importance
## 
##         Overall
## bang    100.000
## dollar   84.659
## crl.tot  33.678
## money    14.086
## n000      7.583
## make      0.000

Plot a heatmap of the results.

trellis.par.set(caretTheme())
plot(gbm_op, metric = "Kappa", plotType = "level",
     scales = list(x = list(rot = 90)))

Make a prediction and show the error for this model.

pre_caret_gbm <- predict(gbm_op, newdata=testing)
mean((pre_caret_gbm != testing$yesno)^2)
## [1] 0.1230992

What does a plot of the confusion matrix look like?

plot(testing$yesno, pre_caret_gbm)

What names of model attributes are available?

names(gbm_op)
##  [1] "method"       "modelInfo"    "modelType"    "results"      "pred"        
##  [6] "bestTune"     "call"         "dots"         "metric"       "control"     
## [11] "finalModel"   "preProcess"   "trainingData" "resample"     "resampledCM" 
## [16] "perfNames"    "maximize"     "yLimits"      "times"        "levels"      
## [21] "terms"        "coefnames"    "xlevels"

What are the parameters of the best model?

gbm_op$bestTune # parameter of the best model
##   n.trees interaction.depth shrinkage n.minobsinnode
## 1    5000                 6     0.001              5

What are the Accuracy and Kappa values of the 10-fold cross-validation and what are the means of different metrics of cross-validation?

(gbm_op$resample)$Accuracy       # Finds the Accuracy and Kappa values of the 10-fold cross-validation 
##   [1] 0.8478261 0.8509317 0.8447205 0.8509317 0.8416149 0.8478261 0.8322981
##   [8] 0.8509317 0.8447205 0.8478261 0.8447205 0.8416149 0.8385093 0.8478261
##  [15] 0.8291925 0.8385093 0.9068323 0.9068323 0.9068323 0.9068323 0.9068323
##  [22] 0.9068323 0.9068323 0.9099379 0.8975155 0.9037267 0.8944099 0.8975155
##  [29] 0.8913043 0.9006211 0.9006211 0.8913043 0.8823529 0.8823529 0.8823529
##  [36] 0.8792570 0.8730650 0.8792570 0.8730650 0.8823529 0.8792570 0.8823529
##  [43] 0.8823529 0.8823529 0.8606811 0.8730650 0.8637771 0.8792570 0.8691589
##  [50] 0.8722741 0.8691589 0.8753894 0.8753894 0.8691589 0.8691589 0.8629283
##  [57] 0.8660436 0.8660436 0.8660436 0.8566978 0.8535826 0.8598131 0.8535826
##  [64] 0.8629283 0.8850932 0.8850932 0.8850932 0.8819876 0.8850932 0.8913043
##  [71] 0.8850932 0.8788820 0.8850932 0.8819876 0.8819876 0.8788820 0.8819876
##  [78] 0.8819876 0.8819876 0.8819876 0.9040248 0.9009288 0.9071207 0.9009288
##  [85] 0.9133127 0.9071207 0.9133127 0.9071207 0.9102167 0.9040248 0.9040248
##  [92] 0.9040248 0.9071207 0.9133127 0.9040248 0.9133127 0.9099379 0.9068323
##  [99] 0.9068323 0.9037267 0.9068323 0.9068323 0.9068323 0.9099379 0.8819876
## [106] 0.8975155 0.8819876 0.8975155 0.8788820 0.8850932 0.8757764 0.8913043
## [113] 0.8878505 0.8878505 0.8909657 0.8878505 0.9003115 0.8847352 0.9034268
## [120] 0.8878505 0.8847352 0.9003115 0.8878505 0.8909657 0.8878505 0.8847352
## [127] 0.8691589 0.8816199 0.8540373 0.8602484 0.8478261 0.8602484 0.8602484
## [134] 0.8509317 0.8540373 0.8540373 0.8602484 0.8602484 0.8633540 0.8571429
## [141] 0.8447205 0.8633540 0.8447205 0.8633540 0.8819876 0.8881988 0.8819876
## [148] 0.8788820 0.8695652 0.8850932 0.8726708 0.8850932 0.8602484 0.8633540
## [155] 0.8602484 0.8602484 0.8509317 0.8571429 0.8633540 0.8664596
gbm_op$results                  # Finds the mean of different metrics of cross-validation
##    shrinkage interaction.depth n.minobsinnode n.trees  Accuracy     Kappa
## 1      0.001                 6              5    5000 0.8841543 0.7527226
## 3      0.001                 6             10    5000 0.8826034 0.7495313
## 9      0.010                 6              5    5000 0.8807391 0.7459046
## 11     0.010                 6             10    5000 0.8766961 0.7370299
## 5      0.001                13              5    5000 0.8829092 0.7497149
## 7      0.001                13             10    5000 0.8829072 0.7493070
## 13     0.010                13              5    5000 0.8766951 0.7381968
## 15     0.010                13             10    5000 0.8770037 0.7382655
## 2      0.001                 6              5   10000 0.8829101 0.7499231
## 4      0.001                 6             10   10000 0.8822890 0.7481405
## 10     0.010                 6              5   10000 0.8770066 0.7385153
## 12     0.010                 6             10   10000 0.8766980 0.7377652
## 6      0.001                13              5   10000 0.8832265 0.7504684
## 8      0.001                13             10   10000 0.8816727 0.7471814
## 14     0.010                13              5   10000 0.8695570 0.7236796
## 16     0.010                13             10   10000 0.8686195 0.7214006
##    AccuracySD    KappaSD
## 1  0.01873939 0.04047873
## 3  0.01818698 0.03920598
## 9  0.02044145 0.04455086
## 11 0.02144147 0.04668073
## 5  0.02176952 0.04690944
## 7  0.02198079 0.04772389
## 13 0.02046232 0.04425314
## 15 0.02042876 0.04443814
## 2  0.02116748 0.04564782
## 4  0.02279398 0.04929545
## 10 0.01944444 0.04220321
## 12 0.01781855 0.03867984
## 6  0.02339480 0.05009626
## 8  0.02638461 0.05642846
## 14 0.02290508 0.04903168
## 16 0.02335900 0.05023024
# gbm_op$resample

Create a run a model using the resample values and run the comparison boxplot.

Model <- c()
for (j in 1:nrow(gbm_op$resample)) {
  
      k1 = ifelse(gbm_op$resample[j, 1] == 0.01, '01', '001')
      k4 = ifelse(gbm_op$resample[j, 4] == 10000, '10k', '20k')
       
      key <- paste(k1, gbm_op$resample[j,2],
                   gbm_op$resample[j,3], k4 , sep='-')
      Model <- c(Model, key)
}

cv_result <- data.frame(Model=Model, Accuracy =(gbm_op$resample)[, 'Accuracy'])


library(ggplot2)



ggplot(cv_result,aes(y = Accuracy, x = Model))+
 geom_boxplot() + xlab("Model")+
  ylab("Accuracy")+
   stat_summary(fun.y = mean,shape = 3,col = 'red',geom = 'point')+
    ggtitle("Boxplot of Cross Validation Accuracy over 10 Runs for GBM")+
     theme(plot.title = element_text(hjust = 0.5))+
      coord_flip()
## Warning: `fun.y` is deprecated. Use `fun` instead.

Tune the gbm model.

trainControl <- trainControl(method = "cv",
                          number = 10,
                          search = "grid")

tuneGrid <- expand.grid(
  interaction.depth = c(13),
  shrinkage = c(0.01),
  n.minobsinnode = c(5),
  n.trees = c(10000))
op_boost <- train(yesno ~.,
                  data = training,
                  method = "gbm",
                  tuneGrid = tuneGrid,
                  trControl = trainControl,
                  verbose = FALSE )

Use the model from the train() function to make predictions and display the error.

pre_gbm <- predict(op_boost, newdata=testing)
mean((pre_gbm != testing$yesno)^2)
## [1] 0.141202

We have an error of \(0.1412\) or \(14.12\)% with the model when running the model on the test data.