Boosting is a class of ensamble learning techniques for regression and classification problems. Boosting aims to build a set of weak learners (i.e. predictive models that are only slightly better than random chance) to create one ‘strong’ learner (i.e. a predictive model that predicts the response variable with a high degree of accuracy). Gradient Boosting is a boosting method which aims to optimise an arbitrary (differentiable) cost function (for example, squared error).
Basically, this algorithm is an iterative process in which you follow the following steps:
F1(x) = y
h1(x) = y-F1(x)
F2(x) = F1(x)+h1(x)
It should be evident to see that as the algorithm moves through multiple iterations, the model continually gets stronger. Also, it should be noted that the type of model used in gradient boosting is not restrictive (i.e. it can be anything), however in practice they tend to be shallow decision trees.
There are numerous packages that you can use to build gradient boosting machines in R. Personally I like to use either the caret package, however it is effectively the same thing as the GBM package as caret inherits the algorithm from the GBM package.
In the example below I use the iris data set and attempt to predict the class “Species” from all of the other columns in the data set. I have also used repeated cross validation and additional tuning parameters to demonstrate the functionality of the caret function train()
.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
data <- iris
gbmTrain <- data[sample(nrow(data), round(nrow(data)*0.9), replace = F),]
#Here I have randomly sampled the data to develop a training set. I have left 10% of the data as a test set to ensure that the model is not overfitting the training data.
#this creates the tuning grid, ensure you name the features the same as the hyper parameters. Hyperparameters are essentially the 'settings' of the algorithm
grid <- expand.grid(n.trees = c(1000,1500), interaction.depth=c(1:3), shrinkage=c(0.01,0.05,0.1), n.minobsinnode=c(20))
#This creates the train control. in this example I am using a repeated k-folds cross validation with k= 5 repeated 2 times, allowing parallel.
ctrl <- trainControl(method = "repeatedcv",number = 5, repeats = 2, allowParallel = T)
#Register parallel cores
registerDoParallel(detectCores()-1)
#build model
set.seed(124) #for reproducability
unwantedoutput <- capture.output(GBMModel <- train(Species~.,data = gbmTrain,
method = "gbm", trControl = ctrl, tuneGrid = grid))
#Note that the "capture.output" function has been used here to avoid pages of output being displayed in the vignette, making it unreadable.
print(GBMModel)
## Stochastic Gradient Boosting
##
## 135 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 108, 108, 108, 109, 107, 108, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.trees Accuracy Kappa
## 0.01 1 1000 0.9449532 0.9173370
## 0.01 1 1500 0.9449532 0.9173370
## 0.01 2 1000 0.9412495 0.9117815
## 0.01 2 1500 0.9449532 0.9173370
## 0.01 3 1000 0.9450956 0.9175465
## 0.01 3 1500 0.9487993 0.9231020
## 0.05 1 1000 0.9411070 0.9115465
## 0.05 1 1500 0.9374033 0.9059909
## 0.05 2 1000 0.9335572 0.9001745
## 0.05 2 1500 0.9335572 0.9001744
## 0.05 3 1000 0.9411070 0.9115465
## 0.05 3 1500 0.9372609 0.9056780
## 0.10 1 1000 0.9335572 0.9001744
## 0.10 1 1500 0.9334147 0.8998874
## 0.10 2 1000 0.9334147 0.8998874
## 0.10 2 1500 0.9297110 0.8943319
## 0.10 3 1000 0.9258649 0.8885155
## 0.10 3 1500 0.9334147 0.8998874
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 20
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 1500,
## interaction.depth = 3, shrinkage = 0.01 and n.minobsinnode = 20.
confusionMatrix(GBMModel)
## Cross-Validated (5 fold, repeated 2 times) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction setosa versicolor virginica
## setosa 32.6 0.0 0.0
## versicolor 0.0 31.5 1.9
## virginica 0.0 3.3 30.7
##
## Accuracy (average) : 0.9481
The code essentially follows the following process:
split data into a training and test set. Note that I havent created a test set as the point of this article is to demstrate the use of the algorithm. If I was to complete the modelling process, I would predict the result against the test set, and compare the results to that of the training set.
Generate a tuning grid. The tuning grid essentially just creates a data frame of every combination of the hyper-parameters listed. The tuning grid is passed on to the train()
function as an argument, which essentially tells it to build a model for each of the entries in the tuning grid. Train will select the model that performs the best based on an evaluation metric. You can select the metric using the metric
argument in train.
Generate the train control design. I have used a repeated k-fold cross validation design which essentialy re-runs the analysis on different random subsets of the training data to ensure the model can be generalised. Note that I also have set allowParallel = T
to allow the multiple models described above to be run in parallel to each other. The next line registers the number of cores that R is to use.
Run the model using train()
. Define the model as being the class you wish to predict (i.e. ‘species’), and use the .
notation to refer all of the predictors to be every other variable in the dataframe. For a gradient boosted algorithm, set the method to gbm. you can use other avaialbe models, see here: https://topepo.github.io/caret/available-models.html
Print the model. This allows you to see all of the different models run, as well as the relevant performance metrics for them. Train will choose the model for you based on the evaluation metric you set in train()
(or set as default by not listing a metric).
Print the confusion matrix. The confusion matrix is a cross-tabulation of the predicted class versus the true class. This matrix not only gives you insight into the overall performance of the algorithm, but also on the relative performance of the algorithm at diferentiating different classes.
Analytics Vidhya, 2015. Learn Gradient Boosting Algorithm for better predictions (with codes in R). [Online] Available at: https://www.analyticsvidhya.com/blog/2015/09/complete-guide-boosting-methods/ [Accessed 27 March 2018].
Authors, V., 2018. Gradient boosting. [Online] Available at: https://en.wikipedia.org/wiki/Gradient_boosting [Accessed 26 March 2018].
Authors, V., 2018. Gradient Descent. [Online] Available at: https://en.wikipedia.org/wiki/Gradient_descent#R [Accessed 28 March 2018].
Friedman, J. H., 2001. Greedy Function Approximation: A Gradient Boosting Machine. IMS 1999 Reitz Lecture, pp. 1-39.
Ghose, A., 2016. What is an intuitive explanation of Gradient Boosting?. [Online] Available at: https://www.quora.com/What-is-an-intuitive-explanation-of-Gradient-Boosting [Accessed 27 March 2018].
Gorman, B., 2017. A Kaggle Master Explains Gradient Boosting. [Online] Available at: http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/ [Accessed 24 March 2018].
Walia, A. S., 2017. Gradient Boosting in R. [Online] Available at: https://www.r-bloggers.com/gradient-boosting-in-r/ [Accessed 25 March 2018].