Boosting

Boosting along with random forest is one fo the most accurate box classifiers that we can use. In general, in boosting, we take a large number ob possibly weak predictors and weight them in a way that takes advantage of their strengths and add them up. When we weight them and add them up, we’re sort of doing the same kind of idea that we did with bagging for regression trees, or that we did with random forest. We’re taking a large number of classifiers and sort of averaging them, and then, by averaging them together, we get a stronger predictor.

Probably, the most famous boosting algorithm is Adaboost. We can find more information and tutorial about boosting at http://webee.technion.ac.il/people/rmeir/BoostingTutorial.pdf

Boosting in R

  1. Boosting can be used with any subset of classifiers
  2. One large subclass is gradient boosting
  3. R has multiple boosting libraries. Differences include the choice of basic classification functions and combination rules. -gbm: boosting with trees
  1. Most of these are available in caret package
library(ISLR)
data(Wage)
library(ggplot2)
library(caret)
Wage<-subset(Wage, select = -c(logwage))
inTrain<-createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training<-Wage[inTrain,]
testing<-Wage[-inTrain,]
dim(training)
## [1] 2102   10
dim(testing)
## [1] 898  10

Lets fit the model. The output variable is the ‘wage’ variable and we want to see what happens to it when we test it against all other variables. We are using the ‘gbm’ method which does boosting with a tree. We are using the ‘training’ dataset. And, ‘verbose=FALSE’ because it produces a lot of output if we use ‘gbm’ method and don’t put the ‘verbose=FALSE’ syntax.

modFit<-train(wage~., method="gbm", data=training, verbose=FALSE)
print(modFit)
## Stochastic Gradient Boosting 
## 
## 2102 samples
##    9 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   1                   50      34.99234  0.3255888  23.61795
##   1                  100      34.45784  0.3327618  23.30938
##   1                  150      34.40645  0.3327590  23.35266
##   2                   50      34.47642  0.3330006  23.30206
##   2                  100      34.44489  0.3307684  23.40489
##   2                  150      34.56962  0.3262456  23.52397
##   3                   50      34.41068  0.3330270  23.30010
##   3                  100      34.65371  0.3232172  23.54426
##   3                  150      34.89181  0.3156747  23.76426
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  1, shrinkage = 0.1 and n.minobsinnode = 10.

When we print the model fit we can see that there are data points and 9 predictors. The resampling was conducted using the Bootstrapping method and there were total of 2102 data points in each of the resampling.There are different number of trees and different number of interaction depths involved. For example, the first iteration had 1 level of interaction, and there were total of 50 trees. It had the root mean squared deviation of 35.02175, R-squared value of 0.3177911, and mean absolute error (MAE) of 23.88903. Now, let’s plot the predicted result from the testing set using our model fit in the training set.

qplot(predict(modFit, testing), wage, data=testing)

And we can see we get a reasonably good prediction, although there still seems to be a lot of variability there.

But the basic idea for fitting a boosting tree a boosted algorithm in general, is to take weak classifiers, and average them together with weights, in order to get a better classifier.

Model Based Prediction

The basic ideas behind this model are:

  1. Assume the data follow a probabilitstic model
  2. Use Bayes’ theorem to identify optimal classifiers based on that probabiilistic model

PROS

  1. can take advantage of structure of the data
  2. May be computationally convenient
  3. Are reasonably accurate on real problems

CONS

  1. Make additional assumptions about the data. They don’t have to be exactly satisfy in order for the prediction algorithm to work pretty well but, if they are too far away, the model fails.
  2. When the model is incorrect we may get reduced/lower accuracy

Let’s see an example how it works in real life. We are using the ‘iris’ data again.

data(iris)
library(ggplot2)
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50

Let’s now break the data into training and testing sets.

inTrain<-createDataPartition(y=iris$Species, p=0.7, list=FALSE)
training<-iris[inTrain, ]
testing<-iris[-inTrain, ]
dim(training)
## [1] 105   5
dim(testing)
## [1] 45  5

Now, lets build the prediction using the training set and test them on the testing set.

library(caret)
modlda=train(Species~., data=training, method="lda")
modnb=train(Species~., data=training, method="nb")
plda=predict(modlda, testing)
pnb=predict(modnb,testing)
table(plda, pnb)
##             pnb
## plda         setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         0
##   virginica       0          1        15

Now we are going to compare the results:

equalPredictions=(plda==pnb)
qplot(Petal.Width, Sepal.Width, colour=equalPredictions, data=testing)