Boosting along with random forest is one fo the most accurate box classifiers that we can use. In general, in boosting, we take a large number ob possibly weak predictors and weight them in a way that takes advantage of their strengths and add them up. When we weight them and add them up, we’re sort of doing the same kind of idea that we did with bagging for regression trees, or that we did with random forest. We’re taking a large number of classifiers and sort of averaging them, and then, by averaging them together, we get a stronger predictor.
Probably, the most famous boosting algorithm is Adaboost. We can find more information and tutorial about boosting at http://webee.technion.ac.il/people/rmeir/BoostingTutorial.pdf
Boosting in R
library(ISLR)
data(Wage)
library(ggplot2)
library(caret)
Wage<-subset(Wage, select = -c(logwage))
inTrain<-createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training<-Wage[inTrain,]
testing<-Wage[-inTrain,]
dim(training)
## [1] 2102 10
dim(testing)
## [1] 898 10
Lets fit the model. The output variable is the ‘wage’ variable and we want to see what happens to it when we test it against all other variables. We are using the ‘gbm’ method which does boosting with a tree. We are using the ‘training’ dataset. And, ‘verbose=FALSE’ because it produces a lot of output if we use ‘gbm’ method and don’t put the ‘verbose=FALSE’ syntax.
modFit<-train(wage~., method="gbm", data=training, verbose=FALSE)
print(modFit)
## Stochastic Gradient Boosting
##
## 2102 samples
## 9 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 50 34.99234 0.3255888 23.61795
## 1 100 34.45784 0.3327618 23.30938
## 1 150 34.40645 0.3327590 23.35266
## 2 50 34.47642 0.3330006 23.30206
## 2 100 34.44489 0.3307684 23.40489
## 2 150 34.56962 0.3262456 23.52397
## 3 50 34.41068 0.3330270 23.30010
## 3 100 34.65371 0.3232172 23.54426
## 3 150 34.89181 0.3156747 23.76426
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 1, shrinkage = 0.1 and n.minobsinnode = 10.
When we print the model fit we can see that there are data points and 9 predictors. The resampling was conducted using the Bootstrapping method and there were total of 2102 data points in each of the resampling.There are different number of trees and different number of interaction depths involved. For example, the first iteration had 1 level of interaction, and there were total of 50 trees. It had the root mean squared deviation of 35.02175, R-squared value of 0.3177911, and mean absolute error (MAE) of 23.88903. Now, let’s plot the predicted result from the testing set using our model fit in the training set.
qplot(predict(modFit, testing), wage, data=testing)
And we can see we get a reasonably good prediction, although there still seems to be a lot of variability there.
But the basic idea for fitting a boosting tree a boosted algorithm in general, is to take weak classifiers, and average them together with weights, in order to get a better classifier.
The basic ideas behind this model are:
PROS
CONS
Let’s see an example how it works in real life. We are using the ‘iris’ data again.
data(iris)
library(ggplot2)
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
Let’s now break the data into training and testing sets.
inTrain<-createDataPartition(y=iris$Species, p=0.7, list=FALSE)
training<-iris[inTrain, ]
testing<-iris[-inTrain, ]
dim(training)
## [1] 105 5
dim(testing)
## [1] 45 5
Now, lets build the prediction using the training set and test them on the testing set.
library(caret)
modlda=train(Species~., data=training, method="lda")
modnb=train(Species~., data=training, method="nb")
plda=predict(modlda, testing)
pnb=predict(modnb,testing)
table(plda, pnb)
## pnb
## plda setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 0
## virginica 0 1 15
Now we are going to compare the results:
equalPredictions=(plda==pnb)
qplot(Petal.Width, Sepal.Width, colour=equalPredictions, data=testing)