Boosting

Boosting:

take multiple weak predictors.
weight them.
add them up and get a stronger predictor.

Boosting, which along with random forest, is one of the most accurate out of the box classifiers that you can use.

Much like Bagging method, but trees are generated sequentially, meaning each tree is grown using information from previous trees.

also, Bootstraping samples are not used in Boosting. instead each tree is fit on a modified version of the original data set.

The basic Idea:

The basic idea here is, take a large number of possibly weak predictors, and we’re going to take those possibly weak predictors, and weight them in a way, that takes advantage of their strengths, and add them up.

When we weight them and add them up, we’re sort of doing the same kind of idea that we did with bagging for regression trees. Or that we did with random forest, where we’re talking a large number of classifiers and sort of averaging them. And then, by averaging them together, we get a stronger predictor.

HOW IT WORKS:

boosting approach learn slowly.

given the current model, we fit a decision tree to the residuals from the model.

we fit a tree using the current residuals rather than outcome variable y.

we then add this new decision tree into the fitted function in order to update the residuals.

in general, statistical learning methods that learns slowly tend to perform better.

unlike in bagging, the construction of each tree depends largely on the trees that have already been grown.

similarity to bagging, boosting uses voting/averaging to combine the output of individual nodels of the same type.

but in bagging, each model is weighted equally. while boosting favors new models to address instances handled incorrectly by previous model.

also the selection of models is different in both. in bagging mdoel are built independently while in boosting models are created based on knowledge gained from previous model.

in boosting the final weight of each model is given by its performance.

The most famous boosting algorithm is probably Adaboost.

Boosting has three tuning parameters:

the number of trees B. unlike bagging and RF, boosting can overfit if B is too large. cross-validation can help here.
The shrinkage parameter lamda, a small number: this controls the rate at which boosting learns. typical values are 0.01 or 0.001 depending on the problem. very small lambda require using a very large value of B in order to achieve good performance.
splits in each tree “d”: which controls the complexity of the boosted ensemble. oftern d=1 works well which is a single split.

in Boosting, because the growth of a particular tree takes into account the other trees that have already been grown, smaller trees are typically sufficient. smaller no. of trees also helps in Interpretability.

library(ISLR)
library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

data(Wage)

names(Wage) = tolower(names(Wage))
head(Wage)

##        year age     sex           maritl     race       education
## 231655 2006  18 1. Male 1. Never Married 1. White    1. < HS Grad
## 86582  2004  24 1. Male 1. Never Married 1. White 4. College Grad
## 161300 2003  45 1. Male       2. Married 1. White 3. Some College
## 155159 2003  43 1. Male       2. Married 3. Asian 4. College Grad
## 11443  2005  50 1. Male      4. Divorced 1. White      2. HS Grad
## 376662 2008  54 1. Male       2. Married 1. White 4. College Grad
##                    region       jobclass         health health_ins
## 231655 2. Middle Atlantic  1. Industrial      1. <=Good      2. No
## 86582  2. Middle Atlantic 2. Information 2. >=Very Good      2. No
## 161300 2. Middle Atlantic  1. Industrial      1. <=Good     1. Yes
## 155159 2. Middle Atlantic 2. Information 2. >=Very Good     1. Yes
## 11443  2. Middle Atlantic 2. Information      1. <=Good     1. Yes
## 376662 2. Middle Atlantic 2. Information 2. >=Very Good     1. Yes
##         logwage      wage
## 231655 4.318063  75.04315
## 86582  4.255273  70.47602
## 161300 4.875061 130.98218
## 155159 5.041393 154.68529
## 11443  4.318063  75.04315
## 376662 4.845098 127.11574

names(Wage)

##  [1] "year"       "age"        "sex"        "maritl"     "race"      
##  [6] "education"  "region"     "jobclass"   "health"     "health_ins"
## [11] "logwage"    "wage"

dim(Wage)

## [1] 3000   12

# remove outcome variable lwage:
Wage = subset(Wage, select=-c(logwage))

index = createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
train = Wage[index,]
test = Wage[-index,]

dim(train)

## [1] 2102   11

dim(test)

## [1] 898  11

modfit = train(wage ~., data=train, method = "gbm", verbose = FALSE) # verbose to minimize the output

## Loading required package: gbm

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.1

modfit

## Stochastic Gradient Boosting 
## 
## 2102 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared 
##   1                   50      35.04887  0.2973296
##   1                  100      34.50768  0.3093947
##   1                  150      34.45017  0.3108135
##   2                   50      34.49474  0.3113621
##   2                  100      34.33434  0.3151927
##   2                  150      34.39562  0.3132805
##   3                   50      34.35256  0.3152163
##   3                  100      34.45733  0.3110375
##   3                  150      34.65550  0.3050664
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

pred.wage = predict(modfit, test)

test.MSE = round(mean((pred.wage - test$wage)^2),2)
model.SE = sqrt(test.MSE)
model.SE

## [1] 32.75317

model SE : this model leads to predictions that are within around model.SE of true WAGE value.

qplot(pred.wage, wage, data=test,
      xlab = "Predicted wage",
      ylab = "Observed wage from test-set")