Boosting, which along with random forest, is one of the most accurate out of the box classifiers that you can use.
Much like Bagging method, but trees are generated sequentially, meaning each tree is grown using information from previous trees.
also, Bootstraping samples are not used in Boosting. instead each tree is fit on a modified version of the original data set.
The basic idea here is, take a large number of possibly weak predictors, and we’re going to take those possibly weak predictors, and weight them in a way, that takes advantage of their strengths, and add them up.
When we weight them and add them up, we’re sort of doing the same kind of idea that we did with bagging for regression trees. Or that we did with random forest, where we’re talking a large number of classifiers and sort of averaging them. And then, by averaging them together, we get a stronger predictor.
boosting approach learn slowly.
given the current model, we fit a decision tree to the residuals from the model.
we fit a tree using the current residuals rather than outcome variable y.
we then add this new decision tree into the fitted function in order to update the residuals.
in general, statistical learning methods that learns slowly tend to perform better.
unlike in bagging, the construction of each tree depends largely on the trees that have already been grown.
similarity to bagging, boosting uses voting/averaging to combine the output of individual nodels of the same type.
but in bagging, each model is weighted equally. while boosting favors new models to address instances handled incorrectly by previous model.
also the selection of models is different in both. in bagging mdoel are built independently while in boosting models are created based on knowledge gained from previous model.
in boosting the final weight of each model is given by its performance.
The most famous boosting algorithm is probably Adaboost.
the number of trees B. unlike bagging and RF, boosting can overfit if B is too large. cross-validation can help here.
The shrinkage parameter lamda, a small number: this controls the rate at which boosting learns. typical values are 0.01 or 0.001 depending on the problem. very small lambda require using a very large value of B in order to achieve good performance.
splits in each tree “d”: which controls the complexity of the boosted ensemble. oftern d=1 works well which is a single split.
library(ISLR)
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
data(Wage)
names(Wage) = tolower(names(Wage))
head(Wage)
## year age sex maritl race education
## 231655 2006 18 1. Male 1. Never Married 1. White 1. < HS Grad
## 86582 2004 24 1. Male 1. Never Married 1. White 4. College Grad
## 161300 2003 45 1. Male 2. Married 1. White 3. Some College
## 155159 2003 43 1. Male 2. Married 3. Asian 4. College Grad
## 11443 2005 50 1. Male 4. Divorced 1. White 2. HS Grad
## 376662 2008 54 1. Male 2. Married 1. White 4. College Grad
## region jobclass health health_ins
## 231655 2. Middle Atlantic 1. Industrial 1. <=Good 2. No
## 86582 2. Middle Atlantic 2. Information 2. >=Very Good 2. No
## 161300 2. Middle Atlantic 1. Industrial 1. <=Good 1. Yes
## 155159 2. Middle Atlantic 2. Information 2. >=Very Good 1. Yes
## 11443 2. Middle Atlantic 2. Information 1. <=Good 1. Yes
## 376662 2. Middle Atlantic 2. Information 2. >=Very Good 1. Yes
## logwage wage
## 231655 4.318063 75.04315
## 86582 4.255273 70.47602
## 161300 4.875061 130.98218
## 155159 5.041393 154.68529
## 11443 4.318063 75.04315
## 376662 4.845098 127.11574
names(Wage)
## [1] "year" "age" "sex" "maritl" "race"
## [6] "education" "region" "jobclass" "health" "health_ins"
## [11] "logwage" "wage"
dim(Wage)
## [1] 3000 12
# remove outcome variable lwage:
Wage = subset(Wage, select=-c(logwage))
index = createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
train = Wage[index,]
test = Wage[-index,]
dim(train)
## [1] 2102 11
dim(test)
## [1] 898 11
modfit = train(wage ~., data=train, method = "gbm", verbose = FALSE) # verbose to minimize the output
## Loading required package: gbm
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
modfit
## Stochastic Gradient Boosting
##
## 2102 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared
## 1 50 35.04887 0.2973296
## 1 100 34.50768 0.3093947
## 1 150 34.45017 0.3108135
## 2 50 34.49474 0.3113621
## 2 100 34.33434 0.3151927
## 2 150 34.39562 0.3132805
## 3 50 34.35256 0.3152163
## 3 100 34.45733 0.3110375
## 3 150 34.65550 0.3050664
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
pred.wage = predict(modfit, test)
test.MSE = round(mean((pred.wage - test$wage)^2),2)
model.SE = sqrt(test.MSE)
model.SE
## [1] 32.75317
model SE : this model leads to predictions that are within around model.SE of true WAGE value.
qplot(pred.wage, wage, data=test,
xlab = "Predicted wage",
ylab = "Observed wage from test-set")