Tree-based models consist of one or more nested conditional statements for the predictors that partition the data. Within these partitions, a model is used to predict the outcome. In the tree models terminology, there are two splits of the data into three terminal nodes or leaves of the tree. To get a prediction for new data, we would follow the if-then statements defined by the tree using values of that sample’s predictors until we come to a terminal node. The model formula in the terminal node would be used to get the prediction.
To demonstrate here various tree based models here, We will use the package mlbench
that contains a function called mlbench.friedman1
that simulates the non linear data.
Regression trees partition a data set into smaller groups and then fit a simple model for each subgroup. Basic regression trees partition the data into smaller groups that are more homogenous against the response. To achieve outcome consistency, regression trees determine the predictor to split on and value of the split, the depth or complexity of the tree and the prediction equation in the terminal nodes.
caret
package implements the rpart
method with cp as the tuning parameter. caret
by default prunes tree based models. cp
is the parameter used by rpart
to determine when to prune.
set.seed(317)
singletree.model <- train(x=trainingData$x,
y=trainingData$y,
method = "rpart",
tuneLength = 5,
trControl = trainControl(method = "cv"))
singletree.model
## CART
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.07534530 3.934818 0.37817606 3.180454
## 0.07900591 3.959539 0.37205821 3.197096
## 0.10653465 4.026705 0.33163441 3.239278
## 0.12310943 4.241378 0.26076200 3.458335
## 0.19745805 4.796923 0.09492617 3.965858
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.0753453.
One limitation of simple regression trees is that each terminal node uses the average of the training set outcomes in that node for prediction. As a consequence, these models may not do a good job predicting samples whose true outcomes are extremely high or low. The model tree approach differs from regression trees as the splitting criterion is different, the terminal nodes predict using a linear model and prediction is often a combination of the predictions from different models along the same path through the tree.
To tune model trees model, the train
function in the caret
package has method = “M5” that evaluates model trees and the rule-based versions of the model along with smoothing and pruning.
set.seed(317)
mdltree.model <- train(x=trainingData$x,
y=trainingData$y,
method = "M5",
trControl = trainControl(method = "cv"),
control = Weka_control(M = 2))
mdltree.model
## Model Tree
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## pruned smoothed rules RMSE Rsquared MAE
## Yes Yes Yes 2.928634 0.6555064 2.341575
## Yes Yes No 3.257857 0.5666599 2.661570
## Yes No Yes 2.814728 0.6744466 2.229229
## Yes No No 3.088767 0.6026558 2.443961
## No Yes Yes 3.294271 0.5511199 2.648493
## No Yes No 3.226561 0.5699273 2.613535
## No No Yes 4.336560 0.3551470 3.597369
## No No No 3.573643 0.5040270 2.940323
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were pruned = Yes, smoothed = No and
## rules = Yes.
Bagging is short for bootstrap aggregation. Bagging is a general approach that uses bootstrapping along with any regression model to construct an ensemble. Each model in the ensemble generates a prediction for a new sample and these, say m, predictions are averaged to give the bagged model’s prediction.
The train
function uses the method treebag
along with bagControl
to implement bagged trees.
bagCtrl <- bagControl(fit = ctreeBag$fit,
predict = ctreeBag$pred,
aggregate = ctreeBag$aggregate)
bagg.model <- train(x=trainingData$x,
y=trainingData$y,
method="treebag",
tuneLength = 2,
trControl = trainControl(method = "cv"),
bagCtrl=bagCtrl)
bagg.model
## Bagged CART
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2.850772 0.6859794 2.251662
Random forest consists of a large number of individual decision trees that work as an ensemble. Each model in the ensemble is used to generate a prediction for a new sample and these predictions are then averaged to give the forest’s prediction. Since the algorithm randomly selects predictors at each split, tree correlation gets reduces as compared to bagging. In random forest algorithm, we first select the number of models to build and theen loop through this number and train a tree model. Once done then avearage the predictions to get overall prediction. In random forests, trees are created independently, each tree is created having maximum depth and each tree contributes equally in the final model.
The train
function has wrappers for tuning these models by specifying either method = “rf” or “cforest”. Doing optimization of mtry
parameter may result in a slight increase in performance. Also, train
function could use standard resampling methods for estimating performance.
set.seed(317)
randfrst.model <- train(x=trainingData$x,
y=trainingData$y,
method = "rf",
tuneLength = 2,
trControl = trainControl(method = "cv"))
randfrst.model
## Random Forest
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 2.891769 0.8425261 2.363624
## 10 2.477719 0.7778500 1.987306
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 10.
Boosting algorithms are influenced by learning theory. Boosting algorithm seeks to improve the prediction power by training a sequence of weak models where each of them compensates the weaknesses of its predecessors. The trees in boosting are dependent on past trees, have minimum depth and do not contribute equally to the final model. It requires usto specify a weak model (e.g. regression, shallow decision trees etc) and then improves it.
Similar to others, the train function can be used here too, to tune over different parameters. Here we define a tuning grid to tune over interaction depth, number of trees, and shrinkage. Next we train over this grid as follows.
set.seed(317)
# boosting regression trees via stochastic gradient boosting machines
gbmGrid <- expand.grid(interaction.depth = seq(1, 2, by = 1),
n.trees = seq(100, 200, by = 50),
shrinkage = 0.1,
n.minobsinnode = 5)
gbm.model <- train(x=trainingData$x,
y=trainingData$y,
method = "gbm",
tuneGrid = gbmGrid,
trControl = trainControl(method = "cv"),
verbose = FALSE)
gbm.model
## Stochastic Gradient Boosting
##
## 200 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared MAE
## 1 100 2.187126 0.8221941 1.802816
## 1 150 2.023134 0.8335915 1.664078
## 1 200 1.969022 0.8380564 1.609419
## 2 100 1.976804 0.8440204 1.594093
## 2 150 1.874660 0.8579755 1.523923
## 2 200 1.846831 0.8610865 1.495919
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 200, interaction.depth =
## 2, shrinkage = 0.1 and n.minobsinnode = 5.
Applied Predictive Modeling by Max Kuhn and Kjell Johnson